Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Stefan "Bebbo" Franke

Posts 139
26 Jul 2019 15:21


Gunnar von Boehn wrote:

I think a very important topic for coding and compiling on 68K is understand the 68K design principal.
 
  The 68K is based on the idea of separation of duties to improve performance.
 
  The concept of the 68K allows to have EA-Units and ALU-Units which work independently. The EA-units are there to calculate addresses.
  The ALU units are there for the normal math and loop stuff.
  Only the ALU unit creates flags!
 
  Each unit type has its own register AN register for the EA units.
  And Dn registers for the ALU units.
  The units can "borrow" and use the other units registers too - but this is comes generally with penalties.
 
  The same concept is valid for the FPU. The FPU has its own registers but can also use Dn registers - if needed.
  That the 68K units can "borrow" other units registers is great freedom and makes coding a lot simpler.
  This should be done carefully and not as general practice as they come with an extra cost!
 
  Some bad coding examples
 
  move.l (A0,D0),  -- borrow Dn as Index
  Using Dn as Index can be done but using An is preferable as An is bubble free.
  --
  move.l (D0),      -- Using Dn as PTR, this can be done with Zero supressing EA mode but is bad practice and comes generally with penalties.
  --
  FADD.S D0,          -- Using Dn as FPU register, can be done but with penalty on older 68Ks, the 080 supports this for free.
  --
  CMPA A0,A1          -- Using An as Loop condition, is a edge case can be done but often is not optimally
  bne  LOOP
  --
  SUBQ.l #1,A0        -- misuse of Addr register as counter this is very bad practice and should not be done
  TST.l  A0
  BNE    LOOP
 

 
 
  Bebbo, how good is GCC today in understanding this design philosophy of 68k?

In gcc it's all about costs. Almost all, the scheduler tracks units and its usages, so I suggest for the 68080

  [a]
  move.l (a0),

cost = 4
(4 is used instead of one to have room for slight modifications)

 
  move.l (A0,D0),

cost = 8

  [c]
  move.l ([a0],d0),

cost = 12

and so on.

"borrowing" means unit usage, can be modeled and is useful for the scheduler.

=== snip ===

cmp ax,ay is a good thing on 68k, especially with auto inc and if it saves a variable. I'm working to keep this for the 68k and avoid it on the 68080.

And loops aren't trivial. gcc decomposes all loops and evaluates all loop variables for its best use, introduces additional temp vars if it's profitable (see costs). All ends up in tuning costs.

unfortunatley gcc does not know about insn costs, it uses operation and ea costs plus you have to guess about the costs for higher processors, if there are best, cache, cache miss, worst cases...

the current cost models for the 68000 and the 68020 are ok for me. The 68030 is an adapted 68020 model and 68040+ is random guessing^^.



Stefan "Bebbo" Franke

Posts 139
26 Jul 2019 15:23


Gunnar von Boehn wrote:

Regarding FPU tuning and tuning in general.
 
  I think we should avoid "over" optimization.
 
  It makes sense to convert a MUL into an ADD or a single SHIFT
  It makes sense to convert a DIV into a single SHIFT
  It makes sense to convert a FDIV #imm into FMUL #imm
 
  Other then this I think the architectual instructions should be used.
 
  For FPU performance its very important that the compiler does the most important instructions FADD/FMUL properly, that the compiler avoids register dependencies on those.
  And that we add more scratch registers to the 68K ABI.
  This together will greatly improve the code - other tunings have very little influence on the global picture.
 
 
  On the EA-modes its important to understand why motorola did add the double-indirect mode (there was a special business reason) and that using this mode for todays code should be greatly avoided!
 
 
  Bebbo what do you think about this?
  Can we do reach this together and how can we help you?
 
 

the decision is cost based. if the costs are correct, the compiler selects correctly if mul or add/shift should be used.



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
26 Jul 2019 15:40


Stefan "Bebbo" Franke wrote:

the current cost models for the 68000 and the 68020 are ok for me. The 68030 is an adapted 68020 model and 68040+ is random guessing^^.

Ok, I understand.
Then lets review the 040/060/080 costs.
We can tell you the clockcycles of all instructions for those 3 cores.

Can you point me to the files in question so that we can help?

move.l (A0,D0)
and
move.l (A0,A0)

Have the same cost and cost 1+0 cycle (1 cycle total)
The (A0,D0) has the risk of bubbles.
The (A0,A0) has NO risk of bubbles.

How about the scratch register topic?
How about adding 8 Scratch Data registers / 8 FPU scratch registers for subroutines?
What do you think?


Stefan "Bebbo" Franke

Posts 139
26 Jul 2019 17:09


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

  the current cost models for the 68000 and the 68020 are ok for me. The 68030 is an adapted 68020 model and 68040+ is random guessing^^.
 

 
  Ok, I understand.
  Then lets review the 040/060/080 costs.
  We can tell you the clockcycles of all instructions for those 3 cores.
 
  Can you point me to the files in question so that we can help?
 
 
  move.l (A0,D0)
  and
  move.l (A0,A0)
 
  Have the same cost and cost 1+0 cycle (1 cycle total)
  The (A0,D0) has the risk of bubbles.
  The (A0,A0) has NO risk of bubbles.
 
 
 
  How about the scratch register topic?
  How about adding 8 Scratch Data registers / 8 FPU scratch registers for subroutines?
  What do you think?

The file for the 68040 costs is m68k_68040_costs.c in gcc/config/m68k.

If you add the switch -fbbb=+V you'll get the insn costs in the asm code (plus the register tracking info), e.g.


        link.w a5,#0
                                #0      -1 0    p                                                                  !*a5      !*a7   
        move.l (8,a5),d0
                                #2      8 8        +d0                                                              .a5                                                     
        move.l d0,a1
                                #3      4 4        .d0                                          +a1                  a5             

the link insn has no costs -1 wich results into 0.
the move.l (8,a5),d0 has cost 8 which stays 8
the move.l d0,a1 has cost 4 which stays at 4.

So the minimal costs are 4, then 8, 12... etc.p.p.

This gives room to e.g. have slight variances, e.g.

move.l 0(a0,d0),d1  with 12 and
move.l 0(a0,a1),d1  with 13 (or even higher).

All new stuff which requires a modifed gnu as, requires a modified gnu as... as prerequisite, which requires a final decision how registers are named, if there is load/store or a new letter for move.? and so on...




Grom 68k

Posts 61
26 Jul 2019 21:19


Samuel Devulder wrote:

Grom 68k wrote:
  With -mtune=68080, it's work well. :)

  Doesn't seem so: EXTERNAL LINK   
  You probably meant: -mtune=68030 (680-thirty) EXTERNAL LINK 
 
 
lsl works well now. Is moveq can be removed ?

  Moveq is mandatory since lsl #n is limited to n<=8 AFAIK.

Try with a bigger number :)

The cost of mul (cost of 68020?) should be too much for your exemple.


Samuel Devulder

Posts 248
27 Jul 2019 00:11


@Grom: I don't understand. Which example, which bigger number ?
  If it is EXTERNAL LINK , then it is a bigger number but no LSL nor MOVEQ is used. Or maybe you were talking about LSL #13,d0. But that's an impossible instruction. EXTERNAL LINK states
The shift count can be specified in the instruction operation word (*to shift from 1 – 8 places*) or in a register (modulo 64 shift count)



Grom 68k

Posts 61
27 Jul 2019 06:42


Samuel Devulder wrote:

    @Grom: I don't understand. Which example, which bigger number ?
    If it is EXTERNAL LINK , then it is a bigger number but no lsl nor moveq is used. Or you were talking about LSL #13,d0. But that's an impossible instruction. EXTERNAL LINK states
The shift count can be specified in the instruction operation word (*to shift from 1 – 8 places*) or in a register (modulo 64 shift count)

   

   
    I was speaking about the need of -mtune=68080 with -m68080 to use mul. This bug was corrected this night. It only remains to reduce the cost of mul.


Samuel Devulder

Posts 248
27 Jul 2019 07:36


I still don't understand: -mtune=68080 with -m68080 still doesn't use mul. EXTERNAL LINK


Grom 68k

Posts 61
27 Jul 2019 07:46


Samuel Devulder wrote:

I still don't understand: -mtune=68080 with -m68080 still doesn't use mul. EXTERNAL LINK 

Your number must be bigger. The cost of mul is too high.

I found how to share link on my phone
EXTERNAL LINK 
-mtune=68080 isn't needed anymore.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
27 Jul 2019 13:47


Stefan "Bebbo" Franke wrote:

    If you add the switch -fbbb=+V
    you'll get the insn costs in the asm code
    (plus the register tracking info), e.g.
   

   
    OK.
    Then lets work together and improve this.
   
    Can you make a new file for 68080.
    For which we change the cost?
   
    1)
    The lowest cost is currently  4
    4 cycle was the cost of the 68000.
    And for the 68000 this number was correct.
    Do you want to keep it this way?
   
    2) I see the LSR #,Dn  hast a cost of 12.
    This is too expensive.
    Please set for all SHIFT both
    SHIFT #,Dn and
    SHIFT Dn,Dm
    to the lowest cost (4?) like MOVEQ
   
 
    3) MUL cost is way to high
        It seems currently to be set to a cost of "80"
        Please set the cost for MUL to 3x MOVEQ (=12)
   
    4) Please set the cost for Ea (d16,An) to +1 of (An)
        In reality
      ADD.L (An),D0  =  1 cycle
      ADD.L (8,An),D1 = 1 cycle
      this means they have the same cost.
      If we want to "hint" GCC to prefer the shorter instruction - then 
      we could use cost 4 and 5 for them!
     
    5) Please note that READ/WRITE instruction cost only 1 cycle!
      ADD.L D0,(An)  =  1 cycle
      ADD.L D0,(8,An) =  1 cycle
   
 
  ADD.L D0,(8,An) has right now a cost of 16.
  This is wrong. 5 might make more sense.
 
    6) EA (An,index) is much too expensive!
      ADD.L (An,Dn*2),D0  =  1 cycle  but right now cost 16 - this is wrong
 
    7) DIV cost is set to 176 please set to 72
 
    8) The added cost of ([mem-double-indirect]) should be +20.
 
    Thanks


Stefan "Bebbo" Franke

Posts 139
27 Jul 2019 19:35


Gunnar von Boehn wrote:

      3) MUL cost is way to high
        It seems currently to be set to a cost of "80"

80 -> 20 cycles, that's what I read in the manual for the 68040.
So it's reasonable that you can do a lot of add/shift in 20 cycles instead.

I'll provide separate costs for the 68080.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
27 Jul 2019 19:43


Stefan "Bebbo" Franke wrote:

  80 -> 20 cycles, that's what I read in the manual for the 68040.
  So it's reasonable that you can do a lot of add/shift in 20 cycles instead.
 

 
The time in the manual is MAX time - depending on the content the values can be faster!

Stefan "Bebbo" Franke wrote:

  I'll provide separate costs for the 68080.
 


 
This is a very good idea.
GCC needs own cost file for 040, 060, and for 080
 
  for example:
  MUL.W on 040 takes 16 cycles
  MUL.W on 060 takes  2 cycles
  MUL.W on 080 takes  2 cycles
 
 


Stefan "Bebbo" Franke

Posts 139
27 Jul 2019 19:50


Grom 68k wrote:

   
  EDIT2: The same with const and -O3 EXTERNAL LINK    Can gcc remove useless a5 ?
 
 

          move.l (a1)+,(-8,a5)
          move.l (a1)+,(-4,a5)
          move.l (a1)+,(-16,a5)
          move.l (a1)+,(-12,a5)
          move.l (a1)+,(-24,a5)
          move.l (a1)+,(-20,a5)
          move.l (a1)+,(-32,a5)
          move.l (a1)+,(-28,a5)
          move.l (a1)+,(-40,a5)
          move.l (a1)+,(-36,a5)
          ...
 


 
There was no option, but I've done another hack in...
... you can play with it in the cex - and maybe it passes the tests and goes live


Stefan "Bebbo" Franke

Posts 139
27 Jul 2019 19:52


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

    80 -> 20 cycles, that's what I read in the manual for the 68040.
    So it's reasonable that you can do a lot of add/shift in 20 cycles instead.
 

   
  The time in the manual is MAX time - depending on the content the values can be faster!
 
 
 
Stefan "Bebbo" Franke wrote:

 
    I'll provide separate costs for the 68080.
 

 
  This is a very good idea.
  GCC needs own cost file for 040, 060, and for 080
 
  for example:
  MUL.W on 040 takes 16 cycles
  MUL.W on 060 takes  2 cycles
  MUL.W on 080 takes  2 cycles

I could adapt the cycle algorithm for the 68000-68030 for the 68040 too...



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
27 Jul 2019 20:03


Stefan "Bebbo" Franke wrote:

  I could adapt the cycle algorithm for the 68000-68030 for the 68040 too...


Yes this would be great!
Please make own files for 040/060/080.

I always found this small overview pretty nice:
EXTERNAL LINK


Stefan "Bebbo" Franke

Posts 139
27 Jul 2019 21:22


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

  I could adapt the cycle algorithm for the 68000-68030 for the 68040 too...
 
 

  Yes this would be great!
  Please make own files for 040/060/080.
 
  I always found this small overview pretty nice:
  EXTERNAL LINK 

68080 costs are live on my cex.

e.g.


_mul31:
        muls.l #-727379969,d0
                                #2      12 12




Grom 68k

Posts 61
27 Jul 2019 22:46


Stefan "Bebbo" Franke wrote:

Grom 68k wrote:

     
    EDIT2: The same with const and -O3 EXTERNAL LINK    Can gcc remove useless a5 ?
   
   

            move.l (a1)+,(-8,a5)
            move.l (a1)+,(-4,a5)
            move.l (a1)+,(-16,a5)
            move.l (a1)+,(-12,a5)
            move.l (a1)+,(-24,a5)
            move.l (a1)+,(-20,a5)
            move.l (a1)+,(-32,a5)
            move.l (a1)+,(-28,a5)
            move.l (a1)+,(-40,a5)
            move.l (a1)+,(-36,a5)
            ...
   

 

 
  There was no option, but I've done another hack in...
  ... you can play with it in the cex - and maybe it passes the tests and goes live

Hi,

Do you know why fp7 is used? It can explain why gcc use too much registers for unrolling.

Else, Mul works pretty well now, thanks.




Samuel Devulder

Posts 248
27 Jul 2019 23:47


I wonder why in this code ( EXTERNAL LINK ) d2 is saved onto the stack. I'd thought that since -mregparm=3 is used, d2 would have been considered as a scratch register and needed no save/restore at all.

Anyway, generally speaking of preserving regs onto stack, why not better use a free scratch reg like A0 or A1 in this case ? Saving regs (even fpu ones when precision doesn't matter[*]) in scratch regs rather onto stack is what I tend to do in asm where possible.
___
[*] And since 68080 has 64bits regs, we can save all 64bits or  fpu regs in scratch data-reg without worrying for a loss of precision.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
28 Jul 2019 06:26


Samuel Devulder wrote:

I wonder why in this code ( EXTERNAL LINK ) d2 is saved onto the stack.

 
 
A good human coder will consider several factors when coding and choose a good compromise.
 
GCC right now looks very "focused" on the cost.
And the cost right now does now make a compromise.
 
A human coder will look at:
  a) instruction clocks
  b) memory reads
  c) instruction length
  d) readability
 
Lets make an example:
  Lets say a MUL cost 100 cycle.
  Lets say a ADD cost 1 cycle.
 
A = A*99
A good coder will of course write this with MUL
The compiler would put 99 ADD instructions to save a single cycle.
This is a bad choice.
As the whole program will become fatter, and will slow down because of more cache misses - the compiler not considers this.
 
I would propose we fix this by making the cost consider this.
 
How about using such formula for the cost?
 
 
  a) 4 per clock cycle
  b) +1 per instruction word
  c) +2 for using memory
 

 
I think this will create much more balanced code.
What do you think?
 
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
28 Jul 2019 06:27


Bebbo,

can GCC also print the cost for ASM input?
I mean is there a way we can make GCC print out all cost for all instructions types and their EA?
Like on overview to spot errors?

posts 367page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19