Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Gunnar von Boehn
(Apollo Team Member)
Posts 6207
28 Jul 2019 06:41


Bebbo,

how can GCC model latency?

For example:


instruction          clocks    latency
LEA    2(A0),A0      1        1 
ADDQ.L #2,A0          1        1
ADDA.W  #2,A0          1        1
ADDA.L  #4,A0          1        1
ADDA.L  (SP),A0        1        3

So if you use a PTR in a EA operation, then doing EA operations like
ADDQ, ADDA #im, ADDA Reg, before - has not Latency
But doing memory operation to the PTR will add usage latency.

Can we model this in GCC?


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
28 Jul 2019 07:07


Bebbo regarding the latency,
 
 
  The 68K has several units:
  [EA] - [ALU] - [FPU]
 
  These instruction can be executed in the EA Unit
 
  LEA (ea),An
  ADDQ #,An
  ADDA #im,An
  ADDA Reg,An
  SUBQ #,An
  SUBA #im,An
  SUBA Reg,An
  MOVE.L #im,An
  MOVE.L Reg,An
 

 
  This means a EA calculation can use the result of the above instructions without LATENCY!
 
  All other instructions are executed in the ALU.
  This means for example:
  MOVE.L (mem),An  has a LATENCY if you want to use AN in an EA calculation.
 
 
  Some example:
 
  ADDA.L #1,A0      -- No LATENCY
  MOVE.L (A0),D0   
 
  ADDA.L #1,A0      -- No LATENCY
  MOVE.L (A5,A0),D0   
 
  ADD.L #1,D4        -- 2 cycle extra LATENCY
  MOVE.L (A5,D4),D0   
 
  MOVE.L (A6),D4    -- 2 cycle extra LATENCY
  MOVE.L (A5,D4),D0   
 
  MOVE.L (A6),A0    -- 2 cycle extra LATENCY
  MOVE.L (A0),D0   
 

  Is this concept clear?
  Can we model this?


Grom 68k

Posts 61
28 Jul 2019 07:57


Gunnar von Boehn wrote:

  Bebbo regarding the latency,
   
   
    The 68K has several units:
    [EA] - [ALU] - [FPU]
   
    These instruction can be executed in the EA Unit
   
    LEA (ea),An
    ADDQ #,An
    ADDA #im,An
    ADDA Reg,An
    SUBQ #,An
    SUBA #im,An
    SUBA Reg,An
    MOVE.L #im,An
    MOVE.L Reg,An
   

   
    This means a EA calculation can use the result of the above instructions without LATENCY!
   
    All other instructions are executed in the ALU.
    This means for example:
    MOVE.L (mem),An  has a LATENCY if you want to use AN in an EA calculation.
   
   
    Some example:
   
    ADDA.L #1,A0      -- No LATENCY
    MOVE.L (A0),D0   
   
    ADDA.L #1,A0      -- No LATENCY
    MOVE.L (A5,A0),D0   
   
    ADD.L #1,D4        -- 2 cycle extra LATENCY
    MOVE.L (A5,D4),D0   
   
    MOVE.L (A6),D4    -- 2 cycle extra LATENCY
    MOVE.L (A5,D4),D0   
   
    MOVE.L (A6),A0    -- 2 cycle extra LATENCY
    MOVE.L (A0),D0   
   
 
    Is this concept clear?
    Can we model this?
 

 
  It looks like latency of fadd. It could be modeled as fpu pipeline.


Stefan "Bebbo" Franke

Posts 139
28 Jul 2019 09:30


Grom 68k wrote:

Stefan "Bebbo" Franke wrote:

 
Grom 68k wrote:

     
    EDIT2: The same with const and -O3 EXTERNAL LINK      Can gcc remove useless a5 ?
   
   

            move.l (a1)+,(-8,a5)
            move.l (a1)+,(-4,a5)
            move.l (a1)+,(-16,a5)
            move.l (a1)+,(-12,a5)
            move.l (a1)+,(-24,a5)
            move.l (a1)+,(-20,a5)
            move.l (a1)+,(-32,a5)
            move.l (a1)+,(-28,a5)
            move.l (a1)+,(-40,a5)
            move.l (a1)+,(-36,a5)
            ...
   

 

   
  There was no option, but I've done another hack in...
  ... you can play with it in the cex - and maybe it passes the tests and goes live
 

 
  Hi,
 
  Do you know why fp7 is used? It can explain why gcc use too much registers for unrolling.
 
  Else, Mul works pretty well now, thanks.

gcc has various optimizer passes to extract loop constants to load these into registers in front of the loops.

right now I coded a replacement: if the register ends on the stack and it's a similare memory access, use the latter instead of stack.

This decision should consider costs. Once this is done, also the costs versus real regs can be compared and in case of 68080 these preloads will be gone.

For the 68k these preloads are a performance gain, also sometimes preloads on the stack are a gain, since the access via register is faster than via symbol.



Stefan "Bebbo" Franke

Posts 139
28 Jul 2019 09:33


Gunnar von Boehn wrote:

Bebbo,
 
  can GCC also print the cost for ASM input?

no, it can't
Gunnar von Boehn wrote:

  I mean is there a way we can make GCC print out all cost for all instructions types and their EA?
  Like on overview to spot errors?

yes there are ways.

a) provide some C code which contains all/most insns.
b) code a gcc extension, which - if triggered - adds a synthetic
function containing the insns.


Stefan "Bebbo" Franke

Posts 139
28 Jul 2019 09:47


Grom 68k wrote:

   
Gunnar von Boehn wrote:

      Bebbo regarding the latency,
       
       
        The 68K has several units:
        [EA] - [ALU] - [FPU]
       
        These instruction can be executed in the EA Unit
       
        LEA (ea),An
        ADDQ #,An
        ADDA #im,An
        ADDA Reg,An
        SUBQ #,An
        SUBA #im,An
        SUBA Reg,An
        MOVE.L #im,An
        MOVE.L Reg,An
       

       
        This means a EA calculation can use the result of the above instructions without LATENCY!
       
        All other instructions are executed in the ALU.
        This means for example:
        MOVE.L (mem),An  has a LATENCY if you want to use AN in an EA calculation.
       
       
        Some example:
       
        ADDA.L #1,A0      -- No LATENCY
        MOVE.L (A0),D0   
       
        ADDA.L #1,A0      -- No LATENCY
        MOVE.L (A5,A0),D0   
       
        ADD.L #1,D4        -- 2 cycle extra LATENCY
        MOVE.L (A5,D4),D0   
       
        MOVE.L (A6),D4    -- 2 cycle extra LATENCY
        MOVE.L (A5,D4),D0   
       
        MOVE.L (A6),A0    -- 2 cycle extra LATENCY
        MOVE.L (A0),D0   
       
     
        Is this concept clear?
        Can we model this?
     

     
      It looks like latency of fadd. It could be modeled as fpu pipeline.
   

   
    It will look different, but it should be possible to model this.
   
    The concept is visible somehow but not clear enough.
   
EDIT:

Ok - guess you need a pipeline per output register:

alu_d0_0, alu_d0_1, alu_d0_2
alu_d1_0, alu_d1_1, alu_d1_2
...
alu_a7_0, alu_a7_1, alu_a7_2

and insn using a registers will use the pipelines simultaneous

  move.l 123(a6,a0.w*8),d7

will use

alu_a6_0 + alu_a6_1 + alu_a6_2 + alu_a0_0 + alu_a0_1 + alu_a0_2

and stall until all units are free.

maybe there is a more elegant way which requires less typing...


Stefan "Bebbo" Franke

Posts 139
28 Jul 2019 12:08


to sum it up:

1. there is a latency of 2, if an ALU result is used by EA.


  addq.l #1,d1
  addq.l #2,d2
; one cycle stall left
  move.l (a0,d1),d3

2. there is no latency if an EA result is used by ALU

  lea (a0,d1),a1
; no stall
  add.l a1,d2

3. there is no latency inside EA/ALU.


Grom 68k

Posts 61
28 Jul 2019 13:53


Stefan "Bebbo" Franke wrote:

  to sum it up:
 
  1. there is a latency of 2, if an ALU result is used by EA.
 

    addq.l #1,d1
    addq.l #2,d2
  ; one cycle stall left
    move.l (a0,d1),d3
 

  2. there is no latency if an EA result is used by ALU
 

    lea (a0,d1),a1
  ; no stall
    add.l a1,d2
 

 
  3. there is no latency inside EA/ALU.
 

 
  What do you think about this:
 
  1. Define a pipeline (p0,p1,p2) on all alu instructions with Dn as output.
 
  2. Create a bypass for all except EA instructions.
 
  (define_bypass 1 "cpu_alu_*" "cpu_alu_*, cpu_fpu_*").

Edit:
2'. Guard sould be easier.

Documentation:

The following construction is used to describe exceptions in the latency time for given instruction pair. This is so called bypasses.

(define_bypass number out_insn_names in_insn_names
                [guard])
number defines when the result generated by the instructions given in string out_insn_names will be ready for the instructions given in string in_insn_names. Each of these strings is a comma-separated list of filename-style globs and they refer to the names of define_insn_reservations. For example:

(define_bypass 1 "cpu1_load_*, cpu1_store_*" "cpu1_load_*")
defines a bypass between instructions that start with ‘cpu1_load_’ or ‘cpu1_store_’ and those that start with ‘cpu1_load_’.

guard is an optional string giving the name of a C function which defines an additional guard for the bypass. The function will get the two insns as parameters. If the function returns zero the bypass will be ignored for this case. The additional guard is necessary to recognize complicated bypasses, e.g. when the consumer is only an address of insn ‘store’ (not a stored value).

If there are more one bypass with the same output and input insns, the chosen bypass is the first bypass with a guard in description whose guard function returns nonzero. If there is no such bypass, then bypass without the guard function is chosen.



Stefan "Bebbo" Franke

Posts 139
28 Jul 2019 16:53


Grom 68k wrote:

    What do you think about this:
    1. Define a pipeline (p0,p1,p2) on all alu instructions with Dn as output.
 

 
  considering Dn is not enough, see the examples above.
 
 
Grom 68k wrote:
 
    2. Create a bypass for all except EA instructions.
   
    (define_bypass 1 "cpu_alu_*" "cpu_alu_*, cpu_fpu_*").
 
  Edit:
  2'. Guard sould be easier.
 

 
  A bypass (with or without guard fx) might be easier, but I wait with a decision until I can understand the requirements.
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
28 Jul 2019 17:04


Stefan "Bebbo" Franke wrote:

to sum it up:
 
  1. there is a latency of 2, if an ALU result is used by EA.
 

    addq.l #1,d1
    addq.l #2,d2
  ; one cycle stall left
    move.l (a0,d1),d3
 

  2. there is no latency if an EA result is used by ALU
 

    lea (a0,d1),a1
  ; no stall
    add.l a1,d2
 

 
  3. there is no latency inside EA/ALU.

Yes you are correct.

EA result to EA input => no LATENCY

ALU result to ALU input => no LATENCY

ALU result to EA input => 2 cycle bubble




Grom 68k

Posts 61
28 Jul 2019 22:01


Stefan "Bebbo" Franke wrote:

Grom 68k wrote:

    What do you think about this:
    1. Define a pipeline (p0,p1,p2) on all alu instructions with Dn as output.
 

 
  considering Dn is not enough, see the examples above.
 
 
Grom 68k wrote:
 
    2. Create a bypass for all except EA instructions.
   
    (define_bypass 1 "cpu_alu_*" "cpu_alu_*, cpu_fpu_*").
   
    Edit:
    2'. Guard sould be easier.
 

 
  A bypass (with or without guard fx) might be easier, but I wait with a decision until I can understand the requirements.
 

Is a bypass to 0 launch the two instructions in the same cycle ?

If that's it, bypass could make the fusing too.

I still have a doubt with this fuse
MOVE.L (an)+,(am)+
MOVE.L (an)+,(am)+


Grom 68k

Posts 61
28 Jul 2019 22:10


Samuel Devulder wrote:

  I wonder why in this code ( EXTERNAL LINK ) d2 is saved onto the stack. I'd thought that since -mregparm=3 is used, d2 would have been considered as a scratch register and needed no save/restore at all.
 
  Anyway, generally speaking of preserving regs onto stack, why not better use a free scratch reg like A0 or A1 in this case ? Saving regs (even fpu ones when precision doesn't matter[*]) in scratch regs rather onto stack is what I tend to do in asm where possible.
  ___
  [*] And since 68080 has 64bits regs, we can save all 64bits or  fpu regs in scratch data-reg without worrying for a loss of precision.
 

 
  I try this example with /4 and I don't have like >>2.
  EXTERNAL LINK 
Edit: unsigned int works. It's jlt and not jne, it's late...


Samuel Devulder

Posts 248
28 Jul 2019 22:27


This is because the C standard defines "/" as rounding toward 0 when both args are positive, and implementation dependant otherwise. Gcc choose to round toward 0 in all cases ( EXTERNAL LINK ) just like Fortran does. I suppose this makes division a symetric operation: (-x)/y == -(x/y). Unfortunately arithmetic shift round toward "-inf" (so to say). It isn't symetric. For instance 1>>2 is 0 whereas (-1>>2) is -1 . Hence there is a test for the sign of the input, and add 3 if negative to get proper rounding when implementing /4 with arithmetic shift.


Grom 68k

Posts 61
29 Jul 2019 16:43


Samuel Devulder wrote:

This is because the C standard defines "/" as rounding toward 0 when both args are positive, and implementation dependant otherwise. Gcc choose to round toward 0 in all cases ( EXTERNAL LINK ) just like Fortran does. I suppose this makes division a symetric operation: (-x)/y == -(x/y). Unfortunately arithmetic shift round toward "-inf" (so to say). It isn't symetric. For instance 1>>2 is 0 whereas (-1>>2) is -1 . Hence there is a test for the sign of the input, and add 3 if negative to get proper rounding when implementing /4 with arithmetic shift.

Yes, I understand when I really read the jlt.
Else, there is no way of improvement in size ? Why not use jge? It can save one asr and one rts. Is bge not better in this case ? What about mispredicting ?

EXTERNAL LINK 


Samuel Devulder

Posts 248
29 Jul 2019 18:08


You mean something like this:
_f:
        tst.l d0
        jge .L5
        addq.l #3,d0
.L5:
        asr.l #2,d0
        rts

This this code is definitively better for the 68080 than the previous one. Indeed, if I recall correctly, the addq.l is free in that case because it is a single instruction that doesn't reference memory following a conditional jump. Or maybe this is only possible with another conditional jump (jcs?). I don't remember exactly the rule here.
   
It might be useful for Gunnar to indicate the general schema for a free operation after a branch so that GCC can use it when possible. This is such a cool feature.


Grom 68k

Posts 61
29 Jul 2019 23:01


Samuel Devulder wrote:

  You mean something like this:
_f:
            tst.l d0
            jge .L5
            addq.l #3,d0
    .L5:
            asr.l #2,d0
            rts

    This this code is definitively better for the 68080 than the previous one. Indeed, if I recall correctly, the addq.l is free in that case because it is a single instruction that doesn't reference memory following a conditional jump. Or maybe this is only possible with another conditional jump (jcs?). I don't remember exactly the rule here.
     
    It might be useful for Gunnar to indicate the general schema for a free operation after a branch so that GCC can use it when possible. This is such a cool feature.
 

 
  Yes, that's it.
 
  Else, I try to read Bebbo commits. I found that ftst is set to 6 instead to 1. I also try 1/fsqrt, fsqrt is not defined.
 
  EXTERNAL LINK 
  I would like to check all but I don't know what is in types falu, fcmp and fbcc (fneg and fabs must be set to 1 for example).
 


Samuel Devulder

Posts 248
30 Jul 2019 00:03


In C, square root is sqrt() not fsqrt(), and you forgot to include <math.h> which is needed for maths functions like sqrt & friends.
 
Try this: EXTERNAL LINK and notice the strange code around negative values of x and the strange fp2 being pointlessly pushed onto the stack and fp3 being used in place of fp1. This asm code is very ugly. This one is better: EXTERNAL LINK It was produced from the very same C code but with different command-line options.
 
As a rule of thumb: always add -ffast-math to gcc command-line.


Grom 68k

Posts 61
30 Jul 2019 02:53


Samuel Devulder wrote:

In C, square root is sqrt() not fsqrt(), and you forgot to include <math.h> which is needed for maths functions like sqrt & friends.
   
  Try this: EXTERNAL LINK and notice the strange code around negative values of x and the strange fp2 being pointlessly pushed onto the stack and fp3 being used in place of fp1. This asm code is very ugly. This one is better: EXTERNAL LINK It was produced from the very same C code but with different command-line options.
 
  As a rule of thumb: always add -ffast-math to gcc command-line.

I knew that :(
I have too many bug in cex this evening on the phone. I restart at least 5 times. I had the include at start. I had too sqrt but not this time. It makes me doubt.

I found in m68k.md falu and others. Only ftst must be modified. Fabs is in any group. Fneg is a group.



Stefan "Bebbo" Franke

Posts 139
30 Jul 2019 10:16


Samuel Devulder wrote:

In C, square root is sqrt() not fsqrt(), and you forgot to include <math.h> which is needed for maths functions like sqrt & friends.
   
  Try this: EXTERNAL LINK and notice the strange code around negative values of x and the strange fp2 being pointlessly pushed onto the stack and fp3 being used in place of fp1. This asm code is very ugly. This one is better: EXTERNAL LINK It was produced from the very same C code but with different command-line options.
 
  As a rule of thumb: always add -ffast-math to gcc command-line.

correct math is very ugly: signed zeroes, inf, nan...

and there are optimizations before the stack frame gets inserted.
and there are optimizations when the stack frame is in place.

if an optimization spares a register after the stack frame is in place an useless push/pop remains.



Grom 68k

Posts 61
30 Jul 2019 16:36


Hi,

I try pow(x,-.5). It works well, it is converted with fdiv and fsqrt.

I try pow(x,y), I must include math-68881.h ? It is not in math.h.

Regards

posts 367page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19