Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Grom 68k

Posts 61
10 Jul 2019 16:32


I can help you to make the worse C/C++ code. :)
Thanks for your awesome work.

#include <string.h>
void Scale3(double scalar, double* b, double* c)
{
    size_t j;
    double t1;
    double t2;
    double t3;
    double t4;
     
    for (j=1000; j; j--){
        t1 = scalar* *c++;
        *b++ = t1;
        t2 = scalar* *c++;
        *b++ = t2;
        t3 = scalar* *c++;
        *b++ = t3;
        t4 = scalar* *c++;
        *b++ = t4;
    }
}


Samuel Devulder

Posts 248
10 Jul 2019 18:57


Stefan "Bebbo" Franke wrote:

  Plus note the early scheduling of the div in the assembly for this function:
 

  double foo(double a, double b, double c) {
        return c/2 + c * (b-a) + b * b + a * (a + 1) / (a * a - 1);
  }
 

 

Yeah, you are doing a good job at tuning the fpu scheduling. I should try recompiling quake your latest gcc6.5b when it'll be available.
 
One note though
 

          fmove.d fp0,-(sp)
          move.l (sp)+,d0
          move.l (sp)+,d1
 

GCC always returns double in d0/d1 pairs (d0 in case of floats) using this kind of code which is not great (memory access). I know  this is the ABI which imposes this, but I wonder if some magick 080 tricks (eg. move fp0 to d1 "verbatim 64bit", then use VPERM to extract highest 32bits into D0) can do the same in an tiny amount of cycles. Of course we should do the reverse when transforming d0:d1 back into fp0 (the caller side.)
 
Actually, if gcc was able to return the result in fp0 instead of d0:d1 (as VBCC does IIRC, but it is a different ABI) it would be even better since the fpu computations on fp0 might run in parallel to the final MOVEM/RTS/ADDQ.l #n,sp that usually follows the flow after the assignments of the result.And the conversion from d0:d1 to fp0 from the caller side won't be necessary anymore. (I'm not sure to be clear.. but the idea is fpu cycles are available in the epilogue of the function. These cycles can be used to finish the fpu-computation.)


Stefan "Bebbo" Franke

Posts 139
10 Jul 2019 19:19


Samuel Devulder wrote:

 
Stefan "Bebbo" Franke wrote:

    Plus note the early scheduling of the div in the assembly for this function:
   

    double foo(double a, double b, double c) {
          return c/2 + c * (b-a) + b * b + a * (a + 1) / (a * a - 1);
    }
   

   

  Yeah, you are doing a good job at tuning the fpu scheduling. I should try recompiling quake your latest gcc6.5b when it'll be available.
 

 
  It's available now.

 
Samuel Devulder wrote:

  One note though
   

            fmove.d fp0,-(sp)
            move.l (sp)+,d0
            move.l (sp)+,d1
   

  GCC always returns double in d0/d1 pairs (d0 in case of floats) using this kind of code which is not great (memory access). I know  this is the ABI which imposes this, but I wonder if some magick 080 tricks (eg. move fp0 to d1 "verbatim 64bit", then use VPERM to extract highest 32bits into D0) can do the same in an tiny amount of cycles. Of course we should do the reverse when transforming d0:d1 back into fp0 (the caller side.)
   
  Actually, if gcc was able to return the result in fp0 instead of d0:d1 (as VBCC does IIRC, but it is a different ABI) it would be even better since the fpu computations on fp0 might run in parallel to the final MOVEM/RTS/ADDQ.l #n,sp that usually follows the flow after the assignments of the result.And the conversion from d0:d1 to fp0 from the caller side won't be necessary anymore. (I'm not sure to be clear.. but the idea is fpu cycles are available in the epilogue of the function. These cycles can be used to finish the fpu-computation.)
 

 
  It should be possible to add an attribute e.g. `__retfp0` to advice the compiler to use fp0 instead of d0/d1... for the return value.
 
  And `__regargs` needs to learn about fp*...
 
 


Stefan "Bebbo" Franke

Posts 139
10 Jul 2019 21:53


Stefan "Bebbo" Franke wrote:

...

  It should be possible to add an attribute e.g. `__retfp0` to advice the compiler to use fp0 instead of d0/d1... for the return value.
 
  And `__regargs` needs to learn about fp*... 


__retfp0 __regargs double add(double a, double b) {
        return a + b;
}

yields (local beta)

_add:
        fdadd.x fp1,fp0
        rts




Samuel Devulder

Posts 248
11 Jul 2019 06:51


That's nice :) I suppose __retfp0 has no impact when the returned value is an int.

Is there a cmd-line switch or pragma to add implicit __retfp0 to every functions (except maybe for functions in math.h) ?


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
11 Jul 2019 07:13


Stefan "Bebbo" Franke wrote:

  have a look at cex: EXTERNAL LINK   
    It's not live yet, since some tests do fail, but it's the wanted result for the Scale() function.
   
    It also shows that newer gcc really do optimize code, whilest gcc-2.95.3 is closer to an exact translation of the provided code. Thus worse C/C++ code may yield the same assembler code with gcc-6 as the "best" C/C++ code.
 
  Plus note the early scheduling of the div in the assembly for this function:
 

  double foo(double a, double b, double c) {
        return c/2 + c * (b-a) + b * b + a * (a + 1) / (a * a - 1);
  }
 

 

 
Hi Bebbo,
 
I have a question why this code is generated:

  fmovecr #0x32,fp6
  fdsub.x fp6,fp4

Would this not be simpler?
fdsub.s #1.0,fp4
 
 


Grom 68k

Posts 61
11 Jul 2019 07:28


Stefan "Bebbo" Franke wrote:

have a look at cex: EXTERNAL LINK   
  It's not live yet, since some tests do fail, but it's the wanted result for the Scale() function.
 
  It also shows that newer gcc really do optimize code, whilest gcc-2.95.3 is closer to an exact translation of the provided code. Thus worse C/C++ code may yield the same assembler code with gcc-6 as the "best" C/C++ code.

Hi Bebbo,

In few tests, gcc generate this code


subq.l #1,d0
...
tst.l d0
jne .L2

Example:


#include <string.h>
void Scale3(double scalar, double* b, double* c)
{
    size_t j;
    double t1;
    double t2;
    double t3;
    double t4;
     
    for (j=1000; j; j--){
        t1 = scalar* *c++;
        *b++ = t1;
        t2 = scalar* *c++;
        *b++ = t2;
        t3 = scalar* *c++;
        *b++ = t3;
        t4 = scalar* *c++;
        *b++ = t4;
    }
}




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
11 Jul 2019 07:38


With BEBBOs GCC 6.5b I did not see this useless TST instruction

But what I see with lower optimization mode are totally useless MOVEA instructions.
These are effectively NOPs and maybe it would be good that GCC never creates them also not with O1 mode.


compile with -O1
.L2:
        fdmove.x fp0,fp4
        fdmove.x fp0,fp3
        fdmul.d (a0)+,fp4
        fdmove.x fp0,fp2
        fdmul.d (a0)+,fp3
        fdmul.d (a0)+,fp2
        fdmove.x fp0,fp1
      move.l a0,a0
        fdmul.d (a0)+,fp1
        fmove.d fp4,(a1)+
        fmove.d fp3,(a1)+
        fmove.d fp2,(a1)+
    move.l a1,a1
        fmove.d fp1,(a1)+
        subq.l #1,d0
        jne .L2




Stefan "Bebbo" Franke

Posts 139
11 Jul 2019 07:49


Gunnar von Boehn wrote:

 
  Hi Bebbo,
 
  I have a question why this code is generated:
 

    fmovecr #0x32,fp6
    fdsub.x fp6,fp4
 

  Would this not be simpler?
  fdsub.s #1.0,fp4

FSUB. < fmt > < ea > ,FPn
FSUB.X FPm,FPn



Stefan "Bebbo" Franke

Posts 139
11 Jul 2019 07:51


Grom 68k wrote:

Stefan "Bebbo" Franke wrote:

  have a look at cex: EXTERNAL LINK   
    It's not live yet, since some tests do fail, but it's the wanted result for the Scale() function.
   
    It also shows that newer gcc really do optimize code, whilest gcc-2.95.3 is closer to an exact translation of the provided code. Thus worse C/C++ code may yield the same assembler code with gcc-6 as the "best" C/C++ code.
 

 
  Hi Bebbo,
 
  In few tests, gcc generate this code
 
 

  subq.l #1,d0
  ...
  tst.l d0
  jne .L2
 

 
  Example:
 
 

  #include <string.h>
  void Scale3(double scalar, double* b, double* c)
  {
      size_t j;
      double t1;
      double t2;
      double t3;
      double t4;
     
      for (j=1000; j; j--){
          t1 = scalar* *c++;
          *b++ = t1;
          t2 = scalar* *c++;
          *b++ = t2;
          t3 = scalar* *c++;
          *b++ = t3;
          t4 = scalar* *c++;
          *b++ = t4;
      }
  }
 

 

Aye - since the m68080 has to wait for the fmul, the scheduler moves insns inbetween...
... without real gain here, since the cmp is retained.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
11 Jul 2019 08:09


Stefan "Bebbo" Franke wrote:

 
Gunnar von Boehn wrote:

   
    Hi Bebbo,
   
    I have a question why this code is generated:
   

      fmovecr #0x32,fp6
      fdsub.x fp6,fp4
   

    Would this not be simpler?
    fdsub.s #1.0,fp4
 

 
  FSUB. < fmt > < ea > ,FPn
  FSUB.X FPm,FPn
 
 

Sorry, I do not understand your answer.
Can you please elaborate more.

My question was Why does GCC use 2 instructions instead 1


Stefan "Bebbo" Franke

Posts 139
11 Jul 2019 08:14


Gunnar von Boehn wrote:

With BEBBOs GCC 6.5b I did not see this useless TST instruction
 
  But what I see with lower optimization mode are totally useless MOVEA instructions.
  These are effectively NOPs and maybe it would be good that GCC never creates them also not with O1 mode.
 
 

  compile with -O1
  .L2:
          fdmove.x fp0,fp4
          fdmove.x fp0,fp3
          fdmul.d (a0)+,fp4
          fdmove.x fp0,fp2
          fdmul.d (a0)+,fp3
          fdmul.d (a0)+,fp2
          fdmove.x fp0,fp1
        move.l a0,a0
          fdmul.d (a0)+,fp1
          fmove.d fp4,(a1)+
          fmove.d fp3,(a1)+
          fmove.d fp2,(a1)+
      move.l a1,a1
          fmove.d fp1,(a1)+
          subq.l #1,d0
          jne .L2
 

 

that's a left over from converting offsets into auto-inc - and yes, it should be optimized away - np.




Stefan "Bebbo" Franke

Posts 139
11 Jul 2019 08:15


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

 
Gunnar von Boehn wrote:

   
    Hi Bebbo,
     
    I have a question why this code is generated:
   

      fmovecr #0x32,fp6
      fdsub.x fp6,fp4
   

    Would this not be simpler?
    fdsub.s #1.0,fp4
   

   
    FSUB. < fmt > < ea > ,FPn
    FSUB.X FPm,FPn
   
 

  Sorry, I do not understand your answer.
  Can you please elaborate more.
 
  My question was Why does GCC use 2 instructions instead 1

FSUB allows either an <ea> or a FPreg as first operand.
Immediates are not allowed.



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
11 Jul 2019 08:27


Stefan "Bebbo" Franke wrote:

      FSUB allows either an <ea> or a FPreg as first operand.
      Immediates are not allowed.
     

     
Actually, Immediates are allowed.
On 68K #Immediates are a valid type of <EA>.
 
 
From the Motorola Manual
 
  Valid <EA>
  Dn*
  An
  (An)
  (An) +
  – (An)
  (d16,An)
  (d16,PC)
  (d8,An,Xn)
  (d8,PC,Xn)
  (bd,An,Xn)
  (bd,PC,Xn)
  ([bd,An,Xn],od)
  ([bd,PC,Xn],od)
  ([bd,An],Xn,od)
  ([bd,PC],Xn,od)
  (xxx).W
  (xxx).L
  # < data >
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
11 Jul 2019 09:45


Lets take the time to discuss the general design of the 68K FPU ISA.
 
The 68k FPU ISA is pretty nice and has some logical rules:
 
a) FPU instructions always store their result in FPU regs.
This is good and allows to execute FPU instruction in parallel to ALU instructions.
 
b) FPU instruction always use 1 FPU Register as 2nd Input
but also can use a variate of other type as 1st Input
  - Possible are Dn Register.
  This is very nice and allows easy passing of parameter from general integer code.
  - All types of Memory-EA
  - And #immediates
  The FPU can also do type conversion on the 1st input
 
As DN used to be 32bit on older 68K models, Floatingpoint-Double Inputs in Dn could not be supported. This limitation was removed in the 68080!
This make the 68080 more flexible and allows to use Dn regs better as TMP or Constant holders.

The old 68K FPU ISA allowed as Inputs
  (Memory-EA)
  (8 Data-Registers)
  (8 FPU-Register)
  (Immediates)
So a total of 16 regs were available.

 
The NEW 68080 FPU ISA allowed as Inputs
  (Memory-EA)
  (8 Data-Registers)
  (32 FPU-Register)
  (Immediates)
So a total of 40 Regs are available!
 
 
The old 68K FPU ISA allowed as Destination
  (8 FPU-Register)
 
 
The new 68080 FPU ISA allowed as Destination
  (32 FPU-Register)
 

The NEW 68080 FPU ISA also allows 3 Op Form which greatly reduces  the number of FMOVE instructions - which result in significantly higher FPU performance.
 
  The 68K FPU ISA was already very powerful.
  The new 68080 ISA makes it even more flexible and more powerful.


Grom 68k

Posts 61
11 Jul 2019 10:08


Gunnar von Boehn wrote:

With BEBBOs GCC 6.5b I did not see this useless TST instruction

Just try my example with -Os, -O2 or -O3

It loose FUSING possiblity:


  subq.l #1,d0
  jne .L2

Is TST add misprediction penalty ?


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
11 Jul 2019 10:15


Grom 68k wrote:

 
Gunnar von Boehn wrote:

  With BEBBOs GCC 6.5b I did not see this useless TST instruction
 

 
  Just try my example with -Os, -O2 or -O3
 
  It loose FUSING possiblity:
 

    subq.l #1,d0
    jne .L2
 

 
  Is TST add misprediction penalty ?
 

 
 
  compiled with -O2 -m68080
 

  .L2:
          fdmove.d (a0)+,fp1
          fdmul.x fp0,fp1
          fmove.d fp1,(a1)+
          subq.l #1,d0
          fdmove.d (a0)+,fp1
          fdmul.x fp0,fp1
          fmove.d fp1,(a1)+
          fdmove.d (a0)+,fp1
          fdmul.x fp0,fp1
          fmove.d fp1,(a1)+
          fdmove.d (a0)+,fp1
          fdmul.x fp0,fp1
          fmove.d fp1,(a1)+
          tst.l d0
          jne .L2
          unlk a5
          rts
 

 
Yes you are absolutely correct.
GCC 6.5b does include the unneeded TST instruction

GCC seems to make 2 mistakes here
a) unneeded move SUBQ up in the code
b) incorrectly believe the FLAGS created by SUBQ would not be valid anymore - as some FPU instruction were issued after them.
The FPU instruction do NOT touch the Flags of the INTEGER ALU.



Stefan "Bebbo" Franke

Posts 139
11 Jul 2019 12:44


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

        FSUB allows either an <ea> or a FPreg as first operand.
        Immediates are not allowed.
     

     
  Actually, Immediates are allowed.
  ... 

Aye Sir!

If the constant is one of the builtin constants, fmovecr is faster.
That's maybe different for th 68080.



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
11 Jul 2019 13:31


Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

 
Stefan "Bebbo" Franke wrote:

        FSUB allows either an <ea> or a FPreg as first operand.
        Immediates are not allowed.
       

       
  Actually, Immediates are allowed.
  ... 
 

 
  Aye Sir!
 
  If the constant is one of the builtin constants, fmovecr is faster.
  That's maybe different for th 68080.
 

Great that you found the issue here.
Can we compare the timing calculation in GCC?

fpmovecr

For 68080 the FPU rules are pretty simple.
Basically the (EA) cost nothing.
EA:
Dn    = free
(mem) = free
#imm  = free
Fpn  = free

On 68080 the MOVECR instruction and its fast = 1 clock

Maybe we should consider that FMOVECR is removed from 68060!
If GCC wants to compile code also running 68060 using FMOVECR is a problem.




Stefan "Bebbo" Franke

Posts 139
11 Jul 2019 14:26


Gunnar von Boehn wrote:

  On 68080 the MOVECR instruction and its fast = 1 clock
 
 
  Maybe we should consider that FMOVECR is removed from 68060!
  If GCC wants to compile code also running 68060 using FMOVECR is a problem.

gcc is aware of 68040/60:
  /* fmovecr must be emulated on the 68040 and 68060, so it shouldn't be used at all on those chips.  */

and for the 68080 all FP constants can be used directly now.

posts 367page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19