Information about the Apollo CPU and FPU. |
|
---|
| | Grom 68k
Posts 61 10 Jul 2019 16:32
| I can help you to make the worse C/C++ code. :) Thanks for your awesome work.#include <string.h> void Scale3(double scalar, double* b, double* c) { size_t j; double t1; double t2; double t3; double t4; for (j=1000; j; j--){ t1 = scalar* *c++; *b++ = t1; t2 = scalar* *c++; *b++ = t2; t3 = scalar* *c++; *b++ = t3; t4 = scalar* *c++; *b++ = t4; } }
| |
| | Samuel Devulder
Posts 248 10 Jul 2019 18:57
| Stefan "Bebbo" Franke wrote:
| Plus note the early scheduling of the div in the assembly for this function: double foo(double a, double b, double c) { return c/2 + c * (b-a) + b * b + a * (a + 1) / (a * a - 1); }
|
Yeah, you are doing a good job at tuning the fpu scheduling. I should try recompiling quake your latest gcc6.5b when it'll be available. One note though fmove.d fp0,-(sp) move.l (sp)+,d0 move.l (sp)+,d1
GCC always returns double in d0/d1 pairs (d0 in case of floats) using this kind of code which is not great (memory access). I know this is the ABI which imposes this, but I wonder if some magick 080 tricks (eg. move fp0 to d1 "verbatim 64bit", then use VPERM to extract highest 32bits into D0) can do the same in an tiny amount of cycles. Of course we should do the reverse when transforming d0:d1 back into fp0 (the caller side.) Actually, if gcc was able to return the result in fp0 instead of d0:d1 (as VBCC does IIRC, but it is a different ABI) it would be even better since the fpu computations on fp0 might run in parallel to the final MOVEM/RTS/ADDQ.l #n,sp that usually follows the flow after the assignments of the result.And the conversion from d0:d1 to fp0 from the caller side won't be necessary anymore. (I'm not sure to be clear.. but the idea is fpu cycles are available in the epilogue of the function. These cycles can be used to finish the fpu-computation.)
| |
| | Stefan "Bebbo" Franke
Posts 142 10 Jul 2019 19:19
| Samuel Devulder wrote:
| Stefan "Bebbo" Franke wrote:
| Plus note the early scheduling of the div in the assembly for this function: double foo(double a, double b, double c) { return c/2 + c * (b-a) + b * b + a * (a + 1) / (a * a - 1); }
|
Yeah, you are doing a good job at tuning the fpu scheduling. I should try recompiling quake your latest gcc6.5b when it'll be available. |
It's available now. Samuel Devulder wrote:
| One note though fmove.d fp0,-(sp) move.l (sp)+,d0 move.l (sp)+,d1
GCC always returns double in d0/d1 pairs (d0 in case of floats) using this kind of code which is not great (memory access). I know this is the ABI which imposes this, but I wonder if some magick 080 tricks (eg. move fp0 to d1 "verbatim 64bit", then use VPERM to extract highest 32bits into D0) can do the same in an tiny amount of cycles. Of course we should do the reverse when transforming d0:d1 back into fp0 (the caller side.) Actually, if gcc was able to return the result in fp0 instead of d0:d1 (as VBCC does IIRC, but it is a different ABI) it would be even better since the fpu computations on fp0 might run in parallel to the final MOVEM/RTS/ADDQ.l #n,sp that usually follows the flow after the assignments of the result.And the conversion from d0:d1 to fp0 from the caller side won't be necessary anymore. (I'm not sure to be clear.. but the idea is fpu cycles are available in the epilogue of the function. These cycles can be used to finish the fpu-computation.) |
It should be possible to add an attribute e.g. `__retfp0` to advice the compiler to use fp0 instead of d0/d1... for the return value. And `__regargs` needs to learn about fp*...
| |
| | Stefan "Bebbo" Franke
Posts 142 10 Jul 2019 21:53
| Stefan "Bebbo" Franke wrote:
| ... It should be possible to add an attribute e.g. `__retfp0` to advice the compiler to use fp0 instead of d0/d1... for the return value. And `__regargs` needs to learn about fp*...
|
__retfp0 __regargs double add(double a, double b) { return a + b; }
yields (local beta)
_add: fdadd.x fp1,fp0 rts
| |
| | Samuel Devulder
Posts 248 11 Jul 2019 06:51
| That's nice :) I suppose __retfp0 has no impact when the returned value is an int. Is there a cmd-line switch or pragma to add implicit __retfp0 to every functions (except maybe for functions in math.h) ?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 11 Jul 2019 07:13
| Stefan "Bebbo" Franke wrote:
| have a look at cex: EXTERNAL LINK It's not live yet, since some tests do fail, but it's the wanted result for the Scale() function. It also shows that newer gcc really do optimize code, whilest gcc-2.95.3 is closer to an exact translation of the provided code. Thus worse C/C++ code may yield the same assembler code with gcc-6 as the "best" C/C++ code. Plus note the early scheduling of the div in the assembly for this function: double foo(double a, double b, double c) { return c/2 + c * (b-a) + b * b + a * (a + 1) / (a * a - 1); }
|
Hi Bebbo, I have a question why this code is generated:
fmovecr #0x32,fp6 fdsub.x fp6,fp4
Would this not be simpler? fdsub.s #1.0,fp4
| |
| | Grom 68k
Posts 61 11 Jul 2019 07:28
| Stefan "Bebbo" Franke wrote:
| have a look at cex: EXTERNAL LINK It's not live yet, since some tests do fail, but it's the wanted result for the Scale() function. It also shows that newer gcc really do optimize code, whilest gcc-2.95.3 is closer to an exact translation of the provided code. Thus worse C/C++ code may yield the same assembler code with gcc-6 as the "best" C/C++ code.
|
Hi Bebbo, In few tests, gcc generate this code subq.l #1,d0 ... tst.l d0 jne .L2
Example: #include <string.h> void Scale3(double scalar, double* b, double* c) { size_t j; double t1; double t2; double t3; double t4; for (j=1000; j; j--){ t1 = scalar* *c++; *b++ = t1; t2 = scalar* *c++; *b++ = t2; t3 = scalar* *c++; *b++ = t3; t4 = scalar* *c++; *b++ = t4; } }
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 11 Jul 2019 07:38
| With BEBBOs GCC 6.5b I did not see this useless TST instruction But what I see with lower optimization mode are totally useless MOVEA instructions. These are effectively NOPs and maybe it would be good that GCC never creates them also not with O1 mode. compile with -O1 .L2: fdmove.x fp0,fp4 fdmove.x fp0,fp3 fdmul.d (a0)+,fp4 fdmove.x fp0,fp2 fdmul.d (a0)+,fp3 fdmul.d (a0)+,fp2 fdmove.x fp0,fp1 move.l a0,a0 fdmul.d (a0)+,fp1 fmove.d fp4,(a1)+ fmove.d fp3,(a1)+ fmove.d fp2,(a1)+ move.l a1,a1 fmove.d fp1,(a1)+ subq.l #1,d0 jne .L2
| |
| | Stefan "Bebbo" Franke
Posts 142 11 Jul 2019 07:49
| Gunnar von Boehn wrote:
| Hi Bebbo, I have a question why this code is generated: fmovecr #0x32,fp6 fdsub.x fp6,fp4
Would this not be simpler? fdsub.s #1.0,fp4
|
FSUB. < fmt > < ea > ,FPn FSUB.X FPm,FPn
| |
| | Stefan "Bebbo" Franke
Posts 142 11 Jul 2019 07:51
| Grom 68k wrote:
|
Stefan "Bebbo" Franke wrote:
| have a look at cex: EXTERNAL LINK It's not live yet, since some tests do fail, but it's the wanted result for the Scale() function. It also shows that newer gcc really do optimize code, whilest gcc-2.95.3 is closer to an exact translation of the provided code. Thus worse C/C++ code may yield the same assembler code with gcc-6 as the "best" C/C++ code. |
Hi Bebbo, In few tests, gcc generate this code subq.l #1,d0 ... tst.l d0 jne .L2
Example: #include <string.h> void Scale3(double scalar, double* b, double* c) { size_t j; double t1; double t2; double t3; double t4; for (j=1000; j; j--){ t1 = scalar* *c++; *b++ = t1; t2 = scalar* *c++; *b++ = t2; t3 = scalar* *c++; *b++ = t3; t4 = scalar* *c++; *b++ = t4; } }
|
Aye - since the m68080 has to wait for the fmul, the scheduler moves insns inbetween... ... without real gain here, since the cmp is retained.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 11 Jul 2019 08:09
| Stefan "Bebbo" Franke wrote:
| Gunnar von Boehn wrote:
| Hi Bebbo, I have a question why this code is generated: fmovecr #0x32,fp6 fdsub.x fp6,fp4
Would this not be simpler? fdsub.s #1.0,fp4 |
FSUB. < fmt > < ea > ,FPn FSUB.X FPm,FPn |
Sorry, I do not understand your answer. Can you please elaborate more.My question was Why does GCC use 2 instructions instead 1
| |
| | Stefan "Bebbo" Franke
Posts 142 11 Jul 2019 08:14
| Gunnar von Boehn wrote:
| With BEBBOs GCC 6.5b I did not see this useless TST instruction But what I see with lower optimization mode are totally useless MOVEA instructions. These are effectively NOPs and maybe it would be good that GCC never creates them also not with O1 mode. compile with -O1 .L2: fdmove.x fp0,fp4 fdmove.x fp0,fp3 fdmul.d (a0)+,fp4 fdmove.x fp0,fp2 fdmul.d (a0)+,fp3 fdmul.d (a0)+,fp2 fdmove.x fp0,fp1 move.l a0,a0 fdmul.d (a0)+,fp1 fmove.d fp4,(a1)+ fmove.d fp3,(a1)+ fmove.d fp2,(a1)+ move.l a1,a1 fmove.d fp1,(a1)+ subq.l #1,d0 jne .L2
|
that's a left over from converting offsets into auto-inc - and yes, it should be optimized away - np.
| |
| | Stefan "Bebbo" Franke
Posts 142 11 Jul 2019 08:15
| Gunnar von Boehn wrote:
|
Stefan "Bebbo" Franke wrote:
| Gunnar von Boehn wrote:
| Hi Bebbo, I have a question why this code is generated: fmovecr #0x32,fp6 fdsub.x fp6,fp4
Would this not be simpler? fdsub.s #1.0,fp4 |
FSUB. < fmt > < ea > ,FPn FSUB.X FPm,FPn |
Sorry, I do not understand your answer. Can you please elaborate more. My question was Why does GCC use 2 instructions instead 1
|
FSUB allows either an <ea> or a FPreg as first operand. Immediates are not allowed.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 11 Jul 2019 08:27
| Stefan "Bebbo" Franke wrote:
| FSUB allows either an <ea> or a FPreg as first operand. Immediates are not allowed. |
Actually, Immediates are allowed. On 68K #Immediates are a valid type of <EA>. From the Motorola Manual Valid <EA> Dn* An (An) (An) + – (An) (d16,An) (d16,PC) (d8,An,Xn) (d8,PC,Xn) (bd,An,Xn) (bd,PC,Xn) ([bd,An,Xn],od) ([bd,PC,Xn],od) ([bd,An],Xn,od) ([bd,PC],Xn,od) (xxx).W (xxx).L # < data >
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 11 Jul 2019 09:45
| Lets take the time to discuss the general design of the 68K FPU ISA. The 68k FPU ISA is pretty nice and has some logical rules: a) FPU instructions always store their result in FPU regs. This is good and allows to execute FPU instruction in parallel to ALU instructions. b) FPU instruction always use 1 FPU Register as 2nd Input but also can use a variate of other type as 1st Input - Possible are Dn Register. This is very nice and allows easy passing of parameter from general integer code. - All types of Memory-EA - And #immediates The FPU can also do type conversion on the 1st input As DN used to be 32bit on older 68K models, Floatingpoint-Double Inputs in Dn could not be supported. This limitation was removed in the 68080! This make the 68080 more flexible and allows to use Dn regs better as TMP or Constant holders.The old 68K FPU ISA allowed as Inputs (Memory-EA) (8 Data-Registers) (8 FPU-Register) (Immediates) So a total of 16 regs were available. The NEW 68080 FPU ISA allowed as Inputs (Memory-EA) (8 Data-Registers) (32 FPU-Register) (Immediates) So a total of 40 Regs are available! The old 68K FPU ISA allowed as Destination (8 FPU-Register) The new 68080 FPU ISA allowed as Destination (32 FPU-Register) The NEW 68080 FPU ISA also allows 3 Op Form which greatly reduces the number of FMOVE instructions - which result in significantly higher FPU performance. The 68K FPU ISA was already very powerful. The new 68080 ISA makes it even more flexible and more powerful.
| |
| | Grom 68k
Posts 61 11 Jul 2019 10:08
| Gunnar von Boehn wrote:
| With BEBBOs GCC 6.5b I did not see this useless TST instruction
|
Just try my example with -Os, -O2 or -O3 It loose FUSING possiblity:
subq.l #1,d0 jne .L2
Is TST add misprediction penalty ?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 11 Jul 2019 10:15
| Grom 68k wrote:
| Gunnar von Boehn wrote:
| With BEBBOs GCC 6.5b I did not see this useless TST instruction |
Just try my example with -Os, -O2 or -O3 It loose FUSING possiblity: subq.l #1,d0 jne .L2
Is TST add misprediction penalty ? |
compiled with -O2 -m68080 .L2: fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ subq.l #1,d0 fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ tst.l d0 jne .L2 unlk a5 rts
Yes you are absolutely correct. GCC 6.5b does include the unneeded TST instructionGCC seems to make 2 mistakes here a) unneeded move SUBQ up in the code b) incorrectly believe the FLAGS created by SUBQ would not be valid anymore - as some FPU instruction were issued after them. The FPU instruction do NOT touch the Flags of the INTEGER ALU.
| |
| | Stefan "Bebbo" Franke
Posts 142 11 Jul 2019 12:44
| Gunnar von Boehn wrote:
|
Stefan "Bebbo" Franke wrote:
| FSUB allows either an <ea> or a FPreg as first operand. Immediates are not allowed. |
Actually, Immediates are allowed. ...
|
Aye Sir! If the constant is one of the builtin constants, fmovecr is faster. That's maybe different for th 68080.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 11 Jul 2019 13:31
| Stefan "Bebbo" Franke wrote:
|
Gunnar von Boehn wrote:
| Stefan "Bebbo" Franke wrote:
| FSUB allows either an <ea> or a FPreg as first operand. Immediates are not allowed. |
Actually, Immediates are allowed. ... |
Aye Sir! If the constant is one of the builtin constants, fmovecr is faster. That's maybe different for th 68080.
|
Great that you found the issue here. Can we compare the timing calculation in GCC?fpmovecr For 68080 the FPU rules are pretty simple. Basically the (EA) cost nothing. EA: Dn = free (mem) = free #imm = free Fpn = free On 68080 the MOVECR instruction and its fast = 1 clock Maybe we should consider that FMOVECR is removed from 68060! If GCC wants to compile code also running 68060 using FMOVECR is a problem.
| |
| | Stefan "Bebbo" Franke
Posts 142 11 Jul 2019 14:26
| Gunnar von Boehn wrote:
| On 68080 the MOVECR instruction and its fast = 1 clock Maybe we should consider that FMOVECR is removed from 68060! If GCC wants to compile code also running 68060 using FMOVECR is a problem.
|
gcc is aware of 68040/60: /* fmovecr must be emulated on the 68040 and 68060, so it shouldn't be used at all on those chips. */ and for the 68080 all FP constants can be used directly now.
| |
|
|
|