Information about the Apollo CPU and FPU. |
|
---|
| | Samuel Devulder
Posts 248 30 Jul 2019 17:08
| $ man pow EXTERNAL LINK EXTERNAL LINK :) This gives: EXTERNAL LINK Actually we can see it calls libm.a/pow. But I think there is a missing __stdargs to the declaration of the pow function in math.h since functions in libm.a should use the default ABI (eg. stack-based) even if -mregparm is present.Also notice that "jsr XXX;rts" should be optimized as "jra XXX". This doesn't happen here, but appears when removing the "-mregpam" option. (Still that famous post-stack setup optim pass that seem missing.)
| |
| | Stefan "Bebbo" Franke
Posts 142 30 Jul 2019 17:17
| Samuel Devulder wrote:
| $ man pow EXTERNAL LINK EXTERNAL LINK :) This gives: EXTERNAL LINK Actually we can see it calls libm.a/pow. But I think there is a missing __stdargs to the declaration of the pow function in math.h since functions in libm.a should use the default ABI (eg. stack-based) even if -mregparm is present. |
Grom68k already provided many headers - I 'only' have to put em in...EDIT: math.h has __stdargs now
| |
| | Samuel Devulder
Posts 248 30 Jul 2019 21:27
| Cool B) Works fine EXTERNAL LINK
| |
| | Grom 68k
Posts 61 30 Jul 2019 22:42
| Is __retfp0 active now on pow ? It seems not. EXTERNAL LINK Edit: -mregparm is active on pow this morning. -mregparm could be useful too in complex.h :). EXTERNAL LINK
| |
| | Stefan "Bebbo" Franke
Posts 142 31 Jul 2019 07:08
| Grom 68k wrote:
|
Is __retfp0 active now on pow ? It seems not. EXTERNAL LINK Edit: -mregparm is active on pow this morning. -mregparm could be useful too in complex.h :). EXTERNAL LINK
|
stdlib functions can't use register parameters or fp0 to return something. /shrug? complex returns more than one fp-register -> a pointer is used this is not covered by __retfp0 - as the name states: fp0 is ONE register.
| |
| | Grom 68k
Posts 61 31 Jul 2019 15:28
| Stefan "Bebbo" Franke wrote:
| Grom 68k wrote:
| Is __retfp0 active now on pow ? It seems not. EXTERNAL LINK Edit: -mregparm is active on pow this morning. -mregparm could be useful too in complex.h :). EXTERNAL LINK |
stdlib functions can't use register parameters or fp0 to return something. /shrug? complex returns more than one fp-register -> a pointer is used this is not covered by __retfp0 - as the name states: fp0 is ONE register. |
That's why I try to use math-68881.h to remove builtin. Else, i don't understand something, is __stdarg really working ? pow use now fp0 and fp1 as input. EXTERNAL LINK I only speak about -mregparm for the complex.h since it works on pow. Complex data is converted as 2 fp registers.
| |
| | Stefan "Bebbo" Franke
Posts 142 31 Jul 2019 15:40
| Grom 68k wrote:
| Else, i don't understand something, pow use now fp0 and fp1 as input. EXTERNAL LINK I only speak about -mregparm for the complex.h since it works on pow. Complex data is converted as 2 fp registers.
|
it depends which headers are live at cex... the old ones without __stdargs or the new ones with __stdargs :-)
| |
| | Samuel Devulder
Posts 248 31 Jul 2019 17:44
| Grom 68k wrote:
| pow use now fp0 and fp1 as input.
|
hmm no it doesn't fmove.d fp1,-(sp) fmove.d fp0,-(sp) jsr _pow (at the moment on EXTERNAL LINK ) Maybe the header has changed since.
| |
| | Grom 68k
Posts 61 31 Jul 2019 22:16
| Yes, cex improve quickly. I try again div for int. It's better. EXTERNAL LINK What is the best between move d0,d0 and test d0 ? Is Test smaller ? Same question between jpl and bpl ? Edit: It's not better with /4. It's only better with /2 ! Why ?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 01 Aug 2019 06:50
| Gunnar von Boehn wrote:
| How about using such formula for the cost? a) 4 per clock cycle b) +1 per instruction word c) +2 for using memory I think this will create much more balanced code. What do you think?
|
Bebbo, what do you think about this balanced cost proposal?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 01 Aug 2019 06:55
| More ideas to make better code: int f(int* ptr) { int tmp; tmp= *ptr &2; return tmp; }
GCC right now creates this, which is not optimal move.l (a0),d0 #2 4 4 +d0 and.l #2,d0 #3 0 4 *d0 rts #5 -1 0
GCC should produce this: MOVEQ #2,D0 AND.L (A0),D0 We should include the instruction length in the COST model! I would propose we include the Memory access and the length in the cost so that the cost should be this: move.l (a0),d0 #2 4 7 +d0 and.l #2,d0 #3 0 7 *d0 rts #5 -1 0
The cost should be this: moveq #2,d0 #2 4 5 +d0 and.l (a0),d0 #3 0 7 *d0 rts #5 -1 0
| |
| | Stefan "Bebbo" Franke
Posts 142 01 Aug 2019 17:06
| Gunnar von Boehn wrote:
| More ideas to make better code: int f(int* ptr) { int tmp; tmp= *ptr &2; return tmp; }
GCC right now creates this, which is not optimal move.l (a0),d0 #2 4 4 +d0 and.l #2,d0 #3 0 4 *d0 rts #5 -1 0
GCC should produce this: MOVEQ #2,D0 AND.L (A0),D0 We should include the instruction length in the COST model! I would propose we include the Memory access and the length in the cost so that the cost should be this: move.l (a0),d0 #2 4 7 +d0 and.l #2,d0 #3 0 7 *d0 rts #5 -1 0
The cost should be this: moveq #2,d0 #2 4 5 +d0 and.l (a0),d0 #3 0 7 *d0 rts #5 -1 0
|
I expect a gain of less than 0.1% ...
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 01 Aug 2019 20:30
| Stefan "Bebbo" Franke wrote:
| I expect a gain of less than 0.1% ...
|
This is very easy to count. :-) Clocks Byte move.l (a0),d0 1 2 and.l #2,d0 1 6 versus moveq #2,D0 1 2 and.l (a0),D0 0 2 (fused!)
Clocks 2 => 1 Bytes 8 => 4 Twice as fast and halve the size. I would call this a great improvement. This tuning is possible for very many operations not only for AND but also ADD/SUB/OR/EOR/...
| |
| | Stefan "Bebbo" Franke
Posts 142 01 Aug 2019 20:33
| Gunnar von Boehn wrote:
|
Stefan "Bebbo" Franke wrote:
| I expect a gain of less than 0.1% ... |
This is very easy to count. :-) Clocks Byte move.l (a0),d0 1 2 and.l #2,d0 1 6 versus moveq #2,D0 1 2 and.l (a0),D0 0 2 (fused!) Clocks 2 => 1 Bytes 8 => 4 Twice as fast and halve the size. I would call this a great improvement. This tuning is possible for very many operations not only for AND but also ADD/SUB/OR/EOR/...
|
not for SUB ...
| |
| | Grom 68k
Posts 61 01 Aug 2019 22:14
| Gunnar von Boehn wrote:
| For the modern CPUs 060/080 we have 2 pipes, so "grouping" of instructions becomes a important topic. As the CPU can do 1 memory operation per cycle, plus another register operation instructions should be scheduled accordingly. Instead this ADDq.l #1,(a0)+ ADDq.l #1,(a0)+ ADDq.l #1,(a0)+ ADDq.l #1,(a0)+ ADDq.l #1,D0 ADDq.l #1,D1 ADDq.l #1,D2 ADDq.l #1,D3
Do this ADDq.l #1,(a0)+ ADDq.l #1,D0 ADDq.l #1,(a0)+ ADDq.l #1,D1 ADDq.l #1,(a0)+ ADDq.l #1,D2 ADDq.l #1,(a0)+ ADDq.l #1,D3
Such scheduling is also important for AMMX and FPU code.
|
Hi, To limit memory usage, I think a memory pipeline can be added. (define_reservation "i_pipelines" "(i0_pipeline | i1_pipeline)");; simple insns with 1 cycle (define_insn_reservation "simple" 1 (eq_attr "type" "alu_l") "i_pipelines, i_ports, i_memory")
Super scalar requirements reduce memory usage.
| |
| | Stefan "Bebbo" Franke
Posts 142 01 Aug 2019 22:40
| Stefan "Bebbo" Franke wrote:
|
Gunnar von Boehn wrote:
| Stefan "Bebbo" Franke wrote:
| I expect a gain of less than 0.1% ... |
This is very easy to count. :-) Clocks Byte move.l (a0),d0 1 2 and.l #2,d0 1 6 versus moveq #2,D0 1 2 and.l (a0),D0 0 2 (fused!) Clocks 2 => 1 Bytes 8 => 4 Twice as fast and halve the size. I would call this a great improvement. This tuning is possible for very many operations not only for AND but also ADD/SUB/OR/EOR/... |
not for SUB ...
|
and not for EOR ...
| |
| | Stefan "Bebbo" Franke
Posts 142 01 Aug 2019 22:45
| Grom 68k wrote:
|
Gunnar von Boehn wrote:
| For the modern CPUs 060/080 we have 2 pipes, so "grouping" of instructions becomes a important topic. As the CPU can do 1 memory operation per cycle, plus another register operation instructions should be scheduled accordingly. Instead this ADDq.l #1,(a0)+ ADDq.l #1,(a0)+ ADDq.l #1,(a0)+ ADDq.l #1,(a0)+ ADDq.l #1,D0 ADDq.l #1,D1 ADDq.l #1,D2 ADDq.l #1,D3
Do this ADDq.l #1,(a0)+ ADDq.l #1,D0 ADDq.l #1,(a0)+ ADDq.l #1,D1 ADDq.l #1,(a0)+ ADDq.l #1,D2 ADDq.l #1,(a0)+ ADDq.l #1,D3
Such scheduling is also important for AMMX and FPU code. |
Hi, To limit memory usage, I think a memory pipeline can be added. (define_reservation "i_pipelines" "(i0_pipeline | i1_pipeline)") ;; simple insns with 1 cycle (define_insn_reservation "simple" 1 (eq_attr "type" "alu_l") "i_pipelines, i_ports, i_memory")
Super scalar requirements reduce memory usage.
|
to allow 2 (or more) insn per cycle is more effort, since this
fmul.x fp0,fp1 ADDq.l #1,(a0)+ ADDq.l #1,D0 ADDq.l #1,(a0)+ ADDq.l #1,D1 ADDq.l #1,(a0)+ fmul.x fp1,fp2
would result in a stall, since the five insns take 3 cycles only...
| |
| | Grom 68k
Posts 61 02 Aug 2019 01:23
| :( I was thinking pipelines more easier than fusing. https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.html#Processor-pipeline-description I found a method for fusing. Is the same problem with insns/cycles count ? EXTERNAL LINK
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 02 Aug 2019 06:32
| Stefan "Bebbo" Franke wrote:
| and not for EOR ...
|
Why should it not? MOVEQ #$F,D0 EOR.L (A0),D0 This works just fine.
The 68K offers a rich selection of instruction. For different immediate ranges the 68K provides us tuned instructions. Example:
Bytes SUBQ.L #$1,A0 2 SUBA.W #$111,A0 4 SUBA.L #$222222,A0 6
Using the tuned instruction will make programs smaller, and increase Icache hit rate. So both size is saved and speed increased. BTW 68080 offers this too:
Bytes ADDQ.L #$1,D0 2 ADDIW.L #$111,D0 4 ADDI.L #$222222,D0 6
| |
| | Grom 68k
Posts 61 02 Aug 2019 06:59
| Gunnar von Boehn wrote:
|
Stefan "Bebbo" Franke wrote:
| and not for EOR ... |
Why should it not?
|
It's simply not in the PDF list for now.
| |
|
|
|