Information about the Apollo CPU and FPU. |
|
---|
| | Stefan "Bebbo" Franke
Posts 142 15 Jul 2019 18:30
| -funroll-loops is working better now too: EXTERNAL LINK
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 15 Jul 2019 18:44
| Stefan "Bebbo" Franke wrote:
| -funroll-loops is working better now too: EXTERNAL LINK
|
Well done Bebbo, this is major progress!
| |
| | Samuel Devulder
Posts 248 15 Jul 2019 19:16
| Stefan "Bebbo" Franke wrote:
| -funroll-loops is working better now too: EXTERNAL LINK |
I don't understand the first operation _Scale8: fdmove.x fp1,fp0 ; <== HERE move.l #640,d0 fmovem fp2/fp3/fp4,-(sp) .L2: fdmove.d (a1)+,fp4 fdmul.x fp0,fp4 fdmove.d (a1)+,fp3 fdmul.x fp0,fp3 fdmove.d (a1)+,fp2 fdmul.x fp0,fp2 fdmove.d (a1)+,fp1 fdmul.x fp0,fp1 fmove.d fp4,(a0)+ fmove.d fp3,(a0)+ fmove.d fp2,(a0)+ fmove.d fp1,(a0)+ subq.l #4,d0 jne .L2 fmovem (sp)+,fp4/fp3/fp2 rts We could have worked with fp1 as source during the loop unrolling, and use fp0 as a tmp register inside the loop. Both regs are interchangeable in this context, but copying one onto the other costs one (needless) instruction. If we inverse the scalar0/scalar args, then the copy doesn't occur (the code is unchanged.)
| |
| | Stefan "Bebbo" Franke
Posts 142 15 Jul 2019 19:27
| Samuel Devulder wrote:
|
Stefan "Bebbo" Franke wrote:
| -funroll-loops is working better now too: EXTERNAL LINK |
I don't understand the first operation _Scale8: fdmove.x fp1,fp0 ; <== HERE move.l #640,d0 fmovem fp2/fp3/fp4,-(sp) .L2: fdmove.d (a1)+,fp4 fdmul.x fp0,fp4 fdmove.d (a1)+,fp3 fdmul.x fp0,fp3 fdmove.d (a1)+,fp2 fdmul.x fp0,fp2 fdmove.d (a1)+,fp1 fdmul.x fp0,fp1 fmove.d fp4,(a0)+ fmove.d fp3,(a0)+ fmove.d fp2,(a0)+ fmove.d fp1,(a0)+ subq.l #4,d0 jne .L2 fmovem (sp)+,fp4/fp3/fp2 rts We could have worked with fp1 as source during the loop unrolling, and use fp0 as a tmp register inside the loop. Both regs are interchangeable in this context, but copying one onto the other costs one (needless) instruction. If we inverse the scalar0/scalar args, then the copy doesn't occur (the code is unchanged.)
|
regparms are a hack in the parser - they remain still parameters, and thus the first compiler action is a load into register, because it does not know about register parameters
| |
| | Samuel Devulder
Posts 248 15 Jul 2019 19:44
| Okay, thanks for explaining. And now for something totally different (maybe?), if I do ( EXTERNAL LINK ) double Scale8(double scalar0, double scalar, double* restrict b, double* restrict c) { double t=0; size_t j; for (j=640; j; j--){ t+=*b++ = scalar * *c++; } return t; } I get_Scale8: fmove.d #0x000000000,fp0 fmovem fp2/fp3/fp4/fp5,-(sp) fdmove.d (60,sp),fp1 move.l (72,sp),a1 move.l (68,sp),a0 move.l #640,d0 .L2: fdmove.d (a1)+,fp5 fdmul.x fp1,fp5 fdmove.d (a1)+,fp4 fdmul.x fp1,fp4 fdmove.d (a1)+,fp3 fdmul.x fp1,fp3 fdmove.d (a1)+,fp2 fdmul.x fp1,fp2 fmove.d fp5,(a0)+ fmove.d fp4,(a0)+ fdadd.x fp5,fp0 ; ~5 cycles waiting for fp0 fdadd.x fp4,fp0 fmove.d fp3,(a0)+ ; ~4 cycles waiting for fp0 fdadd.x fp3,fp0 fmove.d fp2,(a0)+ ; ~4 cycles waiting for fp0 fdadd.x fp2,fp0 subq.l #4,d0 jne .L2 fmove.d fp0,-(sp) move.l (sp)+,d0 move.l (sp)+,d1 fmovem (sp)+,fp5/fp4/fp3/fp2 rts we have 13 wait-cycles, but if we reorg the final computation like this ... fmove.d fp5,(a0)+ fadd.x fp4,fp5 ; <== notice fmove.d fp4,(a0)+ fmove.d fp3,(a0)+ fadd.x fp2,fp3 ; <== notice fmove.d fp2,(a0)+ subq.l #4,d0 ; 0 cycles on fp5 (kewl!) fadd.x fp5,fp0 ; ~5 cycles waiting for fp0 (well...) fadd.x fp3,fp0 jne .L2 we drop to 5 (instead of 13).
| |
| | Stefan "Bebbo" Franke
Posts 142 15 Jul 2019 19:47
| perdon? fdadd.x is done in 1 cycle, isn't it?
| |
| | Samuel Devulder
Posts 248 15 Jul 2019 19:56
| I think this is 6 like fmul. Roughly speaking, provided the regs are available, all fmoves are 1 cycles, all fadd/fsub/fcmp/fmul are 6 (these are "simple" fpu op), fdiv is 10 ("complex" fpu op), and others are "a lot" (==too complex and rare fpu ops) IIRC. See BigGun post on page 3 which details cycles for int operations as well.
| |
| | Stefan "Bebbo" Franke
Posts 142 15 Jul 2019 20:01
| Samuel Devulder wrote:
| I think this is 6 like fmul. Roughly speaking, provided the regs are available, all fmoves are 1 cycles, all fadd/fsub/fcmp/fmul are 6, fdiv is 10, and others are "a lot" IIRC (see BigGun post on page 3 which details cycles for int operations as well). |
EDIT: I consider only fmul with 6 cycles.Dunno what's correct
| |
| | Samuel Devulder
Posts 248 15 Jul 2019 20:04
| Stefan "Bebbo" Franke wrote:
| that's why I don't understand the wait cycles after fdadd.x
|
In the annotated asm I reported, or generally speaking ? If it is in the annotated asm, I might have miscalculated something. Which one is puzzling ?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 15 Jul 2019 20:07
| Stefan "Bebbo" Franke wrote:
| perdon? fdadd.x is done in 1 cycle, isn't it?
|
FADD and FMUL are both 6 cycle. That both take several cycle is typical for an FPU. Its normal that in integer arithmetic ADD is faster than MUL. But in floating point arithmetic FADD is same work and same latency as FMUL.
| |
| | Samuel Devulder
Posts 248 15 Jul 2019 20:17
| Gunnar von Boehn wrote:
| But in floating point arithmetic FADD is same work and same latency as FMUL. |
Work like these maybe: 1) extract sign / exponent / mantissa 2) add exponent in case of '*' or align mantissa in case of '+/-/fcmp' 3) work on mantissa (add or multiply) 4) normalize mantissa & fix exponent if needed 5) pack everything back into ieee format (except for fcmp) 6) update status regs (this is how I figure out the 6 cycles. Which is probably wrong.)
| |
| | Stefan "Bebbo" Franke
Posts 142 15 Jul 2019 20:23
| ok - I updated the cycles for fadd and friends:
_Scale8: fmove.d #0x000000000,fp0 fmovem fp2/fp3/fp4/fp5,-(sp) fdmove.x fp1,fp2 move.l #640,d0 .L2: fdmove.d (a1)+,fp5 fdmul.x fp2,fp5 fdmove.d (a1)+,fp4 fdmul.x fp2,fp4 fdmove.d (a1)+,fp3 fdmul.x fp2,fp3 fdmove.d (a1)+,fp1 fdmul.x fp2,fp1 fmove.d fp5,(a0)+ fdadd.x fp5,fp0 fmove.d fp4,(a0)+ fdadd.x fp4,fp0 fmove.d fp3,(a0)+ fdadd.x fp3,fp0 fmove.d fp1,(a0)+ fdadd.x fp1,fp0 subq.l #4,d0 jne .L2 fmove.d fp0,-(sp) move.l (sp)+,d0 move.l (sp)+,d1 fmovem (sp)+,fp5/fp4/fp3/fp2 rts
| |
| | Samuel Devulder
Posts 248 15 Jul 2019 20:34
| There are still wait-state when doing additions in a linear/serial fashion: fdadd.x fp5,fp0 fmove.d fp4,(a0)+ fdadd.x fp4,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline fmove.d fp3,(a0)+ fdadd.x fp3,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline fmove.d fp1,(a0)+ fdadd.x fp1,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline
I think it is better to "parallelize" things into independant operations. I typically see the computation as a tree with independant branches computing independant partial results converging to the trunk (==end result). The partial result are more and more dependant of one another when we get closer to the trunk. Most of the branches are latency-free. Only the trunk concentrates the wait-cycles which can be easily filled with integer operations or other independant computational-branches coming from expressions occuring a bit later in the code: fmove.d fp5,(a0)+ fdadd.x fp4,fp5 fmove.d fp4,(a0)+ fmove.d fp3,(a0)+ fdadd.x fp1,fp3 ; works in parrallel with fp4+fp5 fmove.d fp1,(a0)+ fdadd.x fp5,fp0 ; wait 1 cycle for fp5 finishing it addition fadd.x fp3,fp0 ; wait 5 cycles for fp0 finishing from prev instruction Total: 6 wait-cycles instead of 15.
| |
| | Stefan "Bebbo" Franke
Posts 142 15 Jul 2019 20:56
| Samuel Devulder wrote:
| There are still wait-state when doing additions in a linear/serial fashion: fdadd.x fp5,fp0 fmove.d fp4,(a0)+ fdadd.x fp4,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline fmove.d fp3,(a0)+ fdadd.x fp3,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline fmove.d fp1,(a0)+ fdadd.x fp1,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline
I think it is better to "parallelize" things into independant operations. I typically see the computation as a tree with independant branches computing independant partial results converging to the trunk (==end result). The partial result are more and more dependant of one another when we get closer to the trunk. Most of the branches are latency-free. Only the trunk concentrates the wait-cycles which can be easily filled with integer operations or other independant computational-branches coming from expressions occuring a bit later in the code: fmove.d fp5,(a0)+ fdadd.x fp4,fp5 fmove.d fp4,(a0)+ fmove.d fp3,(a0)+ fdadd.x fp1,fp3 ; works in parrallel with fp4+fp5 fmove.d fp1,(a0)+ fdadd.x fp5,fp0 ; wait 1 cycle for fp5 finishing it addition fadd.x fp3,fp0 ; wait 5 cycles for fp0 finishing from prev instruction Total: 6 wait-cycles instead of 15. |
I see, understand and know the pass where this happens... ... it's reload where the register assignment is done, and this pass does not consider the latency... ... maybe a no-fix for me. EDIT: aren't if 4 cycles? per wait? total 12?
| |
| | Philippe Flype (Apollo Team Member) Posts 299 15 Jul 2019 21:08
| Since the 080 have a precise cycle counter, i can output the real results of each of them. Those are REGS to REGS operations, in exception of FMOVE R/W, FMOVEM R/W. +------------+--------------+ | FPU instr | Single | OoO | +------------+--------+-----+ | FABS | 1 | 1 | | FADD | 6 | 1 | | FCMP | 6 | 1 | | FDABS | 1 | 1 | | FDADD | 6 | 1 | | FDDIV | 9 | 2 | | FDIV | 9 | 2 | | FDMOVE | 1 | 1 | | FDMUL | 6 | 1 | | FDNEG | 1 | 1 | | FDSQRT | 21 | 12 | | FDSUB | 6 | 1 | | FINTRZ | 2 | 1 | | FMOVERm | 1 | 1 | | FMOVEWm | 1 | 1 | | FMOVERi | 1 | 1 | | FMOVEWi | 1 | 1 | | FMOVECR | 1 | 1 | | FMOVECTRL | 4 | 4 | | FMOVEMR | 8 | 8 | | FMOVEMW | 25 | 25 | | FMUL | 6 | 1 | | FNEG | 1 | 1 | | FSABS | 1 | 1 | | FSADD | 6 | 1 | | FSDIV | 9 | 2 | | FSGLDIV | 9 | 2 | | FSGLMUL | 6 | 1 | | FSMOVE | 1 | 1 | | FSMUL | 6 | 1 | | FSNEG | 1 | 1 | | FSQRT | 21 | 12 | | FSSQRT | 21 | 12 | | FSSUB | 6 | 1 | | FSUB | 6 | 1 | | FTST | 1 | 1 | | FSEQ | 1 | 1 | | FSCC | 1 | 1 | | FNOP | 1 | 1 | +------------+--------+-----+ | FPSP instr | Single | OoO | +------------+--------+-----+ | FACOS | 121 | 121 | | FASIN | 121 | 121 | | FATAN | 198 | 198 | | FATANH | 153 | 153 | | FCOS | 209 | 209 | | FCOSH | 264 | 264 | | FETOX | 220 | 220 | | FETOXM1 | 231 | 231 | | FGETEXP | 88 | 88 | | FGETMAN | 88 | 88 | | FINT | 99 | 99 | | FLOG10 | 231 | 231 | | FLOG2 | 242 | 242 | | FLOGN | 220 | 220 | | FLOGN1P | 220 | 220 | | FMOD | 121 | 121 | | FREM | 121 | 121 | | FSCALE | 99 | 99 | | FSIN | 238 | 238 | | FSINCOS | 264 | 264 | | FSINH | 286 | 286 | | FTAN | 198 | 198 | | FTANH | 275 | 275 | | FTENTOX | 231 | 231 | | FTWOTOX | 231 | 231 | +------------+--------+-----+
Source code provided : EXTERNAL LINK
| |
| | Samuel Devulder
Posts 248 15 Jul 2019 21:44
| @bebbo: For a 6-cycles fpu instruction, I count 5 other instructions slots between reuse of the result. Yes you are right this makes 4 wait-cycles on that code, hence a total of 12. Total count for the "parallelized" version is still at 6. Sam (this thread is very interresting)
| |
| | Grom 68k
Posts 61 15 Jul 2019 22:20
| Stefan "Bebbo" Franke wrote:
|
not exactly - he's looking for the headers which are in several repos. Easiest: grab the built version, and mail me the fixed headers and I'll put them live |
Hi bebbo, This is the script. #!/usr/bin/python import os import re for root, dirs, files in os.walk('.'): for file in files: if file.endswith('.h'): count=0 with open(os.path.join(root, file), 'r') as infile: with open(os.path.join(root, file + 'new'), 'w') as outfile: for currentline in infile.readlines(): match = re.search(r'^\s*\w+\s+\w+\(.*$', currentline, flags=0) if match: outfile.write(re.sub(r'^(\s*)(\w+\s+\w+)(\(.*)$', r'\1__stdargs \2\3', currentline, flags=0)) count=count+1 else: match = re.search(r'^\s*struct\s+\w+\s+\*\w+\(.*$', currentline, flags=0) if match: outfile.write(re.sub(r'^(\s*)(struct\s+\w+\s+\*\w+)(\(.*)$', r'\1__stdargs \2\3', currentline, flags=0)) count=count+1 else: outfile.write(currentline) if (count==0): os.remove(os.path.join(root, file + 'new')) else: os.remove(os.path.join(root, file)) os.rename(os.path.join(root, file + 'new'),os.path.join(root, file))
I think I must check all files before delivery. Could you give me your mail for the delivery? I must modify these one too ?
uint32 APICALL (*Release)(struct SocketIFace *Self); struct Interface * APICALL (*Clone)(struct SocketIFace *Self);
Thanks
| |
| | Nixus Minimax
Posts 416 16 Jul 2019 08:03
| Samuel Devulder wrote:
| (this thread is very interresting)
|
I agree. What may be incomprehensible gibberish to many is really exciting to us.
| |
| | Stefan "Bebbo" Franke
Posts 142 16 Jul 2019 08:35
| Grom 68k wrote:
| I think I must check all files before delivery. Could you give me your mail for the delivery? I must modify these one too ? uint32 APICALL (*Release)(struct SocketIFace *Self); struct Interface * APICALL (*Clone)(struct SocketIFace *Self);
Thanks
|
please mail to bebbo at bejy.net And yes, the function pointers do need it too. If I read "APICALL" it would be sufficient to modify the define for "APICALL" - less work? thank you!
| |
| | Grom 68k
Posts 61 16 Jul 2019 10:18
| Stefan "Bebbo" Franke wrote:
|
Grom 68k wrote:
| I think I must check all files before delivery. Could you give me your mail for the delivery? I must modify these one too ? uint32 APICALL (*Release)(struct SocketIFace *Self); struct Interface * APICALL (*Clone)(struct SocketIFace *Self);
Thanks |
please mail to bebbo at bejy.net And yes, the function pointers do need it too. If I read "APICALL" it would be sufficient to modify the define for "APICALL" - less work? thank you!
|
I will modify APICALL manualy, there is only few files impacted. Do you prefer a zip with all files or only modified ones ?
| |
|
|
|