The team will post updates and news about our project here |
68k FPU Coding Challenge - Win a Prize! | page 1 2
|
---|
|
---|
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 09 Sep 2018 22:16
| We would like to invite you to participate on a little coding challenge. All 3D games need some math routine behind. You all know the famous game ELITE. In the intro of the game the star-ships do this nice rotation. If your computer has an FPU than a common way to calculate such rotation is to use a rotation matrix on an array of vertexes. Here is the challenge: Lets assume we have an array of vertex of type SINGLE with (X,Y,Z) per element. This means 12 byte per Vertex. Lets assume you have an array containing 1000 Vertex =(12KB)Lets assume we have a rotation matrix of type SINGLE We use type SINGLE in this challenge because this is very often used for 3D games. Please write an FPU routine which processes this array and applies the rotation calculation on it and stores the rotated result in a new array. Your routine gets called with these parameters: A0= PTR to array of source vertex A1= PTR where you shall store the rotated values A2= PTR to the rotation matrix to use D0= 1000 Number of elements to process. APOLLO 68080 x12 will max reach 85 MegaFlops with optimal code. The coder who writes a routine coming the closest to this score will win a price of a Vampire-Team-T-shirt. For the measurement we assume that the src array is hot in Dcache. We look forward to see your routines. Good Luck! Please ask if you have questions.
| |
| | Thellier Alain
Posts 143 10 Sep 2018 09:15
| Nice idea Gunnar To give a starting base here it is the code that give gcc 2.90.27 with options -m68040 -m68881 -O3 (edit: use wanted registers) // /*==================================================================*/ // #include <string.h> // #include <stdarg.h> // #include <math.h> // #include <proto/exec.h> // /*==================================================================================*/ // typedef struct _Vertex3D // { // float x,y,z; // } Vertex3D; // /*=================================================================*/ // void CopyTransformV(Vertex3D* Vs,Vertex3D* V2s,float *Ms,LONG Vnbs) // { pea a5@ movel sp,a5 fmovem #0x1c,sp@- movel a0,sp@- // register Vertex3D* V =Vs; LBB2: movel a5@(8),a0 // register Vertex3D* V2 =V2s; movel a5@(12),a1 // register float* M =Ms; /* put stack values to given registers */ movel a5@(16),a2 // register LONG Vnb =Vnbs; movel a5@(20),d0 // register float x; // register float y; // register float z; // // while(Vnb) jeq L75 .even L76: // { // x=V[0].x;y=V[0].y;z=V[0].z; fsmoves a0@,fp2 fsmoves a0@(4),fp4 fsmoves a0@(8),fp3 // V2[0].x= M[0]*x + M[4]*y+ M[8] *z; fsmoves a2@,fp0 fsmulx fp2,fp0 fsmoves a2@(16),fp1 fsmulx fp4,fp1 fsaddx fp0,fp1 fsmoves a2@(32),fp0 fsmulx fp3,fp0 fsaddx fp0,fp1 fmoves fp1,a1@ // V2[0].y= M[1]*x + M[5]*y+ M[9] *z; fsmoves a2@(4),fp0 fsmulx fp2,fp0 fsmoves a2@(20),fp1 fsmulx fp4,fp1 fsaddx fp0,fp1 fsmoves a2@(36),fp0 fsmulx fp3,fp0 fsaddx fp0,fp1 fmoves fp1,a1@(4) // V2[0].z= M[2]*x + M[6]*y+ M[10]*z; fsmuls a2@(8),fp2 fsmuls a2@(24),fp4 fsaddx fp2,fp4 fsmuls a2@(40),fp3 fsaddx fp3,fp4 fmoves fp4,a1@(8) // // V++; addw #12,a0 // V2++; addw #12,a1 // Vnb--; subql #1,d0 // } jne L76 L75: // } LBE2: movel sp@+,a0 fmovem sp@+,#0x38 unlk a5 rts
| |
| | Thellier Alain
Posts 143 10 Sep 2018 10:06
| First ideas: * Process 2 vertex per loop * Remove the addw #12,a0 and addw #12,a1 and use an index register in d1 for accessing vertices * No more use subql #1,d0 but cmp d1,d2 with d2 containing 12*1000 (12*Vnb) * Store some matrices values in fp5 fp6 fp7 (68040) * If enough fp registers then store whole matrix in fp0 fp16 (68080) * feed x y z with a movem BTW a non straightforward solution would be to analyze the matrix for simples x rotation or y rotation or z rotation. Then have 3 dedicated functions for those 3 cases For example in quake3 of course the scene use full xyz transformation matrix but there are some "3d icons" that only y rotate. Depending of the game simple rotations maybe more important in proportion
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 10 Sep 2018 10:42
|
thellier alain wrote:
| First ideas:
|
Hi Alain, thanks for the ideas. Let me give some more details about 68080. 1) the FPU can issue one FPU instruction per cycle. The CPU can in addition execute another integer instruction each cycle. This means be interleaving FPU/CPU instruction you can do some CPU instructions in parallel for "free". This means interleaved PTR increments can be done for free.Another free option is the EA mode (an)+ 2) dbra Dn,loop takes only 0.5 cycle in 2nd pipe. This means a DBRA in the last instruction can be free. The same is true for "SUBQ.l #1,Dn, bne" combination. This combo is identified as LOOP and can be executed for free in 2nd pipe. As you correctly pointed out the 68080 has more Register. 3) APOLLO has 8 normal FPU register + 24 Extended FPU register. This means FPU code can use up to 32 FPU regs. Plus 8 Data regs as Inputs. 4) APOLLO allows to use 3 Opp per cycle. This means using an extra instruction extension word something like this can be done in 1 cycle. FADD (a0),Fp0,Fp1
| |
| | Thellier Alain
Posts 143 10 Sep 2018 14:08
| If we have 32 fpu registers then we can have the full matrice in registers once for all AND movem 5 vertex (15 fpu registers) at one time from/to memory I will remain one fpu register for the calculation>free option is the EA mode (an)+ I allways thought (an)+ (an)+ (an)+ was worse than doing an(0) an(1) an(2) etc as the + change the adresse register so dont use the pipeline I suppose 68080 change the rules :-)
| |
| | Samuel Devulder
Posts 248 10 Sep 2018 14:43
| The M being constant through-out the loop, I'd put them in some regs. Here is what I'd get with pseudo-asm
; a0 = input ; a1 = output ; a2 = M ; d7 = counter ; using extra regs e0-e8, if not allowed use n(a2) fmove.s (a2)+,e0 fmove.s (a2)+,e1 fmove.s (a2)+,e2 fmove.s (a2)+,e3 fmove.s (a2)+,e4 fmove.s (a2)+,e5 fmove.s (a2)+,e6 fmove.s (a2)+,e7 fmove.s (a2)+,e8 move.l (a0)+,d0-d2 ; 3 cycles loop: ; 3 arg op, or use macro for plain 68882 fmul3 d0,e0,fp0 ; 1 fmul3 d1,e1,fp1 ; 1 fmul3 d2,e2,fp2 ; 1 fmul3 d0,e3,fp3 ; 1 fmul3 d1,e4,fp4 ; 1 fmul3 d2,e5,fp5 ; 1 fadd fp0,fp1 ; 1 fmul3 d2,e6,fp6 ; 1 fmul3 d2,e7,fp7 ; 1 fmul3 d2,e8,fp0 ; 1 fadd fp3,fp4 ; 1 fadd fp1,fp2 ; 1 ; 2 cycles bubble fadd fp6,fp7 ; 1 ; 3 cycles bubble --> move.l (a0)+,d0-d2 ; 3 cycles fadd fp4,fp5 ; 1 ; 1 cycles bubble fadd fp7,fp0 ; 1 fmove.s fp2,(a1)+ ; 1 ; 3 cycles bubble fmove.s fp5,(a1)+ ; 1 ; 3 cycles bubble fmove.s fp0,(a1)+ ; 1 dbra d7,loop ; 0 ; total = 21 + 9 bubbles = 30 cycles/element
Bubbles in the bottom of the loop can be eliminated by putting there the beginning (fmul3 parts) of the next iteration, but is a pain to unroll loop in asm. I need time to write this correctly. But you got the idea.
| |
| | Renee Cousins (Apollo Team Member) Posts 142 10 Sep 2018 18:11
| Your move.l (movem.l?) needs to be inside the loop. If you use more registers and unroll the loop somewhat you should be able to intermix the fmuls of the next vertex with the faad/fmove of the prior.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 10 Sep 2018 18:25
| Nice ideas so far. Lets continue this, and lets see if you can score better than an 68060 @ 400 MHz!
| |
| | Thellier Alain
Posts 143 10 Sep 2018 19:07
| @samuel You are using d0 d1 d2 for floats operations : is it legal ? Why do you store the new x y z values only at end if not using movem ? @gunnar Of course we can interleave int/float operations for better performance but here we do floats only... especially if accesses to adresses are not indexed (else we will have several int index++ )
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 10 Sep 2018 19:31
| thellier alain wrote:
| @samuel You are using d0 d1 d2 for floats operations : is it legal ?
|
Yes Dn is legal for the 1st operant. 68K FPU has normally 2 Operants FADD.S 1st,2nd 1st can be any of those memory data register Fpu register Extended Register (68080 only)
| |
| | Samuel Devulder
Posts 248 10 Sep 2018 20:00
| Renee Cousins wrote:
| Your move.l (movem.l?) needs to be inside the loop. |
They are! ( in the pseudo-code after the right arrow "--->" in the comment about the "3 cycles bubble"). I'll get rid of as many bubbles as possible by rotating the loop logic a little and clean the source code. Let's see if I can get rid of all the bubbles. (To optimize, I typically write asm by hand, then add comments about where are the fpu bubbles and try to reorg the code baby-steps by baby-steps until I'm happy with the solution.) @Alain: yes using Dn reg as float source (.s extension) is perfectly legal. They are handled as "vey fast" memory-access in my thinking. As you can see I read them with movem.l to actually read them as "is", eg. as floating point values. I do not write the result with a fmovem because I'm not sure if fmovem can store floats (I tend to think only .x (eg 80bits) values are stored.) Anyway, I think movem and signle moves costs the same for 32-bits values. And also having the writes spread in different places in the code allow filling the bubbles and gain cycles in the end.
| |
| | Samuel Devulder
Posts 248 10 Sep 2018 20:59
| How about this? I got rid of the 3 cycles bubbles in the end of the loop in the previous code. The code also contains an "equivalent" of 68080 3-way ops for 68882 (slower of course) ; a0 = input ; a1 = output ; a2 = M ; d7 = counter ifne ac68080 ; code for apollo accelerator 68080 fmove.s (a2)+,e0 fmove.s (a2)+,e1 fmove.s (a2)+,e2 fmove.s (a2)+,e3 fmove.s (a2)+,e4 fmove.s (a2)+,e5 fmove.s (a2)+,e6 fmove.s (a2)+,e7 fmove.s (a2)+,e8 else ; mc68882 version e0 equ 0 e1 equ 4 e2 equ 8 e3 equ 12 e4 equ 16 e5 equ 20 e6 equ 24 e7 equ 32 e8 equ 36 ; emulation of 3 arg op for mc68882 fmul3 macro fmove.s \1,\3 fmul.s \2(a2),\3 endm ; end of 68882 specific code endc move.l (a0)+,d0-d2 ; 3 cycles fmul3 d0,e0,fp0 ; 1 fmul3 d1,e1,fp1 ; 1 fmul3 d2,e2,fp2 ; 1 fmul3 d0,e3,fp3 ; 1 fmul3 d1,e4,fp4 ; 1 loop: fmul3 d2,e5,fp5 ; 1 fmul3 d2,e6,fp6 ; 1 ; 3 cycles bubble when looping back (fp0) fadd fp0,fp1 ; 1 ----+ <----------+ | | fmul3 d2,e7,fp7 ; 1 | | fmul3 d2,e8,fp0 ; 1 | | | | fadd fp3,fp4 ; 1 | | | | ; 2 cycles bubble (fp1) | | fadd fp1,fp2 ; 1 <---+ | fadd fp6,fp7 ; 1 | | movem.l (a0)+,d0-d2 ; 3 (ex 3 bubbles) | fadd fp4,fp5 ; 1 | | fmove.s fp2,(a1)+ ; 1 | fadd fp7,fp0 ; 1 | | fmul3 d1,e1,fp1 ; 1 (ex 3 bubbles) | fmul3 d2,e2,fp2 ; 1 (ex 3 bubbles) | fmul3 d0,e3,fp3 ; 1 (ex 3 bubbles) | fmove.s fp5,(a1)+ ; 1 | | fmul3 d1,e4,fp4 ; 1 (ex 1 bubble) | fmove.s fp0,(a1)+ ; 1 | fmul3 d0,e0,fp0 ; 1 -----------------+ dbra d7,loop ; 0 total : 21 + 5 bubbles = 26 cycles / loop Yet, there is still 5 cycles in bubbles, mainly due to fp0 upon loopback :( I cannot see a way to do something useful in the ned opf the loop to fill that 3 cycle loop. :(
| |
| | Thellier Alain
Posts 143 11 Sep 2018 09:27
| @all I just noted that I posted the wrong C/Asm function in my first post: I suppose what Gunnar wants is 4x4 matrix not a 3x3 Anyway the problem is still more or less the same
| |
| | Samuel Devulder
Posts 248 11 Sep 2018 15:13
| Gunnar spoke of 12 bytes vectors (X,Y,Z) so for me the matrix is also 3x3. (and there are some typos in my latest code.. some "d2" shoulde be replaced with "d0" and "d1".. well anyway it is just a sketch of code, not a real one.)
| |
| | Renee Cousins (Apollo Team Member) Posts 142 11 Sep 2018 16:44
| thellier alain wrote:
| I just noted that I posted the wrong C/Asm function in my first post: I suppose what Gunnar wants is 4x4 matrix not a 3x3. Anyway the problem is still more or less the same |
The advantage of 4x4 is that the two common (object and camera) transform matrices can be combined into one. So instead of two 3x3 transforms you have one 4x4.
| |
| | Thellier Alain
Posts 143 12 Sep 2018 08:23
| @Samuel You still have unused registers so you should process 2 (4?) vertices per loop : interlacing vertex1 and vertex2 computations should enhance the bubble problemPerhaps using movem for reading vertices was not such a good idea as it impose to wait 3 reads before doing anything Also d0 d1 d2 store x y z but are never used as destination for a computation : I mean when x is last used as input it can serve as output for a computation = economizing regs will allow to process more vertices per loop
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 12 Sep 2018 08:28
| Alain raises some very good points here. a) Yes, processing 2 rows in parallel in the Loop will give more independent work to do - which allows you to fill bubbles. This is a good idea. b) Using three simple MOVE instructions can be faster than one MOVEM, as the simple MOVE instructions could be done in parallel "for free" in a free 2nd pipe slot.
| |
| | Thellier Alain
Posts 143 12 Sep 2018 13:21
| IMHO processing 3 vertices per loop would be an even better idea as those vertices certainly represents triangles so certainly the vertices count will be a multiple of 3 >Using three simple MOVE instructions can be faster than one MOVEM Yes but I dont know if it will be the same for 3*3 move if processing 3 vertices per loop
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 12 Sep 2018 13:31
| thellier alain wrote:
| >Using three simple MOVE instructions can be faster than one MOVEM Yes but I dont know if it will be the same for 3*3 move if processing 3 vertices per loop
|
The instruction in the CPU can looks like this 1pipe 2pipe FMUL free FMUL free FMUL free FADD free
In the example the 2nd pipe does nothing. "Finger in nose" as the french say. For example you can for free do a MOVE instruction each cycle in the 2nd pipe. MOVEM will use the 1st pipe for several clock. This would cost extra.
| |
| | Thellier Alain
Posts 143 12 Sep 2018 15:31
| Ok I see so interleaving the MOVEs is the better solution as we dont have much others int instructions to fill the 2nd pipe I am reposting the problem as C/asm if using a 4x4 matrice used as a 3x4 matrice (I mean rotation + translation) // /*==================================================================*/ // #include <string.h> // #include <stdarg.h> // #include <math.h> // #include <proto/exec.h> // /*==================================================================================*/ // typedef struct _Vertex3D // { // float x,y,z; // } Vertex3D; // /*=================================================================*/ // void CopyTransformV(Vertex3D* Vs,Vertex3D* V2s,float *Ms,LONG Vnbs) // { pea a5@ movel sp,a5 fmovem #0x1c,sp@- movel a0,sp@- // register Vertex3D* V =Vs; /* put stack values to wanted registers */ LBB2: movel a5@(8),a0 // register Vertex3D* V2 =V2s; movel a5@(12),a1 // register float* M =Ms; movel a5@(16),a2 // register LONG Vnb =Vnbs; movel a5@(20),d0 // register float x; // register float y; // register float z; // // while(Vnb) jeq L75 .even L76: // { // x=V[0].x;y=V[0].y;z=V[0].z; fsmoves a0@,fp2 fsmoves a0@(4),fp3 fsmoves a0@(8),fp4 // // V2[0].x= M[0]*x + M[4]*y+ M[8] *z+ M[12]; fsmoves a2@,fp0 fsmulx fp2,fp0 fsmoves a2@(16),fp1 fsmulx fp3,fp1 fsaddx fp0,fp1 fsmoves a2@(32),fp0 fsmulx fp4,fp0 fsaddx fp1,fp0 fsadds a2@(48),fp0 fmoves fp0,a1@ // V2[0].y= M[1]*x + M[5]*y+ M[9] *z+ M[13]; fsmoves a2@(4),fp0 fsmulx fp2,fp0 fsmoves a2@(20),fp1 fsmulx fp3,fp1 fsaddx fp0,fp1 fsmoves a2@(36),fp0 fsmulx fp4,fp0 fsaddx fp1,fp0 fsadds a2@(52),fp0 fmoves fp0,a1@(4) // V2[0].z= M[2]*x + M[6]*y+ M[10]*z+ M[14]; fsmuls a2@(8),fp2 fsmuls a2@(24),fp3 fsaddx fp2,fp3 fsmuls a2@(40),fp4 fsaddx fp3,fp4 fsadds a2@(56),fp4 fmoves fp4,a1@(8) // // V++; addw #12,a0 // V2++; addw #12,a1 // Vnb--; subql #1,d0 // } jne L76 L75: // } LBE2: movel sp@+,a0 fmovem sp@+,#0x38 unlk a5 rts
| |
|
|
|