The team will post updates and news about our project here |
68k FPU Coding Challenge - Win a Prize! | page 1 2
|
---|
|
---|
| | Samuel Devulder
Posts 248 12 Sep 2018 17:17
| As you can see most of the code is FPU-based. There are very few isntructions that can fit in the 2nd pipe. Most bubbles are filled. There is a big one on fp0 while looping back. It seem seem difficult to eliminate. The only way I see to eliminate it is to use an extra floating-point reg. All FPs are used, so we'll have to use one E reg (E9) for that purpose.
| |
| | Samuel Devulder
Posts 248 12 Sep 2018 18:17
| [EDIT] sory for the duplicate. The post just above is a false-handling. I don't know how to delete it. Please consider only the following text: -------------------------------------------------------------------- As you can see most of the code is FPU-based. There are very few instructions that can fit in the 2nd pipe. It is possible to split the movem inside the loop by several moves (btw do 2nd pipe accepts memory accesses?), but this would re-create a 3-cycle bubble that was exactly filled by the movem. Most bubbles are filled. There is a big one on fp0 while looping back. It seem seem difficult to eliminate. The only way I see to eliminate it is to use another floating-point reg. All FPs are used, so we'll have to use one E reg (E9) for that purpose. If I do this, I am able to remove all bubbles:
loop: fadd fp0,fp1 ; 1 fmul3 d2,e5,fp5 ; 1 fmul3 d0,e6,fp6 ; 1 fmul3 d1,e7,fp7 ; 1 fmul3 d2,e8,E9 ; 1 fmul3 d0,e0,fp0 ; 1 fadd fp3,fp4 ; 1 fadd fp1,fp2 ; 1 fadd fp6,fp7 ; 1 movem.l (a0)+,d0-d2 ; 3 (was 3 bubbles) -- is it possible to split this into 3 free moves ? fadd fp4,fp5 ; 1 fmove.s fp2,(a1)+ ; 1 fadd fp7,E9 ; 1 fmul3 d1,e1,fp1 ; 1 (was 3 bubbles) fmul3 d2,e2,fp2 ; 1 " " fmul3 d0,e3,fp3 ; 1 " " fmove.s fp5,(a1)+ ; 1 fmul3 d1,e4,fp4 ; 1 (was 1 bubble) fmove.s E9,(a1)+ ; 1 dbra d7,loop ; 0 ; total = 21 cycles/loop It is still possible to gain 3 extra cycles if the movem can be spread into 3 different memory accesses after few fpu instructions. In addition, provided fpu is enabled on the 2nd pipe (not the case atm I think), we could gain even more cycles. Note: The code looks awful. It is a huge mess :( I can't spot any symmetry in the way it is written. I think it needs a good cleanup/rewriting so that symmetry of datapath is easily spotted. With cleaner code, places where optimisation is still possible might show up easily.
| |
| | Don Adan
Posts 38 12 Sep 2018 18:59
| Samuel Devulder wrote:
| [EDIT] sory for the duplicate. The post just above is a false-handling. I don't know how to delete it. Please consider only the following text: -------------------------------------------------------------------- As you can see most of the code is FPU-based. There are very few instructions that can fit in the 2nd pipe. It is possible to split the movem inside the loop by several moves (btw do 2nd pipe accepts memory accesses?), but this would re-create a 3-cycle bubble that was exactly filled by the movem. Most bubbles are filled. There is a big one on fp0 while looping back. It seem seem difficult to eliminate. The only way I see to eliminate it is to use another floating-point reg. All FPs are used, so we'll have to use one E reg (E9) for that purpose. If I do this, I am able to remove all bubbles: loop: fadd fp0,fp1 ; 1 fmul3 d2,e5,fp5 ; 1 fmul3 d0,e6,fp6 ; 1 fmul3 d1,e7,fp7 ; 1 fmul3 d2,e8,E9 ; 1 fmul3 d0,e0,fp0 ; 1 fadd fp3,fp4 ; 1 fadd fp1,fp2 ; 1 fadd fp6,fp7 ; 1 movem.l (a0)+,d0-d2 ; 3 (was 3 bubbles) -- is it possible to split this into 3 free moves ? fadd fp4,fp5 ; 1 fmove.s fp2,(a1)+ ; 1 fadd fp7,E9 ; 1 fmul3 d1,e1,fp1 ; 1 (was 3 bubbles) fmul3 d2,e2,fp2 ; 1 " " fmul3 d0,e3,fp3 ; 1 " " fmove.s fp5,(a1)+ ; 1 fmul3 d1,e4,fp4 ; 1 (was 1 bubble) fmove.s E9,(a1)+ ; 1 dbra d7,loop ; 0 ; total = 21 cycles/loop It is still possible to gain 3 extra cycles if the movem can be spread into 3 different memory accesses after few fpu instructions. In addition, provided fpu is enabled on the 2nd pipe (not the case atm I think), we could gain even more cycles. Note: The code looks awful. It is a huge mess :( I can't spot any symmetry in the way it is written. I think it needs a good cleanup/rewriting so that symmetry of datapath is easily spotted. With cleaner code, places where optimisation is still possible might show up easily.
|
If i undetstand Gunnar info, you can use next code:
loop: fadd fp0,fp1 ; 1 fmul3 d2,e5,fp5 ; 1 fmul3 d0,e6,fp6 ; 1 fmul3 d1,e7,fp7 ; 1 fmul3 d2,e8,E9 ; 1 fmul3 d0,e0,fp0 ; 1 fadd fp3,fp4 ; 1 move.l (a0)+,d0 ; free fadd fp1,fp2 ; 1 move.l (a0)+,d1 ; free fadd fp6,fp7 ; 1 move.l (a0)+,d2 ; free fadd fp4,fp5 ; 1 fmove.s fp2,(a1)+ ; 1 fadd fp7,E9 ; 1 fmul3 d1,e1,fp1 ; 1 (was 3 bubbles) fmul3 d2,e2,fp2 ; 1 " " fmul3 d0,e3,fp3 ; 1 " " fmove.s fp5,(a1)+ ; 1 fmul3 d1,e4,fp4 ; 1 (was 1 bubble) fmove.s E9,(a1)+ ; 1 dbra d7,loop ; 0 ; total = 21 cycles/loop
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 12 Sep 2018 21:21
| little bit more CPU info APOLLO 68080 can do a free DCache read per cycle So technically instead loading values in advance in register you can also just do this: FMUL.S (a0),Fp0 FMUL.S 4(a0),Fp1 FMUL.S 8(a0),Fp2
You can read in every instruction from Cache, even if you re-read the same value, this is no disadvantage.Cheers
| |
| | Thellier Alain
Posts 143 12 Sep 2018 22:08
| NICE well done Samuel You wrote 21 cycles but it is 18 if the move for reading x y z are free, no? If writing the x y z is not free (? This is what say your listing) then perhaps using movem to write yz or xyz is possibleAnyway your code is for a 3x3 matrix I am almost sûre this is a 4x4 (used as 4x3) that is needed
| |
| | Szyk Cech
Posts 191 25 Sep 2018 14:17
| Gunnar von Boehn wrote:
| We would like to invite you to participate on a little coding challenge.
|
Who won this challenge?!?
| |
|
|
|