Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
The team will post updates and news about our project here

68k FPU Coding Challenge - Win a Prize!page  1 2 

Samuel Devulder

Posts 248
12 Sep 2018 17:17


As you can see most of the code is FPU-based. There are very few isntructions that can fit in the 2nd pipe.

Most bubbles are filled. There is a big one on fp0 while looping back. It seem seem difficult to eliminate. The only way I see to eliminate it is to use an extra floating-point reg. All FPs are used, so we'll have to use one E reg (E9) for that purpose.


Samuel Devulder

Posts 248
12 Sep 2018 18:17


[EDIT] sory for the duplicate. The post just above is a false-handling. I don't know how to delete it. Please consider only the following text:
--------------------------------------------------------------------
As you can see most of the code is FPU-based. There are very few instructions that can fit in the 2nd pipe. It is possible to split the movem inside the loop by several moves (btw do 2nd pipe accepts memory accesses?), but this would re-create a 3-cycle bubble that was exactly filled by the movem.
 
  Most bubbles are filled. There is a big one on fp0 while looping back. It seem seem difficult to eliminate. The only way I see to eliminate it is to use another floating-point reg. All FPs are used, so we'll have to use one E reg (E9) for that purpose.
 
If I do this, I am able to remove all bubbles:
loop:
    fadd    fp0,fp1    ; 1
     
    fmul3  d2,e5,fp5  ; 1
    fmul3  d0,e6,fp6  ; 1
           
    fmul3  d1,e7,fp7  ; 1           
    fmul3  d2,e8,E9    ; 1           
    fmul3  d0,e0,fp0  ; 1
                     
    fadd    fp3,fp4    ; 1           
                     
    fadd    fp1,fp2    ; 1           
    fadd    fp6,fp7    ; 1                 
             
    movem.l (a0)+,d0-d2 ; 3 (was 3 bubbles) -- is it possible to split this into 3 free moves ?
    fadd    fp4,fp5    ; 1                 
             
    fmove.s fp2,(a1)+  ; 1                 
    fadd    fp7,E9      ; 1                 
       
    fmul3  d1,e1,fp1  ; 1 (was 3 bubbles)
    fmul3  d2,e2,fp2  ; 1    "    "
    fmul3  d0,e3,fp3  ; 1    "    "
    fmove.s fp5,(a1)+  ; 1             
             
    fmul3  d1,e4,fp4  ; 1 (was 1 bubble) 
    fmove.s E9,(a1)+    ; 1               
 
    dbra    d7,loop    ; 0
  ; total = 21 cycles/loop
 

It is still possible to gain 3 extra cycles if the movem can be spread into 3 different memory accesses after few fpu instructions. In addition, provided fpu is enabled on the 2nd pipe (not the case atm I think), we could gain even more cycles.
 
Note: The code looks awful. It is a huge mess :( I can't spot any symmetry in the way it is written. I think it needs a good cleanup/rewriting so that symmetry of datapath is easily spotted. With cleaner code, places where optimisation is still possible might show up easily.


Don Adan

Posts 38
12 Sep 2018 18:59


Samuel Devulder wrote:

[EDIT] sory for the duplicate. The post just above is a false-handling. I don't know how to delete it. Please consider only the following text:
  --------------------------------------------------------------------
  As you can see most of the code is FPU-based. There are very few instructions that can fit in the 2nd pipe. It is possible to split the movem inside the loop by several moves (btw do 2nd pipe accepts memory accesses?), but this would re-create a 3-cycle bubble that was exactly filled by the movem.
   
  Most bubbles are filled. There is a big one on fp0 while looping back. It seem seem difficult to eliminate. The only way I see to eliminate it is to use another floating-point reg. All FPs are used, so we'll have to use one E reg (E9) for that purpose.
   
  If I do this, I am able to remove all bubbles:
 
loop:
    fadd    fp0,fp1    ; 1
       
    fmul3  d2,e5,fp5  ; 1
    fmul3  d0,e6,fp6  ; 1
           
    fmul3  d1,e7,fp7  ; 1           
    fmul3  d2,e8,E9    ; 1           
    fmul3  d0,e0,fp0  ; 1
                     
    fadd    fp3,fp4    ; 1           
                     
    fadd    fp1,fp2    ; 1           
    fadd    fp6,fp7    ; 1                 
               
    movem.l (a0)+,d0-d2 ; 3 (was 3 bubbles) -- is it possible to split this into 3 free moves ?
    fadd    fp4,fp5    ; 1                 
               
    fmove.s fp2,(a1)+  ; 1                 
    fadd    fp7,E9      ; 1                 
         
    fmul3  d1,e1,fp1  ; 1 (was 3 bubbles)
    fmul3  d2,e2,fp2  ; 1    "    "
    fmul3  d0,e3,fp3  ; 1    "    "
    fmove.s fp5,(a1)+  ; 1             
             
    fmul3  d1,e4,fp4  ; 1 (was 1 bubble) 
    fmove.s E9,(a1)+    ; 1               
   
    dbra    d7,loop    ; 0
    ; total = 21 cycles/loop
   

  It is still possible to gain 3 extra cycles if the movem can be spread into 3 different memory accesses after few fpu instructions. In addition, provided fpu is enabled on the 2nd pipe (not the case atm I think), we could gain even more cycles.
   
  Note: The code looks awful. It is a huge mess :( I can't spot any symmetry in the way it is written. I think it needs a good cleanup/rewriting so that symmetry of datapath is easily spotted. With cleaner code, places where optimisation is still possible might show up easily.

If i undetstand Gunnar info, you can use next code:

loop:
    fadd    fp0,fp1    ; 1
       
    fmul3  d2,e5,fp5  ; 1
    fmul3  d0,e6,fp6  ; 1
           
    fmul3  d1,e7,fp7  ; 1           
    fmul3  d2,e8,E9    ; 1           
    fmul3  d0,e0,fp0  ; 1
                     
    fadd    fp3,fp4    ; 1           
    move.l (a0)+,d0    ; free
    fadd    fp1,fp2    ; 1         
    move.l (a0)+,d1    ; free
    fadd    fp6,fp7    ; 1                 
               
    move.l (a0)+,d2    ; free
    fadd    fp4,fp5    ; 1                 
               
    fmove.s fp2,(a1)+  ; 1                 
    fadd    fp7,E9      ; 1                 
         
    fmul3  d1,e1,fp1  ; 1 (was 3 bubbles)
    fmul3  d2,e2,fp2  ; 1    "    "
    fmul3  d0,e3,fp3  ; 1    "    "
    fmove.s fp5,(a1)+  ; 1             
             
    fmul3  d1,e4,fp4  ; 1 (was 1 bubble) 
    fmove.s E9,(a1)+    ; 1               
   
    dbra    d7,loop    ; 0
    ; total = 21 cycles/loop
   




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
12 Sep 2018 21:21


little bit more CPU info

APOLLO 68080 can do a free DCache read per cycle
So technically instead loading values in advance in register  you can also just do this:
 


  FMUL.S (a0),Fp0
  FMUL.S 4(a0),Fp1
  FMUL.S 8(a0),Fp2
 

 
You can read in every instruction from Cache, even if you re-read the same value, this is no disadvantage.

Cheers


Thellier Alain

Posts 141
12 Sep 2018 22:08


NICE well done Samuel
You wrote 21 cycles but it is 18 if the move for reading x y z are free, no?
If writing the x y z is not free (? This is what say your listing) then perhaps using movem to write yz or xyz is possible

Anyway your code is for a 3x3 matrix I am almost sûre this is a 4x4 (used as 4x3) that is needed


Szyk Cech

Posts 191
25 Sep 2018 14:17


Gunnar von Boehn wrote:

We would like to invite you to participate on a little coding challenge.

Who won this challenge?!?

posts 26page  1 2