Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Grom 68k

Posts 61
16 Jul 2019 10:29


This is an other example which work better than last week but there is still a problem:


void multiplyMatrix(double* restrict inputMatrix,double transformMatrix[4][4],double* restrict outputMatrix)
{
    for(int i = 0; i < 4; i++ ){
        outputMatrix = 0;
        for(int j = 0; j < 4; j++){
            outputMatrix += transformMatrix[j] * inputMatrix[j];
        }
    }
}

gcc use an unnessecary fp2 to init (a2):

-m68080 -O2 -mregparm=3


_multiplyMatrix:
        moveq #4,d1
        fmovem fp2,-(sp)
        fmove.s #0x0,fp2

        movem.l a4/a3/a2,-(sp)
.L3:
        fmove.d #0x000000000,fp0
        fmove.d fp2,(a2)
        move.l a0,a4
        move.l a1,a3
        moveq #4,d0
.L2:
        fdmove.d (a3)+,fp1
        fdmul.d (a4)+,fp1
        fdadd.x fp1,fp0
        subq.l #1,d0
        jne .L2
        fmove.d fp0,(a2)+
        lea (32,a1),a1
        subq.l #1,d1
        jne .L3
        movem.l (sp)+,a2/a3/a4
        fmovem (sp)+,fp2
        rts

with -O3, it's worse, i add 0 to fp1

-m68080 -O3 -mregparm=3


  fdadd.d #0x000000000,fp1



Samuel Devulder

Posts 248
16 Jul 2019 11:10


The C-code is strange. OutputMatrix is used as a double, but it is a pointer. The 'i' index isn't used as index either, which is weird. What is this function supposed to do? Mulitply 4x4 matrices or multiply a vector by a matrix ?


Grom 68k

Posts 61
16 Jul 2019 11:21


Samuel Devulder wrote:

The C-code is strange. OutputMatrix is used as a double, but it is a pointer. The 'i' index isn't used as index either, which is weird. What is this function supposed to do? Mulitply 4x4 matrices or multiply a vector by a matrix ?

Thanks, I discover the italic tag. This is better like this:


void multiplyMatrix(double* restrict inputMatrix,double transformMatrix[4][4],double* restrict outputMatrix)
{
    for(int l = 0; l < 4; l++ ){
        outputMatrix[l] = 0;
        for(int j = 0; j < 4; j++){
            outputMatrix[l] += transformMatrix[l][j] * inputMatrix[j];
        }
    }
}

It multiplies vector by a matrix to make a rotation and a translation. I use this in CAD program.
I write before

double outputMatrix[4]
but it's better with restrict.


Stefan "Bebbo" Franke

Posts 139
16 Jul 2019 11:50


a zip with all is ok.
   
    regarding that code: I'm not sure if it's allowed to suppress the explicite coded zero assignment. I recommend fixing the c code and use a local variable to sum it up and only store the result.
   
   
    EDIT: hint: use
 [ code] for your code
 
  LOL: hard to format this


Samuel Devulder

Posts 248
16 Jul 2019 13:01


There are strange things occuring with this matrix*vector routine.  EXTERNAL LINK There are lots of "fadd #0,freg" as pointed out by Grom68k, and even, what looks like bad scheduling when unrolling all the loops:

        fdmul.x fp2,fp1
        fdadd.d #0x000000000,fp1 ; useless & waits for fp1 during 5 cycles
        ...
        fdmul.x fp0,fp3
        fdmove.x fp2,fp4
        fdmul.d (16,a1),fp4
        fdadd.x fp1,fp3 ; has to wait 2 cycles for fp3
        ...
        fdmul.x fp6,fp5
        fdadd.x fp5,fp3 ; has to wait 5 cycles for fp5
        ...
        fdmul.d (16,a1),fp3
        fdadd.d #0x000000000,fp4 ; useless
        fdadd.d #0x000000000,fp1 ; useless
        fdadd.x fp3,fp7 ; has to wait 3 cycles for fp3
        ...
        fdmul.d (56,a1),fp0
        fdmul.d (88,a1),fp2
        fdadd.x fp0,fp4 ; wait 4 cycles for fp0
        fdadd.x fp1,fp2 ; wait 1 cycle for fp2 (4 already used before.)
        ...

Strangely enough, when rewriting the C code to invert the order of the l/j loops, the resulting asm looks a bit better (no fadd #0) EXTERNAL LINK       
       
[EDIT] adding -funsafe-math-optimizations (or -ffast-math which includes it) removes the useless "fadd #0" as well, but scheduling seem still bad.

      fdmove.d (8,a1),fp3
      fdmul.x fp4,fp3
      fdmul.x fp2,fp1
      fdadd.x fp3,fp1 ; <== has to wait for fp1 & fp3 (5 wait-ycles)

       
 
[EDIT2] Playing with "-Ofast -mregpam" lead to a compiler crash ==> EXTERNAL LINK   
 
This piece of c-code is, I think, very good to check scheduling performance & limitations. I wonder if hand-written asm can beat gcc.


Grom 68k

Posts 61
16 Jul 2019 16:57


Samuel Devulder wrote:

  This piece of c-code is, I think, very good to check scheduling performance & limitations. I wonder if hand-written asm can beat gcc.

I think this code is easier to start to check scheduling.


void crossproduct(const double* restrict vector1, const double* vector2, double* restrict vector3)
{
    double v10 = *vector1++;
    double v11 = *vector1++;
    double v12 = *vector1;
    double v20 = *vector2++;
    double v21 = *vector2++;
    double v22 = *vector2;   
    *vector3++ = v11 * v22 - v12 * v21;   
    *vector3++ = v12 * v20 - v10 * v22;
    *vector3  = v10 * v21 - v11 * v20;
}




Stefan "Bebbo" Franke

Posts 139
16 Jul 2019 21:28


Grom 68k wrote:

Samuel Devulder wrote:

  This piece of c-code is, I think, very good to check scheduling performance & limitations. I wonder if hand-written asm can beat gcc.
 

 
  I think this code is easier to start to check scheduling.
 
 

  void crossproduct(const double* restrict vector1, const double* vector2, double* restrict vector3)
  {
      double v10 = *vector1++;
      double v11 = *vector1++;
      double v12 = *vector1;
      double v20 = *vector2++;
      double v21 = *vector2++;
      double v22 = *vector2;   
      *vector3++ = v11 * v22 - v12 * v21;   
      *vector3++ = v12 * v20 - v10 * v22;
      *vector3  = v10 * v21 - v11 * v20;
  }
 

 
 

13 wait states are ok for me - not bad for an algorithm.




Samuel Devulder

Posts 248
16 Jul 2019 23:58


I count only 9 here: EXTERNAL LINK 

  _crossproduct:
          fdmove.d (a0)+,fp0
          fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp)
          fdmove.d (a0)+,fp1
          fdmove.d (a1)+,fp3
          fdmove.d (a0),fp6
          fdmove.d (a1)+,fp2
          fdmove.x fp1,fp5
          fdmove.d (a1),fp4
          fdmove.x fp6,fp7
          fdmul.x fp4,fp5
          fdmul.x fp2,fp7
          fdmul.x fp0,fp4
          fdmul.x fp3,fp6
  ; 3 because of fp7
          fdsub.x fp7,fp5
          fdmul.x fp2,fp0
          fdmul.x fp3,fp1
  ; no wait state for fp4/fp6 because of 3 above
          fdsub.x fp4,fp6
  ; 4 because of fp1
          fdsub.x fp1,fp0
          move.l d0,a0
          fmove.d fp5,(a0)+
          fmove.d fp6,(a0)+
  ; 2 because of fp0
          fmove.d fp0,(a0)
          fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
          rts
 
If we manually reorg we can get even fewer (4):
 

  _crossproduct:
          fdmove.d (a0)+,fp0
          fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp)
          fdmove.d (a0)+,fp1
          fdmove.d (a1)+,fp3
          fdmove.d (a0),fp6
          fdmove.d (a1)+,fp2
          fdmove.x fp1,fp5
          fdmove.d (a1),fp4
          fdmove.x fp6,fp7
          fdmul.x fp4,fp5
          fdmul.x fp2,fp7
          fdmul.x fp0,fp4
          fdmul.x fp3,fp6
          fdmul.x fp2,fp0
          fdmul.x fp3,fp1
  ; 1 because of fp7
          fdsub.x fp7,fp5
  ; 1 because of fp6
          fdsub.x fp4,fp6
          move.l d0,a0
          fdsub.x fp1,fp0
  ; 1 because of dfp5
          fmove.d fp5,(a0)+
  ; 1 because of fp6
          fmove.d fp6,(a0)+
          fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
          fmove.d fp0,(a0)
          rts
 

 


Stefan "Bebbo" Franke

Posts 139
17 Jul 2019 07:20


Samuel Devulder wrote:

I count only 9 here: EXTERNAL LINK 

...
If we manually reorg we can get even fewer (4):
... 

I tweaked it a bit, but that's it. It's easy to unbalance the weights and get worse results. So 9 is better than 13 before :-)

If you are able to define the algorithm of your 'manual reorg', it might be worth to implement it.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
17 Jul 2019 07:27


Stefan "Bebbo" Franke wrote:

 
Samuel Devulder wrote:

  I count only 9 here: EXTERNAL LINK 

  ...
  If we manually reorg we can get even fewer (4):
  ... 
 

 
I tweaked it a bit, but that's it. It's easy to unbalance the weights and get worse results. So 9 is better than 13 before :-)
 
If you are able to define the algorithm of your 'manual reorg', it might be worth to implement it.

 
I see several optimizations here which sometimes might conflict.
 
a) Make GCC be aware of the latency
6 cycle for FADD/FSUB/FCMP/FMUL
 
b) bundle "FMOVE and FOPP" to one unit.
This can allow the CPU to execute both together!
 
In general the 68080 CPU can reach great FPU speed for bigger workloops which are unrolled.
Like the "SCALE" over a bigger amount of data.
I believe our focus should be on such algorithms.
Algorithms with are more "sequential" are a lost cause anyhow.


Stefan "Bebbo" Franke

Posts 139
17 Jul 2019 08:28


I created a better m68080 automaton definition


_crossproduct:
        fdmove.d (a0)+,fp0
        fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp)
        fdmove.d (a0)+,fp1
        fdmove.d (a1)+,fp3
        fdmove.d (a1)+,fp2
        fdmove.d (a0),fp6
        fdmove.x fp1,fp5
        fdmove.d (a1),fp4
        fdmove.x fp6,fp7
        fdmul.x fp4,fp5
        fdmul.x fp2,fp7
        fdmul.x fp0,fp4
        fdmul.x fp3,fp6
        fdmul.x fp2,fp0
        fdmul.x fp3,fp1
        move.l a2,-(sp)
        fdsub.x fp7,fp5
        fdsub.x fp4,fp6
        fdsub.x fp1,fp0
        fmove.d fp5,(a2)+
        fmove.d fp6,(a2)+
        fmove.d fp0,(a2)
        move.l (sp)+,a2
        fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
        rts

it's 4 too, isn't it?


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
17 Jul 2019 08:42


Stefan "Bebbo" Franke wrote:

  I created a better m68080 automaton definition
 
 

  _crossproduct:
          fdmove.d (a0)+,fp0
          fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp)
          fdmove.d (a0)+,fp1
          fdmove.d (a1)+,fp3
          fdmove.d (a1)+,fp2
          fdmove.d (a0),fp6
          fdmove.x fp1,fp5  (1)
          fdmove.d (a1),fp4
          fdmove.x fp6,fp7    (2)
          fdmul.x fp4,fp5          (1*)
          fdmul.x fp2,fp7            (2*)
          fdmul.x fp0,fp4
          fdmul.x fp3,fp6
          fdmul.x fp2,fp0
          fdmul.x fp3,fp1
          move.l a2,-(sp)
          fdsub.x fp7,fp5
          fdsub.x fp4,fp6
          fdsub.x fp1,fp0
          fmove.d fp5,(a2)+
          fmove.d fp6,(a2)+
          fmove.d fp0,(a2)
          move.l (sp)+,a2
          fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
          rts
 

 
  it's 4 too, isn't it?
 

 
The instruction marked (1) and (1*) should be bundled
Same for (2) and (2*)

Does this make sense?
We should "prepare" the code to allow the CPU to do 3opp operations


Grom 68k

Posts 61
17 Jul 2019 10:01


Stefan "Bebbo" Franke wrote:

I created a better m68080 automaton definition
 
 

  _crossproduct:
          fdmove.d (a0)+,fp0
          fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp)
          fdmove.d (a0)+,fp1
          fdmove.d (a1)+,fp3
          fdmove.d (a1)+,fp2
          fdmove.d (a0),fp6
          fdmove.x fp1,fp5
          fdmove.d (a1),fp4
          fdmove.x fp6,fp7
          fdmul.x fp4,fp5
          fdmul.x fp2,fp7
          fdmul.x fp0,fp4
          fdmul.x fp3,fp6
          fdmul.x fp2,fp0
          fdmul.x fp3,fp1
          move.l a2,-(sp)
          fdsub.x fp7,fp5
          fdsub.x fp4,fp6
          fdsub.x fp1,fp0
          fmove.d fp5,(a2)+
          fmove.d fp6,(a2)+
          fmove.d fp0,(a2)
          move.l (sp)+,a2
          fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
          rts
 

 
  it's 4 too, isn't it?

Amazing ! (but I count 5 nop)

This test case is already dead :)




Samuel Devulder

Posts 248
17 Jul 2019 11:11


I also count 5 which is much better that 13 or 9
_crossproduct:
            fdmove.d (a0)+,fp0
            fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp)
            fdmove.d (a0)+,fp1
            fdmove.d (a1)+,fp3
            fdmove.d (a1)+,fp2
            fdmove.d (a0),fp6
            fdmove.x fp1,fp5
            fdmove.d (a1),fp4
            fdmove.x fp6,fp7
            fdmul.x fp4,fp5
            fdmul.x fp2,fp7
            fdmul.x fp0,fp4
            fdmul.x fp3,fp6
            fdmul.x fp2,fp0
            fdmul.x fp3,fp1
            move.l a2,-(sp)
            fdsub.x fp7,fp5
  ; 1 (fp6)
            fdsub.x fp4,fp6
  ; 1 (fp1)
            fdsub.x fp1,fp0
  ; 1 (fp5)
            fmove.d fp5,(a2)+
  ; 1 (fp6)
            fmove.d fp6,(a2)+
  ; 1 (fp0)
            fmove.d fp0,(a2)
            move.l (sp)+,a2
            fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
            rts
The code coule be better if instead of using a2 we used a0 or a1 (which are dead at this point). The other advantage of using a0 or a1, is that the final fmove.d fp0,(reg) can be delayed after the final fmovem, sparing 1 wait cycle.

Anyway, I'll give it a try by hand to see if I can get any better. But I have no real "algorithm", just intuition and symmetry in the  computation-tree. However, for quake I've found out that running several computation threads in a diagonal like fashion produces interesting pieces of code where it is easy to spot optimizeable parts... let's see if this strategy is of some help here (more on it later, hopefully).




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
17 Jul 2019 11:47


Seriouly I doubt this crossproduct is a useful testcase.
The function has not enough math to be useful "unrolled"
FMOVEM registers away to have TMP registers add extra cost and makes only limited sense for such few math operations.


Grom 68k

Posts 61
17 Jul 2019 12:36


Gunnar von Boehn wrote:

Seriouly I doubt this crossproduct is a useful testcase.
  The function has not enough math to be useful "unrolled"
  FMOVEM registers away to have TMP registers add extra cost and makes only limited sense for such few math operations.

I am agree, it was just to show scheduling issues.

I have some ideas to optimize unrolling but I am afraid it will be  complex.

Stefan "Bebbo" Franke wrote:

-funroll-loops is working better now too: EXTERNAL LINK 

Is gcc can make difference between  subq.l #4,d0  and  subq.l #1,d0 ?
If yes,  subq.l #4,d0  could be move upward.

Is gcc can simplify the step of the loop ?


        move.l #640,d0      ;160
.L2:
        ...
        subq.l #4,d0        ;1
        jne .L2

I have an other idea but the asm doesn't look like this anymore.


Scale8:
          fdmove.x fp1,fp0
          move.l #640,d0
          fmovem fp2/fp3/fp4,-(sp)
  .L2:
          fdmove.d (a1)+,fp4
          fdmul.x fp0,fp4
          fdmove.d (a1)+,fp3
          fdmul.x fp0,fp3
          fdmove.d (a1)+,fp2
          fdmul.x fp0,fp2
          fdmove.d (a1)+,fp1
          fdmul.x fp0,fp1
          fmove.d fp4,(a0)+
          fmove.d fp3,(a0)+
          fmove.d fp2,(a0)+
          fmove.d fp1,(a0)+
          subq.l #4,d0
          jne .L2
          fmovem (sp)+,fp4/fp3/fp2
          rts



Samuel Devulder

Posts 248
17 Jul 2019 12:52


Gunnar von Boehn wrote:

Seriouly I doubt this crossproduct is a useful testcase.
The function has not enough math to be useful "unrolled"
FMOVEM registers away to have TMP registers add extra cost and makes only limited sense for such few math operations.

Yeah, after a bit of reflection, this cross product doesn't have enough fpu/integer ops to be optimized very much.
   
It has basically 6 independent products. Great, they can work in parallel and the result of the first two one might arrive "just on time" without any additional delay (or may be 1 wait-state) and the other pairs arrive on time two. But then there are only 3 independent subtraction. This is not enough to cover the delay to  wait for the result of the first subtraction. This mean that saving to memory the result of the first subtraction will add at least a 3 cycles delay, meaning that fpu-side this calculation cannot be done with less than 3 or 4 wait-states. Usually we can use these wait-state to deal with integer operations, unfortunately, this codes hasn't any. so I think it is very hard to beat the 4-cycles version suggested above.
 
Also, as indicated by Gunnar the fmovem load and saves a total of 12 fpu values hence 12 cycles. This is a lot when compared to the wait-cycles. Reducing register usege in that case might be worth it. Here is the solution I came to by hand-optimization
_crossproduct:  
  fmove.d  (8,a0),fp0
  fmul.d  (16,a1),fp0
  fmove.d  (16,a0),fp1
  fmul.d  (8,a1),fp1
 
  fmove.d fp4,-(sp)
  fmove.d fp3,-(sp)
  fmove.d fp2,-(sp)
 
  fmove.d (16,a0),fp2
  fmul.d  (a1),fp2
  fsub.x  fp1,fp0
  fmove.d (a0),fp3
  fmul.d  (16,a1),fp3
 
  fmove.d (a0),fp1
  fmul.d  (8,a1),fp1
  fmove.d (8,a0),fp4
  fmul.d  (a1),fp4
 
  fmove.d fp0,(a2)
 
  fmove.x fp2,fp0
  fsub.x  fp3,fp0
  fmove.d (sp)+,fp2
  fmove.d (sp)+,fp3
 
  fsub.x  fp4,fp1
  fmove.d (sp)+,fp4
;1
  fmove.d  fp0,(8,a2)
;2
  fmove.d  fp1,(16,a2)
  rts
; total = 29 cycles (include 3 wait cycles)
I tried to reduce registers usage (this is possible because "fop reg,reg" is as fast as "fop <ea>,reg") and fill the bubbles in the end of calculation with single fmove.d coming from a split fmovem. I'm usure gcc can split a single fmovem into several fmove.d and move them individually into wait-cycles to reduce total cycle-count.


Stefan "Bebbo" Franke

Posts 139
17 Jul 2019 12:58


Grom 68k wrote:

...
  Is gcc can make difference between  subq.l #4,d0  and  subq.l #1,d0 ?
  If yes,  subq.l #4,d0  could be move upward.
...

since the sub is fused to the jne, there is no gain.


Grom 68k

Posts 61
17 Jul 2019 13:25


Stefan "Bebbo" Franke wrote:

Grom 68k wrote:

  ...
  Is gcc can make difference between  subq.l #4,d0  and  subq.l #1,d0 ?
  If yes,  subq.l #4,d0  could be move upward.
  ...
 

 
  since the sub is fused to the jne, there is no gain.

Unlike other, #1 is specified for subq/bne fusing.

Gunnar von Boehn wrote:

  MOVEq  #,Dn
  AND.L  Dx,Dn   
 
 
  SUBQ.L #1,Dn
  BNE.s  LOOP




Samuel Devulder

Posts 248
17 Jul 2019 13:45


;2
    fmove.d  fp1,(16,a2)
    rts
  ; total = 29 cycles (include 3 wait cycles)
Maybe that final bubble can be reduced if we do something like this if the final fmove+jmp operation can be executed together in the same cycle:

    move.l (sp)+,a0
  ; 1
    fmove.d fp1,(16,a2)
    jmp (a0)


posts 367page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19