Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Stefan "Bebbo" Franke

Posts 139
29 Jun 2019 10:20


I forgot to mention the modified handling of int constants:
 
 
 
  EXTERNAL LINK 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
29 Jun 2019 10:22


Stefan "Bebbo" Franke wrote:

Which cpu - of the ones implemented in gcc - is the closest match and a good starting point?
 
Some cold fire? one of i386? mips? sparc? ...?

Not coldfire. (Coldfire FPU is slower than 68080)

"Modern" FPU are pipelined like our.
E.g. Todays x86 (starting with Pentium), All PowerPC.

Our goal should be to "inform" GCC that it can throw an FADD every cycle - but that each has a LATENCY..
Teaching GCC that running several FADD in parallem will increase speed a lot.

For example, GCC does this very well for PowerPC.



Stefan "Bebbo" Franke

Posts 139
29 Jun 2019 11:27


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

  Which cpu - of the ones implemented in gcc - is the closest match and a good starting point?
 
  Some cold fire? one of i386? mips? sparc? ...?
 

 
  Not coldfire. (Coldfire FPU is slower than 68080)
 
  "Modern" FPU are pipelined like our.
  E.g. Todays x86 (starting with Pentium), All PowerPC.
 
  Our goal should be to "inform" GCC that it can throw an FADD every cycle - but that each has a LATENCY..
  Teaching GCC that running several FADD in parallem will increase speed a lot.
 
  For example, GCC does this very well for PowerPC.
 

  - how many float insns are processed parallel?
  - what is the latency of each insn?
  - is there a dependency of some insns to other insns?

same for integer insn (disregard fusing here - 2 fused insns in 68080 should be treated as one insn in gcc)
  - how many int insns are processed parallel?
  --> 2
  - what is the latency of each insn?
  - is there a dependency of some insns to other insns?

best is put dependand insns into groups and specify the dependency of the groups.

e.g.
  - indirect mem: blocks both pipes
  - address used, is blocked by address calculation

and so on...

and for AMMX a capable GNU ASM is mandatory




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
29 Jun 2019 14:12


Stefan "Bebbo" Franke wrote:

    - how many float insns are processed parallel?
 

 
Please help me understand what is your definition of processed?
How many are in flight?
How many can be issued,
how many can be retired?
 
APOLLO 68080 can today issue/retire 1 FPU instruction per cycle.
 
 
Stefan "Bebbo" Franke wrote:

    - what is the latency of each insn?

FMOVE 1
FADD  6
FSUB  6
FCMP  6
FMUL  6
FDIV  10
FSQRT 22
FMOVEM 1 per cycle
 
Inputformat which is float and comes from (Mem) (Dn) or FREG
== FREE
 
Inputformat which is TYPE INTEGER "+1 cycle" for any source
 
 
Stefan "Bebbo" Franke wrote:

  - is there a dependency of some insns to other insns?

No
 
All calculations can be done in parallel.
Instruction which needs the FLAGS like
FBCC will wait for all instruction to finish.
 
 
 
 
Stefan "Bebbo" Franke wrote:

  same for integer insn (disregard fusing here - 2 fused insns in 68080 should be treated as one insn in gcc)
    - how many int insns are processed parallel?

2 INTEGER
(3 in some exception but lets ignored this today)

 
Stefan "Bebbo" Franke wrote:

    - what is the latency of each insn?
 

always 1
 
 
More expensive are

MUL=2
DIV=32
MOVEM=1 per Reg
MOVE16=4
CMPM=2
JMP/JSR with calculated EA =4  E.g. "JSR -40(A6)"
JMP /JSR absolute or PC-relativ  =1
 
 
 
Stefan "Bebbo" Franke wrote:

  - is there a dependency of some insns to other insns?

Both pipes can do memory access
but only 1 pipe can do it per cycle.
 
Flag dependency and Register dependency of course.
 
 


Stefan "Bebbo" Franke

Posts 139
29 Jun 2019 22:30


I have no good example, but:


double foo(double a, double b, double c) {
    return c/2 + c * (b-a) + b * b + a * (a + 1) / (a * a - 1);
}

yields now


_foo:
        link.w a5,#0
        fmovem #28,-(sp)
        fdmove.d (8,a5),fp0
        fdmove.d (16,a5),fp4
        fdmove.x fp4,fp2
        fdsub.x fp0,fp4
        fdmove.x fp4,fp1
        fdmove.d (24,a5),fp3
        fdmul.x fp3,fp1
        fdmul.d #0x3fe0000000000000,fp3
        fdmul.x fp2,fp2
        fdadd.x fp3,fp1
        fdadd.x fp2,fp1
        fmovecr #0x32,fp2
        fdadd.x fp0,fp2
        fdmul.x fp0,fp2
        fmovecr #0x32,fp3
        fdmul.x fp0,fp0
        fdsub.x fp3,fp0
        fddiv.x fp0,fp2
        fdadd.x fp2,fp1
        fmove.d fp1,-(sp)
        move.l (sp)+,d0
        move.l (sp)+,d1
        fmovem (sp)+,#56
        unlk a5
        rts

would this be better?


_foo:
        link.w a5,#0
        fdmove.d (16,a5),fp0
        fmovem #60,-(sp)
        fdmove.d (8,a5),fp4
        fdmove.x fp4,fp3
        fmovecr #0x32,fp2
        fdadd.x fp3,fp2
        fdmul.x fp3,fp4
        fdmul.x fp3,fp2
        fdmove.x fp0,fp1
        fdsub.x fp3,fp0
        fmovecr #0x32,fp3
        fdmove.d (24,a5),fp5
        fdsub.x fp3,fp4
        fddiv.x fp4,fp2
        fdmul.x fp5,fp0
        fdmul.d #0x3fe0000000000000,fp5
        fdmul.x fp1,fp1
        fdadd.x fp5,fp0
        fdadd.x fp1,fp0
        fdadd.x fp2,fp0
        fmovem (sp)+,#60
        fmove.d fp0,-(sp)
        move.l (sp)+,d0
        move.l (sp)+,d1
        unlk a5
        rts



Stefan "Bebbo" Franke

Posts 139
30 Jun 2019 21:40


No feedback? Never mind. I put that change into the build queue -> if it does'nt break anything, it's live in 1 hour


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
30 Jun 2019 22:10


Hi Bebbo,

I think the code uses the parallelism already better.
So yes I think its an improvement. Thanks a lot.

Maybe we can create other examples to discuss this better?
 
How about a workloop example which e.g. does a 3D rotation?
 


Stefan "Bebbo" Franke

Posts 139
01 Jul 2019 08:22


Gunnar von Boehn wrote:

  Hi Bebbo,
   
    I think the code uses the parallelism already better.
    So yes I think its an improvement. Thanks a lot.
   
    Maybe we can create other examples to discuss this better?
   
    How about a workloop example which e.g. does a 3D rotation?
     
 

 
  Sorry, I'm not good at creating such examples.
  Can someone else step in? (maybe even using EXTERNAL LINK )
 


Steve Ferrell

Posts 424
01 Jul 2019 09:08


Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

    Hi Bebbo,
   
    I think the code uses the parallelism already better.
    So yes I think its an improvement. Thanks a lot.
   
    Maybe we can create other examples to discuss this better?
     
    How about a workloop example which e.g. does a 3D rotation?
     
   

   
    Sorry, I'm not good at creating such examples.
    Can someone else step in? (maybe even using EXTERNAL LINK )
   

This C source loads and compiles:

#include <stdio.h>
#include <math.h>

#define M_PI 3.14159265358979323846264338327950288

typedef struct {
  float x;
  float y;
  float z;
}Point;
Point points;

float temp = 0;

void showPoint() {
  printf("%f %f %f\n", points.x, points.y, points.z);
}

void translate(float tx, float ty, float tz) {
  points.x += tx;
  points.y += ty;
  points.z += tz;
  printf("After Translation, new point is :");
  showPoint();
}

void rotatex(float angle) {
  angle = angle * M_PI / 180.0;
  temp = points.y;
  points.y = points.y * cos(angle) - points.z * sin(angle);
  points.z = temp * sin(angle) + points.z * cos(angle);
  printf("After rotation about x, new point is: ");
  showPoint();
}

void rotatey(float angle) {
  angle = (angle * M_PI) / 180.0;
  temp = points.z;
  points.z = points.z * cos(angle) - points.x * sin(angle);
  points.x = temp * sin(angle) + points.x * cos(angle);
  printf("After rotation about y, new point is: ");
  showPoint();

}

void rotatez(float angle) {
  angle = angle * M_PI / 180.0;
  temp = points.x;
  points.x = points.x * cos(angle) - points.y * sin(angle);
  points.y = temp * sin(angle) + points.y *cos(angle);
  printf("After rotation about z, new point is: ");
  showPoint();

}

void scale(float sf, float xf, float yf, float zf) {
  points.x = points.x * sf + (1 - sf) * xf;
  points.y = points.y * sf + (1 - sf) * yf;
  points.z = points.z * sf + (1 - sf) * zf;
  printf("After scaling, new point is: ");
  showPoint();
}

int main()
{
  float tx = 0, ty = 0, tz = 0;
  float sf = 0, xf = 0, yf = 0, zf = 0;
  int choose;
  float angle;
  float my_x, my_y, my_z;
  printf("Enter the initial point you want to transform: ");
  scanf("%f %f %f", &my_x, &my_y, &my_z);
  points.x = my_x; points.y = my_y;  points.z = my_z;
  printf("Choose the following: \n");
  printf("1. Translate\n");
  printf("2. Rotate about X axis\n");
  printf("3. Rotate about Y axis\n");
  printf("4. Rotate about Z axis\n");
  printf("5. Scale\n");
  scanf("%d", &choose);
  switch (choose) {
  case 1:
    printf("Enter the value of tx, ty and tz: ");
    scanf("%d %d %d", &tx, &ty, &tz);
    translate(tx, ty, tz);
    break;
  case 2:
    printf("Enter the angle: ");
    scanf("%f", &angle);
    rotatex(angle);
    break;
  case 3:
    printf("Enter the angle: ");
    scanf("%f", &angle);
    rotatey(angle);
    break;
  case 4:
    printf("Enter the angle: ");
    scanf("%f", &angle);
    rotatez(angle);
    break;
  case 5:
    printf("Enter the value of sf, xf, yf and zf: ");
    scanf("%f %f %f %f", &sf, &xf, &yf, &zf);
    scale(sf, xf, yf, zf);
    break;
  default:
    break;
  }
  return 0;
}



Steve Ferrell

Posts 424
01 Jul 2019 09:21


This is probably more appropriate:

#include <iostream>
#include <cmath>
 
using namespace std;
 
typedef struct {
    float x;
    float y;
    float z;
}Point;
Point points;
 
float rotationMatrix[4][4];
float inputMatrix[4][1] = {0.0, 0.0, 0.0, 0.0};
float outputMatrix[4][1] = {0.0, 0.0, 0.0, 0.0};
 
void showPoint(){
    cout<<"("<<outputMatrix[0][0]<<","<<outputMatrix[1][0]<<","<<outputMatrix[2][0]<<")"<<endl;
}
 
void multiplyMatrix()
{
    for(int i = 0; i < 4; i++ ){
        for(int j = 0; j < 1; j++){
            outputMatrix[j] = 0;
            for(int k = 0; k < 4; k++){
                outputMatrix[j] += rotationMatrix[k] * inputMatrix[k][j];
            }
        }
    }
}
void setUpRotationMatrix(float angle, float u, float v, float w)
{
    float L = (u*u + v * v + w * w);
    angle = angle * M_PI / 180.0; //converting to radian value
    float u2 = u * u;
    float v2 = v * v;
    float w2 = w * w;
 
    rotationMatrix[0][0] = (u2 + (v2 + w2) * cos(angle)) / L;
    rotationMatrix[0][1] = (u * v * (1 - cos(angle)) - w * sqrt(L) * sin(angle)) / L;
    rotationMatrix[0][2] = (u * w * (1 - cos(angle)) + v * sqrt(L) * sin(angle)) / L;
    rotationMatrix[0][3] = 0.0;
 
    rotationMatrix[1][0] = (u * v * (1 - cos(angle)) + w * sqrt(L) * sin(angle)) / L;
    rotationMatrix[1][1] = (v2 + (u2 + w2) * cos(angle)) / L;
    rotationMatrix[1][2] = (v * w * (1 - cos(angle)) - u * sqrt(L) * sin(angle)) / L;
    rotationMatrix[1][3] = 0.0;
 
    rotationMatrix[2][0] = (u * w * (1 - cos(angle)) - v * sqrt(L) * sin(angle)) / L;
    rotationMatrix[2][1] = (v * w * (1 - cos(angle)) + u * sqrt(L) * sin(angle)) / L;
    rotationMatrix[2][2] = (w2 + (u2 + v2) * cos(angle)) / L;
    rotationMatrix[2][3] = 0.0;
 
    rotationMatrix[3][0] = 0.0;
    rotationMatrix[3][1] = 0.0;
    rotationMatrix[3][2] = 0.0;
    rotationMatrix[3][3] = 1.0;
}
 
int main()
{
    float angle;
    float u, v, w;
    cout<<"Enter the initial point you want to transform:";
    cin>>points.x>>points.y>>points.z;
    inputMatrix[0][0] = points.x;
    inputMatrix[1][0] = points.y;
    inputMatrix[2][0] = points.z;
    inputMatrix[3][0] = 1.0;
 
    cout<<"Enter axis vector: ";
    cin>>u>>v>>w;
 
    cout<<"Enter the rotating angle in degree: ";
    cin>>angle;
 
    setUpRotationMatrix(angle, u, v, w);
    multiplyMatrix();
    showPoint();
 
    return 0;
}


Stefan "Bebbo" Franke

Posts 139
01 Jul 2019 10:55


Steve Ferrell wrote:

This is probably more appropriate:
...
  void multiplyMatrix()
  {
      for(int i = 0; i < 4; i++ ){
          for(int j = 0; j < 1; j++){
              outputMatrix[j] = 0;
              for(int k = 0; k < 4; k++){
                  outputMatrix[j] += rotationMatrix[k] * inputMatrix[k][j];
              }
          }
      }
  }
...

does not compile


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Jul 2019 11:41


How about we make it simple?
   
Please can you look at STREAM benchmark!
Its very popular and it is in fact a a FPU benchmark too.
   
EXTERNAL LINK     
We should easily be able to measure and see the performance speed up.
   
   
Example:

#include <string.h>
 
void Scale(double scalar, double* b, double* c)
{
    size_t j;
    for (j=1000; j; j--){
        *b++ = scalar * *c++;
    }
  }

And unrolled this one?


#include <string.h>
 
void Scale(double scalar, double* b, double* c)
{
    size_t j;
    for (j=1000; j; j--){
        *b++ = scalar * *c++;
        *b++ = scalar * *c++;
        *b++ = scalar * *c++;
        *b++ = scalar * *c++;
    }
  }

   
  The online 6.5B compiler give my strange code for this.
  It not uses FPU at all?
 
  Can you help?
 
 
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Jul 2019 13:23


And how does this code look ?


#include <string.h>
void Scale(double scalar, double* b, double* c)
{
    size_t j;
    double t1;
    double t2;
    double t3;
    double t4;
    for (j=1000; j; j--){
        t1 = scalar* *c++;
        t2 = scalar* *c++;
        t3 = scalar* *c++;
        t4 = scalar* *c++;
        *b++ =t1;
        *b++ =t2;
        *b++ =t3;
        *b++ =t4;
    }
}



Stefan "Bebbo" Franke

Posts 139
01 Jul 2019 13:25


Gunnar von Boehn wrote:

How about we make it simple?
 
  Please can you look at STREAM benchmark!
  Its very popular and it is in fact a a FPU benchmark too.
 
  EXTERNAL LINK   
  We should easily be able to measure and see the performance speed up.
 
 
  Example:
 

  void tuned_STREAM_Scale(STREAM_TYPE scalar)
  {
  ssize_t j;
  #pragma omp parallel for
  for (j=0; j<STREAM_ARRAY_SIZE; j++)
      b[j] = scalar*c[j];
  }
 

 
  How does this compile?

this is a good example to show how hard it is to provide a good example.

The result varies on the given options.

It may yield:


tuned_STREAM_Scale(double):
        rts

or


_tuned_STREAM_Scale:
        link.w a5,#0
        move.l a2,-(sp)
        fdmove.d (8,a5),fp0
        lea _c,a0
        lea _b,a1
        move.l #80000000,a2
        add.l a0,a2
.L2:
        fdmove.d (a0)+,fp1
        fdmul.x fp0,fp1
        fmove.d fp1,(a1)+
        cmp.l a0,a2
        jne .L2
        move.l (sp)+,a2
        unlk a5
        rts

or do you want an unrolled loop?



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Jul 2019 13:30


Stefan "Bebbo" Franke wrote:

  or do you want an unrolled loop?
 

Yes.
 
Can you please also compile the 3 snippest I posted above?

Maybe we can also compare -OS and -O3


Stefan "Bebbo" Franke

Posts 139
01 Jul 2019 13:43


Gunnar von Boehn wrote:

  Can you help?

add -m68881  :-)


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Jul 2019 13:47


Lets look here


#include <string.h>
void Scale(double scalar, double* b, double* c)
{
    size_t j;
    double t1;
    double t2;
    double t3;
    double t4;
    for (j=1000; j; j--){
        t1 = scalar* *c++;
        t2 = scalar* *c++;
        t3 = scalar* *c++;
        t4 = scalar* *c++;
        *b++ =t1;
        *b++ =t2;
        *b++ =t3;
        *b++ =t4;
    }
}
 

-Os -m68080 -mhard-float -fomit-frame-pointer


_Scale:
        fmovem #28,-(sp)
        fdmove.d (40,sp),fp0
        move.l (48,sp),a0
        move.l (52,sp),a1
        clr.l d0
.L2:
        fdmove.x fp0,fp3
        fdmove.x fp0,fp2
        fdmul.d (8,a1,d0.l),fp3
        fdmove.x fp0,fp1
        fdmul.d (16,a1,d0.l),fp2
        fdmove.d (a1,d0.l),fp4
        fdmul.x fp0,fp4
        fdmul.d (24,a1,d0.l),fp1
        fmove.d fp4,(a0,d0.l)
        fmove.d fp3,(8,a0,d0.l)
        fmove.d fp2,(16,a0,d0.l)
        fmove.d fp1,(24,a0,d0.l)
        add.l #32,d0
        cmp.l #32000,d0
        jne .L2
        fmovem (sp)+,#56
        rts

OK what could be improved?

a) use LOOP which counts a D0 down
subq.l #1,D0
BNE.b

b) GCC by accident used fat EA modes?
Does GCC calc the instruction length of this correctly?
(8,a0,d0.l)  is 2 byte longer than
(a0)+

OK now use O3


_Scale:
        fmovem #28,-(sp)
        move.l a2,-(sp)
        fdmove.d (44,sp),fp0
        move.l (52,sp),a0
        move.l (56,sp),a1
        lea (32000,a0),a2
.L2:
        fdmove.x fp0,fp4
        fdmove.x fp0,fp3
        fdmul.d (a1),fp4
        fdmove.x fp0,fp2
        fdmul.d (8,a1),fp3
        fdmul.d (16,a1),fp2
        fdmove.x fp0,fp1
        lea (32,a1),a1
        fdmul.d (-8,a1),fp1
        fmove.d fp4,(a0)
        fmove.d fp3,(8,a0)
        fmove.d fp2,(16,a0)
        lea (32,a0),a0
        fmove.d fp1,(-8,a0)
        cmp.l a0,a2
        jne .L2
        move.l (sp)+,a2
        fmovem (sp)+,#56
        rts

Again GCC uses EA (D16,A0) this is bigger and not faster than (a0)+
The code would be a lot better if (A0)+  is used.




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Jul 2019 13:48


Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

    Can you help?
 

 
  add -m68881  :-)

68080 has FPU build in.
Can Hard-Fpu be default for it?



Stefan "Bebbo" Franke

Posts 139
01 Jul 2019 13:49


Gunnar von Boehn wrote:

 
  First: you can't unroll-loops with -Os^^
 
 
  And how does this code look ?
 
 

  #include <string.h>
  void Scale(double scalar, double* b, double* c)
  {
      size_t j;
      double t1;
      double t2;
      double t3;
      double t4;
      for (j=1000; j; j--){
          t1 = scalar* *c++;
          t2 = scalar* *c++;
          t3 = scalar* *c++;
          t4 = scalar* *c++;
          *b++ =t1;
          *b++ =t2;
          *b++ =t3;
          *b++ =t4;
      }
  }
 

 

 
  this code can be scheduled: EXTERNAL LINK 
 

  _Scale3:
          link.w a5,#0
          fmovem #28,-(sp)
          move.l a2,-(sp)
          fdmove.d (8,a5),fp0
          move.l (16,a5),a0
          move.l (20,a5),a1
          lea (32000,a0),a2
  .L2:
          fdmove.x fp0,fp4
          fdmove.x fp0,fp3
          fdmul.d (a1),fp4
          fdmove.x fp0,fp2
          fdmul.d (8,a1),fp3
          fdmul.d (16,a1),fp2
          fdmove.x fp0,fp1
          lea (32,a1),a1
          fdmul.d (-8,a1),fp1
          fmove.d fp4,(a0)
          fmove.d fp3,(8,a0)
          fmove.d fp2,(16,a0)
          lea (32,a0),a0
          fmove.d fp1,(-8,a0)
          cmp.l a0,a2
          jne .L2
          move.l (sp)+,a2
          fmovem (sp)+,#56
          unlk a5
          rts
 

 
  the shorter ones won't work for now - since the scheduler is feed with single (gcc internal) insns like
 

        fmul (a0),(a1)
 

 
  it might help to split those insns in a prepent pass...
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Jul 2019 14:01


Did you saw my post reporting that unneeded longer EA modes are used?
How could this happen with -Os

posts 367page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19