Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Stefan "Bebbo" Franke

Posts 139
15 Jul 2019 18:30


-funroll-loops is working better now too: EXTERNAL LINK 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
15 Jul 2019 18:44


Stefan "Bebbo" Franke wrote:

-funroll-loops is working better now too: EXTERNAL LINK 

Well done Bebbo,

this is major progress!


Samuel Devulder

Posts 248
15 Jul 2019 19:16


Stefan "Bebbo" Franke wrote:

    -funroll-loops is working better now too: EXTERNAL LINK   

I don't understand the first operation
_Scale8:
        fdmove.x fp1,fp0  ; <== HERE
        move.l #640,d0
        fmovem fp2/fp3/fp4,-(sp)
.L2:
        fdmove.d (a1)+,fp4
        fdmul.x fp0,fp4
        fdmove.d (a1)+,fp3
        fdmul.x fp0,fp3
        fdmove.d (a1)+,fp2
        fdmul.x fp0,fp2
        fdmove.d (a1)+,fp1
        fdmul.x fp0,fp1
        fmove.d fp4,(a0)+
        fmove.d fp3,(a0)+
        fmove.d fp2,(a0)+
        fmove.d fp1,(a0)+
        subq.l #4,d0
        jne .L2
        fmovem (sp)+,fp4/fp3/fp2
        rts
We could have worked with fp1 as source during the loop unrolling, and use fp0 as a tmp register inside the loop. Both regs are interchangeable in this context, but copying one onto the other costs one (needless) instruction.
   
If we inverse the scalar0/scalar args, then the copy doesn't occur (the code is unchanged.)


Stefan "Bebbo" Franke

Posts 139
15 Jul 2019 19:27


Samuel Devulder wrote:

Stefan "Bebbo" Franke wrote:

    -funroll-loops is working better now too: EXTERNAL LINK     

  I don't understand the first operation
_Scale8:
          fdmove.x fp1,fp0  ; <== HERE
          move.l #640,d0
          fmovem fp2/fp3/fp4,-(sp)
  .L2:
          fdmove.d (a1)+,fp4
          fdmul.x fp0,fp4
          fdmove.d (a1)+,fp3
          fdmul.x fp0,fp3
          fdmove.d (a1)+,fp2
          fdmul.x fp0,fp2
          fdmove.d (a1)+,fp1
          fdmul.x fp0,fp1
          fmove.d fp4,(a0)+
          fmove.d fp3,(a0)+
          fmove.d fp2,(a0)+
          fmove.d fp1,(a0)+
          subq.l #4,d0
          jne .L2
          fmovem (sp)+,fp4/fp3/fp2
          rts
We could have worked with fp1 as source during the loop unrolling, and use fp0 as a tmp register inside the loop. Both regs are interchangeable in this context, but copying one onto the other costs one (needless) instruction.
   
  If we inverse the scalar0/scalar args, then the copy doesn't occur (the code is unchanged.)

regparms are a hack in the parser - they remain still parameters, and thus the first compiler action is a load into register, because it does not know about register parameters



Samuel Devulder

Posts 248
15 Jul 2019 19:44


Okay, thanks for explaining. And now for something totally different (maybe?), if I do ( EXTERNAL LINK )

  double Scale8(double scalar0, double scalar, double* restrict b, double* restrict c)
  {
      double t=0;
          size_t j;
          for (j=640; j; j--){
              t+=*b++ = scalar * *c++;
          }
      return t;
  }
I get
_Scale8:
          fmove.d #0x000000000,fp0
          fmovem fp2/fp3/fp4/fp5,-(sp)
          fdmove.d (60,sp),fp1
          move.l (72,sp),a1
          move.l (68,sp),a0
          move.l #640,d0
  .L2:
          fdmove.d (a1)+,fp5
          fdmul.x fp1,fp5
          fdmove.d (a1)+,fp4
          fdmul.x fp1,fp4
          fdmove.d (a1)+,fp3
          fdmul.x fp1,fp3
          fdmove.d (a1)+,fp2
          fdmul.x fp1,fp2
          fmove.d fp5,(a0)+
          fmove.d fp4,(a0)+
          fdadd.x fp5,fp0
  ; ~5 cycles waiting for fp0
          fdadd.x fp4,fp0
          fmove.d fp3,(a0)+
  ; ~4 cycles waiting for fp0
          fdadd.x fp3,fp0
          fmove.d fp2,(a0)+
  ; ~4 cycles waiting for fp0
          fdadd.x fp2,fp0
          subq.l #4,d0
          jne .L2
          fmove.d fp0,-(sp)
          move.l (sp)+,d0
          move.l (sp)+,d1
          fmovem (sp)+,fp5/fp4/fp3/fp2
          rts
we have 13 wait-cycles, but if we reorg the final computation like this
         ...
          fmove.d fp5,(a0)+
          fadd.x  fp4,fp5        ; <== notice
          fmove.d fp4,(a0)+
          fmove.d fp3,(a0)+
          fadd.x  fp2,fp3        ; <== notice
          fmove.d fp2,(a0)+
          subq.l #4,d0
  ; 0 cycles on fp5 (kewl!)
          fadd.x  fp5,fp0
  ; ~5 cycles waiting for fp0 (well...)
          fadd.x  fp3,fp0
          jne .L2
 
we drop to 5 (instead of 13).


Stefan "Bebbo" Franke

Posts 139
15 Jul 2019 19:47


perdon?

fdadd.x is done in 1 cycle, isn't it?


Samuel Devulder

Posts 248
15 Jul 2019 19:56


I think this is 6 like fmul. Roughly speaking, provided the regs are available, all fmoves are 1 cycles, all fadd/fsub/fcmp/fmul are 6 (these are "simple" fpu op), fdiv is 10 ("complex" fpu op), and others are "a lot" (==too complex and rare fpu ops) IIRC. See BigGun post on page 3 which details cycles for int operations as well.


Stefan "Bebbo" Franke

Posts 139
15 Jul 2019 20:01


Samuel Devulder wrote:

  I think this is 6 like fmul. Roughly speaking, provided the regs are available, all fmoves are 1 cycles, all fadd/fsub/fcmp/fmul are 6, fdiv is 10, and others are "a lot" IIRC (see BigGun post on page 3 which details cycles for int operations as well).
 

 
EDIT: I consider only fmul with 6 cycles.

Dunno what's correct


Samuel Devulder

Posts 248
15 Jul 2019 20:04


Stefan "Bebbo" Franke wrote:

  that's why I don't understand the wait cycles after fdadd.x

In the annotated asm I reported, or generally speaking ? If it is in the annotated asm, I might have miscalculated something. Which one is puzzling ?


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
15 Jul 2019 20:07


Stefan "Bebbo" Franke wrote:

perdon?
 
  fdadd.x is done in 1 cycle, isn't it?

FADD and FMUL are both 6 cycle.
That both take several cycle is typical for an FPU.

Its normal that in integer arithmetic ADD is faster than MUL.
But in floating point arithmetic FADD is same work and same latency  as FMUL.




Samuel Devulder

Posts 248
15 Jul 2019 20:17


Gunnar von Boehn wrote:

    But in floating point arithmetic FADD is same work and same latency  as FMUL.
 

  Work like these maybe:
  1) extract sign / exponent / mantissa
  2) add exponent in case of '*' or align mantissa in case of '+/-/fcmp'
  3) work on mantissa (add or multiply)
  4) normalize mantissa & fix exponent if needed
  5) pack everything back into ieee format (except for fcmp)
  6) update status regs
(this is how I figure out the 6 cycles. Which is probably wrong.)


Stefan "Bebbo" Franke

Posts 139
15 Jul 2019 20:23


ok - I updated the cycles for fadd and friends:

_Scale8:
        fmove.d #0x000000000,fp0
        fmovem fp2/fp3/fp4/fp5,-(sp)
        fdmove.x fp1,fp2
        move.l #640,d0
.L2:
        fdmove.d (a1)+,fp5
        fdmul.x fp2,fp5
        fdmove.d (a1)+,fp4
        fdmul.x fp2,fp4
        fdmove.d (a1)+,fp3
        fdmul.x fp2,fp3
        fdmove.d (a1)+,fp1
        fdmul.x fp2,fp1
        fmove.d fp5,(a0)+
        fdadd.x fp5,fp0
        fmove.d fp4,(a0)+
        fdadd.x fp4,fp0
        fmove.d fp3,(a0)+
        fdadd.x fp3,fp0
        fmove.d fp1,(a0)+
        fdadd.x fp1,fp0
        subq.l #4,d0
        jne .L2
        fmove.d fp0,-(sp)
        move.l (sp)+,d0
        move.l (sp)+,d1
        fmovem (sp)+,fp5/fp4/fp3/fp2
        rts



Samuel Devulder

Posts 248
15 Jul 2019 20:34


There are still wait-state when doing additions in a linear/serial fashion:

              fdadd.x fp5,fp0
              fmove.d fp4,(a0)+
              fdadd.x fp4,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline
              fmove.d fp3,(a0)+
              fdadd.x fp3,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline
              fmove.d fp1,(a0)+
              fdadd.x fp1,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline

I think it is better to "parallelize" things into independant operations. I typically see the computation as a tree with independant branches computing independant partial results converging to the trunk (==end result). The partial result are more and more dependant of one another when we get closer to the trunk. Most of the branches are latency-free. Only the trunk concentrates the wait-cycles which can be easily filled with integer operations or other independant computational-branches coming from expressions occuring a bit later in the code:
          fmove.d fp5,(a0)+
              fdadd.x fp4,fp5
              fmove.d fp4,(a0)+
              fmove.d fp3,(a0)+
              fdadd.x fp1,fp3 ; works in parrallel with fp4+fp5
              fmove.d fp1,(a0)+
              fdadd.x fp5,fp0 ; wait 1 cycle for fp5 finishing it addition
              fadd.x  fp3,fp0 ; wait 5 cycles for fp0 finishing from prev instruction
Total: 6 wait-cycles instead of 15.
     


Stefan "Bebbo" Franke

Posts 139
15 Jul 2019 20:56


Samuel Devulder wrote:

  There are still wait-state when doing additions in a linear/serial fashion:

                fdadd.x fp5,fp0
                fmove.d fp4,(a0)+
                fdadd.x fp4,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline
                fmove.d fp3,(a0)+
                fdadd.x fp3,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline
                fmove.d fp1,(a0)+
                fdadd.x fp1,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline

  I think it is better to "parallelize" things into independant operations. I typically see the computation as a tree with independant branches computing independant partial results converging to the trunk (==end result). The partial result are more and more dependant of one another when we get closer to the trunk. Most of the branches are latency-free. Only the trunk concentrates the wait-cycles which can be easily filled with integer operations or other independant computational-branches coming from expressions occuring a bit later in the code:
          fmove.d fp5,(a0)+
                fdadd.x fp4,fp5
                fmove.d fp4,(a0)+
                fmove.d fp3,(a0)+
                fdadd.x fp1,fp3 ; works in parrallel with fp4+fp5
                fmove.d fp1,(a0)+
                fdadd.x fp5,fp0 ; wait 1 cycle for fp5 finishing it addition
                fadd.x  fp3,fp0 ; wait 5 cycles for fp0 finishing from prev instruction
Total: 6 wait-cycles instead of 15.
       
 

 
  I see, understand and know the pass where this happens...
  ... it's reload where the register assignment is done, and this pass does not consider the latency...
 
  ... maybe a no-fix for me.
 
EDIT: aren't if 4 cycles? per wait? total 12?
 


Philippe Flype
(Apollo Team Member)
Posts 299
15 Jul 2019 21:08


Since the 080 have a precise cycle counter,
  i can output the real results of each of them.
  Those are REGS to REGS operations,
  in exception of FMOVE R/W, FMOVEM R/W.
 
 

    +------------+--------------+
    | FPU instr  | Single | OoO |
    +------------+--------+-----+
    | FABS      |      1 |  1 |
    | FADD      |      6 |  1 |
    | FCMP      |      6 |  1 |
    | FDABS      |      1 |  1 |
    | FDADD      |      6 |  1 |
    | FDDIV      |      9 |  2 |
    | FDIV      |      9 |  2 |
    | FDMOVE    |      1 |  1 |
    | FDMUL      |      6 |  1 |
    | FDNEG      |      1 |  1 |
    | FDSQRT    |    21 |  12 |
    | FDSUB      |      6 |  1 |
    | FINTRZ    |      2 |  1 |
    | FMOVERm    |      1 |  1 |
    | FMOVEWm    |      1 |  1 |
    | FMOVERi    |      1 |  1 |
    | FMOVEWi    |      1 |  1 |
    | FMOVECR    |      1 |  1 |
    | FMOVECTRL  |      4 |  4 |
    | FMOVEMR    |      8 |  8 |
    | FMOVEMW    |    25 |  25 |
    | FMUL      |      6 |  1 |
    | FNEG      |      1 |  1 |
    | FSABS      |      1 |  1 |
    | FSADD      |      6 |  1 |
    | FSDIV      |      9 |  2 |
    | FSGLDIV    |      9 |  2 |
    | FSGLMUL    |      6 |  1 |
    | FSMOVE    |      1 |  1 |
    | FSMUL      |      6 |  1 |
    | FSNEG      |      1 |  1 |
    | FSQRT      |    21 |  12 |
    | FSSQRT    |    21 |  12 |
    | FSSUB      |      6 |  1 |
    | FSUB      |      6 |  1 |
    | FTST      |      1 |  1 |
    | FSEQ      |      1 |  1 |
    | FSCC      |      1 |  1 |
    | FNOP      |      1 |  1 |
    +------------+--------+-----+
    | FPSP instr | Single | OoO |
    +------------+--------+-----+
    | FACOS      |    121 | 121 |
    | FASIN      |    121 | 121 |
    | FATAN      |    198 | 198 |
    | FATANH    |    153 | 153 |
    | FCOS      |    209 | 209 |
    | FCOSH      |    264 | 264 |
    | FETOX      |    220 | 220 |
    | FETOXM1    |    231 | 231 |
    | FGETEXP    |    88 |  88 |
    | FGETMAN    |    88 |  88 |
    | FINT      |    99 |  99 |
    | FLOG10    |    231 | 231 |
    | FLOG2      |    242 | 242 |
    | FLOGN      |    220 | 220 |
    | FLOGN1P    |    220 | 220 |
    | FMOD      |    121 | 121 |
    | FREM      |    121 | 121 |
    | FSCALE    |    99 |  99 |
    | FSIN      |    238 | 238 |
    | FSINCOS    |    264 | 264 |
    | FSINH      |    286 | 286 |
    | FTAN      |    198 | 198 |
    | FTANH      |    275 | 275 |
    | FTENTOX    |    231 | 231 |
    | FTWOTOX    |    231 | 231 |
    +------------+--------+-----+
 

 
 
 
  Source code provided :
 
 
  EXTERNAL LINK 
 
 
 


Samuel Devulder

Posts 248
15 Jul 2019 21:44


@bebbo: For a 6-cycles fpu instruction, I count 5 other instructions slots between reuse of the result. Yes you are right this makes 4 wait-cycles on that code, hence a total of 12. Total count for the "parallelized" version is still at 6.

Sam (this thread is very interresting)


Grom 68k

Posts 61
15 Jul 2019 22:20


Stefan "Bebbo" Franke wrote:

Niclas A wrote:

 
Grom 68k wrote:

  Where is source ?
 

 
  EXTERNAL LINK   
 

 
  not exactly - he's looking for the headers which are in several repos.
 
  Easiest: grab the built version, and mail me the fixed headers and I'll put them live

Hi bebbo,

This is the script.


#!/usr/bin/python
import os
import re
for root, dirs, files in os.walk('.'):
  for file in files:
    if file.endswith('.h'):
      count=0
      with open(os.path.join(root, file), 'r') as infile:
        with open(os.path.join(root, file + 'new'), 'w') as outfile:
          for currentline in infile.readlines():
            match = re.search(r'^\s*\w+\s+\w+\(.*$', currentline, flags=0)
            if match:
              outfile.write(re.sub(r'^(\s*)(\w+\s+\w+)(\(.*)$', r'\1__stdargs \2\3', currentline, flags=0))
              count=count+1
            else:
              match = re.search(r'^\s*struct\s+\w+\s+\*\w+\(.*$', currentline, flags=0)
              if match:
                outfile.write(re.sub(r'^(\s*)(struct\s+\w+\s+\*\w+)(\(.*)$', r'\1__stdargs \2\3', currentline, flags=0))
                count=count+1
              else:
                outfile.write(currentline)
      if (count==0):
        os.remove(os.path.join(root, file + 'new'))
      else:
        os.remove(os.path.join(root, file))
        os.rename(os.path.join(root, file + 'new'),os.path.join(root, file))

I think I must check all files before delivery.
Could you give me your mail for the delivery?

I must modify these one too ?


uint32 APICALL (*Release)(struct SocketIFace *Self);
struct Interface * APICALL (*Clone)(struct SocketIFace *Self);

Thanks


Nixus Minimax

Posts 416
16 Jul 2019 08:03


Samuel Devulder wrote:
(this thread is very interresting)

I agree. What may be incomprehensible gibberish to many is really exciting to us.



Stefan "Bebbo" Franke

Posts 139
16 Jul 2019 08:35


Grom 68k wrote:

  I think I must check all files before delivery.
  Could you give me your mail for the delivery?
 
  I must modify these one too ?
 

  uint32 APICALL (*Release)(struct SocketIFace *Self);
  struct Interface * APICALL (*Clone)(struct SocketIFace *Self);
 

 
  Thanks

please mail to  bebbo at bejy.net

And yes, the function pointers do need it too. If I read "APICALL" it would be sufficient to modify the define for "APICALL" - less work?

thank you!



Grom 68k

Posts 61
16 Jul 2019 10:18


Stefan "Bebbo" Franke wrote:

Grom 68k wrote:

  I think I must check all files before delivery.
  Could you give me your mail for the delivery?
 
  I must modify these one too ?
 

  uint32 APICALL (*Release)(struct SocketIFace *Self);
  struct Interface * APICALL (*Clone)(struct SocketIFace *Self);
 

 
  Thanks
 

 
  please mail to  bebbo at bejy.net
 
  And yes, the function pointers do need it too. If I read "APICALL" it would be sufficient to modify the define for "APICALL" - less work?
 
  thank you!
 

I will modify APICALL manualy, there is only few files impacted.

Do you prefer a zip with all files or only modified ones ?


posts 367page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19