APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

Information about the Apollo CPU and FPU.

GCC Improvement for 68080	page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19


Stefan "Bebbo" Franke Posts 142 15 Jul 2019 18:30	-funroll-loops is working better now too: EXTERNAL LINK

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
15 Jul 2019 18:44

Stefan "Bebbo" Franke wrote:

-funroll-loops is working better now too: EXTERNAL LINK

Well done Bebbo,

this is major progress!

Samuel Devulder

Posts 248
15 Jul 2019 19:16

Stefan "Bebbo" Franke wrote:

-funroll-loops is working better now too: EXTERNAL LINK

I don't understand the first operation

_Scale8:
         fdmove.x fp1,fp0  ; <== HERE
         move.l #640,d0
         fmovem fp2/fp3/fp4,-(sp)
.L2:
         fdmove.d (a1)+,fp4
         fdmul.x fp0,fp4
         fdmove.d (a1)+,fp3
         fdmul.x fp0,fp3
         fdmove.d (a1)+,fp2
         fdmul.x fp0,fp2
         fdmove.d (a1)+,fp1
         fdmul.x fp0,fp1
         fmove.d fp4,(a0)+
         fmove.d fp3,(a0)+
         fmove.d fp2,(a0)+
         fmove.d fp1,(a0)+
         subq.l #4,d0
         jne .L2
         fmovem (sp)+,fp4/fp3/fp2
         rts

We could have worked with fp1 as source during the loop unrolling, and use fp0 as a tmp register inside the loop. Both regs are interchangeable in this context, but copying one onto the other costs one (needless) instruction.

If we inverse the scalar0/scalar args, then the copy doesn't occur (the code is unchanged.)

Stefan "Bebbo" Franke

Posts 142
15 Jul 2019 19:27

Samuel Devulder wrote:

Stefan "Bebbo" Franke wrote:

-funroll-loops is working better now too: EXTERNAL LINK

I don't understand the first operation

_Scale8:
          fdmove.x fp1,fp0  ; <== HERE
          move.l #640,d0
          fmovem fp2/fp3/fp4,-(sp)
  .L2:
          fdmove.d (a1)+,fp4
          fdmul.x fp0,fp4
          fdmove.d (a1)+,fp3
          fdmul.x fp0,fp3
          fdmove.d (a1)+,fp2
          fdmul.x fp0,fp2
          fdmove.d (a1)+,fp1
          fdmul.x fp0,fp1
          fmove.d fp4,(a0)+
          fmove.d fp3,(a0)+
          fmove.d fp2,(a0)+
          fmove.d fp1,(a0)+
          subq.l #4,d0
          jne .L2
          fmovem (sp)+,fp4/fp3/fp2
          rts

regparms are a hack in the parser - they remain still parameters, and thus the first compiler action is a load into register, because it does not know about register parameters

Samuel Devulder

Posts 248
15 Jul 2019 19:44

Okay, thanks for explaining. And now for something totally different (maybe?), if I do ( EXTERNAL LINK )


   double Scale8(double scalar0, double scalar, double* restrict b, double* restrict c)
   {
       double t=0;
          size_t j;
          for (j=640; j; j--){
              t+=*b++ = scalar * *c++;
          }
       return t;
   }

I get

_Scale8:
           fmove.d #0x000000000,fp0
           fmovem fp2/fp3/fp4/fp5,-(sp)
           fdmove.d (60,sp),fp1
           move.l (72,sp),a1
           move.l (68,sp),a0
           move.l #640,d0
   .L2:
           fdmove.d (a1)+,fp5
           fdmul.x fp1,fp5
           fdmove.d (a1)+,fp4
           fdmul.x fp1,fp4
           fdmove.d (a1)+,fp3
           fdmul.x fp1,fp3
           fdmove.d (a1)+,fp2
           fdmul.x fp1,fp2
           fmove.d fp5,(a0)+
           fmove.d fp4,(a0)+
           fdadd.x fp5,fp0
   ; ~5 cycles waiting for fp0
           fdadd.x fp4,fp0
           fmove.d fp3,(a0)+
   ; ~4 cycles waiting for fp0
           fdadd.x fp3,fp0
           fmove.d fp2,(a0)+
   ; ~4 cycles waiting for fp0
           fdadd.x fp2,fp0
           subq.l #4,d0
           jne .L2
           fmove.d fp0,-(sp)
           move.l (sp)+,d0
           move.l (sp)+,d1
           fmovem (sp)+,fp5/fp4/fp3/fp2
           rts

we have 13 wait-cycles, but if we reorg the final computation like this

         ...
           fmove.d fp5,(a0)+
           fadd.x  fp4,fp5        ; <== notice
           fmove.d fp4,(a0)+
           fmove.d fp3,(a0)+
           fadd.x  fp2,fp3        ; <== notice
           fmove.d fp2,(a0)+
           subq.l #4,d0
   ; 0 cycles on fp5 (kewl!)
           fadd.x  fp5,fp0 
   ; ~5 cycles waiting for fp0 (well...)
           fadd.x  fp3,fp0
           jne .L2

we drop to 5 (instead of 13).


Stefan "Bebbo" Franke Posts 142 15 Jul 2019 19:47	perdon? fdadd.x is done in 1 cycle, isn't it?


Samuel Devulder Posts 248 15 Jul 2019 19:56	I think this is 6 like fmul. Roughly speaking, provided the regs are available, all fmoves are 1 cycles, all fadd/fsub/fcmp/fmul are 6 (these are "simple" fpu op), fdiv is 10 ("complex" fpu op), and others are "a lot" (==too complex and rare fpu ops) IIRC. See BigGun post on page 3 which details cycles for int operations as well.

Stefan "Bebbo" Franke

Posts 142
15 Jul 2019 20:01

Samuel Devulder wrote:

I think this is 6 like fmul. Roughly speaking, provided the regs are available, all fmoves are 1 cycles, all fadd/fsub/fcmp/fmul are 6, fdiv is 10, and others are "a lot" IIRC (see BigGun post on page 3 which details cycles for int operations as well).

EDIT: I consider only fmul with 6 cycles.

Dunno what's correct

Samuel Devulder

Posts 248
15 Jul 2019 20:04

Stefan "Bebbo" Franke wrote:

that's why I don't understand the wait cycles after fdadd.x

In the annotated asm I reported, or generally speaking ? If it is in the annotated asm, I might have miscalculated something. Which one is puzzling ?

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
15 Jul 2019 20:07

Stefan "Bebbo" Franke wrote:

perdon?

fdadd.x is done in 1 cycle, isn't it?

FADD and FMUL are both 6 cycle.
That both take several cycle is typical for an FPU.

Its normal that in integer arithmetic ADD is faster than MUL.
But in floating point arithmetic FADD is same work and same latency as FMUL.

Samuel Devulder

Posts 248
15 Jul 2019 20:17

Gunnar von Boehn wrote:

But in floating point arithmetic FADD is same work and same latency as FMUL.

Work like these maybe:
1) extract sign / exponent / mantissa
2) add exponent in case of '*' or align mantissa in case of '+/-/fcmp'
3) work on mantissa (add or multiply)
4) normalize mantissa & fix exponent if needed
5) pack everything back into ieee format (except for fcmp)
6) update status regs
(this is how I figure out the 6 cycles. Which is probably wrong.)

Stefan "Bebbo" Franke

Posts 142
15 Jul 2019 20:23

ok - I updated the cycles for fadd and friends:


_Scale8:
         fmove.d #0x000000000,fp0
         fmovem fp2/fp3/fp4/fp5,-(sp)
         fdmove.x fp1,fp2
         move.l #640,d0
.L2:
         fdmove.d (a1)+,fp5
         fdmul.x fp2,fp5
         fdmove.d (a1)+,fp4
         fdmul.x fp2,fp4
         fdmove.d (a1)+,fp3
         fdmul.x fp2,fp3
         fdmove.d (a1)+,fp1
         fdmul.x fp2,fp1
         fmove.d fp5,(a0)+
         fdadd.x fp5,fp0
         fmove.d fp4,(a0)+
         fdadd.x fp4,fp0
         fmove.d fp3,(a0)+
         fdadd.x fp3,fp0
         fmove.d fp1,(a0)+
         fdadd.x fp1,fp0
         subq.l #4,d0
         jne .L2
         fmove.d fp0,-(sp)
         move.l (sp)+,d0
         move.l (sp)+,d1
         fmovem (sp)+,fp5/fp4/fp3/fp2
         rts

Samuel Devulder

Posts 248
15 Jul 2019 20:34

There are still wait-state when doing additions in a linear/serial fashion:


               fdadd.x fp5,fp0
               fmove.d fp4,(a0)+
               fdadd.x fp4,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline
               fmove.d fp3,(a0)+
               fdadd.x fp3,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline
               fmove.d fp1,(a0)+
               fdadd.x fp1,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline

I think it is better to "parallelize" things into independant operations. I typically see the computation as a tree with independant branches computing independant partial results converging to the trunk (==end result). The partial result are more and more dependant of one another when we get closer to the trunk. Most of the branches are latency-free. Only the trunk concentrates the wait-cycles which can be easily filled with integer operations or other independant computational-branches coming from expressions occuring a bit later in the code:

          fmove.d fp5,(a0)+
               fdadd.x fp4,fp5
               fmove.d fp4,(a0)+
               fmove.d fp3,(a0)+
               fdadd.x fp1,fp3 ; works in parrallel with fp4+fp5
               fmove.d fp1,(a0)+
               fdadd.x fp5,fp0 ; wait 1 cycle for fp5 finishing it addition
               fadd.x  fp3,fp0 ; wait 5 cycles for fp0 finishing from prev instruction

Total: 6 wait-cycles instead of 15.

Stefan "Bebbo" Franke

Posts 142
15 Jul 2019 20:56

Samuel Devulder wrote:

There are still wait-state when doing additions in a linear/serial fashion:


                 fdadd.x fp5,fp0
                 fmove.d fp4,(a0)+
                 fdadd.x fp4,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline
                 fmove.d fp3,(a0)+
                 fdadd.x fp3,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline
                 fmove.d fp1,(a0)+
                 fdadd.x fp1,fp0 ; <== has to wait 5 cycles for fp0 getting out of the pipeline

          fmove.d fp5,(a0)+
                 fdadd.x fp4,fp5
                 fmove.d fp4,(a0)+
                 fmove.d fp3,(a0)+
                 fdadd.x fp1,fp3 ; works in parrallel with fp4+fp5
                 fmove.d fp1,(a0)+
                 fdadd.x fp5,fp0 ; wait 1 cycle for fp5 finishing it addition
                 fadd.x  fp3,fp0 ; wait 5 cycles for fp0 finishing from prev instruction

Total: 6 wait-cycles instead of 15.

I see, understand and know the pass where this happens...
... it's reload where the register assignment is done, and this pass does not consider the latency...

... maybe a no-fix for me.

EDIT: aren't if 4 cycles? per wait? total 12?

Philippe Flype
(Apollo Team Member)
Posts 299
15 Jul 2019 21:08

Since the 080 have a precise cycle counter,
i can output the real results of each of them.
Those are REGS to REGS operations,
in exception of FMOVE R/W, FMOVEM R/W.


    +------------+--------------+
    | FPU instr  | Single | OoO |
    +------------+--------+-----+
    | FABS       |      1 |   1 |
    | FADD       |      6 |   1 |
    | FCMP       |      6 |   1 |
    | FDABS      |      1 |   1 |
    | FDADD      |      6 |   1 |
    | FDDIV      |      9 |   2 |
    | FDIV       |      9 |   2 |
    | FDMOVE     |      1 |   1 |
    | FDMUL      |      6 |   1 |
    | FDNEG      |      1 |   1 |
    | FDSQRT     |     21 |  12 |
    | FDSUB      |      6 |   1 |
    | FINTRZ     |      2 |   1 |
    | FMOVERm    |      1 |   1 |
    | FMOVEWm    |      1 |   1 |
    | FMOVERi    |      1 |   1 |
    | FMOVEWi    |      1 |   1 |
    | FMOVECR    |      1 |   1 |
    | FMOVECTRL  |      4 |   4 |
    | FMOVEMR    |      8 |   8 |
    | FMOVEMW    |     25 |  25 |
    | FMUL       |      6 |   1 |
    | FNEG       |      1 |   1 |
    | FSABS      |      1 |   1 |
    | FSADD      |      6 |   1 |
    | FSDIV      |      9 |   2 |
    | FSGLDIV    |      9 |   2 |
    | FSGLMUL    |      6 |   1 |
    | FSMOVE     |      1 |   1 |
    | FSMUL      |      6 |   1 |
    | FSNEG      |      1 |   1 |
    | FSQRT      |     21 |  12 |
    | FSSQRT     |     21 |  12 |
    | FSSUB      |      6 |   1 |
    | FSUB       |      6 |   1 |
    | FTST       |      1 |   1 |
    | FSEQ       |      1 |   1 |
    | FSCC       |      1 |   1 |
    | FNOP       |      1 |   1 |
    +------------+--------+-----+
    | FPSP instr | Single | OoO |
    +------------+--------+-----+
    | FACOS      |    121 | 121 |
    | FASIN      |    121 | 121 |
    | FATAN      |    198 | 198 |
    | FATANH     |    153 | 153 |
    | FCOS       |    209 | 209 |
    | FCOSH      |    264 | 264 |
    | FETOX      |    220 | 220 |
    | FETOXM1    |    231 | 231 |
    | FGETEXP    |     88 |  88 |
    | FGETMAN    |     88 |  88 |
    | FINT       |     99 |  99 |
    | FLOG10     |    231 | 231 |
    | FLOG2      |    242 | 242 |
    | FLOGN      |    220 | 220 |
    | FLOGN1P    |    220 | 220 |
    | FMOD       |    121 | 121 |
    | FREM       |    121 | 121 |
    | FSCALE     |     99 |  99 |
    | FSIN       |    238 | 238 |
    | FSINCOS    |    264 | 264 |
    | FSINH      |    286 | 286 |
    | FTAN       |    198 | 198 |
    | FTANH      |    275 | 275 |
    | FTENTOX    |    231 | 231 |
    | FTWOTOX    |    231 | 231 |
    +------------+--------+-----+

Source code provided :

EXTERNAL LINK


Samuel Devulder Posts 248 15 Jul 2019 21:44	@bebbo: For a 6-cycles fpu instruction, I count 5 other instructions slots between reuse of the result. Yes you are right this makes 4 wait-cycles on that code, hence a total of 12. Total count for the "parallelized" version is still at 6. Sam (this thread is very interresting)

Grom 68k

Posts 61
15 Jul 2019 22:20

Stefan "Bebbo" Franke wrote:

Niclas A wrote:

Grom 68k wrote:

Where is source ?

EXTERNAL LINK

not exactly - he's looking for the headers which are in several repos.

Easiest: grab the built version, and mail me the fixed headers and I'll put them live

Hi bebbo,

This is the script.


#!/usr/bin/python
import os
import re
for root, dirs, files in os.walk('.'):
   for file in files:
     if file.endswith('.h'):
       count=0
       with open(os.path.join(root, file), 'r') as infile:
         with open(os.path.join(root, file + 'new'), 'w') as outfile:
           for currentline in infile.readlines():
             match = re.search(r'^\s*\w+\s+\w+\(.*$', currentline, flags=0)
             if match:
               outfile.write(re.sub(r'^(\s*)(\w+\s+\w+)(\(.*)$', r'\1__stdargs \2\3', currentline, flags=0))
               count=count+1
             else:
               match = re.search(r'^\s*struct\s+\w+\s+\*\w+\(.*$', currentline, flags=0)
               if match:
                 outfile.write(re.sub(r'^(\s*)(struct\s+\w+\s+\*\w+)(\(.*)$', r'\1__stdargs \2\3', currentline, flags=0))
                 count=count+1
               else:
                 outfile.write(currentline)
       if (count==0):
         os.remove(os.path.join(root, file + 'new'))
       else:
         os.remove(os.path.join(root, file))
         os.rename(os.path.join(root, file + 'new'),os.path.join(root, file))

I think I must check all files before delivery.
Could you give me your mail for the delivery?

I must modify these one too ?


uint32 APICALL (*Release)(struct SocketIFace *Self);
struct Interface * APICALL (*Clone)(struct SocketIFace *Self);

Thanks

Nixus Minimax

Posts 416
16 Jul 2019 08:03

Samuel Devulder wrote:

(this thread is very interresting)

I agree. What may be incomprehensible gibberish to many is really exciting to us.

Stefan "Bebbo" Franke

Posts 142
16 Jul 2019 08:35

Grom 68k wrote:

I think I must check all files before delivery.
Could you give me your mail for the delivery?

I must modify these one too ?


  uint32 APICALL (*Release)(struct SocketIFace *Self);
  struct Interface * APICALL (*Clone)(struct SocketIFace *Self);

Thanks

please mail to bebbo at bejy.net

And yes, the function pointers do need it too. If I read "APICALL" it would be sufficient to modify the define for "APICALL" - less work?

thank you!

Grom 68k

Posts 61
16 Jul 2019 10:18

Stefan "Bebbo" Franke wrote:

Grom 68k wrote:

I think I must check all files before delivery.
Could you give me your mail for the delivery?

I must modify these one too ?


   uint32 APICALL (*Release)(struct SocketIFace *Self);
   struct Interface * APICALL (*Clone)(struct SocketIFace *Self);

Thanks

please mail to bebbo at bejy.net

And yes, the function pointers do need it too. If I read "APICALL" it would be sufficient to modify the define for "APICALL" - less work?

thank you!

I will modify APICALL manualy, there is only few files impacted.

Do you prefer a zip with all files or only modified ones ?

posts 367	page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19