APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

Information about the Apollo CPU and FPU.

GCC Improvement for 68080	page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Stefan "Bebbo" Franke

Posts 139
17 Jul 2019 13:54

Grom 68k wrote:

Stefan "Bebbo" Franke wrote:

Grom 68k wrote:

...
Is gcc can make difference between subq.l #4,d0 and subq.l #1,d0 ?
If yes, subq.l #4,d0 could be move upward.
...

since the sub is fused to the jne, there is no gain.

Unlike other, #1 is specified for subq/bne fusing.

Gunnar von Boehn wrote:

MOVEq #,Dn
AND.L Dx,Dn

SUBQ.L #1,Dn
BNE.s LOOP


_Scale1: | unrolled 4 times
         fdmove.x fp1,fp0
         fmovem fp2/fp3/fp4,-(sp)
         moveq #20,d0
.L2:
         fdmove.d (a1)+,fp4
         fdmul.x fp0,fp4
         fdmove.d (a1)+,fp3
         fdmul.x fp0,fp3
         fdmove.d (a1)+,fp2
         fdmul.x fp0,fp2
         fdmove.d (a1)+,fp1
         fdmul.x fp0,fp1
         fmove.d fp4,(a0)+
         fmove.d fp3,(a0)+
         subq.l #4,d0
         fmove.d fp2,(a0)+
         fmove.d fp1,(a0)+
         tst.l d0
         jne .L2
         fmove.d fp0,-(sp)
         move.l (sp)+,d0
         move.l (sp)+,d1
         fmovem (sp)+,fp4/fp3/fp2
         rts

Grom 68k

Posts 61
17 Jul 2019 14:09

Stefan "Bebbo" Franke wrote:

Grom 68k wrote:

Stefan "Bebbo" Franke wrote:

Grom 68k wrote:

...
Is gcc can make difference between subq.l #4,d0 and subq.l #1,d0 ?
If yes, subq.l #4,d0 could be move upward.
...

since the sub is fused to the jne, there is no gain.

Unlike other, #1 is specified for subq/bne fusing.

Gunnar von Boehn wrote:

MOVEq #,Dn
AND.L Dx,Dn

SUBQ.L #1,Dn
BNE.s LOOP


   _Scale1: | unrolled 4 times
           fdmove.x fp1,fp0
           fmovem fp2/fp3/fp4,-(sp)
           moveq #20,d0
   .L2:
           fdmove.d (a1)+,fp4
           fdmul.x fp0,fp4
           fdmove.d (a1)+,fp3
           fdmul.x fp0,fp3
           fdmove.d (a1)+,fp2
           fdmul.x fp0,fp2
           fdmove.d (a1)+,fp1
           fdmul.x fp0,fp1
           fmove.d fp4,(a0)+
           fmove.d fp3,(a0)+
           subq.l #4,d0
           fmove.d fp2,(a0)+
           fmove.d fp1,(a0)+
           tst.l d0
           jne .L2
           fmove.d fp0,-(sp)
           move.l (sp)+,d0
           move.l (sp)+,d1
           fmovem (sp)+,fp4/fp3/fp2
           rts

tst.l is useless, fmove doesn't modify zero flag
You can remove unused Scalar0 (= remove first fdmove.x fp1,fp0)

How gcc count the number of time (4,5,6)?
If you can simplify the step to 1, you can fused subq, and it will be better to unroll 5 or 6 times.

Stefan "Bebbo" Franke

Posts 139
17 Jul 2019 14:29

Grom 68k wrote:

tst.l is useless, fmove doesn't modify zero flag
You can remove unused Scalar0 (= remove first fdmove.x fp1,fp0)

I know - not done yet.

Grom 68k wrote:

How gcc count the number of time (4,5,6)?
If you can simplify the step to 1, you can fused subq, and it will be better to unroll 5 or 6 times.

I limited it by some formula, otherwise gcc tends to use stack variables for unrolling.

Grom 68k

Posts 61
18 Jul 2019 08:08

I simplify my example to transform vector3 with matrix4x4 EXTERNAL LINK .
EDIT: This one is better to unroll EXTERNAL LINK
| Bx | = | Ux Vx Wx Tx || Ax |
| By | | Uy Vy Wy Ty || Ay |
| Bz | | Uz Vz Wz Tz || Az |
| 1. | | 0. 0. 0. 1. || 1. |

EDIT2: The same with const and -O3 EXTERNAL LINK Can gcc remove useless a5 ?


         move.l (a1)+,(-8,a5)
         move.l (a1)+,(-4,a5)
         move.l (a1)+,(-16,a5)
         move.l (a1)+,(-12,a5)
         move.l (a1)+,(-24,a5)
         move.l (a1)+,(-20,a5)
         move.l (a1)+,(-32,a5)
         move.l (a1)+,(-28,a5)
         move.l (a1)+,(-40,a5)
         move.l (a1)+,(-36,a5)
         ...

And this is the rainflow EXTERNAL LINK . It's used to calculate fatigue dammage of metallic part.

Thellier Alain

Posts 141
18 Jul 2019 12:27

Hello

I know you are only tuning the compiler but is it possible you regroup those optimized sources into an ASM source ?

I mean having 68080 optimized versions of
CrossProduct
DotProduct
MultyplyMatrices4x4
MultyplyMatrices3x3
TransformVec3Matrices4x4
TransformVec3Matrices3x3
Distance3
etc...
can be usefull for lots of future 3D programs :-)

Thanks

Samuel Devulder

Posts 248
18 Jul 2019 12:59

What's what I did for the Monkey demo in CoffinOS. But beware of library versions. They tend to work against the full potential of optimizing compilers. For instance, here is an except of a profiling session I did some times ago with a library version of DotProduct:

Test date:             Sun Oct 21 20:47:51 2018
   
   Execution profile for: sam/quake.gcc-3.2.2.030
   Time units:            Percentual
   Sort order:            Overall time
   Profiling mode:        Separate
   Used commandline:      -safe -usemode 0 
   All symbols shown
   
   _DotProduct                 7979523     0.000    13.965     0.000   434.592
   R_ClipEdge                  3525197     0.000     7.550     0.000     0.000
   @R_RenderFace                891605     0.000     4.741     0.000     0.000
   @R_EmitEdge                 1525151     0.000     3.433     0.000     0.000
   @D_DrawSpansXP4              252339     0.000     3.401     0.000     0.000

As you can see the #1 most costly function is DotProduct. This is not because it isn't optimized (the library version is as optimized as possible), but because it is used almost everywhere in the code. When used via a library --that is not inlined-- the compiler cannot really optimize it along with other fpu computations. Such a function it is too few fpu-ops for major speed boost. Serializing/deserializing the vectors into/from memory to call the library is a full waste of time for instance. Actually such a library function should always be inlined and optimized globally along with other instructions of the C function. The same goes with primitive-likes operations (CrossProduct, etc.) that are often used to make decisions in the code (ie. their result is combined with other computation and used in a if() statement).

Stefan "Bebbo" Franke

Posts 139
18 Jul 2019 19:51

Grom 68k wrote:

EDIT2: The same with const and -O3 EXTERNAL LINK Can gcc remove useless a5 ?


          move.l (a1)+,(-8,a5)
          move.l (a1)+,(-4,a5)
          move.l (a1)+,(-16,a5)
          move.l (a1)+,(-12,a5)
          move.l (a1)+,(-24,a5)
          move.l (a1)+,(-20,a5)
          move.l (a1)+,(-32,a5)
          move.l (a1)+,(-28,a5)
          move.l (a1)+,(-40,a5)
          move.l (a1)+,(-36,a5)
          ...

there might be an option - but I'll have a look.

Samuel Devulder

Posts 248
18 Jul 2019 22:06

Grom 68k wrote:

Can gcc remove useless a5 ?

If you want to remove the frame-pointer, just add -fomit-frame-pointer: EXTERNAL LINK

Now, if you question is "why on hell do gcc makes a local copy of const double transformMatrix[4][4]", then I have no clue. It is very odd. I see no obvious reason for this local copy (even playing with the restrict keyword doesn't help EXTERNAL LINK ).

[EDIT] I think I kind of "understand" what's going on. Replace the number of loops (900) by a smaller value (say 3). Then you'll see gcc preload transformMatrix into fpu regs. If you increase the number of loops a little bit, you'll see gcc use more and more fpu regs, up to a point (say 5) where there aren't enough fpu-reg for preload and then it seem that gcc uses the local stack as extra "free" regs. This is very very odd. Of course using memory as source is as fast as using fpu-reg, but then why use a local stack-based copy? Copying adds many cycles. It is killing the speed.

Thellier Alain

Posts 141
19 Jul 2019 09:04

@Samuel

I never talked about a Library. I was meaning just some functions (in a .h) with parameters in register that you can use inlined in a C source.


Samuel Devulder Posts 248 19 Jul 2019 12:49	Library or inline asm are the same, optimization-wise. Both appear as "atomic" function call, and the compiler couldn't optimize as much as fully inlined C-code. So better use "static inline" with plain C code in include.h to let a better chance for the compiler to schedule the instructions over whole of the fonction.

Stefan "Bebbo" Franke

Posts 139
21 Jul 2019 07:49

Samuel Devulder wrote:

There are strange things occuring with this matrix*vector routine. EXTERNAL LINK There are lots of "fadd #0,freg" as pointed out by Grom68k,

...

I undid my changes to remove these zeros. You have to use -ffast-math:

Why?

X + 0 and X - 0 both give X when X is NaN, infinite, or nonzero and finite. The problematic cases are when X is zero, and its mode has signed zeros. In the case of rounding towards -infinity, X - 0 is not the same as X because 0 - 0 is -0. In other rounding modes, X + 0 is not the same as X because -0 + 0 is 0.

Thus you can't omit the fadd #0,fpx unles the user forces it via -ffast-math.

Samuel Devulder

Posts 248
21 Jul 2019 10:28

So it is signed zeros that are causing troubles. Damn non-mathematical concept ;)

Anyway, using your latest version and "-ffast-math" gives a great result concerning wait-cycle. I now only count 2 of them remaining in the very end of the calculation (that's really not very much) EXTERNAL LINK

_multiplyMatrix:
            subq.l #8,sp
            fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp)
            fdmove.d (a0)+,fp6
            fdmove.x fp6,fp1
            fdmul.d (a1)+,fp6
            fdmove.d (a0)+,fp7
            fdmove.x fp7,fp0
            fdmove.x fp0,fp3
            fdmove.x fp1,fp2
            fdmul.d (a1)+,fp7
            fdmove.d (24,a1),fp5
            fdmove.d (16,a1),fp4
            fdmul.x fp0,fp5
            fdmul.d (88,a1),fp0
            fdmul.x fp1,fp4
            fdmul.d (56,a1),fp3
            fdmul.d (48,a1),fp2
            fdmul.d (80,a1),fp1
            fdadd.x fp6,fp7
            fdmove.d (a1)+,fp6
            fmove.d fp0,(72,sp)
            fdadd.x fp4,fp5
            fdmove.d (a0)+,fp0
            fdmove.d (24,a1),fp4
            fdadd.x fp3,fp2
            fdmul.x fp0,fp6
            fdmul.x fp0,fp4
            fdmove.x fp0,fp3
            fdmul.d (88,a1),fp0
            fdmul.d (56,a1),fp3
            fdadd.d (72,sp),fp1
            fdadd.x fp7,fp6
            fdmove.d (a1)+,fp7
            fdadd.x fp5,fp4
            fmove.d fp0,(72,sp)
            fdmove.d (a0),fp0
            fdmul.x fp0,fp7
            fdmove.d (24,a1),fp5
            fdadd.x fp3,fp2
            fdmul.x fp0,fp5
            fdmove.x fp0,fp3
            fdmul.d (56,a1),fp3
            fdadd.d (72,sp),fp1
            fdmul.d (88,a1),fp0
            fdadd.x fp6,fp7
            fdadd.x fp5,fp4
            move.l d0,a0
            fdadd.x fp3,fp2
            fdadd.x fp0,fp1
    ; 1 wait-cycle (fp7)
            fmove.d fp7,(a0)+
            fmove.d fp4,(a0)+
    ; 1 wait-cycle (fp2)
            fmove.d fp2,(a0)+
            fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
            fmove.d fp1,(a0)
            addq.l #8,sp
            rts

/me happy with the result :)

Grom 68k

Posts 61
21 Jul 2019 11:08

Samuel Devulder wrote:

_multiplyMatrix:
             subq.l #8,sp
             fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp)
             fdmove.d (a0)+,fp6
             fdmove.x fp6,fp1
             fdmul.d (a1)+,fp6
             fdmove.d (a0)+,fp7
             fdmove.x fp7,fp0
             fdmove.x fp0,fp3
             fdmove.x fp1,fp2
             fdmul.d (a1)+,fp7
             fdmove.d (24,a1),fp5
             fdmove.d (16,a1),fp4
             fdmul.x fp0,fp5
             fdmul.d (88,a1),fp0
             fdmul.x fp1,fp4
             fdmul.d (56,a1),fp3
             fdmul.d (48,a1),fp2
             fdmul.d (80,a1),fp1
             fdadd.x fp6,fp7
             fdmove.d (a1)+,fp6
             fmove.d fp0,(72,sp)
             fdadd.x fp4,fp5
             fdmove.d (a0)+,fp0
             fdmove.d (24,a1),fp4
             fdadd.x fp3,fp2
             fdmul.x fp0,fp6
             fdmul.x fp0,fp4
             fdmove.x fp0,fp3
             fdmul.d (88,a1),fp0
             fdmul.d (56,a1),fp3
             fdadd.d (72,sp),fp1
             fdadd.x fp7,fp6
             fdmove.d (a1)+,fp7
             fdadd.x fp5,fp4
             fmove.d fp0,(72,sp)
             fdmove.d (a0),fp0
             fdmul.x fp0,fp7
             fdmove.d (24,a1),fp5
             fdadd.x fp3,fp2
             fdmul.x fp0,fp5
             fdmove.x fp0,fp3
             fdmul.d (56,a1),fp3
             fdadd.d (72,sp),fp1
             fdmul.d (88,a1),fp0
             fdadd.x fp6,fp7
             fdadd.x fp5,fp4
             move.l d0,a0
             fdadd.x fp3,fp2
             fdadd.x fp0,fp1
     ; 1 wait-cycle (fp7)
             fmove.d fp7,(a0)+
             fmove.d fp4,(a0)+
     ; 1 wait-cycle (fp2)
             fmove.d fp2,(a0)+
             fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
             fmove.d fp1,(a0)
             addq.l #8,sp
             rts

/me happy with the result :)

Hi,

Is it possible to reserve fp0 and fp1 for the last 2 fmove ?

Example:


             fmove.d fp4,(a0)+
             fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
             fmove.d fp1,(a0)+
             fmove.d fp0,(a0)

Grom 68k

Posts 61
21 Jul 2019 21:24

Philippe Flype wrote:

Since the 080 have a precise cycle counter,
i can output the real results of each of them.
Those are REGS to REGS operations,
in exception of FMOVE R/W, FMOVEM R/W.


     +------------+--------------+
     | FPU instr  | Single | OoO |
     +------------+--------+-----+
     | FABS       |      1 |   1 |
     | FADD       |      6 |   1 |
     | FCMP       |      6 |   1 |
     | FDABS      |      1 |   1 |
     | FDADD      |      6 |   1 |
     | FDDIV      |      9 |   2 |
     | FDIV       |      9 |   2 |
     | FDMOVE     |      1 |   1 |
     | FDMUL      |      6 |   1 |
     | FDNEG      |      1 |   1 |
     | FDSQRT     |     21 |  12 |
     | FDSUB      |      6 |   1 |
     | FINTRZ     |      2 |   1 |
     | FMOVERm    |      1 |   1 |
     | FMOVEWm    |      1 |   1 |
     | FMOVERi    |      1 |   1 |
     | FMOVEWi    |      1 |   1 |
     | FMOVECR    |      1 |   1 |
     | FMOVECTRL  |      4 |   4 |
     | FMOVEMR    |      8 |   8 |
     | FMOVEMW    |     25 |  25 |
     | FMUL       |      6 |   1 |
     | FNEG       |      1 |   1 |
     | FSABS      |      1 |   1 |
     | FSADD      |      6 |   1 |
     | FSDIV      |      9 |   2 |
     | FSGLDIV    |      9 |   2 |
     | FSGLMUL    |      6 |   1 |
     | FSMOVE     |      1 |   1 |
     | FSMUL      |      6 |   1 |
     | FSNEG      |      1 |   1 |
     | FSQRT      |     21 |  12 |
     | FSSQRT     |     21 |  12 |
     | FSSUB      |      6 |   1 |
     | FSUB       |      6 |   1 |
     | FTST       |      1 |   1 |
     | FSEQ       |      1 |   1 |
     | FSCC       |      1 |   1 |
     | FNOP       |      1 |   1 |
     +------------+--------+-----+
     | FPSP instr | Single | OoO |
     +------------+--------+-----+
     | FACOS      |    121 | 121 |
     | FASIN      |    121 | 121 |
     | FATAN      |    198 | 198 |
     | FATANH     |    153 | 153 |
     | FCOS       |    209 | 209 |
     | FCOSH      |    264 | 264 |
     | FETOX      |    220 | 220 |
     | FETOXM1    |    231 | 231 |
     | FGETEXP    |     88 |  88 |
     | FGETMAN    |     88 |  88 |
     | FINT       |     99 |  99 |
     | FLOG10     |    231 | 231 |
     | FLOG2      |    242 | 242 |
     | FLOGN      |    220 | 220 |
     | FLOGN1P    |    220 | 220 |
     | FMOD       |    121 | 121 |
     | FREM       |    121 | 121 |
     | FSCALE     |     99 |  99 |
     | FSIN       |    238 | 238 |
     | FSINCOS    |    264 | 264 |
     | FSINH      |    286 | 286 |
     | FTAN       |    198 | 198 |
     | FTANH      |    275 | 275 |
     | FTENTOX    |    231 | 231 |
     | FTWOTOX    |    231 | 231 |
     +------------+--------+-----+

Source code provided :

EXTERNAL LINK

In gcc commits, I found fdiv with a latency of 10 instead of 9. I understand too that fdiv is not fully pipelined.
Where is the cycle not usable ?


;; all insns with latency 10
(define_insn_reservation "m68080_fpu_10" 10
   (and (eq_attr "cpu" "m68080")
     (eq_attr "type" "fdiv"))
   "f0_pipeline, f1_pipeline, f2_pipeline, f3_pipeline, f4_pipeline, f5_pipeline, f6_pipeline, f7_pipeline, f8_pipeline, f9_pipeline")

Grom 68k

Posts 61
23 Jul 2019 12:36

Philippe Flype wrote:

Since the 080 have a precise cycle counter,
i can output the real results of each of them.
Those are REGS to REGS operations,
in exception of FMOVE R/W, FMOVEM R/W.


         +------------+--------------+
         | FPU instr  | Single | OoO |
         +------------+--------+-----+
         | FDIV       |      9 |   2 |
         | FMUL       |      6 |   1 |
         | FSQRT      |     21 |  12 |
         +------------+--------+-----+

Source code provided :

EXTERNAL LINK

Hi,

To help schudeling, are fpu pipelines could be defined as is ?

;; all insns with latency 6
(define_insn_reservation "m68080_fpu_6" 6
(and (eq_attr "cpu" "m68080")
(eq_attr "type" "fmul,falu,fcmp,ftst"))
"f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_end")

;; all insns with latency 10
(define_insn_reservation "m68080_fpu_10" 10
(and (eq_attr "cpu" "m68080")
(eq_attr "type" "fdiv"))
"f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3,
f0_pipeline_instr4, f0_pipeline_instr5, f0_pipeline_instr6, f0_pipeline_end")

;; all insns with latency 21
(define_insn_reservation "m68080_fpu_21" 21
(and (eq_attr "cpu" "m68080")
(eq_attr "type" "fsqrt"))
"f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_end")

Regards

Stefan "Bebbo" Franke

Posts 139
23 Jul 2019 16:08

Grom 68k wrote:

Philippe Flype wrote:

Since the 080 have a precise cycle counter,
i can output the real results of each of them.
Those are REGS to REGS operations,
in exception of FMOVE R/W, FMOVEM R/W.


          +------------+--------------+
          | FPU instr  | Single | OoO |
          +------------+--------+-----+
          | FDIV       |      9 |   2 |
          | FMUL       |      6 |   1 |
          | FSQRT      |     21 |  12 |
          +------------+--------+-----+

Source code provided :

EXTERNAL LINK

Hi,

To help schudeling, are fpu pipelines could be defined as is ?


     ;; all insns with latency 6
     (define_insn_reservation "m68080_fpu_6" 6
       (and (eq_attr "cpu" "m68080")
         (eq_attr "type" "fmul,falu,fcmp,ftst"))
     "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_end")
  
  
     ;; all insns with latency 10
     (define_insn_reservation "m68080_fpu_10" 10
       (and (eq_attr "cpu" "m68080")
         (eq_attr "type" "fdiv"))
     "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3,
     f0_pipeline_instr4, f0_pipeline_instr5, f0_pipeline_instr6, f0_pipeline_end")
  
  
     ;; all insns with latency 21
     (define_insn_reservation "m68080_fpu_21" 21
       (and (eq_attr "cpu" "m68080")
         (eq_attr "type" "fsqrt"))
     "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
     f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
     f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
     f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_end")

Regards

the latency for fdiv can be changed to 9 - np.

Your version will result in some stalls, since one state can only be used from one insn:

e.g. if fsqrt switches from f0_pipeline_instr3 to f0_pipeline_instr1 and a fmul wants to switch from f0_pipeline_start2 to f0_pipeline_instr1, one has to wait.

Grom 68k

Posts 61
24 Jul 2019 15:18

Stefan "Bebbo" Franke wrote:

the latency for fdiv can be changed to 9 - np.

Your version will result in some stalls, since one state can only be used from one insn:

e.g. if fsqrt switches from f0_pipeline_instr3 to f0_pipeline_instr1 and a fmul wants to switch from f0_pipeline_start2 to f0_pipeline_instr1, one has to wait.

Hi,

As fdiv and fsqrt seem to be not fully pipelined, it probably use sometime the same pipe that fadd or fmul. It will be easy to fix but we need 1 hour of the fpu core developper to describe fpu pipelines.

If it is usefull, do you know how to add FPSP(complex fpu as sin, cos...) insn? Is FPSP lock the entire fpu ? How to write this lock on the other insns(fadd, fmul...)? If you can write the first one, I can make the others after my holidays.

After, I try integer instructions EXTERNAL LINK Mul should be modified for the -m68080

Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

- what is the latency of each insn?

always 1

More expensive are

MUL=2
DIV=32
MOVEM=1 per Reg
MOVE16=4
CMPM=2
JMP/JSR with calculated EA =4 E.g. "JSR -40(A6)"
JMP /JSR absolute or PC-relativ =1

Is it mandatory to sub and after add 1 when converts short to int ? EXTERNAL LINK

I am impressed that gcc use dbeq but it leave jne. EXTERNAL LINK

Regards

Samuel Devulder

Posts 248
24 Jul 2019 16:14

Grom 68k wrote:

As fdiv and fsqrt seem to be not fully pipelined,

Is this true? As far as I can test, simple fpu ops can run concurrently with fdiv.

Concerning complex fpu functions like fsin/fcos/ftan/fexp etc, you can ignore the pipeline. They are kind of emulated and takes plenty of operations (see fpsp lib) plus the interruption mechanism which is quite fast nonetheless, but which probably flushes the pipeline. Better consider fsin/fcos and friends as not pipelined at all.

Concerning integer multiplication, it is even worse if you mul by 16 instead of 11. It produces 4 additions in a row accounting for 4 cycles whereas a single LSL #4 is only one cycle! Am I wrong at estimating cycles for LSL?

Grom 68k

Posts 61
24 Jul 2019 16:52

Samuel Devulder wrote:

Grom 68k wrote:

As fdiv and fsqrt seem to be not fully pipelined,

Where does that come from ? As far as I can test, simple fpu ops can run concurrently with and fdiv.

Concerning complex fpu functions like fsin/cos/tan etc, you can ignore the pipeline. They are kind of emulated and takes plenty of operations (see fpsp lib) plus the interruption mechanism which is quite fast nonetheless, but which probably flushes the pipeline. Better consider fsin/fcos and friends as not pipelined at all.

Concerning integer operation, if you mul by 16 instead of 11, I'm surprised by the produced ASM. It is a series of 4 additions in a row accounting for 4 cycles whereas an LSL #4 is only one cycle! Am I wrong at estimating cycles for LSL?

At least, the fdiv definition must be modified. It is defined as fully pipelined in gcc and it's not the case as show the FPU Cycle Counter.

Philippe Flype wrote:

Since the 080 have a precise cycle counter,
i can output the real results of each of them.
Those are REGS to REGS operations,
in exception of FMOVE R/W, FMOVEM R/W.


                  +------------+--------------+
                  | FPU instr  | Single | OoO |
                  +------------+--------+-----+
                  | FDIV       |      9 |   2 |
                  | FMUL       |      6 |   1 |
                  | FSQRT      |     21 |  12 |
                  +------------+--------+-----+

Source code provided :

EXTERNAL LINK

For the FPSP, this was my question, how can block pipeline for other fpu instruction?

For integer, I think as you but there is probably a reason like flags differences or other.
EDIT: (zero, negative, overflow...) I try with unsigned int, it's the same.
EDIT2: Worse, <<4 is replaced by 4 add :( EXTERNAL LINK EDIT3: There is no reason EXTERNAL LINK

Else, do you try the 1/sqrt(x) function from Quake with gcc ? Will 3 Ops instructions make it faster ? Is a creation of a FPSP instruction can help ?

Samuel Devulder

Posts 248
24 Jul 2019 17:24

I have an asm implementation for 1/sqrt(x) wich I use in quake. It has a lot of wait-states. It would be better to let the compiler inline and schedule the corresponding C code amongst other fpu calculations in the caller.

  * float Q_rsqrt( float number )
   * {
   * long i;
   * float x2, y;
   * const float threehalfs = 1.5F;
   *
   * x2 = number * 0.5F;
   * y = number;
   * i = * ( long * ) &y; // evil floating point bit level hacking
   * i = 0x5f3759df - ( i >> 1 ); // what the fuck?
   * y = * ( float * ) &i;
   * y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
   *// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
   *
   * return y;
   *}
   
   * fp0/d0 = 1/fsqrt(fp0) (fp1,d1 preserved)
    xdef _Q_rsqrt
    xdef @Q_rsqrt
    cnop 0,4
   _Q_rsqrt
    ifnd REGPARM
    fmove.s 4(sp),fp0
    endc
   @Q_rsqrt
    fmove.s fp0,d0
    fmul.s #-0.5,fp0
    lsr.l #1,d0
    neg.l d0
    add.l #$5f3759df,d0
    fmul.s d0,fp0
    fmul.s  d0,fp0
    fadd.s  #1.5,fp0
    fmul.s  d0,fp0
    ifd __GNUC__
    fmove.s fp0,d0
    endc
    rts

posts 367	page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19