Overview Features Instructions Performance Forum Downloads Products OrderV4 Reseller Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Stefan "Bebbo" Franke

Posts 136
17 Jul 2019 13:54


Grom 68k wrote:

Stefan "Bebbo" Franke wrote:

 
Grom 68k wrote:

  ...
    Is gcc can make difference between  subq.l #4,d0  and  subq.l #1,d0 ?
    If yes,  subq.l #4,d0  could be move upward.
  ...
 

 
  since the sub is fused to the jne, there is no gain.
 

 
  Unlike other, #1 is specified for subq/bne fusing.
 
 
Gunnar von Boehn wrote:

  MOVEq  #,Dn
  AND.L  Dx,Dn   
 
 
  SUBQ.L #1,Dn
  BNE.s  LOOP
 

 


_Scale1: | unrolled 4 times
        fdmove.x fp1,fp0
        fmovem fp2/fp3/fp4,-(sp)
        moveq #20,d0
.L2:
        fdmove.d (a1)+,fp4
        fdmul.x fp0,fp4
        fdmove.d (a1)+,fp3
        fdmul.x fp0,fp3
        fdmove.d (a1)+,fp2
        fdmul.x fp0,fp2
        fdmove.d (a1)+,fp1
        fdmul.x fp0,fp1
        fmove.d fp4,(a0)+
        fmove.d fp3,(a0)+
        subq.l #4,d0
        fmove.d fp2,(a0)+
        fmove.d fp1,(a0)+
        tst.l d0
        jne .L2
        fmove.d fp0,-(sp)
        move.l (sp)+,d0
        move.l (sp)+,d1
        fmovem (sp)+,fp4/fp3/fp2
        rts




Grom 68k

Posts 61
17 Jul 2019 14:09


Stefan "Bebbo" Franke wrote:

 
Grom 68k wrote:

 
Stefan "Bebbo" Franke wrote:

   
Grom 68k wrote:

    ...
      Is gcc can make difference between  subq.l #4,d0  and  subq.l #1,d0 ?
      If yes,  subq.l #4,d0  could be move upward.
    ...
   

   
    since the sub is fused to the jne, there is no gain.
   

   
    Unlike other, #1 is specified for subq/bne fusing.
   
   
Gunnar von Boehn wrote:

    MOVEq  #,Dn
    AND.L  Dx,Dn   
   
   
    SUBQ.L #1,Dn
    BNE.s  LOOP
   

   
 

 
 

  _Scale1: | unrolled 4 times
          fdmove.x fp1,fp0
          fmovem fp2/fp3/fp4,-(sp)
          moveq #20,d0
  .L2:
          fdmove.d (a1)+,fp4
          fdmul.x fp0,fp4
          fdmove.d (a1)+,fp3
          fdmul.x fp0,fp3
          fdmove.d (a1)+,fp2
          fdmul.x fp0,fp2
          fdmove.d (a1)+,fp1
          fdmul.x fp0,fp1
          fmove.d fp4,(a0)+
          fmove.d fp3,(a0)+
          subq.l #4,d0
          fmove.d fp2,(a0)+
          fmove.d fp1,(a0)+
          tst.l d0
          jne .L2
          fmove.d fp0,-(sp)
          move.l (sp)+,d0
          move.l (sp)+,d1
          fmovem (sp)+,fp4/fp3/fp2
          rts
 

 
 

 
  tst.l is useless, fmove doesn't modify zero flag
  You can remove unused Scalar0 (= remove first fdmove.x fp1,fp0)
 
  How gcc count the number of time (4,5,6)?
  If you can simplify the step to 1, you can fused subq, and it will be better to unroll 5 or 6 times.
 


Stefan "Bebbo" Franke

Posts 136
17 Jul 2019 14:29


Grom 68k wrote:

 
  tst.l is useless, fmove doesn't modify zero flag
  You can remove unused Scalar0 (= remove first fdmove.x fp1,fp0)

I know - not done yet.
 
Grom 68k wrote:

  How gcc count the number of time (4,5,6)?
  If you can simplify the step to 1, you can fused subq, and it will be better to unroll 5 or 6 times. 

I limited it by some formula, otherwise gcc tends to use stack variables for unrolling.



Grom 68k

Posts 61
18 Jul 2019 08:08


I simplify my example to transform vector3 with matrix4x4  EXTERNAL LINK .
EDIT: This one is better to unroll EXTERNAL LINK 
  | Bx | = | Ux Vx Wx Tx || Ax |
  | By |  | Uy Vy Wy Ty || Ay |
  | Bz |  | Uz Vz Wz Tz || Az |
  | 1. |  | 0. 0. 0. 1. || 1. |
 
EDIT2: The same with const and -O3 EXTERNAL LINK  Can gcc remove useless a5 ?


        move.l (a1)+,(-8,a5)
        move.l (a1)+,(-4,a5)
        move.l (a1)+,(-16,a5)
        move.l (a1)+,(-12,a5)
        move.l (a1)+,(-24,a5)
        move.l (a1)+,(-20,a5)
        move.l (a1)+,(-32,a5)
        move.l (a1)+,(-28,a5)
        move.l (a1)+,(-40,a5)
        move.l (a1)+,(-36,a5)
        ...

 
  And this is the rainflow EXTERNAL LINK . It's used to calculate fatigue dammage of metallic part.


Thellier Alain

Posts 117
18 Jul 2019 12:27


Hello

I know you are only tuning the compiler but is it possible you regroup those optimized sources into an ASM source ?

I mean having 68080 optimized versions of
CrossProduct
DotProduct
MultyplyMatrices4x4
MultyplyMatrices3x3
TransformVec3Matrices4x4
TransformVec3Matrices3x3
Distance3
etc...
can be usefull for lots of future 3D programs :-)

Thanks




Samuel Devulder

Posts 246
18 Jul 2019 12:59


What's what I did for the Monkey demo in CoffinOS. But beware of library versions. They tend to work against the full potential of optimizing compilers. For instance, here is an except of a profiling session I did some times ago with a library version of DotProduct:
 
Test date:             Sun Oct 21 20:47:51 2018
 
  Execution profile for: sam/quake.gcc-3.2.2.030
  Time units:            Percentual
  Sort order:            Overall time
  Profiling mode:        Separate
  Used commandline:      -safe -usemode 0
  All symbols shown
 
  _DotProduct                7979523    0.000    13.965    0.000  434.592
  R_ClipEdge                  3525197    0.000    7.550    0.000    0.000
  @R_RenderFace                891605    0.000    4.741    0.000    0.000
  @R_EmitEdge                1525151    0.000    3.433    0.000    0.000
  @D_DrawSpansXP4              252339    0.000    3.401    0.000    0.000
As you can see the #1 most costly function is DotProduct. This is not because it isn't optimized (the library version is as optimized as possible), but because it is used almost everywhere in the code. When used via a library --that is not inlined-- the compiler cannot really optimize it along with other fpu computations. Such a function it is too few fpu-ops for major speed boost. Serializing/deserializing the vectors into/from memory to call the library is a full waste of time for instance. Actually such a library function should always be inlined and optimized globally along with other instructions of the C function. The same goes with primitive-likes operations (CrossProduct, etc.) that are often used to make decisions in the code (ie. their result is combined with other computation and used in a if() statement).
 


Stefan "Bebbo" Franke

Posts 136
18 Jul 2019 19:51


Grom 68k wrote:

   
  EDIT2: The same with const and -O3 EXTERNAL LINK  Can gcc remove useless a5 ?
 
 

          move.l (a1)+,(-8,a5)
          move.l (a1)+,(-4,a5)
          move.l (a1)+,(-16,a5)
          move.l (a1)+,(-12,a5)
          move.l (a1)+,(-24,a5)
          move.l (a1)+,(-20,a5)
          move.l (a1)+,(-32,a5)
          move.l (a1)+,(-28,a5)
          move.l (a1)+,(-40,a5)
          move.l (a1)+,(-36,a5)
          ...
 

there might be an option - but I'll have a look.



Samuel Devulder

Posts 246
18 Jul 2019 22:06


Grom 68k wrote:

      Can gcc remove useless a5 ?

If you want to remove the frame-pointer, just add -fomit-frame-pointer: EXTERNAL LINK   
     
Now, if you question is "why on hell do gcc makes a local copy of const double transformMatrix[4][4]", then I have no clue. It is very odd. I see no obvious reason for this local copy (even playing with the restrict keyword doesn't help EXTERNAL LINK ).
     
[EDIT] I think I kind of "understand" what's going on. Replace the number of loops (900) by a smaller value (say 3). Then you'll see gcc preload transformMatrix into fpu regs. If you increase the number of loops a little bit, you'll see gcc use more and more fpu regs, up to a point (say 5) where there aren't enough fpu-reg for preload and then it seem that gcc uses the local stack as extra "free" regs. This is very very odd. Of course using memory as source is as fast as using fpu-reg, but then why use a local stack-based copy? Copying adds many cycles. It is killing the speed.


Thellier Alain

Posts 117
19 Jul 2019 09:04


@Samuel

I never talked about a Library. I was meaning just some functions (in a .h) with parameters in register that you can use inlined in a C source.




Samuel Devulder

Posts 246
19 Jul 2019 12:49


Library or inline asm are the same, optimization-wise. Both appear as "atomic" function call, and the compiler couldn't optimize as much as fully inlined C-code. So better use "static inline" with plain C code in include.h to let a better chance for the compiler to schedule the instructions over whole of the fonction.


Stefan "Bebbo" Franke

Posts 136
21 Jul 2019 07:49


Samuel Devulder wrote:

There are strange things occuring with this matrix*vector routine.  EXTERNAL LINK There are lots of "fadd #0,freg" as pointed out by Grom68k,

...

I undid my changes to remove these zeros. You have to use -ffast-math:

Why?

X + 0 and X - 0 both give X when X is NaN, infinite, or nonzero and finite.  The problematic cases are when X is zero, and its mode has signed zeros.  In the case of rounding towards -infinity, X - 0 is not the same as X because 0 - 0 is -0.  In other rounding modes, X + 0 is not the same as X because -0 + 0 is 0.

Thus you can't omit the fadd #0,fpx unles the user forces it via -ffast-math.




Samuel Devulder

Posts 246
21 Jul 2019 10:28


So it is signed zeros that are causing troubles. Damn non-mathematical concept ;)
     
Anyway, using your latest version and "-ffast-math" gives a great result concerning wait-cycle. I now only count 2 of them remaining in the very end of the calculation (that's really not very much) EXTERNAL LINK
_multiplyMatrix:
            subq.l #8,sp
            fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp)
            fdmove.d (a0)+,fp6
            fdmove.x fp6,fp1
            fdmul.d (a1)+,fp6
            fdmove.d (a0)+,fp7
            fdmove.x fp7,fp0
            fdmove.x fp0,fp3
            fdmove.x fp1,fp2
            fdmul.d (a1)+,fp7
            fdmove.d (24,a1),fp5
            fdmove.d (16,a1),fp4
            fdmul.x fp0,fp5
            fdmul.d (88,a1),fp0
            fdmul.x fp1,fp4
            fdmul.d (56,a1),fp3
            fdmul.d (48,a1),fp2
            fdmul.d (80,a1),fp1
            fdadd.x fp6,fp7
            fdmove.d (a1)+,fp6
            fmove.d fp0,(72,sp)
            fdadd.x fp4,fp5
            fdmove.d (a0)+,fp0
            fdmove.d (24,a1),fp4
            fdadd.x fp3,fp2
            fdmul.x fp0,fp6
            fdmul.x fp0,fp4
            fdmove.x fp0,fp3
            fdmul.d (88,a1),fp0
            fdmul.d (56,a1),fp3
            fdadd.d (72,sp),fp1
            fdadd.x fp7,fp6
            fdmove.d (a1)+,fp7
            fdadd.x fp5,fp4
            fmove.d fp0,(72,sp)
            fdmove.d (a0),fp0
            fdmul.x fp0,fp7
            fdmove.d (24,a1),fp5
            fdadd.x fp3,fp2
            fdmul.x fp0,fp5
            fdmove.x fp0,fp3
            fdmul.d (56,a1),fp3
            fdadd.d (72,sp),fp1
            fdmul.d (88,a1),fp0
            fdadd.x fp6,fp7
            fdadd.x fp5,fp4
            move.l d0,a0
            fdadd.x fp3,fp2
            fdadd.x fp0,fp1
    ; 1 wait-cycle (fp7)
            fmove.d fp7,(a0)+
            fmove.d fp4,(a0)+
    ; 1 wait-cycle (fp2)
            fmove.d fp2,(a0)+
            fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
            fmove.d fp1,(a0)
            addq.l #8,sp
            rts
/me happy with the result :)
 


Grom 68k

Posts 61
21 Jul 2019 11:08


Samuel Devulder wrote:

So it is signed zeros that are causing troubles. Damn non-mathematical concept ;)
     
  Anyway, using your latest version and "-ffast-math" gives a great result concerning wait-cycle. I now only count 2 of them remaining in the very end of the calculation (that's really not very much) EXTERNAL LINK
_multiplyMatrix:
            subq.l #8,sp
            fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp)
            fdmove.d (a0)+,fp6
            fdmove.x fp6,fp1
            fdmul.d (a1)+,fp6
            fdmove.d (a0)+,fp7
            fdmove.x fp7,fp0
            fdmove.x fp0,fp3
            fdmove.x fp1,fp2
            fdmul.d (a1)+,fp7
            fdmove.d (24,a1),fp5
            fdmove.d (16,a1),fp4
            fdmul.x fp0,fp5
            fdmul.d (88,a1),fp0
            fdmul.x fp1,fp4
            fdmul.d (56,a1),fp3
            fdmul.d (48,a1),fp2
            fdmul.d (80,a1),fp1
            fdadd.x fp6,fp7
            fdmove.d (a1)+,fp6
            fmove.d fp0,(72,sp)
            fdadd.x fp4,fp5
            fdmove.d (a0)+,fp0
            fdmove.d (24,a1),fp4
            fdadd.x fp3,fp2
            fdmul.x fp0,fp6
            fdmul.x fp0,fp4
            fdmove.x fp0,fp3
            fdmul.d (88,a1),fp0
            fdmul.d (56,a1),fp3
            fdadd.d (72,sp),fp1
            fdadd.x fp7,fp6
            fdmove.d (a1)+,fp7
            fdadd.x fp5,fp4
            fmove.d fp0,(72,sp)
            fdmove.d (a0),fp0
            fdmul.x fp0,fp7
            fdmove.d (24,a1),fp5
            fdadd.x fp3,fp2
            fdmul.x fp0,fp5
            fdmove.x fp0,fp3
            fdmul.d (56,a1),fp3
            fdadd.d (72,sp),fp1
            fdmul.d (88,a1),fp0
            fdadd.x fp6,fp7
            fdadd.x fp5,fp4
            move.l d0,a0
            fdadd.x fp3,fp2
            fdadd.x fp0,fp1
    ; 1 wait-cycle (fp7)
            fmove.d fp7,(a0)+
            fmove.d fp4,(a0)+
    ; 1 wait-cycle (fp2)
            fmove.d fp2,(a0)+
            fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
            fmove.d fp1,(a0)
            addq.l #8,sp
            rts
/me happy with the result :)
 

Hi,

Is it possible to reserve fp0 and fp1 for the last 2 fmove ?

Example:


            fmove.d fp4,(a0)+
            fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2
            fmove.d fp1,(a0)+
            fmove.d fp0,(a0)



Grom 68k

Posts 61
21 Jul 2019 21:24


Philippe Flype wrote:

Since the 080 have a precise cycle counter,
  i can output the real results of each of them.
  Those are REGS to REGS operations,
  in exception of FMOVE R/W, FMOVEM R/W.
 
 

    +------------+--------------+
    | FPU instr  | Single | OoO |
    +------------+--------+-----+
    | FABS      |      1 |  1 |
    | FADD      |      6 |  1 |
    | FCMP      |      6 |  1 |
    | FDABS      |      1 |  1 |
    | FDADD      |      6 |  1 |
    | FDDIV      |      9 |  2 |
    | FDIV      |      9 |  2 |
    | FDMOVE    |      1 |  1 |
    | FDMUL      |      6 |  1 |
    | FDNEG      |      1 |  1 |
    | FDSQRT    |    21 |  12 |
    | FDSUB      |      6 |  1 |
    | FINTRZ    |      2 |  1 |
    | FMOVERm    |      1 |  1 |
    | FMOVEWm    |      1 |  1 |
    | FMOVERi    |      1 |  1 |
    | FMOVEWi    |      1 |  1 |
    | FMOVECR    |      1 |  1 |
    | FMOVECTRL  |      4 |  4 |
    | FMOVEMR    |      8 |  8 |
    | FMOVEMW    |    25 |  25 |
    | FMUL      |      6 |  1 |
    | FNEG      |      1 |  1 |
    | FSABS      |      1 |  1 |
    | FSADD      |      6 |  1 |
    | FSDIV      |      9 |  2 |
    | FSGLDIV    |      9 |  2 |
    | FSGLMUL    |      6 |  1 |
    | FSMOVE    |      1 |  1 |
    | FSMUL      |      6 |  1 |
    | FSNEG      |      1 |  1 |
    | FSQRT      |    21 |  12 |
    | FSSQRT    |    21 |  12 |
    | FSSUB      |      6 |  1 |
    | FSUB      |      6 |  1 |
    | FTST      |      1 |  1 |
    | FSEQ      |      1 |  1 |
    | FSCC      |      1 |  1 |
    | FNOP      |      1 |  1 |
    +------------+--------+-----+
    | FPSP instr | Single | OoO |
    +------------+--------+-----+
    | FACOS      |    121 | 121 |
    | FASIN      |    121 | 121 |
    | FATAN      |    198 | 198 |
    | FATANH    |    153 | 153 |
    | FCOS      |    209 | 209 |
    | FCOSH      |    264 | 264 |
    | FETOX      |    220 | 220 |
    | FETOXM1    |    231 | 231 |
    | FGETEXP    |    88 |  88 |
    | FGETMAN    |    88 |  88 |
    | FINT      |    99 |  99 |
    | FLOG10    |    231 | 231 |
    | FLOG2      |    242 | 242 |
    | FLOGN      |    220 | 220 |
    | FLOGN1P    |    220 | 220 |
    | FMOD      |    121 | 121 |
    | FREM      |    121 | 121 |
    | FSCALE    |    99 |  99 |
    | FSIN      |    238 | 238 |
    | FSINCOS    |    264 | 264 |
    | FSINH      |    286 | 286 |
    | FTAN      |    198 | 198 |
    | FTANH      |    275 | 275 |
    | FTENTOX    |    231 | 231 |
    | FTWOTOX    |    231 | 231 |
    +------------+--------+-----+
 

   
   
   
  Source code provided :
 
 
  EXTERNAL LINK 

In gcc commits, I found fdiv with a latency of 10 instead of 9. I understand too that fdiv is not fully pipelined.
Where is the cycle not usable ?


;; all insns with latency 10
(define_insn_reservation "m68080_fpu_10" 10
  (and (eq_attr "cpu" "m68080")
    (eq_attr "type" "fdiv"))
  "f0_pipeline, f1_pipeline, f2_pipeline, f3_pipeline, f4_pipeline, f5_pipeline, f6_pipeline, f7_pipeline, f8_pipeline, f9_pipeline")



Grom 68k

Posts 61
23 Jul 2019 12:36


Philippe Flype wrote:

    Since the 080 have a precise cycle counter,
    i can output the real results of each of them.
    Those are REGS to REGS operations,
    in exception of FMOVE R/W, FMOVEM R/W.
     
     

        +------------+--------------+
        | FPU instr  | Single | OoO |
        +------------+--------+-----+
        | FDIV      |      9 |  2 |
        | FMUL      |      6 |  1 |
        | FSQRT      |    21 |  12 |
        +------------+--------+-----+
     

   
      Source code provided :
   
      EXTERNAL LINK   

   
    Hi,
   
    To help schudeling, are fpu pipelines could be defined as is ?
   
   

    ;; all insns with latency 6
    (define_insn_reservation "m68080_fpu_6" 6
      (and (eq_attr "cpu" "m68080")
        (eq_attr "type" "fmul,falu,fcmp,ftst"))
    "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_end")

    ;; all insns with latency 10
    (define_insn_reservation "m68080_fpu_10" 10
      (and (eq_attr "cpu" "m68080")
        (eq_attr "type" "fdiv"))
    "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3,
    f0_pipeline_instr4, f0_pipeline_instr5, f0_pipeline_instr6, f0_pipeline_end"
)

    ;; all insns with latency 21
    (define_insn_reservation "m68080_fpu_21" 21
      (and (eq_attr "cpu" "m68080")
        (eq_attr "type" "fsqrt"))
    "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
    f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
    f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
    f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_end"
)
   


   
    Regards


Stefan "Bebbo" Franke

Posts 136
23 Jul 2019 16:08


Grom 68k wrote:

Philippe Flype wrote:

    Since the 080 have a precise cycle counter,
    i can output the real results of each of them.
    Those are REGS to REGS operations,
    in exception of FMOVE R/W, FMOVEM R/W.
       
       

          +------------+--------------+
          | FPU instr  | Single | OoO |
          +------------+--------+-----+
          | FDIV      |      9 |  2 |
          | FMUL      |      6 |  1 |
          | FSQRT      |    21 |  12 |
          +------------+--------+-----+
       

   
        Source code provided :
   
        EXTERNAL LINK     

     
    Hi,
   
    To help schudeling, are fpu pipelines could be defined as is ?
     
   

    ;; all insns with latency 6
    (define_insn_reservation "m68080_fpu_6" 6
      (and (eq_attr "cpu" "m68080")
        (eq_attr "type" "fmul,falu,fcmp,ftst"))
    "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_end")
 
 
    ;; all insns with latency 10
    (define_insn_reservation "m68080_fpu_10" 10
      (and (eq_attr "cpu" "m68080")
        (eq_attr "type" "fdiv"))
    "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3,
    f0_pipeline_instr4, f0_pipeline_instr5, f0_pipeline_instr6, f0_pipeline_end"
)
 
 
    ;; all insns with latency 21
    (define_insn_reservation "m68080_fpu_21" 21
      (and (eq_attr "cpu" "m68080")
        (eq_attr "type" "fsqrt"))
    "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
    f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
    f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1,
    f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_end"
)
   

   
    Regards

the latency for fdiv can be changed to 9 - np.

Your version will result in some stalls, since one state can only be used from one insn:

e.g. if fsqrt switches from f0_pipeline_instr3 to f0_pipeline_instr1 and a fmul wants to switch from f0_pipeline_start2 to f0_pipeline_instr1, one has to wait.



Grom 68k

Posts 61
24 Jul 2019 15:18


Stefan "Bebbo" Franke wrote:

  the latency for fdiv can be changed to 9 - np.
 
  Your version will result in some stalls, since one state can only be used from one insn:
 
  e.g. if fsqrt switches from f0_pipeline_instr3 to f0_pipeline_instr1 and a fmul wants to switch from f0_pipeline_start2 to f0_pipeline_instr1, one has to wait.

Hi,

As fdiv and fsqrt seem to be not fully pipelined, it probably use sometime the same pipe that fadd or fmul. It will be easy to fix but we need 1 hour of the fpu core developper to describe fpu pipelines.

If it is usefull, do you know how to add FPSP(complex fpu as sin, cos...) insn? Is FPSP lock the entire fpu ? How to write this lock on the other insns(fadd, fmul...)? If you can write the first one, I can make the others after my holidays.

After, I try integer instructions EXTERNAL LINK  Mul should be modified for the -m68080

Gunnar von Boehn wrote:

 
Stefan "Bebbo" Franke wrote:

    - what is the latency of each insn?
 

 
  always 1
 
 
  More expensive are
 
  MUL=2
  DIV=32
  MOVEM=1 per Reg
  MOVE16=4
  CMPM=2
  JMP/JSR with calculated EA =4  E.g. "JSR -40(A6)"
  JMP /JSR absolute or PC-relativ  =1

Is it mandatory to sub and after add 1 when converts short to int ? EXTERNAL LINK 

I am impressed that gcc use dbeq but it leave jne. EXTERNAL LINK 

Regards


Samuel Devulder

Posts 246
24 Jul 2019 16:14


Grom 68k wrote:

      As fdiv and fsqrt seem to be not fully pipelined,

Is this true? As far as I can test, simple fpu ops can run concurrently with fdiv.
     
Concerning complex fpu functions like fsin/fcos/ftan/fexp etc, you can ignore the pipeline. They are kind of emulated and takes plenty of operations (see fpsp lib) plus the interruption mechanism which is quite fast nonetheless, but which probably flushes the pipeline. Better consider fsin/fcos and friends as not pipelined at all.
   
Concerning integer multiplication, it is even worse if you mul by 16 instead of 11. It produces 4 additions in a row accounting for 4 cycles whereas a single LSL #4 is only one cycle! Am I wrong at estimating cycles for LSL?


Grom 68k

Posts 61
24 Jul 2019 16:52


Samuel Devulder wrote:

       
Grom 68k wrote:

            As fdiv and fsqrt seem to be not fully pipelined,
       

        Where does that come from ? As far as I can test, simple fpu ops can run concurrently with and fdiv.
           
        Concerning complex fpu functions like fsin/cos/tan etc, you can ignore the pipeline. They are kind of emulated and takes plenty of operations (see fpsp lib) plus the interruption mechanism which is quite fast nonetheless, but which probably flushes the pipeline. Better consider fsin/fcos and friends as not pipelined at all.
         
        Concerning integer operation, if you mul by 16 instead of 11, I'm surprised by the produced ASM. It is a series of 4 additions in a row accounting for 4 cycles whereas an LSL #4 is only one cycle! Am I wrong at estimating cycles for LSL?
       

       
        At least, the fdiv definition must be modified. It is defined as fully pipelined in gcc and it's not the case as show the FPU Cycle Counter.
       
       
       
Philippe Flype wrote:

          Since the 080 have a precise cycle counter,
          i can output the real results of each of them.
          Those are REGS to REGS operations,
          in exception of FMOVE R/W, FMOVEM R/W.
               
       

                  +------------+--------------+
                  | FPU instr  | Single | OoO |
                  +------------+--------+-----+
                  | FDIV      |      9 |  2 |
                  | FMUL      |      6 |  1 |
                  | FSQRT      |    21 |  12 |
                  +------------+--------+-----+
               

           
          Source code provided :
           
          EXTERNAL LINK       

       
       
        For the FPSP, this was my question, how can block pipeline for other fpu instruction?
       
       
        For integer, I think as you but there is probably a reason like flags differences or other.
      EDIT: (zero, negative, overflow...) I try with unsigned int, it's the same.
      EDIT2: Worse, <<4 is replaced by 4 add :( EXTERNAL LINK        EDIT3: There is no reason EXTERNAL LINK       
       
        Else, do you try the 1/sqrt(x) function from Quake with gcc ? Will 3 Ops instructions make it faster ? Is a creation of a FPSP instruction can help ?


Samuel Devulder

Posts 246
24 Jul 2019 17:24


I have an asm implementation for 1/sqrt(x) wich I use in quake. It has a lot of wait-states. It would be better to let the compiler inline and schedule the corresponding C code amongst other fpu calculations in the caller.
  * float Q_rsqrt( float number )
  * {
  * long i;
  * float x2, y;
  * const float threehalfs = 1.5F;
  *
  * x2 = number * 0.5F;
  * y = number;
  * i = * ( long * ) &y; // evil floating point bit level hacking
  * i = 0x5f3759df - ( i >> 1 ); // what the fuck?
  * y = * ( float * ) &i;
  * y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
  *// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
  *
  * return y;
  *}
 
  * fp0/d0 = 1/fsqrt(fp0) (fp1,d1 preserved)
    xdef _Q_rsqrt
    xdef @Q_rsqrt
    cnop 0,4
  _Q_rsqrt
    ifnd REGPARM
    fmove.s 4(sp),fp0
    endc
  @Q_rsqrt
    fmove.s fp0,d0
    fmul.s #-0.5,fp0
    lsr.l #1,d0
    neg.l d0
    add.l #$5f3759df,d0
    fmul.s d0,fp0
    fmul.s  d0,fp0
    fadd.s  #1.5,fp0
    fmul.s  d0,fp0
    ifd __GNUC__
    fmove.s fp0,d0
    endc
    rts


posts 365page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19