APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

Information about the Apollo CPU and FPU.

GCC Improvement for 68080	page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19


Grom 68k Posts 61 24 Jul 2019 17:29	There is no reason to not use << EXTERNAL LINK

Samuel Devulder

Posts 248
24 Jul 2019 17:33

Samuel Devulder wrote:

I have an asm implementation for 1/sqrt(x) wich I use in quake. It has a lot of wait-states. It would be better to let the compiler inline and schedule the corresponding C code amongst other fpu calculations in the caller.

    * float Q_rsqrt( float number )
    * {
    * long i;
    * float x2, y;
    * const float threehalfs = 1.5F;
    *
    * x2 = number * 0.5F;
    * y = number;
    * i = * ( long * ) &y; // evil floating point bit level hacking
    * i = 0x5f3759df - ( i >> 1 ); // what the fuck?
    * y = * ( float * ) &i;
    * y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
    *// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
    *
    * return y;
    *}
    
    * fp0/d0 = 1/fsqrt(fp0) (fp1,d1 preserved)
     xdef _Q_rsqrt
     xdef @Q_rsqrt
     cnop 0,4
    _Q_rsqrt
     ifnd REGPARM
     fmove.s 4(sp),fp0
     endc
    @Q_rsqrt
     fmove.s fp0,d0
     fmul.s #-0.5,fp0
     lsr.l #1,d0
     neg.l d0
     add.l #$5f3759df,d0
     fmul.s d0,fp0
     fmul.s  d0,fp0
     fadd.s  #1.5,fp0
     fmul.s  d0,fp0
     ifd __GNUC__
     fmove.s fp0,d0
     endc
     rts

Samuel Devulder

Posts 248
24 Jul 2019 17:33

[oops I somehow managed to post this messages several times when I just wanted to edit it. Damn! Sorry about that guys.]

I have an asm implementation for 1/sqrt(x) wich I use in quake. It has a lot of wait-states. It would be better to let the compiler inline and schedule the corresponding C code amongst other fpu calculations in the caller.


     * float Q_rsqrt( float number )
     * {
     * long i;
     * float x2, y;
     * const float threehalfs = 1.5F;
     *
     * x2 = number * 0.5F;
     * y = number;
     * i = * ( long * ) &y; // evil floating point bit level hacking
     * i = 0x5f3759df - ( i >> 1 ); // what the fuck?
     * y = * ( float * ) &i;
     * y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
     *// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
     *
     * return y;
     *}
     
     * fp0/d0 = 1/fsqrt(fp0) (fp1,d1 preserved)
      xdef _Q_rsqrt
      xdef @Q_rsqrt
      cnop 0,4
     _Q_rsqrt
      ifnd REGPARM
      fmove.s 4(sp),fp0
      endc
     @Q_rsqrt
      fmove.s fp0,d0
      fmul.s #-0.5,fp0
      lsr.l #1,d0
      neg.l d0
      add.l #$5f3759df,d0
  ; 2 wait-states
      fmul.s d0,fp0
  ; 5 wait-states
      fmul.s  d0,fp0
  ; 5 wait-states
      fadd.s  #1.5,fp0
  ; 5 wait-states
      fmul.s  d0,fp0
      ifd __GNUC__
  ; 5 wait-states
      fmove.s fp0,d0
      endc
      rts


Grom 68k Posts 61 24 Jul 2019 17:43	The Mul instruction not exists for -m68040 -m68060 -m68080. EXTERNAL LINK Mul exists for -m68020 EXTERNAL LINK

Nixus Minimax

Posts 416
25 Jul 2019 13:06

Is it possible to use an 882 flag together with the 080 flag and then produce FPU code that schedules well on the 080 but uses the complex instructions for short code? Or would you say this doesn't make sense and the 080's support for complex FPU instructions should merely be seen as an improved tool for remaining 882-compatible? After all a compiler could always come up with equally fast or even faster explicit code with the only downside of it being larger in code size.


Samuel Devulder Posts 248 25 Jul 2019 13:23	Even with -m68080, complex fpu instructions are used provided you add "-ffast-math": EXTERNAL LINK (without fast-math, then gcc will call sin/cos from libm.a)

Gunnar von Boehn
(Apollo Team Member)
Posts 6229
25 Jul 2019 15:01

Grom 68k wrote:

The Mul instruction not exists for -m68040 -m68060 -m68080. EXTERNAL LINK
Mul exists for -m68020 EXTERNAL LINK

Wow this is funny.

Bebbo can you explain this?

Stefan "Bebbo" Franke

Posts 139
25 Jul 2019 16:39

Gunnar von Boehn wrote:

Grom 68k wrote:

The Mul instruction not exists for -m68040 -m68060 -m68080. EXTERNAL LINK
Mul exists for -m68020 EXTERNAL LINK

Wow this is funny.

Bebbo can you explain this?

it all depends on the modelled costs. And the costs for the 68040 do have a bug...

and there is no extra cost model for the 68060 or 68080.

Gunnar von Boehn
(Apollo Team Member)
Posts 6229
25 Jul 2019 17:08

Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

Grom 68k wrote:

The Mul instruction not exists for -m68040 -m68060 -m68080. EXTERNAL LINK
Mul exists for -m68020 EXTERNAL LINK

Wow this is funny.

Bebbo can you explain this?

it all depends on the modelled costs. And the costs for the 68040 do have a bug...

and there is no extra cost model for the 68060 or 68080.

OK, I see.
But this can lead to crazy bad code - as we see here.

We know all 68K CPU model internals very good.
We would be glad if we can help you to improve those models.
Do you see a way how we can do this together?

Stefan "Bebbo" Franke

Posts 139
25 Jul 2019 18:24

Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

Grom 68k wrote:

The Mul instruction not exists for -m68040 -m68060 -m68080. EXTERNAL LINK
Mul exists for -m68020 EXTERNAL LINK

Wow this is funny.

Bebbo can you explain this?

it all depends on the modelled costs. And the costs for the 68040 do have a bug...

and there is no extra cost model for the 68060 or 68080.

OK, I see.
But this can lead to crazy bad code - as we see here.

We know all 68K CPU model internals very good.
We would be glad if we can help you to improve those models.
Do you see a way how we can do this together?

while I'm tampering with it, there will always be effects like this.

Unfortunately there is only a 68000/68020 emulator (vamos) available to comfortably check the efficiency of the changes. For the correctness there are still torture tests, but they don't guarantee that everything is ok, and the performance doesn't matter there.

All other tests currently require manual effort. And currently there are applications that don't run correctly...

The table is already helpful for a 68080 cost model, the addressing types are still missing and what the second column means is unclear to me.

Grom 68k

Posts 61
25 Jul 2019 21:46

Samuel Devulder wrote:


       * float Q_rsqrt( float number )
       * {
       * long i;
       * float x2, y;
       * const float threehalfs = 1.5F;
       *
       * x2 = number * 0.5F;
       * y = number;
       * i = * ( long * ) &y; // evil floating point bit level hacking
       * i = 0x5f3759df - ( i >> 1 ); // what the fuck?
       * y = * ( float * ) &i;
       * y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
       *// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
       *
       * return y;
       *}
       
       * fp0/d0 = 1/fsqrt(fp0) (fp1,d1 preserved)
        xdef _Q_rsqrt
        xdef @Q_rsqrt
        cnop 0,4
       _Q_rsqrt
        ifnd REGPARM
        fmove.s 4(sp),fp0
        endc
       @Q_rsqrt
        fmove.s fp0,d0
        fmul.s #-0.5,fp0
        lsr.l #1,d0
        neg.l d0
        add.l #$5f3759df,d0
    ; 2 wait-states
        fmul.s d0,fp0
    ; 5 wait-states
        fmul.s  d0,fp0
    ; 5 wait-states
        fadd.s  #1.5,fp0
    ; 5 wait-states
        fmul.s  d0,fp0
        ifd __GNUC__
    ; 5 wait-states
        fmove.s fp0,d0
        endc
        rts

Hi,

I think that conversion of int d0 in fpu instruction add 1 latency. I read this somewhere, i am not sure if applicable too on 68080.
EDIT: Oups, this is not a conversion.

Else, this function is cool, it doesn't use too much register. It will be great with gcc.

Grom 68k

Posts 61
25 Jul 2019 22:21

Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

Grom 68k wrote:

The Mul instruction not exists for -m68040 -m68060 -m68080. EXTERNAL LINK
Mul exists for -m68020 EXTERNAL LINK

Wow this is funny.

Bebbo can you explain this?

it all depends on the modelled costs. And the costs for the 68040 do have a bug...

and there is no extra cost model for the 68060 or 68080.

Hi,

I see in the m68k.md file a TUNE_68040 and TUNE_68060 flag. It disables mul instruction. -m68080 probably activate one of thèse flags.

Regards

Samuel Devulder

Posts 248
25 Jul 2019 23:08

Else, this function is cool, it doesn't use too much register. It will be great with gcc.

Only for slows fpus like mc68881 since the number of cycles of this code on an ac68080 is comparable with fsqrt + fdiv. Notice however that precision isn't that great either with this code. This is sufficient for a game, but for real scientific computation this introduces huges errors.

Gunnar von Boehn
(Apollo Team Member)
Posts 6229
26 Jul 2019 06:30

Stefan "Bebbo" Franke wrote:

The table is already helpful for a 68080 cost model, the addressing types are still missing and what the second column means is unclear to me.

Regarding the costs, maybe we can discuss what information is existing and missing.

For SIZE-TUNING GCC will need a size table, explaining GCC how many bytes each instructions needs and how many bytes each address modes needs.
Is this table existing and fully correct?

For the performance tuning.
GCC will need cycle tables for each instruction, plus for the addressmode.
Timing are different for the groups 68000/010 , 020/030, 040, 060, 080.

Generally EA-modes are slower on 000/020/030, and become faster or even free on 040/060/080.

For the modern CPUs 060/080 we have 2 pipes, so "grouping" of instructions becomes a important topic.

As the CPU can do 1 memory operation per cycle, plus another register operation instructions should be scheduled accordingly.
Instead this


ADDq.l #1,(a0)+
ADDq.l #1,(a0)+
ADDq.l #1,(a0)+
ADDq.l #1,(a0)+
ADDq.l #1,D0
ADDq.l #1,D1
ADDq.l #1,D2
ADDq.l #1,D3

Do this


ADDq.l #1,(a0)+
ADDq.l #1,D0
ADDq.l #1,(a0)+
ADDq.l #1,D1
ADDq.l #1,(a0)+
ADDq.l #1,D2
ADDq.l #1,(a0)+
ADDq.l #1,D3

Such scheduling is also important for AMMX and FPU code.

And very important is to handle the ALU2EA dependency.
E.g

addq.l #1,D0
move.l (a0,D0.l),D1

Here the compiler should try to move the addq.l up in the code and place some instruction between alter of Dn and usage in an EA mode.

On the other hand is important to understand that:

addq.l #1,A1
move.l (a0,A1.l),D1

This Code will NOT create any Bubble on 68k! Using of ADDRESS register as INDEX is therefore often much better on 68K.
Is GCC aware of this?

I think that GCC is most important now for compiling of modern software pieces like webbrowsers, or videoplayers or games like Quake2.
I see only 060 and 080 able enough for many modern software titles.
Therefore getting proper code tuning for 060 and 080 is most useful for the people in the future.

We should also discuss a very important topic.
The register ABI.
On 68K the old ABI has 8 Data register, plus 7 Address Register, plus 8 FPU Register. With 2 register each being scratch.

2 scratch register is not enough for many routines.
We have seen this in our FPU examples that we constantly need more scratch registers.

APOLLO 68080 provides a lot more registers.
32 DATA, 16 ADDRESS and 32 FPU registers.

We should consider allowing the ABI to use more scratch registers.
One could say that 8 new DATA regs / 8 new FPU regs / and maybe also some Address Regs are free scratch registers for subroutines.

Giving the CPU and compiler more scratch registers will greatly reduce the need for register saving/restoring and should improve overall performance.

Gunnar von Boehn
(Apollo Team Member)
Posts 6229
26 Jul 2019 06:54

I think a very important topic for coding and compiling on 68K is understand the 68K design principal.

The 68K is based on the idea of separation of duties to improve performance.

The concept of the 68K allows to have EA-Units and ALU-Units which work independently. The EA-units are there to calculate addresses.
The ALU units are there for the normal math and loop stuff.
Only the ALU unit creates flags!

Each unit type has its own register AN register for the EA units.
And Dn registers for the ALU units.
The units can "borrow" and use the other units registers too - but this is comes generally with penalties.

The same concept is valid for the FPU. The FPU has its own registers but can also use Dn registers - if needed.
That the 68K units can "borrow" other units registers is great freedom and makes coding a lot simpler.
This should be done carefully and not as general practice as they come with an extra cost!

Some bad coding examples

move.l (A0,D0), -- borrow Dn as Index
Using Dn as Index can be done but using An is preferable as An is bubble free.
--
move.l (D0), -- Using Dn as PTR, this can be done with Zero supressing EA mode but is bad practice and comes generally with penalties.
--
FADD.S D0, -- Using Dn as FPU register, can be done but with penalty on older 68Ks, the 080 supports this for free.
--
CMPA A0,A1 -- Using An as Loop condition, is a edge case can be done but often is not optimally
bne LOOP
--
SUBQ.l #1,A0 -- misuse of Addr register as counter this is very bad practice and should not be done
TST.l A0
BNE LOOP

Bebbo, how good is GCC today in understanding this design philosophy of 68k?

Gunnar von Boehn
(Apollo Team Member)
Posts 6229
26 Jul 2019 09:14

Regarding FPU tuning and tuning in general.

I think we should avoid "over" optimization.

It makes sense to convert a MUL into an ADD or a single SHIFT
It makes sense to convert a DIV into a single SHIFT
It makes sense to convert a FDIV #imm into FMUL #imm

Other then this I think the architectual instructions should be used.

For FPU performance its very important that the compiler does the most important instructions FADD/FMUL properly, that the compiler avoids register dependencies on those.
And that we add more scratch registers to the 68K ABI.
This together will greatly improve the code - other tunings have very little influence on the global picture.

On the EA-modes its important to understand why motorola did add the double-indirect mode (there was a special business reason) and that using this mode for todays code should be greatly avoided!

Bebbo what do you think about this?
Can we do reach this together and how can we help you?

Grom 68k

Posts 61
26 Jul 2019 13:31

Grom 68k wrote:

Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

Grom 68k wrote:

The Mul instruction not exists for -m68040 -m68060 -m68080. EXTERNAL LINK
Mul exists for -m68020 EXTERNAL LINK

Wow this is funny.

Bebbo can you explain this?

it all depends on the modelled costs. And the costs for the 68040 do have a bug...

and there is no extra cost model for the 68060 or 68080.

Hi,

I see in the m68k.md file a TUNE_68040 and TUNE_68060 flag. It disables mul instruction. -m68080 probably activate one of thèse flags.

Regards

With -mtune=68080, it's work well. :)

Grom 68k

Posts 61
26 Jul 2019 13:41

Grom 68k wrote:

There is no reason to not use << EXTERNAL LINK

lsl works well now. Is moveq can be removed ?

PS: compiler explorer is difficult To use on a Phone ;)

Samuel Devulder

Posts 248
26 Jul 2019 14:32

Grom 68k wrote:

With -mtune=68080, it's work well. :)

Doesn't seem so: EXTERNAL LINK
You probably meant: -mtune=68030 (680-thirty) EXTERNAL LINK

lsl works well now. Is moveq can be removed ?

Moveq is mandatory since lsl #n is limited to n<=8 AFAIK.

Don Adan

Posts 38
26 Jul 2019 14:55

Samuel Devulder wrote:

Grom 68k wrote:

With -mtune=68080, it's work well. :)

Doesn't seem so: EXTERNAL LINK
You probably meant: -mtune=68030 (680-thirty) EXTERNAL LINK

lsl works well now. Is moveq can be removed ?

Moveq is mandatory since lsl #n is limited to n<=8 AFAIK.

Not exactly. Two lsl.l can be used too. But i dont know what is better for 68080. Advantage, no trash/register is necessary.

posts 367	page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19