APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

Information about the Apollo CPU and FPU.

GCC Improvement for 68080	page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Samuel Devulder

Posts 248
30 Jul 2019 17:08

$ man pow
EXTERNAL LINK  
EXTERNAL LINK  :)

This gives: EXTERNAL LINK

Actually we can see it calls libm.a/pow. But I think there is a missing __stdargs to the declaration of the pow function in math.h since functions in libm.a should use the default ABI (eg. stack-based) even if -mregparm is present.

Also notice that "jsr XXX;rts" should be optimized as "jra XXX". This doesn't happen here, but appears when removing the "-mregpam" option. (Still that famous post-stack setup optim pass that seem missing.)

Stefan "Bebbo" Franke

Posts 142
30 Jul 2019 17:17

Samuel Devulder wrote:

$ man pow
   EXTERNAL LINK    
   EXTERNAL LINK    :)

Grom68k already provided many headers - I 'only' have to put em in...

EDIT: math.h has __stdargs now


Samuel Devulder Posts 248 30 Jul 2019 21:27	Cool B) Works fine EXTERNAL LINK

Grom 68k

Posts 61
30 Jul 2019 22:42

Samuel Devulder wrote:

Cool B) Works fine EXTERNAL LINK

Is __retfp0 active now on pow ? It seems not.

EXTERNAL LINK
Edit: -mregparm is active on pow this morning.

-mregparm could be useful too in complex.h :).

EXTERNAL LINK

Stefan "Bebbo" Franke

Posts 142
31 Jul 2019 07:08

Grom 68k wrote:

Samuel Devulder wrote:

Cool B) Works fine EXTERNAL LINK

Is __retfp0 active now on pow ? It seems not.

EXTERNAL LINK
Edit: -mregparm is active on pow this morning.

-mregparm could be useful too in complex.h :).

EXTERNAL LINK

stdlib functions can't use register parameters or fp0 to return something. /shrug?

complex returns more than one fp-register -> a pointer is used
this is not covered by __retfp0 - as the name states: fp0 is ONE register.

Grom 68k

Posts 61
31 Jul 2019 15:28

Stefan "Bebbo" Franke wrote:

Grom 68k wrote:

Samuel Devulder wrote:

Cool B) Works fine EXTERNAL LINK

Is __retfp0 active now on pow ? It seems not.

EXTERNAL LINK
Edit: -mregparm is active on pow this morning.

-mregparm could be useful too in complex.h :).

EXTERNAL LINK

stdlib functions can't use register parameters or fp0 to return something. /shrug?

complex returns more than one fp-register -> a pointer is used
this is not covered by __retfp0 - as the name states: fp0 is ONE register.

That's why I try to use math-68881.h to remove builtin.

Else, i don't understand something, is __stdarg really working ? pow use now fp0 and fp1 as input.

EXTERNAL LINK

I only speak about -mregparm for the complex.h since it works on pow. Complex data is converted as 2 fp registers.

Stefan "Bebbo" Franke

Posts 142
31 Jul 2019 15:40

Grom 68k wrote:

Else, i don't understand something, pow use now fp0 and fp1 as input.

EXTERNAL LINK

I only speak about -mregparm for the complex.h since it works on pow. Complex data is converted as 2 fp registers.

it depends which headers are live at cex... the old ones without __stdargs or the new ones with __stdargs :-)

Samuel Devulder

Posts 248
31 Jul 2019 17:44

Grom 68k wrote:

pow use now fp0 and fp1 as input.

hmm no it doesn't


             fmove.d fp1,-(sp)
             fmove.d fp0,-(sp)
             jsr _pow

(at the moment on EXTERNAL LINK ) Maybe the header has changed since.


Grom 68k Posts 61 31 Jul 2019 22:16	Yes, cex improve quickly. I try again div for int. It's better. EXTERNAL LINK What is the best between move d0,d0 and test d0 ? Is Test smaller ? Same question between jpl and bpl ? Edit: It's not better with /4. It's only better with /2 ! Why ?

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
01 Aug 2019 06:50

Gunnar von Boehn wrote:

How about using such formula for the cost?

a) 4 per clock cycle
b) +1 per instruction word
c) +2 for using memory

I think this will create much more balanced code.
What do you think?

Bebbo, what do you think about this balanced cost proposal?

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
01 Aug 2019 06:55

More ideas to make better code:


  int f(int* ptr)
  {
      int tmp;
      tmp= *ptr &2;
      
      return tmp;
  }

GCC right now creates this, which is not optimal


          move.l (a0),d0         #2      4 4    +d0                                                                        
          and.l #2,d0            #3      0 4    *d0                                                
          rts                    #5      -1 0

GCC should produce this:

MOVEQ #2,D0
AND.L (A0),D0

We should include the instruction length in the COST model!

I would propose we include the Memory access and the length in the cost so that the cost should be this:


          move.l (a0),d0         #2      4 7    +d0                                                                        
          and.l #2,d0            #3      0 7    *d0                                                
          rts                    #5      -1 0

The cost should be this:


          moveq #2,d0            #2      4 5    +d0                                                                        
          and.l (a0),d0          #3      0 7    *d0                                                
          rts                    #5      -1 0

Stefan "Bebbo" Franke

Posts 142
01 Aug 2019 17:06

Gunnar von Boehn wrote:

More ideas to make better code:


   int f(int* ptr)
   {
       int tmp;
       tmp= *ptr &2;
       
       return tmp;
   }

GCC right now creates this, which is not optimal


           move.l (a0),d0         #2      4 4    +d0                                                                        
           and.l #2,d0            #3      0 4    *d0                                                
           rts                    #5      -1 0

GCC should produce this:

MOVEQ #2,D0
AND.L (A0),D0

We should include the instruction length in the COST model!

I would propose we include the Memory access and the length in the cost so that the cost should be this:


           move.l (a0),d0         #2      4 7    +d0                                                                        
           and.l #2,d0            #3      0 7    *d0                                                
           rts                    #5      -1 0

The cost should be this:


           moveq #2,d0            #2      4 5    +d0                                                                        
           and.l (a0),d0          #3      0 7    *d0                                                
           rts                    #5      -1 0

I expect a gain of less than 0.1% ...

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
01 Aug 2019 20:30

Stefan "Bebbo" Franke wrote:

I expect a gain of less than 0.1% ...

This is very easy to count. :-)

Clocks Byte
move.l (a0),d0 1 2
and.l #2,d0 1 6

versus

moveq #2,D0 1 2
and.l (a0),D0 0 2 (fused!)

Clocks 2 => 1
Bytes 8 => 4

Twice as fast and halve the size.
I would call this a great improvement.

This tuning is possible for very many operations
not only for AND but also ADD/SUB/OR/EOR/...

Stefan "Bebbo" Franke

Posts 142
01 Aug 2019 20:33

Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

I expect a gain of less than 0.1% ...

This is very easy to count. :-)

                   Clocks  Byte  
    move.l (a0),d0         1       2                                                                        
    and.l  #2,d0           1       6                                    
  
  versus
  
   moveq   #2,D0           1       2
   and.l   (a0),D0         0       2  (fused!)

Clocks 2 => 1
Bytes 8 => 4

Twice as fast and halve the size.
I would call this a great improvement.

This tuning is possible for very many operations
not only for AND but also ADD/SUB/OR/EOR/...

not for SUB ...

Grom 68k

Posts 61
01 Aug 2019 22:14

Gunnar von Boehn wrote:

For the modern CPUs 060/080 we have 2 pipes, so "grouping" of instructions becomes a important topic.

As the CPU can do 1 memory operation per cycle, plus another register operation instructions should be scheduled accordingly.
Instead this


  ADDq.l #1,(a0)+
  ADDq.l #1,(a0)+
  ADDq.l #1,(a0)+
  ADDq.l #1,(a0)+
  ADDq.l #1,D0
  ADDq.l #1,D1
  ADDq.l #1,D2
  ADDq.l #1,D3

Do this


  ADDq.l #1,(a0)+
  ADDq.l #1,D0
  ADDq.l #1,(a0)+
  ADDq.l #1,D1
  ADDq.l #1,(a0)+
  ADDq.l #1,D2
  ADDq.l #1,(a0)+
  ADDq.l #1,D3

Such scheduling is also important for AMMX and FPU code.

Hi,

To limit memory usage, I think a memory pipeline can be added.


(define_reservation "i_pipelines" "(i0_pipeline | i1_pipeline)");; simple insns with 1 cycle
(define_insn_reservation "simple" 1 (eq_attr "type" "alu_l")
"i_pipelines, i_ports, i_memory")

Super scalar requirements reduce memory usage.

Stefan "Bebbo" Franke

Posts 142
01 Aug 2019 22:40

Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

I expect a gain of less than 0.1% ...

This is very easy to count. :-)

                   Clocks  Byte  
     move.l (a0),d0         1       2                                                                        
     and.l  #2,d0           1       6                                    
   
   versus
   
    moveq   #2,D0           1       2
    and.l   (a0),D0         0       2  (fused!)

Clocks 2 => 1
Bytes 8 => 4

Twice as fast and halve the size.
I would call this a great improvement.

This tuning is possible for very many operations
not only for AND but also ADD/SUB/OR/EOR/...

not for SUB ...

and not for EOR ...

Stefan "Bebbo" Franke

Posts 142
01 Aug 2019 22:45

Grom 68k wrote:

Gunnar von Boehn wrote:


   ADDq.l #1,(a0)+
   ADDq.l #1,(a0)+
   ADDq.l #1,(a0)+
   ADDq.l #1,(a0)+
   ADDq.l #1,D0
   ADDq.l #1,D1
   ADDq.l #1,D2
   ADDq.l #1,D3

Do this


   ADDq.l #1,(a0)+
   ADDq.l #1,D0
   ADDq.l #1,(a0)+
   ADDq.l #1,D1
   ADDq.l #1,(a0)+
   ADDq.l #1,D2
   ADDq.l #1,(a0)+
   ADDq.l #1,D3

Such scheduling is also important for AMMX and FPU code.

Hi,

To limit memory usage, I think a memory pipeline can be added.


  (define_reservation "i_pipelines" "(i0_pipeline | i1_pipeline)")
  
  
  ;; simple insns with 1 cycle
  (define_insn_reservation "simple" 1 (eq_attr "type" "alu_l")
  "i_pipelines, i_ports, i_memory")

Super scalar requirements reduce memory usage.

to allow 2 (or more) insn per cycle is more effort, since this


   fmul.x fp0,fp1
   ADDq.l #1,(a0)+
   ADDq.l #1,D0
   ADDq.l #1,(a0)+
   ADDq.l #1,D1
   ADDq.l #1,(a0)+
   fmul.x fp1,fp2

would result in a stall, since the five insns take 3 cycles only...


Grom 68k Posts 61 02 Aug 2019 01:23	:( I was thinking pipelines more easier than fusing. https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.html#Processor-pipeline-description I found a method for fusing. Is the same problem with insns/cycles count ? EXTERNAL LINK

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
02 Aug 2019 06:32

Stefan "Bebbo" Franke wrote:

and not for EOR ...

Why should it not?

MOVEQ #$F,D0
EOR.L (A0),D0

This works just fine.

The 68K offers a rich selection of instruction.
For different immediate ranges the 68K provides us tuned instructions.

Example:


                     Bytes
SUBQ.L #$1,A0       2
SUBA.W #$111,A0     4
SUBA.L #$222222,A0  6

Using the tuned instruction will make programs smaller, and increase Icache hit rate. So both size is saved and speed increased.

BTW 68080 offers this too:


                     Bytes
ADDQ.L  #$1,D0       2
ADDIW.L #$111,D0     4
ADDI.L  #$222222,D0  6

Grom 68k

Posts 61
02 Aug 2019 06:59

Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

and not for EOR ...

Why should it not?

It's simply not in the PDF list for now.

posts 367	page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19