APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

Information about the Apollo CPU and FPU.

GCC and Other Stories From the Dark Side	page 1 2

Gunnar von Boehn
(Apollo Team Member)
Posts 6223
29 Dec 2023 17:02

Lets look at some C code and what ASM GCC creates:


   int main(int size)
{
         int wert=1;
         int sum=0;
         for (; size; size--) {
                 sum += wert;
                 wert++;
         }
         return (int)sum;
}

gcc -m68080 -O2


         .section .text
         .align  2
         .globl  _main
_main:
         link.w a5,#0
         move.l (8,a5),d1
         move.l d2,-(sp)
         tst.l d1
         jeq .L4
         clr.l d0
         moveq #1,d2
         subq.l #1,d1
.L3:
         add.l d2,d0
         addq.l #1,d2
         subq.l #1,d1
         jcc .L3
         move.l (sp)+,d2
         unlk a5
         rts
.L4:
         move.l (sp)+,d2
         move.l d1,d0
         unlk a5
         rts

The result makes me unhappy.
The code has two return blocks, the Loop could be done simpler and nicer..

Lets try if we can improve this

Gunnar von Boehn
(Apollo Team Member)
Posts 6223
29 Dec 2023 17:05

gcc -m68080 -fomit-frame-pointer -O2


         .section .text
         .align  2
         .globl  _main
_main:
         move.l d2,-(sp)
         move.l (8,sp),d1
         jeq .L4
         clr.l d0
         moveq #1,d2
         subq.l #1,d1
.L3:
         add.l d2,d0
         addq.l #1,d2
         subq.l #1,d1
         jcc .L3
         move.l (sp)+,d2
         rts
.L4:
         move.l d1,d0
         move.l (sp)+,d2
         rts

Ok with -fomit-frame-pointer the LINK UNLK are gone.
But the loop in still not real good.

Don Adan

Posts 38
30 Dec 2023 11:08

You can try to reach something like this:


        .section .text
        .align  2
        .globl  _main
_main:
  move.l d2,-(sp)
  move.l 8(sp),d1
  clr.l d0
  moveq #0,d2
.L3
  add.l d2,d0
  addq.l #1,d2
  subq.l #1,d1
  jcc .L3
  move.l (sp)+,d2
  rts

Or this if A0 can be scratch register:

.section .text
.align 2
.globl _main
_main:
move.l 4(sp),d1
clr.l d0
move.l d0,a0 ; or sub.l a0,a0
.L3
add.l a0,d0
addq.l #1,a0
subq.l #1,d1
jcc .L3
rts

Gunnar von Boehn
(Apollo Team Member)
Posts 6223
30 Dec 2023 14:23

OK after applying some patch to GCC we got this now:
This brings the inner loop from 4 instructions down to 3.
This is good.


          .section .text
          .align  2
          .globl  _main
_main:
          move.l d2,-(sp)
          move.l (8,sp),d1
          jeq .L4
          clr.l d0
          moveq #1,d2
          subq.l #1,d1
.L3:
          add.l d2,d0
          addq.l #1,d2
          dbral d1,.L3
          move.l (sp)+,d2
          rts
.L4:
          move.l d1,d0
          move.l (sp)+,d2
          rts

Gunnar von Boehn
(Apollo Team Member)
Posts 6223
30 Dec 2023 15:20

OK here I see another problem of GCC source


  ;; Special case of fullword move when source is zero for 68000_10.
  ;; moveq is faster on the 68000.
  (define_insn "*movsi_const0_68000_10"
    [(set (match_operand:SI 0 "movsi_const0_operand" "=d,a,g")
          (const_int 0))]
    "TUNE_68000_10"
    "@
     moveq #0,%0
     sub%.l %0,%0
     clr%.l %0"
    [(set_attr "type" "moveq_l,alu_l,clr_l")
     (set_attr "opy" "*,0,*")])
  
  
  
  
  ;; Special case of fullword move when source is zero for 68040_60.
  ;; On the '040, 'subl an,an' takes 2 clocks while lea takes only 1
  (define_insn "*movsi_const0_68040_60"
    [(set (match_operand:SI 0 "movsi_const0_operand" "=a,g")
          (const_int 0))]
    "TUNE_68040_80"
  {
    if (which_alternative == 0)
      return MOTOROLA ? "lea 0.w,%0" : "lea 0:w,%0";
    else if (which_alternative == 1)
      return "clr%.l %0";
    else
      {
        gcc_unreachable ();
        return "";
      }
  }
    [(set_attr "type" "lea,clr_l")])

On 68K we often have different instructions available for setting something to zero.

Data registers are best set to 0 with MOVEQ #0
Adress Registers with SUBA.L An,An
and for the rest CLR is the choice.

This rule is valid and good choice for ALL 68k CPU.

I can see that GCC 6.5 was mislead here with false information.
Do you see the error here?

Gunnar von Boehn
(Apollo Team Member)
Posts 6223
02 Jan 2024 11:16

Gunnar von Boehn wrote:

;; Special case of fullword move when source is zero for 68040_60.
;; On the '040, 'subl an,an' takes 2 clocks while lea takes only 1

This part in GCC is wrong.
Someone must have misread and misunderstood the 68K manual and now GCC makes bad code because of this.


Kamelito Loveless Posts 260 02 Jan 2024 13:22	Updating gcc to use 080 opcodes while also improving code generation seems a giant effort, so you think that you can achieve this goal? It has to be done but from the outside it looks like a difficult task.

Gunnar von Boehn
(Apollo Team Member)
Posts 6223
02 Jan 2024 16:41

As you know on Amiga the majority programs are either written in Assembly or in C.

GCC as C compiler is omnipotent in the Linux world.
Everything which is developed on Linux - gets compiled using GCC.
Therefore many ported programs were originally developed using GCC as compiler.

Having an as good as possible GCC makes therefore a lot of sense to me.

Yes GCC is a complex program, and changing it requires to take a little time to understand GCC first.

But its not impossible to do.
We have added support for a six common 68080 instructions to GCC already. And this pretty good. We see that these instructions are used GCC many thousands of times in some programs.

Using these new instructions helps to make compiled programs smaller and faster.


Kamelito Loveless Posts 260 03 Jan 2024 09:00	Bravo !


DiscreetFX Studios Posts 145 05 Jan 2024 05:39	Great news, thanx a lot!

Gunnar von Boehn
(Apollo Team Member)
Posts 6223
05 Jan 2024 07:41

Let look at some more examples:


  void main(char * src, int * dest, int len) {
  
    int myint;
    char mychar;
  
    for(;len--;){
      mychar = *src++;
      myint = (int) mychar;
      *dest++ = myint;
  
    }
  }

compile with
gcc -Os -S -m68080 -fomit-frame-pointer new.c


  --------------------------------------------------
  #NO_APP
      .section .text
      .align    2
      .globl    _main
  _main:
      move.l (8,sp),a1
      move.l (4,sp),a0
      move.l (12,sp),d0
      jra .L2
  .L3:
      mvs.b (a0)+,d1
      move.l d1,(a1)+
  .L2:
      dbral d0,.L3
      rts
  -----------------------------------------------------------

now compile with O2
gcc -O2 -S -m68080 -fomit-frame-pointer new.c


  #NO_APP
   .section .text
   .align 2
   .globl _main
  _main:
   move.l (4,sp),a0
   move.l (12,sp),d0
   move.l a0,d1
   move.l (8,sp),a1
   add.l d0,d1
   tst.l d0
   jeq .L1
   move.l a0,d0
   sub.l d1,d0
   not.l d0
  .L5:
   mvs.b (a0)+,d1
   move.l d1,(a1)+
   dbral d0,.L5
  .L1:
   rts

Os creates 4 instructions for setup.
O2 creates 10 instructions for the same.

I think the result with Os is OK.

But O2, O3, Ofast all create this very suboptimal code.

What do you think?


Robo Kupka Posts 50 05 Jan 2024 08:23	What is "mvs" instruction ? I only found it mentioned with the ColdFire cpus and your own note from 2016, mentioning the instruction being removed from Apollo ISA. (http://apollo-core.com/knowledge.php?b=1&note=1943)

Tommo Noorduin

Posts 134
05 Jan 2024 16:28

Robo Kupka wrote:

What is "mvs" instruction ? I only found it mentioned with the ColdFire cpus and your own note from 2016, mentioning the instruction being removed from Apollo ISA. (http://apollo-core.com/knowledge.php?b=1¬e=1943)

It cannot be the coldfire one, that opcode is already used by bank.

There was talk about it on discord yesterday, and...
today at 17:07 Gunnar wrote:

lets us talk about .... 68080 instructions
--
ROBINHOOD.exe instruction count
mvs = 833
mvz = 5670
mov3 = 8826
dbf = 1787
addiwl = 2186
cmpiwl = 5511

src: EXTERNAL LINK
I guess it is very new and does move.b & extb.l like the coldfire one.

Gunnar von Boehn
(Apollo Team Member)
Posts 6223
08 Jan 2024 05:34

Robo Kupka wrote:

What is "mvs" instruction ?

MVS and MVZ are move with extend instructions.
MVS does sign extend.
MVZ does zero extend.

This instruction is useful when you convert from byte to long, char to int - or short to int, word to long.
As the 68K address mode (An,Dn) uses Word Index always as signed, using unsigned Short Index in any high level language requires also to Zero extend the Index to int.

Arne which you know as coder of Apollo Invader, has recently activated a number of instructions for the Apollo 68080 CPU, that were dormant.

These re-activated instructions are:
- MVS
- MVZ
- MOV3
- MOVIW
- CLR.Q
- MOVE2
and others

In parallel I've created a number of patches for GCC, allowing the C compiler to make good use of these instruction.
The patches improve the code of GCC in regards of both performance and code density.

Its widely known that the code density of the 68K family is exception good. And the Apollo 68080 is here by far the best of the 68K family having the most dense code and the highest performance.

I'm monitoring the code generation of GCC since a while.
The 68k code quality of recent GCC is of course not as good as code created by human assembler coders but in comparison to older Amiga compilers its actually not bad.
We have created a number of patches to improve this even more.

In our discord developer channel we have a group of Amiga and Atari developers which jointly together are discussing options to further improve the compiler. The GCC group is supported by a number of Apollo-Team members.


Robo Kupka Posts 50 08 Jan 2024 07:22	That is a very nice initiative. I assume the instructions will be well documented in the next upradte of A68060.pdf (found on Discord). Which Core version are instructions available from ? Will these instructions make it to the older Vampires (V1200, V500, V600) ?

Gunnar von Boehn
(Apollo Team Member)
Posts 6223
08 Jan 2024 08:01

Robo Kupka wrote:

That is a very nice initiative.

Thank you.

We see the clear benefit of this already in Benchmarks of new games as Robin Hood".

Robo Kupka wrote:

I assume the instructions will be well documented in the next upradte of A68060.pdf (found on Discord).

Yes we are updating the online manual and HTML pages accordingly.

Robo Kupka wrote:

Which Core version are instructions available from ?

Already available both for V2 and V4 since a number of releases are:

CMPIW.L
ADDIW.L
DBRA.L
BRA.B+
BSR.B+
BCC.B+

With the next release planned for eastern, the above mentioned instructions will be released.

Robo Kupka wrote:

Will these instructions make it to the older Vampires (V1200, V500, V600) ?

Yes we keep both V2 and V4 fully compatible.
The instruction will come for V2 with Release 2.18


Robo Kupka Posts 50 08 Jan 2024 14:53	Awesome. Thanks. I appreciate that the CPU 080 core has the same features across all Vampires, so the code compield with -m68080 optimisations can run on all "V" devices, unless it utilizes V4 exclusive features.


Gunnar von Boehn (Apollo Team Member) Posts 6223 11 Jan 2024 06:14	GCC makes very good progress in using 68080 instructions. Bebbo added a new patch to his cross compiler repository and now GCC also supports CLR.Q This nicely improves performance in memory clearing and clearing of structures or when doing clear on memory or stack regions.

Don Adan

Posts 38
11 Jan 2024 16:52

Gunnar von Boehn wrote:

GCC makes very good progress in using 68080 instructions.

Bebbo added a new patch to his cross compiler repository and now GCC also supports CLR.Q
This nicely improves performance in memory clearing and clearing of structures or when doing clear on memory or stack regions.

From my point of view much better if "moveq" implementation for 68080 will be works as 64 bits command, not only as 32 bits command. It will be much useful than clr.q command. But perhaps now is too late to make changes.

Gunnar von Boehn
(Apollo Team Member)
Posts 6223
12 Jan 2024 06:32

Don Adan wrote:

If you think about using CLR.Q only to clear register, then I fully agree with you - but the main point of CLR.Q is that its not limited to register operations like MOVEQ.

The 68000/68010 have a 16bit memory bus.
The 68020/30/40/60 have a 32bit memory bus.
The 68080 does have a 64bit memory bus.

The 64bit bus of the 68080 CPU gives a big performance boost.
Using instructions like CLR.Q and MOVE2 allows to benefit from this.

Let me give some examples to make this clear (pun).

CLR.Q can be used to clear memory fields.
Using here 64bit of CLR.Q access does give you a big performance advantage.

If we look at many C compiler generated code, then we can see that very often programs create,copy,process structures in memory or on the stack.

This means operations like the following are are very common in many programs.

clr.l 40(A2)
clr.l 44(A2)

Here two Longs are cleared in a structure.
The 68000/010 would do for this 4 memory bus cycles.
The 20/30/40/60 will need 2 memory bus cycles.

But the 68080 could do this twice as fast with 1 memory bus cycle.

=
clr.q 40(A2) * faster/better/shorter

Clearing certain fields in structures is common practice
The CLR.Q instruction allows to make make 2 LONG access in 1 cycle.
This is twice as fast and will improve performance.

Lets look at another example:
Also very common in programs are operations like this:

move.l 32(A2),D4
move.l 36(A2),A0

Here two longs are loaded from memory into 2 registers.
Operations like this are very common.
The 68000/010 would need 4 memory bus cycles.
The 20/30/40/60 will need 2 memory bus cycles.
The 68080 can do this with 1 memory bus cycle = twice as fast.

= move2.l 32(A2),D4:A0 * faster/better/shorter

Using these 64bit memory operations allows use to better use the potential of the 64bit memory bus interface.

In many program we see very often the reading several values from stack or from structures to arbitrary registers.

This could be improved a lot by doing one MOVE2 instruction instead doing two MOVE.L. Using here the advanced 68080 instructions help again to halve the memory access, save space and to speed the programs up.

What do you think?

posts 36	page 1 2