Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Performance and Benchmark Results!

Another 68K Coding Challengepage  1 2 

Gunnar von Boehn
(Apollo Team Member)
Posts 6207
16 Feb 2018 09:39


OK another challenge for the experts.
  This one is a tough nut to crack
 
  This is part of the NEO-GEO Sprite render routines.
  This routine is for ZOOMED sprites.
 
 

    myword = gfxdata[0];
    if (dda_x_skip[ 0]) {if ((col=((myword>>28)&0xf))) PUTPIXEL(*br,paldata[col]);br++;}
    if (dda_x_skip[ 1]) {if ((col=((myword>>24)&0xf))) PUTPIXEL(*br,paldata[col]);br++;}
    if (dda_x_skip[ 2]) {if ((col=((myword>>20)&0xf))) PUTPIXEL(*br,paldata[col]);br++;}
    if (dda_x_skip[ 3]) {if ((col=((myword>>16)&0xf))) PUTPIXEL(*br,paldata[col]);br++;}
    if (dda_x_skip[ 4]) {if ((col=((myword>>12)&0xf))) PUTPIXEL(*br,paldata[col]);br++;}
    if (dda_x_skip[ 5]) {if ((col=((myword>>8)&0xf))) PUTPIXEL(*br,paldata[col]);br++;}
    if (dda_x_skip[ 6]) {if ((col=((myword>>4)&0xf))) PUTPIXEL(*br,paldata[col]);br++;}
    if (dda_x_skip[ 7]) {if ((col=((myword>>0)&0xf))) PUTPIXEL(*br,paldata[col]);br++;}
 

 
 
  Now lets look at the generated ASM code.
 

    tst.b (a3)    // X ZOOM MASK
    beq L5
    bfextu d1{#4:#4},d0  // 4bit pixel
    beq L6
    move.w 2(a2,d0.l*4),(a0)
  L6:
    addq.l #2,a0
  L5:
 

 

What is bad about the generated code?
There are some problems can you explain what makes it slow?



Nixus Minimax

Posts 416
16 Feb 2018 12:10


Bad: The index used in the address for the source in the move.w is determined two instructions before it gets used. Since the ea unit is earlier in the pipeline, this means there have to be several bubbles in the pipeline before the move.w can be executed. The order of conditions could be inverted to solve this:

  bfextu d1{#4:#4},d0
  beq L6
  tst.b  (a3)
  beq L5
  move.w 2(a2,d0.l*4),(a0)
L6:
  addq.l #2,a0
L5:




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
16 Feb 2018 13:05


Nixus Minimax wrote:

Bad: The index used in the address for the source in the move.w is determined two instructions before it gets used.

Yes this is bad.

The Conditional-Branches are also not good.
Do you see ways to improve them?


Don Adan

Posts 38
16 Feb 2018 13:12


No info, about current versions 68080 timing.
Shortest version is possible without problem, but not always shortest is fastest.
And current version used longword table, no big sense for me. But maybe this table is used in other place too.
Word table version can looks as:

tst.b (a3)    // X ZOOM MASK
    beq.b L5
    bfextu d1{#4:#4},d0  // 4bit pixel
    move.w (a0),(a2) ; backup
    move.w (a2,d0.l*2),(a0)+ ; store
  L5:


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
16 Feb 2018 13:43


Don Adan wrote:

    tst.b (a3)    // X ZOOM MASK
    beq.b L5
    bfextu d1{#4:#4},d0  // 4bit pixel
    move.w (a0),(a2) ; backup
    move.w (a2,d0.l*2),(a0)+ ; store
    L5:

You trade a Branch for extra memory access.
This is not such a good trade.
move.w (a0),(a2) ; backup
Add 1 Mem-Read and 1 Mem-Write.
This will slow the system down and affect the cache.

You can do this better on 68080.
Hint: BCC over single instructions - that do max 1 memory access - are free and are never miss-predicted on 68080.


Don Adan

Posts 38
16 Feb 2018 14:30


Gunnar von Boehn wrote:

Don Adan wrote:

      tst.b (a3)    // X ZOOM MASK
      beq.b L5
      bfextu d1{#4:#4},d0  // 4bit pixel
      move.w (a0),(a2) ; backup
      move.w (a2,d0.l*2),(a0)+ ; store
    L5:
 

  You trade a Branch for extra memory access.
  This is not such a good trade.
  move.w (a0),(a2) ; backup
  Add 1 Mem-Read and 1 Mem-Write.
  This will slow the system down and affect the cache.
 
  You can do this better on 68080.
  Hint: BCC over single instructions - that do max 1 memory access - are free and are never miss-predicted on 68080.

Write can be pipelined, no branch version can be fastest f.e. for 68020. Code can be changed.

tst.b (a3)    // X ZOOM MASK
    beq.b L5 
    move.w (a0),(a2) ; backup
    bfextu d1{#4:#4},d0  // 4bit pixel
    move.w (a2,d0.l*2),(a0)+ ; store
    L5:

I dont know how works 68080 caches but A0 and A2 are unchanged for both instructions, for me nothing must be extra calculated. No branch and 1 instructions left. About caches, longword version used 2x data cache space, is ok, if is really fastest.


Nixus Minimax

Posts 416
16 Feb 2018 14:58


Well, we could change the algorithm to always load the background pixel and to write the background pixel if col==0 and to write actual hicolor data if col!=0. Using some clever AND logic this could be used to combine the right pixel (background or newly read from palette) into the pixel to write. In this way you never branch but either rewrite the background pixel or set a new pixel. The incrementing of the pixel pointer depending on dda_x_skip can be skipped without penalty:
 
 

      bfextu d1{#4:#4},d0  // 4bit pixel
      sne    d2
      ext.w  d2
      move.w d2,d3
      neg.w  d2
      and.w  (a0),d2  // backup background pixel
      and.w  2(a2,d0.l*4),d3
      or.w  d2,d3
      move.w d3,(a0)
      tst.b (a3)    // X ZOOM MASK
      beq L5
      addq.l #2,a0
    L5:
 

 
  This would, of course, be terribly slow on an 030. I'm pretty sure there are MMx instructions that turn all the AND/OR stuff above into a single operation.
 
  Another idea would be to use the high words of the long word palette as alpha value to indicate whether this is the 0 entry or a colour entry. This would also require some AND/OR operation like above.
 
 


Nixus Minimax

Posts 416
16 Feb 2018 15:44


Oh, I might have to add an "and.b (a3),d2" after the "sne d2"...


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
16 Feb 2018 16:08


Don Adan wrote:

  Write can be pipelined, no branch version can be fastest f.e. for 68020. Code can be changed.
 

 
Maybe, but lets clarify this:
The topic here is performance tuning of the NEO-GEO emulator code.
Please mind that the NEO-GEO emulator is a real demanding program.
Some users told me that even 240 MHz PPC can not run it fast.
 
Please mind that the 68020 is much to slow for this program.
Not even an 100Mhz overclocked 68060 has any chance to run the NEO-GEO emulator at good speed.
 
So lets focus during this discussion on the only 68K CPU
with enough horse power for this task.
Lets focus on tuning for 68080.
 
 

  tst.b (a3)    // X ZOOM MASK
  beq.b L5 
  move.w (a0),(a2) ; backup
  bfextu d1{#4:#4},d0  // 4bit pixel
  *BUBBLE*
  move.w (a2,d0.l*2),(a0)+ ; store
  L5:
 

This code above suffers from ALU-2-EA Bubble
 
 
Lets remove this bubble!
Nixus already correctly showed how this can be done.
 
The BFEXTU needs to be done first
So lets do this.
 
But we can improve the code even more.
We can also remove the cost of one branch.
 

  bfextu d1{#4:#4},d0  // 4bit pixel
  tst.b (a3)    // X ZOOM MASK
  beq.b L6 
  move.w (a2,d0.l*2),d0
  beq.b L5 
  move.w D0,(a0)  ; store
  L5:
  addq.l #2,A0
  L6:
 

The BFEXTU put before all other instruction will avoid
the bubble. Ideally even doing 2 BFEXTU before using them will be perfect. So this is very good.
 
Not the BEQ over the move.w D0,(a0) -- is FREE!
So this BCC cost nothing and is NEVER mis-predicted.
So this is great saving here.

But also the cost of the 1st Branch can be reduced.
Any idea how?


Nixus Minimax

Posts 416
16 Feb 2018 18:05


Gunnar von Boehn wrote:

  But also the cost of the 1st Branch can be reduced.
  Any idea how?

See my comment above: read the destination pixel and either write that or a new pixel depending on a logic combination with the skip mask.



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
16 Feb 2018 18:56


Nixus Minimax wrote:

 
Gunnar von Boehn wrote:

    But also the cost of the 1st Branch can be reduced.
    Any idea how?
 

 
  See my comment above: read the destination pixel and either write that or a new pixel depending on a logic combination with the skip mask.
 

Reading the background pixel is slow and totally unneeded.


Don Adan

Posts 38
16 Feb 2018 19:46


Gunnar von Boehn wrote:

 
Don Adan wrote:

    Write can be pipelined, no branch version can be fastest f.e. for 68020. Code can be changed.
   

   
  Maybe, but lets clarify this:
  The topic here is performance tuning of the NEO-GEO emulator code.
  Please mind that the NEO-GEO emulator is a real demanding program.
  Some users told me that even 240 MHz PPC can not run it fast.
   
  Please mind that the 68020 is much to slow for this program.
  Not even an 100Mhz overclocked 68060 has any chance to run the NEO-GEO emulator at good speed.
   
  So lets focus during this discussion on the only 68K CPU
  with enough horse power for this task.
  Lets focus on tuning for 68080.
   
   

    tst.b (a3)    // X ZOOM MASK
    beq.b L5 
    move.w (a0),(a2) ; backup
    bfextu d1{#4:#4},d0  // 4bit pixel
    *BUBBLE*
    move.w (a2,d0.l*2),(a0)+ ; store
    L5:
   

  This code above suffers from ALU-2-EA Bubble
   
   
  Lets remove this bubble!
  Nixus already correctly showed how this can be done.
   
  The BFEXTU needs to be done first
  So lets do this.
   
  But we can improve the code even more.
  We can also remove the cost of one branch.
   
 

    bfextu d1{#4:#4},d0  // 4bit pixel
    tst.b (a3)    // X ZOOM MASK
    beq.b L6 
    move.w (a2,d0.l*2),d0
    beq.b L5 
    move.w D0,(a0)  ; store
    L5:
    addq.l #2,A0
    L6:
   

  The BFEXTU put before all other instruction will avoid
  the bubble. Ideally even doing 2 BFEXTU before using them will be perfect. So this is very good.
   
  Not the BEQ over the move.w D0,(a0) -- is FREE!
  So this BCC cost nothing and is NEVER mis-predicted.
  So this is great saving here.
 
 
  But also the cost of the 1st Branch can be reduced.
  Any idea how?
 

 
  For really good speed optimisation, some statistical tests can be done. How often (a3) is null, how often move.w (a2,d0.l*2),d0 is null.
  Adding more instructions wasted cache space. This is perhaps only part of the routine, I think. You can place 4 (?) bfextu instructions at begining of routine, and perhaps no bubble in my code. I think that 68060 has better bubbles handling than 68080, because some memory accesses routines are fastest on 68060.
 
  About NeoGeo emulator. I think that 68060 100MHz with graphic card is enough for (no music) emulation. Of course some changes must be done in emulator idea. Emulation 68000 CPU on 68060 has no sense, it wasted only 68060 CPU power, which can be used for graphic or music emulation. For catching accesses to NeoGeo registers MMU can be used, like in some Atari ST emulators on Amiga.


Don Adan

Posts 38
16 Feb 2018 19:59


BTW. Here is NeoGeo emulation on Falcon 68030 16MHz MMU.
  EXTERNAL LINK  Done by very good Atari ST coder, AnimaInCorpore. 68060 and 68080 are much fastest than 68030 16 MHz.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
16 Feb 2018 20:00


Don Adan wrote:

Adding more instructions wasted cache space.

 
Yes your 68020 did had a total of 256 bytes Icache.
I fully agree with you that size tweaking made good sense in the 68020.
 
But the 68080 has 16384 Bytes Icache.
So the world is very different now.
Whether your work-loop is in total 240 bytes or 260 Bytes makes now no differences anymore
 
Besides the 68080 can load instructions from memory to the Icache with speed of 600 MB/sec.
 
For comparison the 14Mhz 68020 Icache did provide 14 MB/sec on hit!
Apollo@x11 provides 1250 MB/sec on Hit - and can reach 600 MB/sec on miss!
And the 68060@100Mhz has Icache performance on hit of max 400 MB/sec
 
As you see the 68080 is 40 faster on cache miss than the 68020 on cache hit!
Also the 68080 is faster on cache miss than the 68060 on cache hit.
   
   
Don Adan wrote:

I think that 68060 has better bubbles handling than 68080, because some memory accesses routines are fastest on 68060.
 

Maybe this might be a misunderstanding.
   
Read the Motorola 68060 Manual to understand the ALU-2-EA bubbles.
Motorola did explain the reason behind this quite well.
The bubbles we talk about did also affect the 68060. 
   
To measure memory performance you can consult BUSTEST.
Here are some values
CLICK HERE 
VAMP = 650 MB/sec
Cyberstorm 68060 = 55 MB/sec
 
As you clearly see the Vamp scores over 10 times more memory speed than the 68060. :-)


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
16 Feb 2018 20:11


Don Adan wrote:

BTW. Here is NeoGeo emulation on Falcon 68030 16MHz MMU.
  EXTERNAL LINK    Done by very good Atari ST coder, AnimaInCorpore. 68060 and 68080 are much fastest than 68030 16 MHz.

 
Yes - this is very cool.
But you can not compare the two programs.
 
The ATARI program was made pretty clever but the emulation is not perfect and requires partly resources of the games to emulate.
 
The MAME like NEO-GEO PC emulator on the other hand is perfect but brute force - it needs more than 20 times the CPU POWER as the ATARI solution but runs every game perfectly.

I fully agree with you that the NEO-GEO emulator has major room for improvement.
We here in the forum discussed way to speed up rendering.
And we accelerated rendering already by 400%
The PC-NEO GEO emulator even emulates the whole 68000 CPU.
This of course does not need to be done on AMIGA / Vamp - it could get much faster by changing this too.


Nixus Minimax

Posts 416
16 Feb 2018 21:23


Gunnar von Boehn wrote:

  Reading the background pixel is slow and totally unneeded.

With cache prefetch and good memory bandwidth it shouldn't be too bad. You'll need some scattered write operation to avoid both the branch and the reading of the background.



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
16 Feb 2018 21:37


Nixus Minimax wrote:

With cache prefetch and good memory bandwidth it shouldn't be too bad.

 
Yes but very much better is to avoid it completely.
Lets make some simple example.
 
The NEO-GEO emulator code does test a byte per column pointed to by A3 to see if it should write this pixel or skip it-
 
The C code translates to a BRANCH.
BRANCHES have the well known drawbacks.
Therefore ideally we want to avoid them.
But we of course do not want to add unneeded memory access also.
 
So how about this:

    bfextu d1{#4:#4},d0  // 4bit pixel
    move.b (a3),D7      // load ZOOM value (0/2)
    move.w (a2,d0.l*2),d0
    beq.b L5 
    storeCount D0,D7,(a0)  ; store count=0/2 byte
  L5:
    add.l D7,A0            // add 0 or 2 to pointer

 
This code is branch free!
The branch about single instruction can be "removed" by Apollo
The code uses a storeCount to complete avoid the penalty of the MASK cmp /read

You can clearly see how much more elegant and much faster this code is.


Samuel Devulder

Posts 248
16 Feb 2018 23:06


That is rocking cool! I doubt a C compiler could ever reach that level of optimization. Only Gunnaaaaar makes it possible... EXTERNAL LINK :)


Nixus Minimax

Posts 416
17 Feb 2018 07:07


Gunnar von Boehn wrote:

  So how about this:
 

    bfextu d1{#4:#4},d0  // 4bit pixel
    move.b (a3),D7      // load ZOOM value (0/2)
    move.w (a2,d0.l*2),d0
    beq.b L5 
    storeCount D0,D7,(a0)  ; store count=0/2 byte
    L5:
    add.l D7,A0            // add 0 or 2 to pointer
 

You are skipping on black palette values again and not on transparent.



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
17 Feb 2018 08:53


Nixus Minimax wrote:

 
Gunnar von Boehn wrote:

    So how about this:
   

      bfextu d1{#4:#4},d0  // 4bit pixel
      move.b (a3),D7      // load ZOOM value (0/2)
      move.w (a2,d0.l*2),d0
      beq.b L5 
      storeCount D0,D7,(a0)  ; store count=0/2 byte
      L5:
      add.l D7,A0            // add 0 or 2 to pointer
   

 

 
  You are skipping on black palette values again and not on transparent.
 
 

The code above works.
The NEO-GEO emulator uses 15bit colors.
This means 1 bit is unused in the WORD.
The palette can use this 1Bit to signal transparency.
Depending whether you mark transparent or
you mark non-transparent you can use BMI or BEQ.
 

posts 25page  1 2