Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Performance and Benchmark Results!

68k Coding Challengepage  1 2 3 

Gunnar von Boehn
(Apollo Team Member)
Posts 6207
08 Feb 2018 10:40


This is an invitation to show your 68K coding skills.

Lets have some fun and show us you ideas how to optimize a short work loop

The Loop is the key copy routine of a popular NEO GEO emulator:

Pseudo code:

                for(y=0;y<zy;y++) {

                    if (gfxdata[1] || gfxdata[0]) {
                    myword = gfxdata[0];
                    col=(myword>>28)&0xf; if (col) PUTPIXEL(br[0],paldata[col]);
                    col=(myword>>24)&0xf; if (col) PUTPIXEL(br[1],paldata[col]);
                    col=(myword>>20)&0xf; if (col) PUTPIXEL(br[2],paldata[col]);
                    col=(myword>>16)&0xf; if (col) PUTPIXEL(br[3],paldata[col]);
                    col=(myword>>12)&0xf; if (col) PUTPIXEL(br[4],paldata[col]);
                    col=(myword>>8)&0xf; if (col) PUTPIXEL(br[5],paldata[col]);
                    col=(myword>>4)&0xf; if (col) PUTPIXEL(br[6],paldata[col]);
                    col=(myword>>0)&0xf; if (col) PUTPIXEL(br[7],paldata[col]);

                    myword = gfxdata[1];
                    col=(myword>>28)&0xf; if (col) PUTPIXEL(br[8],paldata[col]);
                    col=(myword>>24)&0xf; if (col) PUTPIXEL(br[9],paldata[col]);
                    col=(myword>>20)&0xf; if (col) PUTPIXEL(br[10],paldata[col]);
                    col=(myword>>16)&0xf; if (col) PUTPIXEL(br[11],paldata[col]);
                    col=(myword>>12)&0xf; if (col) PUTPIXEL(br[12],paldata[col]);
                    col=(myword>>8)&0xf; if (col) PUTPIXEL(br[13],paldata[col]);
                    col=(myword>>4)&0xf; if (col) PUTPIXEL(br[14],paldata[col]);
                    col=(myword>>0)&0xf; if (col) PUTPIXEL(br[15],paldata[col]);
                    }
                    br+=(buffer->pitch>>1);
                    gfxdata+=2;
              }

The code copies a sprite (16x16) to the screen
The screen buffer is 16bit hicolor.
The sprite data is 4bit (16color)
The copy uses 3 pointers
gfxdata = pointer to 4bit sprite data
paldata = pointer to colomap of sprite (16 colors)
br      = screen pointer

The current C code needs about 10,000 clock for a 16x16 sprite.
Which is a lot.

Show us your ideas to tune it!


Manfred Bergmann

Posts 226
08 Feb 2018 11:29


I can only clean the code up a bit, but not speed it up. Don't have the knowledge about that.
 

for(y=0;y<zy;y++) {
  if (gfxdata[1] || gfxdata[0]) { 
  shiftbits = 0;
  for(brn = 0;brn < 16;brn++) {
    if(brn == 0 || brn == 8) {
      myword = gfxdata[brn == 0 ? 0 : 1];
    shiftbits = 28;
  }
  col=(myword>>shiftbits)&0xf; if (col) PUTPIXEL(br[brn],paldata[col]);
  shiftbits -= 4;
  }
  }
  br+=(buffer->pitch>>1);
  gfxdata+=2;
}



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
08 Feb 2018 12:09


A little background information
The NEO-GEO is was a very nice game console.
The NEO-GEO used an 68000 CPU like the AMIGA.
 
The NEO-GEO had very powerful Sprite freatures
It could display up to 380 Sprites on Screen.
Each sprite had 16 colors and could use a different palette.
 
To emulate the NEO-GEO the above sprite copy code would need be called up to 380 times per frame.

The current C code need ~ 10.000 cycle per sprite
Which means you need a CPU @ 200 MHz for fullspeed.

Obviously we need to tune this, ideally in 68K ASM.


Szyk Cech

Posts 191
08 Feb 2018 12:45


Manfred Bergmann wrote:

I can only clean the code up a bit

Don't be ridiculous! - if you want compete - then you need use AMMX instructions power...



Manfred Bergmann

Posts 226
08 Feb 2018 13:07


Szyk Cech wrote:

 
Manfred Bergmann wrote:

  I can only clean the code up a bit
 

  Don't be ridiculous! - if you want compete - then you need use AMMX instructions power...
 
 

  I don't want to compete. Sorry.
OK, but it's about enhancing it using AMMX?



Thellier Alain

Posts 141
08 Feb 2018 14:06


Here what come to my mind at first
  I assume that the cpu can do 64bits reads
 
  for(y=0;y<zy;y++)
  {
  register UINT64 myint64 = *gfxdata64++;
  register ULONG  pitch=(buffer->pitch>>1);
  register ULONG  mask4=&0xf;
  register ULONG  shif4=4;
  register UBYTE  col,col8;
 
  if (myint64)
  {
  col8=myint64; myint64=myint64>>8;
  col=col8&mask4;  if (col) PUTPIXEL(br[15],paldata[col]);
  col=col8>>shif4; if (col) PUTPIXEL(br[14],paldata[col]);
  col8=myint64; myint64=myint64>>8;
  col=col8&mask4;  if (col) PUTPIXEL(br[13],paldata[col]);
  col=col8>>shif4; if (col) PUTPIXEL(br[12],paldata[col]);
  col8=myint64; myint64=myint64>>8;
  col=col8&mask4;  if (col) PUTPIXEL(br[11],paldata[col]);
  col=col8>>shif4; if (col) PUTPIXEL(br[10],paldata[col]);
  col8=myint64; myint64=myint64>>8;
  col=col8&mask4;  if (col) PUTPIXEL(br[09],paldata[col]);
  col=col8>>shif4; if (col) PUTPIXEL(br[08],paldata[col]);
 
  col8=myint64; myint64=myint64>>8;
  col=col8&mask4;  if (col) PUTPIXEL(br[07],paldata[col]);
  col=col8>>shif4; if (col) PUTPIXEL(br[06],paldata[col]);
  col8=myint64; myint64=myint64>>8;
  col=col8&mask4;  if (col) PUTPIXEL(br[05],paldata[col]);
  col=col8>>shif4; if (col) PUTPIXEL(br[04],paldata[col]);
  col8=myint64; myint64=myint64>>8;
  col=col8&mask4;  if (col) PUTPIXEL(br[03],paldata[col]);
  col=col8>>shif4; if (col) PUTPIXEL(br[02],paldata[col]);
  col8=myint64; myint64=myint64>>8;
  col=col8&mask4;  if (col) PUTPIXEL(br[01],paldata[col]);
  col=col8>>shif4; if (col) PUTPIXEL(br[00],paldata[col]);   
  }
  br+=pitch;
  }
     

     
 
 
 


Markus (mfro)

Posts 99
08 Feb 2018 15:03


1.)
 
  what does PUTPIXEL() do exactly? Is it just a macro for
 
  *(br + x) = col;
 
  or is there more to it???
 
  2.)
  according to the algorithm, color 0 in sprite data appears to be considered transparent (so you are looking for a little more than just a plain memcpy() of bit-expanded data, right)? 


Artur Jarosik
(Apollo Team Member)
Posts 94
08 Feb 2018 15:04


#define PUTPIXEL(dst,src) dst=src



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
08 Feb 2018 15:07


thellier alain wrote:

Here what come to my mind at first
I assume that the cpu can do 64bits reads 

Yes the CPU can do 64bit operation.

The general problem and main bottleneck of the code
might become more clear when we look at the ASM level.

The Code for 1 pixel will looks somewhat like this


  BFEXTU D1{28:4},D0      ; EXTRACT 4Bit color
  beq  transparent
  move.w (A0,D0.L*2),(A1)  ; copy color value
transparent:

This code looks not complex with 3 instructions.
But its relative slow because of 2 problems.

a) The BEQ is behaving "random" and can not be predicted
as it does depend on the location of transparent pixel
which will change in every row, every sprite.
So we have to expect a high branch misspredict rate.

b) The color move.w (A0,D0.L*2), does depend on the INDEX Register.
This index register is updated shortly before.
This means this will create an ALU-2-EA dependency stall, as explained in the Motorola 68060 manual.

So these 3 instruction will run slow.




Samuel Devulder

Posts 248
08 Feb 2018 15:30


Quick remarks on a pause at work :)
   
The algorithm does col=(myword>>28)&0xf then col=(myword>>24)&0xf then col=(myword>>20)&0xf etc. This is lot of redundant work. For instance to shift by 28 we must first shift by 24 which will be needed in a later stage. This is dumb. We'd better do this the other way around: getting shift-by-24 from the value of shift-by-20. This means works right to left with 4 bits steps. This what Alain's solution intended to do. As a side effect, this also gets rid of the infamously slow operation for non 68040[*] cpus (BFEXTU):

    if (col=(myword&0xF)) PUTPIXEL(br[7],paldata[col]); myword >>= 4;
    if (col=(myword&0xF)) PUTPIXEL(br[6],paldata[col]); myword >>= 4;
    if (col=(myword&0xF)) PUTPIXEL(br[5],paldata[col]); myword >>= 4;
    ...
    if (col=(myword&0xF)) PUTPIXEL(br[2],paldata[col]); myword >>= 4;
    if (col=(myword&0xF)) PUTPIXEL(br[1],paldata[col]); myword >>= 4;
    if (col=myword)      PUTPIXEL(br[0],paldata[col]); myword >>= 4;
 

  Then paldata[col] will be bad if col is typed as (U)WORD or (U)BYTE. The C compiler will extend it to LONG. Better type col as ULONG in the first place. But the optimization is not rewriting C to force the compiler do what we would write directly in asm. So let's do it the ASM way then:
   
  Ideally we can arrange to have myword in some data reg (say D1), and col in D0. Then paldata in A0 and &br[0] in A0. The resulting asm might look like this:
 

    moveq #15,d0
    and.l d1,d0
    beq next
    move.w (a0,d0.l*2),7(A1)
 
next:
    lsr.l #4,d1
    moveq #15,d0
    and.l d1,d0
    beq next2
    move.w (a0,d0.l*2),6(A1)
 
next2: 
    lsr.l #4,d1
    moveq #15,d0
    and.l d1,d0
    beq next3
    move.w (a0,d0.l*2),5(A1)
    next3:
    ...
 
next7:
    lsr.l #4,d1
    beq next8
    move.w (a0,d1.l*2),(A1)
 
next8:
      .. (same with gfx data1)

But this is plain 68030 code. It has a lot of defects as the move.qw <mem>,<mem> which put stress on the memory unit and the EA calculations. Using AMMX extensions we can do much better. I let people read the doc to find an interesting opcode in there: http://www.apollo-core.com/AMMX.doc.txt...
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
08 Feb 2018 15:59


Samuel Devulder wrote:

    This will get rid of the BFEXTU operations
   

        lsr.l #4,d1
        moveq #15,d0
        and.l d1,d0
   

   
  Yes this change , will help CPUs which are very slow on Bitfields.
 
  But the 68080 can do Bitfields very fast.
  The BFEXTU instruction for example, needs only a single cycle.
  The Bitfield is therefore not a bottleneck and replacing it with 3 other instructions will not improve speed.
   
  But a real bottleneck is the unpredictable BCC.
 


Samuel Devulder

Posts 248
08 Feb 2018 16:10


A hint: doing

    ULONG u = myword, t =0;
    t |= u&0x11111111; u>>=1; 
    t |= u&0x11111111; u>>=1;
    t |= u&0x11111111; u>>=1;
    t |= u&0x11111111;
gives a nice bitmask telling if the corresponding color is 0 or not. And a specific AMMX code likes such bitmasks.
 
[EDIT] Here is another way to make this 01010101 mask indicating if any of the 4 bits slice is one or not:
 
MASK = (((myword & 0x77777777)+0x77777777)|mywords)>>4)&0x01010101
(idea inspired from https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord)


Samuel Crow

Posts 424
08 Feb 2018 17:13


Since we know we will be using AMMX, the two duplicated longword unrolled loops are redundant.  We can use a single quadword set of operations.  That said, the smallest unit in an AMMX vector is the byte size element.  Since we want to do nybble size operations we'll have to mask and shift and do each operation twice per vector.

Also, the loops all count upward and compare to a non-zero constant inside the loop.  Reversing them so they count down to zero will make a substantial difference by eliminating the need for the comparison.

Lastly, the for-loop structure in C generates a pre-test loop which means there are always two branches generated.  If we are looping a fixed number of times, this is unneeded.  A post-test loop generates only one branch so it is more efficient.

In C99 this will look something like:


void drawSprite(short *br, short *paldata, long long *gfxdata, int zx, short pitch)
{
  unsigned long long myVec, evenPix, oddPix;
  int y, count;
  char pixel;
  const long long mask=0x0f0f0f0f0f0f0f0fLL;

  y=zx-1;
  do
  {
    // load a row of sprite data
    myVec= *gfxdata++;
    if (myVec)
    {
      // load and mask pixels into even and odd pixel buffers
      evenPix=mask & myVec;
      oddPix=~mask & myVec >> 4;  // shift makes the nybbles low

      count = 8-1;
      do
      {
        pixel = (char)(evenPix>>56);
        *br++ = pixel?paldata[pixel]:*br;
        evenPix<<=8;
        pixel = (char)(oddPix>>56);
        *br++ = pixel?paldata[pixel]:*br;
        oddPix<<=8;
      } until (count--);
    } // endif
    br+=pitch-16*sizeof(short); // assuming buffer->pitch>>1 passed in as parameter
  } until(y--);
}

As Gunnar says though, the part where the transparency is processed is the kicker.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
08 Feb 2018 17:58


One proposal:
 
 

    BFEXTU D0{28:4},D1      ; EXTRACT 4Bit color
    BFEXTU D0{24:4},D2      ; EXTRACT 4Bit color
    BFEXTU D0{20:4},D3      ; EXTRACT 4Bit color
    BFEXTU D0{16:4},D4      ; EXTRACT 4Bit color
 
    move.w (A0,D1.L*2),D1
    beq  z1
    move.w D1,(A1)
  z1
 
    move.w (A0,D2.L*2),D2
    beq  z2
    move.w D2,2(A1)
  z2
 
    move.w (A0,D3.L*2),D3
    beq  z3
    move.w D3,4(A1)
  z3
 
    move.w (A0,D4.L*2),D4
    beq  z4
    move.w D4,6(A1)
  z4
 

 
  What do you think about the above code?
  I would assume it runs 2-3 times faster


Samuel Devulder

Posts 248
08 Feb 2018 18:23


Since BFEXTU doesn't costs much, it eliminates the need for masking and shifting. Sure, this almost free BFEXTU require changing coding habits!
 
  Though, there still remain the unpredictable aspects of the beq.
 
Another aspect is that now we don't check transparency anymore on col==0, but on paldata[col] being zero. It means we assume that color index 0 is mapped onto R5G6B5 to 0 (black). I.e. all blacks are transparent. All, blacks, including the ones not at index #0. This isn't strictly equivalent as the original code especially if the sprites contains a non-transparent black color in its palette.


Fabrice "Lexo" L.

Posts 2
08 Feb 2018 18:26


Problem, with this method, is that you move the transparency check on the color value and not on the colormap index.
  So, if you have a black pixel that is not transparent on your sprite it will not be displayed.

An old trick with 16bits sprites was to generate directly the rendering code, and avoid all the transparency check, but i don't think it could work for an emulator, cause if the sprite change you have to regenerate the code.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
08 Feb 2018 18:31


Samuel Devulder wrote:

Since BFEXTU doesn't costs much, it eliminates the need for masking and shifting. Sure, this almost free BFEXTU require changing coding habits!
 
Though, there still remain the unpredictable aspects of the beq.

Actually, the BCC is not unpredictable anymore
A BCC over a SINGLE instruction which does max 1 memory access
is on APOLLO FREE and never mispredicted.

With the code change both conditions are now valid.
This means the BCC is FREE!

Samuel Devulder wrote:

Another aspect is that now we dont test on col==0, but on colormap[col] being zero. Tis mean we ssume that color index 0 is mapped onto RGB 000 hence that black is transparent. This isn't strictly equivalent especially if the sprites contains the black color in its palette.

Yes this is true.
This is a hack.
But if we for example convert black to 16bit RGB with R=0,G=0,B=1
Then black becomes a very dark blue, which depending on your monitor settings could look nearly like black.


Samuel Devulder

Posts 248
08 Feb 2018 19:16


Ah cheating! Yeah cheating is what makes programming for the amiga so special and so fun :D
   
Nice hack!
   
Knowing that b<cc> over one simple instruction is free is also a very nice feature of the 68080. Does this mean that if the instruction itself doesn't change the flags we can skip over several instructions with free b<cc>? I.e. does this code:

      bcc l3
      lea (a2,a3.w*4),a4
      adda.l d1,a5
  l3:
(presumably 3 cycles) being rewritten as:

      bcc l1
      lea (a2,a3.w*4),a4
  l1:
      bcc l2
      adda.l d1,a5
  l2:
benefits from free bcc (i.e. takes 2 cycles) ?
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
08 Feb 2018 19:28


Samuel Devulder wrote:

Does this mean that if the instruction itself doesn't change the flags we can skip over several instructions with free b<cc> like here ?

    bcc l1
    lea (a2,a3.w*4),a4
  l1:
    bcc l2
    adda.l d1,a5
  l2:
 

Ok let us explain this.

MOTOROLA designed the 68K ISA as CPU with 2 reg files

Regfile with Pointers
A0-A7

Regfile with Work-Register
D0-D7

Motorola also designed several CPUs as chips with separate EA unit and ALU Units.

The 68060 for example looks like this


AN-Regfile
EA-Unit-1    EA-unit-2

DCache
   
DN-Regfile
ALU-Unit-1    ALU-unit-2


   
68080 is an improved 68060 and shares the same design.

Instructions altering AN registers are executed in the EA units.
Instructions working on Memory /Data regs are executed in the ALU Unit.

The flags are generated in the ALU.
The free BCC needs the flags which are generated in the ALU
and therefore only works for ALU instructions.


Samuel Crow

Posts 424
08 Feb 2018 19:33


Samuel Devulder wrote:

Knowing that b<cc> over one simple instruction is free is also a very nice feature of the 68080. Does this mean that if the instruction itself doesn't change the flags we can skip over several instructions with free b<cc> like the following ?

      bcc l3
      lea (a2,a3.w*4),a4
      adda.l d1,a5
  l3:
 

  is rewritten as:
 

      bcc l1
      lea (a2,a3.w*4),a4
  l1:
      bcc l2
      adda.l d1,a5
  l2:
 
and benefit from free bcc.

Does LEA affect the flags ever?  This may be a bad example.  Anyway, the Bcc over a single opcode is a prefix for predication that is made possible by Gunnar's opcode fusion detection.  It should only suppress the flags if the branch is taken.

posts 47page  1 2 3