Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Performance and Benchmark Results!

68k Coding Challengepage  1 2 3 

Markus (mfro)

Posts 99
08 Feb 2018 20:29


4 bit color depth -> 2 pixels per byte on the source raster, 32 bits for two pixels on the destination.

Precalculate a lookup table with all possible double pixel destination raster combinations ('extended palette' of 256 longwords = 1024 bytes) and you only need 3 shifts+4 lookups/dest writes per source longword (8 pixels).

Limited to multiples of 8 pixel widths the easy way, however.



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
08 Feb 2018 20:42


Markus (mfro) wrote:

Precalculate a lookup table with all possible double pixel destination raster combinations ('extended palette' of 256 longwords = 1024 bytes) 

 
Yes this is also a nice idea.

We could also do another trick.
We have relative many registers.
And Apollo does support INDIRECT register loads
We could load the colors in the register, and translate the colors doing register only loads. This would remove a lot of Dcache /Memory access.


Thellier Alain

Posts 141
09 Feb 2018 09:10


There is one way to remove ALL the "if (col)"
 
  Lets have an "pointers palette" called palptr
  palptr[0]=currentpixelptr;
  palptr[1]=&paldata[1];
  palptr[2]=&paldata[2];
  palptr[3]=&paldata[3];
  ...
  palptr[15]=&paldata[15];
 
  so now palptr[col] will tell where (ptr) is next pixel color to read (palette or screen)
 
  Badly it will imply that color00 will have a read access too but it may be faster anyway if color00 frequency is low
(remenber that "if (myint64)" remove lots of those color00 pixels)




Samuel Devulder

Posts 248
09 Feb 2018 10:10


Markus (mfro) wrote:
Precalculate a lookup table with all possible double pixel destination raster combinations ('extended palette' of 256 longwords = 1024 bytes) and you only need 3 shifts+4 lookups/dest writes per source longword (8 pixels).

I don't see how this technique will handle the transparent color.
     
@Alain, interesting. But we don't even need the palptr table and the double indirection it requires. We can simply write the current pixel value in paldata[0]. However these two last solutions require an extra-step for each pixel (either overwrite paldata[0] or palptr[0]). This adds 2 cycles or more depending on the memory contention. Better use the free b<cc> solution IMHO.


Thellier Alain

Posts 141
09 Feb 2018 10:53


with your method writing current pixel value in palette0 need one read one write per pixel
  With mine  writing currentpixelptr need one write and one read ONLY if color0
  Edit:  I mean read from screen not from palette


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
09 Feb 2018 11:05


Alain how about we write the proposal in ASM?
I think then it will be even more clear how much time the ideas save.


Thellier Alain

Posts 141
09 Feb 2018 14:15


Hi Gunnar
I cant write good ASM : I didnt wrote ASM for 20 years :-(

Here it is the cleaned C source (compilation tested)
i tried to compile it as ASM with GCC 2.95 but the compiler didnt want to put shif4 and mask4 in registers

typedef unsigned long long UINT64;
/*=================================================================*/
#define PUTPIXEL(dst,src) dst=src;
/*=================================================================*/
UWORD  paldata[16];
UWORD* palptr[16];
/*=================================================================*/
void DrawPixels(APTR gfxdata,UWORD *br,ULONG zy,ULONG bufferpitch)
{
register UINT64* gfxdata64=gfxdata;
register UINT64  myint64;
register ULONG y;
register ULONG pitch=(bufferpitch>>1);
register ULONG mask4=0xf;
register ULONG shif4=4;
register UBYTE col8;
register UBYTE col;
 
// palette pointers */
  for(y=0;y<16;y++)
  palptr[y]=&paldata[y];

/// do pixels
  for(y=0;y<zy;y++)
  {
  myint64 = *gfxdata64++;
 
  if (myint64)  // if not fully transparent
  {
  col8=myint64;  // get lower 8 bits
  myint64=myint64>>8;  // go next 8 bits
  col=col8 &mask4; palptr[0]=&br[15]; PUTPIXEL(br[15],*(palptr[col]));
  col=col8>>shif4; palptr[0]=&br[14]; PUTPIXEL(br[14],*(palptr[col]));
  col8=myint64;  // get lower 8 bits
  myint64=myint64>>8;  // go next 8 bits
  col=col8 &mask4; palptr[0]=&br[13]; PUTPIXEL(br[13],*(palptr[col]));
  col=col8>>shif4; palptr[0]=&br[12]; PUTPIXEL(br[12],*(palptr[col]));
  col8=myint64;  // get lower 8 bits
  myint64=myint64>>8;  // go next 8 bits
  col=col8 &mask4; palptr[0]=&br[11]; PUTPIXEL(br[11],*(palptr[col]));
  col=col8>>shif4; palptr[0]=&br[10]; PUTPIXEL(br[10],*(palptr[col]));
  col8=myint64;  // get lower 8 bits
  myint64=myint64>>8;  // go next 8 bits
  col=col8 &mask4; palptr[0]=&br[ 9]; PUTPIXEL(br[ 9],*(palptr[col]));
  col=col8>>shif4; palptr[0]=&br[ 8]; PUTPIXEL(br[ 8],*(palptr[col]));
   
  col8=myint64;  // get lower 8 bits
  myint64=myint64>>8;  // go next 8 bits
  col=col8 &mask4; palptr[0]=&br[ 7]; PUTPIXEL(br[ 7],*(palptr[col]));
  col=col8>>shif4; palptr[0]=&br[ 6]; PUTPIXEL(br[ 6],*(palptr[col]));
  col8=myint64;  // get lower 8 bits
  myint64=myint64>>8;  // go next 8 bits
  col=col8 &mask4; palptr[0]=&br[ 5]; PUTPIXEL(br[ 5],*(palptr[col]));
  col=col8>>shif4; palptr[0]=&br[ 4]; PUTPIXEL(br[ 4],*(palptr[col]));
  col8=myint64;  // get lower 8 bits
  myint64=myint64>>8;  // go next 8 bits
  col=col8 &mask4; palptr[0]=&br[ 3]; PUTPIXEL(br[ 3],*(palptr[col]));
  col=col8>>shif4; palptr[0]=&br[ 2]; PUTPIXEL(br[ 2],*(palptr[col]));
  col8=myint64;  // get lower 8 bits
  myint64=myint64>>8;  // go next 8 bits
  col=col8 &mask4; palptr[0]=&br[ 1]; PUTPIXEL(br[ 1],*(palptr[col]));
  col=col8>>shif4; palptr[0]=&br[ 0]; PUTPIXEL(br[ 0],*(palptr[col]));
  }
  br+=pitch;
  }

}



Samuel Devulder

Posts 248
09 Feb 2018 16:57


I noticed
*(palptr[col])
which is double indirection. It'll cost 2 memory reads, added to 2 write cycles (one for palptr[0] and one for the final pixel), plus le masking/shifting this method cost at least 5 or 6 cycles per pixel.
 
The free b<cc> is about 3 cycles/pixel. Twice as fast. Sofar it is the fastest code. Congrats BigGun!


Nixus Minimax

Posts 416
09 Feb 2018 17:34


I wouldn't give up on the two-pixel approach yet. To deal with transparent pixels you need to check three cases in this sequence: byte == 0 (two transparent pixels, skip everything), byte < 16 (one transparent pixel and one coloured pixel, if true do lookup and write only second pixel), do lookup, AND low four bits of byte, skip write of second pixel if zero, SWAP and write first pixel:
 
  It would look something like this (the offsets would have to be calculated, too tired for it now):
 
 
 
    move.l  (a0)+,d0
    move.b  d0,d1
    beq  nextpair
    cmp.b  #15,d1
    bhi  .upper
    move.w 2(a1,d1.b*2),offset(a2)
    bra  nextpair
  .upper:
    move.l  (a1,d1.b*2),d2
    and.b  #$0f,d1
    bne  .twopixels
    swap  d2
    move.w d2,offset(a2)
    bra  nextpair
  .twopixels:
    move.l  d2,offset(a2)
  nextpair:
    lsr.l #8,d0
 
  ...and loop
 

 
  Depending on the sprite data, it could be fast.
 
 
  Other than that I can only come up with this simeple code:
 
 
    move.l  (a0)+,d0
    moveq  #0,d2
    move.l  d0,d1
    and.l  #$f0f0f0f0,d0
    and.l  #$0f0f0f0f,d1
    lsr.l  #4,d0
 
    move.b  d1,d2
    beq  .skip0
    move.w (a1,d2.b*2),offset(a2)
  .skip0:
    lsr.l  #8,d1
 
    move.b  d0,d2
    beq  .skip1
    move.w (a1,d2.b*2),offset(a2)
  .skip1:
    lsr.l  #8,d0
 
  ...and so on for the other pixels.
 

  BTW, Gunnar's similar code above with the four Bitfield Extracts doesn't work for transparent pixels, it skips on black pixels which is why the extra move.b/lsr is needed.
 
 


Denis Markovic

Posts 4
09 Feb 2018 18:03


You wrote, this routine is used to copy a sprite to a screen. I assume that the sprite is either static or does not change too often.

I am not sure if I remember right, but doesn't the 68080 support self modifying code?

If not, the use of following idea is useless but if it does:

What about using pre-initialization to create a piece of code with move instructions on the fly as function?

You could create 16x16 pure move instruction moving contstants to memory via address post-increment, some additional != 2 increment at row ends, 0/transparent pixels would create a greater increment in the previous none-0-move? You might even think using wider than 16 bit moves if none-aligned writes are supported.

This created code block could then always be called as function with the current screen location pointer as input.

Of course, just useful if there is no cache problem with on-the-fly self creating code and only if the number of copies to screen is much higher than the cost of creating this piece of code. Also needs more memory/code creation for animated "sprites"/bobs



Matthew Burroughs

Posts 59
09 Feb 2018 18:41


Gunnar von Boehn wrote:

A little background information
  The NEO-GEO is was a very nice game console.
  The NEO-GEO used an 68000 CPU like the AMIGA.
   
  The NEO-GEO had very powerful Sprite freatures
  It could display up to 380 Sprites on Screen.
  Each sprite had 16 colors and could use a different palette.
   
  To emulate the NEO-GEO the above sprite copy code would need be called up to 380 times per frame.
 
  The current C code need ~ 10.000 cycle per sprite
  Which means you need a CPU @ 200 MHz for fullspeed.
 
  Obviously we need to tune this, ideally in 68K ASM.

Um, "very nice"???

More like comparing a Ford fiesta to the Starship Enterprise!

The Neo Geo  had probably the biggest performance leap compared to its competition in Video Game history -- nothing Evan came close.

And if you asked an Amiga fan that one day a 600 could run Neo Geo games you would be laughed out of the playground/office delete as appropriate.



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
09 Feb 2018 18:44


Denis Markovic wrote:

  You wrote, this routine is used to copy a sprite to a screen. I assume that the sprite is either static or does not change too often.
 

 
NEO-GEO uses ROMS for games.
This means the sprite DATA is in the game in a ROM and is readonly.
The ROM can be big and can contain many thousands of different sprites.
 
  The CPU will make "display" lists telling the HW which of the sprites to put where on the screen.
 
For playing NEO-GEO games we need a Software Sprite-Render routine which will copy the sprites to the display screen.
This routine will copy 300 or more sprites each frame.
 
You can assume that each time the routine is called it has to copy a different sprite image - therefore selfmodify code will not be useful.


Don Adan

Posts 38
10 Feb 2018 12:55


I will use something like this:
      move.w (a1),(a0)
      BFEXTU D0{28:4},D1      ; EXTRACT
   
      move.w (A0,D1.L*2),(A1)
   
 
 
  This is only example, perhaps (a1)+ can be used.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
10 Feb 2018 19:58


Don Adan wrote:

I will use something like this:
      move.w (a1),(a0)

For what is this?




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
10 Feb 2018 21:37


Here is a small example how the code could be written

The code make use of Super Scalarity.
The code uses fast, single, cycle Bitfields on Apollo.
The code is written to avoid the 68060 Ea-ALU bubbles.
The code makes use of the FREE BCC on Apollo.
 
From ~10 cycle per pixel, we are now down to 2.2 cycle per pixel.

This ASM code is now several times faster than the C-Code.

We need 3 instruction at the start of the Loop to avoid the EA-ALU penalty. Then we can interleave the code and reach from then on a 2.0 Cycle per pixel speed.
 


        bfextu d0{#0:#4},d1
        bfextu d0{#4:#4},d2
        bfextu d0{#8:#4},d3
 
        bfextu d0{#12:#4},d4
        move.w 2(a1,d1.l*4),d1
        beq.s p6
        move.w d1,(a0)
p6:
 
        bfextu d0{#16:#4},d1
        move.w 2(a1,d2.l*4),d2
        beq.s p7
        move.w d2,2(a0)
p7:
 
        bfextu d0{#20:#4},d2
        move.w 2(a1,d3.l*4),d3
        beq.s p8
        move.w d3,4(a0)
p8:
        bfextu d0{#24:#4},d3
        move.w 2(a1,d4.l*4),d4
        beq.s p9
        move.w d4,6(a0)
p9:

        bfextu d0{#28:#4},d4
        move.w 2(a1,d1.l*4),d1
        beq.s p10
        move.w d1,8(a0)
p10:

        bfextu d5{#0:#4},d1
        move.w 2(a1,d2.l*4),d2
        beq.s p11
        move.w d2,10(a0)
p11:
       

 




Samuel Devulder

Posts 248
10 Feb 2018 21:56


Gunnar von Boehn wrote:

 
Don Adan wrote:

    I will use something like this:
          move.w (a1),(a0)
   

    For what is this?

I think this corresponds to
pal[0]=*currentpixelptr;
It copies the current pixel value to palette index 0, therefore removing the need to test for the transparent color. The problem with this is that you add read+write operation for each pixel, even if it is of no use. Accessing ram is the bottleneck actually since the test for non-zero is pretty free on the vamp.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
10 Feb 2018 22:29


Samuel Devulder wrote:

Accessing ram is the bottleneck actually since the test for non-zero is pretty free on the vamp.

You are correct.
We are at roughly 2.0 Cycle per sprite pixel now.

The screen to background color would add 2 clocks on top.
As you correctly said this instruction would cost us 50% speed.



Markus B

Posts 209
10 Feb 2018 23:11


So, instead of the initial 10.000 cycles, only 512 cycles are now needed?
Or do I get it wrong?


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
10 Feb 2018 23:17


We only used 68K coding so far.
We should look into AMMX now.
I'm sure we can more than double the speed.


Nixus Minimax

Posts 416
11 Feb 2018 08:56


Gunnar von Boehn wrote:

 

          bfextu d0{#12:#4},d4
          move.w 2(a1,d1.l*4),d1
          beq.s p6
          move.w d1,(a0)
  p6:
 

This skips on black hicolor pixels, not on transparent four bit palette pixels.


posts 47page  1 2 3