Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Running Games and Apps.

Quake ;)page  1 2 3 4 5 6 7 8 9 

Samuel Devulder

Posts 248
01 Nov 2017 08:13


Thanks guys for the reports! :D
 
This is strange that you two Claudio & Simo don't exactly have the same FPS. I guess Simo has overclocked his vampire :)
 
The fastest version for both of you is VBCC. This is good to know. It runs at 80% the speed of ClickBoom's version. That's not bad considering this is a free-time work. Possibly the remaining 20% are spent in the gfx driver which, as being very complex (lots of gfx combination with ham6, dither ahd public screen), is still coded in C (except for the gcc version).
 
  Too bad the gcc version doesn't even load the PAK file. On my setup it is faster than vbcc (sas/c being in-between). It contains valuable inline asm optimizations that are difficult to back port to other compilers. The bad PAK error is strange because this doesn't rely on asm parts. Strangely enough, both is issue and the texture one can't be reproduced on my setup. This is likely to be something related to subtle things in 68080. It'll be interesting to discover what is going on with these. Unfortunately I don't have the hardware to test by myself. I'm still waiting for reports on genuine 680x0 to really see if it's a 080 issue or not.
 
Once these issues corrected, the next step is possibly to learn more about superscalar and 68080 timings in order to improve the 68030 asm code that I have. For instance I wonder if it's faster to do an operation with a single instruction (say "and.l #$0000ffff,d0") or with multiple ("swap d0; clr.w d0; swap d0"), as well as the fastest addressing modes (e.g. which of (a0), (a0)+, 4(a0) is faster or (a0, d0.w) vs (a0, d0.l)).
 
In the meantime if you want to increase the FPS, you can run quake on a 256 colors (or more) public screen (use the "-usepub screen-title" or "-usepub" for the wb) and reduce the window to a minimum size. The minimized size is 170 pixels height, reducing the number of things to compute and display by 30% wrt a 240 pix window. I've found out that 170 pix is still playable on a 640x480 screen and the dither does a good job in making the best use of the shared pens. Having 30% extra boost in speed might be interesting to enjoy. Conversely you might want to increase the window size to see how big you can get without the playability being too much affected.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Nov 2017 10:11


Samuel Devulder wrote:

Once these issues corrected, the next step is possibly to learn more about superscalar and 68080 timings in order to improve the 68030 asm code that I have. For instance I wonder if it's faster to do an operation with a single instruction (say "and.l #$0000ffff,d0") or with multiple ("swap d0; clr.w d0; swap d0"), as well as the fastest addressing modes (e.g. which of (a0), (a0)+, 4(a0) is faster or (a0, d0.w) vs (a0, d0.l)).

On 68080 longer instructions are _NOT_ slowed down.
Therefore:
and.l #$0000ffff,d0  == 1 clock

All normal EA modes are free.
(a0) == free
1234(a0) == free
(a0)+ == free
(1234556,a0,D7.l*8) == free

Exactly like on the 68060, there is one thing you should avoid.
Right after altering an Dn register to use it as Index
move.l (a0),D0
move.l (a1,d0),D1  -- D0 ALU usage to EA usage Bubble!
This construct where the index register D0 was altered directly before using it for EA calculation will throw a bubble.
This is the same effect as explained in the 68060 manual.



Nixus Minimax

Posts 416
01 Nov 2017 11:42


Samuel Devulder wrote:
The fastest version for both of you is VBCC. This is good to know. It runs at 80% the speed of ClickBoom's version. That's not bad considering this is a free-time work. Possibly the remaining 20% are spent in the gfx driver which, as being very complex (lots of gfx combination with ham6, dither ahd public screen), is still coded in C (except for the gcc version).

How much do you reckon could your Quake be sped up by just passing the graphics buffer to the display DMA and be done with the "graphics driver" part?



Samuel Devulder

Posts 248
01 Nov 2017 12:43


@Gunnar So asm for 060 is a good starting point to optimize for the vampire. Good to know :) There are definitively places where this can be applied.

@Mixus: I cannot make predictions because I haven't sucessfully made a profile benchmark of the later version. But what I can say is that it depends on the gfx mode.

Copying the screen is not of a big deal. When using RTG 320x200x256cols takes 32Kb. Copying it to the video ram at 30fps takes 960Kb/s. This is peanuts when the bandwidth is 320+Mb/s on the vampire. On the other hand when doing OCS, then this takes much more bandwidth due to dithering, palette-mapping, possibly ham-processing, and ultimately C2P. There must be some improvements right there.

The part actually sending the gfx "on screen" is IIRC done by a simple WritePixelArray8(). So depending whether or not you are using one of the many WPA8 accelerators or tine-tuned p96 driver it'll speedup as well.

But it all depends upon what the profiling gives as indication on places to improve. Maybe it's in unexpected places in the code that is slowing the thing down. (I hope gprof is able to profile any amiga exe with HUNK_DEBUG infos.)


Gregthe Canuck

Posts 274
01 Nov 2017 17:27


Samuel -

Are you aware of the GCC 6.2 by bebbo?

EXTERNAL LINK


Samuel Devulder

Posts 248
01 Nov 2017 18:03


Nope. I wasn't aware of it.

It looks like this is a cross-compiler. This will complicate the build process in my case. Too bad a native 68k toolchain is not provided. Building and testing in the same environment is really something important for me to consider my amigas still being alive and kickin'.

Rebuilding a 68k version doesn't seem possible as building requires GCC5 that doesn't exist for 68k :(


Gregthe Canuck

Posts 274
01 Nov 2017 18:54



Yes that is a cross-compiler. There is no native 68K version.  :\  Was an option to see if later versions of GCC eliminated some of the odd issues you are seeing.

If anyone is listening/cares I am happy to sponsor a native GCC 68K in a beer-and-pizza way.




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Nov 2017 19:28


Samuel Devulder wrote:

The part actually sending the gfx "on screen" is IIRC done by a simple WritePixelArray8(). So depending whether or not you are using one of the many WPA8 accelerators or tine-tuned p96 driver it'll speedup as well.

What speedup do you see, if you SKIP this completely?
SAGA can display _any_ memory, so a copy is a waste of time,
instead just poking the current buffer_adr is enough.

Can you as simple test skip the WritePixelArray8() and tell us how many FPS is gives?


Samuel Devulder

Posts 248
01 Nov 2017 21:34


claudio guglielmotti wrote:

      gcc version refuse to work (it says bad pak or something like that)

There is no such string as "bad pak" in the source. What is the exact message that is displayed ? It may contain valuable information.
     
@Gunnar I cannot test by myself on real hardware, but I've added a new command-line option named "-nowpa8" which disable WritePixelArray8() only (but not color reduction, the palette mapping and the dither). By testing the FPS with and without this option, one can measure the exact FPS drop this option creates for a given screen-mode. The exe can be downloaded there: EXTERNAL LINK     
   
   
  By the way, I think there are issues specifically related to the vampire because:
 
    1) I've had reports that the gcc exes works fine one a genuine 68060 system (no "bad pak" or something with it.)
 
    2) I've also been reported that on plain 68k there is no defect in the textures as seen in the youtube video posted somewhere not far upward.
   
This must be some kind of subtle bug related to 080 implementation and I don't know how to figure out what is going bad exactly (How can I help the team?).
 
But one can be happy because on a 060 my quake version does 8fps at most where Simo gets 21fps.


Simo Koivukoski
(Apollo Team Member)
Posts 601
01 Nov 2017 22:03


With WritePixelArray8():
21.0fps quake.vbcc.030.881
18.8fps quake.sasc.030.881
Without WritePixelArray8():
20.9fps quake.vbcc.030.881
18.7fps quake.sasc.030.881
gcc builds works on WinUAE, so this error message is for Jari/femu:
the Necropolis
PackFile: OS39:QUAKE/id1/pak0.pak : maps/e1m3.bsp
Error: Bad surface extents
VID_Shutdown

   


Samuel Devulder

Posts 248
01 Nov 2017 23:10


This confirm the fact that WritePixelArray8() doesn't cost that much.
   
The bad surface extents is very interesting because in the original source I get 20 years ago there already is debugging info possibly telling somebody else has had an issue with the fpu computations in there (some comments elsewhere in the code indicate that the Next workstation had troubles with changing the fpu precision).:

    for (i=0 ; i<2 ; i++)
    {
    //static int cpt2 = 0; ++cpt2;
      bmins = floor(mins/16);
      bmaxs = ceil(maxs/16);
   
      s->texturemins = bmins * 16;
      s->extents = (bmaxs - bmins) * 16;
    //if(cpt2==8165) goto label;
      if ( !s->extents || (!(tex->flags & TEX_SPECIAL) && (s->extents > 256))) {
    //label:
    //Con_Printf("float: %d %d 0x4367ec00 0x43f3f600\n", cpt, cpt2);
    //Con_Printf("float: [%08x %08x]\n", *(long*)&mins, *(long*)&maxs);
    //Con_Printf("float: [%g %g]\n", mins, maxs);
    //Con_Printf("float: [%g %g]\n", mins/16, maxs/16);
    //Con_Printf("float: [%g %g]\n", floor(mins/16), ceil(maxs/16));
    //Con_Printf("int: [%d %d] %d\n", bmins, bmaxs, (bmaxs - bmins)*16);
      Sys_Error ("Bad surface extents");
      }
    }

I'll compile a version that displays the commented debug infos to see what is wrong in the computations...done
   
Here it is: EXTERNAL LINK   


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Nov 2017 23:16


Samuel Devulder wrote:

@Gunnar I cannot test by myself on real hardware, but I've added a new command-line option named "-nowpa8" which disable WritePixelArray8() only (but not color reduction, the palette mapping and the dither).

When you say disable WritePixel - it now does really nothing instead?

- color reduction,
- the palette mapping and the dither

Do we really need those steps?
Would it not be simpler to display a hicolor screen directly?




Samuel Devulder

Posts 248
02 Nov 2017 00:20


I you choose a 256 color screen these steps are skipped. Otherwise there are used if fewer colors are used. They are quite fast because they use some "big" precalc tables. For instance I'm pretty proud of the dither algorithm:

  #if defined(__mc68000__) && defined(__GNUC__)
  #define SETLSB(x,y) __asm__ __volatile__("moveb %1,%0" : "=&d" (x) : "dmi" (y), "0" (x))
  #else
  #define SETLSB(x,y) x = (x&~255)|y
  #endif

  #if defined(__mc68000__) && defined(__GNUC__)
  #define EORLSW(x,y) __asm__ __volatile__("eorw %1,%0" : "=&d" (x) : "i" (y), "0" (x))
  #else
  #define EORLSW(x,y) x ^= y
  #endif
 
  static __inline__ void
  dither_line(unsigned char *src, int y, int len)
  {
      unsigned char *dith = fast_mtrx[y&(DMSIZE-1)];
      unsigned long offset = 7*256;
     
      for(;len>0; len -= DMSIZE) {
    EORLSW(offset, (0^7)*256); SETLSB(offset, *src); *src++ = dith[offset];
    EORLSW(offset, (1^0)*256); SETLSB(offset, *src); *src++ = dith[offset];
    EORLSW(offset, (2^1)*256); SETLSB(offset, *src); *src++ = dith[offset];
    EORLSW(offset, (3^2)*256); SETLSB(offset, *src); *src++ = dith[offset];
    EORLSW(offset, (4^3)*256); SETLSB(offset, *src); *src++ = dith[offset];
    EORLSW(offset, (5^4)*256); SETLSB(offset, *src); *src++ = dith[offset];
    EORLSW(offset, (6^5)*256); SETLSB(offset, *src); *src++ = dith[offset];
    EORLSW(offset, (7^6)*256); SETLSB(offset, *src); *src++ = dith[offset];
      }
  }


It does "in-place" dithering very quickly. The C-code looks awfull but results in nice ASM (considering this is compiler generated):

      52562: 2f02            movel %d2,%sp@-
      52564: 206f 0008      moveal %sp@(8),%a0
      52568: 222f 0010      movel %sp@(16),%d1
      5256c: 7007            moveq #7,%d0
      5256e: c0af 000c      andl %sp@(12),%d0
      52572: 740c            moveq #12,%d2
      52574: e5a8            lsll %d2,%d0
      52576: 2240            moveal %d0,%a1
      52578: d3fc 0000 36e4  addal #14052,%a1
      5257e: 203c 0000 0700  movel #1792,%d0
      52584: 4a81            tstl %d1
      52586: 6f56            bles 525de <_D_EndDirectRect+0x11da>
      52588: 0a40 0700      eoriw #1792,%d0
      5258c: 1010            moveb %a0@,%d0
      5258e: 10f1 0800      moveb %a1@(00000000,%d0:l),%a0@+
      52592: 0a40 0100      eoriw #256,%d0
      52596: 1010            moveb %a0@,%d0
      52598: 10f1 0800      moveb %a1@(00000000,%d0:l),%a0@+
      5259c: 0a40 0300      eoriw #768,%d0
      525a0: 1010            moveb %a0@,%d0
      525a2: 10f1 0800      moveb %a1@(00000000,%d0:l),%a0@+
      525a6: 0a40 0100      eoriw #256,%d0
      525aa: 1010            moveb %a0@,%d0
      525ac: 10f1 0800      moveb %a1@(00000000,%d0:l),%a0@+
      525b0: 0a40 0700      eoriw #1792,%d0
      525b4: 1010            moveb %a0@,%d0
      525b6: 10f1 0800      moveb %a1@(00000000,%d0:l),%a0@+
      525ba: 0a40 0100      eoriw #256,%d0
      525be: 1010            moveb %a0@,%d0
      525c0: 10f1 0800      moveb %a1@(00000000,%d0:l),%a0@+
      525c4: 0a40 0300      eoriw #768,%d0
      525c8: 1010            moveb %a0@,%d0
      525ca: 10f1 0800      moveb %a1@(00000000,%d0:l),%a0@+
      525ce: 0a40 0100      eoriw #256,%d0
      525d2: 1010            moveb %a0@,%d0
      525d4: 10f1 0800      moveb %a1@(00000000,%d0:l),%a0@+
      525d8: 5181            subql #8,%d1
      525da: 4a81            tstl %d1
      525dc: 6eaa            bgts 52588 <_D_EndDirectRect+0x1184>
      525de: 241f            movel %sp@+,%d2
      525e0: 4e75            rts

Notice however how the compiler is not smart enough to get rid of the tst.l opcode at $525da. Maybe dividing by 8 in the beginning of the function and working with unsigned short will result in better code (==> dbf).
 
I can see that this code does a lot of pipeline bubbles (d0 being modified just before being used as an index). A version alternating d0/d1 as index might be better for the 080. The C compiler was not smart enough to figure that out.


Simo Koivukoski
(Apollo Team Member)
Posts 601
02 Nov 2017 06:31


Samuel Devulder wrote:
The bad surface extents is very interesting because in the original source I get 20 years ago there already is debugging info possibly telling somebody else has had an issue with the fpu computations in there (some comments elsewhere in the code indicate that the Next workstation had troubles with changing the fpu precision).

The texture issue comes only with femu and these 030.881 builds. Also alternative QuakeFPS.exe -build has the same issue. On WinUAE your 030.881 builds and this QuakeFPS works ok.
 
quake.sasc.030.881 / QuakeFPS (femu):

 
When you build for 882, texture issue is gone with femu.
 
quake.sasc.68040.68882:

 


Claudio Guglielmotti
(Apollo Team Member)
Posts 185
02 Nov 2017 06:44


The error i get with GCC is:

The Necropolis
PackFile: Dati:Games/Quake/id1/pak0.pak  :  maps/e1m3.bsp
Error: Bad surface extents
VID_Shutdown


Simo Koivukoski
(Apollo Team Member)
Posts 601
02 Nov 2017 06:56


Simo Koivukoski wrote:
With WritePixelArray8():
21.0fps quake.vbcc.030.881
18.8fps quake.sasc.030.881
Without WritePixelArray8():
20.9fps quake.vbcc.030.881
18.7fps quake.sasc.030.881
gcc builds works on WinUAE, so this error message is for Jari/femu:
the Necropolis
PackFile: OS39:QUAKE/id1/pak0.pak : maps/e1m3.bsp
Error: Bad surface extents
VID_Shutdown
22.2fps QuakeFPS              <--- 030.881 ??
26.0fps quake.clickBOOM.060

   


Samuel Devulder

Posts 248
02 Nov 2017 07:39


Good to know about the texture error not existing with 68882 version. But this is strange because to my knowledge the 68881 and the 68882 share the same instruction-set, so both exe should be equivalent. The difference is the way SAS/C generate the code. I'll build later this day 2 exes with sas/c: one for 68881 and another one for 68882 with the very same source-code, and then compare the assembly-source. This should help find the difference between the two versions.


Wawa T

Posts 695
02 Nov 2017 09:06


Samuel Devulder wrote:

    Rebuilding a 68k version doesn't seem possible as building requires GCC5 that doesn't exist for 68k :(
 

 
  you could try to use aros68k as compiler environment on your amiga.
  the nightly (contribs) contains gcc-4.6.4 and i can compile gcc-6.3.0 natively for 68k.
 
  the question is if this is practicable to use such a heavy modern compiler on an actual amiga. i am assuming you use a vampire expanded machine.
 
  as for testing i assume the quake you are compiling would run on aros also, what dependencies it has? sdl? ixemul? i could test a binary under aros if you send it to me or provide a link.
 
  as for aros install i think someone here or on apollo irc has a working environment with vampire extensions. michael ness, shk (simo?) or marlon. i could then simply provide an 6.x toolchain.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
02 Nov 2017 09:23


Samuel Devulder wrote:

  <_D_EndDirectRect+0x11da>
        52588: 0a40 0700      eoriw #1792,%d0
        5258c: 1010            moveb %a0@,%d0
        5258e: 10f1 0800      moveb %a1@(00000000,%d0:l),%a0@+
        52592: 0a40 0100      eoriw #256,%d0
        52596: 1010            moveb %a0@,%d0
        52598: 10f1 0800      moveb %a1@(00000000,%d0:l),%a0@+
        5259c: 0a40 0300      eoriw #768,%d0
 

 
  Lets write this more readable
 

    eori.w #$700,d0
    move.b (a0),d0
      -- ALU EA bubble 2 cycle
    move.b (a1,d0.l),(a0)+
 
    eori.w #256,d0
    move.b (a0),d0
      -- ALU EA bubble 2 cycle
    move.b (a1,d0,l),(a0)+
 

 
 
Great find.
Here is room for improvement.
As the code creates many ALU to EA usage bubbles.
A rewrite it removing the bubbles should increase speed.

On the other hand - if the code is not used in the normal 8bit path.
we not need to tweak it.


Samuel Devulder

Posts 248
02 Nov 2017 09:30


Samuel Devulder wrote:
I'll build later this day 2 exes with sas/c: one for 68881 and another one for 68882 with the very same source-code, and then compare the assembly-source. This should help find the difference between the two versions.

Ok done ==> EXTERNAL LINK     
 
The odd thing is that both 030.881 and 030.882 are identical (see the provided ASM disassembled code). So the issue might not be related to 881/882, but most likely between 030 and 040! That's why I've provided also a 040 exe.
   
Can you please test again with these 3 EXEs and tell which one is having texture issues?
   
If they all work the same wrt texture, then this mean that the issue arrived with later changes (ASM code.)

posts 170page  1 2 3 4 5 6 7 8 9