Overview Features Instructions Performance Forum Downloads Products OrderV4 Reseller Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Performance and Benchmark Results!

Can Vampire Do 640x240x32bit ?page  1 2 3 4 

Vladimir Repcak

Posts 327
06 May 2020 10:07


Gunnar von Boehn wrote:

  DBRA can be executed in 2nd Pipe (for free)
 

  Well, I'd rather the second pipe did something useful (e.g. copy second pixel), not just process looping.
 
  That is, assuming it's even possible, that the memory controller can initiate two such parallel copies from two pipes. I think we talked about this in my thread few months ago, I should go reread it again.
 
 
Gunnar von Boehn wrote:

 
 
Vladimir Repcak wrote:

    Wouldn't it be more effective to have 2 sets of pointers (say, one scanline apart) and just keep copying two rows at the same time ?
    Something like:
    move.l (a0)+,(a1)+    ; Row X
    move.l (a2)+,(a3)+    ; Row X+1 (e.g. a2 is 320px further than a0)
 

 
  No this is not good.
  This code would create Read from 2 different memory regions and write to 2 different regions and the write would be 32bit.
 
  The memory stream prefetching will work better if you continuously read from 1 memory region. And also if you write continuously to one region you can create 64bit write which will be faster
 
  Such Loop such give you good performance.
 

  LOOP
    move.l (A0)+,(A1)+
    move.l (A0)+,(A1)+
    move.l (A0)+,(A1)+
    move.l (A0)+,(A1)+
  dbra D0,LOOP
 

 
  Just make sure you "screen buffer" is 64bit aligned.
 
 
 

  Good catch on the 64-bit alignment. Need to go figure out the VASM syntax. Something like CNOP 0,8 or similar ?
 
  So, in other words, the way I have it now, is as good as it gets using vanilla 68000 code.
 
  I reckon AMMX could copy a RAM region faster, right ?


Gunnar von Boehn
(Apollo Team Member)
Posts 4788
06 May 2020 11:11


Vladimir Repcak wrote:

I reckon AMMX could copy a RAM region faster, right ?

 
Not really.
 
I would always recommend to look where the "problem area" is and how to fix.
 
The "problem area" of a memcopy is not the CPU and not the DBRA.
The area you want to optimize is "avoid" cache misses and optimally utilize the memory bus.
 
The memory bus is 64bit.
The 68080 can "merge" 2 MOVE.L that we used in our example to create 64bit access. So the code is already very good in this regard.
You might want to align destination to 64bit.
With the "cache miss" problem the Cache and memory controller of the 68080 will both try to help you if they spot sequential continous READs. So if you continously read then they can help you and will enhance your speed a lot as they will remove the cache misses by automatic prefetching.


Vladimir Repcak

Posts 327
06 May 2020 11:30


So, by unrolling the copying (and aligning to 64-bit), it's as fast as possible ?

As fast as possible is good enough for me :)

640x480 = 307,200 pixels = 307,200 move.l (a0)+,(a1)+ ops

Does the fusing of two move.l (into one 64-bit) mean we execute the above code in:
a) 307,200 cycles : Only 1 pipe working
b) (307,200 / 2) cycles : Both pipes executing one 32-bit move.l
c) (307,200 / 4) cycles : Both pipes executing one 64-bit move.l

Which scenario is it?

There's one more option, I realized. If the benchmark on real HW shows 32-bit FrameBuffer copy as too slow, I could double that particular stage of the pipeline by reducing color depth to 16 bits.

However, from experience on Jaguar, the atmosphere rendering on the planet isn't very smooth at 65,536 colors. So, I'd really like to have 16.7 Mil colors for the background.

We'll see soon...


Gunnar von Boehn
(Apollo Team Member)
Posts 4788
06 May 2020 12:00


Vladimir Repcak wrote:

As fast as possible is good enough for me :)

:-)

Vladimir Repcak wrote:

  640x480 = 307,200 pixels = 307,200 move.l (a0)+,(a1)+ ops
 
  Does the fusing of two move.l (into one 64-bit) mean we execute the above code in:
  a) 307,200 cycles : Only 1 pipe working
  b) (307,200 / 2) cycles : Both pipes executing one 32-bit move.l
  c) (307,200 / 4) cycles : Both pipes executing one 64-bit move.l

The pipes is not the limit here but the memory interface.

Therefore I would recommend to "count" differently
and to look at amount of memory and bandwidth.
640*480*4*2 = 2.5 MB/sec

With good code you might expect to reach 400 MB/sec
So around 160 FPS frame clear speed would be good.



Vladimir Repcak

Posts 327
06 May 2020 12:11


Gunnar von Boehn wrote:

 
Vladimir Repcak wrote:

  As fast as possible is good enough for me :)
 

  :-)
 
 
Vladimir Repcak wrote:

    640x480 = 307,200 pixels = 307,200 move.l (a0)+,(a1)+ ops
   
    Does the fusing of two move.l (into one 64-bit) mean we execute the above code in:
    a) 307,200 cycles : Only 1 pipe working
    b) (307,200 / 2) cycles : Both pipes executing one 32-bit move.l
    c) (307,200 / 4) cycles : Both pipes executing one 64-bit move.l
 

 
  The pipes is not the limit here but the memory interface.
 
  Therefore I would recommend to "count" differently
  and to look at amount of memory and bandwidth.
  640*480*4*2 = 2.5 MB/sec
 
  With good code you might expect to reach 400 MB/sec
  So around 160 FPS frame clear speed would be good.
 
 

  I interpret those 160 FPS as 1/(160/60) = 37.5% of frame time. Is that correct ?
 
  My initial estimate of the unrolled code was around 25% of frame time (though I should get the benchmark results soon), at 60 fps.
 
  Then again, it's 1.23 MB to clear and then the same 1.23 MB to fill, per frame...

And, there's going to be some overdraw...


Vladimir Repcak

Posts 327
06 May 2020 23:08


So, I got the fresh benchmark results from the V2. One of the stages was copying the background bitmap 640*480@32bpp

I made a simple loop of 1,000 iterations:
Loop (#1000) CopySkyboxTexture ()

And it took 9.93173 seconds.
Which is 595.9 frames (@60 fps).
Which is 59.59% of a frame time.

Leaving only 40% of frame time for pixel fill, which is slower than unrolled copying.

If it was an indoor environment, I could get away with copying, as walls would fill the screen.

But since I want windows in the tunnels to view the warm space outdoors, it's unavoidable.

Ideas?
That's way too much.



Vladimir Repcak

Posts 327
06 May 2020 23:35


Some data per frame:
515,522 pixels : Bitmap copy
637,484 pixels : Scanline fill

The reason the second one is bigger is that scanline fill stores the register value via move.l do,(a0)+

The scanline fill was measured by rendering 230 Million pixels of real-world data (anything from single pixel scanline to full screen width).




Gunnar von Boehn
(Apollo Team Member)
Posts 4788
07 May 2020 06:00


Vladimir Repcak wrote:

I made a simple loop of 1,000 iterations:
Loop (#1000) CopySkyboxTexture ()
 
Ideas?

 
What is the executed ASM code?
What screenmode was shown in parallel?

If speed is important then maybe also other ideas can be discussed?
24bit mode looks the same as 32bit but is 25% less memory to copy.
640x360 is a 16:9 mode which matches many TV better.
And 640x360 is 25% less memory to copy than 640x480
How different will a game or the sky look when done in 256 CLUT?
Could you share some picture to help us imagine this?



Vladimir Repcak

Posts 327
07 May 2020 09:47


Gunnar von Boehn wrote:

 
Vladimir Repcak wrote:

  I made a simple loop of 1,000 iterations:
  Loop (#1000) CopySkyboxTexture ()
   
  Ideas?
 

   
  What is the executed ASM code?
  What screenmode was shown in parallel?
 

  This is the code:

  loopStart:
 
    move.l (a0)+,(a1)+
    move.l (a0)+,(a1)+
 
    ......    320x move.l within loop
 
    move.l (a0)+,(a1)+
    move.l (a0)+,(a1)+
 
  dbra d7, loopStart

  Screenmode is 640x480, and of the BGRA/BRGA/RBGA/RGBA (don't recall which exactly now).


Vladimir Repcak

Posts 327
07 May 2020 09:50


Gunnar von Boehn wrote:

 
  If speed is important then maybe also other ideas can be discussed?
  24bit mode looks the same as 32bit but is 25% less memory to copy.
  640x360 is a 16:9 mode which matches many TV better.
  And 640x360 is 25% less memory to copy than 640x480

Yes, 640x360 would be a great compromise.

OK, but is it a default RTG mode that doesn't need any configuring from user's side ? Like 320x240 or 640x480 ?

If I request 360 through the OS requester, will RTG supply iy=t by default without any action from the user ?



Vladimir Repcak

Posts 327
07 May 2020 09:56


Gunnar von Boehn wrote:

  24bit mode looks the same as 32bit but is 25% less memory to copy.
I didn't know 24 bit is an actual option. I assumed it's either 16 or 32.

If we combined this with 640x360,than that's a substantial decrease of instructions needed to fill the screen.
That's 691,200 Bytes vs 1,228,800. Almost 50% !

As it stands right now, even heavily unrolled code takes 60% of frame time to copy the background bitmap.
Combined with pixel fill, it's going to take a full frame (or slightly more) just for pixel drawing.

And I really must keep at least 50% of frame time as a buffer against CPU spikes, leaving ~40% of frame time for actual 3D scene processing. That's not much.



Vladimir Repcak

Posts 327
07 May 2020 10:07


Gunnar von Boehn wrote:

  How different will a game or the sky look when done in 256 CLUT?
  Could you share some picture to help us imagine this?

256 colors for whole scene ? Including space background ?
Ouch! That's going to look like sh*t :)

Now, the flatshaded foreground 3D scene can look great at 256 colors in itself, as long as you pull those 256 shades from 65,536 colors. I've done that on Jaguar and could cycle nicely through 65,536 colors within the limited 256 CLUT, while flying through the level.

But, the planet background would look ugly as hell.

Realize, that with 16.7 Mil colors we can do many special postprocessing FX on the background bitmap (during loading time, so performance cost of those is irrelevant), as we work directly in screen-space. Most of them are impossible to do at 65,536 colors (or simply look real bad with staircase and aliasing artifacts).

I'm talking FX like:
- antialiasing
- Bloom
- lens flare
- day/night transition
- etc.



Gunnar von Boehn
(Apollo Team Member)
Posts 4788
07 May 2020 10:31


Vladimir Repcak wrote:

  Now, the flatshaded foreground 3D scene can look great at 256 colors in itself, as long as you pull those 256 shades from 65,536 colors. 
 

 
You have the following GFX modes
 
CLUT : with 2..256 Colors from 16 Million Pallette.
E.g. for Picture of the "Moon" you could pick 256 matching "grey" colors. Or for a picture of Mars you could pick 256 "red-sand" colors.
 
 
I have no clue what your game will look like but
if I imagine just a back picture of the "moon" then 256 colors  might look very good for this?
 
SAGA supports also 15bit/16bit direct color
 
24bit direct color
 
and 32bit (24bit plus 8bit padding)
 
Another mode is "YUV" which is video format used in video MPEG compression. YUV offers 24bit colors with compressed storage.
 
Maybe one interesting idea might even be using different modes in layers...
Maybe YUV Video animation in the background with 8bit CLUT on top?
But I'm just tossign idea around now...
 
I think what would help me a lot would be to see a draft of how it shall look.. Can you show me more so I can better imagine this?
 
 



Vladimir Repcak

Posts 327
07 May 2020 10:57


Gunnar von Boehn wrote:

  Maybe one interesting idea might even be using different modes in layers...
  Maybe YUV Video animation in the background with 8bit CLUT on top?
  But I'm just tossign idea around now...

Yes, on Jaguar, ObjectProcessor can merge 2 bitmaps of 2 different bitdpeths at run-time.

Meaning, I had a 16-bit background (the space bitmap) and 8-bit foreground (the 3D flatshaded scene).
And there was no need to copy background bitmap into the framebuffer first. The ObjectProcessor simply draw it on screen from the pointer.

But,you said few months ago that SAGA cannot do that.

Are you saying it is possible to merge two bitmaps like that now ?



Vladimir Repcak

Posts 327
07 May 2020 11:01


Gunnar von Boehn wrote:

  I have no clue what your game will look like but
  if I imagine just a back picture of the "moon" then 256 colors  might look very good for this?

But, you're forgetting that if we put all 256 palette colors into the background bitmap, then the 3D foreground will need to have the exact same palette.

Once you start dividing 256 colors into all elements (HUD/3d scene,lasershots, menu, etc.), you'll be lucky to have a total of 64 colors for the planet.




Kef Emzy

Posts 43
07 May 2020 11:06


Gunnar. Isn't there a 16-bit ARGB mode as well with 4-bits per component?


Vladimir Repcak

Posts 327
07 May 2020 11:18


Kef Emzy wrote:

Gunnar. Isn't there a 16-bit ARGB mode as well with 4-bits per component?

4-4-4 ?

I think I'd rather cripple the framerate to 20 fps (and slowed down the speed of racing so it is still smooth) than suffer 4-4-4 RGB.

Hell, a 5-6-5 sucks. Can't even do a simple antialiasing without quite ugly artifacts.

Forget about more complicated kernels or fractional weights. When you compute a fraction from the range <0,15> it's still going to be a very high value, which would completely kill any soft postprocessing FX. Want to do a blend between atmosphere and background ? Get ready for random ugly giant pixels exactly at the border.

It would be much more work than 8-8-8, yet look like sh*t.

Thanks, but no thanks :)


Gunnar von Boehn
(Apollo Team Member)
Posts 4788
07 May 2020 11:20


Vladimir Repcak wrote:

But, you're forgetting that if we put all 256 palette colors into the background bitmap, then the 3D foreground will need to have the

This is obvious.

Many games used 256 color mode (there was whole era of VGA games, from DOOM to "Simon the Sorcerer", or "Age of Empires")

If this is useful or useable for you - I can not decide as I not saw your GFX.


Gunnar von Boehn
(Apollo Team Member)
Posts 4788
07 May 2020 11:24


Kef Emzy wrote:

Gunnar. Isn't there a 16-bit ARGB mode as well with 4-bits per component?

No there is no such mode.


Vladimir Repcak

Posts 327
07 May 2020 11:48


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

  But, you're forgetting that if we put all 256 palette colors into the background bitmap, then the 3D foreground will need to have the
 

  This is obvious.
 
  Many games used 256 color mode (there was whole era of VGA games, from DOOM to "Simon the Sorcerer", or "Age of Empires")
 
  If this is useful or useable for you - I can not decide as I not saw your GFX.

Trust me, I spent several hundreds of hours trying to make it work at both 256 and 65,536 colors in past. Writing shaders at both 68000 assembler and GPU RISC assembler and handling Jaguar's HW bugs when writing DSP code.

I don't need to decide on it. I know exactly what visual issues will happen when I go for 65,536 colors.

Wouldn't you rather have a flagship Vampire game that showcases the HW in an unexpected way rather than looking like a 1989 video game ?

I'm surprised you're even advocating 256 colors, honestly.


posts 80page  1 2 3 4