Overview Features Coding ApolloOS Performance Forum Downloads Products Contact Goto
Apollo-Computer

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

Writing 3D Engine for 68080 In ASMpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Samuel Devulder

Posts 248
07 Feb 2020 10:26


Vladimir Repcak wrote:

  I can now go implement the double buffering.

On the Vamp, the triple buffer technique is easier & faster both to implement and in runtime (no need to pool the VBeamPos).
 
Idea: Allocate 3 memory buffers of the size of the screen: B[0], B[1], B[2]. Say you draw in B[ i ]. Once finished, make saga Ptr point to B[ i ] and then advance i by one modulo 3 so that next you'll draw things in B[ (i+1)%3 ]. That's all! No need to wait for end of screen before changing pointers. (This gives you a bit of extra MIPS to draw the next frame.)
 
Ehm. there's one buffer untouched. It is un-necessary, right?

No! Actually that "extra" buffer is a pretty clever solution to the following problem: Recall that the saga ptr is only updated at the end of current VBL. Now suppose you have only 2 buffers: B0 and B1. You fill B0 while saga is showing B1. Once you finish B0, you make saga point to it and start working on B1 (still without waiting for the VBL because you dont want to loose hunder thousands or milion of cycles in stupid polling loops when you can use these extra cycles to do something more usefull). Then what you observe is that, although to told saga to display B0, it is still displaying B1 (saga ptr is not latched yet), the very same buffer you are working on. Hence you'll see a mix of current & previous frame: tearing!
 
If you introduce a third buffer, this will never happen even if your code works way faster than the VBL. Saga will always show one of the other 2 buffers, B[ (i+2)%3 ] or B[ (i+1)%3 ] (eg. finished screen) when you work on B[ i%3 ]. No tearing anymore!
 
Notice: I use "%3" (e.g. modulo three) as an indication here. Many other implementations are possible.


Vladimir Repcak

Posts 359
07 Feb 2020 14:50


Which SAGA pointers ? Do you have the addresses ? Or do we use some OS functionality for that ?

I was just checking cybergraphics.doc and didn't see anything on this there.


Philippe Flype
(Apollo Team Member)
Posts 299
07 Feb 2020 16:49


What Samuel describe here, is indeed the easiest and fastest.

It's something a coder should do with system-friendly approach, using Intuition.library -> ChangeScreenBuffer() for example. However, this approach is not that easy to deal with (you'd need a very good example) and it have some limits for our HW. Triple-Buffering and Memory alignment comes to mind.

The easy method (but efficient) is to go 95% system-friendly (still use the OS/P96/CGX to init/uninit all the screen setup) + 5% HW-oriented (Push our own framebuffer, triple-buffered).

Here is a list of SAGA registers,

EXTERNAL LINK 

About poking the Video Bitplane Pointer,
for Direct-Access Triple-Buffering :

EXTERNAL LINK


Vladimir Repcak

Posts 359
08 Feb 2020 05:54


Thanks a lot for the example ! Looks like I won't be able to run it under emulator, so I will wait till I get my V4.

In the meantime, WaitTOF will do a good job (temporarily).


Vladimir Repcak

Posts 359
08 Feb 2020 06:05


Alright, now that it appears the quad clipping is covering all cases, I can start porting the rest of the codebase - the actual gameplay/AI/input.

How do we properly do a 2D background bitmap layer - e.g. SkyBox (below framebuffer) on V4 ? Reading Amiga books it appears Dual Playfield technique would be used, but I have no idea if that approach is workable with CGX.

Also, there is a second bitmap - the HUD/GUI.

The final screen composite is actually:
1. Skybox
2. FrameBuffer (3D engine)
3. HUD/GUI

Of course, a plan C that will always work is just a bruteforce copy each frame - but I presume we won't have to resort to that here, right ?


Gunnar von Boehn
(Apollo Team Member)
Posts 6253
08 Feb 2020 07:00


Hi,

Can you show images of the three to image how it shall look in the end?


Vladimir Repcak

Posts 359
08 Feb 2020 07:44


Don't have image from Jag here, but imagine a classic 2D background from Outrun (just full-screen, not just half screen 320x100), which is overlayed with a transparent Framebuffer on top of it.

And at the top of the screen, there's a 240x32 transparent HUD.

ObjectProcessor on Jag handled all the transparency, basically for free (<1% performance difference with transparency disabled).

I hope I won't have to handle the transparency on CPU ? That would brutally slow it down...


Gunnar von Boehn
(Apollo Team Member)
Posts 6253
08 Feb 2020 07:55


Vladimir Repcak wrote:

I hope I won't have to handle the transparency on CPU ? That would brutally slow it down...

If you use AMMX this is very fast.
AMMX is extremely fast in blitting cookie cut sprites.
You can process Sprites in memcopy speed, and reach like 500 MB/sec blitting speed with AMMX.


Vladimir Repcak

Posts 359
08 Feb 2020 08:15


I didn't think of AMMX, but I was hoping the HW would display them for me (like, using some Amiga feature). That explains the Cannonball's missing background, I guess.

So, basically, I will have to create the final composite by myself, on CPU. Ouch.

Well, I think I will now have to redesign it a bit. I can't be doing the following:
1. Copy 320x240x24bit background into current output framebuffer
2. Clear 320x240x24bit (3D scene framebuffer)
3. Merge (with transparency, per pixel) both into output framebuffer

So, this is where having 24-bit will destroy the performance. It's 4x as much data to move around.

That Jaguar was doing a lot of MIPS work for free - clearing framebuffer, overlaying transparent full-screen bitmaps, almost free flatshading+clipping,...

I now have literally zero idea how it will run on V4. I really won't know at all till the moment it all runs together :)




Vladimir Repcak

Posts 359
08 Feb 2020 08:21


If I had a full-screen 2D background, I can get away with a separate 3D framebuffer and clearing it.

Copying it will also serve the functionality of clearing.

And rendering of 3D scene will happen directly there.

Few pages back we talked about performance of unrolled fused copying - e.g. move.l (a0)+,(a1)+ should execute in 1 cycle, if I recall correctly.

At 32-bit, that's 76,800 cycles but at 8-bit color depth it would be just 19,200 cycles.

I really like 32-bit, makes the flatshading so much more colorful...



Gunnar von Boehn
(Apollo Team Member)
Posts 6253
08 Feb 2020 09:08


Vladimir Repcak wrote:

Copying it will also serve the functionality of clearing.

Correct, as you want to CLR the screen anyway.
You can at same time fill it with some nice content.
And this content does not need to be fixed , and can scroll or zoom, can be animated or whatever you like ...




Gunnar von Boehn
(Apollo Team Member)
Posts 6253
08 Feb 2020 09:17


Vladimir Repcak wrote:

  So, basically, I will have to create the final composite by myself, on CPU. Ouch.
 

 
It really depends how you do the math.
There is no "OUCH".
If you look at the MB/sec and not at cycle count.
 
Let me explain.
Lets say you show a 320x240@50Hz with 32bit screen.
This means 15 MB/sec for display.
 
Lets say you want to have Dual playfield of this:
This means 30 MB/sec for display.
 
If you CLRSCREEN one of the 2 Playfields.
This means 15MB/sec for CLRSCREEN.

So to get a non rendered screen you have 30MB display + 15 CLRScreen == 45 MB/sec
This is for the Dual playfield approach.
 
 
If you change this to 1 playfield only and not CLRSCREEN but copy the picture.
Then its 30 MB/sec for the copy and 15 MB/sec for the display.
== the total is 45 MB/sec.
 
This is if you draw and display both at 50 FPS.
And if you render code is slower ... lets say you render at 30FPS...
You now save bandwidth with the memcopy approach.
 
So if you see the memory bandwidth a component which limits at the end your game speed then the copy together of the screen - will not slow the game down.


Vladimir Repcak

Posts 359
08 Feb 2020 09:52


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

  Copying it will also serve the functionality of clearing.
 

 
  Correct, as you want to CLR the screen anyway.
  You can at same time fill it with some nice content.
  And this content does not need to be fixed , and can scroll or zoom, can be animated or whatever you like ...
 
 
 

Yeah, background has to scroll to account for camera angle. And since we have 128/512 MB RAM, the off-screen size is not an issue at all. We can easily store a 360 degree cubemap and just copy the 320x240 subset.


Vladimir Repcak

Posts 359
08 Feb 2020 10:05


Gunnar von Boehn wrote:

 
Vladimir Repcak wrote:

    So, basically, I will have to create the final composite by myself, on CPU. Ouch.
   

   
  It really depends how you do the math.
  There is no "OUCH".
  If you look at the MB/sec and not at cycle count.
   
  Let me explain.
  Lets say you show a 320x240@50Hz with 32bit screen.
  This means 15 MB/sec for display.
   
  Lets say you want to have Dual playfield of this:
  This means 30 MB/sec for display.
   
  If you CLRSCREEN one of the 2 Playfields.
  This means 15MB/sec for CLRSCREEN.
 
  So to get a non rendered screen you have 30MB display + 15 CLRScreen == 45 MB/sec
  This is for the Dual playfield approach.
   
   
  If you change this to 1 playfield only and not CLRSCREEN but copy the picture.
  Then its 30 MB/sec for the copy and 15 MB/sec for the display.
  == the total is 45 MB/sec.
   
  This is if you draw and display both at 50 FPS.
  And if you render code is slower ... lets say you render at 30FPS...
  You now save bandwidth with the memcopy approach.
   
  So if you see the memory bandwidth a component which limits at the end your game speed then the copy together of the screen - will not slow the game down.
 

  The Ouch happened as I got spoiled by Jaguar's Object Processor. I quickly got used to ~free (zero CPU time, just system bandwidth cost)bitmap rasterizer :)
 
 
  However, we can use this situation to our advantage. I always wanted to have the environment in space, but 16-bit is not good enough for planet shading. It still shows visible rings even after antialiasing.
 
  24-bit solves that problem completely and the quality of the procedurally generated backgrounds will be solely dependant on quality of my algorithm. Instead of writing terrain+building rasterizer I will write a space background rasterizer. Roughly same amount of work, but at 24-bit it will at least have a potential to look good (atmosphere, clouds, lensflare, etc.)
 
  I'll thus try to target 30 fps lock, which gives me 2.83 (up to 5.67 for 2 pipes) million cycles budget, but at least won't have to redraw the background+3D scene so often, leaving more room for some additional details (be it lighting or texturing or something else).


Samuel Devulder

Posts 248
08 Feb 2020 13:35


Which SAGA pointers ? Do you have the addresses ?

  There is no os support yet for this, so atm you have to poke the regs. But a game is not a utilitiy. It is meant to be the only app displayed on screen. So HW banging is not a real issue if you check that your screen is indeed the front-one (IntuitionBase->FirstScreen==MyScreen).
   
You can browse my github to see how I use the Saga registers to perform triple buffering (and cropping using the modulo feature): EXTERNAL LINK   
 
Doing (bits of) ASM and poking the HW in place of using many (and possibly inefficient) softwares layers of abstraction is indeed necessary go get the best of the amiga since, as good as the 68080 is, a 80Mhz cpu doesn't have lots of cycles to be wasted when running stuff that were designed for 200+ mhz processors.
   
For instance consider the fps drop between diablo running fullscreen (with direct Saga access) vs diablo in window on the wb using complete SDL abstraction layers (param -x on cmd in devilutionx 1.0.0). This is astounding: IIRC this is around 2 or 3 times slower with SDL than with SAGA screen.
 
Also noticve that you can make your program work both correctly with SAGA and with HW abstraction depedending on the config of the machine it runs on (see the ac68080_saga and ac68080_ammx flags in devilutionx source code). This allows the program to run equaly fast in both uae using SDL and in real vampire-amiga using apollo-core support.


Vladimir Repcak

Posts 359
12 Feb 2020 09:31


Thanks Samuel. This will come in handy real soon.

I don't think I understand what you mean by your last paragraph. Would I have to implement an SDL layer ?  I am currently running my code within UAE, the code is targeting 040 instruction set and CGX.



Vladimir Repcak

Posts 359
12 Feb 2020 09:38


So, I finally got my V4 delivered - great timing since I just about got most of the engine ported and running under WinUAE.

Now, before I can turn the thing on, there's a consideration about the power plug. I am in U.S., so I can't just plug the EU plug into the wall socket.

I presume it should be safe to use just my regular USB charger for a cell phone ?

But, there's a difference in input amperage, and I don't know if that difference won't burn my V4. Definitely preferable to be safe than sorry.

EU V4 PowerPlug  : Input (100V-240V : 0.5A)  Output (5V, 2A)
US CellPhone Plug: Input (100V-240V : 0.3A)  Output (5V, 2000 mA)

The only difference appears to be in Input : 0.5A (EU) vs 0.3A (US).

Can I use the U.S. cellphone plug or would that (0.5A vs 0.3A) destroy V4 ?


Gunnar von Boehn
(Apollo Team Member)
Posts 6253
12 Feb 2020 09:39


Vladimir Repcak wrote:

Thanks Samuel. This will come in handy real soon.
 
  I don't think I understand what you mean by your last paragraph. Would I have to implement an SDL layer ?  I am currently running my code within UAE, the code is targeting 040 instruction set and CGX.
 

He was saying that is you want then
you can in your program provide both codes path.

a) slow copy framebuffer to screen method to work on UAE or other old AMIGA
b) much faster MOVE.L A0,SAGAPTR approach which saves you a memcopy per frame

Then you have both max speed and more FPS on Vampire and also working but slower on older machines.


Vladimir Repcak

Posts 359
12 Feb 2020 10:27


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

  Thanks Samuel. This will come in handy real soon.
 
  I don't think I understand what you mean by your last paragraph. Would I have to implement an SDL layer ?  I am currently running my code within UAE, the code is targeting 040 instruction set and CGX.
 
 

  He was saying that is you want then
  you can in your program provide both codes path.
 
  a) slow copy framebuffer to screen method to work on UAE or other old AMIGA
  b) much faster MOVE.L A0,SAGAPTR approach which saves you a memcopy per frame
 
  Then you have both max speed and more FPS on Vampire and also working but slower on older machines.

Oh, I see now. Because I already have a working version now (via the copy framebuffer), it would be indeed foolish from me, to disable that codepath.

Rather, I will make a compile-time switch that will either target this method (for WinUAE and 060 HW) or sagaptr method.

There are certainly huge productivity advantages to running the build on same screen on Windows.



Vladimir Repcak

Posts 359
12 Feb 2020 10:28


This older post got buried already, so I pasted it here again:

So, I finally got my V4 delivered - great timing since I just about got most of the engine ported and running under WinUAE.
Now, before I can turn the thing on, there's a consideration about the power plug. I am in U.S., so I can't just plug the EU plug into the wall socket.

I presume it should be safe to use just my regular USB charger for a cell phone ?

But, there's a difference in input amperage, and I don't know if that difference won't burn my V4. Definitely preferable to be safe than sorry.

EU V4 PowerPlug  : Input (100V-240V : 0.5A)  Output (5V, 2A)
US CellPhone Plug: Input (100V-240V : 0.3A)  Output (5V, 2000 mA)

The only difference appears to be in Input : 0.5A (EU) vs 0.3A (US).

Can I use the U.S. cellphone plug or would that (0.5A vs 0.3A) destroy V4 ?


posts 429page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22