Overview Features Coding ApolloOS Performance Forum Downloads Products Contact Goto
Apollo-Computer

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

Writing 3D Engine for 68080 In ASMpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Gunnar von Boehn
(Apollo Team Member)
Posts 6253
22 Dec 2019 14:52


Vladimir Repcak wrote:

Is that a 16-bit texture?
It surely is filtered at full floating-point precision

Yes 16bit, mipmapping plus bi-linear interpolated with light+shade (gouraud+phong)
But frankly I wonder if this was overkill and if aiming for a simpler renderer would not also create games which are super fun.
 


Vladimir Repcak

Posts 359
22 Dec 2019 15:40


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

  Is that a 16-bit texture?
  It surely is filtered at full floating-point precision
 

  Yes 16bit, mipmapping plus bi-linear interpolated with light+shade (gouraud+phong)
  But frankly I wonder if this was overkill and if aiming for a simpler renderer would not also create games which are super fun.
 

Of course, nobody will argue that fun factor is a separate consideration and no matter how beautiful art assets are, it's the gameplay, in the end, which matters most.

That being said, nice gfx can surely bump the wow factor and keep player hooked :)

So, what kind of performance did you get out of this ? It's 640x480 ? More than 10 fps ?

Frankly, that thing is PS3-quality, given the texture resolution.

Then again, given how many floating point registers you guys have in 080, you can certainly interpolate more than just UV coords for the texture...
Basically, it's just one FADD (or so) for the light, once the scanline coefficients are computed, right ? That should be just 1 cycle...

You know what - about ~14 yrs ago I was trying to get a financing for an Aquanox3 remake and it looks like some of my art assets just might be useable here :)

EXTERNAL LINK  EXTERNAL LINK 
Ignore the high-poly vegetation - those could be 2D impostors rendered every 30 degree angle (or somesuch).



Vladimir Repcak

Posts 359
22 Dec 2019 15:47


Stefano Briccolani wrote:

I really like this metal-bashing approach.. Is there some videos of your Jaguar stun runner-clone engine to see? Just to have an idea of what you're aiming for. And (of course).. welcome in the vampire world Vladimir..

I'll see about posting something here later.

But, if you look up Arcade StunRunner, then that's how it basically looks like (just with more colors - as I smoothly blend between shades from one track segment to another).

If I get this far here, I would certainly spend a week or two recreating art assets for whatever complexity 080 can handle - so don't worry of being stuck with jaguar-level art assets.

And since I'll be soon creating art assets for Medium and High Detail for jag, those could serve as a base for 080, yet still get 60 fps on 080 - I presume here that 080 is certainly at least 5x faster than Jaguar's GPU+Blitter+68000, assuming proper asm coding...


Vladimir Repcak

Posts 359
22 Dec 2019 16:18


Gunnar von Boehn wrote:

  Do you want 24bit texture and good quality bi-linear/tri-linear filtering?
I can give you one additional use-case scenario here. Are you familiar with Radiosity lighting ?
I'm a long-time fan and recently managed to convert the Radiosity from floating-point to straight integer (for jag, but we probably could keep floating point for 080).

While right now I compute form factors for each surface on PC, given that Apollo has 512 MB, we can certainly do it at run-time there too.

Now, about 4 weeks ago, I was playing with Radiosity on Jaguar in 640x200, 16-bit. I can import the form-factor file for each quad and can recompute final lightmaps at run-time on Jag.

You can adjust light intensity and color - so this allows for a form of dynamic lighting. Jaguar's GPU is fast enough to recompute the following scene at about 20 fps (just not render it at that frame rate yet).

EXTERNAL LINK  EXTERNAL LINK 
These screenshots are from my own personal dev emulator of various Ataris (Lynx, 800XL, Jaguar) with cycle-exact benchmarks allowing to intermix C++ with target Assembler (allowing me to discard algorithms long before they are fully ported to ASM).
But I have exact same scene running on Jaguar - I was just fighting the brutal imprecision of Jaguar's Blitter's fractional scaling - you'd think that 16 bits gives you a lot of precision, but there's quite a few hardwired bumps there, so the clipped&scaled texture looks like crap there (hence forcing the SW rasterizing - which shouldn't be an issue here).

There's a tremendous visual difference between 16-bit and 24-bit.
16-bit shows clear steps between lumens but 24-bit is butter smooth. I don't think those screenshots are 24-bit, would have to run it again.

So, forget Quake2. This basically gives us per-pixel lighting Doom-3 style (minus normal maps - though those could certainly be emulated to a degree) in ~real-time :)

At 24-bit, this thing could be a brutal next-gen ;)

Of course, lightmap resolution is a variable - so don't worry for lower res. But, you double the resolution, you quadruple the workload :)

And, of course, this is the prime candidate for AMMX - working on all 4 components in one instruction (whether it's RGBA color or XYZW vertex). So, compared to Jaguar, where I have to work on each component separately, this thing could fly on 080 !

If you're wondering what's the core idea - then upon entering each room, the lightmaps would be generated from the Radiosity solution during the course of 4-8 frames (depends on resolution). Now, unlike Jaguar, we can't have a separate chip (DSP or GPU) keep computing lightmaps in parallel. We'd have to break it into batches on 080, as it still needs to render the scene and handle gameplay in normal framerate.

Remember how early Unreal3-engine games used to start with super-low texture resolution and over the course of next 3-4 seconds would res up into full res ? Same thing :)



Gunnar von Boehn
(Apollo Team Member)
Posts 6253
22 Dec 2019 17:54


Vladimir Repcak wrote:

Remember how early Unreal3-engine games

To be honest, I'm not not at all a fan of first persons shooters.
I'm not for playing Quake, Doom or anything like that.
I think those games are to violent as you see the people shot at.

Driving with space craft or cars or tanks, or a race game and I'm with you.

Regarding technical aspects...
I assume avoiding memory access is clever
and I would rather do 3 multiplations instead one memory access.




Mr Niding

Posts 459
22 Dec 2019 18:29


I wont pollute the thread beyond this post;

I too love spacecraft, cars and race games, but with regard to FPS;

Quake, Quake 3 Arena (I never liked Quake 2), Unreal Tournament and Counterstrike 1.6 are great games and fun. The newer "realistic" FPS im not a fan of, like you allude to Gunnar.


Jacek Rafal Tatko

Posts 19
22 Dec 2019 20:16


One of the Best Games in that style was Stunt Car Racer (1989)
3d was somewhat limited, resolution etc... but on the V4 it could
be put onto another level.
 
Visually, there was little difference between Atari and Amiga
versions. From my experience, on Amiga it was quite addictive,
not sure about the real gameplay differences on an Atari 520 ST.
 
Here is a informative video about the Atari 520 ST version
EXTERNAL LINK 

 
Here is a video on the Amiga version
EXTERNAL LINK 
 
This a 2-hours longplay version of the Amiga version
(and maybe a bit better player)
EXTERNAL LINK 
 
Regarding some of the features,
the car is dropped from a height onto the track,
turbo boost is essential to overcome obstacles in time
(accelerate enough from obstacle to obstacle and take-off to fly)
but usage is time limited per track.
 
 
The fun was to overtake or ram opponents off track,
make lap and track records, advance onto more difficult
tracks, find a way out of wrecking pits and gaps in the track
without getting totally damaged, in general avoid damage
off track or getting rammed.
 
 
Enjoy high jumps, flights over bumps & ramps at the right speed,
on some parts or tracks it was possible to get over 2 obstacles,
thus gaining even more time - without over flying the curves -
landing with measured damages on the damage counter.
 
I vaguely remember that one could see or do a Replay
(not so sure about it).
 
Stunt Car Racer was very popular but it was an early game
on C64, Amiga & Atari, very basic but immensely fun, playable
for days, weeks, months (some might wonder, we were kids…).
 
A few more ideas if I may suggest,
additional features to mod the car,
engine, suspensions, brakes, fuels,
lights, tires, have different race track
conditions, like day, night, rain, snow, icey,
oily parts, damaging leftovers car parts from
ramming or damaging events,
maybe even be able to shoot bullets or rockets
(all stuff bought in the shop between tracks & races)
or just drop some oil, mines, hit some items
on the side of the track like barrels or lamps
with little damage and make them fall onto
the track for the opponents behind…
 
I never played it with null modem in 2 player mode
but apparently that was yet another great feature. 
Now over network with two V4 could be awesome,
being able to race full screen against friend(s).
 
Having a good responsiveness to the Joystick, to
steer & accelerate is also a thing to get really right.
 
15 years later Stunt Car Racer TNT came out (2019)
with new tracks & stuff - haven't played it yet but it
shows the interest in this type of 3d-style racers EXTERNAL LINK   
     
Other racing games, albeit overhead-style,
that were both fun & had cool features were:
 
Super Cars (2 versions)
  EXTERNAL LINK   
 
Super Offroad EXTERNAL LINK 
 
       
and ofc Badlands EXTERNAL LINK
 
Those three are all fine too & may offer you some
valuable inspiration for different features that
each & all could add to the fun... racing, flying,
sliding, ramming, wrecking, getting ahead & win!
 
  Looking forward to what you come up with
  in your game project, Vladimir. Welcome!


Vojin Vidanovic
(Needs Verification)
Posts 1916/ 1
22 Dec 2019 20:32


Vladimir,
 
  May you reign the peace :) Thanks for the effort, share the results!
 
  Likewise Gunnar I prefer games with more "sense" then FPS. FPS is just 3D Wizards of War :)
 
  So, in racing some blend of Barbados Racing AGA and "kill racing"
  like Roadkill AGA/CD32 :) Or 3D form of All Terrain Racing, Null modem and Internet play please.
 
  If not, then Elite First Encouters :)

A combat isometric 3D game ahead of its time that deserves texturing?

There was an open world adventure with isometric graphic, guardian single floppy (not crappy CD32 shooter) and one robot arena shooter with add ons :) I ll recall the names.

Whatever suits you for demo of engine :)



Jacek Rafal Tatko

Posts 19
22 Dec 2019 23:01


Jacek Rafal Tatko wrote:

...
and ofc Badlands EXTERNAL LINK

Link to Badlands above was wrong. It's here EXTERNAL LINK


Vladimir Repcak

Posts 359
23 Dec 2019 08:32


Gunnar von Boehn wrote:

 
  To be honest, I'm not not at all a fan of first persons shooters.
  I'm not for playing Quake, Doom or anything like that.
  I think those games are to violent as you see the people shot at.
 
  Driving with space craft or cars or tanks, or a race game and I'm with you.

Well, designing a 3D engine for FPS from scratch in Assembler is a crazy idea anyway :)

I'll see if I can think of some other good use of 24 bits...

Well, we could obviously use the runtime Radiosity lightmaps in a 3D platformer style - remember Prince Of Persia on X360 ? We can totally have the same visual quality of environment as the lightmaps would be computed upon entering each room. Characters would have to be sprites, but we do have 512 MB RAM so that shouldn't be an issue...

Gunnar von Boehn wrote:

 
  Regarding technical aspects...
  I assume avoiding memory access is clever
  and I would rather do 3 multiplations instead one memory access.

Well, here's a related question for you:
On Jaguar, the performance discrepancy between loading value from an external LookUp Table (in RAM) vs computing it (inside 4 KB cache) was quite severe, rendering many lookups too expensive to use, because:
- you have to compute the index (5-8 ops)
- you have to load the table's address (2 ops)
- you have to combine the two (1 op)
- execute the load
- now wait till the bus controller gets the privilege from ObjectProcessor (highest priority on the bus)
- only now the register has the value

It was quite crazy benchmarking the above, as the equivalent number of ops that GPU can execute from cache without being halted by external read was sometimes quite ridiculous.
Example - given the 26.6 MHz frequency of GPU, at 60 fps we get ~445,000 RISC cycles per frame. When you have an empty screen, zero load in the system, in a tight 2-op loop (load,jmp) I only got ~18,500 reads. Start doing stuff and the number drops below 10,000. Even if the pipelining was suboptimal, that's still over 200,000 ops, but you only get 20,000 - meaning factor of 10:1. Brutal.

Given the superscalar nature of 080, roughly how many Integer+FPU ops will get executed before the external read is fetched ? Only 6 cycles (3 Int + 3 FP) ? That would be awesome if it's just that.

Also, how do you handle dependencies like that within the pipeline ? Do you halt the execution 100% for all cases till dependencies are resolved?
Jaguar wasn't doing much (other than register scoreboarding) and if you didn't account for the pipelining, your code has wrong data despite being written correctly.
Meaning, do I have to keep inserting lots of nops between instructions for the code to execute properly ?



Vladimir Repcak

Posts 359
23 Dec 2019 08:43


Jacek Rafal Tatko wrote:

One of the Best Games in that style was Stunt Car Racer (1989)
  3d was somewhat limited, resolution etc... but on the V4 it could
  be put onto another level.
Well, when I tried that scene complexity on Jaguar, I got 60 fps in 640x200 and still over 50% of GPU cycles were available (e.g. for physics).

So, the 080 should do it easily with texturing.

It is, actually, a great type of game to make use of the parallel FPU unit of 080 - as not only you use it for physics, you would use it also for texturing.

We probably wouldn't even need to have pre-processed track physics in separate file (per each track), and just compute the darn thing real-time...



Vladimir Repcak

Posts 359
23 Dec 2019 09:06


Samuel Devulder wrote:

On 68k, no compiler is really beating human ASM, but converting a whole source to ASM usually fail because of size making this unmaintainable.
Actually, my own Higgs compiler sometimes beats me.
You see, I don't always have the same focus, and depending on which code you're in right now, it directly affects the way you write next code.
So, I found out, that I don't always write the same things (loops,conditions) in identical effectiveness.
But, my Higgs compiler is 100% effective. I have days when I can barely type 10 lines of code, and it definitely wouldn't be Asm. But in Higgs, it's easy. Here's an example from my AI routine:


{ ; Process Shooting Enemy : Activation, Strafing, Shooting

  SEnemy.InitRegister (MainEnemy)
  if.w (ShootingEnemiesCount == #0)
  { ; Activate it (if time has come)
  ; SEnemy_GetTrackWidthUnderEnemy ()
 
  ; Has enough time elapsed since last destroyed enemy ?
  if.w (Frame > FrameOfNextSpawn:d0)
  { ; We can activate shooting enemy
  ; print.b #7,210,60
  ; vbl 1*60
    { ; Grab next Enemy from the pool and activate it
    SEnemyPool.InitRegister (EnemyPool)
    InitEnemyFromPool ()
    SEnemy_UpdateXposBounds ()
    SEnemy_PlaceToTrackCenter ()
    TurnMainEnemyOn ()
    }
  ; halt
  }
  }
  else
  {  ; Enemy is active, now process it
    ; Check if it's time to recompute XposBounds
  if.w (Frame > SEnemy.FrameUpdateXposBounds:d0)
  {
    SEnemy_UpdateXposBounds ()
  ; print.b #$22,128,40
  ; vbl 1*10
  }
  SEnemy_ProcessStrafing ()
  SEnemy_ProcessShooting ()
  SEnemy_CheckCollisionAgainstPlayerLS ()
  { ; Collision Detection: Player vs Enemy.Lasershots
    local word lpLaser
    SLaserShot.InitRegister (EnemyLaserShots)
    loop.w (lpLaser = #MaxLaserShots)
    {
    if.w (SLaserShot.IsActive == #1)
    {
      CheckCollisionLSvsPlayer ()
      if.w (SPlayer.HP == #0)
      {
        ; Game Over
      jmp gsGameOver
      }
    }
    SLaserShot.Next ()
    }
  }
  }
}
Writing nested conditions in Asm results in code that is hard to throw away due to effort required to write it.
But the above - I don't have problem deleting it, if need be.


   
Samuel Devulder wrote:

  My rule of thumb is: 1) do it in C, then 2) profile, and 3) replace costly C algorithm with "better thought" algorithm (yeah good algorithm is important), 4) profile again and 5) if still some function appear on top, introduce ASM equivalents (based on good algorithms this time).

Yeah, I've done that on Jag. I lost more time to figuring all the alignment issues, random glitches and bugs of both the assembler and HW.
You wrote the code in 2 hours in C, and spend 2 weeks debugging WTF was going on.
I'll rather spend 2 days writing the same code in Asm but have 100% full control over alignment and performance.

And, given my Higgs compiler, I don't even have to do that anymore - it's almost C :)

Samuel Devulder wrote:

  For the progtamming env. I work on PC with cygwin + notepad++ + bebbo's amiga-gcc. I test under UAE to check that the code is working on pure 68k, then go to the vamp, download the exe via "wget" from the pc or some other external storage and test/measure real fps. Using network is quite efficient, and some people reported that it is possible to remote-debug a program running on the vamp directly from eclipse, so the UAE phase might be skipped then.
I presume it's gdb they use for remote debugging ?
Anyway, half decade of working with jag, I created an extensive set of runtime debugging routines right on target HW.
Given the High Resolution of Apollo, these can be significantly extended (there's only so much onscreen printing you can do when your vertical resolution is 240).

Is there any Motorola Assembler debugger that can remotely step through instructions ? Although, just realized, that would have to support 080 opcodes, so that's not going to happen...



Gunnar von Boehn
(Apollo Team Member)
Posts 6253
23 Dec 2019 09:08


Vladimir Repcak wrote:

Meaning, do I have to keep inserting lots of nops between instructions for the code to execute properly ?

68080 is a beautiful to program CISC chip.
It will always correctly track dependencies by itself.
You (the programmer) never needs to worry about this
this means you never need to insert NOPs manually like on a shitty MIPS RISC chip.




Gunnar von Boehn
(Apollo Team Member)
Posts 6253
23 Dec 2019 09:21


Regarding all you technical questions.

APOLLO on the V4 has
  - 16 KB Icache.
    should be enough for your workloop.

  - Icache provides 16byte of instruction per cycle to the decoder.
    This is plenty and enough to execute several instruction per cycle.

  - The DCache is 64 KB.
    A memory burst is 32byte.
    The CPU can handle several parallel reads request in flight and will correctly track them.
    The CPU will by itself detect continous memory acess and will automatically prefetch the memory. This means on linear memory access the Dcache will handle this automatically and start prefetching and complete hide the memory latency.
    The latency is around 12 CPU cyle.
    So if you miss the Dcache this is the stall delay.

    The DCache is 3 ported.
    The DCache does provide in the same cycle
    - 8 byte READ per cycle to the CPU.
      The read can span 2 cachelines  (missaligned) and will be correctly loaded over 2 lines and corrected for free (0 cycle)
    - 8 byte Write per Cycle to the CPU
    - Line burst insert from memory controller.

    This means the Dcache can simultaniously pre-load cache lines,
    and offer READ-DATA to the CPU and accept WRITE from the CPU.

    The CPU can order the DCache to preload Caches lines with calculates addreses.
    This means when your algorithm is written in such a way that it calculated the to be used memory address like 12 cycle ahead of actually needing them. Then the CPU can use "TOUCH" DCache preload commands to instruct the DCache to preload those lines.
    Doing this carefully can completely remove the Cache latency.
    And you can this way reach up to 600 MB/sec from memory throughput constantly.

  Regarding computing power the AMMX unit is pretty strong and you can do 4 multiplications per cycle with it.


Gunnar von Boehn
(Apollo Team Member)
Posts 6253
23 Dec 2019 09:26


Regarding the texturing.

While the AMMX unit and the CPU is very strong.

Our current concept it to off-load the rastering into a line-unit.
(This has nothing todo with the AMIGA Blitter)
The AMIGA blitter can only process planar bits.

The beauty of such dedicated unit is that it will run in parallel and can offer 4 or 8 or whatever we want multiplications or operations per cycle on top.

Also such unit could be pipelined in such a way that it could even better hide memory latency.

The general concept would be the CPU does calc 3D coordiantes and runs the edges of the polygons to draw, and puts theses start-end points into a FIFO... and a Line-Rasterizer takes the work-items out of this FIFO and handles them on its own.

The FIFO allows the CPU to pre-calcs and removed dependencies between CPU and rasterizer. Each of them can run fully parallel at their best speed.



Vladimir Repcak

Posts 359
23 Dec 2019 12:45


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

  Meaning, do I have to keep inserting lots of nops between instructions for the code to execute properly ?
 

 
  68080 is a beautiful to program CISC chip.
  It will always correctly track dependencies by itself.
  You (the programmer) never needs to worry about this
  this means you never need to insert NOPs manually like on a shitty MIPS RISC chip.
 
 
 

You have no idea how happy I am about this behavior ! And while I removed most of those NOPs by detecting those scenarios in my Higgs Compiler, there's scenarios where you must do rearranging of ops directly in RISC ASM, and that's where the RISC rears its ugly head every single time. Last time - two weeks ago where I had 95% of code in Higgs, but the inner loop for Voxel terrain really needed to be manually reordered and NOP'ed - took half day and 40 builds to get it stable...


Vladimir Repcak

Posts 359
23 Dec 2019 12:57


Gunnar von Boehn wrote:

Regarding all you technical questions.
 
  APOLLO on the V4 has
  - 16 KB Icache.
    should be enough for your workloop.
 
  - Icache provides 16byte of instruction per cycle to the decoder.
    This is plenty and enough to execute several instruction per cycle.
 
  - The DCache is 64 KB.
Please excuse my simple questions - I'm still in the middle of studying the 060 manual's section on MMU and caches.
On Jaguar, the control was manual - I had to halt GPU, copy 4 KB of code and restart GPU.

How can we do that efficiently for 080 - meaning ensuring that each 3D engine pipeline stage (as long as it's 16 KB of course) is always cached?

I'm checking the Translation control Register and it allows me to enable/disable the caches or choose 4/8 KB page size, or writethrough/copyback.
I don't see a way to specify a RAM address which I could designate as cache, thus implying I'd have to wait for MMU to catch up and cache the data ?

On DirectX, I often had to prime gfx card's caches, explicitly. Is this approach workable here too?

Given the 64 KB of data cache, we can certainly have separate transform/clip stage and render stage. 64 KB is plenty data to contain both 3D vertices and transformed 2D ones.

Ideally, the cache wouldn't be thrashed between the two stages and still contain the transformed vertices after the transform stage finishes.



Gunnar von Boehn
(Apollo Team Member)
Posts 6253
23 Dec 2019 13:54


Ok let us clarify a few things.

a) old AMIGA did had only planar gfx
old AMIGA blitter had a max mem-copy speed of ~ 3.5 MB/sec.
old Amiga blitter could only copy planar bits around.
The blitter has zero cache.

This means old AMIGA has no HW Acceleration for 3D.
This means any 3D game would render every pixel with CPU.
This means also 320x200 with flat direct pixel was about the peak on accelerated old AMIGAs (even 68060)

Truecolor or Bi-linear interpolation was totally out of the picture.

So if you look online for existing documentation
how to control textures cache then you can never find this - as this did not exist in old Amigas.

We are the first and only onces ever which develop 3D acceleration for Amiga chipset.

Now regarding caches:

Lets sum this up again:

A game rendering could be logically splitted in 3 parts
- 3D Object calculation
- Edge Interpolation
- Line Rasterizing

All 3 could in theory be done by a CPU. (Doom Style)

Our approach is to split this in 2 groups
1 and 2 being done by the CPU
3 being done by a Line-Rasterizer (Voodoo Style)

Our CPU looks like this
- 85 MHz clock
- 16 KB Icache
- 64 KB DCache
  64bit memory interface

The rasterizer would be seperate Unit with own cache.
You will not find any documentation of Motorola about it.

How the best cache organization if the unit will be,
will depending on the use case and on what we aim for.
We have a layout for it atm, but I wonder if less could be more.
We could discuss this...



Vladimir Repcak

Posts 359
23 Dec 2019 13:55


Gunnar von Boehn wrote:

    The DCache is 3 ported.
    The DCache does provide in the same cycle
    - 8 byte READ per cycle to the CPU.
      The read can span 2 cachelines  (missaligned) and will be correctly loaded over 2 lines and corrected for free (0 cycle)
    - 8 byte Write per Cycle to the CPU
    - Line burst insert from memory controller.
 
    This means the Dcache can simultaniously pre-load cache lines,
    and offer READ-DATA to the CPU and accept WRITE from the CPU.
 
    The CPU can order the DCache to preload Caches lines with calculates addreses.
    This means when your algorithm is written in such a way that it calculated the to be used memory address like 12 cycle ahead of actually needing them. Then the CPU can use "TOUCH" DCache preload commands to instruct the DCache to preload those lines.
    Doing this carefully can completely remove the Cache latency.
    And you can this way reach up to 600 MB/sec from memory throughput constantly.
 

It's going to be challenging to design algorithm that will conform to all core behavioral rules of the cache (R,W,preload,etc.), but when that happens... :)

600 MB/s means 10 MB of data per frame at 60 fps.

Of course, most of that will be eventually thrashed, but it does mean that you can process up to 10 MB within the 64 KB. That's 160 full screens of 320x200 @8bit ! Insane !


Vladimir Repcak

Posts 359
23 Dec 2019 14:04


I have just realized, that we could use the MMU caching algorithm to cache the actual FrameBuffer. 64 KB at 320x200@8bit would quite literally be full screen inside cache.

But even at 16bit and hi-res, it would still cover dozens of scanlines.

We don't even need Blitter to do this - MMU will take care of the memory write by itself once we linearly move far enough from the very first scanline.

I'm pretty reasonably sure I could fit whole flatshader pipeline (frustum cull, 3D transform, clip, rasterize) of my StunRunner into 16 KB of Icache.

The 64 KB of Dcache would totally cover all vertices (3D+2D) and clearly at least dozen scanlines of the FrameBuffer.

The Gameplay, AI, Input would be a second 16 KB (Icache) + 64 (DCache) chunk.

posts 429page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22