Information about the Apollo CPU and FPU. |
|
---|
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 30 Jan 2020 10:11
| Vladimir Repcak wrote:
| So, those 4 months should roughly correlate with me finishing the V4 build. |
Sounds like a plan. :-) If you have questions or need some help then keep in touch.In theory you have endless tuning options. APOLLO 68080 supports Hyperthreading internally but we leverage this not today as AMIGA OS does not support it. So there are a lot optionally stuff you could in future do more to get more "Wow". But I assume your focus is finish all and get it working? We can give you tips for doing tripple buffer most efficiently. Give you doing code examples Audio playback. I look forward to see your game soon.
| |
| | Vladimir Repcak
Posts 359 31 Jan 2020 01:30
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| So, those 4 months should roughly correlate with me finishing the V4 build. |
Sounds like a plan. :-) In theory you have endless tuning options.
| Yeah, but no plan can survive my obsession with cycle counting and constant refactoring. This was always a problem to me and poses a greatest single hazard to my schedules.Gunnar von Boehn wrote:
| So there are a lot optionally stuff you could in future do more to get more "Wow". But I assume your focus is finish all and get it working? | Yeah I will try, but also have to be reasonable and allow myself some cycle-counting fun. It'd be virtually impossible for me not to do with a new platform.But, I do have a great running start. Yesterday I ported the quad rasterizer, which reduces scanline count literally by 50%. I also managed to rewrite it to use only registers without accessing RAM at all and only doing the pixel write to RAM. So, the integer version runs at full speed (minus the bubbles).1 Today I've already rewrote 3-pass quad case to use floating-point scanline traversal and am in the middle of using the same for two-pass and a single-pass quad. Gunnar von Boehn wrote:
| We can give you tips for doing tripple buffer most efficiently.
|
Gonna start with double buffering first :)Gunnar von Boehn wrote:
| Give you doing code examples Audio playback.
| You can do it now :) I can write the code now and test when I get my V4.
| |
| | Vladimir Repcak
Posts 359 31 Jan 2020 03:43
| Alright, I got the quad rasterizer ported and even optimized. Since it handles quads, it reduces the scanline traversal cost by 50%, which is obviously very significant. I also wrote a version that does the scanline traversal via FloatingPoint, though it highly likely exhibits the same pipeline bubbles as the other code I posted. Will be interesting to benchmark them. Here's the screenshot of my test scene: EXTERNAL LINK I will try to upload the executable tomorrow (any suggestions where I can easily upload executables?) as I'd love to do some testing on real V4, if somebody is bored :)
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 31 Jan 2020 12:08
| Vladimir Repcak wrote:
| Here's the screenshot of my test scene: EXTERNAL LINK |
Looks real nice. I look forward to test it
| |
| | Vladimir Repcak
Posts 359 31 Jan 2020 22:07
| Thanks, it's beginning to shape up, slowly... Does this download link work ? EXTERNAL LINK It's a static scene, doesn't move, just renders the framebuffer, waits and then quits.
| |
| | Vladimir Repcak
Posts 359 31 Jan 2020 22:31
| Now, let's talk bubbles :) Scanline Fill, in its simplest form, is a single-op loop:
ScanlineFillLoop: move.l d0,(a0)+ ; Pipe 1 dbra d1,ScanlineFillLoop ; Pipe 2
|
I *think* there should be no bubbles whatsoever. There is no computation of anything other than the (a0)+, which I hope can happen in same cycle, correct ? Will this execute on both CPU pipes and take just 1 cycle per pixel ? Meaning a loop of 10 pixels is 10*2 = 20 ops, but takes only 10 cycles ?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 01 Feb 2020 01:39
| Vladimir Repcak wrote:
| Now, let's talk bubbles :) Scanline Fill, in its simplest form, is a single-op loop: ScanlineFillLoop: move.l d0,(a0)+ ; Pipe 1 dbra d1,ScanlineFillLoop ; Pipe 2 |
I *think* there should be no bubbles whatsoever. There is no computation of anything other than the (a0)+, which I hope can happen in same cycle, correct ? Will this execute on both CPU pipes and take just 1 cycle per pixel ? Meaning a loop of 10 pixels is 10*2 = 20 ops, but takes only 10 cycles ?
|
Yes you are fully correct. The loop can be executed in 1 cycle per iteration. How long is your average loop count? I ask because 68080 can write 64bit per cycle to memory.
| |
| | Vladimir Repcak
Posts 359 01 Feb 2020 02:35
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| Now, let's talk bubbles :) Scanline Fill, in its simplest form, is a single-op loop: ScanlineFillLoop: move.l d0,(a0)+ ; Pipe 1 dbra d1,ScanlineFillLoop ; Pipe 2 |
I *think* there should be no bubbles whatsoever. There is no computation of anything other than the (a0)+, which I hope can happen in same cycle, correct ? Will this execute on both CPU pipes and take just 1 cycle per pixel ? Meaning a loop of 10 pixels is 10*2 = 20 ops, but takes only 10 cycles ? |
Yes you are fully correct. The loop can be executed in 1 cycle per iteration.
|
Awesome, thanks ! So, at least the pixel fill can actually run at full steam of 170 MIPS.Gunnar von Boehn wrote:
| How long is your average loop count?
|
I have several benchmarking counters. Currently it's displaying only one of them (the scanline count), but can also add this one. It's just one add per scanline anyway.Gunnar von Boehn wrote:
| I ask because 68080 can write 64bit per cycle to memory.
| Yeah, problem with such optimizations is that once you introduce a condition per scanline, it's going to get executed for every single scanline (including those that don't benefit from the faster codepath, so now the whole pipeline stage becomes slower).But, it will be easy to benchmark and compare if it saved more cycles than it introduced. So, if we'll shoot for ~10,000 scanlines scene complexity, one such condition will eat 2 ops (CMP/BNE) x 10,000 = 20,000 cycles. But, in higher resolutions, e.g. 640x480, the scanlines will be double compared to 320x240, so it hopefully will result in net savings. If we could guarantee that every scanline starts and ends at even XPOS, then we could get away with such condition and save 50% of pixel fill cost. But that would reduce horizontal resolution to just 50%... On 6502, my scanline fill was optimized via Jump - e.g. I had a fully unrolled scanline (of ScrWidth) and merely jumped into the respective move.l (indexed from the end):
.... move.l d0,(a0)+ move.l d0,(a0)+move.l d0,(a0)+ move.l d0,(a0)+ move.l d0,(a0)+ move.l d0,(a0)+ ....
|
Would the code above result in two pixels handled in both pipes ? Probably not, correct ? The second move.l doesn't yet have the correct (a0) address, right ?And introducing second address register (e.g. a1) doesn't help us here, because then we would have to adjust it via a separate add.l #8,a1 - so we might as well just keep the simple initial loop that is guaranteed to execute in 1 cycle per pixel.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 01 Feb 2020 07:48
| Vladimir Repcak wrote:
| .... move.l d0,(a0)+ move.l d0,(a0)+ move.l d0,(a0)+ move.l d0,(a0)+ move.l d0,(a0)+ move.l d0,(a0)+ .... |
Would the code above result in two pixels handled in both pipes ?
|
This would write 64bit per cycle. For CLRSCREEN I would use it.
| |
| | Vladimir Repcak
Posts 359 01 Feb 2020 08:34
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| .... move.l d0,(a0)+ move.l d0,(a0)+ move.l d0,(a0)+ move.l d0,(a0)+ move.l d0,(a0)+ move.l d0,(a0)+ .... |
Would the code above result in two pixels handled in both pipes ? |
This would write 64bit per cycle. For CLRSCREEN I would use it.
|
Right, it would be fused ! Forgot about that.Actually, fusing might solve the problem. If we needed 13 pixels drawn, we would jump to the LastPixelWrite - 13. And the CPU would have 12 writes as fused and the last one would be done on a single pipe. No conditions, nothing. Even better :) So, how much faster that would be compared to loop ? Factor of 2x ? Example: Scanline with 64 pixels Loop Approach: 128 ops but done in 64 cycles (as looping is done on second pipe) Unrolled approach: 64 writes fused into 32 writes (of 64 bits), so 32 cycles. Yep, factor of 2x (minus the per-scanline jump computation cost).
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 02 Feb 2020 15:16
| Vladimir Repcak wrote:
| Thanks, it's beginning to shape up, slowly... Does this download link work ? EXTERNAL LINK It's a static scene, doesn't move, just renders the framebuffer, waits and then quits. |
The link works, the EXE works too. The colors look nice and smooth. The diagonal lines look a little bit "strange" not as smooth as I would have expected them. Do you know what i mean? Can you explain why?What is the "time" on top left corner, how do I read it?
| |
| | Vladimir Repcak
Posts 359 03 Feb 2020 00:45
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| Thanks, it's beginning to shape up, slowly... Does this download link work ? EXTERNAL LINK It's a static scene, doesn't move, just renders the framebuffer, waits and then quits. |
The link works, the EXE works too. The colors look nice and smooth. The diagonal lines look a little bit "strange" not as smooth as I would have expected them. Do you know what i mean? Can you explain why? What is the "time" on top left corner, how do I read it?
| Thanks for testing it. I presume it was on real HW, right :) ? Glad to see I haven't introduced anything [yet, but I'm sure soon] that would break on real HW.The counter is a hexadecimal number of scanlines. So, $523 = 1,315 scanlines. I noticed a weird discrepancy when I introduced floating-point. Looks like I need a special codepath for vertical lines as they aren't really vertical - especially on a screenshot it is really visible. Which is funny, as I wouldn't expect the precision to be the issue with floating point. And because the end-point of edge is forced (not computed), you can see the occasional odd point at the end of most edges. Those are things that will be high on my debugging list. But right now, I am working on a version of quad rasterizer that handles all clipping scenarios. Especially the top screen edge is wreaking havoc with code complexity (I already have over dozen scenarios there) - I may have to write a generic version that will simply do two more checks per scanline - thinking about that right now... The one good thing about Amiga's OS is that if I write somewhere where I shouldn't, I will get an exception msg from OS, so an overflow (unlike pn Jaguar), it won't just keep running as if nothing happened. This helps tremendously with unit testing.
| |
| | Samuel Devulder
Posts 248 03 Feb 2020 10:13
| Vladimir Repcak wrote:
| The one good thing about Amiga's OS is that if I write somewhere where I shouldn't, I will get an exception msg from OS, so an overflow (unlike pn Jaguar), it won't just keep running as if nothing happened. This helps tremendously with unit testing.
|
Not quite true. AmigaOS is very permissive in essence. But since you are sharing memory with other app, when you trash some random memory, you are likely to corrupt something used by some other app or the os itself (free memory list, processes stacks, libraries, devices, etc..), resulting in paying the visit to the well-known guru sooner or later.
| |
|
| | Samuel Crow
Posts 424 03 Feb 2020 22:44
| AROS has its own MungWall equivalent but I don't know about Enforcer.
| |
| | Vladimir Repcak
Posts 359 03 Feb 2020 23:46
| Samuel Devulder wrote:
|
Vladimir Repcak wrote:
| The one good thing about Amiga's OS is that if I write somewhere where I shouldn't, I will get an exception msg from OS, so an overflow (unlike pn Jaguar), it won't just keep running as if nothing happened. This helps tremendously with unit testing. |
Not quite true. AmigaOS is very permissive in essence. But since you are sharing memory with other app, when you trash some random memory, you are likely to corrupt something used by some other app or the os itself (free memory list, processes stacks, libraries, devices, etc..), resulting in paying the visit to the well-known guru sooner or later.
|
So, if I understand it correctly, I don't get the guru upon very first write to a RAM that isn't mine.I only get a guru when the write happens to a memory of some app/os ?
| |
| | Vladimir Repcak
Posts 359 03 Feb 2020 23:52
| Thanks. So these are the 'enforcer hits' I kept reading about on the forums.It appears to be focused solely on dynamic RAM allocation. Is there perhaps some similar SW that detects illegal RAM writes without using OS dynamic allocation ? I'm currently using RAM via .bss segment (e.g. MyArray ds.l 1024) Thinking about it, I just realized, I could write my own brutally simplified version of Enforcer: 1. Mark whole RAM with FEFEBABA 2. Run my game 3. Check whole RAM for FEFEBABA and exclude range occupied by game
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 04 Feb 2020 06:44
| Amiga OS programs can share memory. This works like this: a program (A) could allocate some memory and gives a pointer to it to another program (B) to do something with it. This technique of "passing" information by pointer is common. Messages are passed this way and data is shared between different programs this way. This sharing without need to copy information is part why AMIGA OS is swift. Memory protection is because of this concept not possible on AMIGA. As the idea to read or write into another tasks memory is part of the Amiga Concept. This means memory protection like you maybe know it from Unix - can not be done on Amiga. The Amiga does only know that memory is owned by some program - or that its free. The Amiga does not keep track who owns what - and can never prevent another task to access some task memory. As mentioned this access of shared memory is done on Amiga by design. This means writing in another task memory is never spotted and will never create a guru - unless you destroy/overwrite instructions in other programs which then cause an ILLEGAL instruction exception or similar. Pointer violations can get undetected on Amiga for a while. Enforcer does a little trick to little speed up finding them. Enforcer allocates the unused memory and monitors if a program write in the free regions. Enforcer will not spot if a program writes into another program.
| |
| | Samuel Devulder
Posts 248 04 Feb 2020 10:33
| Vladimir Repcak wrote:
| So, if I understand it correctly, I don't get the guru upon very first write to a RAM that isn't mine. I only get a guru when the write happens to a memory of some app/os ?
|
Right! There is *no* memory protection in amiga OS. All is shared making it a very fast&responsive os, but somehow fragile.
| |
| | Vladimir Repcak
Posts 359 06 Feb 2020 14:35
| Thanks guys, that makes sense. As for the update, I finally finished implementing the clipping for the quad rasterizer and tested by moving camera in a loop that covered all clipping scenarios. It's kind of a unit test too. I can now go implement the double buffering.
| |
|
|
|