Information about the Apollo CPU and FPU. |
|
---|
| | Vladimir Repcak
Posts 359 22 Dec 2019 09:12
| I grew up on Atari 800 XL and wrote my first ASM code (6502) 30 years ago, so Amiga in itself is an unknown entity to me (I believe I ever met only a single person who ever confessed to owning one), so please bear that particular ignorance on my end in mind while answering my questions - YouTube is as close to Amiga as I ever got. I spent last few years working on Atari Jaguar and love to push the system to its limits - I have a working playable demo of StunRunner-style game (with many gameplay improvements) that runs at 640x240 at 60 fps at 65,536 colors and several other demos that run up to 1,024x240 so this is exactly what I'd love to do with Vampire when I take breaks from jag coding. I've got well over 100,000 lines of 68000 and RISC (GPU/DSP) Assembler code covering various areas (road, terrain, voxels, indoor, etc.). I tend to optimize the pipelining in inner loops to the maximum and often write up to 10 versions of the same method, with detailed benchmarks, till I'm happy with the performance. Now, on Jaguar, I spent a lot of effort into designing a multi-threaded (or multiprocessor) engine where each of 68000, GPU, DSP has its own code and can work in parallel with everybody else without waiting for anyone else - thus achieving full parallelism of all 5 chips (OP, Blitter, 68000, GPU, DSP): - while Object Processor is drawing last frame's framebuffer - Blitter is clearing the current framebuffer - GPU is doing transformation and scanline traversal without waiting (e.g. computing next scanline while Blitter draws previous one) - 68000 is processing input, AI, physics, RPG mechanics, Collision Detection, OP interrupts and everything else - DSP is handling audio Feel free to answer any of the questions. Hopefully, eventually, all of them will be answered. Thanks ! Q1: Apollo only has 68080 as a standalone processor, so is the best possible parallelism there achieved only through parallel integer and FPU pipeline and perhaps Blitter (more on Blitter later) ? Q2: Blitter - does Apollo's Blitter run at full system bandwidth of 700 MB/s or the original one ? While I will benchmark it as soon as I get the system, I'd love to know now. I'm well aware of the scanline-length threshold on Jaguar where it doesn't make sense to spin Blitter up (and just do the blit on GPU) - but for flatshading it's one of best parallelism we can get (like on Jaguar) - while current scanline is being blit by Blitter, next one is being computed by 68080. Q3: Build Deployment - I have zero desire to work directly on target HW - I prefer my productivity at maximum on PC. I currently use Notepad++ and made a deployment script that sends the build to Jag within 3 seconds of hitting F5. How do I replicate this with Apollo ? I need to be able to test single instruction differences - hence sometimes deploy 100+ builds within an hour (especially when "debugging" jag's GPU RISC code - I believe something similar with happen with pipelining refactoring on Apollo). Q4. Backwards compatibility - The furthest I am willing to go, in terms of fragmenting the source code, is have separate codepaths for 68060 (later, not now for sure). Is there anything I need to know now (other than absence of AMMX), before writing first line of code, that will make it easier ? Note that I do not intend to bastardize the 68080 performance for the sake of not having to write 060-specific codepath. Q5. Hello World Graphics mode - for initial prototyping, due to my zero prior exposure to Amiga HW - is there a reference ASM code that will set up gfx mode for me (ideally 640x400). I can take it from there. Doesn't even have to be double-buffered - just a working gfx screen. Q6. SAGA registers - it's my current understanding that Amiga used to have something like AGA and it performed a variety of gfx-related functionality - is that where I find registers for double-buffering/vsync/etc. ? Meaning - does AGA actually support chunky non-bitplane addressing (I know that Apollo does - but that could be SAGA - not sure if the HW design allows concurrent execution of both) ? Q7. Can I keep Apollo turned on 24/7, 365 days/year ? Q8. What Assemblers other than vasm support all 68080 instructions ? Q9. Docs. I currently have: - Amiga HW Reference Manual (371 pages) - M68060 User's Manual (416 pages) - AMMX Guide from this site : CLICK HERE Anything else to ease the migration from 68000 to 68080 ? Q10. Any Apollo-safe coding forums out there that are shielded from fulltime AntiApollo trolls "On a Mission From God" ? From my last two days of browsing various Amiga forums it doesn't appear to be so... If it helps, I also have direct experience with these APIs:DirectX, OpenGL, XNA, MESA, CUDA, HLSL (from 400 MHz Celeron to XBOX360) and assemblers 80286,386,486, 6502, 68000, RISC DSP/GPU. While I did some bitplane ASm routines on 80386 for a 360x480 gfx mode on PC, I don't have a desire to ever work with bitplanes again.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6251 22 Dec 2019 09:29
| Vladimir Repcak wrote:
| Q1: Apollo only has 68080 as a standalone processor, so is the best possible parallelism there achieved only through parallel integer and FPU pipeline and perhaps Blitter (more on Blitter later) ? |
You can execute FPU and INT Instruction in parallel. FPU is fully pipelined, with good scheduling you can issue do 1 FADD/FDIV or FMUL each clock. SAGA has 3D/Texture features build in. We have HW functions for compressed Textures and for Bilinear Filtering. We work on new pipeline for 3D line rendering.
Vladimir Repcak wrote:
| Q2: Blitter - does Apollo's Blitter run at full system bandwidth of 700 MB/s or the original one ? While I will benchmark it as soon as I get the system, I'd love to know now. I'm well aware of the scanline-length threshold on Jaguar where it doesn't make sense to spin Blitter up (and just do the blit on GPU) - but for flatshading it's one of best parallelism we can get (like on Jaguar) - while current scanline is being blit by Blitter, next one is being computed by 68080. |
For a 3D game I would use S3 compressed 24bit textures and ideally work very close together with us to use the "WIP" Voodoo like line raster features. Legacy AMIGA Planar Blitter will match here.
Vladimir Repcak wrote:
| Q3: Build Deployment - I have zero desire to work directly on target HW - I prefer my productivity at maximum on PC. I currently use Notepad++ and made a deployment script that sends the build to Jag within 3 seconds of hitting F5. How do I replicate this with Apollo ? I need to be able to test single instruction differences - hence sometimes deploy 100+ builds within an hour (especially when "debugging" jag's GPU RISC code - I believe something similar with happen with pipelining refactoring on Apollo). | APOLLO has cycle exact performance counters. This means you can measure exactly how many clockcycles each of your routine takes. You can also monitor where you have stalls/delay and account their reasons e.g. branch misses, dcache misses etc.You can of course connect a network the Vampire and a PC over share with e.g. SAMBA. This way you can use the editor you want. Vladimir Repcak wrote:
| Q4. Backwards compatibility - The furthest I am willing to go, in terms of fragmenting the source code, is have separate codepaths for 68060 (later, not now for sure). Is there anything I need to know now (other than absence of AMMX), before writing first line of code, that will make it easier ? Note that I do not intend to bastardize the 68080 performance for the sake of not having to write 060-specific codepath. | Your question is very broad.If you compare 080 and 060 then some things are very different. 68060 main deficit are: - missing AMMX - FPU not pipelined - several instruction like 64bit MUL missing - Instruction Cache read limit of 4 Byte per cycle (16 for APOLLO) - No DataCache streaming or prefetching - Magnitude slower Bitfield Instructions - of course much lower memory access speed Vladimir Repcak wrote:
| Q5. Hello World Graphics mode - for initial prototyping, due to my zero prior exposure to Amiga HW - is there a reference ASM code that will set up gfx mode for me (ideally 640x400). I can take it from there. Doesn't even have to be double-buffered - just a working gfx screen. |
I would recommend to join our IRC channels. There are many coders which can answer and give you pointers Vladimir Repcak wrote:
| Q6. SAGA registers - it's my current understanding that Amiga used to have something like AGA and it performed a variety of gfx-related functionality - is that where I find registers for double-buffering/vsync/etc. ? Meaning - does AGA actually support chunky non-bitplane addressing (I know that Apollo does - but that could be SAGA - not sure if the HW design allows concurrent execution of both) ? | Other AMIGAs AGA supports planar modes only. SAGA does support both planar and 8/15/16/24/32 bit Chunky Vladimir Repcak wrote:
| Q7. Can I keep Apollo turned on 24/7, 365 days/year ? | Yes Vladimir Repcak wrote:
| Q8. What Assemblers other than vasm support all 68080 instructions ? | VASM only today Vladimir Repcak wrote:
| Q9. Any Apollo-safe coding forums out there that are shielded from fulltime AntiApollo trolls "On a Mission From God" ? From my last two days of browsing various Amiga forums it doesn't appear to be so... |
You can post here. But brainstorming code sharing might be faster on our IRC or Slack support channels.
| |
| | Mr Niding
Posts 459 22 Dec 2019 09:33
| Q10. Any Apollo-safe coding forums out there that are shielded from fulltime AntiApollo trolls "On a Mission From God" ? From my last two days of browsing various Amiga forums it doesn't appear to be so... Not to my knowledge. Tuko ran a paralell forum for a while, but it merged back to this forum. My advice is to start a thread with a headline "Strictly ON TOPIC", to let Gunnar/Moderators know that they should be heavy handed with moderation to avoid letting your thread spin out of control. We have a few very positive posters too, that will fill the pages with their dreams, which can be fun to read. But will "pollute" a *serious* thread.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6251 22 Dec 2019 09:48
| I think for you most important is to find out what your goal is What type of game do you want to do? What screen resolution do you aim for? Do you want 24bit texture and good quality bi-linear/tri-linear filtering? How many Coordinates do you have to calculate per frame? FPU on Apollo is pipelined 68060@50 peak 16 MFlops, versus V4 80 MFlops What target platform do you want to support. If you aim for Classic AMIGAs (like A1200 + 68060 CPU) then its a complete different ballpark. Lets give some rough indication numbers: SAGA supports both Planar and Chunky Other AMIGA AGA supports only Planar SAGA memory write speed ~ 500 MB/sec Other AMIGA AGA < 7 MB/sec AMMX/SAGA supports fast texture decompression and bilinear filtering. On 68060 you need 100 cycle for the same work where the 68080 needs 1 cycle. Very simple texture example on APOLLO, truecolor with bi-linear filtering EXTERNAL LINK
| |
| | Vladimir Repcak
Posts 359 22 Dec 2019 10:07
| Thanks for your instant answers ! I barely edited the post for typos and things I forgot :)Gunnar von Boehn wrote:
| You can execute FPU and INT Instruction can run in parallel. FPU is fully pipelined, with good scheduling you can issue do 1 FADD/FDIV or FMUL each clock.
|
Yeah, that is going to be fun ! Is there a list of all 68080 pipeline rules (I'm reading 68060 pipelining now from the M68080 Usr manual), so I can print it out and smash it onto wall so my circuits gets wired in this mode ?Gunnar von Boehn wrote:
| SAGA has 3D/Texture features biuld in. We have HW functions for compressed Textures and for Bilinear Filtering. We have a testpipeline for 3D line rendering.
|
I have browsed over two years of these forums in last two days, but I was under impression those features were for the next release (whenever that might be) ? Well, thinking about it now, as some of those threads were written a year+ ago, they probably meant V4 release. So, that's already in V4 ? Damn :) ! Gunnar von Boehn wrote:
| For a 3D game I would use S3 compressed 24bit textures and ideally work very close together with us to use the "WIP" Voodoo like line raster features.
|
So, the V4 already has S3TC ? Nice ! I loved it in DirectX ! I was, initially targetting exploring flatshading (e.g. learn to crawl before I walk), which is where Blitter would be useful, but for S3TC, Blitter is obviously out of question. I have several versions of texturing routines on Jaguar: - Integer-only - Fixed-Point - Fully Perspective - Partially PerspectiveBut, since Apollo has a parallel FPU Unit, it looks like I might even straight have floating point routines ? Awesome !
Gunnar von Boehn wrote:
| APOLLO has cycle exact performance counters. This means you can measure on the clockcycle how long each routine takes.
| I forgot to ask that Q, but you answered it already :) On jag, those cycle counters weren't super useful, so I mostly used rendering 1,000 frames as a benchmark scenario and accepted half a frame as a measurement error. What is the precision of Apollo counters ? How many cycles before they roll over ?Gunnar von Boehn wrote:
| You can of course connect a network the Vampire and a PC over share with e.g. SAMBA. This way you can use the editor you want.
|
Thanks for the initial direction - I will start doing some research on this. From corporate environment, I have some experience in this area (Linux Servers), so that may help a bit.
Gunnar von Boehn wrote:
| Your question is very broad. If you compare 080 and 060 then some things are very different. 68060 main deficit are: - missing AMMX - FPU not pipelined - several instruction like 64bit MUL missing - Instruction Cache read limit of 4 Byte per cycle (16 for APOLLO) - No DataCache streaming or prefetching - Magnitude slower Bitfield Instructions
|
I guess it helps then that I haven't been exposed to 68060, as my brain doesn't operate within 060 bounds. I will simply learn the fastest way - the 080 way- from the get go. My only point is that I want, eventually, to be able to have 060 builds, but the more I read about it, the more it looks like it will have to be a completely separate codepath anyway. So, at least you adjusted my expectations, and when the time comes, I will simply assume to have to rewrite the performance-critical code for 060. Fair enough.
Gunnar von Boehn wrote:
| I would recommend to join our IRC channels. There are many coders which can answer and give you pointers
|
IRC ? Talk about Blast From The Past :) ! I definitely will, although forums are also great because the information isn't lost and when somebody new comes in, makes a search, can see my questions answered - kinda like StackOverflow-style. So, forums have that advantage.
Gunnar von Boehn wrote:
| AGA supports planar modes only. SAGA does support both planar and 8/15/16/24/32 bit Chunky | Cool, I like my pixels Chunky anyway :)
Gunnar von Boehn wrote:
| Vladimir Repcak wrote:
| Q7. Can I keep Apollo turned on 24/7, 365 days/year ? | Yes | Awesome !
Gunnar von Boehn wrote:
| Vladimir Repcak wrote:
| Q8. What Assemblers other than vasm support all 68080 instructions ? | VASM only today | Alright, I use vasm for Jaguar anyway - but it never hurts having additional options :)
Gunnar von Boehn wrote:
| You can post here. But brainstorming code sharing might be faster on our IRC or Slack support channels. |
Slack ? Got a link ? Or is it an invitation-only thing ?
| |
| | Vladimir Repcak
Posts 359 22 Dec 2019 10:26
| Mr Niding wrote:
| Q10. Any Apollo-safe coding forums out there that are shielded from fulltime AntiApollo trolls "On a Mission From God" ? From my last two days of browsing various Amiga forums it doesn't appear to be so... Not to my knowledge. Tuko ran a paralell forum for a while, but it merged back to this forum. My advice is to start a thread with a headline "Strictly ON TOPIC", to let Gunnar/Moderators know that they should be heavy handed with moderation to avoid letting your thread spin out of control. |
Yeah, I kinda figured that the only safe place should be here (where else, after all). But, I would lie if I said that I didn't hope for some other forum to exist where your work isn't constantly attacked and people can freely discuss their coding experiments (without nonstop anxiety of being attacked).Mr Niding wrote:
| We have a few very positive posters too, that will fill the pages with their dreams, which can be fun to read. But will "pollute" a *serious* thread.
| Yeah, I noticed few spammers like that yesterday while browsing 2 years of posts (about 50 pages). It's still much better than on, say, AtariAge's Jaguar section where such threads would have full-on bullying (even enforced by local mods - but then again - "their turf = their rules").But, ideally, we would be able to focus just on work, with occasional joke or something, but it does suck when a 3-page thread has only 3 on-topic responses. Or, even worse, when one of those 3 responses is from a guy who actually does the work (e.g. the gcc or OpenGL guy) yet is demotivated and driven off...
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6251 22 Dec 2019 10:35
| Vladimir Repcak wrote:
| Yeah, that is going to be fun ! Is there a list of all 68080 pipeline rules (I'm reading 68060 pipelining now from the M68080 Usr manual), so I can print it out and smash it onto wall so my circuits gets wired in this mode ?
|
V4 FPU today can do 1 FPU instruction per cycle (fully pipelined) So if your code is unrolled enough you can get 85 MFLops. In a Nutshell, the 68080 Core looks like * Icache (16byte instruction per cycle) * Decoders The decoders can decode up to 4 Integer instructions, included 1 AMMX instruction, 1 FPU instruction per cycle. The Core has 2 main pipelines. The primary pipeline which can do up to 2 INT instructions, or 1 MXX or 1 FPU, The secondary pipeline which can do up to 2 INT instructions, and a selection of FMOVE, AMMX STORE. * 2 EA units to calc up to 2 EA per cycle * DCACHE unit to allow 1 Read and 1 Write per cycle * ALUs (INT/AMMX/FPU) If you have real code to test, I can help you profile and help review scheduling. Vladimir Repcak wrote:
| I have browsed over two years of these forums in last two days, but I was under impression those features were for the next release (whenever that might be) ? Well, thinking about it now, as some of those threads were written a year+ ago, they probably meant V4 release. So, that's already in V4 ? Damn :) !
|
The V4 texture pipeline is WIP right now. We need time and people using it and feedback of coders to make it finished.
Vladimir Repcak wrote:
| So, the V4 already has S3TC ? Nice ! I loved it in DirectX ! I was, initially targetting exploring flatshading (e.g. learn to crawl before I walk), which is where Blitter would be useful, but for S3TC, Blitter is obviously out of question. I have several versions of texturing routines on Jaguar: - Integer-only - Fixed-Point - Fully Perspective - Partially Perspective But, since Apollo has a parallel FPU Unit, it looks like I might even straight have floating point routines ? Awesome !
|
FPU code needs be unrolled for best performance. But as APOLLO has 32 FPU registers doing this is possible.
Vladimir Repcak wrote:
| I forgot to ask that Q, but you answered it already :) On jag, those cycle counters weren't super useful, so I mostly used rendering 1,000 frames as a benchmark scenario and accepted half a frame as a measurement error. What is the precision of Apollo counters ? How many cycles before they roll over ?
|
The Counters count - passed clock cycles - Instructions processed by Pipe1 - Instructions processed by Pipe1 - FPU instructions processed. - Branches - mispredicted branches - DCache reads - DCache misses - Memory Writes - Memory Stalls because of memory controller - Pipeline stalls because of register dependancies (ALU to EA) All counters are 32bit
| |
| | Olaf Schoenweiss
Posts 690 22 Dec 2019 11:51
| welcome and interesting discussion ;) just because I am curious... you want to port a 3D engine you already wrote on Jaguar? Or is the idea to do something from scratch? I ask because there are two options already... stormmesa EXTERNAL LINK a new vampire optimized version would be a big win also there is aros that supports mesa/gallium even on 68k (or did I am not sure how is current situation). It was much too slow of course currently on 68k
| |
| | Vladimir Repcak
Posts 359 22 Dec 2019 12:06
| Gunnar von Boehn wrote:
| I think for you most important is to find out what your goal is What type of game do you want to do?
|
I already have a working StunRunner - style gameplay on jag's 68000 with up to 20 levels, so I could merely create high-poly environment and enemy meshes and just replace them. The 68000 gameplay/AI/input code can totally be left as it is - no need to rewrite it, as it was heavily optimized to take less than 10% of 68000's frame time (at 60 fps) - which at Apollo's frequency will be even less (probably 3% or less).I also have few other racing prototypes: - OutRun-style road+environment - Road Rash (textured buildings and road) - this one I fully recreated about a month ago So, currently mostly racing/driving. Of course, Apollo's power could be used for a NeedForSpeed-style environment, so... I do, however, tend to jump from one project to another fairly often. Which, incidentally, should work out for the better for Apollo... Gunnar von Boehn wrote:
| What screen resolution do you aim for?
|
I am a huge fan of flatshaded gfx because it ages and scales incredibly well - no shimmering half-meter big texels (with ugly interpolation artifacts) - everything is nice and smooth, especially at 65,536 colors. I'm currently maintaining 60 fps on a 13.3 MHz Jaguar with 26.6 MHz GPU at 640x240 (16bit).So, my expectations for Apollo (given its bandwidth) are to at least go for 800x600, but my code is resolution-independent so I will obviously try full HD (even if it is just 5 fps). Wouldn't it be insanely cool to run StunRunner-style game at FullHd on Amiga ?!? From my other experimenting, the single greatest visual difference is in the vertical resolution - the jump from 200 to 400/480 scanlines. That particular jump has much greater impact than horizontal resolution jump (say, from 640 to 1024) However, the blitting performance directly depends on number of scanlines - so if we double the vertical resolution we directly halve the performance of that particular pipeline stage. I really hope that I can maintain 640x480 at much higher polycount on Apollo, but I understand that my first (or second or third) version won't be anywhere its full potential, and that's fine. Gunnar von Boehn wrote:
| Do you want 24bit texture and good quality bi-linear/tri-linear filtering?
|
Depends on project. Can we do flatshading at 24bit ? That would improve visuals significantly while cutting down on 565 conversion per pixel.On Jaguar, one version of Road Rash road texturing had a codepath where I was doing runtime quasi-bilinear filtering within each scanline: - Render scanline into buffer1 - Blitter in parallel draws scanline from buffer2 - swap buffer pointers - there was less than 8% waiting time (for Blitter) so really efficient If there was more than 4 KB of cache on Jaguar, I could have full bilinear filtering across 2 scanlines - the GPU was certainly powerful enough for that. But 4 KB had to contain the texture, 2 scanlines plus rendering code, so it was really tight and impossible to set aside a space for third scanline... With Apollo's cache sizes, it will definitely be an interesting exercise to see how far we can push it. What's the access-time / bandwidth difference between data cache and RAM in Apollo ? On Jaguar there was, obviously, a huge performance difference whether texture was in cache or RAM. Gunnar von Boehn wrote:
| How many Coordinates do you have to calculate per frame? FPU on Apollo is pipelined 68060@50 peak 16 MFlops, versus V4 80 MFlops
|
I don't know yet - the final 3D mesh would be recreated in 3dsmax only after codebase becomes stable and fully benchmarked. For sure, the trasform code, I will keep rewriting it till it's at least 80-90% peak pipeline performance. If I need to write 25 versions to get there, I will write 25 versions. On Jaguar, the polycount was pretty low (StunRunner-style scene complexity), as the target was to keep 60 fps at 640x240, but I had LevelOfDetail there to quadruple the view distance. Understand also, that my pipeline is fully integer, with fixedpoint used only in scanline traversal. And I'm not doing the full view-transform thing either, it's really just the fastest possible perspective projection (2 Divs, 2 Muls, couple Adds). I certainly plan on benchmarking how many vertices you can transform within a single frame. Of course, if I don't mind halving the framerate to 30 fps, then the scene complexity can roughly double. I've done a lot of benchmarks on jag on that - how many scanlines can be processed at 60, 30, 20, 15, 12, 10 fps... Obviously, till I have benchmarks from Apollo, I can only talk from experience with Jag. The compromise has to be made versus: - expectations of people - manhours required for creating such meshes As I've experienced on AtariAge, even if nobody else ever pulled off 640x240x16b at 60 fps, people will then ridicule the low scene complexity with such absurd statements as "I've seen Atari ST do better". Yeah, sure, Atari ST, 640x240 at 65,536 colors and 60 fps, right :) ... So, the scene complexity is an ongoing discussion that can be had till the point I pull the plug and go create the final 3D meshes. There's a huge difference in scene complexity depending on game type. A first-person shooter doesn't need nowhere remotely near the framerate required for something like StunRunner. You can play FPS comfortably at 6-7 fps. That's 10 vblanks. I've done that in past with Quake. But a fast-paced racer cannot be played at 20 fps. Even 30 fps is awful (especially the drops from 60 to 30 - that just sucks) and unless you drastically reduce the driving speed, anything sub-60 sucks. A solution to that is to create a bicycle driving game. 10 kmh, so we can have framerate of 6 fps (yet still be smooth, and internally run input+physics at 60fps) and spend 10x more CPU time on 3D world :) Regardless, there will always be both camps of unhappy people: - people bitching they want 60 fps - people bitching they want higher scene complexity and don't mind 10 fps - people bitching they could code much, much, much, much better stuff, if only they could be bothered to try Can't please 'em all :) Gunnar von Boehn wrote:
| What target platform do you want to support. If you aim for Classic AMIGAs (like A1200 + 68060 CPU) then its a complete different ballpark.
|
I have zero emotional attachment to Amiga being an Atari guy. So, I luckilly don't really care for A1200 itself. Now, I do care for 68060 because of Falcon's demos and Amiga's demos (it's the second most performant platform now). But from what you already explained in your first post, I would have to completely bastardize 080 performance for the code to be reusable on 060. Which, I just realized, would only lead to: - "meh" visuals on 060 - "meh" visuals on 080 So, screw that approach, given how much work that is anyway :) If I ever get to point of creating 060 build, then I'll set aside enough time to rewrite the renderer for 060. Expectations managed :)
Gunnar von Boehn wrote:
| SAGA memory write speed ~ 500 MB/sec Other AMIGA AGA < 7 MB/sec |
So, am I interpreting this correctly - does SAGA's Blitter run at ~500 MB/s ? That would mean I could use my current scanline blitting approach (just figure out the different register setup).Gunnar von Boehn wrote:
| Very simple texture example on APOLLO, truecolor with bi-linear filtering EXTERNAL LINK
|
It's pretty cool ! I definitely wouldn't call it simple, being a coder. I know how much work is involved behind the scenes to pull of such visuals at that framerate. It's a testament to 080's power, for sure !
| |
| | Wawa T
Posts 695 22 Dec 2019 12:09
| id rather implement vamp backend lower level. say warp3d, which stormmesa depend on and is using. there is an open preplacement for w3d called wazp3d and its author is commenting here: alain thellier. other than that vamp backend can be also added to aros mesa.
| |
| | Vladimir Repcak
Posts 359 22 Dec 2019 12:42
| Olaf Schoenweiss wrote:
| I ask because there are two options already... stormmesa EXTERNAL LINK a new vampire optimized version would be a big win also there is aros that supports mesa/gallium even on 68k (or did I am not sure how is current situation). It was much too slow of course currently on 68k
|
Thanks, wasn't aware of those, but in past I have been burnt so many times trying to use other people's engines or APIs that unless there's Microsoft-sized army of developers and testers of the API, I'd rather reinvent the wheel, the door, the window and the roof by myself instead of going to Walmart and buying the pre-made tool shed which starts disintegrating within a week and after 3 months of creating spaghetti workarounds, the resulting structure doesn't really have doors anymore (it was easier to just pry them open forever instead of kicking the cats and hyenas out and just get used to the smell), roof progressed from leaking stage into an occasional stable shingle stage and 3 out of 5 walls already fell :)The sheer amount of work involved in making such API isn't the problem, really. It's the ingratitude and outright hostile approach of people using your API later that makes you think twice whether you watch Netflix or fork latest performance updates to such API. For me, after decades of this experience, especially last decade of Jaguar community, Netflix wins :) Besides, it's 2020, and I'm totally willing to work in ASM, counting cycles and rearchitecting each component till it fills the HW's pipeline in an ideal way. Just don't tell that to any mental health professional :) To me, a 32-cycle jsr+rts combo is offensive, so a C-style API with several layers of indirection on sub-10 GHz CPU doesn't sound like a good use of my time. Now, I could, theoretically, try to write some API for a certain game type - like for my racing games. It would have the pipeline completely optimized for 080, which is the absolute fastest way there is. All stages optimized for cache access, minimum cache thrashing, AMMX, maximum possible parallelism - all of which is only possible if you handcraft and rewrite each component 6-10 times. That being said, there are certain subsets of OpenGL that can be made fast easily - like the VertexArrays, VBOs, etc. But, there's still the whole OpenGL state machine on top of that... So, I have only deepest respect for anybody working on such APIs. I know I couldn't. Olaf Schoenweiss wrote:
| welcome and interesting discussion ;) just because I am curious... you want to port a 3D engine you already wrote on Jaguar? Or is the idea to do something from scratch?
|
Hi ! Initially - yes - my current engine (game, really) would serve as a "reference rasterizer", because on top of GPU RISC rasterizer I have also a 68000-only rasterizer, which could be reused as it is.So, this will allow me to slowly start upgrading 68000 code into 080 AMMX code, component by component and gradually benchmark each version.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6251 22 Dec 2019 13:19
| Vladimir Repcak wrote:
| Besides, it's 2020, and I'm totally willing to work in ASM, counting cycles and rearchitecting each component till it fills the HW's pipeline in an ideal way. Just don't tell that to any mental health professional :) |
Sounds like a good plan. I would propose to just do it then like this. I would NOT recommend to use Mesa or other layers! If you have a working 68K rasterizer then my proposal is to use this as starting bases. APOLLO 68080 is roughly 200 times faster then stock 68000 ... So you should be able to get something useable. Regarding ASM coding some basic ballpark numbers. Most instruction are single cycle. example ADDi.l #$125345,(A0)+ -- 1 cycle The core can read 8 byte from DCache each cycle - and store in the same cycle 8 byte! The core has 2 pipes. Each pipe can do memory operations - but only one pipe per cycle. Misaligned Memory Access is supported in core. Misaligned Reads are cost free. We have drafted a basic HW-Row-Rasterizer which depends on the CPU to calculate edges and feed it. The idea is the CPU does 3D coordinates and edge processing and can offload the actual line rasterizing. Regarding Frame rates and games. What do you thank about a TANK game? What do you think about something like this ? Regarding Screen-Res I think a 16/9 format would be very cool. 640x360 maybe?
| |
| | Stefano Briccolani
Posts 586 22 Dec 2019 13:23
| I really like this metal-bashing approach.. Is there some videos of your Jaguar stun runner-clone engine to see? Just to have an idea of what you're aiming for. And (of course).. welcome in the vampire world Vladimir..
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6251 22 Dec 2019 13:44
| During start of the project I did some 3D code to test our render logic. Here is an example of what we could do. Saying this I would like to have another 3D coder here to brainstorm and to testdrive our HW accelerator the best way.
| |
| | Vladimir Repcak
Posts 359 22 Dec 2019 13:45
| wawa t wrote:
| id rather implement vamp backend lower level. say warp3d, which stormmesa depend on and is using. there is an open preplacement for w3d called wazp3d and its author is commenting here: alain thellier. other than that vamp backend can be also added to aros mesa.
|
Well, something like that is a great first choice for people who don't have anything.But, I already have a fully working 68000+GPU(RISC)-based codebase. Plus, wouldn't I have to go back to C ? I've done a fair amount of interfacing between C and 68000 and the amount of issues and assembler/compiler bugs/alignment issues I discovered doing that cost me waaaaaaay more time than if I instead started in pure ASM in the first place. Also, because the C has to conform to the C spec, the compiler has to do insane amount of work to keep it conforming, which obviously results in super inefficient code (most of the time). No such thing in ASM. Zero dependency. Just one dependency on the HW. No SW update will ever break it (unless it fixes the Assembler bugs that you have workarounds for). The idea of C sounds so great, on the paper. Perhaps, on Amiga, the C compiler produces useable code ? On Jaguar, it wasn't. Also, over the course of last year, I designed my own high-level language ("Higgs") that compiles to the exact same amount of instructions that you would write as a human. It doesn't have the feature set of C, but it handles the most important and useable features, like: - intermix of ASM and Higgs - loops - conditions (fully irregular if.then.else combo) - 3-op expressions parsing (resulting in multi-instruction combo) - functions with default parameters - signed / unsigned math - explicit working registers (for functionality that requires temp registers) - register variables - array access - structures - arrays of structures - scope-based local and global constants - local variables without the cost of constructing them - global constants and variables - complex debugging prints (1-2-3 combos, arrays, formatting, hexdump, ...) Now it looks ALMOST like C, but it's way more efficient as C compiler can ever be (because of its standard compliance). You still have to think in terms of registers but over 90% of all new code I write is in it, only rarely do I write actual 68000 ASM. Especially conditions and local blocks are incredibly more efficient. I don't care anymore if I have to scrap some code, I just do it now :) I can combine 68000, GPU RISC and DSP RISC in same file. So, I presume I should be able to do the same with AMMX and 080-specific instructions, yet retain 100% exact performance (I don't introduce a feature into the language (Higgs parser is written in .NET) unless I'm 100% sure I can write it at same amount of ops as I would do by hand). Of course, I first need to get some experience in 080 code :) But, from what I'm reading in the docs right now, I think it should be possible to -eventually- create a high-level wrapper functionality to the parallelism between the Integer and FP Unit. Meaning - you would wrap the code within some _PARALLEL_ Block and it would execute on FP Unit. Probably need to experiment with the SIMD and parallelism a bit first, anyway...
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6251 22 Dec 2019 14:16
| Vladimir Repcak wrote:
| I think it should be possible to -eventually- create a high-level wrapper functionality to the parallelism between the Integer and FP Unit. Meaning - you would wrap the code within some _PARALLEL_ Block and it would execute on FP Unit. Probably need to experiment with the SIMD and parallelism a bit first, anyway...
|
Please forget this.The existing mesa etc stuff is totally unoptimized - you risk a heart attack if you use - and look at the inner code. We dont want this, right? Regarding FPU and CPU. I'm sure that you really want to write this in ASM for control and performance.
| |
| | Samuel Devulder
Posts 248 22 Dec 2019 14:18
| On 68k, no compiler is really beating human ASM, but converting a whole source to ASM usually fail because of size making this unmaintainable. Better concentrate on where ASM needs to be introduced! My rule of thumb is: 1) do it in C, then 2) profile, and 3) replace costly C algorithm with "better thought" algorithm (yeah good algorithm is important), 4) profile again and 5) if still some function appear on top, introduce ASM equivalents (based on good algorithms this time). This helps increasing the speed while not having to rewrite everything. Tested sucessfully on quake (running > 30fps on my V2), and diablo (wip but already >30fps on my V2).For the progtamming env. I work on PC with cygwin + notepad++ + bebbo's amiga-gcc. I test under UAE to check that the code is working on pure 68k, then go to the vamp, download the exe via "wget" from the pc or some other external storage and test/measure real fps. Using network is quite efficient, and some people reported that it is possible to remote-debug a program running on the vamp directly from eclipse, so the UAE phase might be skipped then.
| |
| | Vladimir Repcak
Posts 359 22 Dec 2019 14:24
| Gunnar von Boehn wrote:
| If you have a working 68K rasterizer then my proposal is to use this as starting bases. APOLLO 68080 is roughly 200 times faster then stock 68000 ... So you should be able to get something useable.
|
True, but I'm getting 60 fps on Jag because of Parallel work of Blitter and parallel work of GPU and 68000. So, that will normalize the factor of 200x to something like 10x of Jaguar, I'd suspect.Gunnar von Boehn wrote:
| Regarding ASM coding some basic ballpark numbers. Most instruction are single cycle. example ADDi.l #$125345,(A0)+ -- 1 cycle The core can read 8 byte from DCache each cycle - and store in the same cycle 8 byte! The core has 2 pipes. Each pipe can do memory operations - but only one pipe per cycle. Misaligned Memory Access is supported in core. Misaligned Reads are cost free.
|
8 Bytes in 1 cycle ? That is awesome :) Misaligned access is soooo annoying on 68000 - I managed to get rid of most of it (due to my high-level language), but it still happens to me every week at least once/twice...So, as long as I interleave memory operations between pipes, I could get full bandwidth this way. Will keep that in mind. Gunnar von Boehn wrote:
| We have drafted a basic HW-Row-Rasterizer which depends on the CPU to calculate edges and feed it. The idea is the CPU does 3D coordinates and edge processing and can offload the actual line rasterizing.
|
Yeah, I do the same on jag. So - do you use the AGA Blitter to draw scanlines ? Is it operating at 500 MB/s to not block CPU ?
Gunnar von Boehn wrote:
| Regarding Frame rates and games. What do you thank about a TANK game? What do you think about something like this ?
| Well, that's definitely more fun game than a bicycle simulator :) And it probably can get away with sub-10 fps (yet be smooth), meaning we can go bonkers on scene complexity.For SillyVenture (2 weeks ago), I created a simple 4 KB demo of a voxel terrain - the party version sucks tremendously (I only worked on it for 3 days), but I've been working last 2 weeks on improving it, so it sucks less. Point being, I have about 7 different versions f the voxel rasterizer running on GPU - with various combinations of HW+SW rasterizing (e.g. Blitter). This could be useful for a tank game (e.g. voxel terrain like in Comanche, but at 65,536 colors) - terrain via voxels and tanks via flatshading. One thing I hate about voxels is that up-close they look awful - Comanche fixes it by putting HUD there, otherwise those 12-16 blocks at the closest row look real ugly. On another hand, given the parallel nature, we might be able to compute interpolation (simple lerp, really) between two neighbouring voxels, thus giving it much better and less blocky look. Gunnar von Boehn wrote:
| Regarding Screen-Res I think a 16/9 format would be very cool. 640x360 maybe?
| 16:9 is not a problem. I've had that option in my codebase (4:3, 16:9) for a long time. Easy to do with flatshading. Simple coefficient applied during loading time. Resolution - hard to say - would need some benchmarks. If we smash a giant HUD over half of the screen, then even 640x480 could work (and look real nice into distance), yet have performance characteristics of 240p.
| |
| | Vladimir Repcak
Posts 359 22 Dec 2019 14:30
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| I think it should be possible to -eventually- create a high-level wrapper functionality to the parallelism between the Integer and FP Unit. Meaning - you would wrap the code within some _PARALLEL_ Block and it would execute on FP Unit. Probably need to experiment with the SIMD and parallelism a bit first, anyway... |
Please forget this. The existing mesa etc stuff is totally unoptimized - you risk a heart attack if you use - and look at the inner code. We dont want this, right?
|
Don't worry, I was talking about my own language - "Higgs". I've had enough exposure of C-compiler Asm code to last me for several lifetimes :)Gunnar von Boehn wrote:
| Regarding FPU and CPU. I'm sure that you really want to write this in ASM for control and performance.
|
Yes, that's exactly the point. Invest 2-3 days of work into unrolling some tight inner loop, but reuse it thousand times in future.Now, realistically, it's going to take months of experimenting till I'm anywhere near the efficiency I want to have - which is zero bubbles in both pipelines and least amount of cache thrashing for rasterizer. But, it's a doable target.
| |
| | Vladimir Repcak
Posts 359 22 Dec 2019 14:39
| Gunnar von Boehn wrote:
|
During start of the project I did some 3D code to test our render logic. Here is an example of what we could do.
|
Is that a 16-bit texture ? It surely is filtered at full floating-point precision - I doubt integer version would come close to looking like this - I tried doing integer texturing at 32 bits, and it's not enough to look like this on such large surfaces (not to mention the close-up).Gunnar von Boehn wrote:
| Saying this I would like to have another 3D coder here to brainstorm and to testdrive our HW accelerator the best way.
|
I sure would love to help in pushing the boundaries ! That's what always excited me anyway.
| |
|
|
|