Information about the Apollo CPU and FPU. |
|
---|
| | Vladimir Repcak
Posts 359 28 Jan 2020 06:47
| Gunnar von Boehn wrote:
| Regarding unrolling. I real live texture loops will often have short runs like 5-15 pixel and they will have variable (random) length. Unrolling is then not really possible. Only for a very special testcase I can see unrolling work.
|
That will be actually very easy to benchmark.Besides, we are in reality unlimited in RAM on V4, so having an additional scanline codepath certainly (unlike on Jaguar's 4 KB GPU cache) doesn't mean we have to cut something somewhere else. From my benchmarking experience, it's not the faster code that needs to be benchmarked, but the impact of that particular condition that decides which fork to take. Meaning, if our scene has 20,000 scanlines, we execute that condition 20,000x every single frame. So, whatever the faster codepath is saving must also save the cycles for executing the condition 20,000x (regardless of whether that path is taken or not). Which is exactly the reason why sometimes it's better to force a somewhat slower codepath which doesn't need any condition in the first place.
| |
| | Vladimir Repcak
Posts 359 28 Jan 2020 06:52
| Gunnar von Boehn wrote:
| I real live texture loops will often have short runs like 5-15 pixel
| 5 px would imply extremely high poly scene.Just imagine a 640x480 - that's ~over 240 triangles in each scanline. 5px may be common in the furthest distance. In racing, the road polygons will easily have full width of screen (e.g. 640).
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 28 Jan 2020 07:17
| Vladimir Repcak wrote:
| So, how can you even theoretically achieve 85 MFLOPS ? Can FP unit handle 5 parallel FP ops ?
|
Yes, even a lot more. The FPU can do in parallel at the same time: 6 FADD/FSUB, 6 FMUL, 10 FDIV, 20 FSQRT
| |
| | Samuel Devulder
Posts 248 28 Jan 2020 08:07
| Vladimir Repcak wrote:
| Can you do the following and issue all five ops in 5 cycles ? cycle 1: FDIV fp7,fp6 ; fp5 available in 9c cycle 2: FADD fp0,fp1 cycle 3: FSUB fp2,fp3 cycle 4: FMUL fp4,fp5 ; fp5 available in 5c cycle 5: FMOVE fp0,fp3 cycle 6-10: results from above ops get written to respective registers |
Since there are no dependency between the calculation, yes they do execute in parallel along the fpu-pipe.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 28 Jan 2020 08:59
| Samuel Devulder wrote:
|
Vladimir Repcak wrote:
| Can you do the following and issue all five ops in 5 cycles ? cycle 1: FDIV fp7,fp6 ; fp5 available in 9c cycle 2: FADD fp0,fp1 cycle 3: FSUB fp2,fp3 cycle 4: FMUL fp4,fp5 ; fp5 available in 5c cycle 5: FMOVE fp0,fp3 cycle 6-10: results from above ops get written to respective registers |
Since there are no dependency between the calculation, yes they do execute in parallel in the fpu-pipe.
|
Actually there is a WRITE BEFORE WRITE collision The FMOVE is faster than the FSUB The FMOVE will finish in cycle 5 (its really single cycle) The FSUB will finish in cycle 8 ... To do this in correct order the FMOVE will wait for the FSUB to finish.
| |
| | Samuel Devulder
Posts 248 28 Jan 2020 09:14
| Oh yes, I haven't seen that fp3 is being written while it is still 'locked' in the pipe. I presume that example is meaningless since result of subtraction is cancelled 2 ops just after. My guess is that this is a typo: fsub and fmove order is swapped, or this is just random code.
| |
| | Vladimir Repcak
Posts 359 28 Jan 2020 10:32
| Yeah, sorry guys - it was a completely random code - just needed to verify that a single FP op didn't stall the FP unit completely. Very glad to hear it's not the case :)
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 28 Jan 2020 14:00
| Vladimir Repcak wrote:
| Yeah, sorry guys - it was a completely random code - just needed to verify that a single FP op didn't stall the FP unit completely. Very glad to hear it's not the case :)
|
No there is no stall. The FPU has complete independent units. FADD / FMUL / FDIV / FSQRT / FMOVE And in theory the FPU could even execute in parallel 1x FMOVE , 1x FADD, 1x FMUL , 1x FDIV ... But Apollo does today NOT support this. APOLLO as most schedules 1 FMOVE and 1 FOPP per cycle in parallel.
| |
| | Nixus Minimax
Posts 416 28 Jan 2020 16:52
| Gunnar von Boehn wrote:
| APOLLO as most schedules 1 FMOVE and 1 FOPP per cycle in parallel.
|
Is that a fusing case like this: fmove fp0,fp1 fadd fp2,fp1 that will be executed like a 3-operand FOPP? Or is this more general and this would also execute in parallel: fmove fp0,fp1 fadd fp2,fp3 ?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 28 Jan 2020 18:34
| Nixus Minimax wrote:
| that will be executed like a 3-operand FOPP?
|
APOLLO 68080 does support 3 Opp FPU instruction encoding of form: FADD (ea),Fp,Fm This allows for MUCH higher throughput if clever used.
| |
| | Vladimir Repcak
Posts 359 28 Jan 2020 21:06
| What are the odds of the FP unit ever finishing execution of most ops (except things like FDIV) in 1-2 cycles ? I have zero idea on such FPGA implementation costs (or space in terms of gates) and I'm sure you already have a ToDo list that already spans half century. But, where would such feature lie on the scale of: <NeverEver: 0% , SureWhyNot: 100%> ?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 28 Jan 2020 22:06
| Vladimir Repcak wrote:
| What are the odds of the FP unit ever finishing execution of most ops (except things like FDIV) in 1-2 cycles ? |
Latency is in relation of your CPU design for clockrate. Aiming for short latency would reduce the MHz speed of the Core. So generally you want a latency high enough to enable you to reach also a high clockrate. Making an FPU nicely pipelined like our is a lot of extra work. But only because of such nice pipeline design the Core can reach its target clock. With a 1 Cycle FPU the Core would be limited to 14 MHz or so...
| |
| | Vladimir Repcak
Posts 359 28 Jan 2020 22:18
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| What are the odds of the FP unit ever finishing execution of most ops (except things like FDIV) in 1-2 cycles ? |
Zero! Unless your goal is to reach 5 MHz Speed only. Latency is in relation of your CPU design for clockrate.
|
Damn, that was fast :)Fair enough - I'll just have to wrap my brain around it - like all new things, it will take some practice - say - like riding a bicycle :) Now, would it be more realistic to provide some second CPU - even if it was just single pipe, and a second citizen on the bus? Even 5 MHz 68000 would be great to have to handle things like: - input - base game logic - AI - rough scene management Reason I'm asking is that a multithreaded engine can handle many things more efficiently and there is no reason for secondary core to be same performance as primary one. My StunRunner code only takes 10% of the 13.3 MHz 68000 and the list of things it handles would span two pages here, but basically everything except 3D transform and rasterizing. So, a parallel 5 MHz 68000 would be extremely helpful and would allow 68080 to just focus on number crunching and pixel pushing.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 28 Jan 2020 22:26
| Vladimir Repcak wrote:
| So, a parallel 5 MHz 68000 would be extremely helpful and would allow 68080 to just focus on number crunching and pixel pushing.
|
The 68080 is roughly 300 times faster than 7 MHz 68000. So adding an extra 68000 would not make a difference. If you want to query user input this is no hassle for the 68080 CPU. A typical way to code this on AMIGA is writing your program in 2 threads. One doing the heavy computing work, and a second running in Vertical Blank IRQ, doing the user input and position calculation at a steady frequency of 50FPS.
| |
| | Vladimir Repcak
Posts 359 28 Jan 2020 23:08
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| So, a parallel 5 MHz 68000 would be extremely helpful and would allow 68080 to just focus on number crunching and pixel pushing. |
The 68080 is roughly 300 times faster than 7 MHz 68000. So adding an extra 68000 would not make a difference. If you want to query user input this is no hassle for the 68080 CPU. A typical way to code this on AMIGA is writing your program in 2 threads. One doing the heavy computing work, and a second running in Vertical Blank IRQ, doing the user input and position calculation at a steady frequency of 50FPS.
|
Interrupt is not the same as a dedicated processor executing the code fully in parallel (with its own registers). That's how I started on Jaguar - everything was on GPU, and eventually got to the point of having 3 processors running their code fully in parallel. It would take me pages to explain the advantages. Now, your post made me realize one thing - the VBlank IRQ will have some serious computational power available. But it comes at the cost of completely grinding the 3D component to a halt. So, is it a brutal amount of work to get some second (much slower, perhaps 8/16 bit) CPU running on an FPGA ? I honestly have no idea. I heard some people say on other forums that it can be very easy, as these cores are already pre-made and it's just a question of configuring it (and having enough gates still available on your current FPGA) All the Namco/Sega Arcade models always had a separate slower CPU (sometimes even 8-bit Z80 or 6502) to handle audio and gameplay fully in parallel with DSPs crunching the 3D scene. It was a great architecture and I really started to appreciate it only last year when I became fully parallel on jaguar.
| |
| | A1200 Coder
Posts 74 29 Jan 2020 00:30
| It still works reasonably well with the Amiga interrupt system and a single CPU, you can commonly have three interrupt levels, the base 0-level, where CPU normally runs, the level 3 vertical blank, which is synchronized to display rate and can vary with screenmode e.g. 50-60 Hz (not a good idea to put music routine here, as the timing will be ruined if you can run the game with different display rates), and the level 6 CIA interrupt which has a programmable rate, you can set it to e.g. 1000 times per second if you want, but interrupt overhead starts to be great (at least on old Amigas), as CIA chips are really old chips, also used in C64, and therefore very slow to touch with CPU. But nobody else can take priority from this level 6 interrupt, so you should set your most critical code here, like music routine, which should never slow down and always run at a fixed rate. If you also use this level 6 CIA interrupt, the level 3 VBL interrupt can be allowed to sometimes miss its max allowed time, without ruining your most critical routines. But yeah, this is how it is done on the Amigas, take it or leave it, I doubt there will be more CPUs than one in the foreseeable future. That's also why we don't have a separate DSP, but instead AMMX instruction set in the 68080 CPU, becomes simpler, and is still efficient enough.
| |
| | Vladimir Repcak
Posts 359 29 Jan 2020 01:55
| Yeah, an audio interrupt at a rate 22050 will certainly eat some resources. It will be very interesting for me to compare the final performance of the 68080 to Jaguar with audio and everything running. On one hand you have: - two RISCs (DSP + GPU) working in parallel that can provide up to 53 MIPS if fully pipelined - 68000 working in parallel handling 95% of all code - Blitter drawing scanlines in parallel to all other chips - Object Processor that handles displaying of FrameBuffer and any other bitmaps (HUD, sprites, etc.) also working in parallel to everything else which displays certain number of bitmaps for ~free. Note that especially blitter is the important part here - it clears framebuffer and fills it (including clipping), basically, for free (if you do it properly).So, all the costs of these substages of the pipeline are something that 080 will have to do to compensate for lack of Blitter: - clearing framebuffer (640x240x2 = 300 KB per frame) - with fusing we should be able to clear 4px / cycle in a loop -> with loop overhead 77,000 c - filling scanlines , pixel by pixel. At 8c/pixel it takes 640*240 = 1.2M cycles and that's with zero overdraw (impossible with Z-sorting, there is always some overdraw in non-flat terrain) - clipping each scanline's XPOS - up to 15-20c per scanline - this will add up significantly So, a simple scanline Blitter (working fully in parallel) can save insane amount of performance. BTW, I'm done with 24-bit experimenting and started porting the game today. Hopefully will have something running by end of this week...
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 29 Jan 2020 05:26
| Vladimir Repcak wrote:
| Yeah, an audio interrupt at a rate 22050 will certainly eat some resources.
|
This is not how you do Audio normally on AMIGA. AMIGA plays Audio with DMA. Our saga chipset has 8 DMA channels to play up to 8 streams of 8-16bit samples in parallel. You point the DMA to a PCM-File in memory "WAVE" and tell it to play it. You don't have a IRQ per sample, but per WAVE file to play. Playing Audio cost you about nearly no CPU time.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 29 Jan 2020 05:37
| Vladimir Repcak wrote:
| Now, your post made me realize one thing - the VBlank IRQ will have some serious computational power available. But it comes at the cost of completely grinding the 3D component to a halt.
|
Lets do some simplified math. Lets say your CPU can execute 170 Million instructions per second. At 50Herz, this is 3.4 Million per Frame. Lets say you code to check user movement needs 100 instructions and you call this 1 per frame. You still have 3.399.900 instructions available for your render loop The effect of such an IRQ every 3 Million instructions is not even measurable.
Vladimir Repcak wrote:
| So, is it a brutal amount of work to get some second (much slower, perhaps 8/16 bit) CPU running on an FPGA ? I honestly have no idea..
|
As much useful as adding a candle to the sun to make it brighter.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 29 Jan 2020 06:06
| Vladimir Repcak wrote:
| - clearing framebuffer (640x240x2 = 300 KB per frame) - with fusing we should be able to clear 4px / cycle in a loop -> with loop overhead 77,000 c
|
Apollo has no Loop overhead for such Loops (if correctly coded) The correct score is ~38400 cycles not 77000.
| |
|
|
|