Information about the Apollo CPU and FPU. |
|
---|
| | Vladimir Repcak
Posts 359 26 Jan 2020 01:35
| Vladimir Repcak wrote:
| I think I might be able to do the scanline traversal with only FP2-FP5. Will try next.
| Yep. What took about two pages of code on Jaguar's RISC GPU with multiple workaround for HW bugs/pipeline/etc., is -literally- just two lines of code here :) fsub.x fp5,fp4 ; xpLeft fadd.x fp3,fp2 ; xpRight
|
And there are no special cases to handle (when polygon's edge is close to camera) And no ugly precision issue (jumping first and last pixel) And a very simple positioning to the requested XPOS when clipping. And I gained back 6 registers that were used for this. And it's just two opsWin:Win:Win:Win:Win :- )))))
| |
| | Vladimir Repcak
Posts 359 26 Jan 2020 05:36
| It took one full day of FP refactoring, but I've replaced all 3 divisions (3*35 = 105 cycles) with FloatingPoint ones (plus, all neighboring computations) and now the whole scanline loop is just 19 ops (10 of which are FP, 9 are INT). I now have only two RAM variables (compared to about 20 yesterday), everything else is just registers. Yesterday the same loop took about 160 cycles, so it's a huge improvement - I reckon there will be couple INT ops that will execute in parallel (with the FP ones) so those 19 ops will be really just ~16, hence improvement of a factor of 10:1. On Jaguar, at 60 fps, without any drawing, I could process about 2,100 scanlines (just processing, no drawing), regardless of whether it was 1 or 100 polygons. So, if I can do 1 scanline in 16 cycles on Apollo, at 85 MHz, we get about 1.417 M cycles (per frame) @ 60 fps. That's a safe number without second pipe engaged (with two CPU pipes, it'd be ~2.83 M cycles). 1.417M / 16c = 88,541 scanlines. That is a factor of 42:1 compared to Jaguar. Poor kitty :) To give some perspective on that number of scanlines, the first room in Quake 1(where you choose the difficulty) -which I recreated manually in code- has about 5,000 scanlines. I'm having a brain freeze, trying to imagine what kind of complexity is a 3D scene with 100,000 scanlines. Probably the only way to reach that number is with a very detailed terrain or high-poly characters. So, if we had a heightmap terrain, at 640*480, we could have 100,000 / 480 = 208 columns on average, which is just ridiculous. Of course, rotating that many vertices would take a lot of time, no doubt about that... Regardless, such amount of scanlines can be only derived from very high poly count, so it will be interesting to benchmarking that threshold (cull,transform,clip,rasterize)...
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 26 Jan 2020 06:44
| Vladimir Repcak wrote:
| was using lots of registers, hence there was a lot of register push/pop to/from stack. Not needed anymore.
|
PLease mind you have on APOLLO a lot more register available. You have 16 Adress Regs You have 8+24 FPU Regs And you have 8+24 Data RegsThis means push/pop should be gone.
| |
| | Vladimir Repcak
Posts 359 26 Jan 2020 07:12
| I just refactored the code and got the scanline loop down to 16 ops (9 FP, 7 INT), so if there are no bubbles, it should take 14c at most, resulting in ~100,000 scanlines processing throughput (1 frame time @ 60fps -> 1.417 M cycles / 14 = ~100,000). This is the code:
loopStart: fsub.x fp5,fp4 ; xpLeft fadd.x fp3,fp2 ; xpRight fmove.x fp7,fp1 fmove.l DeltaYRemap,fp0 fdiv.x fp0,fp1 fmove.l zpCam32Front,fp0 fadd.x fp1,fp0 fmul.x fp6,fp0 fmove.l fp0,d4 cmp.l d4,d6 bgt if_15_soe move.l l_localVar102,d4 if_15_soe: addq.l #1,d5 addq.l #1,l_localVar91 cmpi.l #0,l_localVar91 bne if_16_soe move.l #1,l_localVar91 if_16_soe: dbra d7, loopStart
|
I have no idea on Floating Point bubbles, so would appreciate some feedback. Can this run in 14 cycles ?
| |
| | Vladimir Repcak
Posts 359 26 Jan 2020 07:30
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| was using lots of registers, hence there was a lot of register push/pop to/from stack. Not needed anymore. |
PLease mind you have on APOLLO a lot more register available. You have 16 Adress Regs You have 8+24 FPU Regs And you have 8+24 Data Regs This means push/pop should be gone.
|
That approach will have to wait till I have my V4 shipped and set up. Cannot access those additional registers under emulator, correct ?
| |
| | Samuel Devulder
Posts 248 26 Jan 2020 07:47
| ScanlineLoop: fmove.l fp6,d0 move.l (texPtrStart,d0.l*4),(vidPtr)+ fadd.x fp7,fp6 dbra d1,ScanlineLoop Does this code have a bubble ? |
Yes it does. First there must be "room" before the computation of d0 and its use, then an fadd costs 5 cycles (IIRC). So once again that code will get full power only if you interleave the computation for 4 or pore pixels. But beware, you cannot do fadd fp7,fp6 every other instruction. There must be 5 cycles between two of them. So the proper code would rather look like this fmove.l #0,fp3 fmove.x fp3,fp4 fadd.x fp7,fp4 ; bubbles here fmove.x fp4,fp5 fadd.x fp7,fp5 ; bubbles here fmove.x fp4,fp6 fadd.x fp7,fp6 fmul.s #4,fp7 ScanlineLoop: fmove.l fp3,d0 ; 1 cycle fmove.l fp4,d1 ; 1 cycle fmove.l fp5,d2 ; 1 cycle fmove.l fp6,d3 ; 1 cycle fadd.x fp7,fp3 ; 1 cycle fadd.x fp7,fp4 ; 1 cycle fadd.x fp7,fp5 ; 1 cycle fadd.x fp7,fp6 ; 1 cycle move.l (texPtrStart,d0.l*4),d0 ; 1 cycle move.l (texPtrStart,d1.l*4),d1 ; 1 cycle move.l (texPtrStart,d2.l*4),d2 ; 1 cycle move.l (texPtrStart,d3.l*4),d3 ; 1 cycle movem.l d0-d3,(vidptr)+ ; 2 cycles on v4 core because of fusing mem ops dbra d4,ScanlineLoop This is 14 cycles for 4 pixels without any bubbles in the loop. This is slower than the fixed pt version because fpu ops has a bigger latency than integer ops (and because they can only execute on P1 iirc). Beware, fdiv is not 1 cycle, but 9 and fadd/fmul are 5 (or 6, I never remind) cycles fdiv.x fp0,fp1 fmove.l zpCam32Front,fp0 ; here we must wait 7 cycles at least to get fp1 (result od fdiv) fadd.x fp1,fp0 ; here we must wait 5 cycles to get fp0 (result of fadd above) fmul.x fp6,fp0 ; here we must wait 5 cycles to get fp0 (result of fmul above) fmove.l fp0,d4 The only way to get rid of these bubbles is to unroll the loops 4 times and spread the computatio in the bubbles like I did in the first asm code of this message.
| |
| | Kamelito Loveless
Posts 261 26 Jan 2020 09:03
| I wonder is all those rules to avoid bubbles cannot be added to VASM that will warn about them when you assemble the code. Even better if an optimizer following the rules could re-arrange the code but that might be too much, at least a warning spotting the problem and giving hints should be doable. this would speed up development and will produce faster code more easily.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 26 Jan 2020 11:32
| Samuel Devulder wrote:
| ScanlineLoop: fmove.l fp3,d0 ; 1 cycle fmove.l fp4,d1 ; 1 cycle fmove.l fp5,d2 ; 1 cycle fmove.l fp6,d3 ; 1 cycle |
For the 3D coordinate transformation I'm sure Float does makes sense. For the Rasterline run I think Float has little or no advantage.Don't we agree that the Rasterline run can easily go in Integer?
| |
| | Vladimir Repcak
Posts 359 26 Jan 2020 23:08
| Samuel Devulder wrote:
| Beware, fdiv is not 1 cycle, but 9 and fadd/fmul are 5 (or 6, I never remind) cycles
|
9 ? I thought I read that FPU is supposed to make division every cycle. Then it would follow that fmul is 1 cycle too.But 5/6 for fadd ? :( That quite destroys FP's useability for fast parallel code :( I mean, there are still other advantages: - we don't clutter INT registers (less RAM access for variables) - much shorter code is easier to debug and modify
| |
| | Vladimir Repcak
Posts 359 26 Jan 2020 23:23
| Samuel Devulder wrote:
|
ScanlineLoop: fmove.l fp6,d0 move.l (texPtrStart,d0.l*4),(vidPtr)+ fadd.x fp7,fp6 dbra d1,ScanlineLoop Does this code have a bubble ? |
Yes it does. First there must be "room" before the computation of d0 and its use, then an fadd costs 5 cycles (IIRC). So once again that code will get full power only if you interleave the computation for 4 or pore pixels. But beware, you cannot do fadd fp7,fp6 every other instruction. There must be 5 cycles between two of them. So the proper code would rather look like this fmove.l #0,fp3 fmove.x fp3,fp4 fadd.x fp7,fp4 ; bubbles here fmove.x fp4,fp5 fadd.x fp7,fp5 ; bubbles here fmove.x fp4,fp6 fadd.x fp7,fp6 fmul.s #4,fp7 ScanlineLoop: fmove.l fp3,d0 ; 1 cycle fmove.l fp4,d1 ; 1 cycle fmove.l fp5,d2 ; 1 cycle fmove.l fp6,d3 ; 1 cycle fadd.x fp7,fp3 ; 1 cycle fadd.x fp7,fp4 ; 1 cycle fadd.x fp7,fp5 ; 1 cycle fadd.x fp7,fp6 ; 1 cycle move.l (texPtrStart,d0.l*4),d0 ; 1 cycle move.l (texPtrStart,d1.l*4),d1 ; 1 cycle move.l (texPtrStart,d2.l*4),d2 ; 1 cycle move.l (texPtrStart,d3.l*4),d3 ; 1 cycle movem.l d0-d3,(vidptr)+ ; 2 cycles on v4 core because of fusing mem ops dbra d4,ScanlineLoop This is 14 cycles for 4 pixels without any bubbles in the loop. This is slower than the fixed pt version because fpu ops has a bigger latency than integer ops (and because they can only execute on P1 iirc). Beware, fdiv is not 1 cycle, but 9 and fadd/fmul are 5 (or 6, I never remind) cycles fdiv.x fp0,fp1 fmove.l zpCam32Front,fp0 ; here we must wait 7 cycles at least to get fp1 (result od fdiv) fadd.x fp1,fp0 ; here we must wait 5 cycles to get fp0 (result of fadd above) fmul.x fp6,fp0 ; here we must wait 5 cycles to get fp0 (result of fmul above) fmove.l fp0,d4 The only way to get rid of these bubbles is to unroll the loops 4 times and spread the computatio in the bubbles like I did in the first asm code of this message.
|
Thanks a lot. I had incorrect assumptions about the FP when I wrote it (I thought they all take 1 cycle).So, it is still possible to write a fast FP code, it's just much harder and resulting code will have to be heavily interleaved (due to 5-9 cycle execution latency). That's still worth it for things like scanline loop, but I don't think I want to be doing that for anything else :) Is there a PDF or text file somewhere with the execution times for the FP ops on 080 ? Or does it follow 060 ? I just checked the 060's FPU Instruction execution times and almost nothing there is 1 cycle, indeed. It's all at least 3-5 cycles with FDIV being 37. So, I guess from that standpoint it's still good. Can't wait to get my V4 and start doing benchmarks on real HW myself...
| |
| | Vladimir Repcak
Posts 359 26 Jan 2020 23:27
| Kamelito Loveless wrote:
| I wonder is all those rules to avoid bubbles cannot be added to VASM that will warn about them when you assemble the code. Even better if an optimizer following the rules could re-arrange the code but that might be too much, at least a warning spotting the problem and giving hints should be doable. this would speed up development and will produce faster code more easily.
|
Yeah, that would be very nice, indeed, if we got a warning about bubbles.
| |
| | Vladimir Repcak
Posts 359 26 Jan 2020 23:33
| Gunnar von Boehn wrote:
|
Samuel Devulder wrote:
| ScanlineLoop: fmove.l fp3,d0 ; 1 cycle fmove.l fp4,d1 ; 1 cycle fmove.l fp5,d2 ; 1 cycle fmove.l fp6,d3 ; 1 cycle |
For the 3D coordinate transformation I'm sure Float does makes sense. For the Rasterline run I think Float has little or no advantage. Don't we agree that the Rasterline run can easily go in Integer?
|
Well, we're not talking easy, we're talking fast here :)Before I was told that fadd actually takes 5 cycles, it had a great advantage, as there were just 3 ops in the scanline loop for the horizontal scanlines (and much less INT ops and registers). FP is however utterly unusable for tight loops (3 ops - like in my scanline case), so it must be unrolled to some minimum degree (just like Samuel showed).
| |
| | Vladimir Repcak
Posts 359 27 Jan 2020 01:54
| Samuel Devulder wrote:
|
movem.l d0-d3,(vidptr)+ ; 2 cycles on v4 core because of fusing mem ops
This is 14 cycles for 4 pixels without any bubbles in the loop. This is slower than the fixed pt version because fpu ops has a bigger latency than integer ops (and because they can only execute on P1 iirc).
|
Cool, this is actually a great example for an unrolled flatshading scanline loop. Each Scanline can have any combination of 32-bit aligned / non-aligned Start and End address (which is handled separately).That leaves a loop (though I had a separate jump solution on 6502 flatshader that might be great to reimplement here) of fused 128-bit writes (4 pixels) into 32-bit aligned addresses:
; d0-d3 have the same 32-bit value of the current scanline color (still 24-bit color space) and never change throughout the loop ; Can the following code run in 4*2 = 8 cycles (plus 1 cycle for dbra) without zero bubbles ?Flatshading_Scanline_Loop: movem.l d0-d3,(vidptr)+ ; 2c: Pixel 0-3 movem.l d0-d3,(vidptr)+ ; 2c: Pixel 4-7 movem.l d0-d3,(vidptr)+ ; 2c: Pixel 8-11 movem.l d0-d3,(vidptr)+ ; 2c: Pixel 12-15 dbra d4,Flatshading_Scanline_Loop
|
Or would it make sense to rather do two of these (copying color in d0 into d4-d7 prior to this, of course)?
movem.l d0-d7,(vidptr)+ ; 4c: Pixel 0-7 movem.l d0-d7,(vidptr)+ ; 4c: Pixel 8-15
|
Really, the only thing that changes here is the (a0) register...
| |
| | Don Adan
Posts 38 27 Jan 2020 02:11
| Hi Don, In theory a nice trick. But please mind that this code creates a ALU 2 EA bubble Ok, what about this, old version was perhaps buggy. moveq #0,D5 swap D2 swap D3 loop_10_start: move.l (a3,d3.w*4),(a0)+ add.l d2,d3 addx.w D5,D3 ; addx.l d5,d3 can be used too, if is fastest dbra d1,loop_10_start
And of course this code can be unrolled too.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 27 Jan 2020 07:48
| Don Adan wrote:
| moveq #0,D5 swap D2 swap D3 loop_10_start: move.l (a3,d3.w*4),(a0)+ add.l d2,d3 addx.w D5,D3 dbra d1,loop_10_start
|
Regarding ADDX ADDX looks like a nice trick always. But there is a technical limitation with ADDX. ADDX depends on the flags of the previous instruction. As the FLAGS need time to be "created" ADDX can never be the 2nd instruction in a super-scalar pair. This means ADDX is always in the 1st pipe. This technical reason makes ADDX often limit super scalability. Regarding unrolling. I real live texture loops will often have short runs like 5-15 pixel and they will have variable (random) length. Unrolling is then not really possible. Only for a very special testcase I can see unrolling work. The cure to avoid pipeline bubbles in EA unit, the tricks using EA unit to increment the registers.
ADDA.L D1,A1 -- This runs in EA unit without bubble move.b (A0,A1.b1),(A2)+ DBRA D0,LOOP
| |
| | Samuel Devulder
Posts 248 27 Jan 2020 10:05
| Yes, 5/6 for fadd and fmul but that is latency (I really need to checkout Flype's benchmarking... [EDIT] here it is: fabs/fneg/ftst: 1 cycle, fadd/fsub/fmul/fcmp: 6 cycles, fdiv: 9 cycles, others: "a lot".. fmove fp,fp, fmove #imm,fp: 1 cycle but don't know about fmove fp,Dq). But yeah, the pipeline execute 1 op per cycle but it is 5 stages. That mean you cannot dirtectly use the result of an fpu right after it has been executed. You must wait for the operation to go though the stages, each one requiring 1 cycle. I mean: schedule your code and you'll get 1 fpu op per cycle with full power. Otherwise you'll add 5 cycle latency in the case of a simple fmul and perform way slower than expected. About scheduling: in my code above I was handling 4 pixel at a time, but I think it is possible do process less because then I count cycle it seem that there are at least 10 cycles or so between an fpu op and using it's result where 5 cycles would have been enough. Maybe doing 2 pixels at a time would be enough (assuming "fmove Fp,Dq" can be done in 1 or 2 cycles).
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 27 Jan 2020 19:21
| FPU instructions have 2 attributes 1) throughput 2) latency Throughput of 1/cycle means that you can issue and finish 1 instruction per cycle. This means at 85 MHz you can reach this way 85 MFlops Latency means the time for the result be "valid". Good FPU code is parallel and not sequential and waits the latency time before using the result and therefore can reach the full throughput score. My feeling is FPU is great for doing 3D transformation as this can be easily parallelized. For Rastercode I feel interger is a lot easier. Or even better using the "Rastaman"
| |
| | Vladimir Repcak
Posts 359 28 Jan 2020 00:34
| Samuel Devulder wrote:
| Yes, 5/6 for fadd and fmul but that is latency (I really need to checkout Flype's benchmarking... [EDIT] here it is: fabs/fneg/ftst: 1 cycle, fadd/fsub/fmul/fcmp: 6 cycles, fdiv: 9 cycles, others: "a lot".. fmove fp,fp, fmove #imm,fp: 1 cycle but don't know about fmove fp,Dq). But yeah, the pipeline execute 1 op per cycle but it is 5 stages. That mean you cannot dirtectly use the result of an fpu right after it has been executed. You must wait for the operation to go though the stages, each one requiring 1 cycle. I mean: schedule your code and you'll get 1 fpu op per cycle with full power. Otherwise you'll add 5 cycle latency in the case of a simple fmul and perform way slower than expected.
|
Thanks, that explains it. And adjusts my expectations accordingly.Basically, FPU is utterly unusable for short tight loops. You really need a long and already slow code so you can hide the FPU latency behind it. Or, you butcher the algorithm in a way that it works with "old values", hence the FPU latency can be hidden. I've done that on Jaguar when I was hiding the Division latency, but it made the code utterly incomprehensible two weeks after I wrote it, resulting in really nasty and hard-to-debug bugs once I made adjustment to that code and stuff started to behave weird occasionally. So, I'm pretty sure I don't want to go down that route. It's too expensive in long term. If this was a day job, for a corporation, and I was salaried - sure - why not :) Safer to just avoid FP altogether and let the integer pipes run at full steam. Samuel Devulder wrote:
| About scheduling: in my code above I was handling 4 pixel at a time, but I think it is possible do process less because then I count cycle it seem that there are at least 10 cycles or so between an fpu op and using it's result where 5 cycles would have been enough. Maybe doing 2 pixels at a time would be enough (assuming "fmove Fp,Dq" can be done in 1 or 2 cycles).
|
Yeah, I'm sure it *could* be done. But the only way to get there is to have detailed table of all FP latencies and every time you touch that code, you go through that :)Again, that's something that can be done for fun, over the weekend, if all you worry about is that one routine. I have too much experience from last two years when inevitably some adjustments to such initial assumptions must be made, for the sake of gameplay (or other factors that could not have been known at design time), rendering such code to be a complete throwaway. I'm really glad I only burnt 3 days on floating point. It was fun while it lasted :)
| |
| | Vladimir Repcak
Posts 359 28 Jan 2020 00:45
| Gunnar von Boehn wrote:
| FPU instructions have 2 attributes 1) throughput 2) latency Throughput of 1/cycle means that you can issue and finish 1 instruction per cycle. This means at 85 MHz you can reach this way 85 MFlops Latency means the time for the result be "valid". Good FPU code is parallel and not sequential and waits the latency time before using the result and therefore can reach the full throughput score. My feeling is FPU is great for doing 3D transformation as this can be easily parallelized. For Rastercode I feel interger is a lot easier. Or even better using the "Rastaman"
|
So, how can you even theoretically achieve 85 MFLOPS ?Can FP unit handle 5 parallel FP ops ? Can you do the following and issue all five ops in 5 cycles ? cycle 1: FDIV fp7,fp6 ; fp5 available in 9c cycle 2: FADD fp0,fp1 cycle 3: FSUB fp2,fp3 cycle 4: FMUL fp4,fp5 ; fp5 available in 5c cycle 5: FMOVE fp0,fp3 cycle 6-10: results from above ops get written to respective registers
| |
| | Vladimir Repcak
Posts 359 28 Jan 2020 01:03
| Yes, I finally found the thread which gave me the [incorrect] impression of 1 cycle per FP op: EXTERNAL LINK Initially, I didn't notice the note about implied scheduling, but it is -indeed- down there, just couple replies below :)
| |
|
|
|