Information about the Apollo CPU and FPU. |
|
---|
| | Vladimir Repcak
Posts 359 21 Jan 2020 12:11
| Gunnar von Boehn wrote:
| Nice progress! Congratulations! I wonder if going for 15/16bit screenmode might be smart decision. What do you think? For games maybe more FPS has more value than slightly finer more color shades.. What do you think?
|
Thanks :) From my experience on Jaguar, in flatshading, the difference between 8-bit and 16-bit was around ~15%, so it was absolutely worth it, on Jaguar, to ditch 8-bit and just go for 16-bit. Hoping for similar scenario here - just not (8 vs 16) but rather (16 vs 24) :)However, on Jaguar, the rendering was done in parallel - using Blitter (e.g. while Blitter was drawing current scanline, GPU was computing another in parallel) - which we won't have on Vampire via SW rasterizing (but we will via the texturing parallel unit - which will be yet another (the third) codepath). So, until I have benchmarks from real V4, I honestly don't really have a clue how it will perform. Theoretically, each pixel store is one move op, so it shouldn't really matter much whether it's 16-bit store or 32-bit store, especially when it's fully cached. Computation of color, that's another matter... Gunnar von Boehn wrote:
| Would you like to change the test engine to 15/16bit?
| The Radiosity lighting - I already have 16-bit codepath even right now - that's what I was using on Jaguar - I just haven't tested it under WinUAE , but I have the two separate 16-bit functions. However, I don't really want to do benchmarks under emulator as that might become quite misleading - I'm guessing the emulator will just run at full speed of my PC, rendering the results completely useless ... I'm hoping my V4 will get shipped within next 2 weeks - I just paid the invoice into the IBAN account, so the real benchmarks should hopefully happen quite soon. By that time, I should implement vblank, double buffering and get rid of the framebuffer copying.
| |
| | Vladimir Repcak
Posts 359 21 Jan 2020 12:21
| Gunnar von Boehn wrote:
| For games maybe more FPS has more value than slightly finer more color shades.. | Unfortunately, with Radiosity, the visual difference between 16-bit and 24-bit is quite drastic. At first I was quite shocked and spent 3 days triple checking my calculations in excel, but they were alright - it's just that 65,536 colors is a surprisingly low number when it comes to shading (which must sound ironic given that most oldschool games were done in ~16 colors). At 16-bit, even though it's 65,536 colors, you can clearly see the "rings" - e.g. circles of separate shades. There's almost never a smooth transition. Now, when the lightmap gets applied to the textures, that visual difference becomes smaller, and I'm guessing that at fast movement (FPS genre), 16-bit *might* be OK. That visual ratio can be controlled by the high-frequency vs low-frequency base textures. A high-frequency material will be fine at 16-bit, as its contrast will simply override the lightmap. Ideally, I would write separate 24-bit codepath for the final game so that the player has a full choice over the "framerate vs visuals", but I'd rather not promise it now till I see on real HW how much work that is...
| |
| | Vladimir Repcak
Posts 359 23 Jan 2020 20:38
| Don Adan wrote:
|
Vladimir Repcak wrote:
| Here's an example of the inner loop for the horizontal scanlines: Higgs: loop (lpMain = xlVisible) { idxPixel = idxCurrent >> BitShiftR idxPixel <<= #2 texPtr = texPtrStart + idxPixel (vidPtr)+ = (texPtr) idxCurrent += xpAdd } |
ASM output: loop_10_start: move.l d3,d4 lsr.l d5,d4 lsl.l #2,d4 move.l a3,a2 add.l d4,a2 move.l (a2),(a0)+ add.l d2,d3 dbra d1,loop_10_start |
An FP version would be able to compute the Texel index in parallel. And, a third version - using the internal texturing unit should be even faster :) Of course, the example above is axis-aligned, so won't work for generic angled surfaces, but you can still make lots of games with that. |
perhaps you can use move.l (a3,d4.l*4),(a0)+ for replacing 4 instructions
|
I wasn't sure if the (a3,d4.l*4) would compile, but I just tried it and to my huge surprise it did compile, indeed !So, now the inner loop looks like this: Higgs: loop (lpMain = xlVisible) { idxPixel = idxCurrent >> BitShiftR move.l (texPtrStart,idxPixel.l*4),(vidPtr)+ idxCurrent += xpAdd } |
I will need to update my Higgs compiler so it can parse more complex indirect assignments, e.g : (vidPtr)+ = (texPtrStart,idxPixel.l*4) The Asm version is now 3 ops shorter:
ASM output: loop_10_start: move.l d3,d4 lsr.l d5,d4 move.l (a3,d4.l*4),(a0)+ add.l d2,d3 dbra d1,loop_10_start
|
Which is 1 op shorter than I would have gotten using the (a3,d4) displacement (that would be my next step in optimization). 1 op per pixel, at higher res would add up real fast for sure. More importantly, I learnt new stuff :) Thanks!
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 23 Jan 2020 20:47
| Vladimir Repcak wrote:
| ASM output: loop_10_start: move.l d3,d4 lsr.l d5,d4 move.l (a3,d4.l*4),(a0)+ add.l d2,d3 dbra d1,loop_10_start |
|
What value has D5 ?
| |
| | Vladimir Repcak
Posts 359 23 Jan 2020 21:12
| Gunnar von Boehn wrote:
| I wonder if going for 15/16bit screenmode might be smart decision. What do you think? For games maybe more FPS has more value than slightly finer more color shades.. What do you think? | Well, I just thought of one genre where lower performance due to 24-bit shouldn't matter too much - a grid-based dungeon - something like Legend Of Grimrock, where the camera movement is slow enough that we can substantially lower the framerate, yet still retain smoothness (unlike, say, first person shooter)- All the walls there are axis-aligned - At 800x600, there's 480,000 pixels on screen - for simple calculations, let's assume 50% are for horizontal walls and 50% pixels are for vertical walls - 5 ops per pixel: Horizontal walls - 7 ops per pixel: Vertical walls 240,000 * 5 = 1.2 Mil 240,000 * 7 = 1.68 Mil 1.2 + 1.68 = 2.88 Mil ops So, roughly ~3 MIPS to fill the 800x600x24bit screen. Now, there will be some pipeline cycle for the indirect addressing, so let's round it to 4-5 Mil ops per frame - which should be plenty smooth for this type of game. And that's just basic 68000 code without AMMX, HW texturing unit or parallel floating-point computation. I'm sure a highly parallel code utilizing FP+AMMX+Integer could be written that would bring that number down to ~2 Mil ops per frame.
| |
| | Vladimir Repcak
Posts 359 23 Jan 2020 21:15
| Gunnar von Boehn wrote:
| What value has D5 ? |
D5 = 16 It's a fixed-point, basically. So it could be directly replaced with an FP code that would execute in parallel and save that 1 bitshift op.Also, if the index was FP, the Integer unit would get rid of the add.l d2,d3 at the end of loop and continue working in parallel. So, an inner loop like this could really benefit from parallel Int + FP, I'm guessing.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 23 Jan 2020 21:18
| Vladimir Repcak wrote:
|
Gunnar von Boehn wrote:
| What value has D5 ? |
D5 = 16
|
You could do this on APOLLO loop_10_start: move.l (A3,D3.b2*4),(A0)+ add.l D2,D3 dbra d1,loop_10_start
D3.b2 means use as Index the byte number 2Bytes in Long (3)(2)(1)(0)
| |
| | Vladimir Repcak
Posts 359 23 Jan 2020 21:21
| Also, since this is not a 68000 where there is a difference (in number of cycles) between writing a 8-bit register value and a 32-bit register value, what exactly is the performance difference on 68080 between the two? Because it would seem to me that the code should execute in same amount of cycles, no ? It's still 1 register per pixel (whether it's 256 colors or 16.7 Mil colors).
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 23 Jan 2020 21:23
| Not sure I understand your question correct. You want to know the cycle for the LOOP 10 code? Or if a MOVE.B is faster than MOVE.L?
| |
| | Samuel Devulder
Posts 248 23 Jan 2020 21:54
| Beware of EA-bubbles on the 68080 core: loop_10_start: move.l d3,d4 ; \ these two are fused 1 cycle for both lsr.l d5,d4 ; / ; there there are 3 cycles waiting for index reg (d4) being ready. move.l (a3,d4.l*4),(a0)+ ; p1 2 cycles here add.l d2,d3 ; p2: free I guess dbra d1,loop_10_start ; p1 | Total is 3+4=7 cycles. Half of it being a huge bubble on d4.l as index-reg. You can try to unroll this loop 3 or 4 times (dependinf on free regs) to fill that bubble with usefull instructions. Here is my suggestion, considering that self-modifying code works fine on the 68080 (but not on others 68k) we can replace d2 by some #nnnn and get an extra reg: lsr.l #4,d1 ; assume d1 is a multiple of 4 (if not the code need adaptation, but this is easy) loop_10_start: move.l d3,d4 ; Fused lsr.l d5,d4 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 / move.l d3,d2 ; Fused lsr.l d5,d2 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 / move.l d3,d6 ; Fused lsr.l d5,d6 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 / move.l d3,d7 ; Fused lsr.l d5,d7 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 / ; bubble has been filled --> gone! move.l (a3,d4.l*4),(a0)+ ; p1 2 cycles here move.l (a3,d2.l*4),(a0)+ ; p1 2 cycles here move.l (a3,d6.l*4),(a0)+ ; p1 2 cycles here move.l (a3,d7.l*4),(a0)+ ; p1 2 cycles here dbra d1,loop_10_start | This makes 12 cycles for 4 pixels. This is less that 1 cycle per byte. Not bad ? @Gunnar: I think that fixed point is 16.16 in this case, so byte-index is too short.
| |
| | Vladimir Repcak
Posts 359 23 Jan 2020 22:18
| Gunnar von Boehn wrote:
| Not sure I understand your question correct. You want to know the cycle for the LOOP 10 code? Or if a MOVE.B is faster than MOVE.L?
|
The second one - move.b versus move.l (how many cycles each or if they take the same number of cycles, what are the implications of each one on the pipeline/bandwidth/etc.)
| |
| | Vladimir Repcak
Posts 359 23 Jan 2020 22:19
| Gunnar von Boehn wrote:
| You could do this on APOLLO loop_10_start: move.l (A3,D3.b2*4),(A0)+ add.l D2,D3 dbra d1,loop_10_start
D3.b2 means use as Index the byte number 2 Bytes in Long (3)(2)(1)(0)
|
Wow, that is very interesting and for sure useful!Is there a document/link describing all these 080-specific nuances somewhere ?
| |
| | Samuel Devulder
Posts 248 23 Jan 2020 22:24
| Oh we can be even faster and gain 2 cycles making a total 10 cycles for 16 bytes (4 pixels) by using the fact that move Dn,(a0)+ can be fused :) move.l (a3,d4.l*4),d4 ; p1 1 cycle move.l (a3,d2.l*4),d2 ; p1 1 cycle move.l (a3,d6.l*4),d6 ; p1 1 cycle move.l (a3,d7.l*4),d7 ; p1 1 cycle move.l d4,(a0)+ ; Fused move.l d2,(a0)+ ; F 1 cycle move.l d6,(a0)+ ; Fused move.l d7,(a0)+ ; F 1 cycle Notice: if one doesn't like the #nnnn above it is still possible to use some An reg in place of #nnnn. This way the code will run on any 68020+ (not as fast as on the apollo core though.) @Vladimir: Moving byte or long to memory is same cost, but take care or EA-bubbles. These are nasty and can slow down you code by a lot without noticing (see my post above.) If code is reorganized to remove the bubble, we get from 7 cycles per pixel in the original code down to 2.5 cycles per pixel. That's roughly 2.8 times the initial speed (yes, bubbles costs a lot!)
| |
| | Vladimir Repcak
Posts 359 23 Jan 2020 23:04
| Samuel Devulder wrote:
| Beware of EA-bubbles on the 68080 core: loop_10_start: move.l d3,d4 ; \ these two are fused 1 cycle for both lsr.l d5,d4 ; / ; there there are 3 cycles waiting for index reg (d4) being ready. move.l (a3,d4.l*4),(a0)+ ; p1 2 cycles here add.l d2,d3 ; p2: free I guess dbra d1,loop_10_start ; p1 | Total is 3+4=7 cycles. Half of it being a huge bubble on d4.l as index-reg. You can try to unroll this loop 3 or 4 times (dependinf on free regs) to fill that bubble with usefull instructions. Here is my suggestion, considering that self-modifying code works fine on the 68080 (but not on others 68k) we can replace d2 by some #nnnn and get an extra reg: lsr.l #4,d1 ; assume d1 is a multiple of 4 (if not the code need adaptation, but this is easy) loop_10_start: move.l d3,d4 ; Fused lsr.l d5,d4 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 / move.l d3,d2 ; Fused lsr.l d5,d2 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 / move.l d3,d6 ; Fused lsr.l d5,d6 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 / move.l d3,d7 ; Fused lsr.l d5,d7 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 / ; bubble has been filled --> gone! move.l (a3,d4.l*4),(a0)+ ; p1 2 cycles here move.l (a3,d2.l*4),(a0)+ ; p1 2 cycles here move.l (a3,d6.l*4),(a0)+ ; p1 2 cycles here move.l (a3,d7.l*4),(a0)+ ; p1 2 cycles here dbra d1,loop_10_start | This makes 12 cycles for 4 pixels. This is less that 1 cycle per byte. Not bad ? @Gunnar: I think that fixed point is 16.16 in this case, so byte-index is too short.
|
Very interesting ! Thanks a lot !Basically, just like on Jaguar's RISC GPU (& DSP), where a great deal of effort had to be spent on minimizing the bubbles on inner loops, we have to do the same here, but at least: - there are no HW bugs and illegal combinations of ops - the ops on 080 can do way more than RISC ops So a simple brute-force 8x unroll (that I was originally thinking of) won't help here at all, as we would merely reproduce the bubble 8x. Fun :) BTW, the fixed point is more like 0.16 - e.g. I only process the fractional part. I'm wondering, if using FP math for indexing would help with the bubble ? I guess impossible to know for sure without writing it first. This also means that one may have to write way more than 3-4 versions of same inner loop till we get the max.performance possible :) Then again, for a grid-based game with axis-aligned walls there's literally just 2-3 pieces of code like that, so there's zero excuse for not doing that :)
| |
| | Vladimir Repcak
Posts 359 24 Jan 2020 01:30
| Samuel Devulder wrote:
| @Vladimir: Moving byte or long to memory is same cost | So, from purely performance standpoint, it doesn't then make any sense to favor the 256-color mode. Might as well go for 16.7 Mil, as the texturing code is the same. I'm guessing the only difference is in clearing the framebuffer, where 32-bit store would clear 4 pixels instead of one. But for texturing, code is the same - we still have to process texel by texel, regardless of bit depth. Samuel Devulder wrote:
| ...but take care or EA-bubbles. These are nasty and can slow down you code by a lot without noticing (see my post above.) If code is reorganized to remove the bubble, we get from 7 cycles per pixel in the original code down to 2.5 cycles per pixel. That's roughly 2.8 times the initial speed (yes, bubbles costs a lot!)
|
This reminds me of optimizing shaders, where one hidden pipeline bubble could cripple the performance. Fun :)I'm going to ask some questions about the code you wrote tomorrow after I drink coffee, as it's over my head now, how exactly you fused it :) But, 2.5c per pixel sounds pretty awesome. There's still a branching overhead, but I suspect my inner loop will be unrolled at least 64x to remove the branching overhead per pixel as much as possible.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 24 Jan 2020 02:17
| Regarding INDEX Mode Some interesting information: add.l D2,D3 * * Bubble -- Bubble because D3 was touched in ALU move.l (A3,D3),(A0)+
add.l D2,A4 -- No bubble! move.l (A3,A4),(A0)+
Use an address register to avoid ALU2EA Bubble Then you not need to unroll!
| |
| | Vladimir Repcak
Posts 359 24 Jan 2020 06:46
| Gunnar von Boehn wrote:
| Regarding INDEX Mode Some interesting information: add.l D2,D3 * * Bubble -- Bubble because D3 was touched in ALU move.l (A3,D3),(A0)+
add.l D2,A4 -- No bubble! move.l (A3,A4),(A0)+
Use an address register to avoid ALU2EA Bubble Then you not need to unroll! |
I don't understand why first case has a bubble and second doesn't. On Jaguar's RISC GPU, most ops finished within 3-4 cycles, even though more than 1 could be dispatched (just like on 080). How many cycles does it take internally to finish executing each integer op ? 68060 mentiones 4-stage Execution Pipeline. Does that mean that in reality it takes up to 4 cycles for each op to finish executing ? That would be introducing 3-cycle bubbles after every op that uses register that has already been previously fetched, couple ops ago, but the final write hasn't happened yet.
| |
| | Vladimir Repcak
Posts 359 24 Jan 2020 07:17
| Samuel Devulder wrote:
| This way the code will run on any 68020+ (not as fast as on the apollo core though.) | Yeah, well - now that I'm thinking of 800x600 @16.7M colors, I don't think 68020 will be much of a consideration, really :)Now, that it's clicking that Apollo has 2 execution units (with most ops being single-EA, hence finishing in 1 cycle), which at 85 MHz should really execute 2*85 = 170 MIPS, the 68020 looks like a 1.79 MHz Atari XL from the viewpoint of Atari Falcon :) Meaning, completely different art assets, everything. I know, never say never...
| |
| | Vladimir Repcak
Posts 359 24 Jan 2020 07:54
| Alright, woke up, had my coffee and took a bite on this :) So, I re-read the first few pages of this thread to recall the info on fusing and pipelining: 1. If there's no dependency between two consecutive ops, each one is processed in parallel on separate EU (Execution Unit) or as we call 'em here "pipe" (P1,P2) 2. If there is dependency, then two such ops are "fused" into one op at whichever pipe is processing it (either P1 or P2). Once I realized the two rules above, then your description of fusing suddenly makes complete sense and it instantly clicked :) Samuel Devulder wrote:
| loop_10_start: move.l d3,d4 ; Fused lsr.l d5,d4 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 / move.l d3,d2 ; Fused lsr.l d5,d2 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 / move.l d3,d6 ; Fused lsr.l d5,d6 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 / move.l d3,d7 ; Fused lsr.l d5,d7 ; F P1 \ 1 cycle for these 3 instructions add.l #nnnnn,d3 ; P2 /
|
12 ops in 4 cycles ? Possible ? 0.33 cycle per op ? Let's call it Superscalar Fusing :)
Samuel Devulder wrote:
| ; bubble has been filled --> gone! move.l (a3,d4.l*4),(a0)+ ; p1 2 cycles here move.l (a3,d2.l*4),(a0)+ ; p1 2 cycles here move.l (a3,d6.l*4),(a0)+ ; p1 2 cycles here move.l (a3,d7.l*4),(a0)+ ; p1 2 cycles here dbra d1,loop_10_start
|
I presume 2 cycle cost is here because of the Two-EA per op, correct ?Samuel Devulder wrote:
| This makes 12 cycles for 4 pixels. This is less that 1 cycle per byte. Not bad ? | If by "Not bad" you mean "Pretty awesome", then - yeah - Not bad at all :)We still have looping, but that should be 1 cycle (for dbra) for those 4 unrolled pixels, so just 0.25c per pixel. In our 800x600 axis-aligned walls use case, with roughly 50% pixels being Horizontal scanlines, that's 3.25c per pixel, e.g. 780,000c. If we were shooting for 15 fps (more than enough for a grid-based dungeon crawler), we'd have a cycle budget: 85*2 = 170 MIPS (e.g per second) 170 / 15 (fps) = 11.33 MIPS per frame And since this is indoor, we should be able to avoid Clearing framebuffer, as each pixel would be redrawn anyway. 0.78 M: Horizontal walls 1.72 M: Vertical walls (estimate based on fusing the Hor.walls) -------- 2.50 MIPS to draw the screen 11.33 - 2.50 = 8.8 (or, in worst case, for a single pipe: 4.4M) There's no way the other logic will consume 4.4M. The next most taxing feature is going to be scanline loop, but even at 600 vertical resolution, we shouldn't have more than ~10 vertical stripes in this game, e.g. ~5,000 scanlines. It's not going to take 100c on 080 per iteration, that's for sure, so a worst-case scenario is 0.5M. 4.4M - 0.5M = 3.9M for game logic/input/3D transform/clipping. Way too much :) Looks like we could go for 1024x768 :)
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 24 Jan 2020 09:38
| Vladimir Repcak wrote:
|
Gunnar von Boehn wrote:
| Regarding INDEX Mode Some interesting information: add.l D2,D3 * * Bubble -- Bubble because D3 was touched in ALU move.l (A3,D3),(A0)+
add.l D2,A4 -- No bubble! move.l (A3,A4),(A0)+
Use an address register to avoid ALU2EA Bubble Then you not need to unroll! |
I don't understand why first case has a bubble and second doesn't.
|
I an explain this easily. The pipeline of the 68K Family looks like this: 1) Icache Fetch 2) Decode 3) Reg-Fetch 4) EA Calculation in EA-Unit(s) 5) Dcache Fetch 6) ALU Operation in ALU-Unit(s) Because of this pipeline design 68K instruction can do "free" EA calculation, and "free" DCache access as part of the instruction - in addition to the ALU operation. Lets look at one example instruction: ADD.L (A0)+,D0 This instruction does not 1 things it does 3 things! a) It uses the EA of (A0) and increment it by plus 4 and then updates A0 b) It does a Cache/Memory Read c) It uses the result from memory and adds it to D0 This design allows the 68K to do a lot more work per instruction than a RISC can do. The advanced chips of 68K family have dedicated units for these tasks. (1) The EA unit(s) does the EA calculation and updates. (2) The Dcache does the Cache read (3) The ALU does the ALU operations. These separate unit design is also the reason why the 68K has two types of registers. Address Registers A0-A7 are owned by the EA-Units The Data Registers D0-D7 are owned by the ALU-Units The 68K instruction ADDA, SUBA and LEA are executed in the EA-Units. Operations having memory as destination or DATA registers are executed in the ALU. The 68K is by design a lot stronger than a RISC chip as it can do significant more operations per instruction. The 68k coder has to take care to not create dependencies between ALU and EA-Unit. Is the answer clear now? Or more questions?
| |
|
|
|