Information about the Apollo CPU and FPU. |
|
---|
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 24 Jan 2020 09:44
| Vladimir Repcak wrote:
| 1. If there's no dependency between two consecutive ops, each one is processed in parallel on separate EU (Execution Unit) or as we call 'em here "pipe" (P1,P2)
|
Each Pipe has on top an EA-Unit and an ALU-unit The design looks somewhat like this:
(icache) 16byte per cycle (decoders) up to 2 instruction pairs (4 instr) (EA-Unit1) (EA-Unit2) (Dcache) Read 64bit from Cache/Mem per Cycle (ALU1) (ALU2) (AMMX) (FPU) (write) Write 64bit to Cache/Mem per Cycle
| |
| | Vladimir Repcak
Posts 359 24 Jan 2020 10:27
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| Gunnar von Boehn wrote:
| Regarding INDEX Mode Some interesting information: add.l D2,D3 * * Bubble -- Bubble because D3 was touched in ALU move.l (A3,D3),(A0)+
add.l D2,A4 -- No bubble! move.l (A3,A4),(A0)+
Use an address register to avoid ALU2EA Bubble Then you not need to unroll! |
I don't understand why first case has a bubble and second doesn't. |
I an explain this easily. The pipeline of the 68K Family looks like this: 1) Icache Fetch 2) Decode 3) Reg-Fetch 4) EA Calculation in EA-Unit(s) 5) Dcache Fetch 6) ALU Operation in ALU-Unit(s) Because of this pipeline design 68K instruction can do "free" EA calculation, and "free" DCache access as part of the instruction - in addition to the ALU operation. Lets look at one example instruction: ADD.L (A0)+,D0 This instruction does not 1 things it does 3 things! a) It uses the EA of (A0) and increment it by plus 4 and then updates A0 b) It does a Cache/Memory Read c) It uses the result from memory and adds it to D0 This design allows the 68K to do a lot more work per instruction than a RISC can do. The advanced chips of 68K family have dedicated units for these tasks. (1) The EA unit(s) does the EA calculation and updates. (2) The Dcache does the Cache read (3) The ALU does the ALU operations. These separate unit design is also the reason why the 68K has two types of registers. Address Registers A0-A7 are owned by the EA-Units The Data Registers D0-D7 are owned by the ALU-Units The 68K instruction ADDA, SUBA and LEA are executed in the EA-Units. Operations having memory as destination or DATA registers are executed in the ALU. The 68K is by design a lot stronger than a RISC chip as it can do significant more operations per instruction. The 68k coder has to take care to not create dependencies between ALU and EA-Unit. Is the answer clear now? Or more questions?
|
Thanks. My brain however, literally, hurts right now :)Surprisingly, despite this being brought up for the first time in this thread (as far as I can tell, that is), it makes complete sense. It does cast a slightly different light on "easiness" of efficient 080-coding, though. Basically, all algorithms need to be looked upon from the point of view of ALU vs EA, otherwise there will be random bubbles all over the place. For example, I just examined the next-in-line most-often-executed loop - the scanline loop. It takes 50c plus two divisions (2*35)= 120 cycles per scanline. There is not a single EA operation - everything is: - D0-7 vs D0-7 - D0-7 vs RAM - move/add/sub/mul/lsr/ - div (2x) - cmp/bgt/bne So, that means that there are no bubbles then, right ? Probably a lot of fusing and a lot of P1/P2 execution. Which brings me to my new realization : One should try to avoid using (d0.7,a0-7) indirect addressing as much as possible then, other than absolutely necessary, correct ?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 24 Jan 2020 10:38
| Vladimir Repcak wrote:
| One should try to avoid using (d0.7,a0-7) indirect addressing as much as possible then, other than absolutely necessary, correct ? |
EA calculation is "free" on 68K. Even An++ update or --AN is "free". Use this powerful 68K EAs as much you like. EA will never create bubbles unless you "touch" the used regs very close before in the ALU. If you avoid this "touching" of the Ea needed regs in parallel with both units all is good. Btw the EA-ALU bubble topic is well explained by Motorola in the 68060 Manual. So maybe you like to reread also that paragraph. If you have question please also just ask. I can happily explain with more examples.The 68K pipeline design is vertically for a very good reason. This design allows you to do free memory operations without bubbles. If you compare this look at a RISC design which horizontal. Then the RISC design can not do a free memory operation it will need to do 2-3 instructions for 1 68K instruction
LOAD (ea) ALUOPP
In real live this is even more painful as modern RISC chips have a bubble between a LOAD and a OPP using the result This is how this looks in reality on PPC/ARM and others
LOAD (ea) *bubble* *bubble* ALUOPP
| |
| | Samuel Devulder
Posts 248 24 Jan 2020 12:06
| Vladimir Repcak wrote:
| 1. If there's no dependency between two consecutive ops, each one is processed in parallel on separate EU (Execution Unit) or as we call 'em here "pipe" (P1,P2)
|
Ok. But some instructions are "bounded" to P1. These are FPU, AMMX, branchs (B<CC>, except BRA) including jsr/bsr, MUL/DIV and some uncommon instructions. (BigGun will correct me if I'm not up-to-date here)
2. If there is dependency, then two such ops are "fused" into one op at whichever pipe is processing it (either P1 or P2). |
Fusing doesn't not deal with general dependency, but with specific patterns like these:
MOVE Dp,Dn ALUOP Dq|#imm,Dn Which are internally treated as ALUOP Dp,(#Imm|Dq),Dn (2 inputs, 1 output). These patterns are detected early, and both 2 arg instructions are fused into one 3 args instruction and executed in 1 cycle (works on P1 or P2). Another case of fusing is write or read longs of consecutive memory addresses MOVE.L Dp,(An)+ MOVE.L Dq,(An+) orMOVE.L (An)+,Dp MOVE.L (An)+,Dq These two 32bits access are "fused" into a single 64bit access in 1cycle. In your code "MOVE.L (mem),(mem)" costs at least 2 cycles because it requires 2 memory accesses (one for read, one for write), and the core can only do a single memory read or write each cycle. However it can do 1 read&write in the same memory in a single cycle IIRC. Hence things like "ADDQ.L #4,(A0)" is done in 1 cycle which is super cool. Again BigGun will correct me if I'm telling errors. (I get all these info by chatting with him on IRC and my memory might be inaccurate on specific points.) In my version, by not doing the read/write in the same operation, but collect all the "reads" together before, and do the "writes" together in the end, I can benefit of fusing 2 32bits write into a single 64bit and spare 2 cycles out of 8 (25% faster).
| |
| | Nixus Minimax
Posts 416 24 Jan 2020 14:41
| Another thing worth mentioning is that code like this: Bcc .skip OPP .skip will be fused into a predecated OPP executed in a single cycle and the Bcc will be practically free and never be mispredicted. I think this is true for most if not all 1-cycle single-pipe instructions.
| |
| | Samuel Devulder
Posts 248 24 Jan 2020 18:44
| also OPs that don't access memory IIRC (to revert the operation in case of misprediction.)
| |
| | Vladimir Repcak
Posts 359 24 Jan 2020 23:32
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| One should try to avoid using (d0.7,a0-7) indirect addressing as much as possible then, other than absolutely necessary, correct ? |
EA calculation is "free" on 68K. Even An++ update or --AN is "free". Use this powerful 68K EAs as much you like. EA will never create bubbles unless you "touch" the used regs very close before in the ALU. If you avoid this "touching" of the Ea needed regs in parallel with both units all is good. Btw the EA-ALU bubble topic is well explained by Motorola in the 68060 Manual. So maybe you like to reread also that paragraph. If you have question please also just ask. I can happily explain with more examples. The 68K pipeline design is vertically for a very good reason. This design allows you to do free memory operations without bubbles. If you compare this look at a RISC design which horizontal. Then the RISC design can not do a free memory operation it will need to do 2-3 instructions for 1 68K instruction LOAD (ea) ALUOPP In real live this is even more painful as modern RISC chips have a bubble between a LOAD and a OPP using the result This is how this looks in reality on PPC/ARM and others LOAD (ea) *bubble* *bubble* ALUOPP
|
Thanks. The 060 manual does, indeed, mention this 2-clock incurred stall in CH10 - just found it. Looks like I have some more reading to do :)
| |
| | Vladimir Repcak
Posts 359 24 Jan 2020 23:38
| Samuel Devulder wrote:
|
Vladimir Repcak wrote:
| 1. If there's no dependency between two consecutive ops, each one is processed in parallel on separate EU (Execution Unit) or as we call 'em here "pipe" (P1,P2) |
Ok. But some instructions are "bounded" to P1. These are FPU, AMMX, branchs (B<CC>, except BRA) including jsr/bsr, MUL/DIV and some uncommon instructions. (BigGun will correct me if I'm not up-to-date here) 2. If there is dependency, then two such ops are "fused" into one op at whichever pipe is processing it (either P1 or P2). |
Fusing doesn't not deal with general dependency, but with specific patterns like these: MOVE Dp,Dn ALUOP Dq|#imm,Dn Which are internally treated as ALUOP Dp,(#Imm|Dq),Dn (2 inputs, 1 output). These patterns are detected early, and both 2 arg instructions are fused into one 3 args instruction and executed in 1 cycle (works on P1 or P2). Another case of fusing is write or read longs of consecutive memory addresses MOVE.L Dp,(An)+ MOVE.L Dq,(An+) orMOVE.L (An)+,Dp MOVE.L (An)+,Dq These two 32bits access are "fused" into a single 64bit access in 1cycle. In your code "MOVE.L (mem),(mem)" costs at least 2 cycles because it requires 2 memory accesses (one for read, one for write), and the core can only do a single memory read or write each cycle. However it can do 1 read&write in the same memory in a single cycle IIRC. Hence things like "ADDQ.L #4,(A0)" is done in 1 cycle which is super cool. Again BigGun will correct me if I'm telling errors. (I get all these info by chatting with him on IRC and my memory might be inaccurate on specific points.) In my version, by not doing the read/write in the same operation, but collect all the "reads" together before, and do the "writes" together in the end, I can benefit of fusing 2 32bits write into a single 64bit and spare 2 cycles out of 8 (25% faster).
|
Thanks for explanation. I have incorrectly inferred the fusing use case from my prior code example. Any chance I merely missed some of the explanation of these features and it's already in some form of PDF or link somewhere around here ? Regardless, this thread should serve as an excellent learning spot for any 080-newbie like me :)
| |
| | Vladimir Repcak
Posts 359 24 Jan 2020 23:43
| Nixus Minimax wrote:
| Another thing worth mentioning is that code like this: Bcc .skip OPP .skip will be fused into a predecated OPP executed in a single cycle and the Bcc will be practically free and never be mispredicted. I think this is true for most if not all 1-cycle single-pipe instructions.
|
So, the following code (bgt+move) will be fused into 1 op ? ; C version if (d1 >= d6) texScanline = l_localVar87
|
; ASM Version cmp.l d1,d6 bgt .skip move.l l_localVar87,texScanline .skip:
|
| |
| | Vladimir Repcak
Posts 359 25 Jan 2020 00:55
| Looks like I hit a roadblock for now. I've optimized the scanline loop as much as possible while retaining most often used data in registers, but it became increasingly obvious that the next step should be utilizing the FloatingPoint Unit in parallel with Integer one. Example : Computing endpoints (xpLeft and xpRight) for every scanline is currently done using 5 ops per each point (thus 5+5 = 10 ops total), which was very efficient on Jaguar (but not here): ; xpLeft add.l a6,d4 move.l d4,d1 lsr.l d7,d1 move.l l_localVar71,xpLeft sub.l d1,xpLeft ; xpRight add.l a5,d5 move.l d5,d1 lsr.l d7,d1 move.l l_localVar72,xpRight add.l d1,xpRight |
But, if I used Floating-point, I could simply do this in 2 ops: FADD fp0,fp1 ; fp0: xposLeftDelta fp1:xposLeft FADD fp2,fp3 ; fp2: xposRightDelta fp3:xposRight |
On top of that, I could suddenly reuse 4 out of 8 data registers for different purpose and wouldn't destroy 1 more register during such computation The second substage of scanline traversal is computing the perspective-correct scanline of the texture, based on distance from camera. That takes a whopping 11 integer ops (one of which is 35-cycle division and 2-cycle mul). But, with floating-point, we could remove that division altogether (by computing it once per polygon outside of the loop) and just do a single FMUL, reducing it to about 3-4 ops. It is thus very possible that the entire scanline loop could be refactored to never touch any RAM, except the actual texturing (e.g. read texel , write texel). Since I never touched floating-point in 68000 assembler, I'm looking at one sample from Aminet (plus MC68060UM.pdf). If anybody has any good links for FP code samples, I would appreciate it :)
| |
| | Don Adan
Posts 38 25 Jan 2020 01:51
| Gunnar von Boehn wrote:
| Vladimir Repcak wrote:
| Gunnar von Boehn wrote:
| What value has D5 ? |
D5 = 16 |
You could do this on APOLLO loop_10_start: move.l (A3,D3.b2*4),(A0)+ add.l D2,D3 dbra d1,loop_10_start
D3.b2 means use as Index the byte number 2 Bytes in Long (3)(2)(1)(0) |
If D5=16, then perhaps you can use one more trick too. If i remember right, it will be looks next: swap D2 swap D3 sub.l d2,d3 loop_10_start: addx.l d2,d3 move.l (a3,d3.w*4),(a0)+ dbra d1,loop_10_start
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 25 Jan 2020 07:49
| Hi Don, In theory a nice trick. But please mind that this code creates a ALU 2 EA bubble swap D2 swap D3 sub.l d2,d3 loop_10_start: addx.l d2,d3
*bubble* *bubble* move.l (a3,d3.w*4),(a0)+ dbra d1,loop_10_start
Vlad, I have a general question. To where points A3? Is this just a colortable? How many color entries are in there? If you not use texturing but only color gradients, how about calc them in real time?
| |
| | Vladimir Repcak
Posts 359 25 Jan 2020 11:03
| It is texturing. Lightmap only at this point, but at 24-bit applying a classic material texture (concrete, brick) wouldn't be a problem if needed. To compute radiosity at real-time, Apollo would need to get a gfx accelerator of GeForce 7 caliber. Preferably GF 8800 GTX. If you check my screenshot again you can notice that there are textures at top of screen.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 25 Jan 2020 11:25
| Vladimir Repcak wrote:
| It is texturing.
|
What made me wonder is that there is not U And V but only one direction calculated here?
| |
| | Vladimir Repcak
Posts 359 25 Jan 2020 11:52
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| It is texturing. |
What made me wonder is that there is not U And V but only one direction calculated here?
|
Technically, there is UV, it's just merged into address index computation, because the walls are axis aligned.But I have to separately handle left from right wall and top from bottom wall. I have had a lot of time to think this through in past, so it's as efficient as possible per pixel. A generic triangle under arbitrary camera angle would be quite a bit more complex, for sure. Although, with Apollo's texturing unit, the difference might become much smaller.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6252 25 Jan 2020 12:16
| Vladimir Repcak wrote:
| Technically, there is UV, it's just merged into address index computation, because the walls are axis aligned.
|
How big is the texture in X/Y?
| |
| | Vladimir Repcak
Posts 359 25 Jan 2020 21:27
| Gunnar von Boehn wrote:
| How big is the texture in X/Y? |
80x40 (3 walls) 80x80 (2 walls)To increase the resolution, I adjust the dimension of a Patch (e.g. texel). The way the Radiosity works is a loop of: - gathering energy from all other patches in the scene - distributing certain percentage of its energy to all other patches There is no concept of RGB in Radiosity - it does all its computations in the Energy (watts) space. Only once a specific energy distribution threshold has been met, the final energies (per patch) are converted into RGB using an Exposure formula. So, unlike a classic simplistic pixel shader that doesn't take into account indirect light, the radiosity allows you to see the light that has bounced thousand times around the room (each time giving away - say - 40% of its current energy) - hence the indirect light.
| |
| | Vladimir Repcak
Posts 359 25 Jan 2020 21:34
| One of the things that are super exciting about AMMX is that instead of doing a 3x loop over (R,G,B), AMMX should in theory allow me to do just one operation over all 3 components. So, in theory, we could actually have a real-time pixel shader :) The room I have has 22,400 texels. While right now there's 5 ops per each RGB component, that's because I am not using floats, but fixed-point (e.g. add/mul/bitshift). I think I should be able to reduce the number of ops to just two via AMMX. Which would bring it down to ~45,000 ops. What's the performance of the AMMX unit ? Also 1 cycle per op ?
| |
| | Vladimir Repcak
Posts 359 25 Jan 2020 23:39
| Alright, I just got my first floating-point working - within the scanline loop: ; fp7: TexelsPerPixel = TextureWidth / ScanlineWidth ; fp6: TexelIndex <0,TextureWidth>fmove.l #0,fp6 ScanlineLoop: fmove.l fp6,d0 move.l (texPtrStart,d0.l*4),(vidPtr)+ fadd.x fp7,fp6 dbra d1,ScanlineLoop
|
Now, I tried replacing d0.l*4 with fp6*4 but that didn't obviously compile - it was worth a try, though :) So, before we go unroll this loop, what kind of pipeline bubbles can I expect now that I am trying to run both FP and Integer code ? Does this code have a bubble ?
| |
| | Vladimir Repcak
Posts 359 26 Jan 2020 00:11
| So, as I suspected yesterday, using FP reduced number of ops and thus registers used. Especially the fixed-point computation was using lots of registers, hence there was a lot of register push/pop to/from stack. Not needed anymore.Took me an hour to clean it up (refactor, rearrange, etc.), but now my DrawScanline () function only uses D0,D1 and FP6,FP7,FP0. Plus, there was one division (35 cycles) that I originally planned to replace with a LUT, but since it's FDIV, it should really only be 1 cycle(if I recall correctly FDIV's supposed performance). I think I might be able to do the scanline traversal with only FP2-FP5. Will try next.
| |
|
|
|