Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

Writing 3D Engine for 68080 In ASMpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Gunnar von Boehn
(Apollo Team Member)
Posts 6207
24 Jan 2020 09:44


Vladimir Repcak wrote:

  1. If there's no dependency between two consecutive ops, each one is processed in parallel on separate EU (Execution Unit) or as we call 'em here "pipe" (P1,P2)

Each Pipe has on top an EA-Unit and an ALU-unit
The design looks somewhat like this:


      (icache)          16byte per cycle
      (decoders)          up to 2 instruction pairs (4 instr)
(EA-Unit1)  (EA-Unit2)
      (Dcache)            Read 64bit from Cache/Mem per Cycle
(ALU1)      (ALU2)
(AMMX)
(FPU)
      (write)            Write 64bit to Cache/Mem per Cycle



Vladimir Repcak

Posts 359
24 Jan 2020 10:27


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

 
Gunnar von Boehn wrote:

    Regarding INDEX Mode
     
    Some interesting information:
   
     

            add.l D2,D3
              *
              * Bubble              -- Bubble because D3 was touched in ALU
            move.l (A3,D3),(A0)+
     

     
     

            add.l D2,A4            -- No bubble!
            move.l (A3,A4),(A0)+
     

     
    Use an address register to avoid ALU2EA Bubble
    Then you not need to unroll!
   

    I don't understand why first case has a bubble and second doesn't.
 

 
  I an explain this easily.
  The pipeline of the 68K Family looks like this:
  1) Icache Fetch
  2) Decode
  3) Reg-Fetch
  4) EA Calculation in EA-Unit(s)
  5) Dcache Fetch
  6) ALU Operation in ALU-Unit(s)
 
  Because of this pipeline design 68K instruction
  can do "free" EA calculation, and "free" DCache access as part of the instruction - in addition to the ALU operation.
 
  Lets look at one example instruction:
  ADD.L (A0)+,D0
 
  This instruction does not 1 things it does 3 things!
  a) It uses the EA of (A0) and increment it by plus 4 and then updates A0
  b) It does a Cache/Memory Read
  c) It uses the result from memory and adds it to D0
 
  This design allows the 68K to do a lot more work per instruction than a RISC can do.
 
  The advanced chips of 68K family have dedicated units for these tasks.
  (1) The EA unit(s) does the EA calculation and updates.
  (2) The Dcache does the Cache read
  (3) The ALU does the ALU operations.
 
  These separate unit design is also the reason why the 68K has two types of registers.
 
  Address Registers A0-A7 are owned by the EA-Units
  The Data Registers D0-D7 are owned by the ALU-Units
 
  The 68K instruction ADDA, SUBA and LEA are executed in the EA-Units.
  Operations having memory as destination or DATA registers are executed in the ALU.
 
  The 68K is by design a lot stronger than a RISC chip
  as it can do significant more operations per instruction.
  The 68k coder has to take care to not create dependencies between ALU and EA-Unit.
 
  Is the answer clear now?
  Or more questions?

Thanks. My brain however, literally, hurts right now :)

Surprisingly, despite this being brought up for the first time in this thread (as far as I can tell, that is), it makes complete sense.

It does cast a slightly different light on "easiness" of efficient 080-coding, though. Basically, all algorithms need to be looked upon from the point of view of ALU vs EA, otherwise there will be random bubbles all over the place.

For example, I just examined the next-in-line most-often-executed loop - the scanline loop. It takes 50c plus two divisions  (2*35)= 120 cycles per scanline.

There is not a single EA operation - everything is:
- D0-7 vs D0-7
- D0-7 vs RAM
- move/add/sub/mul/lsr/
- div (2x)
- cmp/bgt/bne

So, that means that there are no bubbles then, right ? Probably a lot of fusing and a lot of P1/P2 execution.

Which brings me to my new realization :

One should try to avoid using (d0.7,a0-7) indirect addressing as much as possible then, other than absolutely necessary, correct ?


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
24 Jan 2020 10:38


Vladimir Repcak wrote:

  One should try to avoid using (d0.7,a0-7) indirect addressing as much as possible then, other than absolutely necessary, correct ?
 

 
 
EA calculation is "free" on 68K.
Even An++ update or --AN is "free".
Use this powerful 68K EAs as much you like.
 
EA will never create bubbles unless you "touch" the used regs
very close before in the ALU.
If you avoid this "touching" of the Ea needed regs in parallel with both units all is good.
 
Btw the EA-ALU bubble topic is well explained by Motorola in the 68060 Manual. So maybe you like to reread also that paragraph.
 
If you have question please also just ask.
I can happily explain with more examples.

The 68K pipeline design is vertically for a very good reason.
This design allows you to do free memory operations without bubbles.

If you compare this look at a RISC design which horizontal.
Then the RISC design can not do a free memory operation it will need to do 2-3 instructions for 1 68K instruction

LOAD (ea)
ALUOPP

In real live this is even more painful as modern RISC chips have a bubble between a LOAD and a OPP using the result
This is how this looks in reality on PPC/ARM and others

LOAD (ea)
*bubble*
*bubble*
ALUOPP




Samuel Devulder

Posts 248
24 Jan 2020 12:06


Vladimir Repcak wrote:

1. If there's no dependency between two consecutive ops, each one is processed in parallel on separate EU (Execution Unit) or as we call 'em here "pipe" (P1,P2)

Ok. But some instructions are "bounded" to P1. These are FPU, AMMX, branchs (B<CC>, except BRA) including jsr/bsr, MUL/DIV and some uncommon instructions. (BigGun will correct me if I'm not up-to-date here)

2. If there is dependency, then two such ops are "fused" into one op at whichever pipe is processing it (either P1 or P2).

Fusing doesn't not deal with general dependency, but with specific patterns like these:
MOVE Dp,Dn
ALUOP Dq|#imm,Dn
Which are internally treated as ALUOP Dp,(#Imm|Dq),Dn (2 inputs, 1 output). These patterns are detected early, and both 2 arg instructions are fused into one 3 args instruction and executed in 1 cycle (works on P1 or P2).
 
Another case of fusing is write or read longs of consecutive memory addresses
MOVE.L Dp,(An)+
MOVE.L Dq,(An+)
or
MOVE.L (An)+,Dp
MOVE.L (An)+,Dq
These two 32bits access are "fused" into a single 64bit access in 1cycle.
 
In your code "MOVE.L (mem),(mem)" costs at least 2 cycles because it requires 2 memory accesses (one for read, one for write), and the core can only do a single memory read or write each cycle. However it can do 1 read&write in the same memory in a single cycle IIRC. Hence things like "ADDQ.L #4,(A0)" is done in 1 cycle which is super cool. Again BigGun will correct me if I'm telling errors. (I get all these info by chatting with him on IRC and my memory might be inaccurate on specific points.)
 
In my version, by not doing the read/write in the same operation, but collect all the "reads" together before, and do the "writes" together in the end, I can benefit of fusing 2 32bits write into a single 64bit and spare 2 cycles out of 8 (25% faster).


Nixus Minimax

Posts 416
24 Jan 2020 14:41


Another thing worth mentioning is that code like this:

  Bcc .skip
  OPP
.skip

will be fused into a predecated OPP executed in a single cycle and the Bcc will be practically free and never be mispredicted. I think this is true for most if not all 1-cycle single-pipe instructions.



Samuel Devulder

Posts 248
24 Jan 2020 18:44


also OPs that don't access memory IIRC (to revert the operation in case of misprediction.)


Vladimir Repcak

Posts 359
24 Jan 2020 23:32


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

    One should try to avoid using (d0.7,a0-7) indirect addressing as much as possible then, other than absolutely necessary, correct ?
 

 
 
  EA calculation is "free" on 68K.
  Even An++ update or --AN is "free".
  Use this powerful 68K EAs as much you like.
 
  EA will never create bubbles unless you "touch" the used regs
  very close before in the ALU.
  If you avoid this "touching" of the Ea needed regs in parallel with both units all is good.
 
  Btw the EA-ALU bubble topic is well explained by Motorola in the 68060 Manual. So maybe you like to reread also that paragraph.
 
  If you have question please also just ask.
  I can happily explain with more examples.
 
 
  The 68K pipeline design is vertically for a very good reason.
  This design allows you to do free memory operations without bubbles.
 
  If you compare this look at a RISC design which horizontal.
  Then the RISC design can not do a free memory operation it will need to do 2-3 instructions for 1 68K instruction
 
  LOAD (ea)
  ALUOPP
 

 
  In real live this is even more painful as modern RISC chips have a bubble between a LOAD and a OPP using the result
  This is how this looks in reality on PPC/ARM and others
 
  LOAD (ea)
  *bubble*
  *bubble*
  ALUOPP
 

 

Thanks. The 060 manual does, indeed, mention this 2-clock incurred stall in CH10 - just found it.
Looks like I have some more reading to do :)



Vladimir Repcak

Posts 359
24 Jan 2020 23:38


Samuel Devulder wrote:

Vladimir Repcak wrote:

  1. If there's no dependency between two consecutive ops, each one is processed in parallel on separate EU (Execution Unit) or as we call 'em here "pipe" (P1,P2)
 

  Ok. But some instructions are "bounded" to P1. These are FPU, AMMX, branchs (B<CC>, except BRA) including jsr/bsr, MUL/DIV and some uncommon instructions. (BigGun will correct me if I'm not up-to-date here)
 

  2. If there is dependency, then two such ops are "fused" into one op at whichever pipe is processing it (either P1 or P2).

  Fusing doesn't not deal with general dependency, but with specific patterns like these:
 
MOVE Dp,Dn
  ALUOP Dq|#imm,Dn
Which are internally treated as ALUOP Dp,(#Imm|Dq),Dn (2 inputs, 1 output). These patterns are detected early, and both 2 arg instructions are fused into one 3 args instruction and executed in 1 cycle (works on P1 or P2).
   
  Another case of fusing is write or read longs of consecutive memory addresses
MOVE.L Dp,(An)+
  MOVE.L Dq,(An+)
or
MOVE.L (An)+,Dp
  MOVE.L (An)+,Dq
These two 32bits access are "fused" into a single 64bit access in 1cycle.
   
  In your code "MOVE.L (mem),(mem)" costs at least 2 cycles because it requires 2 memory accesses (one for read, one for write), and the core can only do a single memory read or write each cycle. However it can do 1 read&write in the same memory in a single cycle IIRC. Hence things like "ADDQ.L #4,(A0)" is done in 1 cycle which is super cool. Again BigGun will correct me if I'm telling errors. (I get all these info by chatting with him on IRC and my memory might be inaccurate on specific points.)
 
  In my version, by not doing the read/write in the same operation, but collect all the "reads" together before, and do the "writes" together in the end, I can benefit of fusing 2 32bits write into a single 64bit and spare 2 cycles out of 8 (25% faster).

Thanks for explanation. I have incorrectly inferred the fusing use case from my prior code example.

Any chance I merely missed some of the explanation of these features and it's already in some form of PDF or link somewhere around here ?

Regardless, this thread should serve as an excellent learning spot for any 080-newbie like me :)



Vladimir Repcak

Posts 359
24 Jan 2020 23:43


Nixus Minimax wrote:

Another thing worth mentioning is that code like this:
 
    Bcc .skip
    OPP
  .skip
 
  will be fused into a predecated OPP executed in a single cycle and the Bcc will be practically free and never be mispredicted. I think this is true for most if not all 1-cycle single-pipe instructions.
 

So, the following code (bgt+move) will be fused into 1 op ?


  ; C version
if (d1 >= d6) texScanline = l_localVar87


  ; ASM Version
cmp.l d1,d6
bgt .skip
  move.l l_localVar87,texScanline
.skip:




Vladimir Repcak

Posts 359
25 Jan 2020 00:55


Looks like I hit a roadblock for now.
 
  I've optimized the scanline loop as much as possible while retaining most often used data in registers, but it became increasingly obvious that the next step should be utilizing the FloatingPoint Unit in parallel with Integer one.
 
  Example : Computing endpoints (xpLeft and xpRight) for every scanline is currently done using 5 ops per each point (thus 5+5 = 10 ops total), which was very efficient on Jaguar (but not here):
 

    ; xpLeft
 
  add.l a6,d4
  move.l d4,d1
  lsr.l d7,d1
  move.l l_localVar71,xpLeft
  sub.l d1,xpLeft
 
    ; xpRight
  add.l a5,d5
  move.l d5,d1
  lsr.l d7,d1
  move.l l_localVar72,xpRight
  add.l d1,xpRight
 
 

 
  But, if I used Floating-point, I could simply do this in 2 ops:
 

    FADD fp0,fp1    ; fp0: xposLeftDelta      fp1:xposLeft
    FADD fp2,fp3    ; fp2: xposRightDelta    fp3:xposRight
 

 
  On top of that, I could suddenly reuse 4 out of 8 data registers for different purpose and wouldn't destroy 1 more register during such computation
 
  The second substage of scanline traversal is computing the perspective-correct scanline of the texture, based on distance from camera.
  That takes a whopping 11 integer ops (one of which is 35-cycle division and 2-cycle mul).
  But, with floating-point, we could remove that division altogether (by computing it once per polygon outside of the loop) and just do a single FMUL, reducing it to about 3-4 ops.
 
 
  It is thus very possible that the entire scanline loop could be refactored to never touch any RAM, except the actual texturing (e.g. read texel , write texel).
 
  Since I never touched floating-point in 68000 assembler, I'm looking at one sample from Aminet (plus MC68060UM.pdf).
 
  If anybody has any good links for FP code samples, I would appreciate it :)
 
 


Don Adan

Posts 38
25 Jan 2020 01:51


Gunnar von Boehn wrote:

 
Vladimir Repcak wrote:

 
Gunnar von Boehn wrote:

    What value has D5 ?

    D5 = 16
 

 
  You could do this on APOLLO
 
 

      loop_10_start:
        move.l (A3,D3.b2*4),(A0)+
        add.l D2,D3
        dbra d1,loop_10_start
 

  D3.b2 means use as Index the byte number 2
 
  Bytes in Long (3)(2)(1)(0)
 
 

 
  If D5=16, then perhaps you can use one more trick too. If i remember right, it will be looks next:
 
 
 

    swap D2
    swap D3
    sub.l d2,d3
 
    loop_10_start:
      addx.l d2,d3
      move.l (a3,d3.w*4),(a0)+
    dbra d1,loop_10_start
 
 




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
25 Jan 2020 07:49


Hi Don,

In theory a nice trick.
But please mind that this code creates a ALU 2 EA bubble

 


    swap D2
    swap D3
    sub.l d2,d3
    loop_10_start:
      addx.l d2,d3

      *bubble*
      *bubble*

      move.l (a3,d3.w*4),(a0)+
    dbra d1,loop_10_start
 
 

Vlad, I have a general question.
To where points A3?
Is this just a colortable?
How many color entries are in there?

If you not use texturing but only color gradients,
how about calc them in real time?




Vladimir Repcak

Posts 359
25 Jan 2020 11:03


It is texturing. Lightmap only at this point, but at 24-bit applying a classic material texture (concrete, brick) wouldn't be a problem if needed.

To compute radiosity at real-time, Apollo would need to get a gfx accelerator of GeForce 7 caliber. Preferably GF 8800 GTX.

If you check my screenshot again you can notice that there are textures at top of screen.




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
25 Jan 2020 11:25


Vladimir Repcak wrote:

It is texturing. 

What made me wonder is that there is not U And V
but only one direction calculated here?



Vladimir Repcak

Posts 359
25 Jan 2020 11:52


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

  It is texturing. 
 

 
  What made me wonder is that there is not U And V
  but only one direction calculated here?
 

Technically, there is UV, it's just merged into address index computation, because the walls are axis aligned.

But I have to separately handle left from right wall and top from bottom wall.

I have had a lot of time to think this through in past, so it's as efficient as possible per pixel.

A generic triangle under arbitrary camera angle would be quite a bit more complex, for sure.
Although, with Apollo's texturing unit, the difference might become much smaller.



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
25 Jan 2020 12:16


Vladimir Repcak wrote:

Technically, there is UV, it's just merged into address index computation, because the walls are axis aligned.
 

How big is the texture in X/Y?


Vladimir Repcak

Posts 359
25 Jan 2020 21:27


Gunnar von Boehn wrote:

  How big is the texture in X/Y?

80x40 (3 walls)
80x80 (2 walls)

To increase the resolution, I adjust the dimension of a Patch (e.g. texel).
The way the Radiosity works is a loop of:
- gathering energy from all other patches in the scene
- distributing certain percentage of its energy to all other patches

There is no concept of RGB in Radiosity - it does all its computations in the Energy (watts) space. Only once a specific energy distribution threshold has been met, the final energies (per patch) are converted into RGB using an Exposure formula.

So, unlike a classic simplistic pixel shader that doesn't take into account indirect light, the radiosity allows you to see the light that has bounced thousand times around the room (each time giving away - say - 40% of its current energy) - hence the indirect light.




Vladimir Repcak

Posts 359
25 Jan 2020 21:34


One of the things that are super exciting about AMMX is that instead of doing a 3x loop over (R,G,B), AMMX should in theory allow me to do just one operation over all 3 components.

So, in theory, we could actually have a real-time pixel shader :)
The room I have has 22,400 texels.

While right now there's 5 ops per each RGB component, that's because I am not using floats, but fixed-point (e.g. add/mul/bitshift).

I think I should be able to reduce the number of ops to just two via AMMX. Which would bring it down to ~45,000 ops.

What's the performance of the AMMX unit ? Also 1 cycle per op ?


Vladimir Repcak

Posts 359
25 Jan 2020 23:39


Alright, I just got my first floating-point working - within the scanline loop:


; fp7: TexelsPerPixel = TextureWidth / ScanlineWidth
; fp6: TexelIndex <0,TextureWidth>

fmove.l #0,fp6

ScanlineLoop:
  fmove.l  fp6,d0
  move.l  (texPtrStart,d0.l*4),(vidPtr)+
  fadd.x  fp7,fp6
dbra d1,ScanlineLoop

Now, I tried replacing d0.l*4 with fp6*4 but that didn't obviously compile - it was worth a try, though :)

So, before we go unroll this loop, what kind of pipeline bubbles can I expect now that I am trying to run both FP and Integer code ?
Does this code have a bubble ?



Vladimir Repcak

Posts 359
26 Jan 2020 00:11


So, as I suspected yesterday, using FP reduced number of ops and thus registers used.
Especially the fixed-point computation was using lots of registers, hence there was a lot of register push/pop to/from stack. Not needed anymore.

Took me an hour to clean it up (refactor, rearrange, etc.), but now my DrawScanline () function only uses D0,D1 and FP6,FP7,FP0.

Plus, there was one division (35 cycles) that I originally planned to replace with a LUT, but since it's FDIV, it should really only be 1 cycle(if I recall correctly FDIV's supposed performance).

I think I might be able to do the scanline traversal with only FP2-FP5. Will try next.

posts 429page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22