Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

Writing 3D Engine for 68080 In ASMpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Vladimir Repcak

Posts 359
26 Jan 2020 01:35


Vladimir Repcak wrote:

  I think I might be able to do the scanline traversal with only FP2-FP5. Will try next.
Yep. What took about two pages of code on Jaguar's RISC GPU with multiple workaround for HW bugs/pipeline/etc., is -literally- just two lines of code here :)


fsub.x fp5,fp4  ; xpLeft
fadd.x fp3,fp2  ; xpRight

And there are no special cases to handle (when polygon's edge is close to camera)
And no ugly precision issue (jumping first and last pixel)
And a very simple positioning to the requested XPOS when clipping.
And I gained back 6 registers that were used for this.
And it's just two ops

Win:Win:Win:Win:Win
:- )))))


Vladimir Repcak

Posts 359
26 Jan 2020 05:36


It took one full day of FP refactoring, but I've replaced all 3 divisions (3*35 = 105 cycles) with FloatingPoint ones (plus, all neighboring computations) and now the whole scanline loop is just 19 ops (10 of which are FP, 9 are INT). I now have only two RAM variables (compared to about 20 yesterday), everything else is just registers.

Yesterday the same loop took about 160 cycles, so it's a huge improvement - I reckon there will be couple INT ops that will execute in parallel (with the FP ones) so those 19 ops will be really just ~16, hence improvement of a factor of 10:1.

On Jaguar, at 60 fps, without any drawing, I could process about 2,100 scanlines (just processing, no drawing), regardless of whether it was 1 or 100 polygons.

So, if I can do 1 scanline in 16 cycles on Apollo, at 85 MHz, we get about 1.417 M cycles (per frame) @ 60 fps. That's a safe number without second pipe engaged (with two CPU pipes, it'd be ~2.83 M cycles).

1.417M / 16c = 88,541 scanlines.
That is a factor of 42:1 compared to Jaguar. Poor kitty :)

To give some perspective on that number of scanlines, the first room in Quake 1(where you choose the difficulty) -which I recreated manually in code- has about 5,000 scanlines.

I'm having a brain freeze, trying to imagine what kind of complexity is a 3D scene with 100,000 scanlines. Probably the only way to reach that number is with a  very detailed terrain or high-poly characters.
So, if we had a heightmap terrain, at 640*480, we could have 100,000 / 480 = 208 columns on average, which is just ridiculous. Of course, rotating that many vertices would take a lot of time, no doubt about that...

Regardless, such amount of scanlines can be only derived from very high poly count, so it will be interesting to benchmarking that threshold  (cull,transform,clip,rasterize)...



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
26 Jan 2020 06:44


Vladimir Repcak wrote:

was using lots of registers, hence there was a lot of register push/pop to/from stack. Not needed anymore.

PLease mind you have on APOLLO a lot more register available.
You have 16 Adress Regs
You have  8+24 FPU Regs
And you have 8+24 Data Regs

This means push/pop should be gone.


Vladimir Repcak

Posts 359
26 Jan 2020 07:12


I just refactored the code and got the scanline loop down to 16 ops (9 FP, 7 INT), so if there are no bubbles, it should take 14c at most, resulting in ~100,000 scanlines processing throughput (1 frame time @ 60fps -> 1.417 M cycles / 14 = ~100,000).

This is the code:

loopStart:
  fsub.x fp5,fp4  ; xpLeft
  fadd.x fp3,fp2  ; xpRight
  fmove.x fp7,fp1
  fmove.l DeltaYRemap,fp0
  fdiv.x fp0,fp1
  fmove.l zpCam32Front,fp0
  fadd.x fp1,fp0
  fmul.x fp6,fp0
  fmove.l fp0,d4
  cmp.l d4,d6
  bgt if_15_soe
    move.l l_localVar102,d4
    if_15_soe:
  addq.l #1,d5
  addq.l #1,l_localVar91
  cmpi.l #0,l_localVar91
  bne if_16_soe
    move.l #1,l_localVar91
    if_16_soe:
dbra d7, loopStart

I have no idea on Floating Point bubbles, so would appreciate some feedback.
Can this run in 14 cycles ?



Vladimir Repcak

Posts 359
26 Jan 2020 07:30


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

  was using lots of registers, hence there was a lot of register push/pop to/from stack. Not needed anymore.
 

  PLease mind you have on APOLLO a lot more register available.
  You have 16 Adress Regs
  You have  8+24 FPU Regs
  And you have 8+24 Data Regs
 
  This means push/pop should be gone.

That approach will have to wait till I have my V4 shipped and set up. Cannot access those additional registers under emulator, correct ?


Samuel Devulder

Posts 248
26 Jan 2020 07:47


ScanlineLoop:
    fmove.l  fp6,d0
    move.l  (texPtrStart,d0.l*4),(vidPtr)+
    fadd.x  fp7,fp6
  dbra d1,ScanlineLoop
Does this code have a bubble ?

Yes it does.
 
First there must be "room" before the computation of d0 and its use, then an fadd costs 5 cycles (IIRC). So once again that code will get full power only if you interleave the computation for 4 or pore pixels. But beware, you cannot do fadd fp7,fp6 every other instruction. There must be 5 cycles between two of them. So the proper code would rather look like this

    fmove.l #0,fp3
    fmove.x fp3,fp4
    fadd.x  fp7,fp4
  ; bubbles here
    fmove.x fp4,fp5
    fadd.x  fp7,fp5
  ; bubbles here
    fmove.x fp4,fp6
    fadd.x  fp7,fp6
    fmul.s #4,fp7
  ScanlineLoop:
    fmove.l  fp3,d0 ; 1 cycle
    fmove.l  fp4,d1 ; 1 cycle
    fmove.l  fp5,d2 ; 1 cycle
    fmove.l  fp6,d3 ; 1 cycle
 
    fadd.x  fp7,fp3 ; 1 cycle
    fadd.x  fp7,fp4 ; 1 cycle
    fadd.x  fp7,fp5 ; 1 cycle
    fadd.x  fp7,fp6 ; 1 cycle
 
    move.l  (texPtrStart,d0.l*4),d0 ; 1 cycle
    move.l  (texPtrStart,d1.l*4),d1 ; 1 cycle
    move.l  (texPtrStart,d2.l*4),d2 ; 1 cycle
    move.l  (texPtrStart,d3.l*4),d3 ; 1 cycle
 
    movem.l d0-d3,(vidptr)+ ; 2 cycles on v4 core because of fusing mem ops
    dbra d4,ScanlineLoop
 
This is 14 cycles for 4 pixels without any bubbles in the loop. This is slower than the fixed pt version because fpu ops has a bigger latency than integer ops (and because they can only execute on P1 iirc).
 
Beware, fdiv is not 1 cycle, but 9 and fadd/fmul are 5 (or 6, I never remind) cycles
   fdiv.x fp0,fp1
    fmove.l zpCam32Front,fp0
  ; here we must wait 7 cycles at least to get fp1 (result od fdiv)
    fadd.x fp1,fp0
  ; here we must wait 5 cycles to get fp0 (result of fadd above)
    fmul.x fp6,fp0
  ; here we must wait 5 cycles to get fp0 (result of fmul above)
    fmove.l fp0,d4
The only way to get rid of these bubbles is to unroll the loops 4 times and spread the computatio in the bubbles like I did in the first asm code of this message.


Kamelito Loveless

Posts 260
26 Jan 2020 09:03


I wonder is all those rules to avoid bubbles cannot be added to VASM that will warn about them when you assemble the code. Even better if an optimizer following the rules could re-arrange the code but that might be too much, at least a warning spotting the problem and giving hints should be doable. this would speed up development and will produce faster code more easily.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
26 Jan 2020 11:32


Samuel Devulder wrote:

  ScanlineLoop:
      fmove.l  fp3,d0 ; 1 cycle
      fmove.l  fp4,d1 ; 1 cycle
      fmove.l  fp5,d2 ; 1 cycle
      fmove.l  fp6,d3 ; 1 cycle
   
 

 
For the 3D coordinate transformation I'm sure Float does makes sense.
For the Rasterline run I think Float has little or no advantage.

Don't we agree that the Rasterline run can easily go in Integer?


Vladimir Repcak

Posts 359
26 Jan 2020 23:08


Samuel Devulder wrote:

  Beware, fdiv is not 1 cycle, but 9 and fadd/fmul are 5 (or 6, I never remind) cycles

9 ? I thought I read that FPU is supposed to make division every cycle. Then it would follow that fmul is 1 cycle too.

But 5/6 for fadd ? :(

That quite destroys FP's useability for fast parallel code :(

I mean, there are still other advantages:
- we don't clutter INT registers (less RAM access for variables)
- much shorter code is easier to debug and modify




Vladimir Repcak

Posts 359
26 Jan 2020 23:23


Samuel Devulder wrote:

ScanlineLoop:
      fmove.l  fp6,d0
      move.l  (texPtrStart,d0.l*4),(vidPtr)+
      fadd.x  fp7,fp6
    dbra d1,ScanlineLoop
Does this code have a bubble ?

  Yes it does.
 
  First there must be "room" before the computation of d0 and its use, then an fadd costs 5 cycles (IIRC). So once again that code will get full power only if you interleave the computation for 4 or pore pixels. But beware, you cannot do fadd fp7,fp6 every other instruction. There must be 5 cycles between two of them. So the proper code would rather look like this

      fmove.l #0,fp3
      fmove.x fp3,fp4
      fadd.x  fp7,fp4
  ; bubbles here
      fmove.x fp4,fp5
      fadd.x  fp7,fp5
  ; bubbles here
      fmove.x fp4,fp6
      fadd.x  fp7,fp6
      fmul.s #4,fp7
  ScanlineLoop:
      fmove.l  fp3,d0 ; 1 cycle
      fmove.l  fp4,d1 ; 1 cycle
      fmove.l  fp5,d2 ; 1 cycle
      fmove.l  fp6,d3 ; 1 cycle
   
      fadd.x  fp7,fp3 ; 1 cycle
      fadd.x  fp7,fp4 ; 1 cycle
      fadd.x  fp7,fp5 ; 1 cycle
      fadd.x  fp7,fp6 ; 1 cycle
   
      move.l  (texPtrStart,d0.l*4),d0 ; 1 cycle
      move.l  (texPtrStart,d1.l*4),d1 ; 1 cycle
      move.l  (texPtrStart,d2.l*4),d2 ; 1 cycle
      move.l  (texPtrStart,d3.l*4),d3 ; 1 cycle
   
      movem.l d0-d3,(vidptr)+ ; 2 cycles on v4 core because of fusing mem ops
      dbra d4,ScanlineLoop
 
This is 14 cycles for 4 pixels without any bubbles in the loop. This is slower than the fixed pt version because fpu ops has a bigger latency than integer ops (and because they can only execute on P1 iirc).
 
  Beware, fdiv is not 1 cycle, but 9 and fadd/fmul are 5 (or 6, I never remind) cycles
   fdiv.x fp0,fp1
    fmove.l zpCam32Front,fp0
  ; here we must wait 7 cycles at least to get fp1 (result od fdiv)
    fadd.x fp1,fp0
  ; here we must wait 5 cycles to get fp0 (result of fadd above)
    fmul.x fp6,fp0
  ; here we must wait 5 cycles to get fp0 (result of fmul above)
    fmove.l fp0,d4
The only way to get rid of these bubbles is to unroll the loops 4 times and spread the computatio in the bubbles like I did in the first asm code of this message.

Thanks a lot. I had incorrect assumptions about the FP when I wrote it (I thought they all take 1 cycle).

So, it is still possible to write a fast FP code, it's just much harder and resulting code will have to be heavily interleaved (due to 5-9 cycle execution latency).

That's still worth it for things like scanline loop, but I don't think I want to be doing that for anything else :)

Is there a PDF or text file somewhere with the execution times for the FP ops on 080 ? Or does it follow 060 ?
I just checked the 060's FPU Instruction execution times and almost nothing there is 1 cycle, indeed. It's all at least 3-5 cycles with FDIV being 37.

So, I guess from that standpoint it's still good.

Can't wait to get my V4 and start doing benchmarks on real HW myself...



Vladimir Repcak

Posts 359
26 Jan 2020 23:27


Kamelito Loveless wrote:

I wonder is all those rules to avoid bubbles cannot be added to VASM that will warn about them when you assemble the code. Even better if an optimizer following the rules could re-arrange the code but that might be too much, at least a warning spotting the problem and giving hints should be doable. this would speed up development and will produce faster code more easily.

Yeah, that would be very nice, indeed, if we got a warning about bubbles.


Vladimir Repcak

Posts 359
26 Jan 2020 23:33


Gunnar von Boehn wrote:

Samuel Devulder wrote:

    ScanlineLoop:
        fmove.l  fp3,d0 ; 1 cycle
        fmove.l  fp4,d1 ; 1 cycle
        fmove.l  fp5,d2 ; 1 cycle
        fmove.l  fp6,d3 ; 1 cycle
     
 

 
  For the 3D coordinate transformation I'm sure Float does makes sense.
  For the Rasterline run I think Float has little or no advantage.
 
  Don't we agree that the Rasterline run can easily go in Integer?

Well, we're not talking easy, we're talking fast here :)

Before I was told that fadd actually takes 5 cycles, it had a great advantage, as there were just 3 ops in the scanline loop for the horizontal scanlines (and much less INT ops and registers).

FP is however utterly unusable for tight loops (3 ops - like in my scanline case), so it must be unrolled to some minimum degree (just like Samuel showed).


Vladimir Repcak

Posts 359
27 Jan 2020 01:54


Samuel Devulder wrote:


  movem.l d0-d3,(vidptr)+ ; 2 cycles on v4 core because of fusing mem ops

This is 14 cycles for 4 pixels without any bubbles in the loop. This is slower than the fixed pt version because fpu ops has a bigger latency than integer ops (and because they can only execute on P1 iirc).

Cool, this is actually a great example for an unrolled flatshading scanline loop.
Each Scanline can have any combination of 32-bit aligned / non-aligned Start and End address (which is handled separately).

That leaves a loop (though I had a separate jump solution on 6502 flatshader that might be great to reimplement here) of fused 128-bit writes (4 pixels) into 32-bit aligned addresses:

    ; d0-d3 have the same 32-bit value of the current scanline color (still 24-bit color space) and never change throughout the loop
    ; Can the following code run in 4*2 = 8 cycles (plus 1 cycle for dbra) without zero bubbles ?

Flatshading_Scanline_Loop:
  movem.l d0-d3,(vidptr)+  ; 2c: Pixel 0-3
  movem.l d0-d3,(vidptr)+  ; 2c: Pixel 4-7
  movem.l d0-d3,(vidptr)+  ; 2c: Pixel 8-11
  movem.l d0-d3,(vidptr)+  ; 2c: Pixel 12-15
dbra d4,Flatshading_Scanline_Loop

Or would it make sense to rather do two of these (copying color in d0 into d4-d7 prior to this, of course)?

  movem.l d0-d7,(vidptr)+  ; 4c: Pixel 0-7
  movem.l d0-d7,(vidptr)+  ; 4c: Pixel 8-15

Really, the only thing that changes here is the (a0) register...




Don Adan

Posts 38
27 Jan 2020 02:11



Hi Don,
 
  In theory a nice trick.
  But please mind that this code creates a ALU 2 EA bubble

  Ok, what about this, old version was perhaps buggy.

   


      moveq #0,D5
      swap D2
      swap D3

      loop_10_start:
        move.l (a3,d3.w*4),(a0)+
        add.l d2,d3
        addx.w D5,D3 ; addx.l d5,d3 can be used too, if is fastest
      dbra d1,loop_10_start
   
   


  And of course this code can be unrolled too.
 
 




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
27 Jan 2020 07:48


Don Adan wrote:

     

        moveq #0,D5
        swap D2
        swap D3
   
        loop_10_start:
          move.l (a3,d3.w*4),(a0)+
          add.l d2,d3
          addx.w D5,D3
        dbra d1,loop_10_start   
     

 

 
Regarding ADDX
ADDX looks like a nice trick always.
But there is a technical limitation with ADDX.
ADDX depends on the flags of the previous instruction.
As the FLAGS need time to be "created"
ADDX can never be the 2nd instruction in a super-scalar pair.
This means ADDX is always in the 1st pipe.
This technical reason makes ADDX often limit super scalability.
 
 
Regarding unrolling.
I real live texture loops will often have short runs like 5-15 pixel
and they will have variable (random) length.
Unrolling is then not really possible.
Only for a very special testcase I can see unrolling work.
 

The cure to avoid pipeline bubbles in EA unit,
the tricks using EA unit to increment the registers.

  ADDA.L D1,A1    -- This runs in EA unit without bubble
  move.b (A0,A1.b1),(A2)+
  DBRA D0,LOOP




Samuel Devulder

Posts 248
27 Jan 2020 10:05


Yes, 5/6 for fadd and fmul but that is latency (I really need  to checkout Flype's benchmarking... [EDIT] here it is: fabs/fneg/ftst: 1 cycle, fadd/fsub/fmul/fcmp: 6 cycles, fdiv: 9 cycles, others: "a lot".. fmove fp,fp, fmove #imm,fp: 1 cycle but don't know about fmove fp,Dq).
 
But yeah, the pipeline execute 1 op per cycle but it is 5 stages. That mean you cannot dirtectly use the result of an fpu right after it has been executed. You must wait for the operation to go though the stages, each one requiring 1 cycle.
 
I mean: schedule your code and you'll get 1 fpu op per cycle with full power. Otherwise you'll add 5 cycle latency in the case of a simple fmul and perform way slower than expected.
 
About scheduling: in my code above I was handling 4 pixel at a time, but I think it is possible do process less because then I count cycle it seem that there are at least 10 cycles or so between an fpu op and using it's result where 5 cycles would have been enough. Maybe doing 2 pixels at a time would be enough (assuming "fmove Fp,Dq" can be done in 1 or 2 cycles).


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
27 Jan 2020 19:21


FPU instructions have 2 attributes
1) throughput
2) latency
 
Throughput of 1/cycle means that you can issue and finish 1 instruction per cycle.
This means at 85 MHz you can reach this way 85 MFlops
 
Latency means the time for the result be "valid".
 
Good FPU code is parallel and not sequential
and waits the latency time before using the result
and therefore can reach the full throughput score.
 
My feeling is FPU is great for doing 3D transformation as this can be easily parallelized. For Rastercode I feel interger is a lot easier. Or even better using the "Rastaman"


Vladimir Repcak

Posts 359
28 Jan 2020 00:34


Samuel Devulder wrote:

Yes, 5/6 for fadd and fmul but that is latency (I really need  to checkout Flype's benchmarking... [EDIT] here it is: fabs/fneg/ftst: 1 cycle, fadd/fsub/fmul/fcmp: 6 cycles, fdiv: 9 cycles, others: "a lot".. fmove fp,fp, fmove #imm,fp: 1 cycle but don't know about fmove fp,Dq).
 
  But yeah, the pipeline execute 1 op per cycle but it is 5 stages. That mean you cannot dirtectly use the result of an fpu right after it has been executed. You must wait for the operation to go though the stages, each one requiring 1 cycle.
   
  I mean: schedule your code and you'll get 1 fpu op per cycle with full power. Otherwise you'll add 5 cycle latency in the case of a simple fmul and perform way slower than expected.

Thanks, that explains it. And adjusts my expectations accordingly.

Basically, FPU is utterly unusable for short tight loops. You really need a long and already slow code so you can hide the FPU latency behind it.

Or, you butcher the algorithm in a way that it works with "old values", hence the FPU latency can be hidden.
I've done that on Jaguar when I was hiding the Division latency, but it made the code utterly incomprehensible two weeks after I wrote it, resulting in really nasty and hard-to-debug bugs once I made adjustment to that code and stuff started to behave weird occasionally.

So, I'm pretty sure I don't want to go down that route. It's too expensive in long term. If this was a day job, for a corporation, and I was salaried - sure - why not :)

Safer to just avoid FP altogether and let the integer pipes run at full steam.

Samuel Devulder wrote:

  About scheduling: in my code above I was handling 4 pixel at a time, but I think it is possible do process less because then I count cycle it seem that there are at least 10 cycles or so between an fpu op and using it's result where 5 cycles would have been enough. Maybe doing 2 pixels at a time would be enough (assuming "fmove Fp,Dq" can be done in 1 or 2 cycles).

Yeah, I'm sure it *could* be done. But the only way to get there is to have detailed table of all FP latencies and every time you touch that code, you go through that :)

Again, that's something that can be done for fun, over the weekend, if all you worry about is that one routine.

I have too much experience from last two years when inevitably some adjustments to such initial assumptions must be made, for the sake of gameplay (or other factors that could not have been known at design time), rendering such code to be a complete throwaway.

I'm really glad I only burnt 3 days on floating point. It was fun while it lasted :)



Vladimir Repcak

Posts 359
28 Jan 2020 00:45


Gunnar von Boehn wrote:

FPU instructions have 2 attributes
  1) throughput
  2) latency
 
  Throughput of 1/cycle means that you can issue and finish 1 instruction per cycle.
  This means at 85 MHz you can reach this way 85 MFlops
 
  Latency means the time for the result be "valid".
 
  Good FPU code is parallel and not sequential
  and waits the latency time before using the result
  and therefore can reach the full throughput score.
 
  My feeling is FPU is great for doing 3D transformation as this can be easily parallelized. For Rastercode I feel interger is a lot easier. Or even better using the "Rastaman"

So, how can you even theoretically achieve 85 MFLOPS ?

Can FP unit handle 5 parallel FP ops ?

Can you do the following and issue all five ops in 5 cycles ?
cycle 1: FDIV fp7,fp6  ; fp5 available in 9c
cycle 2: FADD fp0,fp1
cycle 3: FSUB fp2,fp3
cycle 4: FMUL fp4,fp5  ; fp5 available in 5c
cycle 5: FMOVE fp0,fp3
cycle 6-10: results from above ops get written to respective registers



Vladimir Repcak

Posts 359
28 Jan 2020 01:03


Yes, I finally found the thread which gave me the [incorrect] impression of 1 cycle per FP op:
EXTERNAL LINK 
Initially, I didn't notice the note about implied scheduling, but it is -indeed- down there, just couple replies below :)



posts 429page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22