Information about the Apollo CPU and FPU. |
|
---|
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 14 Feb 2016 17:14
| .loopX ; clk pipe clr.l d2 ; 1 0 clr.l d3 ; 1 1 clr.l d4 ; 2 0 move.b -1(a0),d2 ; 2 1 move.b 1(a0),d3 ; 3 0 move.b FIREW(a0),d4 ; 4 0 add.l d3,d2 ; 4 1 add.l d4,d2 ; 5 0 add.l d4,d2 ; 6 0 lsr.l #2,d2 ; 7 0 move.b d2,(a0)+ ; 8 0 dbf d1,.loopX ; 8 1 / 9
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 14 Feb 2016 17:22
| clr.l D3 clr.l D4 clr.l D2 move.b -1(a0),d2 .loopX ; clk pipe move.b FIREW(a0),d4 ; 1 0 move.b 1(a0),d3 ; 2 0 add.l D4,d2 ; 2 1 add.l d3,d2 ; 3 0 add.l d4,d2 ; 4 0 lsr.l #2,d2 ; 5 0 move.b d2,(a0)+ ; 6 0 dbf d1,.loopX ; 6 1 / 7
| |
| | Philippe Flype (Apollo Team Member) Posts 299 14 Feb 2016 19:01
| Thx Gunnar, I will try these optimisations, By the way, the code posted here is part of a fire effect i coded on vampire, still compatible with all rtg amigas. 180fps ;) look video here : EXTERNAL LINK
| |
| | Philippe Flype (Apollo Team Member) Posts 299 20 Feb 2016 22:06
| Hi, I modified the routine for better looking, it is a little more greedy. How would you optimize the XLoop in the following routine, please ? ;------------------------------------------------------------ FireDraw3: ; FireDraw() ;------------------------------------------------------------ movem.l d0-d4/a0,-(sp) ; Store registers move.l _FirePower,d4 ; Get power move.l _CgxBaseAddress,a0 ; Get screen move.w #SCRH-1,d0 ; y < height-1 .loopY ; For y move.w #SCRW-2,d1 ; x < width-2 clr.w d2 ; Reset Pixel .loopX ; For x clr.w d3 ; Reset Color move.b SCRW*1-0(a0),d2 ; Pixel = Pixel(x+0,y+1) add.l d2,d3 ; Color + + Pixel move.b SCRW*3-0(a0),d2 ; Pixel = Pixel(x+0,y+3) add.l d2,d3 ; Color + Pixel move.b SCRW*2-1(a0),d2 ; Pixel = Pixel(x-1,y+2) add.l d2,d3 ; Color + Pixel move.b SCRW*2+1(a0),d2 ; Pixel = Pixel(x+1,y+2) add.l d2,d3 ; Color + Pixel add.l d4,d3 ; Color + Power lsr.w #2,d3 ; Color >> 2 beq.s .next ; If Color > 0 { subq.b #1,d3 ; Color-- .next ; } move.b d3,(a0)+ ; Pixel(x,y) = Color dbf d1,.loopX ; Next x addq.l #1,a0 ; Screen++ dbf d0,.loopY ; Next y movem.l (sp)+,d0-d4/a0 ; Restore registers rts ; Return
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 21 Feb 2016 07:07
| Hi Philippe,Thanks for the interesting routine. Lets look at it and see how the instructions decoder/scheduler will put in on the Apollo Pipes.
.loopX CLK Pipe ; For x clr.w d3 1 0 ; Reset Color move.b SCRW*1-0(a0),d2 1 1 ; Pixel = Pixel(x+0,y+1) add.l d2,d3 2 0 ; Color + + Pixel move.b SCRW*3-0(a0),d2 2 1 ; Pixel = Pixel(x+0,y+3) add.l d2,d3 3 0 ; Color + Pixel move.b SCRW*2-1(a0),d2 3 1 ; Pixel = Pixel(x-1,y+2) add.l d2,d3 4 0 ; Color + Pixel move.b SCRW*2+1(a0),d2 4 1 ; Pixel = Pixel(x+1,y+2) add.l d2,d3 5 0 ; Color + Pixel add.l d4,d3 6 0 ; Color + Power lsr.w #2,d3 7 0 ; Color >> 2 beq.s .next 8 0 ; If Color > 0 { subq.b #1,d3 8 1 ; Color-- .next move.b d3,(a0)+ 9 0 ; Pixel(x,y) = Color dbf d1,.loopX 9 1 /10 ; Next x
I count 15 instructions, executed in 10 cycles. So you reach an IPC of 1.5 This is pretty good already. The first halve of the loop code was already clever written so that its very easy for the CPU to exxcute 2 instructions per clock. This block here
add.l d2,d3 5 0 ; Color + Pixel add.l d4,d3 6 0 ; Color + Power lsr.w #2,d3 7 0 ; Color >> 2
This block has instruction dependancies and its not possible for the CPU to use its SuperScalar potential.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 21 Feb 2016 07:16
| As we see the code was already pretty good written for normal 68k code. You can calculate 1 pixel in 10 clocks. I think if you unroll it once to calc two pixel per loop iteration, then you get more indepedant instructions then reaching a speed of 1 pixel in 8 clocks should be possible. I think with normal old 68k instructions, this will be the max. But it can be futher tuned by using some of the new 64bit instructions .loopX CLK Pipe ; For x move.q SCRW*1-0(a0),d3 1 0 ; Pixel = Pixel(x+0,y+1) pavgu.b SCRW*3-0(a0),d3 2 0 ; Pixel = Pixel(x+0,y+3) pavgu.b SCRW*2-1(a0),d3 3 0 ; Pixel = Pixel(x-1,y+2) psubsu.b #$01010101010101,d3 4 ; move.q d3,(a0)+ 5 0 ; Pixel(x,y) = Color dbf d1,.loopX 5 1 /6 ; Next x
As you see the 64bit code is even simpler. :) With using 64bit instructions you can calculate
8 pixel in 6 clocks.
This is a huge speedup of 80 clock to 6 clocks for 8 pix. How many hundred FPS will you then get?
| |
| | Philippe Flype (Apollo Team Member) Posts 299 22 Feb 2016 01:11
| Current routine is 100 fps in 320x240x8, this means 1300fps if 8 pixels in 6 clock. So maybe about 160fps in 640x480x16bits (1300/4/2=162).
| |
| | Philippe Flype (Apollo Team Member) Posts 299 05 Mar 2016 12:19
| With some modifications, and some unrolling i reached more than 200fps as you can see in this video : EXTERNAL LINK EXTERNAL LINK Further tests i did brings it over 250fps (better fusing).
| |
| | Philippe Flype (Apollo Team Member) Posts 299 06 Mar 2016 02:30
| Plasma effect : EXTERNAL LINK
| |
|
|
|