Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

Coding Example:

Gunnar von Boehn
(Apollo Team Member)
Posts 6207
14 Feb 2016 17:14



 
  .loopX                        ; clk pipe
      clr.l  d2                ;  1  0
      clr.l  d3                ;  1  1
      clr.l  d4                ;  2  0
      move.b  -1(a0),d2          ;  2  1
      move.b  1(a0),d3          ;  3  0
      move.b  FIREW(a0),d4      ;  4  0
      add.l  d3,d2              ;  4  1
      add.l  d4,d2              ;  5  0
      add.l  d4,d2              ;  6  0
      lsr.l  #2,d2              ;  7  0
      move.b  d2,(a0)+          ;  8  0
      dbf    d1,.loopX          ;  8  1  /  9
 

 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
14 Feb 2016 17:22



          clr.l  D3
          clr.l  D4
          clr.l  D2
          move.b  -1(a0),d2

          .loopX                        ; clk pipe
            move.b  FIREW(a0),d4      ;  1  0
     
            move.b  1(a0),d3          ;  2  0
            add.l  D4,d2              ;  2  1
     
            add.l  d3,d2              ;  3  0
            add.l  d4,d2              ;  4  0
     
            lsr.l  #2,d2              ;  5  0
            move.b  d2,(a0)+          ;  6  0
            dbf    d1,.loopX          ;  6  1  / 7
       




Philippe Flype
(Apollo Team Member)
Posts 299
14 Feb 2016 19:01


Thx Gunnar, I will try these optimisations,

By the way, the code posted here is part of a fire effect i coded on vampire, still compatible with all rtg amigas.
 
  180fps ;) look video here :
 
  EXTERNAL LINK


Philippe Flype
(Apollo Team Member)
Posts 299
20 Feb 2016 22:06


Hi,

I modified the routine for better looking, it is a little more greedy. How would you optimize the XLoop in the following routine, please ?

;------------------------------------------------------------
FireDraw3:                      ; FireDraw()
;------------------------------------------------------------
    movem.l d0-d4/a0,-(sp)      ; Store registers
    move.l  _FirePower,d4        ; Get power
    move.l  _CgxBaseAddress,a0  ; Get screen
    move.w  #SCRH-1,d0          ; y < height-1
.loopY                          ; For y
    move.w  #SCRW-2,d1          ; x < width-2
    clr.w  d2                  ; Reset Pixel
.loopX                          ; For x
    clr.w  d3                  ; Reset Color
    move.b  SCRW*1-0(a0),d2      ; Pixel = Pixel(x+0,y+1)
    add.l  d2,d3                ; Color + + Pixel
    move.b  SCRW*3-0(a0),d2      ; Pixel = Pixel(x+0,y+3)
    add.l  d2,d3                ; Color + Pixel
    move.b  SCRW*2-1(a0),d2      ; Pixel = Pixel(x-1,y+2)
    add.l  d2,d3                ; Color + Pixel
    move.b  SCRW*2+1(a0),d2      ; Pixel = Pixel(x+1,y+2)
    add.l  d2,d3                ; Color + Pixel
    add.l  d4,d3                ; Color + Power
    lsr.w  #2,d3                ; Color >> 2
    beq.s  .next                ; If Color > 0 {
    subq.b  #1,d3                ;  Color--
.next                            ; }
    move.b  d3,(a0)+            ; Pixel(x,y) = Color
    dbf    d1,.loopX            ; Next x
    addq.l  #1,a0                ; Screen++
    dbf    d0,.loopY            ; Next y
    movem.l (sp)+,d0-d4/a0      ; Restore registers
    rts                          ; Return



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
21 Feb 2016 07:07



Hi Philippe,

Thanks for the interesting routine.
Lets look at it and see how the instructions decoder/scheduler
will put in on the Apollo Pipes.


  .loopX                          CLK Pipe ; For x
      clr.w  d3                  1  0    ; Reset Color
      move.b  SCRW*1-0(a0),d2    1  1    ; Pixel = Pixel(x+0,y+1)
      add.l  d2,d3              2  0    ; Color + + Pixel
      move.b  SCRW*3-0(a0),d2    2  1    ; Pixel = Pixel(x+0,y+3)
      add.l  d2,d3              3  0    ; Color + Pixel
      move.b  SCRW*2-1(a0),d2    3  1    ; Pixel = Pixel(x-1,y+2)
      add.l  d2,d3              4  0    ; Color + Pixel
      move.b  SCRW*2+1(a0),d2    4  1    ; Pixel = Pixel(x+1,y+2)
      add.l  d2,d3              5  0    ; Color + Pixel
      add.l  d4,d3              6  0    ; Color + Power
      lsr.w  #2,d3              7  0    ; Color >> 2
      beq.s  .next              8  0    ; If Color > 0 {
      subq.b  #1,d3              8  1    ;  Color--
  .next                           
      move.b  d3,(a0)+            9  0      ; Pixel(x,y) = Color
      dbf    d1,.loopX          9  1 /10  ; Next x

I count 15 instructions, executed in 10 cycles.
So you reach an IPC of 1.5
This is pretty good already.

The first halve of the loop code was already clever written
so that its very easy for the CPU to exxcute 2 instructions per clock.

This block here


      add.l  d2,d3              5  0    ; Color + Pixel
      add.l  d4,d3              6  0    ; Color + Power
      lsr.w  #2,d3              7  0    ; Color >> 2

This block has instruction dependancies and its not possible for the CPU to use its SuperScalar potential.




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
21 Feb 2016 07:16


As we see the code was already pretty good written for normal 68k code.

You can calculate 1 pixel in 10 clocks.
I think if you unroll it once to calc two pixel per loop iteration,
then you get more indepedant instructions then reaching a speed of
1 pixel in 8 clocks should be possible.

I think with normal old 68k instructions, this will be the max.

But it can be futher tuned by using some of the new 64bit instructions
 


     
  .loopX                          CLK Pipe ; For x
      move.q  SCRW*1-0(a0),d3      1  0  ; Pixel = Pixel(x+0,y+1)
      pavgu.b  SCRW*3-0(a0),d3      2  0  ; Pixel = Pixel(x+0,y+3)
      pavgu.b  SCRW*2-1(a0),d3      3  0  ; Pixel = Pixel(x-1,y+2)
      psubsu.b #$01010101010101,d3  4      ;                     
      move.q  d3,(a0)+            5  0      ; Pixel(x,y) = Color
      dbf      d1,.loopX            5  1 /6  ; Next x
 

 
As you see the 64bit code is even simpler. :)
With using 64bit instructions you can calculate

8 pixel in 6 clocks.

This is a huge speedup of 80 clock to 6 clocks for 8 pix.
How many hundred FPS will you then get?



Philippe Flype
(Apollo Team Member)
Posts 299
22 Feb 2016 01:11


Current routine is 100 fps in 320x240x8, this means 1300fps if 8 pixels in 6 clock.

So maybe about 160fps in 640x480x16bits (1300/4/2=162).


Philippe Flype
(Apollo Team Member)
Posts 299
05 Mar 2016 12:19


With some modifications, and some unrolling i reached more than 200fps as you can see in this video :
   
  EXTERNAL LINK 
  EXTERNAL LINK   
 
Further tests i did brings it over 250fps (better fusing).
 
 



Philippe Flype
(Apollo Team Member)
Posts 299
06 Mar 2016 02:30


Plasma effect :

EXTERNAL LINK

posts 9