Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

Coding Example - SuperScalar

Philippe Flype
(Apollo Team Member)
Posts 299
10 Mar 2016 15:06


Just a short example on how end-developer can improve code on very criticals routines with superscalar feature of 68060+ CPU.

This is not intended to be a lesson or to the yet skilled developers, and i'm not, but for beginners it can help to learn how instructions are executed when having 2 pipes ; let's see a little routine that is inside the MPEG Player RiVA by Stephen Fellner.

The player load data from disk, decode frames and render each frames on a given format (RGB16, 24, 32, YUV422). Here is the LoopX routine on the YUV conversion.


mpr_YUV422_loopy                    ; CLK  PIPE
        move.b  (a3)+,d5            ;  1    0  u
        swap    d5                  ;  2    0
        move.b  (a4)+,d5            ;  3    0  v
        move.w  (a2),d2            ;  4    0  ?? ?? y0 y1
        move.w  (a2,d1.w),d3        ;  5    0  ?? ?? y2 y3
        lsl.l  #8,d2              ;  5    1  ?? y0 y1 --
        lsr.w  #8,d2              ;  6    0  ?? y0 -- y1
        lsl.l  #8,d2              ;  7    0  y0 -- y1 --
        or.l    d5,d2              ;  8    0  y0 u  y1 v
        move.l  d2,(a1)            ;  9    0
        lsl.l  #8,d3              ;  9    1  ?? y2 y3 --
        lsr.w  #8,d3              ; 10    0  ?? y2 -- y3
        lsl.l  #8,d3              ; 11    0  y2 -- y3 --
        or.l    d5,d3              ; 12    0  y2 u  y3 v
        move.l  d3,(a1,d7.w)        ; 13    0
        addq.l  #$2,a2              ; 13    1  source y next
        addq.l  #$4,a1              ; 14    0  dest yuv422 next
        subq.l  #1,d6              ; 14    1 
        bne.b  mpr_YUV422_loopy    ; 15    0  continue

On the comments, i added the CPU clock (CLK) and the pipe on which it is executed (PIPE). The routine has 19 instructions and it executes in 15 cycles. This gives us an IPC (intruction per second) of 1.26. With a IPC of 1.26 this means the 2nd pipe is used. But not enough used. This can be better and saves some FPS.


Philippe Flype
(Apollo Team Member)
Posts 299
10 Mar 2016 15:15


Now, how to have a better IPC with SuperScalar feature ?

This can be achieved by moving instructions in a different order so that there is no more dependancies between instruction A in 1rst pipe and instruction B in 2nd pipe ; and of course by taking careful of not breaking the logic itself.


mpr_YUV422_loopy                    ; CLK  PIPE
        move.w  (a2)+,d2            ;  1    0
        move.b  (a3)+,d5            ;  2    0
        lsl.l  #8,d2              ;  2    1
        move.w  (-2,a2,d1.w),d3    ;  3    0
        swap    d5                  ;  3    1
        lsl.l  #8,d3              ;  4    0
        move.b  (a4)+,d5            ;  4    1
        lsr.w  #8,d2              ;  5    0
        lsr.w  #8,d3              ;  5    1
        lsl.l  #8,d2              ;  6    0
        lsl.l  #8,d3              ;  6    1
        or.l    d5,d2              ;  7    0
        or.l    d5,d3              ;  7    1
        move.l  d2,(a1)+            ;  8    1
        move.l  d3,(-4,a1,d7.w)    ;  9    0
        subq.l  #1,d6              ;  9    1
        bne.b  mpr_YUV422_loopy    ; 10    0

As you can see, now there are much more instructions that are executed in 2nd pipe (PIPE column).

There is also the use of EA mode with increment (An)+ instead of ADDQ.l which saves some cycles.

Now there are a) less instructions b) better use of superscalar feature. IPC is now 1.7 (17 instructions / 10 cycles).


Philippe Flype
(Apollo Team Member)
Posts 299
10 Mar 2016 15:20


For instance, on our example, it is better to interleave the LSL and LSR so that CPU works on D2 in 1st pipe and D3 in 2nd pipe.
   
If 2 instructions uses the same register, then the 2nd instruction can't be executed in the 2nd pipe because of 'dependancies'.

 


    lsl.l  #8,d2 ;  1    1st pipe
    lsr.w  #8,d2 ;  2    1st pipe
    lsl.l  #8,d3 ;  3    1st pipe
    lsr.w  #8,d3 ;  4    1st pipe

    ==> 4 cycles.
 

Better do something like this ; whenever it is possible :


    lsl.l  #8,d2 ;  1    1st pipe
    lsl.l  #8,d3 ;  1    2nd pipe
    lsr.w  #8,d2 ;  2    1st pipe
    lsr.w  #8,d3 ;  2    2nd pipe

    ==> 2 cycles.

This example gives a little improvement when playing MPEG video. But when applied to many other locations it can improve the whole thing by maybe 5 FPS better or even more, specially with hi-res video (640x480).

EDIT: I made Typo on first post. 'IPC' stands for 'Instruction Per Cycle' (Clock), of course.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
19 Mar 2016 11:01


To convert the YUV I could propose the following code..
 
 

 
  LEA Y0,A0    -- Line 0
  LEA Y1,A1    -- Line 1
  LEA U,A2    --
  LEA V,A3    --
  LEA Dest0,A4 -- Line 0
  LEA Dest1,A5 -- Line 1
 
 
  MOVE.Q (A2)+,D0  -- 8 Byte U
  MOVE.Q (A3)+,D1  -- 8 Byte V
  PERM  #$08192A3B,D0,D1,D2  -- Mixed UV first 8 Byte
  PERM  #$4C5D6E7F,D0,D1,D3  -- Mixed UV second 8 Byte
 
  MOVE.Q (A0)+,D0  -- 8 Byte Y
  MOVE.Q (A0)+,D1  -- 8 Byte Y
 
  PERM  #$08192A3B,D0,D2,D4  -- Mixed YUYV first 8 Byte
  MOVE.Q D4,(A4)+
  PERM  #$4C5D6E7F,D0,D2,D4  -- Mixed YUYV second 8 Byte
  MOVE.Q D4,(A4)+
  PERM  #$08192A3B,D1,D3,D4  -- Mixed YUYV third 8 Byte
  MOVE.Q D4,(A4)+
  PERM  #$4C5D6E7F,D1,D3,D4  -- Mixed YUYV forth 8 Byte
  MOVE.Q D4,(A4)+
 
 
  MOVE.Q (A1)+,D0  -- 8 Byte Y  2nd Line
  MOVE.Q (A1)+,D1  -- 8 Byte Y  2nd Line
 
  PERM  #$08192A3B,D0,D2,D4  -- Mixed YUYV first 8 Byte
  MOVE.Q D4,(A5)+
  PERM  #$4C5D6E7F,D0,D2,D4  -- Mixed YUYV second 8 Byte
  MOVE.Q D4,(A5)+
  PERM  #$08192A3B,D1,D3,D4  -- Mixed YUYV third 8 Byte
  MOVE.Q D4,(A5)+
  PERM  #$4C5D6E7F,D1,D3,D4  -- Mixed YUYV forth 8 Byte
  MOVE.Q D4,(A5)+
 



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
19 Mar 2016 12:32


If I count correct then
 
 
  The original routine copies
  14 Byte in 14 cycle
  = 1.0 Byte / clock
 
 
  The SuperScalar Version copies
  14 Byte in 9 cycle
  = 1.5 Byte / clock
 
 
  The 64bit Version copies
  112 Byte in 24 cycle
  = 4.6 Byte / clock
 



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
19 Mar 2016 17:23


Now with the orig registers


mpr_YUV422_loopy
        ;  (a3)      == U
        ;  (a4)      == V
        ;  (a2)      == Y0
        ;  (a2,d1.w) == Y1
        ;  (a1)      == Dest0
        ;  (a1,d7.w) == Dest1

  MOVE.Q (A3)+,D0  -- 8 Byte U
  MOVE.Q (A4)+,D2  -- 8 Byte V
  PERM  #$08192A3B,D0,D2,D3  -- Mixed UV first 8 Byte
  PERM  #$4C5D6E7F,D0,D2,D5  -- Mixed UV second 8 Byte
 
  MOVE.Q (A2)+,D0  -- 8 Byte Y
  PERM  #$08192A3B,D0,D3,D2  -- Mixed YUYV first 8 Byte
  MOVE.Q D2,(A1)+
  PERM  #$4C5D6E7F,D0,D3,D2  -- Mixed YUYV second 8 Byte
  MOVE.Q D2,(A1)+

  MOVE.Q (A2)+,D0  -- 8 Byte Y
  PERM  #$08192A3B,D0,D5,D2  -- Mixed YUYV third 8 Byte
  MOVE.Q D2,(A1)+
  PERM  #$4C5D6E7F,D0,D5,D2  -- Mixed YUYV forth 8 Byte
  MOVE.Q D2,(A1)+
 
 
  MOVE.Q (-16,A2,D1.w),D0  -- 8 Byte Y
  PERM  #$08192A3B,D0,D3,D2  -- Mixed YUYV first 8 Byte
  MOVE.Q D2,(-32,A1,D7.w)
  PERM  #$4C5D6E7F,D0,D3,D2  -- Mixed YUYV second 8 Byte
  MOVE.Q D2,(-24,A1,D7.w)

  MOVE.Q (-8,A2,D1.w),D0  -- 8 Byte Y
  PERM  #$08192A3B,D0,D5,D2  -- Mixed YUYV third 8 Byte
  MOVE.Q D2,(-16,A1,D7.w)
  PERM  #$4C5D6E7F,D0,D5,D2  -- Mixed YUYV forth 8 Byte
  MOVE.Q D2,(-8,A1,D7.w)



posts 6