TABLE OF CONTENTS AMMX/--------------Introduction---------------- AMMX/--Instruction_Words_and_Addressing_Modes-- AMMX/-------Detecting_AMMX_in_AmigaOS---------- AMMX/--------Enabling_AMMX_in_AmigaOS---------- AMMX/BFLYW AMMX/BSEL AMMX/C2P AMMX/LOAD AMMX/LOADi AMMX/PACKUSWB AMMX/PACK3216 AMMX/PADDB AMMX/PADDUSB AMMX/PADDUSW AMMX/PADDW AMMX/PAND_POR_PEOR_PANDN AMMX/PAVGB AMMX/PCMPccB AMMX/PCMPccW AMMX/PMAXxB AMMX/PMAXxW AMMX/PMINxB AMMX/PMINxW AMMX/PMUL88 AMMX/PMULA AMMX/PMULH AMMX/PMULL AMMX/PSUBB AMMX/PSUBUSB AMMX/PSUBUSW AMMX/PSUBW AMMX/STORE AMMX/STOREC AMMX/STOREi AMMX/STOREilm AMMX/STOREm AMMX/TRANSHI AMMX/TRANSLO AMMX/UNPACK1632 AMMX/VPERM AMMX/__MISC__ AMMX/__MISC__ ____ ____ .__ \ \ / /____ _____ ______ |__|______ ____ \ Y /\__ \ / \\____ \| \_ __ \_/ __ \ \ / / __ \| Y Y \ |_> > || | \/\ ___/ \___/ (____ /__|_| / __/|__||__| \___ > \/ \/|__| \/ Conversion from AutoDoc format to AmigaGuide (keeping the backspaces intact): cat AMMX_doc.txt | sed 's/\\/\\\\/g' >ram:AMMX_bspace.doc ad2ag ram:AMMX_bspace.doc Undocumented Instructions (for now): ------------------------------ BANK - register bank switching ------------------------------ BANK SrcA,SrcB,Size ; BANK SrcA,SrcB,Dest,Size ; ; "Size" is the length of the whole bundle = opcode length + bank_length (2) ; Size = %00 : 4 bytes ; Size = %01 : 6 bytes ; Size = %10 : 8 bytes ; Size = %11 : 10 bytes ; currently valid banks: ; 2 Address register banks (An,Bn) ; 4 Data register banks (D0-D7,E0-E7,E8-E15,E16-E23) BANK MACRO ; ----CCC-DDCCAABB AA BB DD dc.w (%0111000100000000+((\1)*%100)+(\2)+((\3)*%1000000)) ENDM short examples: BANK 0,0,%10 ; dc.w $7180 ; examplary/redundant: ; bank 0 is default lea NUMBERS,a5 ; Load in A5 BANK 0,1,%10 ; dc.w $7181 lea NUMBERS,a5 ; Load in B5 moveq #0,d0 moveq #0,d1 BANK 0,0,%00 ; dc.w $7100 ; select register bank 0 ; (redundant, example only) add.l (a5)+,d0 ; BANK 1,0,%00 ; select Address register Bank 1 add.l (a5)+,d1 ; add.l (b5)+.d1 ; D0 should be = D1 = 1 rts numbers: dc.l 1,2 -------- MINITERM -------- Full documentation is in TBD status. Miniterm replicates the Amiga Blitter miniterm calculations. Register usage restrictions like TRANSHI/-LO. Inputs are a consecutive set of four registers (D0-D3,D4-D7,E0-E3, etc.). Inputs are assigned as Channels A,B,C,Miniterm. The Output register will carry the result. -------- - LSLQ - -------- mnemonic: lslq ,b,d short: 64 Bit shift left equivalent C code _uint64_t a,b,d; d = b<,b,d short: 64 Bit shift right equivalent C code _uint64_t a,b,d; d = b>>a; LSRQ is a 64 Bit right left operation. The shift constant in input a is handled as modulo 64. While this operation shares the AMMX encoding style, it is a full 64 Bit scalar operation. ----- BFLYB ----- Obsolete ------------------------------------------------------------------------------- AMMX/PSUBW mnemonic: psubw ,b,d short: vector subtract short graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | ----------------------------------------- | | | | | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | ----------------------------------------- | | | | | | | \_____________________ | \______ \______________ \ | \ \ \ ----------------- ----------------- ----------------- ----------------- | b0-a0 | | b1-a1 | | b2-a2 | | b3-a3 | ----------------- ----------------- ----------------- ----------------- | ____/ ____________/ / | / / ____________________/ | / / / ----------------------------------------- | d0 | d1 | d2 | d3 | ----------------------------------------- equivalent C Code: int i; short a[4]; short b[4]; short d[4]; for( i = 0 ; i<4 ; i++ ) { d[i] = a[i] - b[i]; } typical application cases: PADDW is a plain and simple vectorized 16 Bit subtraction operation that performs four independent sub.w operations in one shot. ------------------------------------------------------------------------------- AMMX/PADDW mnemonic: paddw ,b,d short: vector add short graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | ----------------------------------------- | | | | | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | ----------------------------------------- | | | | | | | \_____________________ | \______ \______________ \ | \ \ \ ----------------- ----------------- ----------------- ----------------- | b0+a0 | | b1+a1 | | b2+a2 | | b3+a3 | ----------------- ----------------- ----------------- ----------------- | ____/ ____________/ / | / / ____________________/ | / / / ----------------------------------------- | d0 | d1 | d2 | d3 | ----------------------------------------- equivalent C Code: int i; short a[4]; short b[4]; short d[4]; for( i = 0 ; i<4 ; i++ ) { d[i] = a[i] + b[i]; } typical application cases: PADDW is a plain and simple vectorized 16 Bit addition operation that performs four independent add.w operations in one shot. ------------------------------------------------------------------------------- AMMX/PSUBUSW mnemonic: psubusw ,b,d short: vector subtract unsigned short with saturation graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | ----------------------------------------- | | | | | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | ----------------------------------------- | | | | | | | \_____________________ | \______ \______________ \ | \ \ \ ----------------- ----------------- ----------------- ----------------- |min($ffff,b0-a0| |min($ffff,b1-a1| |min($ffff,b2-a2| |min($ffff,b3-a3| ----------------- ----------------- ----------------- ----------------- | ____/ ____________/ / | / / ____________________/ | / / / ----------------------------------------- | d0 | d1 | d2 | d3 | ----------------------------------------- equivalent C Code: int i; unsigned short a[4]; unsigned short b[4]; unsigned short d[4]; for( i = 0 ; i<4 ; i++ ) { d[i] = min(0xffff, b[i] - a[i] ); } typical application cases: PSUBUSW might come in handy for pixel manipulation with accuracy demands beyond 8 Bit per gun. Subtraction results are implicitly saturated, leaving the complete 16 Bit range available plus the ability to subtract dithering offsets, for example. ------------------------------------------------------------------------------- AMMX/PADDUSW mnemonic: paddusw ,b,d short: vector add unsigned short with saturation graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | ----------------------------------------- | | | | | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | ----------------------------------------- | | | | | | | \_____________________ | \______ \______________ \ | \ \ \ ----------------- ----------------- ----------------- ----------------- |min($ffff,b0+a0| |min($ffff,b1+a1| |min($ffff,b2+a2| |min($ffff,b3+a3| ----------------- ----------------- ----------------- ----------------- | ____/ ____________/ / | / / ____________________/ | / / / ----------------------------------------- | d0 | d1 | d2 | d3 | ----------------------------------------- equivalent C Code: int i; unsigned short a[4]; unsigned short b[4]; unsigned short d[4]; for( i = 0 ; i<4 ; i++ ) { d[i] = min(0xffff, a[i] + b[i] ); } typical application cases: PADDUSW might come in handy for pixel manipulation with accuracy demands beyond 8 Bit per gun. Addition results are implicitly saturated, leaving the complete 16 Bit range available plus the ability to add dithering offsets, for example. ------------------------------------------------------------------------------- AMMX/PSUBB mnemonic: psubb ,b,d short: vector subtract bytes graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | ... | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | ----------------------------------------- | | ... | | | | | | | | | | | | \ | \_____________ ... \ \ \ \_____ | \ \ ----------- ----------- ----------- | b0-a0 | | b1-a1 | ... | b7-a7 | ----------- ----------- ----------- | _____________/ ... ______/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; char a[8]; /* signed inputs */ char b[8]; char d[8]; for( i = 0 ; i<8 ; i++ ) { d[i] = b[i] - a[i]; } typical application cases: Whenever a byte subtraction is needed and overflows either won't happen or are expected, this is the vector variant of sub.b. ------------------------------------------------------------------------------- AMMX/PADDB mnemonic: paddb ,b,d short: vector add bytes graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | ... | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | ----------------------------------------- | | ... | | | | | | | | | | | | \ | \_____________ ... \ \ \ \_____ | \ \ ---------- ---------- ---------- | a0+b0 | | a1+b1 | ... | a7+b7 | ---------- ---------- ---------- | _____________/ ... ______/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; char a[8]; /* signed inputs */ char b[8]; char d[8]; for( i = 0 ; i<8 ; i++ ) { d[i] = a[i] + b[i]; } typical application cases: Whenever a byte addition is needed and overflows either won't happen or are expected, this is the vector variant of add.b. ------------------------------------------------------------------------------- AMMX/PSUBUSB mnemonic: psubusb ,b,d short: vector subtract unsigned bytes with unsigned saturation graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | ... | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | ----------------------------------------- | | ... | | | | | | | | | | | | \ | \_____________ ... \ \ \ \_____ | \ \ ------------------ ------------------ ------------------ | max(0,(b0-a0)) | | max(0,(b1-a1)) | | max(0,(b7-a7)) | ------------------ ------------------ ------------------ | _____________/ ... ______/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; unsigned char a[8]; unsigned char b[8]; unsigned char d[8]; for( i = 0 ; i<8 ; i++ ) { d[i] = max( 0, ( b[i] - a[i] )); } typical application cases: Psubusb is useful when dealing with 8 Bit pixel data. All subtractions are guaranteed to be kept in the 8 Bit range. Subtraction results <0 are saturated to 0. ------------------------------------------------------------------------------- AMMX/PADDUSB mnemonic: paddusb ,b,d short: vector add unsigned bytes with unsigned saturation graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | ... | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | ----------------------------------------- | | ... | | | | | | | | | | | | \ | \_____________ ... \ \ \ \_____ | \ \ ------------------ ------------------ ------------------ |min(255,(a0+b0))| |min(255,(a1+b1))| ... |min(255,(a7+b7))| ------------------ ------------------ ------------------ | _____________/ ... ______/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; unsigned char a[8]; unsigned char b[8]; unsigned char d[8]; for( i = 0 ; i<8 ; i++ ) { d[i] = min( 255, ( a[i] + b[i] )); } typical application cases: Paddusb is useful when dealing with 8 Bit pixel data. All additions are guaranteed to be kept in the 8 Bit range. Addition results >255 are saturated to 255. ------------------------------------------------------------------------------- AMMX/PMULA mnemonic: pmula ,b,d short: vector multiply and add unsigned bytes graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | ... | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | ----------------------------------------- | | ... | | | | input/output d ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- | | | | | | | | \ | \_____________ ... \ \ \ \ | \ \ ----------------- ----------------- ----------------- |((b0*a0)>>8)+d0| |((b1*a1)>>8)+d1| |((b7*a7)>>8)+d7| ----------------- ----------------- ----------------- | _____________/ ... _/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; unsigned char a[8]; unsigned char b[8]; unsigned char d[8]; for( i = 0 ; i<8 ; i++ ) { d[i] = ( ( a[i] * b[i] ) >> 8 ) + d[i]; } typical application cases: Alpha Blending is the most prominent use for this instruction. In case of premultiplied Alpha, a single PMULA instruction is sufficient for the task. Just preload 255-Alpha in the relevant byte slots of b, destination content in a and RGB (premultiplied with Alpha) into d. ------------------------------------------------------------------------------- AMMX/PMULL mnemonic: pmull ,b,d short: vector multiply short and keep lower 16 Bit graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | ----------------------------------------- | | | | | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | ----------------------------------------- | | | | | | | \_____________ \ \____ \_________ \ | \ \ \ -------------- -------------- -------------- -------------- |(b0*a0)&ffff| |(b1*a1)&ffff| |(b3*a3)&ffff| |(b3*a3)&ffff| -------------- -------------- -------------- -------------- | ___/ ________/ _____________/ | / / / ----------------------------------------- | d0 | d1 | d2 | d3 | ----------------------------------------- equivalent C Code: int i; short a[4]; /* signed inputs */ short b[4]; short d[4]; for( i = 0 ; i<4 ; i++ ) { d[i] = ( a[i] * b[i] ) & 0xffff; /* the AND is here for */ /* clarification only... */ } typical application cases: PMULL mostly applies to cases where the result is known to fit in 16 Bit. Usually, it is expected that PMULL is used in tandem with PMULH to obtain a 16x16 = 32 Bit result. Another application for PMULL is to serve as a left shift operator. Being a multiplication instruction, pass 2^(shift value) instead of the number of bits. examples: pmull.w #54,E0,E2 ; E2 = E0*54 pmull.w #1024,E0,E3 ; E3 = E0<<10 ;16*16 = 32 Bit multiplication example pmull.w E4,E5,E6 ;low part l0 l1 l2 l3 pmulh.w E4,E5,E7 ;high part h0 h1 h2 h3 vperm #$018923ab,E7,E6,E8 ; h0 l0 h1 l1 vperm #$45cd67ef,E7,E6,E9 ; h2 l2 h3 l3 ------------------------------------------------------------------------------- AMMX/PMULH mnemonic: pmulh ,b,d short: vector multiply short and keep upper 16 Bit graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | ----------------------------------------- | | | | | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | ----------------------------------------- | | | | | | | \_____________ \ \____ \_________ \ | \ \ \ -------------- -------------- -------------- -------------- | (b0*a0)>>16| | (b1*a1)>>16| | (b3*a3)>>16| | (b3*a3)>>16| -------------- -------------- -------------- -------------- | ___/ ________/ _____________/ | / / / ----------------------------------------- | d0 | d1 | d2 | d3 | ----------------------------------------- equivalent C Code: int i; short a[4]; /* signed inputs */ short b[4]; short d[4]; for( i = 0 ; i<4 ; i++ ) { d[i] = ( a[i] * b[i] ) >> 16; } typical application cases: PMULH is useful on it's own in cases where one of the operands is a fixed-point fractional variable with an absolute value <1. In said cases, the usual renormalizing shift is not necessary. Application cases like trigonometric transforms and 3D matrix transforms come to mind. PMULH teams up with PMULL when 16x16 = 32 Bit results are of interest. examples: pmulh.w #1024,E0,E2 ; E2 = (E0*1024)>>16 = E0>>6 ------------------------------------------------------------------------------- AMMX/PMUL88 mnemonic: pmul88 ,b,d short: vector multiply short and shift down 8 graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | ----------------------------------------- | | | | | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | ----------------------------------------- | | | | | | | \_____________ \ \____ \_________ \ | \ \ \ -------------- -------------- -------------- -------------- | (b0*a0)>>8 | | (b1*a1)>>8 | | (b3*a3)>>8 | | (b3*a3)>>8 | -------------- -------------- -------------- -------------- | ___/ ________/ _____________/ | / / / ----------------------------------------- | d0 | d1 | d2 | d3 | ----------------------------------------- equivalent C Code: int i; short a[4]; /* signed inputs */ short b[4]; short d[4]; for( i = 0 ; i<4 ; i++ ) { d[i] = ( a[i] * b[i] ) >> 8; } typical application cases: PMUL88 is effectively a multiply of a 16.0 integer with an 8.8 fixed point. As such, it can be used for short-range shifts as well. Should accuracy and dynamic range not fit the particular needs, please have a look at PMULL and PMULH. examples: pmul88.w #64,E0,E1 ; E1 = (E0*64)>>8 = E0>>2 pmul88.w #1024,E0,E2 ; E2 = (E0*1024)>>8 = E0<<2 ------------------------------------------------------------------------------- AMMX/-------Detecting_AMMX_in_AmigaOS---------- AMMX is a new feature added to the 68k architecture. Especially the extended registers need to be handled in the multitasking of AmigaOS. As of the Gold 2.7 release, the extensions to context switching are part of the VampireSupport kickstart module. Without this module, AMMX cannot be safely used in a multitasking setting. Apollo Core presence can be checked in several ways. The recommended method is to check the AttnFlags in ExecBase. The following additions need to be done either in own code or in the respective includes: execbase.i IFND AFB_68080 ; ; The AFB_68080 bit is set when a working AC68080 ; is in the system. If this is set then all bits ; for 010/020/030/040 are also set, since the 080 ; is intended to be compatible with all of them. ; BITDEF AF,68080,10 ; Set if AC68080 ENDC execbase.h #ifndef AFB_68080 /* * The AFB_68080 bit is set when a working AC68080 * is in the system. If this is set then all bits * for 010/020/030/040 are also set, since the 080 * is intended to be compatible with all of them. */ #define AFB_68080 10 #define AFF_68080 (1<<10) #endif ASM-Code to check for Apollo Core may look like the following example: ; returns 1 if the program is running on Apollo Core, else 0 INCLUDE exec/execbase.i Apollo_Presence: move.l $4.w,a0 ; Load ExecBase move.w AttnFlags(a0),d1 ; Read AttnFlags moveq #1,d0 ; btst #AFB_68080,d1 ; AC68080 ? bne.s _have_apollo moveq #0,d0 ; side note for optimization: Apollo will ; merge a single instruction after a branch ; to a conditional execution (=free branch) _have_apollo: rts C Code to check for Apollo Core may look like the following example: #include #ifndef AFB_68080 #define AFB_68080 10 #endif #ifndef AFF_68080 #define AFF_68080 (1<AttnFlags & AFF_68080 ) return 1; return 0; } AMMX1 is guaranteed to be available when AFF_68080 is present in execbase (Apollo Core Gold 2.5 release). There are, however also the instructions introduced with the Gold 2.7 release, in short AMMX2 (PCMP,BSEL,PMIN/PMAX, STOREC, STOREILM, Loadi/Storei, PMULA/PMULL/PMULH etc.). Although manual checks for these instructions would be possible, it is not recommended to do so. The best and shortest way is simply to call vampire.resource. The vampire.resource has a function that will check for AMMX presence and version and enable it accordingly on request. SEE ALSO AMMX/--------Enabling_AMMX_in_AmigaOS---------- ------------------------------------------------------------------------------- AMMX/--------Enabling_AMMX_in_AmigaOS---------- AMMX is a feature that was added to an 68k processor long after AmigaOS3 development stalled. Hence, the extended registers that come along with Apollo Core are not saved in the regular AmigaOS stackframe. In order to find a balance between backwards compatibility, context switch overhead and Apollo Core support, it was decided that AMMX aware tasks have to announce themselves to the operating system. Please note at this point: This setting is per task. If your program is using multiple tasks/processes, every one of them needs to call the respective function in vampire.resource. It is _not_ necessary to switch off AMMX explicitly. The functionality to switch off AMMX was provided for library functions to enable/disable AMMX on the fly. Technical details: AMMX awareness of tasks is signaled in SR, bit 11. Once this flag is set, the task scheduling replacement in the VampireSupport kickstart module will save the extended registers to the user stack. Otherwise, it reverts to the standard stackframe. Please refer to the calling conventions of vampire.resource for further details. example: #include struct Library *VampireBase; /* sample program returns 1 if AMMX2 available and activated, 0 if not */ int main( int argc, char ** argv ) { if( !(VampireBase = OpenResource( V_VAMPIRENAME ) ) ) return 0; if( VampireBase->lib_Version >= 45 ) { if( V_EnableAMMX( V_AMMX_V2 ) != VRES_ERROR ) return 1; } return 0; } SEE ALSO vampire/V_EnableAMMX vampire/vampire.h vampire/vampire.i ------------------------------------------------------------------------------- AMMX/--------------Introduction---------------- AMMX, as Gunnar named it, is a 64 Bit SIMD extension to an 68k CPU. Apart from the fact that it shares the 64 Bit width with the MMX of a well known company, the concept we followed is more geared towards the SIMD extensions in RISC architectures (AltiVEC, Wireless MMX). In the current state of development, 32 registers are available for SIMD usage. These 32 registers include the well-known D0-D7 (extended to 64 Bit) and 24 new registers which are SIMD exclusive. This way, a lot of work can be done in registers, reducing the strain on memory reads and writes considerably. While reading on, it should get clear that we studied the previously available SIMD architectures but at the same time came up with unique features that separate AMMX from the rest. Most instructions follow a 3 operand logic D=A op B, where the results of the operation between A and B is stored in any C of the registers. It must be noted at this point that the input operand A doesn't have to be a register. Any effective address in 68k notatation is allowed, including immediates. At this point, some examples are shown: PADDW D0,D1,D2 ; 4x16 Bit addition D0+D1=D2 PADDW (A0),D1,D2 ; same, from memory (unaligned) PADDW #$8100810081008100,D1,D2 ; add 4x16 Bit constant PADDW.W #$8100,D1,D2 ; same as above, with implicit splat The latter two code lines above demonstrate a convenient feature in AMMX. You can specify immediates in AMMX SIMD code, something you don't find easily somewhere else. The constants can be given in full 64 Bit. While this may be useful for some applications, the 64 Bit immediates result in instruction words of 12 Bytes. As an alternative, we added a second way of specifiying constants. The .w Syntax in the last of the example mnemonics triggers the implicit distribution of the immediate data word to all four 16 Bit slots. This way, the latter two instructions are identical in their arithmetic operation. The difference with implicit splat is a reduction of the instruction word to 6 Bytes. These two concepts of 3 operand logic and immediates can help to save a number of move instructions that were common to 68k code. In terms of data movement, two basic operations are supported: LOAD and STORE. While input data for the operations can be gathered by the for one of the operands in the arithmetic operations, the destination is a register in the majority of instructions. Therefore, movement to memory needs to be done by STORE. Example: LOAD (A0)+,D1 ;D1=64 bit from any memory location, A0=A0+8 PAVGB (A1)+,D1,D1 ;8x unsigned byte average (a+b+1)>>1 STORE D1,(A2)+ ;write result A special case of STORE is also provided, one that can selectively write the individual bytes. The STOREM Rn,Rm, will only write bytes of which the corresponding mask bit is set (both in MSB to LSB notation). moveq #4,d3 ;yes yes, this will stall in the following calculation LOAD 4(A0,D3.l*4),D1 ;D1=64 bit from any memory location moveq #%01010101,D2 ;D2.b=bit mask which bytes (bit=1) are to be written STOREM D1,D2,(A2)+ ;write every second byte from D1 The third special STORE variant is targeted at 8 Bit pixel data. Typical operations in image/video processing result in intermediate results exceeding the 8 Bit range, which implies clipping before going back to 8 Bit. The Apollo features its own interpretation of PACKUSWB for this purpose. Clipping is done to (0,255). Example: LOAD (A0)+,D1 ;4 signed words: a0.w a1.w a2.w a3.w LOAD (A1)+,D2 ;4 signed words: b0.w b1.w b2.w b3.w PACKUSWB D1,D2,(A2)+ ;8 unsigned bytes: a0 a1 a2 a3 b0 b1 b2 b3 ; operation: vn.b = ( vn.w < 0 ) ? 0 : ( ( vn.w > 255 ) ? 255 : vn.w ); // n=0...7 One catch with SIMD is that you can not always guarantee that you are able to layout your data as needed by the arithmetics. That's why coders have been fond of the permute instruction, introduced with Morotola's PPC7400 (aka G4) series. The Apollo core offers one, too. Two input registers Ra and Rb can be permuted by a given permutation constant into the destionation Rd. Example: ;byte permutation key semantics for Rm,Rn ; Rm m0 m1 m2 m3 m4 m5 m6 m7 = 0 1 2 3 4 5 6 7 ; Rn n0 n1 n2 n3 n4 n5 n6 n7 = 8 9 a b c d e f ; ; ex1: word interleaving LOAD (A0)+,D1 ;4 signed words: m0.w m1.w m2.w m3.w LOAD (A1)+,D2 ;4 signed words: n0.w n1.w n2.w n3.w VPERM #$018923ab,D1,D2,D3 ;D3: m0.w n0.w m1.w n1.w ; ex2: unsigned byte to words LOAD (A0),D4 ;8 unsigned bytes m0 m1 m2 m3 m4 m5 m6 m7 moveq #0,d5 ;0.l VPERM #$F0F1F2F3,D4,D5,D6 ; first four bytes as words m0.w m1.w m2.w m3.w VPERM #$F4F5F6F7,D4,D5,D6 ; second four bytes as words m4.w m5.w m6.w m7.w Let's come to arithmetics. Bit-wise operations are: PAND ,Rb,Rd POR ,Rb,Rd PEOR ,Rb,Rd PANDN ,Rb,Rd Addition/Subtraction can be done on 8 Bit or 16 Bit. PADDB ,Rb,Rd ;Rd = Rb + PADDW ,Rb,Rd ; PSUBB ,Rb,Rd ;Rd = Rb - PSUBW ,Rb,Rd ; One special case of add/sub is the BFLYW. A common recurrence in signal transforms (FFT,DCT,DWT) is the butterfly, an operation where the result of an addition and subtraction of two operands is required. In order to augment such transforms, the AMMX offers BFLYW ,Rb,Rd:Rd+1. Please note that the destination register is actually a consecutive pair (with an even index for the first one). BFLYW D0,D1,D2:D3 ; D2 = D1 + D0 , D3 = D1 - D0 (4 words each) As a side note, we replaced 28 add+sub combinations by butterflies in an 8x8 iDCT, roughly 15% of the total instructions in that function block.. Multiplies are currently offered by the PMUL88 ,Rb,Rd instruction. It multiplies four words with the given operand and shifts down by 8 Bits after the multiply (Rd = (Rb*)>>8 ). Example: PMUL88.W #16,D0,D1 ; D1 = (D0*16)>>8 = D0/16 PMUL88.W #1024,D0,D1 ; D1 = (D1*1024)>>8 = D0*4 PMUL88.W D2,D3,D4 ; The multiply is implemented with full throughput. The implicit downshift (>>8) can serve as short range shift replacement with the respective multipliers. Now, a second special operation pair is TRANS. It comes in two flavors, TRANSHi and TRANSLo. Normally, a matrix transpose is quite time consuming as you can only shuffle two operands with other ISA's (just counted 18 instructions for an 8x4 block with 16 bits per element using Intel SSE in an old routine of mine). This normal overhead may well have a significant impact on SIMD execution speed when it comes to matrix operations. Apollo's TRANS operations allow to transpose a 4x4 block with 16 bit per element from row to column order and vice versa. Example: LOAD (A0)+,E0 ; A0 B0 C0 D0 (A-D = 16 bit words) LOAD (A0)+,E1 ; A1 B1 C1 D1 (A-D = 16 bit words) LOAD (A0)+,E2 ; A2 B2 C2 D2 (A-D = 16 bit words) LOAD (A0)+,E3 ; A3 B3 C3 D3 (A-D = 16 bit words) ;Now transpose the first two words of E0-E3 into the output registers TRANSHi E0-E3,E4:E5 ; E4: A0 A1 A2 A3 ; E5: B0 B1 B2 B3 ;Transpose the lower two words of E0-E3 into the chosen output registers TRANSLo E0-E3,E6:E7 ; E6: C0 C1 C2 C3 ; E7: D0 D1 D2 D3 ; Done. Calculate or store... As one can clearly see from the operand list, TRANS is a beast. It takes 32 Bytes as input and provides 16 Bytes output, with a throughput of 1. This requires some compromises. Technically spoken, an Apollo Feature called "late write" is used here. This induces latency. Place instructions for two cycles between a TRANS and the instruction referencing it's result to avoid bubbles. The second compromise concerns input and output registers. The inputs are restricted to a consecutive block of registers, starting with an index dividable by 4, i.e. D0-D3,D4-D7,E0-E3,...,E20-E23. A similar restriction applies to the outputs. Here, the register index must be a multiple of two. TRANS does not accept memory locations or immediates. ------------------------------------------------------------------------------- AMMX/--Instruction_Words_and_Addressing_Modes-- The AMMX instruction words are 32 bit in length. The first word is organized as follows: ------------------------------------------------------------- | Bit | 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | | Content | 1 1 1 1 1 1 1 A B D <------VEA------> | ------------------------------------------------------------- The second word contains the register indices and instruction numbering. Currently, 5 bits are in use for the op itself. The remaining 3 Bits are reserved for future opcodes. ------------------------------------------------------------- | Bit | 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | | Content | <- REG-B -> <- REG-D -> 0 0 0 <---- Op ----> | ------------------------------------------------------------- AMMX generally provides the usual m68k addressing modes for one of the operands. In case of operations with clear or typical target into memory, applies to the destination. Otherwise, one of the inputs is of the type . Exceptions to this scheme are VPERM and TRANS, where memory operands are not allowed. Before the in itself is explained, some other notes towards the Apollo Core. There have been requests to provide additional address registers. As a consequence, Apollo Core offers the additional registers Bn = B0...B7. A number of scalar instructions has been implemented to support the new Bn. In terms of instruction format, these registers were carried over to the AMMX instructions. The distinction between classic addressing modes and the new register set(s) is done by Bits 8,7,6 (=A,B,D) of the instruction word. The most prominent use of the Bits A,B,D in AMMX is the bank selection for E8-E23. When one of the selector bits is set, E8-E23 are selected instead of D0-D7,E0-E7 for the Operands and . D0-D7 correspond to the bit combinations 0000...0111 and E0-E7 to 1000...1111 in . With bank selector bit on, the registers are E8-E23 with the consecutive bit combinations 0000...1111. The following table lists the valid addressing modes for AMMX. Please note that the 68020+ memory indirect modes are not among the valid choices for AMMX. Also, the register direct mode only refers to 64 Bit registers, replacing An/Bn source operands by En. Another difference between and was chosen in terms of immediates. The default immediate encoding specifies the full 64 bit with four extension words. With A=1, the short variant with implicit splat is selected. The other extension words to modes are unchanged in comparison to . Please refer to the 68020+ manual for the common encoding. +---------+-------------------------------------------------+ | MOD REG | Effective Adressing Mode in dependency of A-Bit | +---------+-------------------------------------------------+ | | A=0 | A=1 | +---------+-------------------------+-----------------------+ | 000 --- | Dn | E8...E15 | | 001 --- | E0-E7 | E16...E23 | | 010 --- | (An) | (Bn) | | 011 --- | (An)+ | (Bn)+ | | 100 --- | -(An) | -(Bn) | | 101 --- | (d16,An) | (d16,Bn) | +---------+-------------------------------------------------+ | 110 --- | (d8,An,Xn.SIZE*SCALE) | (d8,Bn,Xn.SIZE*SCALE) | | 110 --- | (bd,An,Xn.SIZE*SCALE) | (bd,Bn,Xn.SIZE*SCALE) | +---------+-------------------------------------------------+ | 111 010 | (d16,PC) | | 111 011 | (d8,PC,Xn.SIZE*SCALE) | | 111 011 | (bd,PC,Xn.SIZE*SCALE) | | 111 000 | (xxxx).W | | 111 001 | (xxxxxxxx).L | | 111 100 | #.q | #.w | +---------+-------------------------------------------------+ ------------------------------------------------------------------------------- AMMX/PACKUSWB mnemonic: packuswb a,b, short: pack 2x4 signed shorts into 8 unsigned char, saturate to 0..255 graphic: ----------------------------------------- | a0w | a1w | a2w | a3w | ----------------------------------------- | / _____/ _________| | / / __/ ----------------------------------------- | / / / | b0w | b1w | b2w | b3w | | / / / ----------------------------------------- | | | | ____| | | | | | | | / __________| | | | | | | / / ________________| | | | | | / / / ______________________| | | | | / / / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; short a[4]; short b[4]; unsigned char d[8]; for( i = 0 ; i<4 ; i++ ) { d[i] = ( a[i] < 0 ) ? 0 : (a[i] > 255) ? 255 : a[i]; d[i+4] = ( b[i] < 0 ) ? 0 : (b[i] > 255) ? 255 : b[i]; } constraints: The destination can be either a register or memory location. Immediate operands don't apply here. typical application cases: filter calculations in 16 bit, output of 8 Bit unsigned pixel data as a result ------------------------------------------------------------------------------- AMMX/PACK3216 mnemonic: pack3216 a,b, short: pack 32 Bit ARGB data into 16 Bit RGB565 graphic: ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | | / _/ _/ ----------------------------------------- / / / / / / | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | / / / / / / ----------------------------------------- / / / / / / _______/ ___/ __/ _______/ / / | | | / / / / ___/ ____/ __/ _________/ / | | | | | | | / / / / _________/ | | | | | | | | | / / / --------------------------------------------------------------------- | a1 a2 a3 | a5 a6 a7 | b1 b2 b3 | b5 b6 b7 | | 7..3 7..2 7..3 | 7..3 7..2 7..3 | 7..3 7..2 7..3 | 7..3 7..2 7..3 | | d0w | d1w | d2w | d3w | --------------------------------------------------------------------- pseudocode: _REGISTER_ a,b; _VEA_ d; for( i = 0 ; i<4 ; i++ ) { if( i < 2 ) diw = ((a[1+(i&1)*4)]&0xf8)<<8) | ((a[2+(i&1)*4)]&0xfc)<<3) | ((a[3+(i&1)*4)]&0xf8)>>3); else diw = ((b[1+(i&1)*4)]&0xf8)<<8) | ((b[2+(i&1)*4)]&0xfc)<<3) | ((b[3+(i&1)*4)]&0xf8)>>3); } constraints: The destination can be either a register or memory location. Immediate operands don't apply here. typical application cases: Conversion of ARGB input data to HiColor screen and texture formats. ------------------------------------------------------------------------------- AMMX/UNPACK1632 mnemonic: unpack1632 ,d:d+1 short: unpack 16 Bit RGB565 data into 32 Bit ARGB graphic: --------------------------------------------------------------------- | 7..3 7..2 7..3 | 7..3 7..2 7..3 | 7..3 7..2 7..3 | 7..3 7..2 7..3 | | a0w | a1w | a2w | a3w | --------------------------------------------------------------------- | | | | | | | | | \ \ \ | | | | | | \ \ \____ \__ \ \___________ | | | \ \ \ \ \_______ \ \ \__________ | \ \ \ \ \ \ \_______ \ \___ \_______ | | \ \ \ \ \ \ \ | | | | | \ \ \ \ \ \ ----------------------------------------- \ \ \ \ \_ \_ | e0 | e1 | e2 | e3 | e4 | e5 | e6 | e7 | | | | \ \ \ ----------------------------------------- ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- constraints: The destination register pair needs to be consecutive, starting with an even register index (e.g. unpack1632 (a0),E0:E1 ). Immediate operands don't apply here. equivalent C Code: int i; unsigned short a[4]; unsigned char d[16]; for( i = 0 ; i<4 ; i++ ) { d[i*4] = 0xff; d[i*4+1] = ((a[i]>>8) & 0xf8) | ((a[i]>>13) & 0x7); d[i*4+2] = ((a[i]>>3) & 0xfc) | ((a[i]>>9 ) & 0x3); d[i*4+3] = ((a[i]<<3) & 0xf8) | ((a[i]>>2 ) & 0x7); } typical application cases: Conversion of RGB565 HiColor input data to ARGB screen and texture formats. ------------------------------------------------------------------------------- AMMX/VPERM mnemonic: vperm #N,a,b,d short: permute the contents of two registers into destination register graphic: --------------------------------- | N0 N1 | N2 N3 | N4 N5 | N6 N7 | (4 Bit each) --------------------------------- ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- \ / | | | | ----------------------------------------- \/ / | ___/ / | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | /\ / | / / ----------------------------------------- | \ / | / ____/________/ | | \/ | / / / (example permutation) | | /\_ | / / / ( #N = #$1203687d ) / | | \ | / | | ____________________________/ | | \ | / | | / ----------------------------------------- | e0 | e1 | e2 | e3 | e4 | e5 | e6 | e7 | ----------------------------------------- constraints: This instruction does not support memory operands. The two inputs (a,b) and the destination (d) must be registers. The permutation constant N must be given at assembly time. equivalent C Code: int i,n; unsigned char a[8]; unsigned char b[8]; unsigned char d[8]; unsigned char N[4]; for( i = 0 ; i<8 ; i++ ) { n = N[i>>1] >> (i&1)*4; /* get next 4 bits from N */ if( n < 8 ) d[i] = a[n]; else d[i] = b[n-8]; } typical application cases: Flexible shuffling of bytes can be necessary virtually everywhere in SIMD-World. example: ; convert 8 Bit pixel data to 16 Bit load (a0),E1 ;load 8 Bytes moveq #0,d0 ;D0 = 0 (lower 32 Bit) vperm #$48494a4b,d0,E1,E2 ; first 4 Bytes of E1 into 4 Words (insert zeros from d0) vperm #$4c4d4e4f,d0,E1,E3 ; second 4 Bytes of E1 into 4 Words ------------------------------------------------------------------------------- AMMX/TRANSHI mnemonic: transhi a-d , e:f short: matrix transposition, upper half graphic: input: 4 consecutive registers ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- ----------------------------------------- | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | ----------------------------------------- ----------------------------------------- | c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | ----------------------------------------- ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- output: two consecutive registers ----------------------------------------- | a0 | a1 | b0 | b1 | c0 | c1 | d0 | d1 | ----------------------------------------- ----------------------------------------- | a2 | a3 | b2 | b3 | c2 | c3 | d2 | d3 | ----------------------------------------- constraints: This instruction does not support memory operands. The four inputs (a-d) and the destination (e,f) must be consecutive registers. The first source a is constrained to a multiple of 4 (i.e. D0-D3,D4-D7,E0-E3,...,E20-E23). The destination register index pair (e,f) are restricted to a multiple of two (i.e. D0:D1,D2:D3 etc.). equivalent C Code: int i,n; unsigned short a[16]; /* 4 registers of 8 bytes each */ unsigned short e[4]; /* 1 register of 8 bytes */ unsigned short f[4]; /* 1 register of 8 bytes */ for( i = 0 ; i<4 ; i++ ) { e[i] = a[ 4*i ]; f[i] = a[ 4*i + 1 ]; } typical application cases: This instruction allows a 4x4 matrix transposition in just two CPU cycles (in conjunction with translo). The extension towards larger matrices (8x8,16x16 etc.) is straightforward. example: load (a0),E0 load 8(a0),E1 load 16(a0),E2 load 24(a0),E3 transhi E0-E3,E4:E5 ;transpose upper (left) half of matrix ------------------------------------------------------------------------------- AMMX/TRANSLO mnemonic: translo a-d , e:f short: matrix transposition, lower half graphic: input: 4 consecutive registers ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- ----------------------------------------- | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | ----------------------------------------- ----------------------------------------- | c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | ----------------------------------------- ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- output: two consecutive registers ----------------------------------------- | a4 | a5 | b4 | b5 | c4 | c5 | d4 | d5 | ----------------------------------------- ----------------------------------------- | a6 | a7 | b6 | b7 | c6 | c7 | d6 | d7 | ----------------------------------------- constraints: This instruction does not support memory operands. The four inputs (a-d) and the destination (e,f) must be consecutive registers. The first source a is constrained to a multiple of 4 (i.e. D0-D3,D4-D7,E0-E3,...,E20-E23). The destination register index pair (e,f) are restricted to a multiple of two (i.e. D0:D1,D2:D3 etc.). equivalent C Code: int i; unsigned short a[16]; /* 4 registers of 8 bytes each */ unsigned short e[4]; /* 1 register of 8 bytes */ unsigned short f[4]; /* 1 register of 8 bytes */ for( i = 0 ; i<4 ; i++ ) { e[i] = a[ 4*i + 2 ]; f[i] = a[ 4*i + 3 ]; } typical application cases: This instruction allows a 4x4 matrix transposition in just two CPU cycles (in conjunction with transhi). The extension towards larger matrices (8x8,16x16 etc.) is straightforward. example: load (a0),E0 load 8(a0),E1 load 16(a0),E2 load 24(a0),E3 transhi E0-E3,E4:E5 ;transpose upper (left) half of matrix (to complete example) translo E0-E3,E6:E7 ;transpose lower (right) half of matrix ------------------------------------------------------------------------------- AMMX/BFLYW mnemonic: bflyw ,b,d:d+1 short: butterfly operation, single-cycle vector short addition _and_ subtraction graphic: inputs: ----------------------------------------- | a0w | a1w | a2w | a3w | ----------------------------------------- | | | | ----------------------------------------- | b0w | b1w | b2w | b3w | ----------------------------------------- | | | | | | | | | | | | | | | | outputs: | | | | | | ----------------------------------------- | b0w+a0w | b1w+a1w | b2w+a2w | b3w+a3w | ----------------------------------------- | | | | | | | | ----------------------------------------- | b0w-a0w | b1w-a1w | b2w-a2w | b3w-a3w | ----------------------------------------- constraints: The destination register pair needs to be consecutive, starting with an even register index (e.g. bflyw (a0),E8,E0:E1 ). Immediate operands don't apply here. equivalent C Code: int i; short a[4]; /* */ short b[4]; /* */ short d[4]; /* */ short e[4]; /* */ for( i = 0 ; i<4 ; i++ ) { d[i] = b[ i ] + a[ i ]; f[i] = b[ i ] - a[ i ]; } constraints: The destination can be either a register or memory location. Immediate operands don't apply here. typical application cases: Signal transforms like FFT, DCT, DWT etc. and lifting approaches in terms of filter banks in general. ------------------------------------------------------------------------------- AMMX/C2P mnemonic: c2p ,d short: chunky to planar conversion, bit-wise transpose graphic: input (bytes k-r with their bits 7...0): ------------------------------------------------------------------------- |kkkkkkkk|llllllll|mmmmmmmm|nnnnnnnn|oooooooo|pppppppp|qqqqqqqq|rrrrrrrr| |76543210|76543210|76543210|76543210|76543210|76543210|76543210|76543210| ------------------------------------------------------------------------- output: ------------------------------------------------------------------------- |klmnopqr|klmnopqr|klmnopqr|klmnopqr|klmnopqr|klmnopqr|klmnopqr|klmnopqr| |77777777|66666666|55555555|44444444|33333333|22222222|11111111|00000000| ------------------------------------------------------------------------- equivalent C Code: int i,j; unsigned char t; unsigned char a[8]; unsigned char d[8]; for( i = 0 ; i<8 ; i++ ) { for( j=0,t=0 ; j < 8 ; j++ ) t |= ( ( a[j] >> (7-i) ) & 1 ) << (7-j); d[i] = t; } typical application cases: Chunky-to-planar conversion and vice versa. Consider the combination of C2P and VPERM/TRANSHI/TRANSLO for efficient implementations. Also useful to extract condition masks from AMMX registers (e.g. after PCMP) into regular data registers. ------------------------------------------------------------------------------- AMMX/BSEL mnemonic: bsel ,b,d short: graphic: input1: (bytes l-s with their bits 7...0) ------------------------------------------------------------------------- |llllllll|mmmmmmmm|nnnnnnnn|oooooooo|pppppppp|qqqqqqqq|rrrrrrrr|ssssssss| |76543210|76543210|76543210|76543210|76543210|76543210|76543210|76543210| ------------------------------------------------------------------------- | | | | | ... | ... | ... | | | | | input3: (contents of destination register d, with bytes d-k) | | ------------------------------------------------------------------------- |dddddddd|eeeeeeee|ffffffff|gggggggg|hhhhhhhh|iiiiiiii|jjjjjjjj|kkkkkkkk| |76543210|76543210|76543210|76543210|76543210|76543210|76543210|76543210| ------------------------------------------------------------------------- | | | | | ... | ... | ... | | | | | input2: (contents of register b with bytes b,c,t-y) | | ------------------------------------------------------------------------- |bbbbbbbb|cccccccc|tttttttt|uuuuuuuu|vvvvvvvv|wwwwwwww|xxxxxxxx|yyyyyyyy| |76543210|76543210|76543210|76543210|76543210|76543210|76543210|76543210| ------------------------------------------------------------------------- | \ | | | ... \_________________ ... ______/ ... \___ | \ / \ | \ / \ -------------------- -------------------- -------------------- -------------------- | (b7&l7)|(!b7&d7) | .. | (b0&l0)|(!b0&d0) | .. | (y7&s7)|(!y7&k7) | .. | (y0&s0)|(!y0&k0) | -------------------- -------------------- -------------------- -------------------- | ... ____________________/ ... \________ ... ____/ | / \ / ------------------------------------------------------------------------- |dddddddd|eeeeeeee|ffffffff|gggggggg|hhhhhhhh|iiiiiiii|jjjjjjjj|kkkkkkkk| |76543210|76543210|76543210|76543210|76543210|76543210|76543210|76543210| ------------------------------------------------------------------------- equivalent C Code: { unsigned long long a; unsigned long long b; unsigned long lond d; d = ( d & (!b) ) | ( a & b ); /* BSEL a,b,d */ } typical application cases: This instruction allows a bit-by-bit selection of data from two sources into the destination. Typically, this is applied in conjunction with a prior pcmp instruction. Tasks like conditional replenishment and clipping easily come to mind. ------------------------------------------------------------------------------- AMMX/PAND_POR_PEOR_PANDN mnemonic: pand ,b,d por ,b,d peor ,b,d pandn ,b,d short: 64 Bit logic operations graphic: input1: (bytes l-s with their bits 7...0) ------------------------------------------------------------------------- |llllllll|mmmmmmmm|nnnnnnnn|oooooooo|pppppppp|qqqqqqqq|rrrrrrrr|ssssssss| |76543210|76543210|76543210|76543210|76543210|76543210|76543210|76543210| ------------------------------------------------------------------------- | | | | | ... | ... | ... | | | | | | | | | input2: (contents of register b with bytes b,c,t-y) | | ------------------------------------------------------------------------- |bbbbbbbb|cccccccc|tttttttt|uuuuuuuu|vvvvvvvv|wwwwwwww|xxxxxxxx|yyyyyyyy| |76543210|76543210|76543210|76543210|76543210|76543210|76543210|76543210| ------------------------------------------------------------------------- | \ | | | ... \____ ... ... / | | \ _____/ | | \ / | --------- --------- --------- --------- | b7&l7 | .. | b0&l0 | (partial PAND ILLUSTRATION) | s7&y7 | .. | s0&y0 | --------- --------- --------- --------- | ... ______/ ... \_______ \_ | / \ \ ------------------------------------------------------------------------- |dddddddd|eeeeeeee|ffffffff|gggggggg|hhhhhhhh|iiiiiiii|jjjjjjjj|kkkkkkkk| |76543210|76543210|76543210|76543210|76543210|76543210|76543210|76543210| ------------------------------------------------------------------------- equivalent C Code: { unsigned long long a; unsigned long long b; unsigned long lond d; d = a & b; /* PAND */ d = a | b; /* POR */ d = a ^ b; /* PEOR */ d = !a & b; /* PANDN */ } typical application cases: EOR,AND,OR AND I think NOT that I have to write more. :-) ------------------------------------------------------------------------------- AMMX/PAVGB mnemonic: pavgb ,b,d short: average 8 unsigned bytes with 8 unsigned bytes graphic: ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | | | | ... | | | | ----------------------------------------- | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | ----------------------------------------- \___ \_______________ ... \______ \ \ \ ---------------- ---------------- ---------------- | (a0+b0+1)>>1 | | (a1+b1+1)>>1 | ... | (a7+b7+1)>>1 | ---------------- ---------------- ---------------- ___/ _______________/ ... ______/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; unsigned char a[8]; unsigned char b[8]; unsigned char d[8]; for( i = 0 ; i<8 ; i++ ) { d[i] = ( a[i] + b[i] + 1 ) >> 1; } typical application cases: Simple half-sample linear interpolation. This type of operation is used (read: mandatory) in many classic video codecs, ranging from MPEG-1 to MPEG-4 part 2. ------------------------------------------------------------------------------- AMMX/LOAD mnemonic: load ,d short: load 64 bit into destination register equivalent C Code: { unsigned long long src; unsigned long lond d; d = src; } typical application cases: Load is the AMMX equivalent to move ,dn. As most other AMMX instructions, it can load either from memory (any An,Bn addressing modes), from another register or constants. In case of memory sources, there are no restrictions towards alignment. You may load from any valid address in Chip- and FastRAM. examples: load (a0),E0 load 1(a0),E1 load #$c0ffee00feedface,E2 load.w #$beef,E3 ; this one is with implicit splat ; E3 = $beefbeefbeefbeef ------------------------------------------------------------------------------- AMMX/LOADi mnemonic: loadi ,d short: load 64 bit indirect into destination register equivalent C Code: { unsigned long long src; unsigned char d; unsigned long lond regfile[64]; /* Apollo-internal register file */ regfile[d] = src; } notes: In reality, the mnemonic should be "loadi ,(d)". The interim implementation in VASM does not reflect the indirect reference to "d", however. So in the meantime, think of "(d)" while specifying "d" in code. typical application cases: For many cases, the normal load instruction is more appropriate and convenient. While this indexed variant requires to preload the index register, it helps for example at places where the contents of the destination register are to be changed conditionally. Also, you may think of preloading AMMX registers in a loop instead of in a row to keep code size small (where appropriate). examples: moveq #1,d0 loadi (a0),d0 ;register number 1 is "d1", so d1=(a0) moveq #7,d0 ;example: preload E0-E7 moveq #40,d1 ;= E0 index .loop: loadi (a0)+,d1 addq.l #1,d1 dbf d0,.loop register map (partial, decimal numbers): 00 - 07 = D0 - D7 08 - 15 = A0 - A7 16 - 23 = B0 - B7 40 - 47 = E0 - E7 48 - 55 = E8 - E15 56 - 63 = E16 - E23 ------------------------------------------------------------------------------- AMMX/STORE mnemonic: store a, short: store 64 bit from source register in memory equivalent C Code: { unsigned long long a; unsigned long lond dest; dest = a; } typical application cases: Store is the AMMX equivalent to move dn,. It unconditionally writes 64 Bit into the destination. In typical cases, the destination is some location in memory. You may store to any valid address in Chip- and FastRAM. There are no alignment restrictions. examples: store E0,(a0) store E1,9(a0) ------------------------------------------------------------------------------- AMMX/STOREi mnemonic: storei a, short: store 64 bit indirect source into memory location equivalent C Code: { unsigned long long dest; unsigned char a; unsigned long lond regfile[64]; /* Apollo-internal register file */ dest = regfile[d]; } notes: In reality, the mnemonic should be "storei (d),". The interim implementation in VASM does not reflect the indirect reference to "d", however. So in the meantime, think of "(d)" while specifying "d" in code. typical application cases: For many cases, the normal store instruction is more appropriate and convenient. While this indexed variant requires to preload the index register, it helps for example at places where the source register is to be changed conditionally. Also, you may think of storing a list of AMMX registers in a loop instead of in a row to keep code size small (where appropriate). examples: moveq #1,d0 storei d0,(a0) ;register number 1 is "d1", so (a0) = d1 register map (partial, decimal numbers): 00 - 07 = D0 - D7 08 - 15 = A0 - A7 16 - 23 = B0 - B7 40 - 47 = E0 - E7 48 - 55 = E8 - E15 56 - 63 = E16 - E23 ------------------------------------------------------------------------------- AMMX/STOREm mnemonic: storem a,m, short: store a selection of bytes from a into destination graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | ... | | | | logical input: destination contents ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- | | ... | | | | input m: lower 8 bit of second argument ----------------------------------------- | | | | | | | | m | ----------------------------------------- \___ \____________ ... \___ \ \ \ -------------- ------------- ---------------- | (m&128) ? | | (m&64) ? | ... | (m&1) ? | | a0 : d0 | | a1 : d1 | | a7 : d7 | -------------- ------------- ---------------- ___/ ____________/ ... ___/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; unsigned char a[8]; unsigned char m; unsigned char d[8]; for( i = 0 ; i<8 ; i++ ) { d[i] = ( m & (1<<(7-i)) ) ? a[i] : d[i]; } typical application cases: It is usual in SIMD that you can write only the native amount of bytes at once. This instruction enables selective overwriting of memory, based on the contents of the mask register (lower 8 Bit). This instruction has several uses. Perhaps the most prominent application is for cookie-cut: selective writing of pixels from sprites to the screen. examples: move.b #$ff,d0 storem E0,d0,(a0) ;this example is the same as store E0,(a0) move.b #$f0,d0 storem E0,d0,8(a0) ;overwrite 4 bytes from 8(a0) with upper 32 bit of E0 ; one way of color key (see also storeilm) pcmpeqw.w #$f81f,E0,E2 ;magenta HiColor RGB565 pixel(s) in E0 ? c2p e2,e2 ;re-order: get bits from word mask into one byte peor.w #$ffff,e2,e2 ;negate mask (logical: !magenta) storem E0,E2,(a0) ;store only words where the magenta check didn't match ------------------------------------------------------------------------------- AMMX/STOREilm mnemonic: storeilm a,m, short: store bytes from a into destination, based on inverted long mask graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | ... | | | | logical input: destination contents ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- | | ... | | | | input m: lower 8 bit of second argument ----------------------------------------- | m0 | m1 | m2 | m3 | m4 | m5 | m6 | m7 | ----------------------------------------- \___ \____________ ... \___ \ \ \ -------------- -------------- -------------- | (m0&128) ? | | (m1&128) ? | ... | (m7&128) ? | | a0 : d0 | | a1 : d1 | | a7 : d7 | -------------- -------------- -------------- ___/ ____________/ ... ___/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; unsigned char a[8]; unsigned char m[8]; unsigned char d[8]; for( i = 0 ; i<8 ; i++ ) { d[i] = ( m[i] & 128 ) ? d[i] : a[i]; } typical application cases: It is usual in SIMD that you can write only the native amount of bytes at once. This instruction enables selective overwriting of memory, based on the contents of the mask register. Depending on your needs, you might consider either storem or storeilm for particular problems. examples: load.w #$ffff,d0 ;-1.q storeilm E0,d0,(a0) ;this example stores nothing :-) peor d0,d0,d0 ;0.q storeilm E0,d0,(a0) ;is the same as store E0,(a0) ; another way of color key (see also storem) pcmpeqw.w #$f81f,E0,E2 ;magenta HiColor RGB565 pixel(s) in E0 ? storeilm E0,E2,(a0) ;store only words where the magenta check didn't match ------------------------------------------------------------------------------- AMMX/STOREC mnemonic: storec a,count, short: store at most "count" bytes from a into destination graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | ... | | | | logical input: destination contents ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- | | ... | | | | input count: number of bytes to write ----------------------------------------- | | count | ----------------------------------------- \___ \____________ ... \___ \ \ \ -------------- -------------- -------------- | count>0 ? | | count>1 ? | ... | count>7 ? | | a0 : d0 | | a1 : d1 | | a7 : d7 | -------------- -------------- -------------- ___/ ____________/ ... ___/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; unsigned char a[8]; int count; unsigned char d[8]; for( i = 0 ; i<8 ; i++ ) { if( (count - i) > 0 ) d[i] = a[i]; } typical application cases: Memcopy. You can always read/write with 64 Bit instructions and won't have to worry about overwritten locations (as long as you count the written bytes within the loop, ofc). examples: move.l #1523,d0 .loop load (a0)+,E0 storec E0,d0,(a1)+ subq.l #8,d0 bgt .loop ------------------------------------------------------------------------------- AMMX/PCMPccB mnemonic: pcmpeqb ,b,d pcmpgtb ,b,d pcmpgeb ,b,d pcmphib ,b,d short: byte-by-byte vector compare graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | ... | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | ----------------------------------------- | | ... | | | | \___ \____________ ... \___ \ \ \ -------------- -------------- -------------- | b0 > a0 ? | | b1 > a1 ? | ... | b7 > a7 ? | example drawing: | $ff : $00 | | $ff : $00 | | $ff : $00 | pcmpgtb -------------- -------------- -------------- ___/ ____________/ ... ___/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; unsigned char au[8]; /* unsigned inputs */ unsigned char bu[8]; char as[8]; /* signed inputs */ char bs[8]; unsigned char d[8]; /* all 4 variants shown in this loop, in practice only one condition apply per instruction */ for( i = 0 ; i<8 ; i++ ) { /* pcmpeqb, applies to au/bu and as/bs */ d[i] = ( bu[i] == au[i] ) ? 0xff : 0x00; d[i] = ( bs[i] == as[i] ) ? 0xff : 0x00; /* pcmpgtb */ d[i] = ( bs[i] > as[i] ) ? 0xff : 0x00; /* pcmpgeb */ d[i] = ( bs[i] >= as[i] ) ? 0xff : 0x00; /* pcmphib */ d[i] = ( bu[i] > au[i] ) ? 0xff : 0x00; } typical application cases: Comparisons are as important in SIMD as in scalar code. The difference to the latter is that condition codes are not set in SIMD. Instead, you get bit masks. In the easiest case, you can just BSEL for handling of true/false conditions. Also, STOREILM might be worth a look. In case that scalar handling of vector condition masks is desired, C2P with a Dn register target might come in handy. examples: pcmpeqb.w #$0101,E0,E1 ; any byte slot in E0 equal to a constant of "1" ? ; ;pcmphs calculation pcmpeqb E3,E0,E1 ; E1: E0 == E3 ? pcmphib E3,E0,E2 ; E2: E0 > E3 ? (unsigned) por E1,E2,E1 ; E1: E0 >= E3 ? (unsigned) ;pcmphs when E3 is always >= 1 psubusb.w #$0101,E3,E2 pcmphib E2,E0,E1 ; E1: E0 >= E3 ? ; pcmpgtb #$4040,E0,E1 ; any signed byte slot in E0 > $40 ? pcmpgeb #$4040,E0,E1 ; any signed byte slot in E0 >= $40 ? ------------------------------------------------------------------------------- AMMX/PCMPccW mnemonic: pcmpeqw ,b,d pcmpgtw ,b,d pcmpgew ,b,d pcmphiw ,b,d short: short-by-short vector compare graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | ----------------------------------------- | | ... | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | ----------------------------------------- | | ... | | | | \_ \_____ ... \____ | \ \ -------------- -------------- -------------- | b0 > a0 ? | | b1 > a1 ? | ... | b3 > a3 ? | example drawing: |$ffff:$0000 | |$ffff:$0000 | |$ffff:$0000 | pcmpgtw -------------- -------------- -------------- / ____/ ... ___/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | ----------------------------------------- equivalent C Code: int i; unsigned short au[4]; /* unsigned inputs */ unsigned short bu[4]; short as[4]; /* signed inputs */ short bs[4]; unsigned short d[4]; /* all 4 variants shown in this loop, in practice only one condition apply per instruction */ for( i = 0 ; i<4 ; i++ ) { /* pcmpeqw, applies to au/bu and as/bs */ d[i] = ( bu[i] == au[i] ) ? 0xffff : 0x0000; d[i] = ( bs[i] == as[i] ) ? 0xffff : 0x0000; /* pcmpgtw */ d[i] = ( bs[i] > as[i] ) ? 0xffff : 0x0000; /* pcmpgew */ d[i] = ( bs[i] >= as[i] ) ? 0xffff : 0x0000; /* pcmphiw */ d[i] = ( bu[i] > au[i] ) ? 0xffff : 0x0000; } typical application cases: Comparisons are as important in SIMD as in scalar code. The difference to the latter is that condition codes are not set in SIMD. Instead, you get bit masks. In the easiest case, you can just BSEL for handling of true/false conditions. Also, STOREILM might be worth a look. In case that scalar handling of vector condition masks is desired, C2P with a Dn register target might come in handy. examples: pcmpeqw.w #$0101,E0,E1 ; any word slot in E0 equal to a constant of "1" ? ; ;pcmphs calculation pcmpeqw E3,E0,E1 ; E1: E0 == E3 ? pcmphiw E3,E0,E2 ; E2: E0 > E3 ? (unsigned) por E1,E2,E1 ; E1: E0 >= E3 ? (unsigned) ;pcmphs when E3 is always >= 1 psubusw.w #$0101,E3,E2 pcmphiw E2,E0,E1 ; E1: E0 >= E3 ? ; pcmpgtw #$4040,E0,E1 ; any signed short slot in E0 > $40 ? pcmpgew #$4040,E0,E1 ; any signed short slot in E0 >= $40 ? ------------------------------------------------------------------------------- AMMX/PMINxB mnemonic: pminub ,b,d pminsb ,b,d short: byte-by-byte vector compare and obtain smaller graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | ... | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | ----------------------------------------- | | ... | | | | \___ \____________ ... \___ \ \ \ -------------- -------------- -------------- | b0 < a0 ? | | b1 < a1 ? | ... | b7 < a7 ? | | b0 : a0 | | b1 : a1 | | b7 : a7 | -------------- -------------- -------------- ___/ ____________/ ... ___/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; unsigned char au[8]; /* unsigned inputs */ unsigned char bu[8]; char as[8]; /* signed inputs */ char bs[8]; unsigned char d[8]; /* both variants shown in this loop, in practice only one condition apply per instruction */ for( i = 0 ; i<8 ; i++ ) { /* pminsb */ d[i] = ( bs[i] < as[i] ) ? bs[i] : as[i]; /* pminub */ d[i] = ( bu[i] < au[i] ) ? bu[i] : au[i]; } typical application cases: PMIN allows to simplify clipping operations. Normally, one would write a pcmp/bsel sequence. This construct by PMAX reduces the operation to one instruction only. examples: pmaxub E0,E1,E2 ; max( E0, E1 ) pminub E0,E1,E1 ; min( E0, E1 ) psubb E1,E2,E1 ; E1 = max( E0,E1 ) - min( E0,E1 ) = abs( E0-E1 ) ------------------------------------------------------------------------------- AMMX/PMAXxB mnemonic: pmaxub ,b,d pmaxsb ,b,d short: byte-by-byte vector compare and obtain larger graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | ----------------------------------------- | | ... | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 | ----------------------------------------- | | ... | | | | \___ \____________ ... \___ \ \ \ -------------- -------------- -------------- | b0 > a0 ? | | b1 > a1 ? | ... | b7 > a7 ? | | b0 : a0 | | b1 : a1 | | b7 : a7 | -------------- -------------- -------------- ___/ ____________/ ... ___/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | d4 | d5 | d6 | d7 | ----------------------------------------- equivalent C Code: int i; unsigned char au[8]; /* unsigned inputs */ unsigned char bu[8]; char as[8]; /* signed inputs */ char bs[8]; unsigned char d[8]; /* both variants shown in this loop, in practice only one condition apply per instruction */ for( i = 0 ; i<8 ; i++ ) { /* pmaxsb */ d[i] = ( bs[i] > as[i] ) ? bs[i] : as[i]; /* pmaxub */ d[i] = ( bu[i] > au[i] ) ? bu[i] : au[i]; } typical application cases: PMAX allows to simplify clipping operations. Normally, one would write a pcmp/bsel sequence. This construct by PMAX reduces the operation to one instruction only. examples: pmaxub E0,E1,E2 ; max( E0, E1 ) pminub E0,E1,E1 ; min( E0, E1 ) psubb E1,E2,E1 ; E1 = max( E0,E1 ) - min( E0,E1 ) = abs( E0-E1 ) ------------------------------------------------------------------------------- AMMX/PMINxW mnemonic: pminuw ,b,d pminsw ,b,d short: short-by-short vector compare and obtain smaller graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | ----------------------------------------- | | ... | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | ----------------------------------------- | | ... | | | | \_ \_____ ... \____ | \ \ -------------- -------------- -------------- | b0 < a0 ? | | b1 < a1 ? | ... | b3 < a3 ? | | b0 : a0 | | b1 : a1 | | b3 : a3 | -------------- -------------- -------------- / ____/ ... ___/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | ----------------------------------------- equivalent C Code: int i; unsigned short au[4]; /* unsigned inputs */ unsigned short bu[4]; short as[4]; /* signed inputs */ short bs[4]; short d[4]; /* both variants shown in this loop, in practice only one condition apply per instruction */ for( i = 0 ; i<4 ; i++ ) { /* pminsw */ d[i] = ( bs[i] < as[i] ) ? bs[i] : as[i]; /* pminuw */ d[i] = ( bu[i] < au[i] ) ? bu[i] : au[i]; } typical application cases: PMIN allows to simplify clipping operations. Normally, one would write a pcmp/bsel sequence. This construct by PMAX reduces the operation to one instruction only. examples: pmaxuw E0,E1,E2 ; max( E0, E1 ) pminuw E0,E1,E1 ; min( E0, E1 ) psubw E1,E2,E1 ; E1 = max( E0,E1 ) - min( E0,E1 ) = abs( E0-E1 ) ------------------------------------------------------------------------------- AMMX/PMAXxW mnemonic: pmaxuw ,b,d pmaxsw ,b,d short: short-by-short vector compare and obtain larger graphic: input a ----------------------------------------- | a0 | a1 | a2 | a3 | ----------------------------------------- | | ... | | | | input b ----------------------------------------- | b0 | b1 | b2 | b3 | ----------------------------------------- | | ... | | | | \_ \_____ ... \____ | \ \ -------------- -------------- -------------- | b0 > a0 ? | | b1 > a1 ? | ... | b3 > a3 ? | | b0 : a0 | | b1 : a1 | | b3 : a3 | -------------- -------------- -------------- / ____/ ... ___/ | / / ----------------------------------------- | d0 | d1 | d2 | d3 | ----------------------------------------- equivalent C Code: int i; unsigned short au[4]; /* unsigned inputs */ unsigned short bu[4]; short as[4]; /* signed inputs */ short bs[4]; short d[4]; /* both variants shown in this loop, in practice only one condition apply per instruction */ for( i = 0 ; i<4 ; i++ ) { /* pmaxsw */ d[i] = ( bs[i] > as[i] ) ? bs[i] : as[i]; /* pmaxuw */ d[i] = ( bu[i] > au[i] ) ? bu[i] : au[i]; } typical application cases: PMAX allows to simplify clipping operations. Normally, one would write a pcmp/bsel sequence. This construct by PMAX reduces the operation to one instruction only. examples: pmaxuw E0,E1,E2 ; max( E0, E1 ) pminuw E0,E1,E1 ; min( E0, E1 ) psubw E1,E2,E1 ; E1 = max( E0,E1 ) - min( E0,E1 ) = abs( E0-E1 ) -------------------------------------------------------------------------------