Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

Can I Have Some New AMMX Mnemonics?page  1 2 

Andreas Timmermann

Posts 11
01 Aug 2017 23:26


hello,

is there any possibility to compare 4 shorts in one mnemonic?
maybe something like pge d0,d1,d2 where d0 and d1 are the 8 values and d2 is a destination bit mask where a 0 if d0.w < d1.w and a 1 if d0.w >= d1.w.
with that I could do 4 z-buffer compare in a single cycle :-)

best regards

ps: whats that?
PCMPgt.W (ea),B,D Parallel CMPgt.W - Greater Than.


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
02 Aug 2017 08:57


Andreas Timmermann wrote:

  hello,
 
  is there any possibility to compare 4 shorts in one mnemonic?
 

 
  Yes, this is possible.
 
  AMMX does support the following parallel Compare instructions
 
  PCMPeq.b
  PCMPeq.w
  PCMPgt.b
  PCMPgt.w
 
  The ".b" Instruktions do 8 parallel BYTE compares.
  The ".w" Instruktions do 4 parallel WORD/SHORT Compares
 
  These Instruktions have 3 operants
  A,B,D
  A and B are compared and the resulting BITMASK is stored in D
 
 
 
Andreas Timmermann wrote:

  maybe something like pge d0,d1,d2 where d0 and d1 are the 8 values and d2 is a destination bit mask where a 0 if d0.w < d1.w and a 1 if d0.w >= d1.w.
 

  Yes exactly like this.
  The result is per Operant either "00" or "FF" for FALSE / TRUE.
 
 
Andreas Timmermann wrote:

  with that I could do 4 z-buffer compare in a single cycle :-)
 

  Yes this you can do.
 
 
  I assume you then also want to conditional store 4 pixel?
  And conditional update the 4 Z Values?
  What Pixelformat do you use 8bit, 16bit,24bit?


Andreas Timmermann

Posts 11
02 Aug 2017 09:08


My zBuffer are 16Bit. so this is a very good solution :-) thank you.
  there is a 4 pixel store, I use. so its ok.
btw: cool would be also an ammx addition with carry flag, so that I can to 32 Bit addition ;-)


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
02 Aug 2017 10:14


Hallo Andreas,
 
if you want to conditionally update the Z-Buffer
then a good solution is using a BSEL after the PCMP.
The BSEL has 3 inputs A,B, and MASK
 
This way you can pick in parallel the 4 highest Z-Values with 1 instruction, and write them back into the Z-buffer with another 1 instruction.

The AMMX STORE can also be done in the 2nd pipe, this means this can be executed Super-scalar in 1 cycle.

Do you CLEAR the Z-Buffer every frame?
 


Andreas Timmermann

Posts 11
02 Aug 2017 10:55


No. Its an Compressed Buffer, like Nvidia. I Only delete 1/16 of the ZBuffer. :-) So its very fast.
    i also don't write the buffer every time. I only update the buffer, if 2 block are in the same z-range. and also I only update the buffer at the end of the frame and only if 2 blocks are in the same z-range.
    that mean, if you have a block of 4x4 pixel and your zmin is higher then the zmax for this block in the zbuffer, you can skip the whole block. and if the zmax of the block you want to write is more near then the zmin of the block in the zbuffer, you only write the buffer. you don't need to compare.
    also you only check the min max for the block. only if you found more then 1 block in the same min max range, you need to compare every single entry of this block. so you same a lot of zbuffer compares and writes.


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
02 Aug 2017 11:50


Andreas Timmermann wrote:

No. Its an Compressed Buffer, like Nvidia. I Only delete 1/16 of the ZBuffer. :-) So its very fast.

Very interesting.

What type of application / game is this?

I assume you also update pixels, right?
How do you want to update the them?
There are 2 options.
A)  READ BACKGROUND, MERGE , WRITE BACK
B) WRITE WITH MASK

The 2nd option is a special APOLLO feature and saves you 1 memory access.


Andrew Copland

Posts 113
02 Aug 2017 14:12


@Gunnar, Hierarchical z-buffer used for rendering usually. Remember I wrote one for the old Natami 3d-core development stuff and we planned to have it as a feature?
 
Can give a really good speedup combined with the block/tile based rasteriser. You'll read about it being used for occlusion culling on modern GPUs.


Andreas Timmermann

Posts 11
03 Aug 2017 00:46


i want to optimize the tinygl library.
and first priority was to reduce the zbuffer usage.
I didn't though about the color buffer. that comes much later :-D
so i will see ;-)



Gregthe Canuck

Posts 274
03 Aug 2017 03:11


Would it be possible for a block/tile renderer to take advantage of the core's multithreading features mentioned back in June?
 
CLICK HERE   
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
03 Aug 2017 07:00


gregthe canuck wrote:

Would it be possible for a block/tile renderer to take advantage of the core's multithreading features mentioned back in June?

Yes, this could be done.
But writing multithreaded code is of course extra work.

There are also other coding options you can use on APOLLO.

a) Use AMMX for SIMD Data parallelism.

b) Benefit of more registers.
Apollo has 16 Pointer Registers and 32 Data register it can use.
These Registers can be used both with AMMX and with every normal instruction.

c) Benefit of Data prefetch
APOLLO will automatically detect memory streams and prefetch them automatically for you. This is very useful for linear memory usage.
In case of "random" / non linear memory read patterns the programmer can use TOUCH instructions to preload cache lines in advance.
If well used the programmer can complete avoid memory latency.
 
d) Benefit of BRANCH rewrite
APOLLO will internally rewrite Code like this


Bcc notdo
  1 Instruction
notdo:

into a condditional form of this instruction.
This rewrite will avoid taking a branch and also avoid any misprediction. This makes this instruction construct much faster.

e) Optimize BRANCH prediction
For branches which can NOT be avoided, and semi random branches which have no clear pattern - the programmer can use HINT instructions to help the CPU. For cases where the condition can be calculated some cycle before using the BRANCH this technique can be used to guarantee 100% correct branch prediction.


Lorenzo Pistone

Posts 22
03 Aug 2017 08:48


It's possibile to include opencl unit in the apollo core with the other intructions or it will work better in a separate fpga?


Andreas Timmermann

Posts 11
03 Aug 2017 09:09


that sound very good.
maybe you can help me in another code construction.
at the moment my z-value is a 32 bit value, but my buffer is a 16 bit buffer.
is it possible to get an addition with carry flag? so I could make 2 addition to having
4x 32 bit additions with ammx?
this would be nice, because I can have 4 lower 16 bit values in one register and 4 higher 16 bit value in another register ;-)



Gunnar von Boehn
(Apollo Team Member)
Posts 6197
03 Aug 2017 10:48


Andreas Timmermann wrote:

that sound very good.
  maybe you can help me in another code construction.
  at the moment my z-value is a 32 bit value, but my buffer is a 16 bit buffer.

Not sure how this works.
You store 16bit in memory and compare to the 16bit?




Andreas Timmermann

Posts 11
03 Aug 2017 12:58


I didn't think, that this is a good solution.
  the carry flag add was very helpful for texture mapping too. I think, the fastest texture mapping for the amiga goes like this:
 
  d0.l = xx00YYyy
  d1.l = 000000XX
  XX and YY are the high byte and xx and yy are the lower byte
 
  then you have the gradients of x and y in d2 and d3 ->
 
  add.l d2,d0
  addx.l d3,d1
  move.w d0,d4
  move.b d1,d4
 
  and then you have it. d4.w = YYXX
 
  and this 4x with the AMMX. ;-)


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
03 Aug 2017 13:20


Andreas Timmermann wrote:

  add.l d2,d0
  addx.l d3,d1
  move.w d0,d4
  move.b d1,d4

Yes I know this type of code.
One has of course to mind that the combining this
D4 Index register and using it in a MOVE.B (A0,D4)
has a ALU to EA latency. So between the 2 instructions
should be 2 cycle = 4 instruction on the 68060 to avoid the bubble.

I used to write the code interleaved like this in the old days.


        add.l  A0,D7                  ; xx__YYyy
        addx.b  D4,D6                  ; ______XX
        move.w  D7,D1                  ; 04
        move.b  D6,D1
        move.w  (A6,D0.l*2),(A1)+      ; move pixel

        add.l  A0,D7                  ; xx__YYyy
        addx.b  D4,D6                  ; ______XX
        move.w  D7,D0                  ; 01
        move.b  D6,D0
        move.w  (A6,D1.l*2),(A1)+      ; move pixel




Andreas Timmermann

Posts 11
03 Aug 2017 13:30


very nice.
  and now I only want to combine that with ammx like this
 
  e0 = YYXXYYXXYYXXYYXX
  e1 = yyxxyyxxyyxxyyxx
 
  e2 are high bits gradient
  e3 are low bits gradient
 
  padd.b e3,e1,e1
  paddx.b e2,e0,e0
 
  and so on ;-)
 
  4 pixels with 2 additions and also in the right position ;-)

but I think the memory access will kill all the saved cycles :-/


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
03 Aug 2017 13:34


Andreas Timmermann wrote:

very nice.
and now I only want to combine that with ammx like this

Working with "next pixel" 8bit color was OK in the early 90th.
Today we could per pixel fetch 4 Texels and bilinear interpolate them with AMMX. This will look many times better.


Andreas Timmermann

Posts 11
03 Aug 2017 13:43


ok, I must think about that :-D I thought about bilinear interpolation on AMMX. but I need 2 more memory access per pixel. and the textures muss be 32 bit. or I need palettes. but that need more random memory access. this sound horrible.


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
03 Aug 2017 13:47


Andreas Timmermann wrote:

ok, I must think about that :-D I thought about bilinear interpolation on AMMX. but I need 2 more memory access per pixel. and the textures muss be 32 bit. or I need palettes. but that need more random memory access. this sound horrible.

Lets say the texture is 16bit.
Then with 2 instruction you can fetch 4 texel
MOVE.L (A0,D0),E0
MOVE.L WIDTH(A0,D0),E1
 
right?


Andreas Timmermann

Posts 11
03 Aug 2017 13:50


the problem in the code is the wrapping.
you need a compare and a branch every time.

posts 29page  1 2