Overview Features Instructions Performance Forum Downloads Products OrderV4 Reseller Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
VISIT APOLLO IRC CHANNEL



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Performance and Benchmark Results!

OpenGL On Vampire Cardspage  1 2 3 4 5 6 

Gunnar von Boehn
(Apollo Team Member)
Posts 4754
16 Mar 2017 09:05


Norbert Kett wrote:

  on other areas we need float math with FPU.
 

  Of course, the softfloat code makes no sense.
 
 
 
Norbert Kett wrote:

  for example a matrix multiplication uses 64 fmul, 64 fadd. its 'just' 128 fpu instructions.
 

Can you explain which type matrix multiplication this is?
128 FPU instruction is an odd number.
 
For a normal (3x3) x (3) matrix *Vector multplication you need
9 FMUL and 6 FADD = 15 clock cycles.

For a normal (4x4) x (4) matrix *Vector multplication you need
16 FMUL and 9 FADD = 25 clock cycles.
 
Either option 15 to 25 clocks is peanuts.


Thellier Alain

Posts 116
16 Mar 2017 10:39


Hi Wawa Hi everybody

@Gunnar
As you know there are several states in GL like texturing,gouraud,blending...
Plus the various Z test (less,greater,no zbuffer,etc...)

So the drawpoly will need to have several versions
if need a texture pixel
if need to extrapolate color
if need to blend color & tex
etc...

So the question : is it conceivable to auto-generate the 68080+AMMX code based on those states ? (to have a function that do exactly what is needed not more)

Alain Thellier




Gunnar von Boehn
(Apollo Team Member)
Posts 4754
16 Mar 2017 11:03


Salut Alain,.

thellier alain wrote:

As you know there are several states in GL like texturing,gouraud,blending...
Plus the various Z test (less,greater,no zbuffer,etc...)

  So the drawpoly will need to have several versions
  if need a texture pixel
  if need to extrapolate color
  if need to blend color & tex
  etc...


Yes

thellier alain wrote:

So the question : is it conceivable to auto-generate the 68080+AMMX code based on those states ? (to have a function that do exactly what is needed not more)

Alain Thellier


Not sure what you mena by auto-generate exactly.

There is no point in using C IMHO.
C looses to much speed. You can waste this if your CPU runs at 3 GHz, but not when you run in the 100 Mhz order.

I see good speed potential for us if we code well in ASM.
For example I can ALPHA Blend you two 32bit texel (A,R,G,B)
with 1 ASM instruction, needing only 1 clock cycle.

For a beautiful Bi-linear interpolation of 4 source texel each (32bit ARGB) we need to run this instruction 3 times = 3 clocks.

With conditional Stores we can avoid branches in the code.
This means the whole rasterizer should be written branch free,
and layouted with prefetching for removal of memory latency.

For special support for some texture format we can discuss.
I could offer you some instruction for doing CLUT8 texture expansion in HW and this very fast. Also other formats we could help to speed up.

I think there are lot of options.


Norbert Kett
(Apollo Team Member)
Posts 38
16 Mar 2017 11:25


Gunnar von Boehn wrote:

  Can you explain which type matrix multiplication this is?
  128 FPU instruction is an odd number. 
 

 
  TinyGL has the following matrix mul function:
 
  void gl_M4_MulLeft(M4 *c, M4 *b)
  {
  int i,j,k;
  float s;
  M4 a;
 
  a = *c;
 
  for(i=0;i<4;i++)
  {
    for(j=0;j<4;j++)
    {
    s = 0.0;
    for(k=0;k<4;k++)
    {
      s += a.m[k]*b->m[k][j];
    }
    c->m[j] = s;
    }
  }
  }
 
  the inner part run 64 times, so its 64 mul, and 64 add.


Thellier Alain

Posts 116
16 Mar 2017 13:15


Hello again

@Norbert Kett
Yes it is 64 mul & 48 add see below
Anyway there are not so much matrix mul in a program I mean there are much more calls to drawpoly/drawpixel

@Gunnar
Of course C code was only here for demonstration
I was meaning generating the ASM in the fly based on the GL states So the drawpoly code will only contain (and loop on) the neeeded instructions

Alain

/*=================================================================*/
inline void MultM(register float *M1,register float *M2,float *M3)
{
float M[16];

  M[0 ]= M1[0]*M2[0] + M1[1]*M2[4] + M1[2]*M2[8] + M1[3]*M2[12];
  M[1 ]= M1[0]*M2[1] + M1[1]*M2[5] + M1[2]*M2[9] + M1[3]*M2[13];
  M[2 ]= M1[0]*M2[2] + M1[1]*M2[6] + M1[2]*M2[10] + M1[3]*M2[14];
  M[3 ]= M1[0]*M2[3] + M1[1]*M2[7] + M1[2]*M2[11] + M1[3]*M2[15];

  M[4 ]= M1[4]*M2[0] + M1[5]*M2[4] + M1[6]*M2[8] + M1[7]*M2[12];
  M[5 ]= M1[4]*M2[1] + M1[5]*M2[5] + M1[6]*M2[9] + M1[7]*M2[13];
  M[6 ]= M1[4]*M2[2] + M1[5]*M2[6] + M1[6]*M2[10] + M1[7]*M2[14];
  M[7 ]= M1[4]*M2[3] + M1[5]*M2[7] + M1[6]*M2[11] + M1[7]*M2[15];

  M[8 ]= M1[8]*M2[0] + M1[9]*M2[4] + M1[10]*M2[8] + M1[11]*M2[12];
  M[9 ]= M1[8]*M2[1] + M1[9]*M2[5] + M1[10]*M2[9] + M1[11]*M2[13];
  M[10]= M1[8]*M2[2] + M1[9]*M2[6] + M1[10]*M2[10] + M1[11]*M2[14];
  M[11]= M1[8]*M2[3] + M1[9]*M2[7] + M1[10]*M2[11] + M1[11]*M2[15];

  M[12]= M1[12]*M2[0] + M1[13]*M2[4] + M1[14]*M2[8] + M1[15]*M2[12];
  M[13]= M1[12]*M2[1] + M1[13]*M2[5] + M1[14]*M2[9] + M1[15]*M2[13];
  M[14]= M1[12]*M2[2] + M1[13]*M2[6] + M1[14]*M2[10] + M1[15]*M2[14];
  M[15]= M1[12]*M2[3] + M1[13]*M2[7] + M1[14]*M2[11] + M1[15]*M2[15];

  CopyM(M3,M);
}



Norbert Kett
(Apollo Team Member)
Posts 38
16 Mar 2017 13:32


Alain: yes, this is the same as irrlicht's 4x4 matrix mul.function. (64 mul, 48 add). you are right, the render consumes the most time.

TinyGL has many type of triangle fill functions. and if i implement alpha blending, it will be much more. it's ok.


Gunnar von Boehn
(Apollo Team Member)
Posts 4754
16 Mar 2017 13:41


Hi Alain,
 
thellier alain wrote:

@Gunnar
Of course C code was only here for demonstration
I was meaning generating the ASM in the fly based on the GL states So the drawpoly code will only contain (and loop on) the neeeded instructions

 
Yes ABSOLUTELY!
All conditions and checks need be done outside the rastering on the line. And the line loop needs be tuned optimally without any IF/THEN/ELSE
 
But why generate on ther fly?
Why not pre-define the different functions fully?

If we do it like this, then we will see very good speed.
 
Alain do you like to work with us on this?
 
 


Norbert Kett
(Apollo Team Member)
Posts 38
17 Mar 2017 04:52


- for me it's ok if every GL state combination has an optimized triangle filler. we have no GHz CPU (yet :) except the Z function,  there is the instruction can be changed.
- correct me, but as i see Alain's example is not perspective correct texture mapping. TinyGL has perspective correct mapping, and because of this its render code more complex.
- i measured the used float ranges, and the fixed 16:16 is not enough. overflow, and underflow will happen with this datatype.
- we may use a fixed 32:32 arithmetic which requires more instructions / operand, but still much less than a soft-fpu. but why use this instead of a faster FPU solution?
- we can optimize the integer only put pixel, but that is not enough. as i see a HW support is required for the good result.

i listing the possible optimization levels:

- optimizing the put-pixel part (imho: can't provide enough speed)
- support the texel processing with new instructions (the mentioned CMPZ, and MOVET method for example)
- provide a complete HW put-pixel instruction.
- provide a complete triangle filler function like blitter.

i think the last one is the proper, long term solution. (i know the team is busy with other things.) so, i will do further tests after the FPU is available in Apollo core.

any comments are welcome :)


Gunnar von Boehn
(Apollo Team Member)
Posts 4754
17 Mar 2017 08:10


Norbert Kett wrote:

  - optimizing the put-pixel part (imho: can't provide enough speed)

Optimizing the function should be done.
I agreed that tuning ths function alone is not enough.
Actually the seperation into a subfunction limits our performance.
The code needs be looked and tuned for whole the rasterline level.

Norbert Kett wrote:

  - support the texel processing with new instructions (the mentioned CMPZ, and MOVET method for example)

Yes, I agree.
The coding of the rasterline code needs to done in ASM and need to take maximum advantage of the possibilities.

Norbert Kett wrote:

  - provide a complete HW put-pixel instruction.

I better to do this in ASM.
Doing this in ASM has huge advantages.
In ASM you can provide many different implementations perfectly matching different usecases. ASM does provide you so much flexibility than a fixed function instruction.
 

Norbert Kett wrote:

- provide a complete triangle filler function like blitter.

A blitter is very limited.
Doing it in ASM provides much more flexibility.

The goal of a ASM implementation would be to offload the computation to AMMX and to try to maximize the memory bandwidth.
The CPU does benefit fomr very good caches of APOLLO to help here.

A hardware blitter would in the end of the day also not be faster than the available memory.

Norbert Kett wrote:

  i think the last one is the proper, long term solution.

Nope.
The FPGA logic is designed to give the CPU the maximum performance.
All the cache resources in the FPGA are allocated to the CPU.

Any "outside" Blitter without these caches is by design always be slower than the Apollo CPU.

If would want to develop a extra FPGA card which comes with own memory and with extra caches then you can aim for developing also a complete hardware blitter in it.
If you want to target the Vampire the best options is to code AMMX-ASM for the CPU.



Thellier Alain

Posts 116
17 Mar 2017 09:43


Hello

@Gunnar
I cant help you much as I am not have time to code much anymore
But I can give you some ideas
Also it is You the ASM gourou :-) I cant produce any good ASM code

@all
Please note that if you want a function per draw mode we will need lots of function as there are lots of "states" combination

8 Zbuffer modes
Z_NEVER   
Z_LESS   
Z_GEQUAL   
Z_LEQUAL   
Z_GREATER   
Z_NOTEQUAL   
Z_EQUAL   
Z_ALWAYS 

2 TexMode
on/off

2 TexPerspective
on/off

2 GouraudMode
on/off

2 FogMode
on/off

4 TexEnvMode
on/off

so 8*2*2*2*2*4 possibilities
Even more if we include the differents blending operation (source/dest/alpha stuff)

So there are 2 solutions:
1) Make just some optimized functions for classic states combination like Z_LESS+tex+persp+gouraud+modulate and a "catch-all" (slow) function with tests or conditionnal asm for all other states combinations
2) Generate the ASM on the fly

About the 16.16 fixed numbers I think it will works fine for filling the polygon (as it worked in Wazp3D and PatchCompositeTags)
In my example code  u v are still floats when entering in DrawPoly() (so the triangle clipping is made on float) but once we begin to draw the edges in DrawEdge() we use only 16.16
>not perspective correct texture mapping in my code
Yes this is just an example
But It can serve as a test : if this C source once converted in ASM+AMMX cant achieve a simple textured_with_no_perspective draw in a decent speed the case will be closed :-/
So I incite you to start to convert this simplest case to AMMX and test speed before talking further ...
as we said in French "Ne vendez pas la peau de l'ours avant de l'avoir tué"

Alain



Gunnar von Boehn
(Apollo Team Member)
Posts 4754
17 Mar 2017 10:33


thellier alain wrote:

Hello
 
@Gunnar
I cant help you much as I am not have time to code much anymore

I understand.
I also have no time to write this all.

I can offer to help to tune some inner loops.
And I can show ways how to get the most out of the CPU.
And I measure for the code how good the cache prefetching is and so forth.

One thing is clear to me.
OpenGL has too many options.
And writing as -one-size-fits-all workloop in C
whivh covers all switches and options - is by desing so slow and wasteful that it makes no sense at all to use it on 100Mhz.

So we should do it the other way around.
We should start with picking one sensible usecase / one settings.
Which will allow us to run one or some games properly - with great speed.



Mr Niding

Posts 444
17 Mar 2017 11:28


If you decide to pick one game; Quake 1 is brilliant, just from a gameplay point of view. Its very competative, both solo and multiplayer. Speedruns etc.

Just my 2 cents.


Daniel Sevo

Posts 299
17 Mar 2017 12:00


Gunnar von Boehn wrote:

 
  One thing is clear to me.
  OpenGL has too many options.
  And writing as -one-size-fits-all workloop in C
  whivh covers all switches and options - is by desing so slow and wasteful that it makes no sense at all to use it on 100Mhz.
 
  So we should do it the other way around.
  We should start with picking one sensible usecase / one settings.
  Which will allow us to run one or some games properly - with great speed.
 

Basically same as 3dfx did with MiniGL? A bare minimum open gl version, juuust enough to run GLQuake.
Sounds like a good starting point.
Having GL Quake run at decent resolution and good FPS would be pretty awsome.

And in the long run maybe have Q2 as a realistic target?

(I reckon Q3 is beyond the reach of this hardware as long as its a FPGA and not ASIC)



Wawa T

Posts 695
17 Mar 2017 12:16


if you ask me thats exactly the reason, why 3d is used with a standard dedicated hardware, and why that advent of that dedicated hardware wiped out the alternative solutions.

from my experience with trying to port several pieces of 3d software years ago using dedicated hardware and accelerated drivers, namely mediator and voodoo3 i almost never achieved an acceptable speed.

a future vampire accelerator or standalone might provide a bus to a dediacted of the shelf chip, but im not sure if its worth the effort to have one or two cult applications running. 

if one wants something simple for the beginning, here is one of non textured games that ran alomst acceptable on amiga:
EXTERNAL LINK


Andrew Copland

Posts 113
17 Mar 2017 12:20


There's a few areas which could be optimised but the biggest appears to be the rasterisation which looks like the usual scanline/bresenham algorithm involving a lot of floating point maths.
 
Instead of using that I suggest taking a look at something like a block/tile based rasteriser.
EXTERNAL LINK and EXTERNAL LINK 
Gunnar: That's basically the scheme I suggested back in the Natami days for hardware implementation and could be possible parallelise some of it with the AMMX instructions you've created.

Andy


Andrew Copland

Posts 113
17 Mar 2017 14:31


Also, I can't check the code at the moment as I'm at work, but instead of optimising the matrix multiplication which should only be used rarely, you can optimise both the matrix*vertex operation and implement a post-transform vertex cache to avoid re-transforming vertices in the first place.


Gunnar von Boehn
(Apollo Team Member)
Posts 4754
17 Mar 2017 21:41


Andrew Copland wrote:

  Also, I can't check the code at the moment as I'm at work, but instead of optimising the matrix multiplication which should only be used rarely, you can optimise both the matrix*vertex operation and implement a post-transform vertex cache to avoid re-transforming vertices in the first place.
 

 
Yes, a clever guy could surely get something like this.


Thellier Alain

Posts 116
18 Mar 2017 07:27


Hello

Could you give a full description of the AMMX instructions
that can be usefull for drawing GL pixels such as the
mulalpha or getpixel ones you mentionned

Thanks

Alain Thellier


Niclas A
(Apollo Team Member)
Posts 213
18 Mar 2017 07:55


thellier alain wrote:

Hello
 
  Could you give a full description of the AMMX instructions
  that can be usefull for drawing GL pixels such as the
  mulalpha or getpixel ones you mentionned
 
  Thanks
 
  Alain Thellier

Here is some info that i know about.
CLICK HERE  EXTERNAL LINK


Gunnar von Boehn
(Apollo Team Member)
Posts 4754
18 Mar 2017 08:38


Alain, Andy, Nobert, all

The forum discussion adds a little delay on everything.
To improve this I would propose to continue brainstorming in our IRC channel, there you can meet the development team and also others coders using AMMX, for brainstorming.



posts 119page  1 2 3 4 5 6