OpenGL On Vampire Cards  page 1 2 3 4 5 6




  Gunnar von Boehn (Apollo Team Member) Posts 4723 16 Mar 2017 09:05
 Norbert Kett wrote:
 on other areas we need float math with FPU. 
Of course, the softfloat code makes no sense. Norbert Kett wrote:
 for example a matrix multiplication uses 64 fmul, 64 fadd. its 'just' 128 fpu instructions. 
Can you explain which type matrix multiplication this is? 128 FPU instruction is an odd number. For a normal (3x3) x (3) matrix *Vector multplication you need 9 FMUL and 6 FADD = 15 clock cycles.For a normal (4x4) x (4) matrix *Vector multplication you need 16 FMUL and 9 FADD = 25 clock cycles. Either option 15 to 25 clocks is peanuts.
 
  Thellier Alain
Posts 116 16 Mar 2017 10:39
 Hi Wawa Hi everybody @Gunnar As you know there are several states in GL like texturing,gouraud,blending... Plus the various Z test (less,greater,no zbuffer,etc...) So the drawpoly will need to have several versions if need a texture pixel if need to extrapolate color if need to blend color & tex etc... So the question : is it conceivable to autogenerate the 68080+AMMX code based on those states ? (to have a function that do exactly what is needed not more) Alain Thellier
 
  Gunnar von Boehn (Apollo Team Member) Posts 4723 16 Mar 2017 11:03
 Salut Alain,.thellier alain wrote:
 As you know there are several states in GL like texturing,gouraud,blending... Plus the various Z test (less,greater,no zbuffer,etc...) So the drawpoly will need to have several versions if need a texture pixel if need to extrapolate color if need to blend color & tex etc...

Yesthellier alain wrote:
 So the question : is it conceivable to autogenerate the 68080+AMMX code based on those states ? (to have a function that do exactly what is needed not more)Alain Thellier

Not sure what you mena by autogenerate exactly.There is no point in using C IMHO. C looses to much speed. You can waste this if your CPU runs at 3 GHz, but not when you run in the 100 Mhz order. I see good speed potential for us if we code well in ASM. For example I can ALPHA Blend you two 32bit texel (A,R,G,B) with 1 ASM instruction, needing only 1 clock cycle. For a beautiful Bilinear interpolation of 4 source texel each (32bit ARGB) we need to run this instruction 3 times = 3 clocks. With conditional Stores we can avoid branches in the code. This means the whole rasterizer should be written branch free, and layouted with prefetching for removal of memory latency. For special support for some texture format we can discuss. I could offer you some instruction for doing CLUT8 texture expansion in HW and this very fast. Also other formats we could help to speed up. I think there are lot of options.
 
  Norbert Kett (Apollo Team Member) Posts 38 16 Mar 2017 11:25
 Gunnar von Boehn wrote:
 Can you explain which type matrix multiplication this is? 128 FPU instruction is an odd number. 
TinyGL has the following matrix mul function: void gl_M4_MulLeft(M4 *c, M4 *b) { int i,j,k; float s; M4 a; a = *c; for(i=0;i<4;i++) { for(j=0;j<4;j++) { s = 0.0; for(k=0;k<4;k++) { s += a.m[k]*b>m[k][j]; } c>m[j] = s; } } } the inner part run 64 times, so its 64 mul, and 64 add.
 
  Thellier Alain
Posts 116 16 Mar 2017 13:15
 Hello again @Norbert Kett Yes it is 64 mul & 48 add see below Anyway there are not so much matrix mul in a program I mean there are much more calls to drawpoly/drawpixel @Gunnar Of course C code was only here for demonstration I was meaning generating the ASM in the fly based on the GL states So the drawpoly code will only contain (and loop on) the neeeded instructions Alain /*=================================================================*/ inline void MultM(register float *M1,register float *M2,float *M3) { float M[16]; M[0 ]= M1[0]*M2[0] + M1[1]*M2[4] + M1[2]*M2[8] + M1[3]*M2[12]; M[1 ]= M1[0]*M2[1] + M1[1]*M2[5] + M1[2]*M2[9] + M1[3]*M2[13]; M[2 ]= M1[0]*M2[2] + M1[1]*M2[6] + M1[2]*M2[10] + M1[3]*M2[14]; M[3 ]= M1[0]*M2[3] + M1[1]*M2[7] + M1[2]*M2[11] + M1[3]*M2[15]; M[4 ]= M1[4]*M2[0] + M1[5]*M2[4] + M1[6]*M2[8] + M1[7]*M2[12]; M[5 ]= M1[4]*M2[1] + M1[5]*M2[5] + M1[6]*M2[9] + M1[7]*M2[13]; M[6 ]= M1[4]*M2[2] + M1[5]*M2[6] + M1[6]*M2[10] + M1[7]*M2[14]; M[7 ]= M1[4]*M2[3] + M1[5]*M2[7] + M1[6]*M2[11] + M1[7]*M2[15]; M[8 ]= M1[8]*M2[0] + M1[9]*M2[4] + M1[10]*M2[8] + M1[11]*M2[12]; M[9 ]= M1[8]*M2[1] + M1[9]*M2[5] + M1[10]*M2[9] + M1[11]*M2[13]; M[10]= M1[8]*M2[2] + M1[9]*M2[6] + M1[10]*M2[10] + M1[11]*M2[14]; M[11]= M1[8]*M2[3] + M1[9]*M2[7] + M1[10]*M2[11] + M1[11]*M2[15]; M[12]= M1[12]*M2[0] + M1[13]*M2[4] + M1[14]*M2[8] + M1[15]*M2[12]; M[13]= M1[12]*M2[1] + M1[13]*M2[5] + M1[14]*M2[9] + M1[15]*M2[13]; M[14]= M1[12]*M2[2] + M1[13]*M2[6] + M1[14]*M2[10] + M1[15]*M2[14]; M[15]= M1[12]*M2[3] + M1[13]*M2[7] + M1[14]*M2[11] + M1[15]*M2[15]; CopyM(M3,M); }
 
  Norbert Kett (Apollo Team Member) Posts 38 16 Mar 2017 13:32
 Alain: yes, this is the same as irrlicht's 4x4 matrix mul.function. (64 mul, 48 add). you are right, the render consumes the most time. TinyGL has many type of triangle fill functions. and if i implement alpha blending, it will be much more. it's ok.
 
  Gunnar von Boehn (Apollo Team Member) Posts 4723 16 Mar 2017 13:41
 Hi Alain,
thellier alain wrote:
 @Gunnar Of course C code was only here for demonstration I was meaning generating the ASM in the fly based on the GL states So the drawpoly code will only contain (and loop on) the neeeded instructions

Yes ABSOLUTELY! All conditions and checks need be done outside the rastering on the line. And the line loop needs be tuned optimally without any IF/THEN/ELSE But why generate on ther fly? Why not predefine the different functions fully?If we do it like this, then we will see very good speed. Alain do you like to work with us on this?
 
  Norbert Kett (Apollo Team Member) Posts 38 17 Mar 2017 04:52
  for me it's ok if every GL state combination has an optimized triangle filler. we have no GHz CPU (yet :) except the Z function, there is the instruction can be changed.  correct me, but as i see Alain's example is not perspective correct texture mapping. TinyGL has perspective correct mapping, and because of this its render code more complex.  i measured the used float ranges, and the fixed 16:16 is not enough. overflow, and underflow will happen with this datatype.  we may use a fixed 32:32 arithmetic which requires more instructions / operand, but still much less than a softfpu. but why use this instead of a faster FPU solution?  we can optimize the integer only put pixel, but that is not enough. as i see a HW support is required for the good result.i listing the possible optimization levels:  optimizing the putpixel part (imho: can't provide enough speed)  support the texel processing with new instructions (the mentioned CMPZ, and MOVET method for example)  provide a complete HW putpixel instruction.  provide a complete triangle filler function like blitter. i think the last one is the proper, long term solution. (i know the team is busy with other things.) so, i will do further tests after the FPU is available in Apollo core. any comments are welcome :)
 
  Gunnar von Boehn (Apollo Team Member) Posts 4723 17 Mar 2017 08:10
 Norbert Kett wrote:
  optimizing the putpixel part (imho: can't provide enough speed)

Optimizing the function should be done. I agreed that tuning ths function alone is not enough. Actually the seperation into a subfunction limits our performance. The code needs be looked and tuned for whole the rasterline level.Norbert Kett wrote:
  support the texel processing with new instructions (the mentioned CMPZ, and MOVET method for example)

Yes, I agree. The coding of the rasterline code needs to done in ASM and need to take maximum advantage of the possibilities.Norbert Kett wrote:
  provide a complete HW putpixel instruction.

I better to do this in ASM. Doing this in ASM has huge advantages. In ASM you can provide many different implementations perfectly matching different usecases. ASM does provide you so much flexibility than a fixed function instruction. Norbert Kett wrote:
  provide a complete triangle filler function like blitter.

A blitter is very limited. Doing it in ASM provides much more flexibility.The goal of a ASM implementation would be to offload the computation to AMMX and to try to maximize the memory bandwidth. The CPU does benefit fomr very good caches of APOLLO to help here. A hardware blitter would in the end of the day also not be faster than the available memory. Norbert Kett wrote:
 i think the last one is the proper, long term solution.

Nope. The FPGA logic is designed to give the CPU the maximum performance. All the cache resources in the FPGA are allocated to the CPU.Any "outside" Blitter without these caches is by design always be slower than the Apollo CPU. If would want to develop a extra FPGA card which comes with own memory and with extra caches then you can aim for developing also a complete hardware blitter in it. If you want to target the Vampire the best options is to code AMMXASM for the CPU.
 
  Thellier Alain
Posts 116 17 Mar 2017 09:43
 Hello @Gunnar I cant help you much as I am not have time to code much anymore But I can give you some ideas Also it is You the ASM gourou :) I cant produce any good ASM code @all Please note that if you want a function per draw mode we will need lots of function as there are lots of "states" combination 8 Zbuffer modes Z_NEVER Z_LESS Z_GEQUAL Z_LEQUAL Z_GREATER Z_NOTEQUAL Z_EQUAL Z_ALWAYS 2 TexMode on/off 2 TexPerspective on/off 2 GouraudMode on/off 2 FogMode on/off 4 TexEnvMode on/off so 8*2*2*2*2*4 possibilities Even more if we include the differents blending operation (source/dest/alpha stuff) So there are 2 solutions: 1) Make just some optimized functions for classic states combination like Z_LESS+tex+persp+gouraud+modulate and a "catchall" (slow) function with tests or conditionnal asm for all other states combinations 2) Generate the ASM on the fly About the 16.16 fixed numbers I think it will works fine for filling the polygon (as it worked in Wazp3D and PatchCompositeTags) In my example code u v are still floats when entering in DrawPoly() (so the triangle clipping is made on float) but once we begin to draw the edges in DrawEdge() we use only 16.16 >not perspective correct texture mapping in my code Yes this is just an example But It can serve as a test : if this C source once converted in ASM+AMMX cant achieve a simple textured_with_no_perspective draw in a decent speed the case will be closed :/ So I incite you to start to convert this simplest case to AMMX and test speed before talking further ... as we said in French "Ne vendez pas la peau de l'ours avant de l'avoir tué" Alain
 
  Gunnar von Boehn (Apollo Team Member) Posts 4723 17 Mar 2017 10:33
 thellier alain wrote:
 Hello @Gunnar I cant help you much as I am not have time to code much anymore

I understand. I also have no time to write this all. I can offer to help to tune some inner loops. And I can show ways how to get the most out of the CPU. And I measure for the code how good the cache prefetching is and so forth. One thing is clear to me. OpenGL has too many options. And writing as onesizefitsall workloop in C whivh covers all switches and options  is by desing so slow and wasteful that it makes no sense at all to use it on 100Mhz. So we should do it the other way around. We should start with picking one sensible usecase / one settings. Which will allow us to run one or some games properly  with great speed.
 
  Mr Niding
Posts 443 17 Mar 2017 11:28
 If you decide to pick one game; Quake 1 is brilliant, just from a gameplay point of view. Its very competative, both solo and multiplayer. Speedruns etc. Just my 2 cents.
 
  Daniel Sevo
Posts 298 17 Mar 2017 12:00
 Gunnar von Boehn wrote:
 One thing is clear to me. OpenGL has too many options. And writing as onesizefitsall workloop in C whivh covers all switches and options  is by desing so slow and wasteful that it makes no sense at all to use it on 100Mhz. So we should do it the other way around. We should start with picking one sensible usecase / one settings. Which will allow us to run one or some games properly  with great speed.

Basically same as 3dfx did with MiniGL? A bare minimum open gl version, juuust enough to run GLQuake. Sounds like a good starting point. Having GL Quake run at decent resolution and good FPS would be pretty awsome. And in the long run maybe have Q2 as a realistic target? (I reckon Q3 is beyond the reach of this hardware as long as its a FPGA and not ASIC)
 
  Wawa T
Posts 695 17 Mar 2017 12:16
 if you ask me thats exactly the reason, why 3d is used with a standard dedicated hardware, and why that advent of that dedicated hardware wiped out the alternative solutions. from my experience with trying to port several pieces of 3d software years ago using dedicated hardware and accelerated drivers, namely mediator and voodoo3 i almost never achieved an acceptable speed. a future vampire accelerator or standalone might provide a bus to a dediacted of the shelf chip, but im not sure if its worth the effort to have one or two cult applications running. if one wants something simple for the beginning, here is one of non textured games that ran alomst acceptable on amiga: EXTERNAL LINK
 
  Andrew Copland
Posts 113 17 Mar 2017 12:20
 There's a few areas which could be optimised but the biggest appears to be the rasterisation which looks like the usual scanline/bresenham algorithm involving a lot of floating point maths. Instead of using that I suggest taking a look at something like a block/tile based rasteriser. EXTERNAL LINK and EXTERNAL LINK Gunnar: That's basically the scheme I suggested back in the Natami days for hardware implementation and could be possible parallelise some of it with the AMMX instructions you've created.Andy
 
  Andrew Copland
Posts 113 17 Mar 2017 14:31
 Also, I can't check the code at the moment as I'm at work, but instead of optimising the matrix multiplication which should only be used rarely, you can optimise both the matrix*vertex operation and implement a posttransform vertex cache to avoid retransforming vertices in the first place.
 
  Gunnar von Boehn (Apollo Team Member) Posts 4723 17 Mar 2017 21:41
 Andrew Copland wrote:
 Also, I can't check the code at the moment as I'm at work, but instead of optimising the matrix multiplication which should only be used rarely, you can optimise both the matrix*vertex operation and implement a posttransform vertex cache to avoid retransforming vertices in the first place. 
Yes, a clever guy could surely get something like this.
 
  Thellier Alain
Posts 116 18 Mar 2017 07:27
 Hello Could you give a full description of the AMMX instructions that can be usefull for drawing GL pixels such as the mulalpha or getpixel ones you mentionned Thanks Alain Thellier
 
  Niclas A (Apollo Team Member) Posts 213 18 Mar 2017 07:55
 thellier alain wrote:
 Hello Could you give a full description of the AMMX instructions that can be usefull for drawing GL pixels such as the mulalpha or getpixel ones you mentionned Thanks Alain Thellier

Here is some info that i know about. CLICK HERE EXTERNAL LINK
 
  Gunnar von Boehn (Apollo Team Member) Posts 4723 18 Mar 2017 08:38
 Alain, Andy, Nobert, all The forum discussion adds a little delay on everything. To improve this I would propose to continue brainstorming in our IRC channel, there you can meet the development team and also others coders using AMMX, for brainstorming.
 
