Performance and Benchmark Results!
|
OpenGL On Vampire Cards | page 1 2 3 4 5 6
|
---|
|
---|
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 14 Mar 2017 15:52
| Norbert Kett wrote:
| i made a quick calculation: if i assume the CPU runs at 100MHz, and i want 30 frames / second, and 640x400 pixels. then i have 13 CPU cycles to compute one pixel which is not too much. usually an application need to compute many other things too. so, if i'm correct without a special texturing HW we can not achieve ps2 quality. i feel AMMX wont help. |
The Calculation is a good start. But mind we also decode VIDEO in 640x400 at 30 FPS - with no problem. We also play DOOM in 640x400 smooth - without problems. And this even without using MMX! So we have already proved that we can do this resolution. Now show us what instruction/operations are needed per texel. Then we can do the clock counting! ;-)
| |
| | Norbert Kett (Apollo Team Member) Posts 39 14 Mar 2017 17:53
| ok, at first, lets see what we are talking about :) here is a triangle renderer source: EXTERNAL LINK and the include: EXTERNAL LINK tingl has many kind of triangle renderer function, all are use this include, so its a bit messy. here is the merged renderer: EXTERNAL LINK we can see its optimized, it can emit 8 texels in row. the PUT_PIXEL part is integer only. here is the no-fpu asm code: EXTERNAL LINK and with fpu: EXTERNAL LINK easy to find the put pixel part which is included 8 times in row. here is the put_pixel asm code: movel d6,d2 moveq #14,d0 lsrl d0,d2 clrl d0 movew a2@(4),d0 cmpl d2,d0 jhi L43 movel d4,d1 andl a5@(-28),d1 clrl d0 movew a5@(-22),d0 lsll d0,d1 movel d5,d0 andl a5@(-28),d0 orl d0,d1 clrl d0 movew a5@(-24),d0 lsrl d0,d1 movel a5@(-4),a0 movew a0@(d1:l),a3@(4) movew d2,a2@(4) L43: addl a5@(-96),d6 addl d7,d5 addl d3,d4 and we have no pixel blending yet. that's all for now ;)
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 14 Mar 2017 18:15
| What is the typical texture size (X/Y) What sizes of texture are supported?
| |
| | Norbert Kett (Apollo Team Member) Posts 39 14 Mar 2017 18:23
| TinyGL originally supported only 256x256, now: from 64x64 to 2048x2048.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 14 Mar 2017 18:26
| Norbert Kett wrote:
| TinyGL originally supported only 256x256, now: from 64x64 to 2048x2048. |
OK lets together make a simple example.. Lets write in ASM the code to put pixel from 256x256 texture. for this case the core function code is:
move.w (A0,D0*2),(A1)+
== 1 cycle Where D0 is the texture index. Layout in Index Register is YyXx So this core is pretty simple You are maintain Z Buffer so you do
CMP.W D1,(A2) bhi .noset move.w (A0,D0*2),(A1)+ move.w D1,(A2)+ .noset
So the main function of the code is only 4 instruction. --- May I ask which compiler version and which compile options you use? Your disasm looks a bit like it does some extra casting from LONG to WORD
| |
| | Norbert Kett (Apollo Team Member) Posts 39 15 Mar 2017 04:21
| gcc 2.95.3.
| |
| | Norbert Kett (Apollo Team Member) Posts 39 15 Mar 2017 04:45
| i peeked into the gcc soft-float lib src, i was curious about how many instructions used for float operations/conversions... omg! i think its pointless to do any optimizations until FPU is not enabled in Apollo core :/
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 15 Mar 2017 06:52
| Norbert Kett wrote:
| i think its pointless to do any optimizations until FPU is not enabled in Apollo core :/
|
Certainly using softfloat make absolutely no sense here. ON the other hand enabling FPU instructions would be possible. But I agree with you this C code will generally give a problem.We have to mind that competetive 3D games like DOOM used to write their render code in handtuned ASM. The inner loop of DOOM rasterizer is propably less than 10 ASM instruction in total, including incementing Y,X and U,H. The project can be a success if we write the render functions also in handtuned ASM.
| |
| | Norbert Kett (Apollo Team Member) Posts 39 15 Mar 2017 07:18
| yes, i already updated vasm with a new version with Apollo support. and i believe the rendering can be much faster with proper optimization. the asm compiled from c is juts a good guide here. but the fpu is essential, some soft fpu functions use more than 300 instructions. btw doom's vertical column rendering is much simpler compared to a 'real' 3D polygon render with perspective correction, and z buffer r/w. i saw the quake's handtuned x86 asm render codes. many thousand of lines for rendering.
| |
| | Wawa T
Posts 695 15 Mar 2017 07:19
| but doom is a 2d engine providing an illusion of 3d action, while an implementation of even a subset of some historical version of ogl, is still a true 3d multipurpose engine, demanding its share in complexity and flexibility. if tinygl is going to be a base for porting applications it cannot be that tricky and restricted. it might be possible to have it implemented within reason only with integer aritmetics, it seems that is what os4 people (hans) are doing when they use compositing to fake 3d in demos, still im not sure it will alone gain enough speed to have your average simple 3d game port to run on a vampire n a playable speed, say some 10 fps.. edit: norbert, you have beat me to it;)
| |
| | Grzegorz Wójcik (pisklak (Apollo Team Member) Posts 87 15 Mar 2017 08:25
| Well OGL is for sure nice - but I think if anyone want that on nofpu platform then TinyGL needs a realy serious rewrite. For sure someone can write good working 3D engine with only fixedpoint math, but I guess that will not be OGL compatible. We have some nice AMMX instructins that for sure can help with texture filtering and some other stuff. We may focus on that right now and later when we will have full FPU optimize all other stuff...
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 15 Mar 2017 08:28
| wawa t wrote:
| but doom is a 2d engine providing an illusion of 3d action,
|
Of course every 3D engine is a simplification. Some more, some less. Its always a trade-off of correctness for performance. Some more, some less. 3D game engines never to do true raycasting.Important for success is to be aware of the possibilities and to set a reasonable goal. I would recommend to write the rasterizer in ASM to be able to take full control of Super-Scaler, of data prefetching, of AMMX. I think it would make sense to start from the inside to the outside. And start with a simple rasterizer demos first and understand how to get the core routines to full speed. If someone likes to do this, then I'm happy to help on the ASM tuning. And if needed we can always cheat by adding new instructions if we identified very useful cases.
| |
| | Wawa T
Posts 695 15 Mar 2017 12:33
| gunnar, this is not a question of writing an engine. you dont need an ogl library, against which you compile your app, to do that. but hardly anyone will write a dediatced separate 3d engine for every separate amiga app he wants to develop, if any. a library you can compile ogl stuff against is mostly good for porting sources from linux. such sources dont necessarily containg floating point variables, id say, but if dont remember well, one would have to grep some of them for types. the question is, does it pay to develop now an integer only library, taking speed penalties and other limits into account, or is it better to wait for fpu implementation.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 15 Mar 2017 12:37
| wawa t wrote:
| the question is, does it pay to develop now an integer only library.. |
Thank you for repeating the point I made!If you read my post carefully you see that I clearly said to use FPU and to use AMMX.
| |
| | Thellier Alain
Posts 143 15 Mar 2017 13:19
| Hello guys As you talked about my Wazp3D stuff let me introduce in your discussion... 1) as clipping may occurs we will not have to draw triangle only but also cutted triangle that will be polygons in fact So the right way is to make a fill is : for(polygonheight) draw the horizontal segment between the 2 edges 2) at this point we dont need fpu as all (float) values may be stored as 32 bits with the higher 16 bits representing the integer parts (fixed size integer) 3) So the "game" will be to interpolate linearly several values among this segment Lets start with 2 values : U V the texture coordinates We interpolate simply u=u+du; v=v+dv; 4) This C code (that come from my PatchCompositeTags prog) will show how to fill a poly from the two edges already done /*==================================================================*/ void FillPoly_A8R8G8B8_Fast(struct Comp3D *C,struct Edge3D *Edge1,struct Edge3D *Edge2) { register LONG x,dx; register LONG u,du; register LONG v,dv; register ULONG m,n; register LONG y; register ULONG *Src32; /* bm memory */ register ULONG *Dst32; /* bm memory */ register ULONG *Dst32X; /* bm memory */ register ULONG sline; register ULONG dline; register ULONG pix32; FUNC y=C->y; Edge1=&Edge1[y]; Edge2=&Edge2[y]; /* now lock and begin to draw pixels */ LOCKBM(C->Src); Src32=C->renderInfo.Memory; sline=C->Src.LineSize/4; LOCKBM(C->Dst); Dst32=C->renderInfo.Memory; dline=C->Dst.LineSize/4; MLOOP(C->high) { x =(Edge1->x); dx=(Edge2->x - Edge1->x)+1; if(dx < 1) goto LineDone; u=(Edge1->u); v=(Edge1->v); du=(((Edge2->u>>16) - (Edge1->u>>16))<<16)/dx; dv=(((Edge2->v>>16) - (Edge1->v>>16))<<16)/dx; Dst32X=&Dst32[y*dline + x]; NLOOP(dx) { pix32=Src32[ (v>>16)*sline + (u>>16)]; Dst32X[n]=pix32; u=u+du; v=v+dv; } LineDone: Edge1++; Edge2++; y++; } /* unlock all */ done: UNLOCKBM(C->Src); UNLOCKBM(C->Dst); } Alain Thellier - Wazp3D
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 15 Mar 2017 13:34
| thellier alain wrote:
| Hello guys |
Nice to see you here Alain! You are spot on, the inner loop can nicely be done in INTEGER. Thank you for the code example. In my opinion one should write the inner rastirizer in ASM. To be able to control CPU behavior the most. With Apollo you have SW programmable DCache prefetch instruction. If you write the code accordingly E.g. Mix this texel, while fetch next-next-texel. Then you can ideally kill memory latency. What do you think Alain?
| |
| | Thellier Alain
Posts 143 15 Mar 2017 13:34
| And now the draw edges part (I dont give a TestPoly/ClipPoly code as any GL implementation will have it)/*==================================================================*/ void DrawEdge(struct Comp3D *C,struct Point3D *P0,struct Point3D *P1) { register struct Edge3D *E; register LONG x,dx; register LONG y,dy; register LONG u,du; register LONG v,dv; register LONG w,dw; register ULONG m; APTR temp; if(C->UseLine) DrawLine(C,P0,P1); FUNC if(P0->y < P1->y) {E=C->edge1;} else {SWAP(P0,P1); E=C->edge2;} dy= floor(P1->y) - floor(P0->y) + 1; if(dy<1) return; x=floor(P0->x); x=(x<<16); dx=floor(P1->x) - floor(P0->x); dx=(dx<<16)/dy; u=floor(P0->u); u=(u<<16); du=floor(P1->u) - floor(P0->u); du=(du<<16)/dy; v=floor(P0->v); v=(v<<16); dv=floor(P1->v) - floor(P0->v); dv=(dv<<16)/dy; w=floor(P0->w); w=(w<<16); dw=floor(P1->w) - floor(P0->w); dw=(dw<<16)/dy; y=P0->y; E=&E[y]; MLOOP(dy) { E[m].x=x>>16; x=x+dx; E[m].y=y; y=y+1; E[m].u=u; u=u+du; E[m].v=v; v=v+dv; E[m].w=w; w=w+dw; } /* securize extremities */ E[0].x=floor(P0->x); E[0].y=floor(P0->y); E[0].u=floor(P0->u); E[0].u=(E[0].u<<16); E[0].v=floor(P0->v); E[0].v=(E[0].v<<16); E[0].w=floor(P0->w); E[0].w=(E[0].w<<16); dy=dy-1; E[dy].x=floor(P1->x); E[dy].y=floor(P1->y); E[dy].u=floor(P1->u); E[dy].u=(E[dy].u<<16); E[dy].v=floor(P1->v); E[dy].v=(E[dy].v<<16); E[dy].w=floor(P1->w); E[dy].w=(E[dy].w<<16); } /*==================================================================*/ void DrawPoly(struct Comp3D *C,ULONG Pnb) { register LONG ymin,ymax,ymed; register struct Point3D *P=C->PolyP; register n; C->PolyP[Pnb].x=C->PolyP[0].x; /* close poly */ C->PolyP[Pnb].y=C->PolyP[0].y; C->PolyP[Pnb].u=C->PolyP[0].u; C->PolyP[Pnb].v=C->PolyP[0].v; C->PolyP[Pnb].w=C->PolyP[0].w; C->NotClipped=TestPoly(C,Pnb); if(C->NotClipped==-1) return; if(C->NotClipped== 0) Pnb=ClipPoly(C,Pnb); ymin=ymax=P[0].y; NLOOP(Pnb) { if(P[n].y < ymin) ymin=P[n].y; if(ymax < P[n].y) ymax=P[n].y; } C->y=ymin; C->high=ymax-ymin+0; if(C->high<2) return; NLOOP(Pnb) DrawEdge(C,&P[n],&P[n+1]); ymed=(ymax+ymin)/2; if(C->edge1[ymed].x < C->edge2[ymed].x) FillPoly(C,C->edge1,C->edge2); else FillPoly(C,C->edge2,C->edge1); }
| |
| | Wawa T
Posts 695 15 Mar 2017 13:47
| hi alain, good to see you here, have you been lurking?
| |
| | Thellier Alain
Posts 143 15 Mar 2017 13:49
| >Mix this texel, while fetch next-next-texel. Certainly at ASM level it will add some speedAnyway even if Vampire cant have enough speed for all GL effects having a triangle texturer with blending can be interesting for emulating CompositeTags (the OS4 compositing function) for games like MACE Alain Thellier
| |
| | Norbert Kett (Apollo Team Member) Posts 39 16 Mar 2017 03:53
| Hello Alain, Thanks for sharing these infos here :) using fixed 16:16 for rendering is a good idea. (OpenGL has such a datatype: GL_FIXED) TinyGL uses fixed point arithmetic in texel processing. it should be extended to the whole triangle drawing part. on other areas we need float math with FPU. for example a matrix multiplication uses 64 fmul, 64 fadd. its 'just' 128 fpu instructions. but with soft-fpu its about 10000 integer instructions.
| |
|
|
|