APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

Performance and Benchmark Results!

OpenGL On Vampire Cards	page 1 2 3 4 5 6

Gunnar von Boehn
(Apollo Team Member)
Posts 6301
14 Mar 2017 15:52

Norbert Kett wrote:

i made a quick calculation: if i assume the CPU runs at 100MHz, and i want 30 frames / second, and 640x400 pixels. then i have 13 CPU cycles to compute one pixel which is not too much. usually an application need to compute many other things too. so, if i'm correct without a special texturing HW we can not achieve ps2 quality. i feel AMMX wont help.

The Calculation is a good start.
But mind we also decode VIDEO in 640x400 at 30 FPS - with no problem.
We also play DOOM in 640x400 smooth - without problems.
And this even without using MMX!
So we have already proved that we can do this resolution.

Now show us what instruction/operations are needed per texel.
Then we can do the clock counting!

;-)

Norbert Kett
(Apollo Team Member)
Posts 39
14 Mar 2017 17:53

ok, at first, lets see what we are talking about :)

here is a triangle renderer source:
EXTERNAL LINK and the include:
EXTERNAL LINK
tingl has many kind of triangle renderer function, all are use this include, so its a bit messy. here is the merged renderer:
EXTERNAL LINK
we can see its optimized, it can emit 8 texels in row. the PUT_PIXEL part is integer only.

here is the no-fpu asm code:
EXTERNAL LINK
and with fpu:
EXTERNAL LINK
easy to find the put pixel part which is included 8 times in row.

here is the put_pixel asm code:

movel d6,d2
moveq #14,d0
lsrl d0,d2
clrl d0
movew a2@(4),d0
cmpl d2,d0
jhi L43
movel d4,d1
andl a5@(-28),d1
clrl d0
movew a5@(-22),d0
lsll d0,d1
movel d5,d0
andl a5@(-28),d0
orl d0,d1
clrl d0
movew a5@(-24),d0
lsrl d0,d1
movel a5@(-4),a0
movew a0@(d1:l),a3@(4)
movew d2,a2@(4)
L43:
addl a5@(-96),d6
addl d7,d5
addl d3,d4

and we have no pixel blending yet. that's all for now ;)


Gunnar von Boehn (Apollo Team Member) Posts 6301 14 Mar 2017 18:15	What is the typical texture size (X/Y) What sizes of texture are supported?


Norbert Kett (Apollo Team Member) Posts 39 14 Mar 2017 18:23	TinyGL originally supported only 256x256, now: from 64x64 to 2048x2048.

Gunnar von Boehn
(Apollo Team Member)
Posts 6301
14 Mar 2017 18:26

Norbert Kett wrote:

TinyGL originally supported only 256x256, now: from 64x64 to 2048x2048.

OK lets together make a simple example..
Lets write in ASM the code to put pixel from 256x256 texture.

for this case the core function code is:


move.w (A0,D0*2),(A1)+

== 1 cycle

Where D0 is the texture index.
Layout in Index Register is
YyXx

So this core is pretty simple
You are maintain Z Buffer so you do


CMP.W  D1,(A2)
bhi    .noset
move.w (A0,D0*2),(A1)+
move.w D1,(A2)+
.noset

So the main function of the code is only 4 instruction.
---

May I ask which compiler version and which compile options you use?
Your disasm looks a bit like it does some extra casting from LONG to WORD


Norbert Kett (Apollo Team Member) Posts 39 15 Mar 2017 04:21	gcc 2.95.3.


Norbert Kett (Apollo Team Member) Posts 39 15 Mar 2017 04:45	i peeked into the gcc soft-float lib src, i was curious about how many instructions used for float operations/conversions... omg! i think its pointless to do any optimizations until FPU is not enabled in Apollo core :/

Gunnar von Boehn
(Apollo Team Member)
Posts 6301
15 Mar 2017 06:52

Norbert Kett wrote:

i think its pointless to do any optimizations until FPU is not enabled in Apollo core :/

Certainly using softfloat make absolutely no sense here.
ON the other hand enabling FPU instructions would be possible.
But I agree with you this C code will generally give a problem.

We have to mind that competetive 3D games like DOOM used to write
their render code in handtuned ASM.
The inner loop of DOOM rasterizer is propably less than 10 ASM instruction in total, including incementing Y,X and U,H.

The project can be a success if we write the render functions also in handtuned ASM.

Norbert Kett
(Apollo Team Member)
Posts 39
15 Mar 2017 07:18

yes, i already updated vasm with a new version with Apollo support.
and i believe the rendering can be much faster with proper optimization. the asm compiled from c is juts a good guide here.
but the fpu is essential, some soft fpu functions use more than 300 instructions.
btw doom's vertical column rendering is much simpler compared to a 'real' 3D polygon render with perspective correction, and z buffer r/w.
i saw the quake's handtuned x86 asm render codes. many thousand of lines for rendering.

Wawa T

Posts 695
15 Mar 2017 07:19

but doom is a 2d engine providing an illusion of 3d action, while an implementation of even a subset of some historical version of ogl, is still a true 3d multipurpose engine, demanding its share in complexity and flexibility. if tinygl is going to be a base for porting applications it cannot be that tricky and restricted. it might be possible to have it implemented within reason only with integer aritmetics, it seems that is what os4 people (hans) are doing when they use compositing to fake 3d in demos, still im not sure it will alone gain enough speed to have your average simple 3d game port to run on a vampire n a playable speed, say some 10 fps..

edit: norbert, you have beat me to it;)

Grzegorz W�jcik (pisklak
(Apollo Team Member)
Posts 87
15 Mar 2017 08:25

Well OGL is for sure nice - but I think if anyone want that on nofpu platform then TinyGL needs a realy serious rewrite.
For sure someone can write good working 3D engine with only fixedpoint math, but I guess that will not be OGL compatible.
We have some nice AMMX instructins that for sure can help with texture filtering and some other stuff. We may focus on that right now and later when we will have full FPU optimize all other stuff...

Gunnar von Boehn
(Apollo Team Member)
Posts 6301
15 Mar 2017 08:28

wawa t wrote:

but doom is a 2d engine providing an illusion of 3d action,

Of course every 3D engine is a simplification.
Some more, some less.
Its always a trade-off of correctness for performance.
Some more, some less.
3D game engines never to do true raycasting.

Important for success is to be aware of the possibilities and to set a reasonable goal.

I would recommend to write the rasterizer in ASM to be able to take full control of Super-Scaler, of data prefetching, of AMMX.
I think it would make sense to start from the inside to the outside.
And start with a simple rasterizer demos first and understand how to get the core routines to full speed.

If someone likes to do this, then I'm happy to help on the ASM tuning. And if needed we can always cheat by adding new instructions if we identified very useful cases.

Wawa T

Posts 695
15 Mar 2017 12:33

gunnar, this is not a question of writing an engine. you dont need an ogl library, against which you compile your app, to do that. but hardly anyone will write a dediatced separate 3d engine for every separate amiga app he wants to develop, if any.

a library you can compile ogl stuff against is mostly good for porting sources from linux. such sources dont necessarily containg floating point variables, id say, but if dont remember well, one would have to grep some of them for types.

the question is, does it pay to develop now an integer only library, taking speed penalties and other limits into account, or is it better to wait for fpu implementation.

Gunnar von Boehn
(Apollo Team Member)
Posts 6301
15 Mar 2017 12:37

wawa t wrote:

the question is, does it pay to develop now an integer only library..

Thank you for repeating the point I made!

If you read my post carefully you see that I clearly said to use FPU and to use AMMX.

Thellier Alain

Posts 144
15 Mar 2017 13:19

Hello guys

As you talked about my Wazp3D stuff let me introduce in your discussion...

1) as clipping may occurs we will not have to draw triangle only but also cutted triangle that will be polygons in fact
So the right way is to make a fill is :

for(polygonheight)
draw the horizontal segment between the 2 edges

2) at this point we dont need fpu as all (float) values may be stored as 32 bits with the higher 16 bits representing the integer parts (fixed size integer)

3) So the "game" will be to interpolate linearly several values among this segment
Lets start with 2 values : U V the texture coordinates
We interpolate simply
u=u+du;
v=v+dv;

4) This C code (that come from my PatchCompositeTags prog) will show how to fill a poly from the two edges already done

/*==================================================================*/
void FillPoly_A8R8G8B8_Fast(struct Comp3D *C,struct Edge3D *Edge1,struct Edge3D *Edge2)
{
register LONG x,dx;
register LONG u,du;
register LONG v,dv;
register ULONG m,n;
register LONG y;

register ULONG *Src32; /* bm memory */
register ULONG *Dst32; /* bm memory */
register ULONG *Dst32X; /* bm memory */
register ULONG sline;
register ULONG dline;
register ULONG pix32;

FUNC
y=C->y;
Edge1=&Edge1[y];
Edge2=&Edge2[y];

/* now lock and begin to draw pixels */
LOCKBM(C->Src);
Src32=C->renderInfo.Memory;
sline=C->Src.LineSize/4;

LOCKBM(C->Dst);
Dst32=C->renderInfo.Memory;
dline=C->Dst.LineSize/4;

MLOOP(C->high)
{
x =(Edge1->x); dx=(Edge2->x - Edge1->x)+1;

if(dx < 1)
goto LineDone;

u=(Edge1->u);
v=(Edge1->v);

du=(((Edge2->u>>16) - (Edge1->u>>16))<<16)/dx;
dv=(((Edge2->v>>16) - (Edge1->v>>16))<<16)/dx;
Dst32X=&Dst32[y*dline + x];

NLOOP(dx)
{
pix32=Src32[ (v>>16)*sline + (u>>16)];
Dst32X[n]=pix32;
u=u+du;
v=v+dv;
}

LineDone:
Edge1++;
Edge2++;
y++;
}

/* unlock all */
done:
UNLOCKBM(C->Src);
UNLOCKBM(C->Dst);
}

Alain Thellier - Wazp3D

Gunnar von Boehn
(Apollo Team Member)
Posts 6301
15 Mar 2017 13:34

thellier alain wrote:

Hello guys

Nice to see you here Alain!

You are spot on, the inner loop can nicely be done in INTEGER.
Thank you for the code example.

In my opinion one should write the inner rastirizer in ASM.
To be able to control CPU behavior the most.
With Apollo you have SW programmable DCache prefetch instruction.
If you write the code accordingly
E.g. Mix this texel, while fetch next-next-texel.
Then you can ideally kill memory latency.

What do you think Alain?

Thellier Alain

Posts 144
15 Mar 2017 13:34

And now the draw edges part (I dont give a TestPoly/ClipPoly code as any GL implementation will have it)

/*==================================================================*/
void DrawEdge(struct Comp3D *C,struct Point3D *P0,struct Point3D *P1)
{
register struct Edge3D *E;
register LONG x,dx;
register LONG y,dy;
register LONG u,du;
register LONG v,dv;
register LONG w,dw;
register ULONG m;
APTR temp;

if(C->UseLine)
DrawLine(C,P0,P1);
FUNC
if(P0->y < P1->y)
{E=C->edge1;}
else
{SWAP(P0,P1); E=C->edge2;}

dy= floor(P1->y) - floor(P0->y) + 1;
if(dy<1)
return;

x=floor(P0->x); x=(x<<16); dx=floor(P1->x) - floor(P0->x); dx=(dx<<16)/dy;
u=floor(P0->u); u=(u<<16); du=floor(P1->u) - floor(P0->u); du=(du<<16)/dy;
v=floor(P0->v); v=(v<<16); dv=floor(P1->v) - floor(P0->v); dv=(dv<<16)/dy;
w=floor(P0->w); w=(w<<16); dw=floor(P1->w) - floor(P0->w); dw=(dw<<16)/dy;

y=P0->y;
E=&E[y];
MLOOP(dy)
{
E[m].x=x>>16; x=x+dx;
E[m].y=y; y=y+1;
E[m].u=u; u=u+du;
E[m].v=v; v=v+dv;
E[m].w=w; w=w+dw;
}

/* securize extremities */
E[0].x=floor(P0->x);
E[0].y=floor(P0->y);
E[0].u=floor(P0->u); E[0].u=(E[0].u<<16);
E[0].v=floor(P0->v); E[0].v=(E[0].v<<16);
E[0].w=floor(P0->w); E[0].w=(E[0].w<<16);

dy=dy-1;
E[dy].x=floor(P1->x);
E[dy].y=floor(P1->y);
E[dy].u=floor(P1->u); E[dy].u=(E[dy].u<<16);
E[dy].v=floor(P1->v); E[dy].v=(E[dy].v<<16);
E[dy].w=floor(P1->w); E[dy].w=(E[dy].w<<16);

}
/*==================================================================*/
void DrawPoly(struct Comp3D *C,ULONG Pnb)
{
register LONG ymin,ymax,ymed;
register struct Point3D *P=C->PolyP;
register n;

C->PolyP[Pnb].x=C->PolyP[0].x; /* close poly */
C->PolyP[Pnb].y=C->PolyP[0].y;
C->PolyP[Pnb].u=C->PolyP[0].u;
C->PolyP[Pnb].v=C->PolyP[0].v;
C->PolyP[Pnb].w=C->PolyP[0].w;

C->NotClipped=TestPoly(C,Pnb);

if(C->NotClipped==-1) return;
if(C->NotClipped== 0) Pnb=ClipPoly(C,Pnb);

ymin=ymax=P[0].y;
NLOOP(Pnb)
{
if(P[n].y < ymin) ymin=P[n].y;
if(ymax < P[n].y) ymax=P[n].y;
}
C->y=ymin;
C->high=ymax-ymin+0;

if(C->high<2)
return;

NLOOP(Pnb)
DrawEdge(C,&P[n],&P[n+1]);

ymed=(ymax+ymin)/2;

if(C->edge1[ymed].x < C->edge2[ymed].x)
FillPoly(C,C->edge1,C->edge2);
else
FillPoly(C,C->edge2,C->edge1);
}


Wawa T Posts 695 15 Mar 2017 13:47	hi alain, good to see you here, have you been lurking?

Thellier Alain

Posts 144
15 Mar 2017 13:49

>Mix this texel, while fetch next-next-texel.
Certainly at ASM level it will add some speed

Anyway even if Vampire cant have enough speed for all GL effects having a triangle texturer with blending can be interesting for emulating CompositeTags (the OS4 compositing function) for games like MACE

Alain Thellier

Norbert Kett
(Apollo Team Member)
Posts 39
16 Mar 2017 03:53

Hello Alain,

Thanks for sharing these infos here :)

using fixed 16:16 for rendering is a good idea. (OpenGL has such a datatype: GL_FIXED) TinyGL uses fixed point arithmetic in texel processing. it should be extended to the whole triangle drawing part.

on other areas we need float math with FPU.
for example a matrix multiplication uses 64 fmul, 64 fadd. its 'just' 128 fpu instructions. but with soft-fpu its about 10000 integer instructions.

posts 119	page 1 2 3 4 5 6