APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

Information about the Apollo CPU and FPU.

	page 1 2 3 4 5 6 7 8 9 10 11


Olaf Schoenweiss Posts 690 08 May 2017 17:40	EXTERNAL LINK

Olaf Schoenweiss

Posts 690
08 May 2017 17:41

Aksel Andersen wrote:

Gunnar von Boehn wrote:

Aksel Andersen wrote:

Now you are just being difficult.

Do you really not understand what I mean?

Well I am confused. Many of us are about this. Don't this thread show this?

But let's not go into a flame war here guys. :)

he wants a idea for a use case that is useful and shows superiority of vampire/apollo compared to existing solutions

Gunnar von Boehn
(Apollo Team Member)
Posts 6214
08 May 2017 17:47

Aksel Andersen wrote:

Well I am confused.

You know RIVA?
RIVA is a video player written in 68k ASM.
This video player allows you to watch video on AMIGA.
Henryk used RIVA as demonstration case.
This means he wrote some inner function to use AMMX instruction.
This demonstrated very clearly how AMMX does tremendously improve performance.

Now we need the same for the FPU.
All exiting 68k FPU code is probably NOT useful for this.
Why?
Because the old 68K FPU was very weak.
So code was written for it without the expectation and considerations that code needs to reach high performance on a fast FPU.

Does this make sense now?

Chain Q

Posts 19
08 May 2017 17:50

Gunnar von Boehn wrote:

We talk here about a usecase..
Something to show the superiority...

*RANT DELETED*

Aksel Andersen

Posts 120
08 May 2017 17:51

Gunnar von Boehn wrote:

Aksel Andersen wrote:

Well I am confused.

Yes I understand all of this. I just don't see why you push so hard in convincing us that fpu is not really needed.

Especially if you intend to implement a backwards compatible fpu anyways.

Gunnar von Boehn
(Apollo Team Member)
Posts 6214
08 May 2017 17:56

Aksel Andersen wrote:

Yes I understand all of this. I just don't see why you push so hard in convincing us that fpu is not really needed.

OK I give up. Maybe its a language barrier.


Aksel Andersen Posts 120 08 May 2017 17:59	You beat me to it.

Thierry Atheist

Posts 644
08 May 2017 18:50

google HATES ME.... no matter what I type in, it mostly returns garbage. I typed in "what are floating point units used for" and got nothing useful.

Then I remembered watching this a while back.

EXTERNAL LINK

Jari Eskelinen

Posts 23
08 May 2017 19:40

Gunnar von Boehn wrote:

OK I give up. Maybe its a language barrier.

It is language barrier. People are using English words and communication style from their native language and culture and misunderstandings are guaranteed.. Seen that so many times on different forums... people actually would agree but due to language barrier it escalates to flame war.

Also I think that there are two point of views; some wants just 68881 compatibility without any performance boost so they can run some legacy software and Gunnar wants something better - like AMMX is.


David Wright Posts 373 08 May 2017 21:01	Funny, I have taken upon myself an earnest effort to learn German. Wanted to in my youth for various reasons and then after foregoing english on an order site, I almost ordered 5 of the same item. If it goes any further I probably will learn just enough to be dangerous. So, guten tag

Wawa T

Posts 695
08 May 2017 22:02

I just don't see why you push so hard in convincing us that fpu is not really needed.

if an fpu is not abstracted away via a library, as example, which it usually wont be, for speed reasons. it might not be possible to trap even taking penalties into account. so if it makes a new adapted binary necessary, a recompilation or a patch, otherwise it introduces a compatibility breech between old and new implementation and it makes old fpu software obsolete, impossible to use. so far it is clear to me i guess.

Daniel Sevo

Posts 299
08 May 2017 22:19

Gunnar, am I getting this right?

The test-case you want - the one to show the real benefit of the improved FPU taht you already have but is "deactivated/not included" in current core:
Is this all about showing how to greatly accelerate some piece of software with properly optimized FPU code that CPU alone couldn't do and thus proving that the FPU "has earned its place in the FPGA" and the use of precious Logical Elements in the FPGA is motivated.

Meaning, it is not seen as "enough" to simply achieve straight compatibility with 68060 FPU - to allow some percentage of old software to run to be worth using up FPGA space for?

Marcus Sackrow

Posts 37
08 May 2017 22:24

Gunnar von Boehn wrote:

povray?
or a realtime raytracing routine (I wrote/ported some recently) something like that?

Gunnar von Boehn
(Apollo Team Member)
Posts 6214
08 May 2017 22:54

Marcus Sackrow wrote:

povray?
or a realtime raytracing routine (I wrote/ported some recently) something like that?

I try to explain.
The 68060 FPU was not pipelined.
So you could write stuff like


FADD F0,F1
FADD F1,F2
FMUL F2,F3
FMUL F3,F4

And the code would execute at normal speed sequentially.
Each instruction taking 3 clocks!

Our FPU works different.
We could execute FADD and FMUL in parallel - with a throughput of 1/1 each. So 3 FLOPS per clock.
But these 2 instruction of course need be different type
1 MUL , 1 ADD
And they need to use different input/output.
E.g Not depending. And as we have a LATENCY of several cycles for the result to become valid. Several instructions need be fully independent.

Our requirement in independence is "normal" and "StateOftheArt".
A modern x86-FPU or modern POWER-FPU does have the very same requirements.

Old 68K FPU which were slow in not parallel.
Did not have these requirements.
So we can assume that old code did not take care of this.

So our testcase will need to be written like this for maximum performance.

The testcase should also be NOT to long in ASM.
So that we can TRACE and SINGLE step it many times and measure stuff. We would like to use it to learn if our Result-Forwarding works ideally. If prefetching works perfect, if the core picks the ideal Super-Scalar grouping etc.

So ideally we have a testcase which we can "hand count" the expected Mips and review if all units in the Core work without bubbles and we therefore really reach the performance goal.

So we need to testcase not only to show how cool we are but also to "setup" our engine optimally. So we would use it to profile the core and to locate areas to improve.

Does make explanation make sense now?


Saladriel Amrael Posts 166 08 May 2017 23:05	So, something like a custom made ASM routine for drawing a Mandelbrot or other fractals? They tend to be quite simple algorithms AFAIK

Gunnar von Boehn
(Apollo Team Member)
Posts 6214
08 May 2017 23:20

Saladriel Amrael wrote:

So, something like a custom made ASM routine for drawing a Mandelbrot or other fractals? They tend to be quite simple algorithms AFAIK

Maybe.
Normally a Mandel is recursive e.g. depending on the previous calculation so its maybe complex to create the desired amount of data independancy.

But pretty simple to code and create high data /instruction paralis would be a MATRIX-MUL like done by normal 3D Games.

So maybe calculating 3D coordinates is a sensible testcase.
As this has also real practical reuse value for games.

Kolbj�rn Barmen
(Needs Verification)
Posts 219/ 2
08 May 2017 23:34

Olaf Schoenweiss wrote:

At the moment SAGA and 16bit sound are more important and I guess most users will agree

16bit sound? What ever for? Hardly any software supports it.

Kolbj�rn Barmen
(Needs Verification)
Posts 219/ 2
08 May 2017 23:55

Thierry Atheist wrote:

google HATES ME.... no matter what I type in, it mostly returns garbage. I typed in "what are floating point units used for" and got nothing useful.

Then I remembered watching this a while back.

EXTERNAL LINK

Here, maybe you find this useful...
EXTERNAL LINK

Chain Q

Posts 19
09 May 2017 00:08

Gunnar von Boehn wrote:

I try to explain.
The 68060 FPU was not pipelined.
So you could write stuff like


  FADD F0,F1
  FADD F1,F2
  FMUL F2,F3
  FMUL F3,F4

And the code would execute at normal speed sequentially.
Each instruction taking 3 clocks!

Hand written FPU assembly is rare. So better optimize for what compilers generate.

So here, have a real world, compiler generated FPU code example. It's an actual 4x4 matrix multiplication used in a 3D engine. Yes, it's far from ideal when it comes to two pipes running in parallel, but still, a lot of things can be paired, and some of the overlaps can be dealt by register renaming, and potential OoOE. Also, for the memory reads, a prefetch engine can work ahead.

Of course, it's also not that hard to add instruction rescheduling to compilers, when there's actually a CPU which needs FPU instruction rescheduling.

As usual, when you try to eliminate bottlenecks, the biggest problem is the lack of 3 operand instructions there which results in a lot of extra FMOVEs, otherwise lot of the FOP+FMOVE or FOP+FMOVE pairs could be turned into 3 operand instructions. The fact that one of the operands is both read *and* written at the same time for most ops, doesn't help. Maybe the core could also add exceptions for that, I think even the '060 has exceptions for similar cases, when it comes to integer instruction pairing.


  link.w %a5,#-88
  movem.l %a2/%a3/%a4/%a6,-88(%a5)
  fmovem.x %fp2/%fp3/%fp4/%fp5/%fp6/%fp7,-72(%a5)  move.l 8(%a5),%a6
  moveq.l #-1,%d1
  .balignw 4,0x4e71
.Lj20:
  addq.l #1,%d1
  move.l %d1,%d0
  lsl.l #4,%d0
  lea (%a1,%d0.l),%a3
  move.l %a3,%a2
  move.l %d1,%d0
  lsl.l #4,%d0
  lea (%a0,%d0.l),%a3
  move.l %a3,%a4
  fmove.s (%a2),%fp1
  fmove.s 4(%a2),%fp3
  lea (%a6),%a3
  fmove.x %fp1,%fp0
  fmul.s (%a3),%fp0
  fmove.x %fp3,%fp2
  fmul.s 16(%a3),%fp2
  fadd.x %fp2,%fp0
  fmove.x %fp0,%fp7
  lea (%a6),%a3
  fmove.x %fp1,%fp0
  fmul.s 4(%a3),%fp0
  fmove.x %fp3,%fp2
  fmul.s 20(%a3),%fp2
  fadd.x %fp2,%fp0
  fmove.x %fp0,%fp4
  lea (%a6),%a3
  fmove.x %fp1,%fp0
  fmul.s 8(%a3),%fp0
  fmove.x %fp3,%fp2
  fmul.s 24(%a3),%fp2
  fadd.x %fp2,%fp0
  fmove.x %fp0,%fp5
  lea (%a6),%a3
  fmove.x %fp1,%fp0
  fmul.s 12(%a3),%fp0
  fmove.x %fp3,%fp2
  fmul.s 28(%a3),%fp2
  fadd.x %fp2,%fp0
  fmove.x %fp0,%fp6
  fmove.s 8(%a2),%fp1
  fmove.s 12(%a2),%fp3
  lea (%a6),%a3
  fmove.x %fp1,%fp0
  fmul.s 32(%a3),%fp0
  fmove.x %fp3,%fp2
  fmul.s 48(%a3),%fp2
  fadd.x %fp2,%fp0
  fadd.x %fp7,%fp0
  fmove.s %fp0,(%a4)
  lea (%a6),%a3
  fmove.x %fp1,%fp0
  fmul.s 36(%a3),%fp0
  fmove.x %fp3,%fp2
  fmul.s 52(%a3),%fp2
  fadd.x %fp2,%fp0
  fadd.x %fp4,%fp0
  fmove.s %fp0,4(%a4)
  lea (%a6),%a3
  fmove.x %fp1,%fp0
  fmul.s 40(%a3),%fp0
  fmove.x %fp3,%fp2
  fmul.s 56(%a3),%fp2
  fadd.x %fp2,%fp0
  fadd.x %fp5,%fp0
  fmove.s %fp0,8(%a4)
  lea (%a6),%a3
  fmove.x %fp1,%fp0
  fmul.s 44(%a3),%fp0
  fmove.x %fp3,%fp2
  fmul.s 60(%a3),%fp2
  fadd.x %fp2,%fp0
  fadd.x %fp6,%fp0
  fmove.s %fp0,12(%a4)
  cmp.l #3,%d1
  jlt .Lj20
  move.l %a0,%d0
  movem.l -88(%a5),%a2/%a3/%a4/%a6
  fmovem.x -72(%a5),%fp2/%fp3/%fp4/%fp5/%fp6/%fp7
  unlk %a5
  rtd #4

Gunnar von Boehn
(Apollo Team Member)
Posts 6214
09 May 2017 05:30

Many thanks for the example code post.
Your code examples does shows a few things.

Our proposed FMOVE+FOPP fusing is important.

The compiler generates too many dependencies.
The FPU performance is crippled because of this.
Hand scheduling will highly improve performance, x3

The compiles creates unneeded LEA instructions

Our OPP, MOVEToMEM fusing is again very useful.

Your example verifies very nicely what we are saying.
Which is not surprising - as we did analyses a lot code before hand.

So we all agree hand scheduling will highly improve performance.
With good code being many times faster then 68060 is possible.

posts 206	page 1 2 3 4 5 6 7 8 9 10 11