Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

Writing 3D Engine for 68080 In ASMpage  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Vojin Vidanovic
(Needs Verification)
Posts 1916/ 1
23 Dec 2019 14:12


Vladimir Repcak wrote:

  I have just realized, that we could use the MMU caching algorithm to cache the actual FrameBuffer. 64 KB at 320x200@8bit would quite literally be full screen inside cache.
 

 
  Note that Apollo core presently DOES NOT have m68k compatibile MMU, so you have to do it by hand OR consider exactly what do you want to do and programme if for Apollos own "PMMU" unit.
 
  This could change in future.
 
  OPEN WORLD ADVENTURE PROPOSAL
 
  Hunter
  https://en.wikipedia.org/wiki/Hunter_(video_game)
  Trex Warrior
  EXTERNAL LINK 
  Also isometric 3D great titles, need textures
 
  Robocop 3D 1992
  No Second Prize (1992) EXTERNAL LINK  F1 GP https://en.wikipedia.org/wiki/Formula_One_Grand_Prix_(video_game)
  D/generation
  Virus
  Syndicate
  Zeewolf 2
  Adventures of Robin Hood, The
  Megalomania


Nixus Minimax

Posts 416
23 Dec 2019 14:37


Vojin Vidanovic wrote:

    Note that Apollo core presently DOES NOT have m68k compatibile MMU

I think Vlad uses the term "MMU" in a non-standard way. None of the caching operations of the 080 need any programmer-side configuration nor interference with the processor's (P)MMU.



Vojin Vidanovic
(Needs Verification)
Posts 1916/ 1
23 Dec 2019 14:38


Nixus Minimax wrote:

  I think Vlad uses the term "MMU" in a non-standard way. None of the caching operations of the 080 need any programmer-side configuration nor interference with the processor's (P)MMU.

If its memory caching by CPU, done in m68k ASM, the marrier.



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
23 Dec 2019 15:11


Vladimir Repcak wrote:

I'm pretty reasonably sure I could fit whole flatshader pipeline (frustum cull, 3D transform, clip, rasterize) of my StunRunner into 16 KB of Icache.

This is good.
But even if not this does not matter.
As the Icache can also stream hundreds of MB/sec of instructions into the Icache if needed.
This means you a not limited by the Icache size.



Vladimir Repcak

Posts 359
23 Dec 2019 18:02


Nixus Minimax wrote:

Vojin Vidanovic wrote:

    Note that Apollo core presently DOES NOT have m68k compatibile MMU

 
  I think Vlad uses the term "MMU" in a non-standard way. None of the caching operations of the 080 need any programmer-side configuration nor interference with the processor's (P)MMU.
 

Well, there's lots of new stuff in the 060 architecture doc I[m going through right now, so feel free to chime in when I make a wrong assumption.
The doc however does mention that MMU manages the caches - you can even control some functionality via CACR register. Not sure if that's the same MMU we are talking about - I stumbled upon some MMU threads which just made more confused...

Now, I'd have to look up the 4-way set-associative ruleset, as I honestly don't recall its algorithm right now. Cache lines are 16-Byte long, preceded with a physical addr+status bit.

But, all I wanna avoid is cache thrashing of inner loops. I'm hoping that the way MMU's cache tagging works is some kind of LRU, so it won't thrash recent cache lines just because its predictive algorithm thinks I might wanna execute code from a different page.

I'm fine with additional manual tagging/precaching/warming up caches - I've done that on PC - if anything, I prefer full manual control of cache as the usual predictive algorithms that are implemented are based on statistics which my code is usually anything but...


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
23 Dec 2019 18:18


Vladimir Repcak wrote:

I'm fine with additional manual tagging/precaching/warming up caches -

You better forget all this
You can NOT do this on any 68k.

With the MMU you have no influence on this.
The MMU can only set memory region to not cacheable.
Like IO regions. But this is not helping you at all for 3D.
 

Some side info: Caches lines are 32byte on Apollo.
2nd Cachelines are no limits on APOLLO
in opposite to the older 68K, APOLLO can access 2 lines in parallel.
This is true for Icache and Dcache.
This means you can access intructions or Data at any address
even spanning two cache lines - without penalty.
 
Also what you can do on APOLLO is prefetch cache lines.
If you know POWERPC/POWER then you might know the DCBT instruction.
APOLLO has the same.
 
 


Vladimir Repcak

Posts 359
23 Dec 2019 18:42


Cool, that's not a big deal :)

What do you mean by "2nd cachelines are no limits" ?


Vladimir Repcak

Posts 359
23 Dec 2019 18:44


Just pasting this from the other thread, so I got it here for future reference - that's some real nice profiling feature !

Gunnar von Boehn wrote:

To improve the 68K ISA
  the following performance counters were added.
 
  All counter are 32bit and can be read using MOVEC
 
 

  809 CCC  = Clock Cycle Counter
  80A IEP1 = Instructions Executed Pipe1
  80B IEP2 = Instructions Executed Pipe2
  80C BPC  = Branch Predicted Correct
  80D BPW  = Branch Predicted Wrong
  80E DCH  = Data Cache Hit
  80F DCM  = Data Cache Miss
  00E CMW  = Counter Memory Writes
 
  00A SCR - Stall Counter, caused by Register dependencies
  00B SCC - Stall Counter, caused by DCache misses
  00C SCH - Stall Counter, caused by Hazards
  00D SCB - Stall Counter, caused by Write Buffer full
 
 
 





Gunnar von Boehn
(Apollo Team Member)
Posts 6207
23 Dec 2019 18:46


Vladimir Repcak wrote:

Cool, that's not a big deal :)
 
  What do you mean by "2nd cachelines are no limits" ?

MOVE.L 31,D0

Read 4 byte from address 31,32,33,34
These 4 bytes span 2 cache lines.
This read can be done by Apollo (combining 2 cache lines) in just 1 cycle



Vladimir Repcak

Posts 359
23 Dec 2019 18:59


That penalty absence is going to be especially nice for 8-bit rasterizing. The 16-bit access will be aligned automatically anyway.

Although you did mention that there's no exception for misaligned reads, previously - if I recall correctly.

It sure would be nice not to have to constantly mess around with manual alignment.

Though, some linkers can force automatic alignment of each symbol...


Vladimir Repcak

Posts 359
23 Dec 2019 19:08


Pasting the following from other thread, so it's all in one place.

If I'm reading this right, the 32 bits of precision here should be enough for up to 50s of a benchmark at 85 MHz.

Philippe Flype wrote:

Since the SILVER3 Core, a unique feature is available.
   
    The SAGA Core provides a Clock-Cycle counter Register. It is a new 32-Bits ReadOnly Register that can help developers in some way to optimize their critical routines:
   
     

      // Register Address
      SAGA_CLK_REGISTER = 0xDE0008
     

   
    The idea of this register is very simple:
    Each time you read it, the internal cycles counter is reseted to zero.
   
   

      // Reset the counter
      tst.l SAGA_CLK_REGISTER
     
      // Read the counter
      move.l SAGA_CLK_REGISTER,d0
   

   
   
    You can use it to determine how many cycles a routine consume, or one single APOLLO instruction.
   
    You can find here an example i made using GCC to illustrate the use of this register:
     
    EXTERNAL LINK       
     
    I use GCC 2.95.3 + ADE includes to compile this project.
   
    The tool ouput this result in CLI:
   
     

        Abcd    Dm,Dn                  : 1
        Abcd    -(Ax),-(Ay)            : 2
        Add.l  Dm,Dn                  : 1
        Add.l  (Ax),Dm                : 1
        And.l  Dm,Dn                  : 1
        And.l  (Ax),Dm                : 1
        Asr.l  Dm,Dn                  : 1
        Asr.l  #Imm,Dm                : 1
        Bchg.l  Dm,Dn                  : 1
        Bchg.l  Dm,(Ax)                : 1
        Bfextu  Dm{Dx:Dy},Dn          : 1
        Bfextu  (Ax){Dx:Dy},Dm        : 1
        Clr.l  Dm                    : 1
        Clr.l  (Ax)                  : 1
        Cmp.l  Dm,Dn                  : 1
        Cmp.l  (Ax),Dm                : 1
        Divu.l  Dm,Dn                  : 35
        Divu.l  (Ax),Dm                : 35
        Divul.l Dm,Dr:Dq              : 35
        Divul.l (Ax),Dr:Dq            : 35
        Exg    Dm,Dn                  : 1
        Ext.l  Dm                    : 1
        Move.l  Dm,Dn                  : 1
        Move.l  #Imm,Dn                : 1
        Move.l  (Ax),(Ay)              : 2
        Move.l  (Ax)+,(Ay)+            : 2
        Move.l  -(Ax),-(Ay)            : 2
        Move.l  (Ax,Dm),(Ay,Dn)        : 2
        Mulu.l  Dm,Dn                  : 2
        Mulu.l  (Ax),Dm                : 2
        Neg.l  Dm                    : 1
        Neg.l  (Ax)                  : 1
        Not.l  Dm                    : 1
        Not.l  (Ax)                  : 1
        Ror.l  Dm,Dn                  : 1
        Ror.l  #Imm,Dn                : 1
        Sub.l  Dm,Dn                  : 1
        Sub.l  (Ax),Dm                : 1
        Swap    Dn                    : 1
     

     
    This output is not exhaustive but yet as you can see and verify by yourself, many instructions takes only 1 cycle.




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
23 Dec 2019 19:49


Vladimir Repcak wrote:

This output is not exhaustive but yet as you can see and verify by yourself, many instructions takes only 1 cycle.

This is fully correct.
An instruction normally takes 1 cycle,
2 cycle if the instruction needs 2 memory access to different EAs.
 
ADD.L D0,(A0) -- needs 1 cycle as it READ and WRITE to same EA-Adr

You have 2 pipes therefore you can simultaneously execute two independent instructions.

Example:
ADDI.L #$1234,D0
ADDI.L #$45678,D1
They both take together 1 cycle.

Sometimes you have DEPENDING instructions:
Example:
MOVE.L D0,D2
ADDI.L  #$45678,D2
Here is a dependency chain.
These instructions normally need 2 cycle ...
APOLLO sees that it can "combine" them
And will do FUSING on both instructions and do this:
ADDI.L #$45678+D0,D2
These 2 instruction can be executed in one Pipe.

This means 4 instructions can be executed like this per cycle - peak.




Vladimir Repcak

Posts 359
23 Dec 2019 21:06


I was just about to go ask if the parallelism isn't actual of the 3rd order (2xINT + FP) and then I realized there's also AMMX (which would imply possibility of AMMX+FP+INTx2, but then I recalled your earlier post:

Gunnar von Boehn wrote:

 
  In a Nutshell, the 68080 Core looks like
 
  * Icache  (16byte instruction per cycle)
  * Decoders
    The decoders can decode up to 4 Integer instructions,
    included 1 AMMX instruction, 1 FPU instruction per cycle.
    The Core has 2 main pipelines.
    The primary pipeline which can do up to 2 INT instructions, or 1 MXX or 1 FPU,
    The secondary pipeline which can do up to 2 INT instructions, and a selection of FMOVE, AMMX STORE.
  * 2 EA units to calc up to 2 EA per cycle
  * DCACHE unit to allow 1 Read and 1 Write per cycle
  * ALUs (INT/AMMX/FPU)

So, if I'm reading this right, these are the possible scenarios ?

Scenario 1: No AMMX or FP:
pOEP: 2xINT
sOEP: 2xINT

Scenario 2: AMMX
pOEP: 0xINT, 1xAMMX
sOEP: 2xINT, AMMX STORE

Scenario 3: FP
pOEP: 0xINT, 1xFP
sOEP: 2xINT, FMOVE

Assuming I got the sOEP right, the highest possible throughput could be 5 ops / cycle ?
pOEP: 2xINT
sOEP: 2xINT, AMMX STORE



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
24 Dec 2019 06:59


Vladimir Repcak wrote:

  So, if I'm reading this right, these are the possible scenarios ?
 
  Scenario 1: No AMMX or FP:
  pOEP: 2xINT
  sOEP: 2xINT
 
  Scenario 2: AMMX
  pOEP: 0xINT, 1xAMMX
  sOEP: 2xINT, AMMX STORE
 
  Scenario 3: FP
  pOEP: 0xINT, 1xFP
  sOEP: 2xINT, FMOVE
 
  Assuming I got the sOEP right, the highest possible throughput could be 5 ops / cycle ?
  pOEP: 2xINT
  sOEP: 2xINT, AMMX STORE

Lets try to start with features.
Then it will become clear what is possible:

- The core can read 16 Byte instructions per cycle.
- The core has 2 decoders
  The 1st decoder can take up to all Icache bytes
  The 2nd decoder can take up to 8 Byte per cycle
  - The 1st decoder can decode in 1 cycle either
    1 INT multicycle cycle
    1 INT single instr
    2 INT single instr (Combination)
    1 FPU instr
    1 AMMX instr
  - The 2nd decoder can decode in 1 cycle either
    1 INT single instr
    2 INT single instr (Combination)
    1 FPU/AMMX Store

- The core has 2 EA Units
  The core can use them to calculate memory address
    or execute EA instruction like
    LEA,SUBA,ADDA

- The DCache unit can do 
  1 read of up to 8 Byte per cycle
  1 write of up to 8 Byte per cycle
  1 Cache-Reload from Memory per cycle

- 2 ALU Unit
  1st Unit can issue per cycle either
  1 INT instr
  2 INT Combined Instructions
  1 AMMX unit
  1 FPU Unit
 
  2 Unit can issue per cycle either
  1 INT instr
  2 INT Combined Instructions

The Core has in total 8 Register Reads Ports and 4 Register Writeports

 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
24 Dec 2019 08:57


AMMX und FPU Instructions that are started share the register Read-Ports and Memory-Port of the 1st INT ALU…
This means the 1st-ALU is used to launch AMMX and FPU instructions.

Of the 4 Register-Write-Ports 3 are dedicated to store INTEGER and EA-Results and 1 Port is reserved for AMMX-FPU results.
This means AMMX and FPU instruction can run in parallel of more integer instructions and will retire fully independently.



Vladimir Repcak

Posts 359
24 Dec 2019 10:42


Thanks a lot. I think it finally all clicked.

This, however, means that even if I managed to get to 100% utilization of both decoders by using INT, it still doesn't necessarily have to be the most ideal use of them - because there could be a different version of algorithm that would use a combo of INT+FP+AMMX that would run faster despite lower utilization of second decoder.

Essentially, one should ideally design 3 versions of each algorithm:
1. Int
2. FP
3. AMMX

Of course, INT version that manipulates vectors will never come close to AMMX that can do up to 16x (or more : PADDSUB) more work done in one instruction.

So, basically, rewrite each part of 3D pipeline like this:
1. AMMX: Trigonometry / vectors / physics : Complex 3D Math
2. FP:  Various coefficients, lerping, simpler Math
3. 3D Core: texturing, filtering
4. INT: All other logic (gameplay, initialization, looping, etc.)

Now, if one thinks about what C++ compiler is doing internally (compared to 080 code), it's not just a different continent.

It's like a remote galaxy recently discovered by Hubble...

The only way such a compiler would ever come close to a hand-made code is if it was written by an AI that would go through all permutations of each algorithm :)




Vladimir Repcak

Posts 359
24 Dec 2019 10:47


On an unrelated note: 3 days ago I placed an order for Vampire, but to my surprise, there was no Order form - it's like you just received an email or something.

I suspect a combination of following happened:
1. Holidays
2. Perhaps you ran out of current batch of boards so you don't take money yet

I suspect things pick up the first weekend after New Year - around Jan-12, correct ?

I was looking into WinUAE, but I can't seem to find its support for 080. Perhaps it doesn't ? Is there any other emulator I could use in the meantime ?



Vojin Vidanovic
(Needs Verification)
Posts 1916/ 1
24 Dec 2019 10:59


Vladimir Repcak wrote:

On an unrelated note: 3 days ago I placed an order for Vampire, but to my surprise, there was no Order form - it's like you just received an email or something.

That is it. There are no emulators that emulate 080.



Vladimir Repcak

Posts 359
24 Dec 2019 11:54


Vojin Vidanovic wrote:

  That is it. There are no emulators that emulate 080.
Well, there's been quite a few WinUAE releases in last 2 years, supporting all kinds of external HW. Last release just 4 days ago.

So, it's certainly not due to the lack of free time, but perhaps it's just super low priority...

Or is it the general Anti-Apollo sentiment ?

Understand also, that I myself have written my personal dev emulator on PC for various Atari architectures (6502, 6502C, 68000, RISC GPU, DSP) so I do happen to have an idea about the effort involved in adding new instruction set. I'm not talking cycle-perfect emulation or superscalar execution - just the new ops.

Adding support for, say - at least - AMMX instructions, surely isn't herculean effort. If one was inclined to do so, that is...



Vojin Vidanovic
(Needs Verification)
Posts 1916/ 1
24 Dec 2019 11:59


UAE has added PPC only recently, when it was arranged an in interest of OS4 emu.
 
  Its not anti Apollo, UAE is classic emulation, Vamp is a new Classic. Only
dead ones go UAE :)

Ammx is not widely used or accepted in Amiga land yet. Although 040-mmx could be a testbed.

posts 429page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22