Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

New Performance Counter Available

Gunnar von Boehn
(Apollo Team Member)
Posts 6207
06 Oct 2016 19:20


To improve the 68K ISA
  the following performance counters were added.
 
  All counter are 32bit and can be read using MOVEC
 
 

  809 CCC  = Clock Cycle Counter
  80A IEP1 = Instructions Executed Pipe1
  80B IEP2 = Instructions Executed Pipe2
  80C BPC  = Branch Predicted Correct
  80D BPW  = Branch Predicted Wrong
  80E DCH  = Data Cache Hit
  80F DCM  = Data Cache Miss
  00E CMW  = Counter Memory Writes
 
  00A SCR - Stall Counter, caused by Register dependencies
  00B SCC - Stall Counter, caused by DCache misses
  00C SCH - Stall Counter, caused by Hazards
  00D SCB - Stall Counter, caused by Write Buffer full

 




Gregthe Canuck

Posts 274
07 Oct 2016 02:29



Amazing these were never there in the first place.

Nice work.


Simo Koivukoski
(Apollo Team Member)
Posts 601
07 Oct 2016 06:09


I hope that someone could code of tool, which can show all these information from the selected program.

With it would be easy to compare different compilers and effects of their settings.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
07 Oct 2016 17:57


Also interesting to note is that APOLLO 68080 is not only the first 68K-CPU which does provide information like DCache misses to the programmer.

Its also the first and only 68K CPU which offers new instructoins allowing the programmer to fix this!

Apollo 68080 provides for example instructions for the programmer to control prefetching and using this ideally a 100% cache hit rate can be achieved!


Philippe Flype
(Apollo Team Member)
Posts 299
18 Oct 2016 22:41


Hi Folks,
 
  I wrote a tool that use this new counters.
 
  The tool is running in the CLI window or redirected to a file, querying each second the CPU internal state.
 
  It is very cool as it offers to see in realtime the code distribution in the pipes, the data-cache use, ... Impressive and unique stuff that helps to improve code and detects bottlenecks.
 
 
 
 
 



Philippe Flype
(Apollo Team Member)
Posts 299
18 Oct 2016 23:14


Me and friend Pisklak made a little game fight, playing with this tool and those counters.
 
  - Pisklak with his plasma effect
  - Me with the fire effect (yet posted some months ago)
 
Both use RTG screen, and use direct draw to screen.
 
  ====== PlasmaEffect (Pisklak) ======
 
  Clock:  89 MHz (x12)
  Total:  140.82 MIPS (IPC: 1.58)
  Pipe1:  81.32 M (57%)
  Pipe2:  59.49 M (42%)
  DCache:  39.98 M (99%) (40438 miss)
  Branch:  20.09 M (98%) (294722 wrong)
  St-Reg:  00.20 M
  St-Cac:  01.23 M
  St-Hzd:  00.01 M
  St-Buf:  02.42 M
 
  ====== FireEffect (Flype) ======
 
  Clock:  86 MHz (x12)
  Total:  135.72 MIPS (IPC: 1.52)
  Pipe1:  83.88 M (61%)
  Pipe2:  51.84 M (38%)
  DCache:  59.54 M (99%) (482906 miss)
  Branch:  01.08 M (98%) (14491 wrong)
  St-Reg:  00.17 M
  St-Cac:  00.92 M
  St-Hzd:  00.01 M
  St-Buf:  00.27 M
 
 
Technically speaking, the plasma effect inner routine is more efficient, but the fire effect is very close behind. Both shows very good use of the 2nd Pipe.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
18 Oct 2016 23:22


Awesome result guys!
  I'm deeply impressed.
 
140 MIPS with normal demo code!
And you do lots of memory operations with them too.

Flype you do 140 Million integer Operations
plus 60 Million Cache reads per clock!
Over 200 Millions Operation total per second!

Real awesome result!
 
 


Grzegorz Wójcik (pisklak
(Apollo Team Member)
Posts 87
19 Oct 2016 08:15


Yap very good results for both effects !
Fire effect is a little more complicated but overall thay are quite similar. Thx to shared momory model we can effectively work directly on FB data and in memory buffers (DBUFF). Both progs use plain good 68k code to achieve that speed +  a little fusing/SS magic :-)

posts 8