APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

The team will post updates and news about our project here

68080	page 1 2 3

Markus (mfro)

Posts 99
22 Oct 2016 12:25

Just for completeness: with the (nearly) exact same code (just replaced the dbf instruction and have movem adapted), the FireBee scores 194 MIPS.

If you allow me an additional comment and - hopefully recepted as constructive - criticism: the code as provided doesn't really show the quality of branch prediction very well.

It repeatedly calls the assembler loops as subroutine and uses movem to save and restore registers. This would make it a perfect candidate for inlining as the expensive register save and restore operations effectively "hide" the branch prediction quality from the result.

I just did a quick test inlining the assembler subroutine into the calling code and ended up with a score pretty close to the FireBee's clock rate which indicates near 100% hit rate of branch prediction.

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
22 Oct 2016 12:39

Markus (mfro) wrote:

Just for completeness: with the (nearly) exact same code (just replaced the dbf instruction and have movem adapted), the FireBee scores 194 MIPS.

Thanks for the test. :-)

We have some more CPU test which show a lot more detailed numbers.
Would you like to compile for example our minibench?

Cheers

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
22 Oct 2016 13:43

Markus (mfro) wrote:

Just for completeness: with the (nearly) exact same code (just replaced the dbf instruction

Can you show us the new code?

Cheers
Gunnar

Markus (mfro)

Posts 99
22 Oct 2016 19:51

Gunnar von Boehn wrote:

Can you show us the new code?

Sure. The only changes to the original sources are:


diff bench/loop1.S bench_fb/loop1.S
5c5,6
<  movem.l d0-a6,-(sp)
---
>  lea  -15 * 4(sp),sp
>  movem.l d0-a6,(sp)
9,10c10,11
<  moveq   #1,D6
<  addq    #1,D5
---
>  moveq   #2,D6
>  addq.l    #1,D5
22c23,25
<  dbra    D6,L2
---
>  subq.l #1,d6
>  bne.s  L2
>  // dbra    D6,L2
25c28
<  bne     L1
---
>  bne.s     L1
27c30,31
<  movem.l (sp)+,d0-a6
---
>  movem.l (sp),d0-a6
>  lea  15 * 4(sp),sp


Markus (mfro) Posts 99 22 Oct 2016 20:31	Just tried to compile minibench on the FireBee. Are the sources complete? I had to comment out the _64x calls since they only seem to appear in the x86 assembly sources?


Philippe Flype (Apollo Team Member) Posts 299 26 Oct 2016 00:05	Hi, i took time to sort the files, here is archive of sources EXTERNAL LINK

Markus (mfro)

Posts 99
26 Oct 2016 12:52

Philippe Flype wrote:

Hi, i took time to sort the files,

here is archive of sources EXTERNAL LINK

Thank you. Will probably look into it during the weekend.

Vincent Rivière

Posts 87
29 Oct 2016 12:43

Gunnar von Boehn wrote:

Coldfire 266 Firebee 8

For precision:

- The "ColdFire" processor spells with a capital F in the middle, this is how Freescale spells it. The "Atari Coldfire Project" (ACP) deliberately chose to remove that capital F in their name, that's their own choice.

- Similarly, "FireBee" spells with a capital B in the middle.

- The FireBee ColdFire clock is exactly 264 MHz (contrary to what we sometimes read).

Markus (mfro)

Posts 99
30 Oct 2016 13:20

Got the sources to compile on the FireBee. I had to mill them through PortAsm (the Micro-APL EXTERNAL LINK tool to convert 68k code to the ColdFire instruction set).

The code spits out a lot of bogus values, however while others seem to be reasonable, even where PortAsm didn't change anything. So its probably not the tool to blame for that.

Before I start searching the haystack: we do use the same calling conventions on Atari and Amiga, do we?

Arguments on the stack, left to right, d0-d2/a0 = scratch, return values in d0, all other registers to be preserved by called function?


  ------------------------------------------------
  Processor & Memory Performance Benchmark.
  $VER: Minibench 8.06 (04.07.16) Apollo Team
  ------------------------------------------------
  ------------------------------------------------
  CPU - Math                512KB 
  ------------------------------------------------
  NOP                       209.7 
  ADD.L REG             1048576.0 
  ADD.W Im16            1048576.0 
  ADD.L Im32            1048576.0 
  SHIFT REG             1048576.0 
  SHIFT Imm             1048576.0 
  AND.L REG             1048576.0 
  ANDI.L Im32           1048576.0 
  MULU.L                1048576.0 
  DIV.L                      52.4 
  ROL.L Dn,Dm               209.7 
  BFFFO  Dn{},Dm              5.8 
  BFEXTU (a0){},Dn           17.4 
  ------------------------------------------------
  CPU - Special             512KB 
  ------------------------------------------------
  fuse_ma_x16_1 Reg     1048576.0 
  fuse_ma_x16_2 Imm8    1048576.0 
  fuse_ma_x16_3 Imm32   1048576.0 
  fuse_ma_x16_4 And     1048576.0 
  bond_ma_x16_1 Reg     1048576.0 
  bond_ma_x16_2 Imm     1048576.0 
  bond_ma_x16_3 Mem     1048576.0 
  ea_latencyx16             209.7 
  alu_latency1x16       1048576.0 
  cache_latency1x16         104.8 
  ------------------------------------------------
  CPU - EA                  512KB 
  ------------------------------------------------
  R (d16,An)                209.7 
  R (d32,An)                209.7 
  R (An)+                   209.7 
  R (An) ; ADDQ #,An        209.7 
  R (An,Dn)                 104.8 
  R (d32,An,Dn)             209.7 
  W (d16,An)            1048576.0 
  W (d32,An)                104.8 
  W (An)+                   209.7 
  W (An) ; ADDQ #,An        209.7 
  W (An,Dn)                 209.7 
  W (d32,An,Dn)              69.9 
  U (d16,An)                209.7 
  U (d32,An)                 52.4 
  U (An)+                   209.7 
  U (An) ; ADDQ #,An        104.8 
  U (An,Dn)                 104.8 
  U (d32,An,Dn)              69.9 
  ------------------------------------------------
  CPU - Loop                512KB 
  ------------------------------------------------
  loopx2                    209.8 
  loopx4                    209.7 
  loopx6                    209.7 
  loopx8                1048576.0 
  loopx16               1048576.0 
  loopx32               1048576.0 
  loopx64               1048576.0 
  loopx128              1048576.0 
  loopix2                   209.8 
  loopix4                   209.7 
  loopix6                   209.7 
  loopix8               1048576.0 
  loopix16              1048576.0 
  loopix32              1048576.0 
  loopix64              1048576.0 
  loopix128             1048576.0 
  ------------------------------------------------
  CPU - Goto                512KB 
  ------------------------------------------------
  goto_x16                  104.8 
  goto2_x16                 209.7 
  goto4_x16                 209.7 
  gotoCC                    104.8 
  gotoCCTRUE                209.7 
  gotoCCFALSE               209.7 
  gosup_chainx1              69.9 
  gosup_chainx2              69.9 
  gosup_chainx4             104.8 
  ------------------------------------------------
  CPU - Workload            512KB 
  ------------------------------------------------
  workload_AAAA             210.5 
  workload_LA           1048576.0 
  workload_LAA          1048576.0 
  workload_LAAA         1048576.0 
  workload_LAAAA        1048576.0 
  workload_LLA          1048576.0 
  workload_LLAA         1048576.0 
  workload_LLAAA        1048576.0 
  workload_LLAAAA       1048576.0 
  workload_LAALA        1048576.0 
  ------------------------------------------------
  Measuring memory throughput:
  Results are in MB/sec. Higher value is faster.
  Memory 2 Memory
  Alignment 0-0      512KB       16KB        4KB 
  ------------------------------------------------
  libc memcpy           52.4       52.4       52.4 
  read 8                52.4       52.4       69.9 
  read 8x4              69.9       69.9       69.9 
  read 32               69.9       69.9       69.9 
  read 32x4            104.8      104.8      104.8 
  read 32x8            104.8      104.8      104.8 
  write 8               41.9       41.9       41.9 
  write 8x4             52.4       52.4       52.4 
  write 32              41.9       52.4       41.9 
  write 32x4            52.4       41.9       52.4 
  write 32x8            41.9       52.4       52.4 
  copy 8                58.9       52.4       52.4 
  copy 8x4              52.4       52.4       52.4 
  copy 32               58.9       58.9       58.9 
  copy 32x4             52.4       58.9       58.9 
  copy 32x8             58.9       58.9       52.4 
  ------------------------------------------------
  Cache 2 Cache
  Alignment 0-0      512KB       16KB        4KB 
  ------------------------------------------------
  libc memcpy           52.4      418.6  2097152.0 
  read 8                52.4      104.8       69.9 
  read 8x4              69.9      104.8      209.7 
  read 32               69.9  1048576.0      209.7 
  read 32x4             69.9  1048576.0  1048576.0 
  read 32x8            104.8  1048576.0  1048576.0 
  write 8               41.9      104.8      104.9 
  write 8x4             52.4      209.7      209.7 
  write 32              41.9  1048576.0      209.7 
  write 32x4            52.4  1048576.0  1048576.0 
  write 32x8            52.4  1048576.0  1048576.0 
  copy 8                58.9      208.7      138.8 
  copy 8x4              58.9      208.7      208.7 
  copy 32               58.9      418.5  2097152.0 
  copy 32x4             52.4  2097152.0      418.6 
  copy 32x8             58.9  2097152.0  2097152.0 
  ------------------------------------------------
  MIPS:   37754778 / 75 = 503397 
  MEMORY: 6992967 / 32 = 218530 
  TOTAL:  418203 
  ------------------------------------------------

Vincent Rivière

Posts 87
30 Oct 2016 17:32

Markus (mfro) wrote:

Before I start searching the haystack: we do use the same calling conventions on Atari and Amiga, do we?

Arguments on the stack, left to right, d0-d2/a0 = scratch, return values in d0, all other registers to be preserved by called function?

Beware, calling conventions may be different depending on the API.

For Atari GCC functions, the scratch registers are d0-d1/a0-a1. I believe this is the standard for almost all GCC 680x0 targets.

For Atari BIOS/XBIOS/GEMDOS system calls, scratch registers are d0-d2/a0-a2. We must be extremely careful about that when mixing C functions and system calls.

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
30 Oct 2016 17:47

Markus (mfro) wrote:

The code spits out a lot of bogus values,

Did you run it with strongest parameter?

Cheers
Gunnar

Markus (mfro)

Posts 99
30 Oct 2016 19:00

Gunnar von Boehn wrote:

Markus (mfro) wrote:

The code spits out a lot of bogus values,

Did you run it with strongest parameter?

Cheers
Gunnar

Not until just now. strongest results in slighly different values, but still these 7-digit numbers at places.

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
30 Oct 2016 19:41

Markus (mfro) wrote:

Not until just now. strongest results in slighly different values, but still these 7-digit numbers at places.

The values look to me like the time for the run is to short,
or the resolution of the timer value not high enough..

If you increase the runtime this should be fixed.
Can you try?

Cheers
Gunnar

Markus (mfro)

Posts 99
30 Oct 2016 20:21

Gunnar von Boehn wrote:

Markus (mfro) wrote:

Not until just now. strongest results in slighly different values, but still these 7-digit numbers at places.

Just had a look into main.c - "strongest" is not a valid parameter, it seems. The only one that is recognized appears to be "68000".

Do I have the latest sources?

Gunnar von Boehn wrote:

The values look to me like the time for the run is to short,
or the resolution of the timer value not high enough..

If you increase the runtime this should be fixed.
Can you try?

Thanks.

Set LOOPS (from 2) to 32 (didn't really check what it does in detail, but it appeared to me the most straightforward thing to do ;) ).

At least it made the strange values vanish.

As memory throughput numbers seem to be reasonable now (about the same - pretty disappointing values - I got from my own measurements), I guess it's indeed caused by the timer granularity and we're getting closer. Can you show 68080 values? This is what I have now:


  firebee:~#./benchv6_68k
  ------------------------------------------------
  Processor & Memory Performance Benchmark.
  $VER: Minibench 8.06 (04.07.16) Apollo Team
  ------------------------------------------------
  ------------------------------------------------
  CPU - Math                512KB 
  ------------------------------------------------
  NOP                       167.4 
  ADD.L REG                 838.4 
  ADD.W Im16                838.4 
  ADD.L Im32                838.4 
  SHIFT REG                1677.8 
  SHIFT Imm                1677.8 
  AND.L REG                1118.8 
  ANDI.L Im32               671.8 
  MULU.L                    258.0 
  DIV.L                      50.0 
  ROL.L Dn,Dm               139.6 
  BFFFO  Dn{},Dm              5.2 
  BFEXTU (a0){},Dn           15.8 
  ------------------------------------------------
  CPU - Special             512KB 
  ------------------------------------------------
  fuse_ma_x16_1 Reg        1118.8 
  fuse_ma_x16_2 Imm8        838.4 
  fuse_ma_x16_3 Imm32      1118.8 
  fuse_ma_x16_4 And        1677.8 
  bond_ma_x16_1 Reg        1118.8 
  bond_ma_x16_2 Imm         671.8 
  bond_ma_x16_3 Mem         335.8 
  ea_latencyx16             124.6 
  alu_latency1x16           258.0 
  cache_latency1x16          93.8 
  ------------------------------------------------
  CPU - EA                  512KB 
  ------------------------------------------------
  R (d16,An)                258.0 
  R (d32,An)                119.4 
  R (An)+                   258.0 
  R (An) ; ADDQ #,An        129.0 
  R (An,Dn)                 129.0 
  R (d32,An,Dn)             124.6 
  W (d16,An)                258.0 
  W (d32,An)                124.6 
  W (An)+                   239.8 
  W (An) ; ADDQ #,An        129.0 
  W (An,Dn)                 124.6 
  W (d32,An,Dn)              64.0 
  U (d16,An)                134.2 
  U (d32,An)                 55.4 
  U (An)+                   134.2 
  U (An) ; ADDQ #,An        101.2 
  U (An,Dn)                 104.4 
  U (d32,An,Dn)              52.2 
  ------------------------------------------------
  CPU - Loop                512KB 
  ------------------------------------------------
  loopx2                    176.0 
  loopx4                    223.8 
  loopx6                    223.8 
  loopx8                    671.8 
  loopx16                   838.4 
  loopx32                   838.4 
  loopx64                  1118.8 
  loopx128                 1118.8 
  loopix2                   129.0 
  loopix4                   152.4 
  loopix6                   159.8 
  loopix8                   479.8 
  loopix16                  671.8 
  loopix32                  671.8 
  loopix64                  671.8 
  loopix128                 838.4 
  ------------------------------------------------
  CPU - Goto                512KB 
  ------------------------------------------------
  goto_x16                   76.6 
  goto2_x16                 176.0 
  goto4_x16                 239.8 
  gotoCC                    134.2 
  gotoCCTRUE                223.8 
  gotoCCFALSE               167.4 
  gosup_chainx1              71.4 
  gosup_chainx2              79.8 
  gosup_chainx4              81.0 
  ------------------------------------------------
  CPU - Workload            512KB 
  ------------------------------------------------
  workload_AAAA             258.0 
  workload_LA                 9.4 
  workload_LAA                9.4 
  workload_LAAA               9.4 
  workload_LAAAA              9.4 
  workload_LLA                9.4 
  workload_LLAA               9.4 
  workload_LLAAA              9.4 
  workload_LLAAAA             9.4 
  workload_LAALA              9.4 
  ------------------------------------------------
  Measuring memory throughput:
  Results are in MB/sec. Higher value is faster.
  Memory 2 Memory
  Alignment 0-0      512KB       16KB        4KB 
  ------------------------------------------------
  libc memcpy           52.2       52.2       52.2 
  read 8                56.4       55.4       56.4 
  read 8x4              62.8       61.8       61.8 
  read 32               79.8       78.8       76.6 
  read 32x4             78.8       76.6       78.8 
  read 32x8             79.8       78.8       78.8 
  write 8               45.8       45.8       45.8 
  write 8x4             45.8       45.8       45.8 
  write 32              45.8       45.8       45.8 
  write 32x4            45.8       44.6       45.8 
  write 32x8            45.8       45.8       45.8 
  copy 8                56.4       52.2       50.2 
  copy 8x4              56.4       54.4       52.2 
  copy 32               56.4       54.4       54.4 
  copy 32x4             56.4       54.4       54.4 
  copy 32x8             56.4       54.4       54.4 
  ------------------------------------------------
  Cache 2 Cache
  Alignment 0-0      512KB       16KB        4KB 
  ------------------------------------------------
  libc memcpy           54.2      744.6      958.8 
  read 8                56.4       88.4       86.2 
  read 8x4              62.8      115.2      119.4 
  read 32               78.8      419.2      419.2 
  read 32x4             78.8      838.4      838.4 
  read 32x8             79.8      838.4     1118.8 
  write 8               45.8      108.6      104.4 
  write 8x4             45.8      209.0      209.0 
  write 32              45.8      479.8      559.8 
  write 32x4            45.8      838.4      838.4 
  write 32x8            45.8     1118.8      838.4 
  copy 8                56.4      172.6      176.0 
  copy 8x4              56.4      208.0      208.0 
  copy 32               56.4      670.8      744.6 
  copy 32x4             56.4      744.6      838.4 
  copy 32x8             56.4      838.4      958.8 
  ------------------------------------------------
  MIPS:   28569 / 75 = 380 
  MEMORY: 7013 / 32 = 219 
  TOTAL:  332 
  ------------------------------------------------
  
  firebee:~#


Gunnar von Boehn (Apollo Team Member) Posts 6254 30 Oct 2016 20:45	Hi Markus, Sorry tried to ping you on IRC with info. I was confused with the code versions. The scores that you have are still impossible values There must be a Config variable called CONFIG_TEST_SIZE Please set it to 64 MB I hope this will fix it Thanks

Markus (mfro)

Posts 99
30 Oct 2016 21:52

Gunnar von Boehn wrote:

Hi Markus,

Sorry tried to ping you on IRC with info.
I was confused with the code versions.
The scores that you have are still impossible values
There must be a Config variable called
CONFIG_TEST_SIZE
Please set it to 64 MB

I hope this will fix it
Thanks

O.k., done, next try:

Results in waaaay longer runtime (yawn ... I had to set LOOPS to 8 additionally because I got odd numbers for the CPU workload benchmark again), but then pretty much the same values (like less than 5% off) as posted above, so I decided not to clutter the forum with it.

Maybe I have to inspect what PortAsm did to the code. Is there anything particular you'd consider way off so we'd look into that first?

Markus (mfro)

Posts 99
31 Oct 2016 11:55

I guess I've found at least most of the problematic parts.

First thing was rather trivial: the original code uses preprocessor macros for instruction sequences like e.g.


#define NOP4     nop; nop; nop; nop

which looks innocent, but doesn't work with PortAsm.

PortAsm interprets the semicolon as start of a comment (although it has been told we are using the gnu assembler where this is valid syntax), so only the very first instruction was executed.

Fixed, but still no go.

Second was a little trickier and not so obvious (at least not for an Atarian like me).

As I just had to learn the hard way, Amiga code appears to use register A5 as frame pointer while the rest of the world uses A6.

This isn't going to be a problem as long as you consistently use either %a5 or %fp.

Unfortunately, this wasn't the case. The routines in tests_WORKLOAD_68k.S where using both (%fp in the LINK instruction, %a5 for the unlnk) which obviously corrupted registers of the calling routine and caused the code to fail.

Fixing that takes us there:


firebee:~#./benchv6_68k 
   ------------------------------------------------
   Processor & Memory Performance Benchmark.
   $VER: Minibench 8.06 (04.07.16) Apollo Team
   ------------------------------------------------
   ------------------------------------------------
   CPU - Math                 64MB 
   ------------------------------------------------
   NOP                        43.4 
   ADD.L REG                 255.6 
   ADD.W Im16                 86.8 
   ADD.L Im32                255.6 
   SHIFT REG                 255.6 
   SHIFT Imm                 255.6 
   AND.L REG                 255.6 
   ANDI.L Im32               172.0 
   MULU.L                     65.2 
   DIV.L                       9.0 
   ROL.L Dn,Dm                36.8 
   BFFFO  Dn{},Dm              1.4 
   BFEXTU (a0){},Dn            3.8 
   ------------------------------------------------
   CPU - Special              64MB 
   ------------------------------------------------
   fuse_ma_x16_1 Reg         479.2 
   fuse_ma_x16_2 Imm8        479.2 
   fuse_ma_x16_3 Imm32       260.6 
   fuse_ma_x16_4 And         479.2 
   bond_ma_x16_1 Reg         255.6 
   bond_ma_x16_2 Imm         255.6 
   bond_ma_x16_3 Mem         104.0 
   ea_latencyx16             123.6 
   alu_latency1x16           253.2 
   cache_latency1x16          93.4 
   ------------------------------------------------
   CPU - EA                   64MB 
   ------------------------------------------------
   R (d16,An)                255.6 
   R (d32,An)                120.2 
   R (An)+                   248.4 
   R (An) ; ADDQ #,An        127.8 
   R (An,Dn)                 127.8 
   R (d32,An,Dn)             124.2 
   W (d16,An)                253.2 
   W (d32,An)                124.2 
   W (An)+                   239.6 
   W (An) ; ADDQ #,An        127.8 
   W (An,Dn)                 125.4 
   W (d32,An,Dn)              64.2 
   U (d16,An)                132.8 
   U (d32,An)                 55.8 
   U (An)+                   132.2 
   U (An) ; ADDQ #,An        102.8 
   U (An,Dn)                 103.6 
   U (d32,An,Dn)              52.6 
   ------------------------------------------------
   CPU - Loop                 64MB 
   ------------------------------------------------
   loopx2                    175.4 
   loopx4                    211.2 
   loopx6                    225.4 
   loopx8                    235.4 
   loopx16                   248.4 
   loopx32                   255.6 
   loopx64                   258.0 
   loopx128                  263.0 
   loopix2                   131.4 
   loopix4                   150.8 
   loopix6                   157.8 
   loopix8                   162.6 
   loopix16                  168.8 
   loopix32                  172.0 
   loopix64                  174.2 
   loopix128                 175.4 
   ------------------------------------------------
   CPU - Goto                 64MB 
   ------------------------------------------------
   goto_x16                   75.0 
   goto2_x16                 175.4 
   goto4_x16                 239.6 
   gotoCC                    126.6 
   gotoCCTRUE                229.4 
   gotoCCFALSE               170.8 
   gosup_chainx1              71.2 
   gosup_chainx2              78.4 
   gosup_chainx4              82.4 
   ------------------------------------------------
   CPU - Workload             64MB 
   ------------------------------------------------
   workload_AAAA             258.0 
   workload_LA               258.0 
   workload_LAA              258.0 
   workload_LAAA             258.0 
   workload_LAAAA            258.0 
   workload_LLA              258.0 
   workload_LLAA             258.0 
   workload_LLAAA            258.0 
   workload_LLAAAA           258.0 
   workload_LAALA            258.0 
   ------------------------------------------------
   Measuring memory throughput:
   Results are in MB/sec. Higher value is faster.
   Memory 2 Memory
   Alignment 0-0       64MB       16KB        4KB 
   ------------------------------------------------
   libc memcpy           54.2       54.6       54.2 
   read 8                56.2       56.0       56.0 
   read 8x4              61.0       60.8       60.6 
   read 32               76.8       76.6       76.4 
   read 32x4             76.8       76.4       76.2 
   read 32x8             78.2       77.8       77.8 
   write 8               45.0       44.8       44.8 
   write 8x4             45.0       44.8       44.8 
   write 32              45.0       44.8       44.8 
   write 32x4            45.0       45.0       44.8 
   write 32x8            45.0       44.8       44.8 
   copy 8                58.0       54.2       52.4 
   copy 8x4              58.2       56.0       52.2 
   copy 32               58.4       56.8       56.4 
   copy 32x4             58.4       56.2       54.8 
   copy 32x8             58.4       56.8       56.4 
   ------------------------------------------------
   Cache 2 Cache
   Alignment 0-0       64MB       16KB        4KB 
   ------------------------------------------------
   libc memcpy           54.2      800.2      852.0 
   read 8                56.2       87.6       87.4 
   read 8x4              61.0      116.6      116.2 
   read 32               76.8      419.4      412.8 
   read 32x4             76.8      838.8      813.4 
   read 32x8             78.2      925.6      894.6 
   write 8               45.0      105.2      104.8 
   write 8x4             45.0      209.6      208.0 
   write 32              45.0      526.2      516.2 
   write 32x4            45.0      838.8      813.4 
   write 32x8            45.0      925.6      894.6 
   copy 8                58.0      172.6      174.8 
   copy 8x4              58.2      206.4      208.6 
   copy 32               58.4      678.4      688.2 
   copy 32x4             58.4      800.2      824.8 
   copy 32x8             58.4      838.8      908.8 
   ------------------------------------------------
   MIPS:   13920 / 75 = 185 
   MEMORY: 6870 / 32 = 214 
   TOTAL:  194 
   ------------------------------------------------
   
firebee:~#

Probably a little disappointing for us FireBee users, but it's not as bad as it looks. Nobody in the Atari world would ever come up with the strange idea to use bitset instructions. PortAsm generates code that loops with 16 instructions 32 x through the register ...

If we do not count them, we reach a score of


   ------------------------------------------------
   MIPS:   13840 / 71 = 194 
   MEMORY: 6880 / 32 = 215 
   TOTAL:  201 
   ------------------------------------------------

- at least more than 200 ;)

On the other hand, minibench is mostly nice to the FireBee in that it uses word-sized instructions very sparingly. This is what really hurts ColdFire performance on TOS in real world.

All in all: well done, Apollians!

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
31 Oct 2016 15:16

Thanks Markus.

Very interesting result.
For comparison here are current Vampire scores.


-----------------------------------------------------------
MiniBench 8.07h (MC68K)
MEMORY USED: 2MB
-----------------------------------------------------------
CPU - Math                512KB
-----------------------------------------------------------
NOP                       173.1 
ADD.L Reg                 173.1 
ADD.W Imm16               163.5 
ADD.L Imm32               163.5 
SHIFT Reg                 173.0 
SHIFT Imm16               173.1 
AND.L Reg                 173.0 
ANDI.L Imm32              162.1 
MULU.L                     30.0 
DIV.L                       2.6 
ROL.L Dn,Dm               173.1 
BFFFO Dn,Dm                86.6 
BFEXTU (a0),Dn             85.8 
-----------------------------------------------------------
CPU - Special             512KB
-----------------------------------------------------------
Fuse1 x16 Reg             325.8 
Fuse2 x16 Imm8            326.4 
Fuse3 x16 Imm32           326.1 
Fuse4 x16 And             326.2 
Bond1 x16 Reg             173.1 
Bond2 x16 Imm             163.5 
Bond3 x16 Mem             173.1 
EA Latency x16             86.6 
ALU Latency x16            92.0 
Cache Latency x16          86.6 
-----------------------------------------------------------
CPU - EA                  512KB
-----------------------------------------------------------
R (d16,An)                 89.2 
R (d32,An)                 88.8 
R (An)+                    89.3 
R (An); ADDQ #,An          86.5 
R (An,Dn)                  86.2 
R (d32,An,Dn)              86.6 
W (d16,An)                 87.4 
W (d32,An)                 87.4 
W (An)+                    87.8 
W (An); ADDQ #,An          84.8 
W (An,Dn)                  85.2 
W (d32,An,Dn)              84.7 
U (d16,An)                 87.4 
U (d32,An)                 87.8 
U (An)+                    87.4 
U (An); ADDQ #,An          85.2 
U (An,Dn)                  84.8 
U (d32,An,Dn)              84.8 
-----------------------------------------------------------
CPU - Loop                512KB
-----------------------------------------------------------
Loop1 x2                   92.0 
Loop1 x4                  121.8 
Loop1 x6                  138.0 
Loop1 x8                  147.2 
Loop1 x16                 162.1 
Loop1 x32                 173.0 
Loop1 x64                 173.1 
Loop1 x128                176.7 
Loop2 x2                   92.0 
Loop2 x4                  121.9 
Loop2 x6                  138.0 
Loop2 x8                  147.2 
Loop2 x16                 162.1 
Loop2 x32                 163.5 
Loop2 x64                 173.1 
Loop2 x128                176.5 
-----------------------------------------------------------
CPU - Goto                512KB
-----------------------------------------------------------
Goto1                      81.8 
Goto2                     129.9 
Goto4                     147.1 
Gosup1                     39.2 
Gosup2                     39.0 
Gosup4                     36.4 
GotoCC                    132.5 
GotoCC0                   132.5 
GotoCC1                   132.5 
-----------------------------------------------------------
CPU - Workload            512KB
-----------------------------------------------------------
WorkLoad AAAA             176.4 
WorkLoad LA               169.7 
WorkLoad LAA              169.7 
WorkLoad LAAA             175.9 
WorkLoad LAAAA            176.5 
WorkLoad LLA              129.9 
WorkLoad LLAA             169.7 
WorkLoad LLAAA            142.0 
WorkLoad LLAAAA           169.7 
WorkLoad LAALA            169.7 
-----------------------------------------------------------
Memory to Memory (MB/sec)
Alignment 0-0      512KB      16KB       4KB
-----------------------------------------------------------
Libc Memcpy          220.3      216.6      210.5 
Read 8                92.0       91.1       90.9 
Read 8x4              92.0       91.5       90.1 
Read 32              240.4      238.1      234.0 
Read 32x4            240.4      237.4      232.9 
Read 32x8            240.3      236.8      230.6 
Write 8               90.5       90.2       89.6 
Write 8x4             90.0       90.2       88.9 
Write 32             359.5      356.2      345.3 
Write 32x4           359.5      355.5      344.2 
Write 32x8           359.4      354.6      340.1 
Copy 8                54.1       54.1       54.8 
Copy 8x4              70.0       70.8       70.5 
Copy 32              170.9      170.1      168.9 
Copy 32x4            274.6      272.5      268.1 
Copy 32x8            274.2      272.3      268.4 
-----------------------------------------------------------
Cache to Cache (MB/sec)
Alignment 0-0      512KB      16KB       4KB
-----------------------------------------------------------
Libc Memcpy          220.5      306.5      298.2 
Read 8                91.5       91.8       91.2 
Read 8x4              92.0       91.3       90.9 
Read 32              240.4      363.0      353.1 
Read 32x4            240.4      362.7      351.2 
Read 32x8            240.4      361.2      346.1 
Write 8               90.3       89.8       89.4 
Write 8x4             90.5       90.2       89.3 
Write 32             359.4      356.0      345.8 
Write 32x4           359.5      355.6      344.4 
Write 32x8           359.5      354.7      340.4 
Copy 8                54.2       60.9       60.8 
Copy 8x4              70.9       80.9       80.6 
Copy 32              170.6      242.7      240.2 
Copy 32x4            274.8      576.4      558.1 
Copy 32x8            274.6      636.8      614.5 
-----------------------------------------------------------
CPU: 10048 / 75 = 133 MIPS.
MEM: 7152 / 32 = 223 MB/Sec.
ALL: 160 Points.
-----------------------------------------------------------

Markus (mfro)

Posts 99
31 Oct 2016 17:56

maybe there is someone who volunteers throwing this: EXTERNAL LINK at an Amiga compiler and post the outcome?

Yes, its aged and probably more than far from being the best benchmark, but since Motorola originally claimed to score 401 VAX MIPS with this on the ColdFire V4, we simply _had to_ test it (and missed the goal miserably, so much for marketing).

We have collected some numbers for different Atari machines here:
http://firebee.org/~firebee/pictures/files/dhrystone.pdf

Would be nice if we could add another 68k to it ...


Vincent Rivière Posts 87 31 Oct 2016 21:02	Excellent investigation, Markus :-D

posts 45	page 1 2 3