Overview Features Instructions Performance Forum Downloads Products Reseller Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
VISIT APOLLO IRC CHANNEL



All TopicsNewsPerformanceGamesDemosApolloVampireReleases
Performance and Benchmark Results!

World Record ATARI FreeMiNT Kronos 68080-FPUpage  1 2 3 4 5 6 

Gunnar von Boehn
(Apollo Team Member)
Posts 3089
14 Dec 2017 22:32


Olivier Landemarre wrote:

lf reversing memory with 32

move.l (a1)+,-(a0)
move.l (a1)+,-(a0)
move.l (a1)+,-(a0)


 
Ok, thanks for explaining.
This code explains the result.
 
A code pattern like this:

move.l (a1)+,(a0)+
move.l (a1)+,(a0)+
move.l (a1)+,(a0)+
move.l (a1)+,(a0)+

Could be "merged" by the core.
The core would then combine 2 instructions internally to one 64bit instruciton. (as long as the memory region is not flagged as IO space).
The merged 64bit code could reach higher performance.
 


Vincent Rivière

Posts 81
15 Dec 2017 07:13



move.l (a1)+,-(a0)
move.l (a1)+,-(a0)

On the other hand, the core could be smart and also optimize code like that ;-)


Markus (mfro)

Posts 83
15 Dec 2017 17:41


Vincent Rivière wrote:


  move.l (a1)+,-(a0)
  move.l (a1)+,-(a0)
 

  On the other hand, the core could be smart and also optimize code like that ;-)

... but would then have to swap longwords in a quad word ;)


Gunnar von Boehn
(Apollo Team Member)
Posts 3089
15 Dec 2017 18:50


Vincent Rivière wrote:


  move.l (a1)+,-(a0)
  move.l (a1)+,-(a0)
 

  On the other hand, the core could be smart and also optimize code like that ;-)

This code is not a normal memcopy.
Its a copy which inverses the content of the memory.
I would assume that this code is not common in real live.
Or do you think that is done often?

ON the other hand, a normal memcopy is very often used, so tuning this to be faster makes really sense.

Vincent can you tell us what the CODE is that KRONOS triggers when copying the memory to GFX card?

KRONOS measures "only" 40 MB/sec with GXF-mem operation.
I would have expected 200 MB/sec.
Can you show the code and help us to understand the reason?



OneSTone O2o

Posts 119
15 Dec 2017 21:03


Gunnar. the video memory test is using VDI function (a component of GEM):
   
   
Olivier Landemarre wrote:

    So in theory we should have similar value in your case to value found to copy from memory to memory and looks we have around the half I fully agree with you, but this test rather directly copy memory to memory use VDI function vro_cpyfm to copy an area of video screen somewhere else. Probaly routine is not optimized, I'm sure with optimized routine we should have perhaps even better result than my own simple linear copy routine. It depend of Emutos, not Kronos.
   

   


Markus (mfro)

Posts 83
15 Dec 2017 21:36


Olivier Landemarre wrote:
...It depend of Emutos, not Kronos...

 
  I don't think EmuTOS has anything to do with that. These screenshot have apparently been taken from an fVDI driven screen, using Vincent's driver. This driver replaces EmuTOS' vro_cpyfm() routine with it's own (a pretty straightforward 16 bit copy routine implemented in C).
 
  I'm not sure if I have the latest sources, but apparently fVDI holds an offscreen buffer in FastRAM. Not sure if it's needed here and if it really ended up in the driver, but if it did, it would explain the low speed: memory appears to be actually copied twice - once to/from screen, once to/from the offscreen buffer. Maybe Vincent is able to tell if this is the case or not.

Background is probably speed: on a machine with very slow screen memory access (and reasonably fast FastRAM, like on the CT60), overall screen blit speed (with logic ops) will considerably accellerate (because you can do the logic op in FastRAM and only need to write once to slow screen memory).

If Vampire screen memory access is indeed as fast as FastRAM, this is probably just a waste of time?


Peter Slegg

Posts 13
15 Dec 2017 22:05


For reference I just ran Kronos 2.01 on the Milan060 @50MHz

It managed:

Mothercard Perf.: 501
BogoMIPS: 47.77


Gunnar von Boehn
(Apollo Team Member)
Posts 3089
15 Dec 2017 22:53


Markus (mfro) wrote:

I'm not sure if I have the latest sources, but apparently fVDI holds an offscreen buffer in FastRAM. Not sure if it's needed here and if it really ended up in the driver, but if it did, it would explain the low speed: memory appears to be actually copied twice - once to/from screen, once to/from the offscreen buffer. Maybe Vincent is able to tell if this is the case or not.

I do not fully understand what the KRONOS tests does here.
Maybe you can help me to understand it better, so that we see if this is test related or a driver issue which could be improved.

From the video numbers printed in KRONOS it looks like the GFX speed could be improved a lot.
I think it would be great to make ATARI OS run really fast, and like to help.



Markus (mfro)

Posts 83
16 Dec 2017 05:56


Gunnar von Boehn wrote:

  I do not fully understand what the KRONOS tests does here.
  Maybe you can help me to understand it better, so that we see if this is test related or a driver issue which could be improved.
 
  From the video numbers printed in KRONOS it looks like the GFX speed could be improved a lot.
  I think it would be great to make ATARI OS run really fast, and like to help.
 

I'm not familiar with the Kronos internals, but what I understand from the discussion is that it tests video memory access with the OS' (software) blit routines.

vro_cpyfm() is the VDI software blitter. It copies rectangular rasters from main memory to video memory (and v.v.) with a logic op (AND, OR, XOR, ...) between source and destination.

Blits from FastRAM to video mem are (obviously) a read-modify-write operation on target RAM. The driver appears to be derived from the Falcon CT60 driver. The CT60 has very slow 16 bit video memory, but considerably faster 32 bit FastRAM. The driver appears to accelerate the blit logic ops by maintaining a video shadow in FastRAM 
to do the logical composition there instead. This avoids a read/modify write operation in slow video memory at the expense of double writes (to video memory AND the shadow buffer) in case there is no logic op involved.



Olivier Landemarre

Posts 39
16 Dec 2017 07:51


Gunnar von Boehn wrote:

Vincent Rivière wrote:

 

  move.l (a1)+,-(a0)
  move.l (a1)+,-(a0)
 

  On the other hand, the core could be smart and also optimize code like that ;-)
 

 
  This code is not a normal memcopy.
  Its a copy which inverses the content of the memory.
  I would assume that this code is not common in real live.
  Or do you think that is done often?
 
  ON the other hand, a normal memcopy is very often used, so tuning this to be faster makes really sense.
 
  Vincent can you tell us what the CODE is that KRONOS triggers when copying the memory to GFX card?
 
  KRONOS measures "only" 40 MB/sec with GXF-mem operation.
  I would have expected 200 MB/sec.
  Can you show the code and help us to understand the reason?
 

I fully agree it is not normal memcopy! The reason when I do this was to reduce memory need to run Kronos, so I allocate only 64Kb and I want reduce as possible cache effects. But your results are for very good, because CT60 give the maximum speed access memory possible for 68060. This tests give just an idea, compare apple to apple. For most interesting tests are GFX information and small opengl test.
- The first one because it is usefull for driver writer to know speed of 4 routine : line, rectangle filled, text display and memory block, other tests it's just for fun. Kronos have been intensively used by Didier Mequignon when he worked on Radeon driver and his driver is really fast now but first time I use it and we start tests it was very slow on coldfire evaluation board, fast video driver is very important for interface, with fast driver even if computer is not very fast, user not see it because most of time computer is waiting user but user dont like wait!
- The second the small opengl test, it is full processor test, using memory, CPU, FPU, I like it

For your question source code for copy from screen to screen
this is a copy from the left to the right of the width of the screen minus 16 pixels with an offset of 16 pixels:

nb=1L;
apres=0;
avant=0;
xy[0]=0;
xy[1]=0;
      xy[2]=xy[6]=_var_sys->work_out[0]-16;
      xy[3]=xy[7]=_var_sys->work_out[1];
xy[4]=16;
xy[5]=0;
while((apres - avant)<200L)
  {
  encours=0;
 
  avant=my_sync();
  while(encours<nb)
  {
            vro_cpyfm(_var_sys->vdihandle,3,xy,&ecran,&ecran);
    encours++;
        }
  apres=stop_chrono();

  if((apres-avant)<20) nb*=15L;
  else
  if((apres-avant)<50L) nb*=5;
  else
  if((apres-avant)>=200L)
  {
  }
  else nb *=2L;
}

for other copy with screen in memory or from memory, it is near same test, it is a copy of full screen to/from TTram buffer

Olivier




Markus (mfro)

Posts 83
16 Dec 2017 10:26


Olivier Landemarre wrote:

              vro_cpyfm(_var_sys->vdihandle,3,xy,&ecran,&ecran);
 

 
  Mode 3 above is "source only", so no logic op involved. If Vincent's VDI driver really maintains the FastRAM screen buffer (as assumed above), it receives the full penalty for double buffer writes here.


Olivier Landemarre

Posts 39
16 Dec 2017 11:27


Markus (mfro) wrote:

Olivier Landemarre wrote:

              vro_cpyfm(_var_sys->vdihandle,3,xy,&ecran,&ecran);
 

 
  Mode 3 above is "source only", so no logic op involved. If Vincent's VDI driver really maintains the FastRAM screen buffer (as assumed above), it receives the full penalty for double buffer writes here.

Yes possible so should be easy to do far faster in this case!


Vincent Rivière

Posts 81
16 Dec 2017 13:57


Gunnar von Boehn wrote:

Vincent can you tell us what the CODE is that KRONOS triggers when copying the memory to GFX card?

Kronos is closed source, I have no idea what he does. Only Olivier can tell.

Markus (mfro) wrote:

I'm not sure if I have the latest sources, but apparently fVDI holds an offscreen buffer in FastRAM. Not sure if it's needed here and if it really ended up in the driver, but if it did, it would explain the low speed: memory appears to be actually copied twice - once to/from screen, once to/from the offscreen buffer. Maybe Vincent is able to tell if this is the case or not.

The fVDI sources I used is there:
EXTERNAL LINK  AFAIK they are latest official fVDI sources, plus a few bugfixes from me to avoid it to crash it completely, plus my SAGA driver.

My SAGA driver is more or less just a copy/paste of the original Falcon 16-bit driver provided by fVDI. I mainly changed the video mode initialization to switch to the SAGA screen, not much more. Actual drawing routines are the ones provided Falcon 16-bit sample, implemented in C. Do not expect any performance there, this was just an (incomplete) example of C driver provided with fVDI. My SAGA driver for fVDI was mainly a proof of concept. Once again, do not expect any performance.

Generally speaking, fVDI is a big mess. I can just tell that it calls the graphics primitives from the drivers, and fills the gaps to provide the proper VDI interface. I don't know much more.

I indeed saw that offscren buffer stuff, but IIRC I disabled it in the SAGA driver.

And final note about EmuTOS: in Vampire optimized binaries (floppy and ROM), I put everything possible into FastRAM. Chip RAM is only used when absolutely necessary.
About FastRAM usage in FreeMiNT / XaAES / fVDI / Kronos, I have no idea.


Gunnar von Boehn
(Apollo Team Member)
Posts 3089
16 Dec 2017 15:01


Cool, reading the sources!
 
There are of course very many function in them.
 
Maybe someone with knowledge of ATARI OS. can put some light into which functions are most important / most used.
 
For example what function is used if a WINDOW is moved?
 
I assume if you open a folder on screen, the folder/window is first cleared/filled and then Icons are printed in it?
 
If this is the case then maybe looking 1st at the Rect_fill code makes sense?

I'm not sure which routine is the final code called for filling a rect?
Is this the final work loop?
vdi/fvdi/drivers/16_bit/16b_fill.c




Olivier Landemarre

Posts 39
16 Dec 2017 17:19


Gunnar von Boehn wrote:

Cool, reading the sources!
 
  There are of course very many function in them.
 
  Maybe someone with knowledge of ATARI OS. can put some light into which functions are most important / most used.
 
  For example what function is used if a WINDOW is moved?
 
  I assume if you open a folder on screen, the folder/window is first cleared/filled and then Icons are printed in it?
 
  If this is the case then maybe looking 1st at the Rect_fill code makes sense?
 
 
  I'm not sure which routine is the final code called for filling a rect?
  Is this the final work loop?
  vdi/fvdi/drivers/16_bit/16b_fill.c
 
 

I think I can easily answer to this.
When windows move generaly most important for speed is copy of bloc from screen to screen or screen <-> memory (I speak a bit for my own AES I think XaAES is a bit better optimized). If some area of screen was not displayed before, filling rectangle is probably one of the most function used.

As I said in previous message most important function for AES system are rectangle filling, lines, text and bloc copy, other function are near never used.



Markus (mfro)

Posts 83
16 Dec 2017 17:21


Vincent Rivière wrote:

  The fVDI sources I used is there:
  EXTERNAL LINK  AFAIK they are latest official fVDI sources, plus a few bugfixes from me to avoid it to crash it completely, plus my SAGA driver.

Yes, that's what I've been looking at as well. I assume the SAGA driver uses the 16_bit/16b_blit.c software blit routines? That said file still has

#define FAST
#define BOTH

set which activates the FastRAM buffer (and thus causes double writes) if I'm not mistaken.


Gunnar von Boehn
(Apollo Team Member)
Posts 3089
16 Dec 2017 17:30


Markus (mfro) wrote:

I assume the SAGA driver uses the 16_bit/16b_blit.c software blit routines?

Maybe we can also look at the created ASM of those?




Markus (mfro)

Posts 83
16 Dec 2017 18:04


Gunnar von Boehn wrote:

 
Markus (mfro) wrote:

  I assume the SAGA driver uses the 16_bit/16b_blit.c software blit routines?
 

 
  Maybe we can also look at the created ASM of those?
 
 
 

 
  That's not worth it; it's extremely simple:
 
 

  for(i = h - 1; i >= 0; i--) {
      for(j = w - 1; j >= 0; j--) {
          v = *src_addr++;
  #ifdef BOTH
          *(volatile PIXEL *)dst_addr_fast++ = v, 0; 
  #endif
          *dst_addr++ = v;
      }
      src_addr += src_line_add;
      dst_addr += dst_line_add;
  }
 

 
  PIXEL is a typedef for short.


Gunnar von Boehn
(Apollo Team Member)
Posts 3089
16 Dec 2017 19:03


Markus (mfro) wrote:

That's not worth it; it's extremely simple:

It could still be interesting to see what ASM is created.


Markus (mfro)

Posts 83
16 Dec 2017 19:20


Gunnar von Boehn wrote:

Markus (mfro) wrote:

  That's not worth it; it's extremely simple:
 

  It could still be interesting to see what ASM is created.


000001e0 <_s_blit_copy>:
  1e0:  48e7 3f3c      moveml %d2-%d7/%a2-%a5,%sp@-
  1e4:  242f 002c      movel %sp@(44),%d2
  1e8:  262f 0032      movel %sp@(50),%d3
  1ec:  282f 0036      movel %sp@(54),%d4
  1f0:  386f 003c      moveaw %sp@(60),%a4
  1f4:  3a2f 003e      movew %sp@(62),%d5
  1f8:  5345            subqw #1,%d5
  1fa:  6b50            bmis 24c <.L5>
  1fc:  366f 0030      moveaw %sp@(48),%a3
  200:  200b            movel %a3,%d0
  202:  d080            addl %d0,%d0
  204:  2640            moveal %d0,%a3
  206:  3e2f 003a      movew %sp@(58),%d7
  20a:  48c7            extl %d7
  20c:  de87            addl %d7,%d7
  20e:  3c0c            movew %a4,%d6
  210:  5346            subqw #1,%d6
  212:  0286 0000 ffff  andil #65535,%d6
  218:  2006            movel %d6,%d0
  21a:  d080            addl %d0,%d0
  21c:  2a40            moveal %d0,%a5
  21e:  548d            addql #2,%a5
  220:  5286            addql #1,%d6
  222:  dc86            addl %d6,%d6

00000224 <.L9>:
  224:  300c            movew %a4,%d0
  226:  6f1a            bles 242 <.L7>
  228:  2202            movel %d2,%d1
  22a:  d28d            addl %a5,%d1
  22c:  2444            moveal %d4,%a2
  22e:  2243            moveal %d3,%a1
  230:  2042            moveal %d2,%a0

00000232 <.L8>:
  232:  3018            movew %a0@+,%d0
  234:  34c0            movew %d0,%a2@+
  236:  32c0            movew %d0,%a1@+
  238:  b288            cmpl %a0,%d1
  23a:  66f6            bnes 232 <.L8>
  23c:  d486            addl %d6,%d2
  23e:  d886            addl %d6,%d4
  240:  d686            addl %d6,%d3

00000242 <.L7>:
  242:  d48b            addl %a3,%d2
  244:  d687            addl %d7,%d3
  246:  d887            addl %d7,%d4
  248:  51cd ffda      dbf %d5,224 <.L9>

0000024c <.L5>:
  24c:  4cdf 3cfc      moveml %sp@+,%d2-%d7/%a2-%a5
  250:  4e75            rts



posts 109page  1 2 3 4 5 6