Performance and Benchmark Results!
|
World Record ATARI FreeMiNT Kronos 68080-FPU | page 1 2 3 4 5 6
|
---|
|
---|
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 14 Dec 2017 22:32
| Olivier Landemarre wrote:
| lf reversing memory with 32
move.l (a1)+,-(a0) move.l (a1)+,-(a0) move.l (a1)+,-(a0)
|
Ok, thanks for explaining. This code explains the result. A code pattern like this:
move.l (a1)+,(a0)+ move.l (a1)+,(a0)+ move.l (a1)+,(a0)+ move.l (a1)+,(a0)+
Could be "merged" by the core. The core would then combine 2 instructions internally to one 64bit instruciton. (as long as the memory region is not flagged as IO space). The merged 64bit code could reach higher performance.
| |
| | Vincent Rivière
Posts 87 15 Dec 2017 07:13
| move.l (a1)+,-(a0) move.l (a1)+,-(a0)
|
On the other hand, the core could be smart and also optimize code like that ;-)
| |
| | Markus (mfro)
Posts 99 15 Dec 2017 17:41
| Vincent Rivière wrote:
|
move.l (a1)+,-(a0) move.l (a1)+,-(a0) |
On the other hand, the core could be smart and also optimize code like that ;-)
|
... but would then have to swap longwords in a quad word ;)
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 15 Dec 2017 18:50
| Vincent Rivière wrote:
|
move.l (a1)+,-(a0) move.l (a1)+,-(a0) |
On the other hand, the core could be smart and also optimize code like that ;-)
|
This code is not a normal memcopy. Its a copy which inverses the content of the memory. I would assume that this code is not common in real live. Or do you think that is done often? ON the other hand, a normal memcopy is very often used, so tuning this to be faster makes really sense. Vincent can you tell us what the CODE is that KRONOS triggers when copying the memory to GFX card? KRONOS measures "only" 40 MB/sec with GXF-mem operation. I would have expected 200 MB/sec. Can you show the code and help us to understand the reason?
| |
| | OneSTone O2o
Posts 159 15 Dec 2017 21:03
| Gunnar. the video memory test is using VDI function (a component of GEM): Olivier Landemarre wrote:
| So in theory we should have similar value in your case to value found to copy from memory to memory and looks we have around the half I fully agree with you, but this test rather directly copy memory to memory use VDI function vro_cpyfm to copy an area of video screen somewhere else. Probaly routine is not optimized, I'm sure with optimized routine we should have perhaps even better result than my own simple linear copy routine. It depend of Emutos, not Kronos. |
| |
| | Markus (mfro)
Posts 99 15 Dec 2017 21:36
| Olivier Landemarre wrote:
| ...It depend of Emutos, not Kronos... |
I don't think EmuTOS has anything to do with that. These screenshot have apparently been taken from an fVDI driven screen, using Vincent's driver. This driver replaces EmuTOS' vro_cpyfm() routine with it's own (a pretty straightforward 16 bit copy routine implemented in C). I'm not sure if I have the latest sources, but apparently fVDI holds an offscreen buffer in FastRAM. Not sure if it's needed here and if it really ended up in the driver, but if it did, it would explain the low speed: memory appears to be actually copied twice - once to/from screen, once to/from the offscreen buffer. Maybe Vincent is able to tell if this is the case or not.Background is probably speed: on a machine with very slow screen memory access (and reasonably fast FastRAM, like on the CT60), overall screen blit speed (with logic ops) will considerably accellerate (because you can do the logic op in FastRAM and only need to write once to slow screen memory). If Vampire screen memory access is indeed as fast as FastRAM, this is probably just a waste of time?
| |
| | Peter Slegg
Posts 22 15 Dec 2017 22:05
| For reference I just ran Kronos 2.01 on the Milan060 @50MHz It managed: Mothercard Perf.: 501 BogoMIPS: 47.77
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 15 Dec 2017 22:53
| Markus (mfro) wrote:
| I'm not sure if I have the latest sources, but apparently fVDI holds an offscreen buffer in FastRAM. Not sure if it's needed here and if it really ended up in the driver, but if it did, it would explain the low speed: memory appears to be actually copied twice - once to/from screen, once to/from the offscreen buffer. Maybe Vincent is able to tell if this is the case or not.
|
I do not fully understand what the KRONOS tests does here. Maybe you can help me to understand it better, so that we see if this is test related or a driver issue which could be improved. From the video numbers printed in KRONOS it looks like the GFX speed could be improved a lot. I think it would be great to make ATARI OS run really fast, and like to help.
| |
| | Markus (mfro)
Posts 99 16 Dec 2017 05:56
| Gunnar von Boehn wrote:
| I do not fully understand what the KRONOS tests does here. Maybe you can help me to understand it better, so that we see if this is test related or a driver issue which could be improved. From the video numbers printed in KRONOS it looks like the GFX speed could be improved a lot. I think it would be great to make ATARI OS run really fast, and like to help.
|
I'm not familiar with the Kronos internals, but what I understand from the discussion is that it tests video memory access with the OS' (software) blit routines. vro_cpyfm() is the VDI software blitter. It copies rectangular rasters from main memory to video memory (and v.v.) with a logic op (AND, OR, XOR, ...) between source and destination. Blits from FastRAM to video mem are (obviously) a read-modify-write operation on target RAM. The driver appears to be derived from the Falcon CT60 driver. The CT60 has very slow 16 bit video memory, but considerably faster 32 bit FastRAM. The driver appears to accelerate the blit logic ops by maintaining a video shadow in FastRAM to do the logical composition there instead. This avoids a read/modify write operation in slow video memory at the expense of double writes (to video memory AND the shadow buffer) in case there is no logic op involved.
| |
| | Olivier Landemarre
Posts 147 16 Dec 2017 07:51
| Gunnar von Boehn wrote:
|
Vincent Rivière wrote:
| move.l (a1)+,-(a0) move.l (a1)+,-(a0) |
On the other hand, the core could be smart and also optimize code like that ;-) |
This code is not a normal memcopy. Its a copy which inverses the content of the memory. I would assume that this code is not common in real live. Or do you think that is done often? ON the other hand, a normal memcopy is very often used, so tuning this to be faster makes really sense. Vincent can you tell us what the CODE is that KRONOS triggers when copying the memory to GFX card? KRONOS measures "only" 40 MB/sec with GXF-mem operation. I would have expected 200 MB/sec. Can you show the code and help us to understand the reason?
|
I fully agree it is not normal memcopy! The reason when I do this was to reduce memory need to run Kronos, so I allocate only 64Kb and I want reduce as possible cache effects. But your results are for very good, because CT60 give the maximum speed access memory possible for 68060. This tests give just an idea, compare apple to apple. For most interesting tests are GFX information and small opengl test. - The first one because it is usefull for driver writer to know speed of 4 routine : line, rectangle filled, text display and memory block, other tests it's just for fun. Kronos have been intensively used by Didier Mequignon when he worked on Radeon driver and his driver is really fast now but first time I use it and we start tests it was very slow on coldfire evaluation board, fast video driver is very important for interface, with fast driver even if computer is not very fast, user not see it because most of time computer is waiting user but user dont like wait! - The second the small opengl test, it is full processor test, using memory, CPU, FPU, I like it For your question source code for copy from screen to screen this is a copy from the left to the right of the width of the screen minus 16 pixels with an offset of 16 pixels: nb=1L; apres=0; avant=0; xy[0]=0; xy[1]=0; xy[2]=xy[6]=_var_sys->work_out[0]-16; xy[3]=xy[7]=_var_sys->work_out[1]; xy[4]=16; xy[5]=0; while((apres - avant)<200L) { encours=0; avant=my_sync(); while(encours<nb) { vro_cpyfm(_var_sys->vdihandle,3,xy,&ecran,&ecran); encours++; } apres=stop_chrono(); if((apres-avant)<20) nb*=15L; else if((apres-avant)<50L) nb*=5; else if((apres-avant)>=200L) { } else nb *=2L; } for other copy with screen in memory or from memory, it is near same test, it is a copy of full screen to/from TTram buffer Olivier
| |
| | Markus (mfro)
Posts 99 16 Dec 2017 10:26
| Olivier Landemarre wrote:
| vro_cpyfm(_var_sys->vdihandle,3,xy,&ecran,&ecran); |
Mode 3 above is "source only", so no logic op involved. If Vincent's VDI driver really maintains the FastRAM screen buffer (as assumed above), it receives the full penalty for double buffer writes here.
| |
| | Olivier Landemarre
Posts 147 16 Dec 2017 11:27
| Markus (mfro) wrote:
|
Olivier Landemarre wrote:
| vro_cpyfm(_var_sys->vdihandle,3,xy,&ecran,&ecran); |
Mode 3 above is "source only", so no logic op involved. If Vincent's VDI driver really maintains the FastRAM screen buffer (as assumed above), it receives the full penalty for double buffer writes here.
|
Yes possible so should be easy to do far faster in this case!
| |
| | Vincent Rivière
Posts 87 16 Dec 2017 13:57
| Gunnar von Boehn wrote:
| Vincent can you tell us what the CODE is that KRONOS triggers when copying the memory to GFX card?
|
Kronos is closed source, I have no idea what he does. Only Olivier can tell.Markus (mfro) wrote:
| I'm not sure if I have the latest sources, but apparently fVDI holds an offscreen buffer in FastRAM. Not sure if it's needed here and if it really ended up in the driver, but if it did, it would explain the low speed: memory appears to be actually copied twice - once to/from screen, once to/from the offscreen buffer. Maybe Vincent is able to tell if this is the case or not.
|
The fVDI sources I used is there: EXTERNAL LINK AFAIK they are latest official fVDI sources, plus a few bugfixes from me to avoid it to crash it completely, plus my SAGA driver.My SAGA driver is more or less just a copy/paste of the original Falcon 16-bit driver provided by fVDI. I mainly changed the video mode initialization to switch to the SAGA screen, not much more. Actual drawing routines are the ones provided Falcon 16-bit sample, implemented in C. Do not expect any performance there, this was just an (incomplete) example of C driver provided with fVDI. My SAGA driver for fVDI was mainly a proof of concept. Once again, do not expect any performance. Generally speaking, fVDI is a big mess. I can just tell that it calls the graphics primitives from the drivers, and fills the gaps to provide the proper VDI interface. I don't know much more. I indeed saw that offscren buffer stuff, but IIRC I disabled it in the SAGA driver. And final note about EmuTOS: in Vampire optimized binaries (floppy and ROM), I put everything possible into FastRAM. Chip RAM is only used when absolutely necessary. About FastRAM usage in FreeMiNT / XaAES / fVDI / Kronos, I have no idea.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 16 Dec 2017 15:01
| Cool, reading the sources! There are of course very many function in them. Maybe someone with knowledge of ATARI OS. can put some light into which functions are most important / most used. For example what function is used if a WINDOW is moved? I assume if you open a folder on screen, the folder/window is first cleared/filled and then Icons are printed in it? If this is the case then maybe looking 1st at the Rect_fill code makes sense? I'm not sure which routine is the final code called for filling a rect? Is this the final work loop? vdi/fvdi/drivers/16_bit/16b_fill.c
| |
| | Olivier Landemarre
Posts 147 16 Dec 2017 17:19
| Gunnar von Boehn wrote:
| Cool, reading the sources! There are of course very many function in them. Maybe someone with knowledge of ATARI OS. can put some light into which functions are most important / most used. For example what function is used if a WINDOW is moved? I assume if you open a folder on screen, the folder/window is first cleared/filled and then Icons are printed in it? If this is the case then maybe looking 1st at the Rect_fill code makes sense? I'm not sure which routine is the final code called for filling a rect? Is this the final work loop? vdi/fvdi/drivers/16_bit/16b_fill.c
|
I think I can easily answer to this. When windows move generaly most important for speed is copy of bloc from screen to screen or screen <-> memory (I speak a bit for my own AES I think XaAES is a bit better optimized). If some area of screen was not displayed before, filling rectangle is probably one of the most function used. As I said in previous message most important function for AES system are rectangle filling, lines, text and bloc copy, other function are near never used.
| |
| | Markus (mfro)
Posts 99 16 Dec 2017 17:21
| Vincent Rivière wrote:
| The fVDI sources I used is there: EXTERNAL LINK AFAIK they are latest official fVDI sources, plus a few bugfixes from me to avoid it to crash it completely, plus my SAGA driver.
|
Yes, that's what I've been looking at as well. I assume the SAGA driver uses the 16_bit/16b_blit.c software blit routines? That said file still has #define FAST #define BOTH set which activates the FastRAM buffer (and thus causes double writes) if I'm not mistaken.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 16 Dec 2017 17:30
| Markus (mfro) wrote:
| I assume the SAGA driver uses the 16_bit/16b_blit.c software blit routines?
|
Maybe we can also look at the created ASM of those?
| |
| | Markus (mfro)
Posts 99 16 Dec 2017 18:04
| Gunnar von Boehn wrote:
| Markus (mfro) wrote:
| I assume the SAGA driver uses the 16_bit/16b_blit.c software blit routines? |
Maybe we can also look at the created ASM of those? |
That's not worth it; it's extremely simple: for(i = h - 1; i >= 0; i--) { for(j = w - 1; j >= 0; j--) { v = *src_addr++; #ifdef BOTH *(volatile PIXEL *)dst_addr_fast++ = v, 0; #endif *dst_addr++ = v; } src_addr += src_line_add; dst_addr += dst_line_add; }
PIXEL is a typedef for short.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6258 16 Dec 2017 19:03
| Markus (mfro) wrote:
| That's not worth it; it's extremely simple:
|
It could still be interesting to see what ASM is created.
| |
| | Markus (mfro)
Posts 99 16 Dec 2017 19:20
| Gunnar von Boehn wrote:
|
Markus (mfro) wrote:
| That's not worth it; it's extremely simple: |
It could still be interesting to see what ASM is created.
|
000001e0 <_s_blit_copy>: 1e0: 48e7 3f3c moveml %d2-%d7/%a2-%a5,%sp@- 1e4: 242f 002c movel %sp@(44),%d2 1e8: 262f 0032 movel %sp@(50),%d3 1ec: 282f 0036 movel %sp@(54),%d4 1f0: 386f 003c moveaw %sp@(60),%a4 1f4: 3a2f 003e movew %sp@(62),%d5 1f8: 5345 subqw #1,%d5 1fa: 6b50 bmis 24c <.L5> 1fc: 366f 0030 moveaw %sp@(48),%a3 200: 200b movel %a3,%d0 202: d080 addl %d0,%d0 204: 2640 moveal %d0,%a3 206: 3e2f 003a movew %sp@(58),%d7 20a: 48c7 extl %d7 20c: de87 addl %d7,%d7 20e: 3c0c movew %a4,%d6 210: 5346 subqw #1,%d6 212: 0286 0000 ffff andil #65535,%d6 218: 2006 movel %d6,%d0 21a: d080 addl %d0,%d0 21c: 2a40 moveal %d0,%a5 21e: 548d addql #2,%a5 220: 5286 addql #1,%d6 222: dc86 addl %d6,%d600000224 <.L9>: 224: 300c movew %a4,%d0 226: 6f1a bles 242 <.L7> 228: 2202 movel %d2,%d1 22a: d28d addl %a5,%d1 22c: 2444 moveal %d4,%a2 22e: 2243 moveal %d3,%a1 230: 2042 moveal %d2,%a0 00000232 <.L8>: 232: 3018 movew %a0@+,%d0 234: 34c0 movew %d0,%a2@+ 236: 32c0 movew %d0,%a1@+ 238: b288 cmpl %a0,%d1 23a: 66f6 bnes 232 <.L8> 23c: d486 addl %d6,%d2 23e: d886 addl %d6,%d4 240: d686 addl %d6,%d3 00000242 <.L7>: 242: d48b addl %a3,%d2 244: d687 addl %d7,%d3 246: d887 addl %d7,%d4 248: 51cd ffda dbf %d5,224 <.L9> 0000024c <.L5>: 24c: 4cdf 3cfc moveml %sp@+,%d2-%d7/%a2-%a5 250: 4e75 rts
| |
|
|
|