Information about the Apollo CPU and FPU. |
|
---|
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 14 Jan 2020 11:14
| Kyle Blake wrote:
| Insistance on C on amiga in general is because the Amiga is a real computer platform, with a real OS. .. ASM code practices does not scale or adapt.
|
Actually there is no insistence on C on AMIGA. The official AMIGA HW coding books are fully of ASM examples. The AMIGA OS is actually designed to be used from ASM. Parameters passed to OS functions are based in registers and not using stack. This means to use AMIGA OS - you need special C compiler patches. Fact: Major part of the AMIGA OS are written in ASM. Simply because ASM is much faster. Fact: Important Graphicroutines of the OS and RTG drivers are written in ASM on AMIGA - simply for the same reason. Fact: Amiga Video players like RIVA are written in 100% ASM for the same reason. In high FPS games (Doom/etc) the render code is always in ASM - again for the same reason. The Vampire Diablo port has the render function written in ASM. This over doubled the FPS. Writing the Render code in NEOGEO emulator from C to ASM improved the speed by a factor of 5 times. Writing non time critical code in C is fully OK. For time critical routines like render / GFX code ASM is always the best choice. On AMIGA a huge number of programs and tool are written in pure ASM. Most important part is of course ALGORITHM. Tuning a screencopy in ASM - makes not sense if you can avoid the copy altogether by doing a simply PTR Swap. Or writing a BubbleSort in ASM is the wrong approach if MergeSort or QuickSort would be the better algorithm to use. If you write a text editor than either C or ASM can be used. But also Pascal or Modula or, Oberon2 would be good choices. But if you goal is to write a fast 3D FPS game then ASM is really the best option!
| |
| | A1200 Coder
Posts 74 14 Jan 2020 14:18
| I also agree that coding in asm is the best thing youcan do with the Amiga. You can also make more complex applications in asm. I made once an ansi/vt100 terminal client without using any OS calls on the A500. I just cut the OS off, and stole the keyboard routine from some game, and used sprites for cursor in the terminal.
| |
| | Kamelito Loveless
Posts 261 14 Jan 2020 17:12
| Does it means that AROS critical code that need speed will be rewritten in ASM? AmigaOS Exec is pure ASM for instance.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 14 Jan 2020 17:15
| Kamelito Loveless wrote:
| Does it means that AROS critical code that need speed will be rewritten in ASM? AmigaOS Exec is pure ASM for instance. |
I know that AROS got 68k ASM tuning for IDE access the other day. So yes there are improvements done like this today. But this does not mean that AROS is rewritten in ASM.
| |
| | Vladimir Repcak
Posts 359 14 Jan 2020 23:26
| I spent some time cleaning up the source code, removing all bitplane/copper variables (plus include file) resulting in a single file. Also, I reduced the number of required includes to just 4 from over a dozen. Hopefully, somebody will find this useful in future to start their coding for Vampire. This sample renders colorful rectangle 256x240. .68000 ; The Base NDK 3.9 doesn't have all files ; - cybergraphics_lib.i: taken from ChunkyStartup2 on Aminet ; - intuition_lib.i: taken from amiga-sdk on github/deplinenoise ; vasm -m68020 -Fhunkexe -I c:\Vampire\samples\gt\build\include_i -o gt.exe gt.asm include intuition/screens.i ; NDK 3.9 include intuition_lib.i ; EXTERNAL LINK include cybergraphics.i ; NDK 3.9 include cybergraphics_lib.i ; EXTERNAL LINK ResX equ 320 ResY equ 240 BitDepth equ 32 PixelBytes equ 4 START: ; Open graphics.library move.l 4,a6 move.l #graphicsLibName,a1 jsr -552(a6) ; Open library move.l d0,graphicsBase ; Open CyberGraphX move.l 4,a6 move.l #CyberGraphXLibName,a1 ; Name move.l #41,d0 ; Version jsr -552(a6) ; Open library move.l d0,CyberGraphXBase ; Store Ptr ; Open Intuition move.l 4,a6 move.l #IntuitionLibName,a1 ; Name move.l #39,d0 ; Version jsr -552(a6) ; Open library move.l d0,IntuitionBase2 ; Store Ptr ; Open Dos move.l 4,a6 move.l #DosLibName,a1 ; Name move.l #39,d0 ; Version jsr -552(a6) ; Open library move.l d0,DosBase ; Store Ptr ; Requester move.l CyberGraphXBase,a6 suba.l a0,a0 lea.l requestertags,a1 jsr _LVOCModeRequestTagList(a6) move.l d0,mode_insert+4 ; store into screen taglist ; Open Screen suba.l a0,a0 lea.l screentags,a1 move.l IntuitionBase2,a6 jsr _LVOOpenScreenTagList(a6) ; open the screen move.l d0,screen add.l #sc_RastPort,d0 move.l d0,rastport ; save rport address ; Open Dummy Window suba.l a0,a0 lea.l windowtags,a1 move.l IntuitionBase2,a6 jsr _LVOOpenWindowTagList(a6) ; open a dummy window... move.l d0,window ; Render 256*240 = 61,440 colors out of 16.7M lea FrameBuffer,a0 clr.l d2 ; d2: ARGB Color move.l #(ResY-1),d0 ; d0: YPOS loop .scrOuter: move.l d2,d3 ; Store move.l #256-1,d1 ; d1: XPOS loop .scrInner: move.l d2,(a0)+ add.l #1,d2 ; Update Blue add.l #256*2,d2 ; Update Green dbra d1,.scrInner move.l d3,d2 ; Restore add.l #256*256,d2 ; Update Red ; Skip remaining (320-256) pixels add.l #(ResX-256)*PixelBytes,a0 dbra d0,.scrOuter jsr UpdateBuffer ExitWait: ; "Wait" Loop without introducing Timer OS calls into the mix (utterly useless for the purpose of this code sample) SecondsToWait equ 5 ; move.l #7000,d2 ; 7000 = 10 seconds (on my computer) move.l #SecondsToWait*700,d2 .FunnyWaitLoopOuter: move.l #$FFFF,d3 move.l #$FFFFFFFF,d1 .FunnyWaitLoopInner: divu #2,d1 dbra d3,.FunnyWaitLoopInner dbra d2,.FunnyWaitLoopOuter exit: move.l window,a0 move.l IntuitionBase2,a6 jsr _LVOCloseWindow(a6) move.l screen,a0 move.l IntuitionBase2,a6 jsr _LVOCloseScreen(a6) move.l graphicsBase,a6 jsr -270(a6) ; WaitTOF move.l $4,a6 jmp -126(a6) ; Enable ; May need more close calls (CGX/DOS ?) UpdateBuffer: movem.l d0-d7,-(sp) lea.l FrameBuffer,a0 move.l rastport,a1 clr.w d0 clr.w d1 move.w #PixelBytes*ResX,d2 ; bytes per line in source clr.w d3 clr.w d4 move.w #ResX,d5 move.w #ResY,d6 move.w #RECTFMT_ARGB,d7 move.l CyberGraphXBase,a6 jsr _LVOWritePixelArray(a6) movem.l (sp)+,d0-d7 rts ; OS Libs (Pointer + name) graphicsBase: dc.l 0 graphicsLibName DC.B 'graphics.library',0 DosBase: dc.l 0 DosLibName dc.b 'dos.library',0 IntuitionBase2: dc.l 0 IntuitionLibName dc.b 'intuition.library',0 CyberGraphXBase: dc.l 0 CyberGraphXLibName dc.b 'cybergraphics.library',0 reqtitle dc.b "Pick a screenmode",0 even requestertags dc.l CYBRMREQ_WinTitle,reqtitle dc.l CYBRMREQ_MinWidth,ResX dc.l CYBRMREQ_MaxWidth,ResX dc.l CYBRMREQ_MinHeight,ResY dc.l CYBRMREQ_MaxHeight,ResY dc.l CYBRMREQ_MinDepth,8 dc.l CYBRMREQ_MaxDepth,32 dc.l 0,0 rastport dc.l 0 screentags dc.l SA_Left,0 dc.l SA_Top,0 dc.l SA_Width,ResX dc.l SA_Height,ResY dc.l SA_Depth mode_depth dc.l BitDepth dc.l SA_Type,CUSTOMSCREEN mode_insert dc.l SA_DisplayID,0 dc.l SA_Draggable,0 dc.l SA_Exclusive,1 dc.l 0,0 window dc.l 0 windowtags dc.l WA_Left,0 dc.l WA_Top,0 dc.l WA_Width,20 dc.l WA_Height,20 dc.l WA_CustomScreen screen dc.l 0 dc.l WA_Borderless,1 dc.l WA_BackFill,LAYERS_NOBACKFILL dc.l WA_Activate,1 dc.l 0,0 Section Chunky,BSS_F FrameBuffer ds.b ResX*ResY*PixelBytes |
| |
| | Vladimir Repcak
Posts 359 15 Jan 2020 00:04
| Gunnar von Boehn wrote:
|
Vladimir Repcak wrote:
| Yes, it's going to cost some bandwidth - 320x240x2 = 150 KB per each frame, which at 60 fps makes ~9 MB/s. |
9MB read + 9 MB write = 18MB time Also on higher res 640x360 = this becomes 55 MB/sec This is a lot time for nothing. The bigger problem of the copy is = it looks like shit. As you copy to the screen - which is displayed! This means you see the copy/redraw. This looks bad. A much cleaner solution is having 2 buffers and rendering the 2nd while the 1st is displayed and then swapping the PTR. This PTR swap is free and looks a lot better. You only need to sync the SWAP time. The very best solution is having 3 Buffers. 1st is Displayed, 2nd is rendered, and 3rd is also rendered. Using 3 Buffers you can unsync the render loop from the display time. This means if your render routine does not need to wait for the display. For the final product I can highly recommend you to use Tripple Buffer and to only do PTR update. On Vampire/SAGA the PTR SWAP is auto-synced with screen display. This means all you need to do is 1 MOVE.L and the HW does the rest for you.
|
I presume the copying shouldn't be visible once double/triple buffering will be implemented.But yeah, especially at higher resolutions, this would mean avoidable performance losses. I just wanted to share the working cleaned-up code before I go implement it.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 15 Jan 2020 06:46
| Nice start you have done. You can also test user input. E.g. Testing the Left Mouse button
Wait: btst #6,$bfe001 ; test left mouse button bne Wait ; if not pressed goto WaitLoop
Testing Right Mouse button is also very easy
btst #2,$dff016 ; Right mouse button beq RMB_pressed
Also Joystick input can this easy be tested with simple ASM. Maybe you can upgrade your demo that it shows a rotating 3D object?
| |
| | Kamelito Loveless
Posts 261 15 Jan 2020 07:01
| Nice but yes you should free all ressources you allocate. There is a jump to enable() while there is no disable. Disable()/Enable() should not be used at least not for a long period under AmigaOS.
| |
| | Nixus Minimax
Posts 416 15 Jan 2020 08:53
| Vlad, you open graphics.library without giving a version so it will fail if there is some junk left in d0 from a previous task. For all I know the OS will not hand you empty registers. Furthermore you don't check whether the OpenLibrary() call succeeds. There is a reason the pointer to the library base is returned in a data register, not an address register. If the pointer is ==0, you need to do some error handling. I also believe your code should have some "even" directives between the strings and the dc.l for the pointers. I haven't counted the string lengths all but "graphics.library",0 is uneven. I'm not sure storing the screen pointer within the taglist for opening the window is a good idea because I don't know whether the OS will preserve these arguments you pass to OpenWindowTagList() and you will need the screen pointer for closing the screen upon exit. Because yes, AmigaOS does not have any resource tracking which is why you must free all allocated resources, close all libraries and so on. Another thing is that you might want to use the exec includes and then use "_LVOOpenLibrary" instead of "-552" and so on. But you are clearly getting there. If you get the timer.device into the mix and calculate a nice FPS counter, you will soon see that disable()/enable() doesn't make that much of a difference (unless the user is running a raytracing job in the background which would clearly be user error...).
| |
| | Vladimir Repcak
Posts 359 15 Jan 2020 18:26
| Kamelito Loveless wrote:
| Nice but yes you should free all ressources you allocate. There is a jump to enable() while there is no disable. Disable()/Enable() should not be used at least not for a long period under AmigaOS. |
Thanks. I forgot which sample I took those ones from, but I commented them out now. I checked the OS docs and found out they do , explicitly, say that each OpenLibrary must have a matching CloseLibrary, so that's what I did: ; Exit exit: ; Close window+screen move.l window,a0 move.l IntuitionBase2,a6 jsr _LVOCloseWindow(a6) move.l screen,a0 move.l IntuitionBase2,a6 jsr _LVOCloseScreen(a6) ; Close dos.library move.l 4,a6 move.l DosBase,a1 jsr _LVOCloseLibrary(a6) ; Close library ; Close intuition.library move.l 4,a6 move.l IntuitionBase2,a1 jsr _LVOCloseLibrary(a6) ; Close library ; Close cybergraphics.library move.l 4,a6 move.l CyberGraphXBase,a1 jsr _LVOCloseLibrary(a6) ; Close library ; Close graphics.library move.l 4,a6 move.l graphicsBase,a1 jsr _LVOCloseLibrary(a6) ; Close library ; commented out ; move.l graphicsBase,a6 ; jsr -270(a6) ; WaitTOF ; move.l $4,a6 ; jmp -126(a6) ; Enable |
| |
| | Vladimir Repcak
Posts 359 15 Jan 2020 18:32
| Nixus Minimax wrote:
| Vlad, you open graphics.library without giving a version so it will fail if there is some junk left in d0 from a previous task. For all I know the OS will not hand you empty registers. Furthermore you don't check whether the OpenLibrary() call succeeds. There is a reason the pointer to the library base is returned in a data register, not an address register. If the pointer is ==0, you need to do some error handling. I also believe your code should have some "even" directives between the strings and the dc.l for the pointers. I haven't counted the string lengths all but "graphics.library",0 is uneven. I'm not sure storing the screen pointer within the taglist for opening the window is a good idea because I don't know whether the OS will preserve these arguments you pass to OpenWindowTagList() and you will need the screen pointer for closing the screen upon exit. Because yes, AmigaOS does not have any resource tracking which is why you must free all allocated resources, close all libraries and so on. Another thing is that you might want to use the exec includes and then use "_LVOOpenLibrary" instead of "-552" and so on. But you are clearly getting there. If you get the timer.device into the mix and calculate a nice FPS counter, you will soon see that disable()/enable() doesn't make that much of a difference (unless the user is running a raytracing job in the background which would clearly be user error...).
|
Nice catch on the missing library version. Looks I cleaned the code up too much :- ))))) I included the exec_lib from the github/deplinenoise and used the _LVOOpenLibrary Error handling will have to wait for now (but will be implemented later for sure), I think I've had enough OS stuff in last 2 weeks and should start porting the engine :)Here's the current Initialization section:
include intuition_lib.i ; EXTERNAL LINK include exec_lib.i ; EXTERNAL LINK include exec/execbase.i include intuition/screens.i ; NDK 3.9 include cybergraphics.i ; NDK 3.9 include cybergraphics_lib.i ; EXTERNAL LINK START: ; Open graphics.library move.l 4,a6 move.l #graphicsLibName,a1 move.l #39,d0 ; Version jsr _LVOOpenLibrary(a6) ; Open library move.l d0,graphicsBase ; Open CyberGraphX move.l 4,a6 move.l #CyberGraphXLibName,a1 ; Name move.l #41,d0 ; Version jsr _LVOOpenLibrary(a6) ; Open library move.l d0,CyberGraphXBase ; Store Ptr ; Open Intuition move.l 4,a6 move.l #IntuitionLibName,a1 ; Name move.l #39,d0 ; Version jsr _LVOOpenLibrary(a6) ; Open library move.l d0,IntuitionBase2 ; Store Ptr ; Open Dos move.l 4,a6 move.l #DosLibName,a1 ; Name move.l #39,d0 ; Version jsr _LVOOpenLibrary(a6) ; Open library move.l d0,DosBase ; Store Ptr ; Requester move.l CyberGraphXBase,a6 suba.l a0,a0 lea.l requestertags,a1 jsr _LVOCModeRequestTagList(a6) move.l d0,mode_insert+4 ; store into screen taglist ; Open Screen suba.l a0,a0 lea.l screentags,a1 move.l IntuitionBase2,a6 jsr _LVOOpenScreenTagList(a6) ; open the screen move.l d0,screen add.l #sc_RastPort,d0 move.l d0,rastport ; save rport address ; Open Dummy Window suba.l a0,a0 lea.l windowtags,a1 move.l IntuitionBase2,a6 jsr _LVOOpenWindowTagList(a6) ; open a dummy window... move.l d0,window |
| |
| | Kamelito Loveless
Posts 261 15 Jan 2020 20:00
| I was able to assemble it, it work fine under 320x240 16 and 24bits but I've an empty screen in 8bit. I was about to tell you about exec_lib.i but see that you resolved it. There's a problem about the release of the allocated resources as after 3/4 launches I got "not enough memory available".update after more test I can't repro the memory problem, maybe it is due to the fact that when launched under winuae you can choose 8/16/24 bit while your code is aimed at 24bits. No enforcer hit, no memory loss when I choose 24bits.
| |
| | Vladimir Repcak
Posts 359 16 Jan 2020 01:52
| Kamelito Loveless wrote:
| I was able to assemble it, it work fine under 320x240 16 and 24bits but I've an empty screen in 8bit. |
Correct, the 8-bit isn't supposed to work, I just couldn't easily get rid of it, only later figured that if I adjust MinDepth at the requester tags to 24 bits, then the 8-bit and 16-bit disappear from the dialog, leaving user with choosing only the 24-bit res. Kamelito Loveless wrote:
| There's a problem about the release of the allocated resources as after 3/4 launches I got "not enough memory available". | Interesting. I can usually run it 50 times or more and haven't encountered that message. Perhaps my config in UAE is to blame (I guess I gave it too much RAM). Are you using the first or the second version? Because first one didn't do any clean-up, only the second one does. But you'd have to replace the exit portion manually, as I didn't upload full code (I figured only the differences would be enough). I am running avail from commandline - it gives chip/fast available/in-use break-down. It appears, this leaks 432 Bytes per run. I guess there are some other things to release other than window/screen and close all libraries ?EDIT:Also, the first version I pasted here didn't have rts at the end :) So, if that's what you ran, then it surely leaked a lot :)
| |
| | Vladimir Repcak
Posts 359 21 Jan 2020 05:35
| Screenshot from emulator: EXTERNAL LINK Trying to use the img tag to embed the file, not sure if it will work for external paths: [img=https://pasteboard.co/IQWDYwV.png] Took some time to implement the workarounds for the vasm idiosyncrasies (quite different from my previous assembler), but now all Higgs language features are finally compiling under vasm. I spent couple days implementing the Radiosity lightmapper as there is no better way to test the 24-bit color space: - full 24-bit precision - lights are smoothly merged - currently there's 2 lights in the scene, but there's no actual limit - the lightmaps are generated at run-time - any light can have any RGB color (16.7 Mil) Texturing is implemented using generic 68000 code - Inner loops are completely running using just the registers - code is fully integer - there is no floating point - Inner scanline loops are touching RAM only for texel read and pixel write - scene is currently axis-aligned - so while 3d mesh can be relatively generic, best to keep quads axis-aligned This code could be relatively easily extended into that Star Wars Tunnel demo scene we talked about earlier.
| |
| | Nixus Minimax
Posts 416 21 Jan 2020 07:54
| Vladimir Repcak wrote:
| Screenshot from emulator: https://pasteboard.co/IQWDYwV.png |
This looks pretty nice! - code is fully integer - there is no floating point
|
That's a pity because the 080 has such an amazingly fast FPU. I guess you will be able to make good use of the FPU for point projection and transformation stuff.
| |
| | Vladimir Repcak
Posts 359 21 Jan 2020 09:52
| Nixus Minimax wrote:
| This looks pretty nice! | Thanks. That's Radiosity. It accounts not only for a direct lighting but also indirect (bounced off walls). I believe this particular scene has 95% energy threshold - meaning it's been bouncing off the energy till 95% got redistributed.Then I store the resulting form factors, so I only need to do a single linear multiply+add pass (at runtime) over the texture to get the final result, yet it's possible to change the light color, intensity and overall scene brightness. Each additional light, because it's 24-bit color space is merely added together, which is very fast (should be one AMMX op, really). The greatest usage of this would be, obviously, for an FPS shooter - imagine classic Wolfenstein with such colored lighting ;) Another use case could be top-down 3D RPG, like Dungeon Siege or Torchlight or Diablo3. Especially Diablo3 could use the 24-bit color space for its height-based fog (alongside proper wall lighting)... Nixus Minimax wrote:
| - code is fully integer - there is no floating point |
That's a pity because the 080 has such an amazingly fast FPU. I guess you will be able to make good use of the FPU for point projection and transformation stuff. |
Yeah, this is my first code that runs on emulator - I spent about 3 days updating my Higgs compiler (for the vasm differences) and then about 3 days to write this, so not too bad.AMMX would be really good to use here - for the RGBA processing in single instruction. The computation of lightmaps will be certainly greatly accelerated. I suspect, I should be able to use emulator for the floating-point instructions, right ? Meaning - at least 68040 FP instructions ? I never really used FP in Asm, as Jaguar's interpretation of FP is, suboptimal at best, hence I always rewrote each algorithm in two versions - fixed-point and then integer.
| |
| | Vladimir Repcak
Posts 359 21 Jan 2020 10:00
| Here's an example of the inner loop for the horizontal scanlines:
Higgs: loop (lpMain = xlVisible) { idxPixel = idxCurrent >> BitShiftR idxPixel <<= #2 texPtr = texPtrStart + idxPixel (vidPtr)+ = (texPtr) idxCurrent += xpAdd } |
ASM output: loop_10_start: move.l d3,d4 lsr.l d5,d4 lsl.l #2,d4 move.l a3,a2 add.l d4,a2 move.l (a2),(a0)+ add.l d2,d3 dbra d1,loop_10_start |
An FP version would be able to compute the Texel index in parallel. And, a third version - using the internal texturing unit should be even faster :) Of course, the example above is axis-aligned, so won't work for generic angled surfaces, but you can still make lots of games with that.
| |
| | Don Adan
Posts 38 21 Jan 2020 10:29
| Vladimir Repcak wrote:
| Here's an example of the inner loop for the horizontal scanlines: Higgs: loop (lpMain = xlVisible) { idxPixel = idxCurrent >> BitShiftR idxPixel <<= #2 texPtr = texPtrStart + idxPixel (vidPtr)+ = (texPtr) idxCurrent += xpAdd } |
ASM output: loop_10_start: move.l d3,d4 lsr.l d5,d4 lsl.l #2,d4 move.l a3,a2 add.l d4,a2 move.l (a2),(a0)+ add.l d2,d3 dbra d1,loop_10_start |
An FP version would be able to compute the Texel index in parallel. And, a third version - using the internal texturing unit should be even faster :) Of course, the example above is axis-aligned, so won't work for generic angled surfaces, but you can still make lots of games with that.
|
perhaps you can use move.l (a3,d4.l*4),(a0)+ for replacing 4 instructions
| |
| | Vladimir Repcak
Posts 359 21 Jan 2020 10:42
| Don Adan wrote:
| wrote:
| ASM output: loop_10_start: move.l d3,d4 lsr.l d5,d4 lsl.l #2,d4 move.l a3,a2 add.l d4,a2 move.l (a2),(a0)+ add.l d2,d3 dbra d1,loop_10_start An FP version would be able to compute the Texel index in parallel. And, a third version - using the internal texturing unit should be even faster :) Of course, the example above is axis-aligned, so won't work for generic angled surfaces, but you can still make lots of games with that. |
perhaps you can use move.l (a3,d4.l*4),(a0)+ for replacing 4 instructions |
Yes, Indirect addressing should remove quite a few ops. On Jaguar, I was occasionally having hard-to-debug issues with the (const, An, Xn.s) displacement modes and mostly used just the simplest (const,a0) displacement. But, this is a different platform, so it's worth trying it out. Thanks for pointing this out !
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6253 21 Jan 2020 11:02
| Nice progress! Congratulations!I wonder if going for 15/16bit screenmode might be smart decision. What do you think? For games maybe more FPS has more value than slightly finer more color shades.. What do you think? Would you like to change the test engine to 15/16bit?
| |
|
|
|