Gunnar von Boehn wrote:
| I try to explain. The 68060 FPU was not pipelined. So you could write stuff like FADD F0,F1 FADD F1,F2 FMUL F2,F3 FMUL F3,F4
And the code would execute at normal speed sequentially. Each instruction taking 3 clocks!
|
Hand written FPU assembly is rare. So better optimize for what compilers generate.So here, have a real world, compiler generated FPU code example. It's an actual 4x4 matrix multiplication used in a 3D engine. Yes, it's far from ideal when it comes to two pipes running in parallel, but still, a lot of things can be paired, and some of the overlaps can be dealt by register renaming, and potential OoOE. Also, for the memory reads, a prefetch engine can work ahead. Of course, it's also not that hard to add instruction rescheduling to compilers, when there's actually a CPU which needs FPU instruction rescheduling. As usual, when you try to eliminate bottlenecks, the biggest problem is the lack of 3 operand instructions there which results in a lot of extra FMOVEs, otherwise lot of the FOP+FMOVE or FOP+FMOVE pairs could be turned into 3 operand instructions. The fact that one of the operands is both read *and* written at the same time for most ops, doesn't help. Maybe the core could also add exceptions for that, I think even the '060 has exceptions for similar cases, when it comes to integer instruction pairing. link.w %a5,#-88 movem.l %a2/%a3/%a4/%a6,-88(%a5) fmovem.x %fp2/%fp3/%fp4/%fp5/%fp6/%fp7,-72(%a5) move.l 8(%a5),%a6 moveq.l #-1,%d1 .balignw 4,0x4e71 .Lj20: addq.l #1,%d1 move.l %d1,%d0 lsl.l #4,%d0 lea (%a1,%d0.l),%a3 move.l %a3,%a2 move.l %d1,%d0 lsl.l #4,%d0 lea (%a0,%d0.l),%a3 move.l %a3,%a4 fmove.s (%a2),%fp1 fmove.s 4(%a2),%fp3 lea (%a6),%a3 fmove.x %fp1,%fp0 fmul.s (%a3),%fp0 fmove.x %fp3,%fp2 fmul.s 16(%a3),%fp2 fadd.x %fp2,%fp0 fmove.x %fp0,%fp7 lea (%a6),%a3 fmove.x %fp1,%fp0 fmul.s 4(%a3),%fp0 fmove.x %fp3,%fp2 fmul.s 20(%a3),%fp2 fadd.x %fp2,%fp0 fmove.x %fp0,%fp4 lea (%a6),%a3 fmove.x %fp1,%fp0 fmul.s 8(%a3),%fp0 fmove.x %fp3,%fp2 fmul.s 24(%a3),%fp2 fadd.x %fp2,%fp0 fmove.x %fp0,%fp5 lea (%a6),%a3 fmove.x %fp1,%fp0 fmul.s 12(%a3),%fp0 fmove.x %fp3,%fp2 fmul.s 28(%a3),%fp2 fadd.x %fp2,%fp0 fmove.x %fp0,%fp6 fmove.s 8(%a2),%fp1 fmove.s 12(%a2),%fp3 lea (%a6),%a3 fmove.x %fp1,%fp0 fmul.s 32(%a3),%fp0 fmove.x %fp3,%fp2 fmul.s 48(%a3),%fp2 fadd.x %fp2,%fp0 fadd.x %fp7,%fp0 fmove.s %fp0,(%a4) lea (%a6),%a3 fmove.x %fp1,%fp0 fmul.s 36(%a3),%fp0 fmove.x %fp3,%fp2 fmul.s 52(%a3),%fp2 fadd.x %fp2,%fp0 fadd.x %fp4,%fp0 fmove.s %fp0,4(%a4) lea (%a6),%a3 fmove.x %fp1,%fp0 fmul.s 40(%a3),%fp0 fmove.x %fp3,%fp2 fmul.s 56(%a3),%fp2 fadd.x %fp2,%fp0 fadd.x %fp5,%fp0 fmove.s %fp0,8(%a4) lea (%a6),%a3 fmove.x %fp1,%fp0 fmul.s 44(%a3),%fp0 fmove.x %fp3,%fp2 fmul.s 60(%a3),%fp2 fadd.x %fp2,%fp0 fadd.x %fp6,%fp0 fmove.s %fp0,12(%a4) cmp.l #3,%d1 jlt .Lj20 move.l %a0,%d0 movem.l -88(%a5),%a2/%a3/%a4/%a6 fmovem.x -72(%a5),%fp2/%fp3/%fp4/%fp5/%fp6/%fp7 unlk %a5 rtd #4
|