Sonic and Sega Retro Message Board: Optimized KosDec and NemDec, considerably faster decompression - Sonic and Sega Retro Message Board

Jump to content

Hey there, Guest!  (Log In · Register) Help
  • 3 Pages +
  • 1
  • 2
  • 3
    Locked
    Locked Forum

Optimized KosDec and NemDec, considerably faster decompression UPDATE: New improved Kosinski and Nemesis compressors by flamewing

#31 User is offline MainMemory 

Posted 28 February 2015 - 02:50 PM

  • Every day's the same old thing... Same place, different day...
  • Posts: 4178
  • Joined: 14-August 09
  • Gender:Not Telling
  • Project:SonLVL
  • Wiki edits:1,339
Bumping so this doesn't get lost: I modified the optimized NemDec for Sonic 2: http://pastebin.com/2s3PnCDQ

#32 User is offline Clownacy 

Posted 08 August 2015 - 11:58 AM

  • Posts: 718
  • Joined: 06-July 13
  • Gender:Not Telling
(Bump-mad, lately, huh?)

I came across some scrambled art in my hack. It uses S&K's KosM system, ported to S2, which, itself had this improved KosDec built into it. I initially followed this post's directions to fix it, but, looking into it, all this "fix" seems to do is disable the bugged bookmark system.

It seems to me that the bug is simply that the Kosinski decompressor's registers aren't all being backed up. The part of the decompressor used by KosM uses all data registers, and address registers a0, a1, a3 and a4, so Restore_Kos_Bookmark and Backup_Kos_Registers need to suit these. The size of Kos_decomp_stored_registers will need to be increased to $30 (8*4 + 4*4).

You can get away with not saving a4, so Kos_decomp_stored_registers will only require $2C bytes of RAM, by manually loading KosDec_ByteMap to a4 at the end of Restore_Kos_Bookmark.

For those wondering how to get this Kosinski decompressor into KosM, go to Process_Kos_Queue, and there should be a comment that reads 'what follows is identical to the normal Kosinski decompressor except for using Kos_description_field instead of the stack'. Note that, after switching decompressors, Kos_description_field won't be used, so consider that free RAM. Anyway, for the sake of simplicity (Set_Kos_Bookmark gets a little complicated, otherwise) we'll be duplicating the KosDec code. Delete everything from after that comment to the label Process_Kos_Queue_EndReached. In its place, you'll want to put the code from after the KosDec label up to, and including, the KosDec_Quit label. Modify the labels however you want, to prevent 'label multiply defined' errors.

#33 User is offline flamewing 

Posted 08 August 2015 - 02:30 PM

  • Emerald Hunter
  • Posts: 1137
  • Joined: 11-October 10
  • Gender:Male
  • Location:🇫🇷 France
  • Project:Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
  • Wiki edits:12
Oh, I guess I forgot to release this…

For starters: new, faster version of Kos decoder. The speed gain is relatively minor, but it is there ($087F versus $08E2, according to vladik's measurements). It has the advantage that it uses a5 instead of a3, which S3&K assumed was preserved in calls to KosDec. It also uses dot labels. For asm68k, you have to replace the dotted labels with @ labels, and the "endm" after "rept" with "endr".

Regarding KosM: here is a version of KosM using the above decoder, as well as all auxiliary functions needed by KosM (except the function calls and V-Int modifications). It includes the following versions of Restore_Kos_Bookmark and Backup_Kos_Registers:

Restore_Kos_Bookmark:
	movem.w	(Kos_decomp_stored_registers).w,d0-d6
	movem.l	(Kos_decomp_stored_registers+2*7).w,a0-a1/a5
	move.l	(Kos_decomp_bookmark).w,-(sp)
	move.w	(Kos_decomp_stored_SR).w,-(sp)
	moveq	#(1<<_Kos_LoopUnroll)-1,d7
	lea	KosDec_ByteMap(pc),a4		; Load LUT pointer.
	rte
; End of function Process_Kos_Queue
; ===========================================================================
Backup_Kos_Registers:
	move	sr,(Kos_decomp_stored_SR).w
	movem.w	d0-d6,(Kos_decomp_stored_registers).w
	movem.l	a0-a1/a5,(Kos_decomp_stored_registers+2*7).w
	rts
; ===========================================================================

These versions needs less RAM than the stock S3&K version does to save the decoder state (saves 14 bytes), and is slightly faster (12 cycles) to save and restore because of it, despite having more state to restore.

Edit: Oh, the KosM version does not include the lookup table; you need to use the one from the Kos decoder. But if you place one the KosM decoder under the Kos decoder (as S3&K also does), then the lookup table will be there and close enough to be used. Also, it does not have any of the macros, which are also only on the Kos decoder.

Edit 2: As Clownacy pointed out in IRC, there is a bug in the save/restore code I originally posted. I changed the post and the paste to a fixed (but slightly slower) version, which uses up as much RAM as the stock S3&K version.

Edit 3: Another new version, this time saving RAM and speed on the bookmark system by saving/restoring all data registers as words instead of longs. This needed a change to both decoders (as they share a macro) because d3 was used as a long. The change allows it to be used only as a word, which has the nice side-effect of making the decoder 4 cycles faster.
This post has been edited by flamewing: 08 August 2015 - 04:31 PM

#34 User is offline carljr 

Posted 15 December 2017 - 11:52 PM

  • Posts: 5
  • Joined: 12-December 17
Hello all,

As a way to learn M68K assembler, I have been looking at the sonic 1 disassembly. For whatever reason, I decided to start with the NemDec decompression methods. I found this thread and saw that there were some optimizations made to the routines.

I decided to take it a little further, but the following code is untested. If I've understood how everything works, it should work in theory. At the moment I don't have the ability to test as I am just starting out and learning the ropes.

If someone does get a chance to test it, or finds the code useful, it would be nice to hear back.

Thanks, Carl

https://pastebin.com/EuaKJurd

#35 User is offline InvisibleUp 

Posted 16 December 2017 - 12:37 AM

  • Posts: 104
  • Joined: 10-November 12
  • Gender:Not Telling
  • Project:Sonic Redux (Sonic R hack)
Not to put down your effort, but it doesn't even assemble.

SN 68k version 2.53

NEMESIS DECOMPRESSION.ASM(32) : Error : Illegal value (16755200)
 lea $ffaa00.w,a1
NEMESIS DECOMPRESSION.ASM(85) : Error : Illegal value (9)
 subq.b #9,d6
Errors during pass 1 - pass 2 aborted
Assembly completed.
2 error(s) from 52894 lines in 0.19 seconds



The first error is because $FFAA00 is greater than the size of a word, which only goes up to $FFFF. You'd want to use a .l instead.

The second error is because subq only supports values from 1 to 8. You'd need to use sub.b for anything past that.

Also note that the Sonic 1 git diasm requires NemDec_WriteRowToVDP to have a loc_1502 label immediately below it.

Unfortunately the game freezes on a black screen even when those changes are made. Looking at the VDP debugger in Regen, it seems to be writing garbage to VRAM and never stopping.

If you want to test this, I would just get a copy of the Sonic 1 (or 2) Git disasm and replacing the Nemeis decompression file. It's super simple to get the diasm working and compiling your own code. And try to make one change at a time, and see if that works before continuing.

#36 User is offline Ralakimus 

Posted 16 December 2017 - 01:23 AM

  • Some say that Heaven is Hell
  • Posts: 266
  • Joined: 16-April 13
  • Gender:Male
  • Project:ass

View PostInvisibleUp, on 16 December 2017 - 12:37 AM, said:

The first error is because $FFAA00 is greater than the size of a word, which only goes up to $FFFF. You'd want to use a .l instead.


What the ".w" actually does is take the 32-bit number it is used with and checks if it fits in the signed word range, which can be done by getting rid of the highest 2 bytes (only if they are both $FF or $00) to get only the lowest bytes. The signed word range ranges from -$8000 through $7FFF ($0000-$7FFF being positive and $8000-$FFFF being negative, mapped to -$8000 through -$0001). In this case, $AA00 (the lowest 2 bytes from the address he/she used) is considered a negative word number, but in his/her code, he/she never properly sign extended it to a longword as it was expected to be. The correct number to use is $FFFFAA00 if you want to use ".w". While changing it to ".l" also works just fine, fixing it for ".w" would result in a little more optimised instruction.

How does this work for writing to the second half of RAM? -$8000 through -$0001 as a signed longword is mapped to $FFFF8000-$FFFFFFFF, and yet all of RAM is only in $FF0000-$FFFFFF in the Mega Drive's memory space. Simply put, the 68000 has a 24-bit address bus, so, the highest byte is pretty much ignored. So when it writes to/reads from $FFFF8000-$FFFFFFFF, it's actually writing to/reading from $FF8000-$FFFFFF.

This should also explain why writing to/reading from $FF0000-$FF7FFF with ".w" won't work, and why you must use a ".l" for that range. $FFFF0000-$FFFF7FFF do not fit the signed word range anyways, since that is considered as -$10000 through -$8001, so it is impossible to find a signed word that can be sign extended into that range.
This post has been edited by Ralakimus: 16 December 2017 - 01:50 AM

#37 User is offline carljr 

Posted 16 December 2017 - 04:56 PM

  • Posts: 5
  • Joined: 12-December 17

View PostInvisibleUp, on 16 December 2017 - 12:37 AM, said:

Not to put down your effort, but it doesn't even assemble.

SN 68k version 2.53

NEMESIS DECOMPRESSION.ASM(32) : Error : Illegal value (16755200)
 lea $ffaa00.w,a1
NEMESIS DECOMPRESSION.ASM(85) : Error : Illegal value (9)
 subq.b #9,d6
Errors during pass 1 - pass 2 aborted
Assembly completed.
2 error(s) from 52894 lines in 0.19 seconds



The first error is because $FFAA00 is greater than the size of a word, which only goes up to $FFFF. You'd want to use a .l instead.

The second error is because subq only supports values from 1 to 8. You'd need to use sub.b for anything past that.

Also note that the Sonic 1 git diasm requires NemDec_WriteRowToVDP to have a loc_1502 label immediately below it.

Unfortunately the game freezes on a black screen even when those changes are made. Looking at the VDP debugger in Regen, it seems to be writing garbage to VRAM and never stopping.

If you want to test this, I would just get a copy of the Sonic 1 (or 2) Git disasm and replacing the Nemeis decompression file. It's super simple to get the diasm working and compiling your own code. And try to make one change at a time, and see if that works before continuing.


I really appreciate your reply. I meant to use absolute word addressing, apparently the assembler does not accept that syntax and as the other poster mentioned you would need to do a $FFFFAA00.w when you want to use absolute word addressing on something that is sign extended to 32 bits.

Anyway, even with a fresh copy of the sonic 1 disassembly with no changes, I can get it to compile with no errors but when I run it with REGEN it just deposits me into the debugger screen. I try to exit, but it just keeps popping up. Maybe the code is causing one exception after the other. :-( Seems to me it's not that easy to compile the assembly. (I'm using the /p to compile the binary.)

Any help getting started would be much appreciated :-)

#38 User is offline InvisibleUp 

Posted 16 December 2017 - 08:15 PM

  • Posts: 104
  • Joined: 10-November 12
  • Gender:Not Telling
  • Project:Sonic Redux (Sonic R hack)
Does running "build.bat" not work for you? What version of the disasm are you using?

#39 User is offline carljr 

Posted 17 December 2017 - 02:19 AM

  • Posts: 5
  • Joined: 12-December 17

View PostInvisibleUp, on 16 December 2017 - 08:15 PM, said:

Does running "build.bat" not work for you? What version of the disasm are you using?


Success! Yes, part of the problem was that I was not using build.bat which has extra assembler parameters that I do not yet understand. That did the trick.

Then there were some modifications to my code. When I wrote it, I did not consider that outside routines would make calls to NemDec3 or 4, etc. there were some register incompatibility issues which I have compensated for.

Here is the updated code. If someone knows how to bench it against the original to see how much faster it is, I'd be interested to know the results. :-)

There was even one more optimization I noticed for in-line code, but I think I'll leave things as is.

Thanks, Carl

https://pastebin.com/3eDw1NNx

#40 User is offline carljr 

Posted 30 December 2017 - 11:45 PM

  • Posts: 5
  • Joined: 12-December 17
Hi all,

I'm continuing to learn M68K assembly and using these decompression routines as a sort of springboard. As a result, I've also succeeded in improving the current fastest KosDec routine a bit.

This was quite challenging, because the coding by vladicomper was already so tight (nice job!). However, I was able to gain some cycles here and there. The biggest win was getting 20 cycles for every 16 bits processed from the stream. I don't know if this would make any perceptible difference, though. If anyone can bench it against the previous routine, I'd be interested in knowing how much of a difference it makes.

Anyway, here is the improved KosDec:
https://pastebin.com/GWvCyRjD

And again, here is my improved NemDec:
https://pastebin.com/3eDw1NNx

Hope someone finds this helpful!

Carl

#41 User is offline Clownacy 

Posted 31 December 2017 - 06:16 AM

  • Posts: 718
  • Joined: 06-July 13
  • Gender:Not Telling
If you're that desperate for cycles in KosDec, you could unroll the descriptor field loop (removing the need for the 'dbra d2,@skip\@'). Sonic 2 Nick Arcade had a "Chameleon" decompressor that had that done to it.
This post has been edited by Clownacy: 31 December 2017 - 06:17 AM

#42 User is offline MarkeyJester 

Posted 31 December 2017 - 09:45 AM

  • Do you have a sister?
  • Posts: 1846
  • Joined: 22-July 08
  • Gender:Male
  • Location:Japan
  • Wiki edits:16
You can speed up your bit-field reading by a few more cycles still, by replacing;

        move.b  (a0)+,d1                        ;  8 -.
        move.b  (a4,d1.w),d0                    ; 14  :
        lsl.w   #8,d0                           ; 22 -' d0.8-15  <- [a0++] (x-mirror, process left to right)
        move.b  (a0)+,d1                        ;  8 -.
        move.b  (a4,d1.w),d0                    ; 14 -' d0.0-7   <- [a0++] (x-mirror, process left to right)

With:
	move.b	(a0)+,d1			;  8
	move.b	(a4,d1.w),(sp)			; 18
	move.w	(sp),d0				;  8
	move.b	(a0)+,d1			;  8
	move.b	(a4,d1.w),d0			; 14

You'll have to push the stack back by a word before starting decompression, and push it forwards again after decompression is finished, but this'll allow you to shift the contents of d0 8 pixels left, without actually shifting, and saves a few cycles.

To be fair though, the only real speed optimisations which you could perform on the routines, are mostly those that will cost memory to compensate, ala look up tables and unrolled loops (just like the method Clownacy mentioned), I think these routines are generally reaching the peak whereby; it's a matter of choice (the old speed vs size syndrome), and given them to cost memory, rather than actually saving memory (which is what the compression is designed specifically to do), I guess it really depends on context of the game itself, do you need the speed more than the size? (I know I'm the one to talk given my compression requires a shit load of look up tables (practice what you preach right?), but still, I felt it worth noting, it could be important).

#43 User is offline carljr 

Posted 31 December 2017 - 09:46 PM

  • Posts: 5
  • Joined: 12-December 17

View PostMarkeyJester, on 31 December 2017 - 09:45 AM, said:

With:
	move.b	(a0)+,d1			;  8
	move.b	(a4,d1.w),(sp)			; 18
	move.w	(sp),d0				;  8
	move.b	(a0)+,d1			;  8
	move.b	(a4,d1.w),d0			; 14

You'll have to push the stack back by a word before starting decompression, and push it forwards again after decompression is finished, but this'll allow you to shift the contents of d0 8 pixels left, without actually shifting, and saves a few cycles.

Hey, this is a really nice trick! I'll have to implement that. I guess you could also just use (a1) instead of (sp), assuming there is at least 1-byte left to output, which there would have to be. My point is you would not have to worry about putting something on the stack and popping it off later.

Are there any other forums or posts about optimizing code and coming up with new tricks? It works well for me when studying M68K programming.

Thanks again, Carl
This post has been edited by carljr: 31 December 2017 - 09:51 PM

#44 User is online nineko 

Posted 01 January 2018 - 06:55 AM

  • I am the Holy Cat
  • Posts: 5642
  • Joined: 17-August 06
  • Gender:Male
  • Location:italy
  • Project:I... don't even know anymore :U
  • Wiki edits:5,251
You can take a look at GenDev. And if you'll ever want to learn Z80 too, here is smspower.

#45 User is offline MarkeyJester 

Posted 01 January 2018 - 07:50 AM

  • Do you have a sister?
  • Posts: 1846
  • Joined: 22-July 08
  • Gender:Male
  • Location:Japan
  • Wiki edits:16

View Postcarljr, on 31 December 2017 - 09:46 PM, said:

Hey, this is a really nice trick! I'll have to implement that. I guess you could also just use (a1) instead of (sp), assuming there is at least 1-byte left to output, which there would have to be. My point is you would not have to worry about putting something on the stack and popping it off later.

Sorry, but I'm afraid you cannot use (a1) as it could be on an odd address at any point, the 68k is not able to load word data from an odd offset, it will trigger an address error exception. I chose the stack and the sp, simply as the other registers are in circumstances in which they cannot be used in such a manner. You could use a previously unused address register, but you'd need to store the contents of that address register onto the stack, and point the address register to a rewrite-able address, which would probably take longer than simply moving the stack back and using that.

Good thinking though, I admire that you're thinking and trying to find more efficient ways around it =)

  • 3 Pages +
  • 1
  • 2
  • 3
    Locked
    Locked Forum

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users