don't click here

Optimized KosDec and NemDec, considerably faster decompression

Discussion in 'Engineering & Reverse Engineering' started by vladikcomper, Nov 4, 2013.

  1. RetroKoH

    RetroKoH

    Member
    1,681
    37
    28
    S1Fixed: Successor to ReadySonic
    SO... I tried to run the Comper compressor on my laptop... got greeted by one of those black console windows for a split second, then nothing... This isn't something new, but I never know what to do about this. Surely I cannot be the only one???
     
  2. vladikcomper

    vladikcomper

    Tech Member
    210
    158
    43
    Sonic Warped
    All compressors provided by flamewing are command-line tools, which means they only work from a command prompt, where you should pass 3 or 2 arguments to the compressor.

    To use them, open Command Prompt, set the current path to where compressors' executable files are, like so:

    Code (Text):
    1. cd D:\Sega\Dev\Tools\KENS_FL\
    Now you can run any executable file from that folder just by typing its name (even without extension). For example, if you want to execute Comper's compressor (compcmp.exe), simply type in: compcmp. However, this won't get you anywhere, because, in order to make it to work, you should pass arguments to the compressor first. Executing it without them simply leads you to a short help message - the one that popped up for a split second when you launched it right from Windows Explorer.

    Now, as for the arguments.
    If you want compress file A.bin to Comper format, execute the following command in command prompt:

    Code (Text):
    1. compcmp A.bin A.bin.comper
    The output will be A.bin.comper (obviously). You may choose any name you like. I prefer extra extensions to identify compression format more easily.

    If you want decompress B.bin.comper, do this:

    Code (Text):
    1. compcmp -x B.bin.comper B.bin
    If you don't like dealing with command prompt, you may check out my own original compressor tools (the compcmp.exe from the recent KENS was written by flamewing himself): https://dl.dropboxusercontent.com/u/44757401/Comper%20Compressor.7z
    It works simple: just drag&drop the file you want to compress onto executable. The output will be a file of the same name, but with extra .comper extension. This compressor, however, works slowly due to debug output, it also was compiled with MinGW and may not work on some machines if GCC dynamic link library is absent -- this can be easily fixed by finding and placing the said dll-file into compressors folder (the name will be displayed in the error message).
     
  3. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    67
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    So, I was running the tools to have some statistics to show and I discovered that the KosM encoder was far from optimal as it was not properly managing the internal padding of each module. I also did some things to improve the Nemesis encoder a bit by using some new heuristics. And finally, I found out a case (that does not occur in practice) were the Saxman could be improved: if it started with a sequence of zeroes. I updated the download link in the previous post; but here it is, for convenience.

    Anyway, stats: I reencoded all Sonic 1 Nemesis art, 256x256 blocks and its sole Kosinski art file; for Sonic 2, I reencoded all Nemesis art, all Kosinski art, all 16x16 and 128x128 blocks and all Saxman songs; and for S&K, I reencoded all Kosinski art, all Nemesis art and all moduled Kosinski art. I did this for both original KENS and the latest versions of my tools (in the download linked to above). The results:

    Code (Text):
    1.                     Uncompressed    Original      KENS         Mine     Original ratio  KENS ratio    My ratio  KENS/Original   Mine/Original    Mine/KENS
    2. S1 256x256 blocks    335872 B    70896 B     70511 B      66300 B      21.11%      20.99%      19.74%      99.46%      93.52%      94.03%
    3. S1 Nemesis art       361344 B   168306 B    166251 B     164198 B      46.58%      46.01%      45.44%      98.78%      97.56%      98.77%
    4. S1 Kosinski art        4096 B     1424 B      1419 B       1366 B      34.77%      34.64%      33.35%      99.65%      95.93%      96.26%
    5. S1 Total             701312 B   240626 B    238181 B     231864 B      34.31%      33.96%      33.06%      98.98%      96.36%      97.35%
    6. S2 Saxman music       29430 B    21935 B     21933 B      21819 B      74.53%      74.53%      74.14%      99.99%      99.47%      99.48%
    7. S2 16x16 blocks       42128 B    30688 B     30630 B      30342 B      72.84%      72.71%      72.02%      99.81%      98.87%      99.06%
    8. S2 128x128 blocks    262144 B    85104 B     84599 B      81098 B      32.46%      32.27%      30.94%      99.41%      95.29%      95.86%
    9. S2 Nemesis art       329216 B   158758 B    157823 B     154990 B      48.22%      47.94%      47.08%      99.41%      97.63%      98.20%
    10. S2 Kosinski art      238164 B   104816 B    104515 B     100446 B      44.01%      43.88%      42.18%      99.71%      95.83%      96.11%
    11. S2 Total             901082 B   401301 B    399500 B     388695 B      44.54%      44.34%      43.14%      99.55%      96.86%      97.30%
    12. S&K Nemesis art      264000 B   113268 B    108641 B     107274 B      42.90%      41.15%      40.63%      95.91%      94.71%      98.74%
    13. S&K KosM art         444544 B   205564 B    204595 B     197198 B      46.24%      46.02%      44.36%      99.53%      95.93%      96.38%
    14. S&K Kosinski art     304096 B   126736 B    126350 B     120586 B      41.68%      41.55%      39.65%      99.70%      95.15%      95.44%
    15. S&K Total           1012640 B   445568 B    439586 B     425058 B      44.00%      43.41%      41.98%      98.66%      95.41%      96.70%
    16.  
    All told, you can save 8762 B in S1, 12606 B in S2 and 20510 B in S3&K.

    Edit: Added in a further improved version of the KosM encoder and updated the table accordingly -- it takes into consideration the 3-byte end-of-compression sequence when managing the padding for a module. It is as good as it gets now.
     
  4. RetroKoH

    RetroKoH

    Member
    1,681
    37
    28
    S1Fixed: Successor to ReadySonic
    So, I'm trying out implementing the Decompressor code... and I've a question.

    In the Nemesis Decompression code, above NemDec: There's a block labeled NemDec_RAM:

    Should I just ignore this and port the code into Sonic 1 as is? Or is NemDec_RAM supposed to be used in certain instances?
     
  5. Clownacy

    Clownacy

    Tech Member
    1,091
    662
    93
    It would appear that NemDec_RAM (or NemDecToRAM as Sonic 2 calls it) is in fact present and unused in Sonic 1. Look here (taken from HG disasm):

    Code (ASM):
    1. NemDec:
    2.         movem.l d0-a1/a3-a5,-(sp)
    3.         lea (loc_1502).l,a3
    4.         lea (vdp_data_port).l,a4
    5.         bra.s   loc_145C
    6. ; ===========================================================================
    7.         movem.l d0-a1/a3-a5,-(sp)
    8.         lea (loc_1518).l,a3


    Above the '='s you can see NemDec. Below them, with no label, you can see NemDec_RAM. This is actually used in Sonic 2. Its use is similar to KosDec in the sense that, to use it, the source is loaded into a0 and the RAM destination is loaded into a4 (a1 in KosDec), so you can easily swap them out, as I did with CompDec.

    To answer your question, no, you don't need NemDec_RAM unless you're planning to replace some Kosinski art/mappings/anything, and therefore their decompression routines, with Nemesis. From my experiences, NemDec_RAM, KosDec and CompDec are fully interchangeable (though Sonic 2's 16x16s are giving me some trouble.) So choose whichever suits your needs best; small file size or fast decompression speeds.

    EDIT: Grammar!
     
  6. MainMemory

    MainMemory

    Kate the Wolf Tech Member
    4,781
    362
    63
    SonLVL
    My understanding is that Nemesis is designed for, and should only be used with, 8x8 tiles. Kosinski and Comper can be used for any type of data, and Enigma is for plane mappings or 16x16 blocks (Sonic 1 uses it for them).
     
  7. RetroKoH

    RetroKoH

    Member
    1,681
    37
    28
    S1Fixed: Successor to ReadySonic
    That said, would it be feasible to simply oust Nemesis entirely in favor of Kosinski (or Comper now that we have that now)? I'm guessing part of the reason that's not done... is a filesize issue???
     
  8. Clownacy

    Clownacy

    Tech Member
    1,091
    662
    93
    I think the main reason is... well, you know how NemDec_RAM could be replaced because Kos/CompDec had the same results (source (a0) decompressed to a RAM address (a4/1)?) Well, NemDec is a different story: for example, it writes to VRAM and involves the VDP data port. I don't see that working with KosDec and CompDec in their current states. Maybe Sonic 3 found a way around that, having replaced a good chunk of its art (Nemesis included) with KosinskiM. Aside from that though, I'm not too sure either, though I can't imagine it being space limitations.
     
  9. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    67
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    The main issue in ousting Nemesis in favor of Kosinski (or Moduled Kosinski, or Comper, or a putative Moduled Comper) is simple -- RAM. As is, Kosinski (et al) can only decompress to RAM, while Nemesis can decompress directly to VRAM. Moduled Kosinski would be a viable alternative if you could get a buffer with $1000 bytes in main RAM -- or you use a modified encoder, such as mine, which allows you to specify a smaller buffer size for KosM, with a corresponding penalty in compression ratio.

    It is entirely possible to free $1000 bytes in main RAM for KosM in S1 and S2 -- S2 being easier because it uses a z80 driver, which frees up a lot of RAM right there -- but you need to juggle around an awful lot of addresses to get the full $1000.

    In theory, since it uses words internally, a modified Comper decoder could be written using VRAM writes and VRAM copies, thus being able to decompress directly to VRAM. I am not sure if a VRAM copy would work with Kosinski or not -- I need to test how VRAM copies would do what is required when copying an overlapping byte stream to be sure. I am not sure how fast this would be, but it would probably be a lot slower. Which might be OK for a PLC-like mechanism, I guess.
     
  10. RetroKoH

    RetroKoH

    Member
    1,681
    37
    28
    S1Fixed: Successor to ReadySonic
    Code (Text):
    1. Kos_decomp_buffer =     ramaddr( $FFFFD000 ) ; $1000 bytes ; each module in a KosM archive is decompressed here and then DMAed to VRAM
    Ah... you mean this. Pulled it from S3K. Ok. I think I understand completely now. This greatly helps me with my hack. Thanks guys!
     
  11. MainMemory

    MainMemory

    Kate the Wolf Tech Member
    4,781
    362
    63
    SonLVL
    Bumping so this doesn't get lost: I modified the optimized NemDec for Sonic 2: http://pastebin.com/2s3PnCDQ
     
  12. Clownacy

    Clownacy

    Tech Member
    1,091
    662
    93
    (Bump-mad, lately, huh?)

    I came across some scrambled art in my hack. It uses S&K's KosM system, ported to S2, which, itself had this improved KosDec built into it. I initially followed this post's directions to fix it, but, looking into it, all this "fix" seems to do is disable the bugged bookmark system.

    It seems to me that the bug is simply that the Kosinski decompressor's registers aren't all being backed up. The part of the decompressor used by KosM uses all data registers, and address registers a0, a1, a3 and a4, so Restore_Kos_Bookmark and Backup_Kos_Registers need to suit these. The size of Kos_decomp_stored_registers will need to be increased to $30 (8*4 + 4*4).

    You can get away with not saving a4, so Kos_decomp_stored_registers will only require $2C bytes of RAM, by manually loading KosDec_ByteMap to a4 at the end of Restore_Kos_Bookmark.

    For those wondering how to get this Kosinski decompressor into KosM, go to Process_Kos_Queue, and there should be a comment that reads 'what follows is identical to the normal Kosinski decompressor except for using Kos_description_field instead of the stack'. Note that, after switching decompressors, Kos_description_field won't be used, so consider that free RAM. Anyway, for the sake of simplicity (Set_Kos_Bookmark gets a little complicated, otherwise) we'll be duplicating the KosDec code. Delete everything from after that comment to the label Process_Kos_Queue_EndReached. In its place, you'll want to put the code from after the KosDec label up to, and including, the KosDec_Quit label. Modify the labels however you want, to prevent 'label multiply defined' errors.
     
  13. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    67
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    Oh, I guess I forgot to release this…

    For starters: new, faster version of Kos decoder. The speed gain is relatively minor, but it is there ($087F versus $08E2, according to vladik's measurements). It has the advantage that it uses a5 instead of a3, which S3&K assumed was preserved in calls to KosDec. It also uses dot labels. For asm68k, you have to replace the dotted labels with @ labels, and the "endm" after "rept" with "endr".

    Regarding KosM: here is a version of KosM using the above decoder, as well as all auxiliary functions needed by KosM (except the function calls and V-Int modifications). It includes the following versions of Restore_Kos_Bookmark and Backup_Kos_Registers:

    [68k]Restore_Kos_Bookmark:
    movem.w (Kos_decomp_stored_registers).w,d0-d6
    movem.l (Kos_decomp_stored_registers+2*7).w,a0-a1/a5
    move.l (Kos_decomp_bookmark).w,-(sp)
    move.w (Kos_decomp_stored_SR).w,-(sp)
    moveq #(1<<_Kos_LoopUnroll)-1,d7
    lea KosDec_ByteMap(pc),a4 ; Load LUT pointer.
    rte
    ; End of function Process_Kos_Queue
    ; ===========================================================================
    Backup_Kos_Registers:
    move sr,(Kos_decomp_stored_SR).w
    movem.w d0-d6,(Kos_decomp_stored_registers).w
    movem.l a0-a1/a5,(Kos_decomp_stored_registers+2*7).w
    rts
    ; ===========================================================================[/68k]
    These versions needs less RAM than the stock S3&K version does to save the decoder state (saves 14 bytes), and is slightly faster (12 cycles) to save and restore because of it, despite having more state to restore.

    Edit: Oh, the KosM version does not include the lookup table; you need to use the one from the Kos decoder. But if you place one the KosM decoder under the Kos decoder (as S3&K also does), then the lookup table will be there and close enough to be used. Also, it does not have any of the macros, which are also only on the Kos decoder.

    Edit 2: As Clownacy pointed out in IRC, there is a bug in the save/restore code I originally posted. I changed the post and the paste to a fixed (but slightly slower) version, which uses up as much RAM as the stock S3&K version.

    Edit 3: Another new version, this time saving RAM and speed on the bookmark system by saving/restoring all data registers as words instead of longs. This needed a change to both decoders (as they share a macro) because d3 was used as a long. The change allows it to be used only as a word, which has the nice side-effect of making the decoder 4 cycles faster.
     
  14. carljr

    carljr

    Member
    5
    0
    0
    Hello all,

    As a way to learn M68K assembler, I have been looking at the sonic 1 disassembly. For whatever reason, I decided to start with the NemDec decompression methods. I found this thread and saw that there were some optimizations made to the routines.

    I decided to take it a little further, but the following code is untested. If I've understood how everything works, it should work in theory. At the moment I don't have the ability to test as I am just starting out and learning the ropes.

    If someone does get a chance to test it, or finds the code useful, it would be nice to hear back.

    Thanks, Carl

    https://pastebin.com/EuaKJurd
     
  15. InvisibleUp

    InvisibleUp

    friendly internet ghost Member
    139
    13
    18
    Not to put down your effort, but it doesn't even assemble.

    Code (Text):
    1.  
    2. SN 68k version 2.53
    3.  
    4. NEMESIS DECOMPRESSION.ASM(32) : Error : Illegal value (16755200)
    5.  lea $ffaa00.w,a1
    6. NEMESIS DECOMPRESSION.ASM(85) : Error : Illegal value (9)
    7.  subq.b #9,d6
    8. Errors during pass 1 - pass 2 aborted
    9. Assembly completed.
    10. 2 error(s) from 52894 lines in 0.19 seconds
    11.  
    The first error is because $FFAA00 is greater than the size of a word, which only goes up to $FFFF. You'd want to use a .l instead.

    The second error is because subq only supports values from 1 to 8. You'd need to use sub.b for anything past that.

    Also note that the Sonic 1 git diasm requires NemDec_WriteRowToVDP to have a loc_1502 label immediately below it.

    Unfortunately the game freezes on a black screen even when those changes are made. Looking at the VDP debugger in Regen, it seems to be writing garbage to VRAM and never stopping.

    If you want to test this, I would just get a copy of the Sonic 1 (or 2) Git disasm and replacing the Nemeis decompression file. It's super simple to get the diasm working and compiling your own code. And try to make one change at a time, and see if that works before continuing.
     
  16. Devon

    Devon

    DROWN, DROWN, DROWN MYSELF! Tech Member
    1,361
    1,612
    93
    your mom
    What the ".w" actually does is take the 32-bit number it is used with and checks if it fits in the signed word range, which can be done by getting rid of the highest 2 bytes (only if they are both $FF or $00) to get only the lowest bytes. The signed word range ranges from -$8000 through $7FFF ($0000-$7FFF being positive and $8000-$FFFF being negative, mapped to -$8000 through -$0001). In this case, $AA00 (the lowest 2 bytes from the address he/she used) is considered a negative word number, but in his/her code, he/she never properly sign extended it to a longword as it was expected to be. The correct number to use is $FFFFAA00 if you want to use ".w". While changing it to ".l" also works just fine, fixing it for ".w" would result in a little more optimised instruction.

    How does this work for writing to the second half of RAM? -$8000 through -$0001 as a signed longword is mapped to $FFFF8000-$FFFFFFFF, and yet all of RAM is only in $FF0000-$FFFFFF in the Mega Drive's memory space. Simply put, the 68000 has a 24-bit address bus, so, the highest byte is pretty much ignored. So when it writes to/reads from $FFFF8000-$FFFFFFFF, it's actually writing to/reading from $FF8000-$FFFFFF.

    This should also explain why writing to/reading from $FF0000-$FF7FFF with ".w" won't work, and why you must use a ".l" for that range. $FFFF0000-$FFFF7FFF do not fit the signed word range anyways, since that is considered as -$10000 through -$8001, so it is impossible to find a signed word that can be sign extended into that range.
     
  17. carljr

    carljr

    Member
    5
    0
    0
    I really appreciate your reply. I meant to use absolute word addressing, apparently the assembler does not accept that syntax and as the other poster mentioned you would need to do a $FFFFAA00.w when you want to use absolute word addressing on something that is sign extended to 32 bits.

    Anyway, even with a fresh copy of the sonic 1 disassembly with no changes, I can get it to compile with no errors but when I run it with REGEN it just deposits me into the debugger screen. I try to exit, but it just keeps popping up. Maybe the code is causing one exception after the other. :-( Seems to me it's not that easy to compile the assembly. (I'm using the /p to compile the binary.)

    Any help getting started would be much appreciated :-)
     
  18. InvisibleUp

    InvisibleUp

    friendly internet ghost Member
    139
    13
    18
    Does running "build.bat" not work for you? What version of the disasm are you using?
     
  19. carljr

    carljr

    Member
    5
    0
    0
    Success! Yes, part of the problem was that I was not using build.bat which has extra assembler parameters that I do not yet understand. That did the trick.

    Then there were some modifications to my code. When I wrote it, I did not consider that outside routines would make calls to NemDec3 or 4, etc. there were some register incompatibility issues which I have compensated for.

    Here is the updated code. If someone knows how to bench it against the original to see how much faster it is, I'd be interested to know the results. :-)

    There was even one more optimization I noticed for in-line code, but I think I'll leave things as is.

    Thanks, Carl

    https://pastebin.com/3eDw1NNx
     
  20. carljr

    carljr

    Member
    5
    0
    0
    Hi all,

    I'm continuing to learn M68K assembly and using these decompression routines as a sort of springboard. As a result, I've also succeeded in improving the current fastest KosDec routine a bit.

    This was quite challenging, because the coding by vladicomper was already so tight (nice job!). However, I was able to gain some cycles here and there. The biggest win was getting 20 cycles for every 16 bits processed from the stream. I don't know if this would make any perceptible difference, though. If anyone can bench it against the previous routine, I'd be interested in knowing how much of a difference it makes.

    Anyway, here is the improved KosDec:
    https://pastebin.com/GWvCyRjD

    And again, here is my improved NemDec:
    https://pastebin.com/3eDw1NNx

    Hope someone finds this helpful!

    Carl