don't click here

Utility Accurate Kosinski compressor

Discussion in 'Engineering & Reverse Engineering' started by Clownacy, Oct 16, 2021.

  1. Clownacy

    Clownacy

    Tech Member
    1,093
    665
    93
    Right now we have many Kosinski compressors such as the ones in KENS, KENSSharp, mdcomp (a.k.a. KENSC), and clownlzss, but none of them are actually capable of compressing files in the exact same way as Sega's original Kosinski compressor. This means that if you decompressed all of the Kosinski files in a Sonic disassembly, recompressed them, and then built that disassembly, then the resulting ROM would not be the same as the original.

    Back in 2018 I started working on a compressor that would remedy this, and produce files identically to Sega's original compressor. While I did get pretty close, with my compressor being able to recompress most files accurately, there were still a few that differed. In the last few days, however, I've finally cracked it, addressing the last inaccuracy and now having a compressor that accurately reproduces every single Kosinski-compressed file in the entire Sonic trilogy.

    This might seem like an odd niche, but it does have its uses. For example, the Sonic disassemblies have to compress their assembled Z80 sound driver code, and if the compressor it uses is inaccurate, then the built ROM will be inaccurate. Until not too long ago, we'd been making do with a hacked-up version of KENS that would accurately-compress that specific code, but not anything else.

    Another use for something like this is that it would allow for the disassemblies to store all of their assets completely uncompressed, making it simpler for level editors and the like to support them. The disassemblies could then compress the files as part of their build process, similar to what some of the Pokemon disassemblies do.

    Part of the reason that accurate Kosinski compression has been so hard to achieve is that Sega's original compressor was riddled with bugs. For example, while the Kosinski format itself allows for up to 0x100 bytes to be compressed at once, Sega's compressor would only ever compress up to 0xFD bytes. Likewise, while compressed data had a 'range' of 0x2000 bytes, Sega's compressor would only ever reach 0x1F03 at most.

    The final bug is perhaps the most obscure and complicated: the final block of compressed data at the end of a file would often be unusual, in the sense that it wouldn't reference data in the same pattern that the other blocks do. It turns out that the reason for this is that the original compressor would accidentally (or, rather, deliberately, as part of a bizarre optimisation) read past the end of the file when scanning for data to compress. Because of this, it would end up matching data in the file with garbage data beyond the end of the file. This might seem like an impossible behaviour to recreate, as the garbage data would have been random bytes in RAM located after the file buffer, but it turns out that the original compressor actually used a special kind of file buffer: rather than load the entire uncompressed file into a single memory buffer, it would instead stream the file into a 0x2000 byte long ring buffer, and then, when the end of the file is reached, the compressor would simply stop reading new data into it. This means that the garbage data which the compressor ends up reading isn't the bytes in RAM after the file buffer, but rather just data from 0x2000 bytes earlier in the file (or just a series of 0x00 bytes if the file was less than 0x2000 bytes long, as this is what the ring buffer was initialised to). This is the inaccuracy that took me over three years to figure out.

    With this bug correctly emulated, my compressor now passes a test suite of every single bit of Kosinski-compressed data that I could find in Sonic 1, Sonic 2, Sonic 3, and Sonic & Knuckles. I'm hoping to expand this test suite with more 'official' Kosinski files in the future to catch any potential remaining inaccuracies.

    By the way, in case anyone reading this doesn't know, we already have an accurate Saxman compressor, as I found the original Saxman compressor's source code a while back. We haven't had that kind of luck with Kosinski however, which is why an accurate recreation is necessary. Fun fact, it was actually because of the Saxman compressor's source code that I was able to figure out the final Kosinski bug... because it does the exact same thing.

    Anyway, you can find my compressor on GitHub. To use it, you can either use the command line, or just drag-and-drop files onto it. The compressed file will be called 'out.kos' for Kosinski, and 'out.kosm' for Moduled Kosinski.

    For those interested in the code, it's written in C99 (but also compiles as valid C++11), and released under the zlib licence, so it should be fairly simple to incorporate into other projects. I hope to convert it to C89 at some point, for extra portability.
     
    Last edited: Oct 24, 2022
    • Like Like x 7
    • Useful Useful x 2
    • List
  2. MarkeyJester

    MarkeyJester

    Original, No substitute Resident Jester
    2,232
    483
    63
    Japan
    Fantastic work Clownacy!

    The compression method with 01 in the bitfield and RRRR RRRR RRRR RDDD CCCC CCCC in the data, where the D portion is 111, and the C portion is 0000 0001 ($01), which results in... well nothing being decompressed, would appear in Sonic 1's chunk data when the compression reaches the A000 mark. You say yours re-compresses exactly 1:1 to the original Sonic 1 data, so I assume you also have this quirk in place too. My question is; would you happen to know why this was always present? When I was working on a perfect Kosinski msyelf (small world), I noticed this rather weird reoccurring quirk too, and while luckily it was consistent and repeatable, and other games didn't seem to compress data nearly as large. I have yet to figure out what it was, I was assuming some sort of fail-safe marker, or an early attempt at some type of cue system, but A000 is such a specific size to reach. Any ideas?
     
  3. Clownacy

    Clownacy

    Tech Member
    1,093
    665
    93
    I have two ideas, which both assert that those 'dummy matches' were used by a decompressor that ran on PC:

    The first idea is that the dummy matches are a signal to the PC decompressor to allocate more memory for the decompression buffer. So, say you start off with a decompression buffer that's 0xA100 bytes large (the extra 0x100 is to hold the data that crosses the 0xA000 boundary), then, when a dummy match is encountered, it realloc's the buffer to be 0xA000 bytes larger. My decompressor actually did this at one point. By having the compressed data tell the decompressor to allocate more memory, it avoids the need for the decompressor to constantly check for when it needs more memory instead. But this idea falls apart when you consider that the decompressor likely decompressed straight to disk rather than a memory buffer, just like the original Saxman decompressor does, so a decompression buffer big enough to hold the entire uncompressed file would not have been necessary.

    My second idea is that it's a file-integrity thing: perhaps the decompressor would expect one of those dummy matches to occur every 0xA000 bytes, and if it doesn't, the decompressor would bail and report to the user that the input data is corrupt or not a valid Kosinski archive. But in that case, why is it only every 0xA000 bytes? Wouldn't it make sense for it to be more often? Because of this seemingly-arbitrary limit, most files don't even have one of these dummy matches - they're only in Sonic 1's chunks.

    Of course, this is all assuming that a decompressor existed for PC, which I don't think is too far-fetched: the fact that Kosinski's descriptor fields are byteswapped indicates that the format was intended for little-endian CPU architectures like x86 rather than the big-endian 68k.

    As for why these dummy matches occur specifically every 0xA000 bytes, I'm afraid I have no idea. 0xA000 is 40KiB, but I can't think of anything significant about that number. Hard drive sizes seem to have been in the tens of megabytes around 1990, so if Kosinski was actually originally a PC compression format, then perhaps it was also intended for compressing much larger data than it ended up being used for on the Mega Drive?
     
    Last edited: Oct 16, 2021
  4. Sonic Hachelle-Bee

    Sonic Hachelle-Bee

    Taking a Sand Shower Tech Member
    812
    204
    43
    Lyon, France
    Sonic 2 Long Version
    Please, yes. This is definitely a step in the right direction. All compressed files shall be generated at compile time.

    Very good work Clownacy!
     
  5. Hivebrain

    Hivebrain

    Administrator
    3,063
    184
    43
    53.4N, 1.5W
    Github
    Good stuff. I've implemented it in my disassembly, if that's okay. Seems to work perfectly, although I'm not sure what to do with the Z80 driver, which requires specific bytes to point to the Sega PCM file.
     
  6. Clownacy

    Clownacy

    Tech Member
    1,093
    665
    93
    It's absolutely okay: the compressor is released under a licence that allows you to freely use, modify and redistribute it, so long as a few conditions are met. Your usage meets those requirements, so it's perfectly fine. You can read the licence here.
     
  7. MarkeyJester

    MarkeyJester

    Original, No substitute Resident Jester
    2,232
    483
    63
    Japan
    In the original game, the compressed Z80 is followed by 10 ($A) bytes of 00. This works out such that the compressed Z80 and the 00's add up to a multiple of $20 bytes.

    So... you assemble the Z80 with the PCM pointer/size to 0, compress it, include it into the main source with extra padding up to the next $20 multiple of the Z80 ROM. Assemble the main ROM, acquire the Sega PCM address, reassemble the Z80 with the new pointer, re-compress, then inject it into the ROM (where the padded 00 bytes gives you enough leeway in-case the compression ends up slightly larger than before). I am convinced this is how they did theirs...
     
    • Informative Informative x 1
    • Useful Useful x 1
    • List
  8. Clownacy

    Clownacy

    Tech Member
    1,093
    665
    93
    The padding you're describing appears to be a quirk of Kosinski: for some reason, every Kosinski file is padded to $10 bytes, even the ones that make up Kosinski modules.

    The Sonic 2 Simon Wai prototype contains an unassembled Kosinski file. Yes, unassembled. For some reason, it's not just a binary file. Notably, each line is a $10 byte long 'dc.b' directive. My guess is that whatever tool was used to create these assembly files was unable to create a 'dc.b' directive that was shorter than $10 bytes, and so every file it produced would be padded with 00's. I could be wrong though: maybe the compressor itself would pad to $10 for some reason. Either way, it doesn't seem to be a quirk specific to Sonic 1's Z80 driver.

    I guess, while I'm here, I can give a small update to on the project: I've tested my compressor against the Mega CD BIOS's Kosinski-compressed SUB-CPU payload, but it doesn't recompress identically at all. The differences are so massive, in fact, that I genuinely think that the BIOS was made with a different compressor altogether: it doesn't follow the patterns of the other files I've tested with.
     
    Last edited: Apr 20, 2023
    • Like Like x 2
    • Informative Informative x 1
    • List
  9. Brainulator

    Brainulator

    Regular garden-variety member Member
    These files being unassembled is actually consistent with how PCM samples and the Z80 driver itself is handled in the SMPS 68000 source code we have... which makes me wonder whether everything was included as data to be assembled in the original ROMs.

    As for the Z80 driver itself? A part of me wants to speculate that the developers may have relied on the SEGA chant sample being A) a certain size and B) at the very end of the $80000-byte-long ROM (as well as, possibly, C) the compression being imperfect and D) the file being a bunch of stringed-together dc.b directives).
     
  10. Hivebrain

    Hivebrain

    Administrator
    3,063
    184
    43
    53.4N, 1.5W
    Github
    Slightly off-topic question here: Would accurate Nemesis and Enigma compressors be a possibility?
     
  11. Clownacy

    Clownacy

    Tech Member
    1,093
    665
    93
    I think so, but I'm not familiar with their types of compression (both apparently use run-length encoding and Nemesis additionally uses entropy encoding), so I'm not sure how simple it would be.
     
  12. Brainulator

    Brainulator

    Regular garden-variety member Member
    Relevant post from another thread, reposting because it's relevant here (with an additional post referenced for clarity):