Utility Accurate Kosinski compressor

Discussion in 'Engineering & Reverse Engineering' started by Clownacy, Oct 16, 2021.

  1. Clownacy

    Clownacy

    Tech Member
    837
    79
    28
    Right now we have many Kosinski compressors such as the ones in KENS, KENSSharp, mdcomp (a.k.a. KENSC), and clownlzss, but none of them are actually capable of compressing files in the exact same way as Sega's original Kosinski compressor. This means that if you decompressed all of the Kosinski files in a Sonic disassembly, recompressed them, and then built that disassembly, then the resulting ROM would not be the same as the original.

    Back in 2018 I started working on a compressor that would remedy this, and produce files identically to Sega's original compressor. While I did get pretty close, with my compressor being able to recompress most files accurately, there were still a few that differed. In the last few days, however, I've finally cracked it, addressing the last inaccuracy and now having a compressor that accurately reproduces every single Kosinski-compressed file in the entire Sonic trilogy.

    This might seem like an odd niche, but it does have its uses. For example, the Sonic disassemblies have to compress their assembled Z80 sound driver code, and if the compressor it uses is inaccurate, then the built ROM will be inaccurate. Until not too long ago, we'd been making do with a hacked-up version of KENS that would accurately-compress that specific code, but not anything else.

    Another use for something like this is that it would allow for the disassemblies to store all of their assets completely uncompressed, making it simpler for level editors and the like to support them. The disassemblies could then compress the files as part of their build process, similar to what some of the Pokemon disassemblies do.

    Part of the reason that accurate Kosinski compression has been so hard to achieve is that Sega's original compressor was riddled with bugs. For example, while the Kosinski format itself allows for up to 0x100 bytes to be compressed at once, Sega's compressor would only ever compress up to 0xFD bytes. Likewise, while compressed data had a 'range' of 0x2000 bytes, Sega's compressor would only ever reach 0x1F03 at most.

    The final bug is perhaps the most obscure and complicated: the final block of compressed data at the end of a file would often be unusual, in the sense that it wouldn't reference data in same pattern that the other blocks do. It turns out that the reason for this is that the original compressor would accidentally (or, rather, deliberately, as part of a bizarre optimisation) read past the end of the file when scanning for data to compress. Because of this, it would read garbage data beyond the end of the file, and only match the to-be-compressed data with data earlier in the file that is followed by a pattern of bytes that perfectly matches the garbage data. This might seem like an impossible behaviour to recreate, as the garbage data would have been random bytes in RAM located after the file buffer, which cannot be recreated, but it turns out that the original compressor actually streamed its file data into a ring buffer, and then, when the end of the file is reached, the compressor simply stops reading new data into it. This means that the garbage data which the compressor ends up reading is just data from 0x2000 bytes earlier in the file (or just a series of 00 bytes, if the file was less than 0x2000 bytes long). This is the inaccuracy that took me over three years to figure out.

    With this bug correctly emulated, my compressor now passes a test suite of every single bit of Kosinski-compressed data that I could find in Sonic 1, Sonic 2, Sonic 3, and Sonic & Knuckles. I'm hoping to expand this test suite with more 'official' Kosinski files in the future to catch any potential remaining inaccuracies.

    By the way, in case anyone reading this doesn't know, we already have an accurate Saxman compressor, as we've actually managed to find the original Saxman compressor's source code. We haven't had that kind of luck with Kosinski however, which is why an accurate recreation is necessary. Fun fact, it was actually because of this source code that I was able to figure out the final Kosinski bug... because the Saxman compressor does the exact same thing.

    Anyway, you can find my compressor on GitHub. To use it, you can either use the command line, or just drag-and-drop files onto it. The compressed file will be called 'out.kos' for Kosinski, and 'out.kosm' for Moduled Kosinski.

    For those interested in the code, it's written in C99 (but also compiles as valid C++11), and released under the zlib licence, so it should be fairly simple to incorporate into other projects. I hope to convert it to C89 at some point, for extra portability.
     
    Last edited: Oct 19, 2021
  2. MarkeyJester

    MarkeyJester

    A D V A N C E Resident Jester
    2,083
    215
    43
    Japan
    Fantastic work Clownacy!

    The compression method with 01 in the bitfield and RRRR RRRR RRRR RDDD CCCC CCCC in the data, where the D portion is 111, and the C portion is 0000 0001 ($01), which results in... well nothing being decompressed, would appear in Sonic 1's chunk data when the compression reaches the A000 mark. You say yours re-compresses exactly 1:1 to the original Sonic 1 data, so I assume you also have this quirk in place too. My question is; would you happen to know why this was always present? When I was working on a perfect Kosinski msyelf (small world), I noticed this rather weird reoccurring quirk too, and while luckily it was consistent and repeatable, and other games didn't seem to compress data nearly as large. I have yet to figure out what it was, I was assuming some sort of fail-safe marker, or an early attempt at some type of cue system, but A000 is such a specific size to reach. Any ideas?
     
  3. Clownacy

    Clownacy

    Tech Member
    837
    79
    28
    I have two ideas, which both assert that those 'dummy matches' were used by a decompressor that ran on PC:

    The first idea is that the dummy matches are a signal to the PC decompressor to allocate more memory for the decompression buffer. So, say you start off with a decompression buffer that's 0xA100 bytes large (the extra 0x100 is to hold the data that crosses the 0xA000 boundary), then, when a dummy match is encountered, it realloc's the buffer to be 0xA000 bytes larger. My decompressor actually did this at one point. By having the compressed data tell the decompressor to allocate more memory, it avoids the need for the decompressor to constantly check for when it needs more memory instead. But this idea falls apart when you consider that the decompressor likely decompressed straight to disk rather than a memory buffer, just like the original Saxman decompressor does, so a decompression buffer big enough to hold the entire uncompressed file would not have been necessary.

    My second idea is that it's a file-integrity thing: perhaps the decompressor would expect one of those dummy matches to occur every 0xA000 bytes, and if it doesn't, the decompressor would bail and report to the user that the input data is corrupt or not a valid Kosinski archive. But in that case, why is it only every 0xA000 bytes? Wouldn't it make sense for it to be more often? Because of this seemingly-arbitrary limit, most files don't even have one of these dummy matches - they're only in Sonic 1's chunks.

    Of course, this is all assuming that a decompressor existed for PC, which I don't think is too far-fetched: the fact that Kosinski's descriptor fields are byteswapped indicates that the format was intended for little-endian CPU architectures like x86 rather than the big-endian 68k.

    As for why these dummy matches occur specifically every 0xA000 bytes, I'm afraid I have no idea. 0xA000 is 40KiB, but I can't think of anything significant about that number. Hard drive sizes seem to have been in the tens of megabytes around 1990, so if Kosinski was actually originally a PC compression format, then perhaps it was also intended for compressing much larger data than it ended up being used for on the Mega Drive?
     
    Last edited: Oct 16, 2021
  4. Sonic Hachelle-Bee

    Sonic Hachelle-Bee

    Taking a Sand Shower Tech Member
    748
    72
    28
    Lyon, France
    Sonic 2 Long Version
    Please, yes. This is definitely a step in the right direction. All compressed files shall be generated at compile time.

    Very good work Clownacy!
     
  5. Hivebrain

    Hivebrain

    Administrator
    2,927
    66
    28
    53.4N, 1.5W
    HiveView
    Good stuff. I've implemented it in my disassembly, if that's okay. Seems to work perfectly, although I'm not sure what to do with the Z80 driver, which requires specific bytes to point to the Sega PCM file.
     
  6. Clownacy

    Clownacy

    Tech Member
    837
    79
    28
    It's absolutely okay: the compressor is released under a licence that allows you to freely use, modify and redistribute it, so long as a few conditions are met. Your usage meets those requirements, so it's perfectly fine. You can read the licence here.
     
  7. MarkeyJester

    MarkeyJester

    A D V A N C E Resident Jester
    2,083
    215
    43
    Japan
    In the original game, the compressed Z80 is followed by 10 ($A) bytes of 00. This works out such that the compressed Z80 and the 00's add up to a multiple of $20 bytes.

    So... you assemble the Z80 with the PCM pointer/size to 0, compress it, include it into the main source with extra padding up to the next $20 multiple of the Z80 ROM. Assemble the main ROM, acquire the Sega PCM address, reassemble the Z80 with the new pointer, re-compress, then inject it into the ROM (where the padded 00 bytes gives you enough leeway in-case the compression ends up slightly larger than before). I am convinced this is how they did theirs...