Utility clownnemesis - An Accurate Nemesis Compressor

Clownacy · Feb 18, 2024

I've cracked it! It turns out that Sega's compressor was using a selection sort! The thing about selection sort is that it's an "unstable" sorting algorithm, which means that values which are neither greater nor less than each other end up shuffled-around! Previously I was using a bubble sort, which is a stable algorithm.

I now have every Nemesis file from Sonic 1 and Sonic 2 compressing identically. Testing Sonic 3 & Knuckles is just a matter of tediously putting together a list of every single one of its Nemesis files.

The compressor can be downloaded here.

Combined with my accurate Kosinski compressor and the rediscovered original Saxman compressor, 3 out of the 4 main Sonic compression formats now have accurate compressors!

Brainulator said:

Clownacy said:

While the codes themselves are the same
Click to expand...

Hmm... I see that this is Cucky's art in Sonic 1, and there's a code for nybble B of length 1 that is absent in the recompressed version, replaced by a nybble 6 of length 2. This looks to be more than a sorting thing, though minor nonetheless.
Click to expand...

You were right to notice that: that was the result of an oversight in my compressor. It's fixed now.

Clownacy · Feb 18, 2024

(Cross-post)

Making my Nemesis Compressor Accurate

While it was nice to create a compressor that surpassed Sega's own, what I wanted more was to create a compressor that perfectly mimicked it.

When making a disassembly or decompilation of a game, assets should ideally be stored in a user-friendly format; the assets should be easy to view and easy to modify. However, often assets are left in their original compressed form. The reason for this is that the disassembly/decompilation is aiming to produce an exact copy of the game, down to matching binary, but a recompressed file would have different binary. By not decompressing the assets, the need for recompression is bypassed, avoiding this problem.

There are usually multiple ways to encode data in a given compression format. Some ways are better than others, such as by being faster to compress, faster to decompress, using the least RAM, or producing the smallest compressed data. Other ways aren't necessarily better, but simply the result of the inner workings of the compressor, such as the order in which it does tasks, the sorting algorithm it uses, and even things as minute as whether it uses '<' or '<=' to check the order of two integers. Because of this, it is unlikely for two compressors to produce the exact same compressed data.

When I was creating my Nemesis compressor, I was specifically aiming to mimic the behaviour, and thus output, of Sega's compressor. I noticed early on that the code table was arranged first by order of the matching nybble run's length and then by the value of the nybble itself, so I replicated that behaviour in my compressor. Later, I figured out that the codes were generated through Fano coding, so I implemented that too. But, even with the compressor complete, with both of these behaviours accounted for, the compressed output differed. Not long afterwards, I noticed that nybble runs with a length of less than 3 were never assigned a code, and mimicking this brought the data closer, but still without matching it exactly.

At this point, the differences in the data were quite subtle: the codes were the same, the size of the data was similar, the inlined data was the same, but the assignment of codes to nybble runs was different. The original compressor appeared to perform an optimisation that mine did not: it was somehow deviating from the usual Fano coding to assign the shorter codes to the most common nybble runs. The nature of Fano coding makes the occasional assignment of shorter codes to less common nybbles unavoidable, so I mistook this to mean that Sega's compressor was not using Fano coding, and instead something more efficient like Huffman coding.

After implementing Huffman coding, I found that it was far less accurate than Fano coding ever was, so I switched focus to making Fano more optimal. Eventually I discovered that Sega's compressor was not deviating from standard Fano coding, but rather it was performing some post-processing. In order to ensure that the most common nybble runs were assigned the shortest codes, it used a sorting algorithm to reassign the codes after Fano coding had already assigned them.

I also found that, unlike the code table, codes were assigned to nybble runs first in the order of run length first and second in the order of the run nybble. Fixing this simply requiring swapping the order of two nested for-loops.

But, even with both of these implemented, the compressed data was still not exactly the same. While the shortest codes correctly mapped to the most common nybble runs, codes of the same size would be assigned to different nybble runs. The fact that differences in ordering only applied to codes of the same size was enough of a clue for me to tell that this was a quirk of the compressor's sorting algorithm.

Sorting algorithms can be divided into two groups: stable and unstable. Stable sorting algorithms will sort the data without changing the order of values that are equal, whereas unstable algorithms will. My compressor was using a stable sorting algorithm, and so all codes with the same length were in the same order both before and after sorting. For the output of Sega's compressor to have the codes in a different order meant that it was using an unstable algorithm.

But how was I supposed to know which one it was? Every unstable sorting algorithm affects equal values in different ways, so I had to make my compressor use the exact same one that Sega's compressor used. I could have figured it out through trial-and-error, implementing every sorting algorithm and checking which one produced the correct output, but I did not want to do that as sorting algorithms can be very complex and hard to write. Instead, I figured that I would simply examine the data and figure out what sorting algorithm could create it.

Here are some codes generated by Sega's compressor:

Code (Text):

Code 1110110 encodes nybble A of length 2 and was seen 19 times.

Code 11101110 encodes nybble 0 of length 6 and was seen 18 times.

Code 11101111 encodes nybble 8 of length 3 and was seen 16 times.

Code 1111000 encodes nybble F of length 3 and was seen 19 times.

Code 11110010 encodes nybble 7 of length 3 and was seen 17 times.

Code 11110011 encodes nybble D of length 2 and was seen 16 times.

Code 1111010 encodes nybble 8 of length 5 and was seen 19 times.

And here are those same codes generated by my compressor:

Code (Text):

Code 1110110 encodes nybble A of length 2 and was seen 19 times.

Code 11101110 encodes nybble 0 of length 6 and was seen 18 times.

Code 11101111 encodes nybble 7 of length 3 and was seen 17 times.

Code 1111000 encodes nybble F of length 3 and was seen 19 times.

Code 11110010 encodes nybble D of length 2 and was seen 16 times.

Code 11110011 encodes nybble 8 of length 3 and was seen 16 times.

Code 1111010 encodes nybble 8 of length 5 and was seen 19 times.

As expected, the shortest codes are assigned to the most common nybble runs, and the ordering of the 8-bit codes differ. Nybble runs 7-3, 8-3, and D-2 are in the wrong order.

Across a variety of compressed data, I noticed a pattern: the ordering of equal nybble runs would only ever differ if those nybble runs were sandwiched between codes of a shorter length. This suggested that the the reordering was a side-effect of moving a nybble run in order to sort it.

This is what the codes are before they are sorted:

Code (Text):

Code 1110110 encodes nybble A of length 2 and was seen 19 times.

Code 11101110 encodes nybble F of length 3 and was seen 19 times.

Code 11101111 encodes nybble 8 of length 5 and was seen 19 times.

Code 1111000 encodes nybble 0 of length 6 and was seen 18 times.

Code 11110010 encodes nybble 7 of length 3 and was seen 17 times.

Code 11110011 encodes nybble D of length 2 and was seen 16 times.

Code 1111010 encodes nybble 8 of length 3 and was seen 16 times.

Sorting is needed so that the nybble runs F-3 and 8-5 are assigned to codes 1111000 and 1111010, respectively. In my compressor, this is done without disturbing the other codes, while in Sega's compressor this is not.

In a moment of serendipity, I tried sorting the data by hand by swapping each nybble with its destination. I swapped F-3 with 0-6, and 8-5 with 8-3, resulting in this:

Code (Text):

Code 1110110 encodes nybble A of length 2 and was seen 19 times.

Code 11101110 encodes nybble 0 of length 6 and was seen 18 times.

Code 11101111 encodes nybble 8 of length 3 and was seen 16 times.

Code 1111000 encodes nybble F of length 3 and was seen 19 times.

Code 11110010 encodes nybble 7 of length 3 and was seen 17 times.

Code 11110011 encodes nybble D of length 2 and was seen 16 times.

Code 1111010 encodes nybble 8 of length 5 and was seen 19 times.

This exactly matched the Sega-compressed data!

I looked-into various unstable sorting algorithms, searching for one that worked by swapping in this way, and I found selection sort. After making my compressor sort its codes using this algorithm, I was delighted to see that the nybble runs were in the same order! I had finally cracked it!

Recompressing other data exposed a few bugs which were affecting accuracy, as well as some edge-cases that both compressors handled differently. For instance, when deciding whether to use XOR mode or not, Sega's compressor would only compare the number of bytes used, rather than bits. When XOR mode and non-XOR mode use the exact same number of bytes, then non-XOR mode is preferred. Additionally, when performing its binary chop, the Fano code generator does not round when dividing by 2.

With these issues addressed, my compressor was able to perfectly recompress every bit of Nemesis-compressed data in the first two Sonic the Hedgehog games!

And that brings us to now. My compressor is able to both mimic Sega's compressor as well as surpass it. I've polished-up the code and released a Windows executable on GitHub. Being written in ANSI C (C89), I was able to compile it with Visual Studio 6, so it should run even on ancient versions of Windows. Hooray for portability.

A use-case for this compressor is Hivebrain's Sonic 1 disassembly. Years ago, my accurate Kosinski compressor was integrated into it, allowing some of the game's assets to be stored uncompressed. With my Nemesis compressor, this can finally be extended to the majority of the game's assets!

With this compressor, my accurate Kosinski compressor, and the rediscovered original Saxman compressor, 3 out of 4 of the Sonic compression formats have accurate compressors! All that remains now is Enigma.

OrionNavattan · Feb 18, 2024

Clownacy said:

(Cross-post)

Making my Nemesis Compressor Accurate

While it was nice to create a compressor that surpassed Sega's own, what I wanted more was to create a compressor that perfectly mimicked it.

When making a disassembly or decompilation of a game, assets should ideally be stored in a user-friendly format; the assets should be easy to view and easy to modify. However, often assets are left in their original compressed form. The reason for this is that the disassembly/decompilation is aiming to produce an exact copy of the game, down to matching binary, but a recompressed file would have different binary. By not decompressing the assets, the need for recompression is bypassed, avoiding this problem.

There are usually multiple ways to encode data in a given compression format. Some ways are better than others, such as by being faster to compress, faster to decompress, using the least RAM, or producing the smallest compressed data. Other ways aren't necessarily better, but simply the result of the inner workings of the compressor, such as the order in which it does tasks, the sorting algorithm it uses, and even things as minute as whether it uses '<' or '<=' to check the order of two integers. Because of this, it is unlikely for two compressors to produce the exact same compressed data.

When I was creating my Nemesis compressor, I was specifically aiming to mimic the behaviour, and thus output, of Sega's compressor. I noticed early on that the code table was arranged first by order of the matching nybble run's length and then by the value of the nybble itself, so I replicated that behaviour in my compressor. Later, I figured out that the codes were generated through Fano coding, so I implemented that too. But, even with the compressor complete, with both of these behaviours accounted for, the compressed output differed. Not long afterwards, I noticed that nybble runs with a length of less than 3 were never assigned a code, and mimicking this brought the data closer, but still without matching it exactly.

At this point, the differences in the data were quite subtle: the codes were the same, the size of the data was similar, the inlined data was the same, but the assignment of codes to nybble runs was different. The original compressor appeared to perform an optimisation that mine did not: it was somehow deviating from the usual Fano coding to assign the shorter codes to the most common nybble runs. The nature of Fano coding makes the occasional assignment of shorter codes to less common nybbles unavoidable, so I mistook this to mean that Sega's compressor was not using Fano coding, and instead something more efficient like Huffman coding.

After implementing Huffman coding, I found that it was far less accurate than Fano coding ever was, so I switched focus to making Fano more optimal. Eventually I discovered that Sega's compressor was not deviating from standard Fano coding, but rather it was performing some post-processing. In order to ensure that the most common nybble runs were assigned the shortest codes, it used a sorting algorithm to reassign the codes after Fano coding had already assigned them.

I also found that, unlike the code table, codes were assigned to nybble runs first in the order of run length first and second in the order of the run nybble. Fixing this simply requiring swapping the order of two nested for-loops.

But, even with both of these implemented, the compressed data was still not exactly the same. While the shortest codes correctly mapped to the most common nybble runs, codes of the same size would be assigned to different nybble runs. The fact that differences in ordering only applied to codes of the same size was enough of a clue for me to tell that this was a quirk of the compressor's sorting algorithm.

Sorting algorithms can be divided into two groups: stable and unstable. Stable sorting algorithms will sort the data without changing the order of values that are equal, whereas unstable algorithms will. My compressor was using a stable sorting algorithm, and so all codes with the same length were in the same order both before and after sorting. For the output of Sega's compressor to have the codes in a different order meant that it was using an unstable algorithm.

But how was I supposed to know which one it was? Every unstable sorting algorithm affects equal values in different ways, so I had to make my compressor use the exact same one that Sega's compressor used. I could have figured it out through trial-and-error, implementing every sorting algorithm and checking which one produced the correct output, but I did not want to do that as sorting algorithms can be very complex and hard to write. Instead, I figured that I would simply examine the data and figure out what sorting algorithm could create it.

Here are some codes generated by Sega's compressor:

Code (Text):

Code 1110110 encodes nybble A of length 2 and was seen 19 times.

Code 11101110 encodes nybble 0 of length 6 and was seen 18 times.

Code 11101111 encodes nybble 8 of length 3 and was seen 16 times.

Code 1111000 encodes nybble F of length 3 and was seen 19 times.

Code 11110010 encodes nybble 7 of length 3 and was seen 17 times.

Code 11110011 encodes nybble D of length 2 and was seen 16 times.

Code 1111010 encodes nybble 8 of length 5 and was seen 19 times.

And here are those same codes generated by my compressor:

Code (Text):

Code 1110110 encodes nybble A of length 2 and was seen 19 times.

Code 11101110 encodes nybble 0 of length 6 and was seen 18 times.

Code 11101111 encodes nybble 7 of length 3 and was seen 17 times.

Code 1111000 encodes nybble F of length 3 and was seen 19 times.

Code 11110010 encodes nybble D of length 2 and was seen 16 times.

Code 11110011 encodes nybble 8 of length 3 and was seen 16 times.

Code 1111010 encodes nybble 8 of length 5 and was seen 19 times.

As expected, the shortest codes are assigned to the most common nybble runs, and the ordering of the 8-bit codes differ. Nybble runs 7-3, 8-3, and D-2 are in the wrong order.

Across a variety of compressed data, I noticed a pattern: the ordering of equal nybble runs would only ever differ if those nybble runs were sandwiched between codes of a shorter length. This suggested that the the reordering was a side-effect of moving a nybble run in order to sort it.

This is what the codes are before they are sorted:

Code (Text):

Code 1110110 encodes nybble A of length 2 and was seen 19 times.

Code 11101110 encodes nybble F of length 3 and was seen 19 times.

Code 11101111 encodes nybble 8 of length 5 and was seen 19 times.

Code 1111000 encodes nybble 0 of length 6 and was seen 18 times.

Code 11110010 encodes nybble 7 of length 3 and was seen 17 times.

Code 11110011 encodes nybble D of length 2 and was seen 16 times.

Code 1111010 encodes nybble 8 of length 3 and was seen 16 times.

Sorting is needed so that the nybble runs F-3 and 8-5 are assigned to codes 1111000 and 1111010, respectively. In my compressor, this is done without disturbing the other codes, while in Sega's compressor this is not.

In a moment of serendipity, I tried sorting the data by hand by swapping each nybble with its destination. I swapped F-3 with 0-6, and 8-5 with 8-3, resulting in this:

Code (Text):

Code 1110110 encodes nybble A of length 2 and was seen 19 times.

Code 11101110 encodes nybble 0 of length 6 and was seen 18 times.

Code 11101111 encodes nybble 8 of length 3 and was seen 16 times.

Code 1111000 encodes nybble F of length 3 and was seen 19 times.

Code 11110010 encodes nybble 7 of length 3 and was seen 17 times.

Code 11110011 encodes nybble D of length 2 and was seen 16 times.

Code 1111010 encodes nybble 8 of length 5 and was seen 19 times.

This exactly matched the Sega-compressed data!

I looked-into various unstable sorting algorithms, searching for one that worked by swapping in this way, and I found selection sort. After making my compressor sort its codes using this algorithm, I was delighted to see that the nybble runs were in the same order! I had finally cracked it!

Recompressing other data exposed a few bugs which were affecting accuracy, as well as some edge-cases that both compressors handled differently. For instance, when deciding whether to use XOR mode or not, Sega's compressor would only compare the number of bytes used, rather than bits. When XOR mode and non-XOR mode use the exact same number of bytes, then non-XOR mode is preferred. Additionally, when performing its binary chop, the Fano code generator does not round when dividing by 2.

With these issues addressed, my compressor was able to perfectly recompress every bit of Nemesis-compressed data in the first two Sonic the Hedgehog games!

And that brings us to now. My compressor is able to both mimic Sega's compressor as well as surpass it. I've polished-up the code and released a Windows executable on GitHub. Being written in ANSI C (C89), I was able to compile it with Visual Studio 6, so it should run even on ancient versions of Windows. Hooray for portability.

A use-case for this compressor is Hivebrain's Sonic 1 disassembly. Years ago, my accurate Kosinski compressor was integrated into it, allowing some of the game's assets to be stored uncompressed. With my Nemesis compressor, this can finally be extended to the majority of the game's assets!

With this compressor, my accurate Kosinski compressor, and the rediscovered original Saxman compressor, 3 out of 4 of the Sonic compression formats have accurate compressors! All that remains now is Enigma.
Click to expand...

As always, I appreciate your in-depth explanations. Will be adding this to my Sonic 2 disasm in short order.

Hivebrain · Feb 18, 2024

Very nice work. I'll add it to my disassembly later.

Hivebrain · Feb 20, 2024

This isn't the fault of your compressor, but several of the nem files in Sonic 1 had two bytes of padding for no particular reason. I had to manually insert them after the incbin after recompression. Something to be aware of when making a byte-perfect disassembly.

Clownacy · Feb 20, 2024

Hivebrain said:

This isn't the fault of your compressor, but several of the nem files in Sonic 1 had two bytes of padding for no particular reason. I had to manually insert them after the incbin after recompression. Something to be aware of when making a byte-perfect disassembly.
Click to expand...

I noticed that while verifying my compressor against the Git disassembly. From what I can tell, those padding bytes are actually empty, unreferenced Nemesis files. There's no pattern to when they appear, so them being padding bytes doesn't make sense.

Clownacy · Feb 22, 2024

Sorry, I was wrong about there being empty Nemesis files. What is actually happening is that Sega's Nemesis compressor had a slight bug: normally, upon reaching the end of the uncompressed data, the compressor will write a single byte containing the remaining unflushed bitstream bits. If there are no unflushed bits, however, then there is no point in emitting this byte. Sega's compressor failed to account for this, and so it occasionally emits a useless empty byte at the end of its data. The second byte is caused by this junk byte increasing the data's size to an odd number of bytes, leading to the following 'even' macro inserting a byte to make it even.

I have fixed this and another inaccuracy that would cause S3K's Buggernaut tiles to compress inaccurately. An update that contains these fixes can be found here.

I have extended the compressor's test suite to include Sonic 3 & Knuckles. Now every(?) Nemesis file in that game has been tested to recompress accurately also.

Brainulator · Feb 22, 2024

I had a feeling it was a bug. I've noticed a fair few files in the Sonic 2 Nick Arcade prototype do this too, only that ROM has a little something extra: the names of every art file after the main chunk of code. No labels for would-be empty Nemesis files can be found.

Something else worth noting, not that I suspect any real-world use of your program would need this: in Sonic & Knuckles Collection, the first word of data is little-endian, as it runs on an x86, but everything else seems to be the same. (The same thing happens with Kosinski Moduled files.)

Utility clownnemesis - An Accurate Nemesis Compressor

Useful Searches