Optimized KosDec and NemDec, considerably faster decompression

Discussion in 'Engineering & Reverse Engineering' started by vladikcomper, Nov 4, 2013.

  1. vladikcomper

    vladikcomper

    Tech Member
    183
    5
    18
    Russia
    Sonic Warped
    A much faster Kosinski decompressor for the 68k

    Kosinski is a variant of LZSS compression algorithm, used in various Sega Genesis games. Being much faster than Nemesis algorithm in decompression, this format also provides a fairly better compression ratio on data with frequently repeated sequences of bytes. The decompression code for the 68k found in many Genesis games seems identical. Its decompression routine is known as 'KosDec' in most of Sonic diassemblies. Although written in quite a wise and accurate manner, it still suffers from some optimization errors, both on logic and instructions side of things.

    I started looking into the ways of reducing decompression times after once I attempted decompressing big blocks of data on the fly. The process was time-critical and the original decompressor was doing pretty well, but not as well as I wanted it to be: it was kind of balancing on the edge -- I needed just a tiny bit faster processing to get rid of unwanted slowdown. I analyzed the whole decompressor's code back then in order to optimize it and get the best out of it. I figured few tricks which allowed me about 5% faster decompression which was enough back at the time.

    Later I did realize that the decompression routine can be optimized even further, but more fundamental changes should be made to it. I then decided to create my own implementation of Kosinski decompressor by applying all the optimization tricks I planned and more. It was doing the same thing, but in a different way, using a different, more efficient code logic.

    The result overcame all my expectations, the very first revision of my own Kosinski decompressor was overall 40% faster than the original one. Later on, MarkeyJester and flamewing have suggested a few more ideas, which helped to optimize it even further. Initially, I didn't plan to use some extreme optimization tricks like look-up tables (LUTs) in order to simplify calculations, but flamewing's test proved to give it a good performance increase that I didn't expect, which quite makes up for a larger decompression code.

    Now, let me present you, an extremely fast Kosinski decompressor!

    http://pastebin.com/EtYf1vz6

    The performance increase compared to the original decompressor is unreal. Believe or not, it works more than 1.5 times faster. I've set several tests to measure new compressor's effectiveness. Here's a test involving decompressing the original EHZ art from Sonic 2:

    [​IMG] [​IMG]

    Using the new decompressor will not only greatly reduce loading times on any Kosinski-compressed art, but also provide you more freedom with "on the fly" decompression, since you will be able to decompress nearly twice as much data within one TV-frame.

    An optimized Nemesis decompressor for the 68k

    Unlike Kosinski, Nemesis is an entropy-based compression method. Due to this method's nature, characters in the compressed stream have variable size in bits, making decompression process way more time-consuming: it is required to read compressed stream bit by bit and analyze it to decode characters. Although compression was rather high, it took about 10 times slower than Kosinski to decompress. The latter also provided quite a close compression ratio, which was even higher for many files with frequently repeated byte sequences. That was the reason Sonic Team started slowly migrating from this compression format starting from Sonic 2. Most of the compressed art in Sonic 3K was in Kosinski format already.

    I attempted to optimize the Nemesis decompression routine, known as 'NemDec' shortly after my success with the Kosinski decoder. However, even given the fact that Nemesis decompression was terribly slow, their original code was surprisingly well-designed. The algorithm was quite complex and used nearly all the registers processor provided, leaving me almost no gaps for usual optimization tricks.

    I've managed to pull just a few optimization tricks, but this didn't make a considerable improvement though. Anyways, a little optimization over the original version was reached, which, considering how long the decompression takes, helps it to save up to 4 frames on decompressing a single level art file in Sonic 1.

    http://pastebin.com/cAJSJ0np

    The performance increase is about 5% as seen below. This test program decompressed the original GHZ art from Sonic 1. The optimized version works 3 frames faster:

    [​IMG] [​IMG]

    * * *

    Also, stay tuned! More on the Kosinski and Nemesis compression shall come soon.
     
  2. Cinossu

    Cinossu

    Administrator
    2,820
    8
    18
    London, UK
    Sonic the Hedgehog Extended Edition
    Highly impressive, and highly useful. I shall definitely be making use of the two of these, ta very much. :>
    I would definitely say document this on the wiki under either a SCHG general tutorial of sorts or under the individual compression method pages.
     
  3. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,138
    0
    16
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    Part two is by me, so here it goes!

    Improved compressors and decompressors
    When we were working on the above decompressor, vladikcomper mentioned that he thought that KENS' LZSS encoder did not result in optimal sizes. He felt that if you broke some runs earlier, you could maybe gain better compression. I went with that idea, and did some research. As in reading scholarly computer science papers. It did not take long to see that this was true, and there were methods to make optimum compression. They were behind a paywall; and when I did get to them, they were a rather simple report on a recent conference. So I kept digging more and more. I found more articles, one of which mapped LZSS compression to graph theory. If you are interested in knowing more, I added lots of comments explaining what is going on in the source code (see below).

    So I set about to writing a perfect compression Kosinski implementation. But rather than do just that, I made a framework for perfect compression of LZSS, which I proceeded to apply to Kosinski, Saxman and Comper (an LZSS-based format created by vladikcomper that has decompression speed as the primary goal).

    Some folks know that I have also written a much improved Nemesis encoder too; FraGag made a version in C# (which has some undiagnosed issues) for MainMemory. This compressor has vastly improved compression over old KENS, but it sadly does not have perfect compression -- I have yet to devise a non-brute force way to get past the last remaining barrier to perfect compression. The brute-force solution is not acceptable here because it takes FAR too long.

    Anyway: you can get the Windows binaries here; you can get the source code along with that of the S2-SSEdit (here); the tools compile and install by default. These command line tools use a KENS-like interface. I will eventually switch to a KENSSharp-like interface, but for now:
    Code (Text):
    1. To compress:   <tool> [options] input output
    2. To decompress: <tool> [options] -x <input> <output>
    3. To recompress: <tool> [options] -c <input> <output>
     
  4. Hitaxas

    Hitaxas

    Retro 80's themed Paladins Twich streamer Member
    In the 2007 version of Sonic 2, which yes, I still use... I cannot get the kosdec routine to work due to the macro. I get this error with @skip\@ telling me it's an invalid symbol. and if I try to change it to anything else it gives me an error saying, error: open macro definition.
     
  5. muteKi

    muteKi

    Fuck it Member
    7,575
    2
    18
    Do you have the LZSS/graph theoretic paper still handy? I'd like to take a look at what's going on in that one.
     
  6. MainMemory

    MainMemory

    Have no fear...Amy Rose is here! Tech Member
    4,412
    67
    28
    SonLVL
    This code was written with ASM68K syntax in mind, so two changes need to be made for it to work with AS:
    1. The "@skip\@" label needs to be replaced with either "skip" or "+".
    2. The "endr" directive under KosDec_largecopy needs to be replaced with "endm" (AS uses endm for both macro and rept structures).
    This version works for AS.
     
  7. Hitaxas

    Hitaxas

    Retro 80's themed Paladins Twich streamer Member
    I did try the first change, which lead to the Open macro error. number 2 fixed the issue up for me. It didn't show up as an error when building, and it was easy to overlook that.

    Thank you, it is building and running perfectly now. :)
     
  8. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,138
    0
    16
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    Hitaxas: since you probably are To anyone that is using this in S3&K, you should also replace the decoder included in the KosQueue for faster KosM decoding — the KosM decoder includes a copy of the Kosisnki decoder so it can detect if it needs to set a bookmark during vint. You probably won't need to include the lookup table again, unless you want to.

    muteKi: I was thinking of writing an explanation myself, including nice graphics; I ended up not doing it because vladikcomper dissuaded me with the argument that probably nobody would be interested. If you give me a couple days, I can write it up — the actual articles involved are rather dry and technical.
     
  9. RetroKoH

    RetroKoH

    Member
    1,657
    11
    18
    Project Sonic 8x16
    I'll definitely get on this today. I wonder... provided either flamewing or vladikomper could walk me through a bit of how it works... if I could modify the .DLL's to share the same improvements. NOTE: I use the OLD KENS.dll's for Triad, as there doesn't seem to be any support for the newer files for Game Maker implementation.
     
  10. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,138
    0
    16
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    I do have commit privileges on KENS repository -- but when I tried using the "new" DLLs (which already have the improved Nemesis and Enigma encoders, but not the improved Kosinski and Saxman ones) with SonMapEd, it failed badly. I am guessing that kram1024 didn't want to/didn't know how to preserve binary compatibility. I am not sure I know how myself.
     
  11. RetroKoH

    RetroKoH

    Member
    1,657
    11
    18
    Project Sonic 8x16
    My issue with a DLL is that it will need to have functionality with Game Maker... I could look into studying a .dll via IDA Pro and try to figure it out, but I wouldn't know where to begin. NEver touched a .dll before.
     
  12. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,138
    0
    16
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    As promised, here is the LZSS->Graph write-up.

    First, some background/terminology on LZSS: for each byte in the input stream to compress, LZSS has a sliding window, which is composed of a number of bytes that came before that byte, and a look-ahead buffer, which is composed of the byte in question and a number of bytes that come after it. For Kosinski, the sliding window is composed of at most the 8192 bytes that came before the current byte, and the lookahead buffer is composed of the 256 bytes that start from the current byte.

    LZSS compresses by looking at matches in the sliding window (starting in the sliding window) and the lookahead buffer; if there is no match (the very first byte in the file, for example), the value is encoded as a literal byte; otherwise, the bytes in the lookahead that had a contiguous match in the sliding window are encoded as a (offset,length) pair that indicates how far behind the byte the match was (offset) and how long it is (length). There are descriptor fields in the file that are a set of bit flags; these descriptor fields marks which bytes are literal bytes and which ones are (offset,length) pairs. In general, there will be multiple matches from a given byte with different length and/or offset.

    Kosinski has 3 different match encodings of different sizes; shorter matches closer to the current byte take up less space, while longer matches compress better; there is a trade-off in which one to use at any given time. Saxman has an additional mode that allows it to generate a sequence of zeros without out of nowhere, which can reduce file size in some cases.

    Now for some primer in graphs. This is a basic overview; more can be seen here. A graph is a set of vertices or nodes, plus a set of edges that connect these nodes. There are several categories of graphs: undirected versus directed; with cycles vs without cycles (acyclic); weighted vs unweighted. The image in the spoiler tags at the end of the post has illustrations of each kind; here is a direct link to it. Graphs are very simple structures; but they are also very powerful. They are everywhere in computer science, from the routing of packages on the Internet to compilers.

    But moving on: how do we go from bytes in a file to a graph and how does it help with LZSS? To answer this, we will build the LZSS graph of a file (I don't think anyone ever called it like this in the papers I read; it just feels appropriate). To do that, we need to build the set of nodes and the set of edges connecting them. The LZSS graph is a directed weighted graph. It will have more properties, but I will talk about them below.

    To build the nodes:
    1. Add one node for each byte in the file; label this node with the (0-based) position where the byte appears in the file. So the first byte in the file becomes node 0.
    2. Add one additional node for the end-of-file; I will refer to it as "the EOF node", but it should be labeled with the length of the file.
    As an implementation detail, these nodes don't need to be actually created; we will only need the edges for LZSS.

    To build the edges:
    1. EOF has no outgoing edges; so it is excluded from the steps below except as a target for an edge.
    2. For each nodes U, add an edge from U to U+1; this node's weight is the cost, in bits, of encoding the corresponding byte as a literal match.
    3. For each node U, add an edge from U to a node V > U if there is an LZSS match that encodes all bytes from U to V-1. Since there are multiple LZSS matches that cover this range of bytes, pick the one that takes the least amount of bits to encode the match; this cost is the weight of this edge.
    4. For Saxman, if the current byte and the next several bytes (at least 2) are zeros, then add an additional edge between these two nodes with the weight of encoding this match (or replace the cost of the existing edge if it had a higher cost).
    To sum it up, two nodes U, V with U < V have an edge between them if you have a match for all (V - U) bytes starting in U and ending in V-1; that is, each edge starts at the first byte of a compressed sequence of bytes and ends at the next byte that needs to be compressed. The weight of this edge is the smallest size, in bits, that is needed to encode this compressed sequence. As far as implementations go, the (offset,length) pair of each edge should also be saved as attributes of each edge so they can be used later.

    This gives us a graph that (a) has no self-loops; (b) has no edges from a node to an earlier node; © only has positive edge costs. (a) and (b) means we have no cycles, and they combine to make this a directed acyclic graph (DAG) which is already topologically sorted.

    Now to put this to use for LZSS: given a path from node 0 to EOF, if we sum the costs of each edge in this path, we will have the size of the compressed file*. So we just need to find the shortest path from node 0 to EOF and we have the lowest file size*. Finding the shortest path in a topologically-sorted DAG is trivial, and can be done even as the graph is built (I don't know why I didn't do it in the version I released...).

    * Well, almost; this gives size in bits; not only we need to round it up to bytes, but there is an additional thorny issue on some LZSS formats, such as Kosinski. Kosinski suffers from what I call "early descriptor loading": whenever Kosinski has used up the last bit in a descriptor field, it immediately loads a new descriptor field (Saxman and Comper only do that when they need a new bit). Thus, the end-of-compression token in Kosinski (which needs 2 bits in the descriptor field) may end up causing a whole new descriptor field (the whole 16 bits) to be printed, sometimes with only 1 bit used, other times with 0 bits used. My encoder also handles this case.

    Hopefully, this is clear enough; if not, please ask away and I will improve the explanation.

    Edit: something I forgot to mention: I thought of generating the graph for a small file as it was being Kosinski-compressed to show here; it was so large that it would take an hour to upload, and that is for the smallest non-trivial file I tried.

    [​IMG]
     
  13. vladikcomper

    vladikcomper

    Tech Member
    183
    5
    18
    Russia
    Sonic Warped
    I should've posted this earlier along with flamewing's first post in this topic, but unfortunately real life events got me busy. As you might notice, the newly released compressors by flamewing also included a new compression format, Comper. So, here it comes...

    Comper Compression

    Comper is a new compression format, developed by me while I was working on a small proof-of-concept a while ago. My goal was to provide the best compression ratio possible along with faster decompression speeds that no other algorithms offered before. This way, large amounts of data can be loaded "on the fly" within one frame, giving no slowdown. Therefore, the compressed data can be operated as easily as uncompressed.

    Here's a little demonstration to show this format's potential. I've compressed the original GHZ art from Sonic 1 with Nemesis and Comper algorithms and compared both compression effectiveness and decompression times. The recently released flamewing's compressors were used to guarantee the best compression ratio for Nemesis so far.

    Code (Text):
    1. Uncompressed art: 25.9 KB
    2. Nemesis-compressed art: 10.3 KB
    3. Comper-compressed art: 12.6 KB
    [​IMG] [​IMG]

    Resulted in just 2.3 KB larger file, Comper-compressed art was decompressed 10 times faster than Nemesis, which further prooves that Nemesis has a terrible compression/speed ratio. In some rare cases Comper may even overcome Nemesis's compression ratio, just like Kosinski. But in most of the cases, obviously, Nemesis has more efficient compression due to algorithm complexity and times it takes to decompress. While Comper compression had decompression speed as its primary goal, however, I was surprised how close it could get to complex and "heavy" compressions like Nemesis, like in my particular example with GHZ art.

    The decompression speed Comper has is so high that you may even be able to decompress small tilesets in-game on the fly, as if it was uncompressed art. The decompressor's code is also small and compact just like the format itself -- it's pretty simple, but effective.

    Decompressor's code (ASM68K version): http://pastebin.com/aMrfKTbz
    Decompressor's code (AS version): http://pastebin.com/fVp8d5Fr

    (Yeah, and sorry for not doing AS versions of KosDec and NemDec, I completely forgot. =X)
     
  14. Puto

    Puto

    Shin'ichi Kudō, detective. Tech Member
    2,013
    0
    16
    Portugal, Oeiras
    Part of Team Megamix, but haven't done any actual work in ages.
    Curious, would this thing be feasible to compress Sonic's art (or any other DPLCable art)?
     
  15. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,138
    0
    16
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    Unless you can fit all the decompressed art in main RAM, it would not be suitable for that: you can't start decompressing a part of the file until you have decompressed everything before it, which is not very good for DPLCs. If you were to break the file into the contiguous chunks of art for each DPLC frame, the compression ratio would probably be bad as there is not enough space to find good matches.

    For DPLCs, a better choice would probably be Sik's UFTC format -- or would be, except FileDen seems to have gone the way of the dodo. Its compression ratio is much worse, but it allows decompression of specific tiles.
     
  16. Puto

    Puto

    Shin'ichi Kudō, detective. Tech Member
    2,013
    0
    16
    Portugal, Oeiras
    Part of Team Megamix, but haven't done any actual work in ages.
    Yeah, was afraid of that. would be nice if we could just decompress specific parts...
     
  17. FireRat

    FireRat

    Nah. Fuck it. Misfit
    48
    9
    8
    Chile
    Mobius Evolution 2
    It actually had a pretty good compression ratio when compressing Green Hill and Hidden Palace's (S2), and a custom tileset's tiles. Only around 5kb more each, compared to Nemesis.

    UFTC can be found here: https://github.com/sikthehedgehog/mdtools/tree/master/uftc
    The compression tool needs to be compiled though.

    EDIT: I've compiled the tool, and here it is: http://trox.binary-division.com/public/uftc.exe
     
  18. Caverns 4

    Caverns 4

    Member
    342
    0
    16
    Sanik Quest: Journey To The Right
    Wow, this is awesome! Amazing job, seriously.
    Once SonLvL has proper support for this (if it doesn't already), I want to convert over all the nemesis art in my hack. And the Level art too, depending on how much it affects the performance.
     
  19. MainMemory

    MainMemory

    Have no fear...Amy Rose is here! Tech Member
    4,412
    67
    28
    SonLVL
    If you're referring to the Comper format, well, the only way any new compression formats are getting support is if someone else (usually FraGag) writes a pure .NET library for it.
     
  20. vladikcomper

    vladikcomper

    Tech Member
    183
    5
    18
    Russia
    Sonic Warped
    Flamewing's right, you cannot compress Sonic's art right away. You need to compress each frame separately, for that you need to assemble all tiles transferred by DPLCs for each frame. Different frames may use the same group of tiles to save space, but in your reassembled separate frames you'll have to duplicate them. Nevertheless, I considered doing something like this earlier. Despite the need for duplicated tiles, this may worth a try. There aren't too many of them, at least in Sonic 1's tileset and the compression ratio could theoretically make up for that loss.

    Also, you may easily use Comper compression instead of certain uncompressed art DPLCs.

    Objects like Spin Dash dust and shields in Sonic 3 load the entire unique tile set from every animation frame they display. Therefore, each such frame can be fully compressed separately, no shared tiles will be met. Decompression should be fast enough to cause no lag, just like the uncompressed art. However, a lot of adjustments to the DPLC code should be made to handle the new way of delivering art. Luckily, each object has its own DPLC-related code, at least in Sonic 2.

    I would recommend compressing DPLCs only if you're running out of space or dealing with really huge set of sprites with unique tiles each. I shall set some tests soon to see how well my compression can be used in Sonic games.