I found the Saxman compressor source code

Discussion in 'Engineering & Reverse Engineering' started by Clownacy, Apr 10, 2020.

  1. Clownacy

    Clownacy

    Tech Member
    837
    79
    28
    I cannot believe I'm writing this. On an absolute whim, I found the source code to Sega's Saxman compressor, used for Sonic 2's sound engine and music files.

    Saxman is not completely custom: it's based on some public-domain Japanese LZSS code from 1989, by Haruhiko Okumura. The only modification, to my knowledge, is the addition of Saxman's unique ability to compress runs of 00 bytes - this is a mere eight-line edit.

    You can find the code here, in a file called "LZSS.C":
    https://web.archive.org/web/1999020...edu/pub/simtelnet/msdos/arcutils/lz_comp2.zip

    Here's the modification to make it produce zero-run matches:

    Code (Text):
    1. void InsertNode(int r)
    2.    /* Inserts string of length F, text_buf[r..r+F-1], into one of the
    3.       trees (text_buf[r]'th tree) and returns the longest-match position
    4.       and length via the global variables match_position and match_length.
    5.       If match_length = F, then removes the old node in favor of the new
    6.       one, because the old one will be deleted sooner.
    7.       Note r plays double role, as tree node and position in buffer. */
    8. {
    9.    int  i, p, cmp;
    10.    unsigned char  *key;
    11.    cmp = 1;  key = &text_buf[r];  p = N + 1 + key[0];
    12.    rson[r] = lson[r] = NIL;  match_length = 0;
    13.    for ( ; ; ) {
    14.        if (cmp >= 0) {
    15.            if (rson[p] != NIL) p = rson[p];
    16.            else {  rson[p] = r;  dad[r] = p;  return;  }
    17.        } else {
    18.            if (lson[p] != NIL) p = lson[p];
    19.            else {  lson[p] = r;  dad[r] = p;  return;  }
    20.        }
    21.        /* NEW CODE START */
    22.        if (ftell(infile) < 0xFFF - THRESHOLD) {
    23.            for (i = 0; i < F; i++)
    24.                if ((cmp = key[i] - 0) != 0)  break;
    25.            if (i > match_length) {
    26.                match_position = 0xFFF - F - THRESHOLD;
    27.                if ((match_length = i) >= F)  break;
    28.            }
    29.        }
    30.        /* NEW CODE END */
    31.        for (i = 1; i < F; i++)
    32.            if ((cmp = key[i] - text_buf[p + i]) != 0)  break;
    33.        if (i > match_length) {
    34.            match_position = p;
    35.            if ((match_length = i) >= F)  break;
    36.        }
    37.    }
    38.    dad[r] = dad[p];  lson[r] = lson[p];  rson[r] = rson[p];
    39.    dad[lson[p]] = r;  dad[rson[p]] = r;
    40.    if (rson[dad[p]] == p) rson[dad[p]] = r;
    41.    else                   lson[dad[p]] = r;
    42.    dad[p] = NIL;  /* remove p */
    43. }

    I still can't believe this... I never thought I'd see this code in my life.

    Just this morning, I started writing my own Saxman compressor, in the hopes of making it able to produce files identical to the original (so far, every community-made Saxman compressor is inaccurate), but I hit a brick wall when I realised the original files have no pattern to how they select dictionary-matches - sometimes they use the earliest data in the file, sometimes they use the latest, and sometimes they use data in the middle, even when there's perfectly good data before and after.

    Having been stumped by this issue, I decided to check-out an old 1989 compressor I read about on Wikipedia, to see how devs would compress LZSS files back-in-the-day.

    To my absolute amazement, I saw the compression format shared numerous similarities with Saxman - heck, even the decompressor shared similarities with the 68k/Z80 ASM decompressors used in Sonic 2: just look at how they detect when the descriptor field is empty!

    LZSS.C:
    Code (Text):
    1.        if (((flags >>= 1) & 256) == 0) {
    2.            if ((c = getc(infile)) == EOF) break;
    3.            flags = c | 0xff00;       /* uses higher byte cleverly */
    4.        }                           /* to count eight */

    Sonic 2's 68k decompressor:
    Code (Text):
    1.     lsr.w    #1,d6    ; Next descriptor bit
    2.     btst    #8,d6    ; Check if we've run out of bits
    3.     bne.s    +    ; (lsr 'shifts in' 0's)
    4.     jsr    SaxDec_GetByte(pc)
    5.     move.b    d0,d6
    6.     ori.w    #$FF00,d6    ; These set bits will disappear from the high byte as the register is shifted
    7. +

    Sonic 2's Z80 decompressor:
    Code (Text):
    1.     srl    c            ; c >> 1 (active control byte)
    2.     srl    b            ; b >> 1 (just a mask that lets us know when we need to reload)
    3.     bit    0,b            ; test next bit of 'b'
    4.     jr    nz,+            ; if it's set, we still have bits left in 'c'; jump to '+'
    5.     ; If you get here, we're out of bits in 'c'!
    6.     call    zDecEndOrGetByte    ; get next byte -> 'a'
    7.     ld    c,a            ; a -> 'c'
    8.     ld    b,0FFh            ; b = FFh (8 new bits in 'c')
    9. +

    And, yes, it produces identical files.

    You know... for years, I've wanted to know the real names of the various compression formats used in the Mega Drive Sonic games. Time after time we've come so close, be it through leaked function names in Sonic 2's Nick Arcade prototype, leftover compilation data in Sonic & Knuckles Collection, or a complete symbol-list in Sonic CD PS2. Each time, the names of the decompression functions were conveniently absent. At least now, we finally have some closure on Saxman - it doesn't have a name! It's just 'LZSS.C'...

    EDIT (2020/10/07):
    It turns out there's a much simpler way to make LZSS.C produce Saxman files: just change both uses of the ASCII space character to 0. I've updated Sega Retro's Saxman page with instructions.
     
    Last edited: Oct 7, 2020
    • Like Like x 12
    • Informative Informative x 3
    • Useful Useful x 1
    • List
  2. ICEknight

    ICEknight

    Researcher Researcher
    So... it does indeed have a name, being just a small variation of LZSS.

    I mean, same as Sinclair Basic on the ZX Spectrum is just a small variation of Sinclair Basic on the ZX81 and they're called the same.
     
  3. Blastfrog

    Blastfrog

    See ya starside. Member
    Incredible find, thanks for sharing! Source code is meant to be shared.
     
  4. Brainulator

    Brainulator

    Regular garden-variety member Member
    Late, I know, but looking at the C file for myself, alongside ValleyBell's comments on the equivalent SSRG thread, I noticed this line in the section that is now to be modified:

    As you have now noticed, replacing the space character (0x20) with zeroes (0x00) generates Saxman-compressed files. Makes sense: when dealing with plain text, spaces are the most common character you'll find, but with code, you see zeroes far more often. I imagine if there was a file that had a bunch of plaintext 1s in it, it would make more sense to change that space to a 1.

    At this point, I'd inclined to drop the Saxman name and just call it LZSS compression...
     
  5. Clownacy

    Clownacy

    Tech Member
    837
    79
    28
    Saxman and Kosinski are both LZSS, so the name is more to differentiate Saxman from other LZSS variants than anything.
     
    Last edited: Aug 27, 2021
    • Agree Agree x 2
    • Like Like x 1
    • List
  6. starfrost

    starfrost

    Member
    4
    0
    1
    "lzss00", perhaps? As it is just LZSS with 0x00...
     
  7. Blastfrog

    Blastfrog

    See ya starside. Member
    Besides which, he was the one who cracked it to begin with, so I think we should stick with the moniker in his honor.
     
    • Agree Agree x 2
    • Like Like x 1
    • List
  8. MarkeyJester

    MarkeyJester

    A D V A N C E Resident Jester
    2,080
    211
    43
    Japan
    That's an understatement.

    LZSS is a "general model", not really a "format", the theorem proposed by Lempel and Ziv (in 77/78), then further improved by Storer and Szymanski (in 82) provides the idea in a generic form, and does not provide thorough specifics for how the offset/length are stored, nor the marker to decide between using offset/length vs an uncompressed block (or in this case, an addition of a 00 blank). You would also have to specify the bit format in the name too, along with any prerequisites (such as a length or entropy setup for formats which have them), such a naming scheme can get out of hand given we have multiple compression formats which utilise the general model, something which would not have been a problem for a company with limit use of LZ given it's patent history at the time, and could lightly name their file "LZSS.c" without giving much of a damn.

    Furthermore, Blastfrog's got the right idea. We name these compressions the same way scientists name elements, the person who cracks the format gets to name it, or it becomes named after them through casual communication, "We need to decompress this file, it's in that format Saxman cracked, yeah, in Saxman's compression...". Chemistry would be interesting if we renamed Chlorine to "Hologen 2" or Iron to "Element C8R4" suddenly after all these years, doing the same to the KENS family seems unnecessary.
     
    • Like Like x 4
    • Agree Agree x 1
    • List
  9. Brainulator

    Brainulator

    Regular garden-variety member Member
    Admittedly, I'm new to this compression thing, which may have limited my understanding of what was going on, but here's what I'm getting from this: LZSS is a method while Kosinski and Saxman are implementations of that method. I guess, alternatively, this would be akin to LZSS being an element with Kosinski and Saxman being isotopes thereof. Thanks for the heads up.

    I guess my concern was against acting like Saxman was its own format when its actual nature was discovered... but you know what? Okumura didn't name his name anything beyond LZSS.C, and I imagine Sonic 2's source code probably just says "songdecode" or something, so I won't go telling people not to call it Saxman (not that I had planned to).