I found the Saxman compressor source code

Discussion in 'Engineering & Reverse Engineering' started by Clownacy, Apr 10, 2020.

  1. Clownacy

    Clownacy

    Tech Member
    802
    36
    28
    I cannot believe I'm writing this. On an absolute whim, I found the source code to Sega's Saxman compressor, used for Sonic 2's sound engine and music files.

    Saxman is not completely custom: it's based on some public-domain Japanese LZSS code from 1989, by Haruhiko Okumura. The only modification, to my knowledge, is the addition of Saxman's unique ability to compress runs of 00 bytes - this is a mere eight-line edit.

    You can find the code here, in a file called "LZSS.C":
    https://web.archive.org/web/1999020...edu/pub/simtelnet/msdos/arcutils/lz_comp2.zip

    Here's the modification to make it produce zero-run matches:

    Code (Text):
    1. void InsertNode(int r)
    2.    /* Inserts string of length F, text_buf[r..r+F-1], into one of the
    3.       trees (text_buf[r]'th tree) and returns the longest-match position
    4.       and length via the global variables match_position and match_length.
    5.       If match_length = F, then removes the old node in favor of the new
    6.       one, because the old one will be deleted sooner.
    7.       Note r plays double role, as tree node and position in buffer. */
    8. {
    9.    int  i, p, cmp;
    10.    unsigned char  *key;
    11.    cmp = 1;  key = &text_buf[r];  p = N + 1 + key[0];
    12.    rson[r] = lson[r] = NIL;  match_length = 0;
    13.    for ( ; ; ) {
    14.        if (cmp >= 0) {
    15.            if (rson[p] != NIL) p = rson[p];
    16.            else {  rson[p] = r;  dad[r] = p;  return;  }
    17.        } else {
    18.            if (lson[p] != NIL) p = lson[p];
    19.            else {  lson[p] = r;  dad[r] = p;  return;  }
    20.        }
    21.        /* NEW CODE START */
    22.        if (ftell(infile) < 0xFFF - THRESHOLD) {
    23.            for (i = 0; i < F; i++)
    24.                if ((cmp = key[i] - 0) != 0)  break;
    25.            if (i > match_length) {
    26.                match_position = 0xFFF - F - THRESHOLD;
    27.                if ((match_length = i) >= F)  break;
    28.            }
    29.        }
    30.        /* NEW CODE END */
    31.        for (i = 1; i < F; i++)
    32.            if ((cmp = key[i] - text_buf[p + i]) != 0)  break;
    33.        if (i > match_length) {
    34.            match_position = p;
    35.            if ((match_length = i) >= F)  break;
    36.        }
    37.    }
    38.    dad[r] = dad[p];  lson[r] = lson[p];  rson[r] = rson[p];
    39.    dad[lson[p]] = r;  dad[rson[p]] = r;
    40.    if (rson[dad[p]] == p) rson[dad[p]] = r;
    41.    else                   lson[dad[p]] = r;
    42.    dad[p] = NIL;  /* remove p */
    43. }

    I still can't believe this... I never thought I'd see this code in my life.

    Just this morning, I started writing my own Saxman compressor, in the hopes of making it able to produce files identical to the original (so far, every community-made Saxman compressor is inaccurate), but I hit a brick wall when I realised the original files have no pattern to how they select dictionary-matches - sometimes they use the earliest data in the file, sometimes they use the latest, and sometimes they use data in the middle, even when there's perfectly good data before and after.

    Having been stumped by this issue, I decided to check-out an old 1989 compressor I read about on Wikipedia, to see how devs would compress LZSS files back-in-the-day.

    To my absolute amazement, I saw the compression format shared numerous similarities with Saxman - heck, even the decompressor shared similarities with the 68k/Z80 ASM decompressors used in Sonic 2: just look at how they detect when the descriptor field is empty!

    LZSS.C:
    Code (Text):
    1.        if (((flags >>= 1) & 256) == 0) {
    2.            if ((c = getc(infile)) == EOF) break;
    3.            flags = c | 0xff00;       /* uses higher byte cleverly */
    4.        }                           /* to count eight */

    Sonic 2's 68k decompressor:
    Code (Text):
    1.     lsr.w    #1,d6    ; Next descriptor bit
    2.     btst    #8,d6    ; Check if we've run out of bits
    3.     bne.s    +    ; (lsr 'shifts in' 0's)
    4.     jsr    SaxDec_GetByte(pc)
    5.     move.b    d0,d6
    6.     ori.w    #$FF00,d6    ; These set bits will disappear from the high byte as the register is shifted
    7. +

    Sonic 2's Z80 decompressor:
    Code (Text):
    1.     srl    c            ; c >> 1 (active control byte)
    2.     srl    b            ; b >> 1 (just a mask that lets us know when we need to reload)
    3.     bit    0,b            ; test next bit of 'b'
    4.     jr    nz,+            ; if it's set, we still have bits left in 'c'; jump to '+'
    5.     ; If you get here, we're out of bits in 'c'!
    6.     call    zDecEndOrGetByte    ; get next byte -> 'a'
    7.     ld    c,a            ; a -> 'c'
    8.     ld    b,0FFh            ; b = FFh (8 new bits in 'c')
    9. +

    And, yes, it produces identical files.

    You know... for years, I've wanted to know the real names of the various compression formats used in the Mega Drive Sonic games. Time after time we've come so close, be it through leaked function names in Sonic 2's Nick Arcade prototype, leftover compilation data in Sonic & Knuckles Collection, or a complete symbol-list in Sonic CD PS2. Each time, the names of the decompression functions were conveniently absent. At least now, we finally have some closure on Saxman - it doesn't have a name! It's just 'LZSS.C'...
     
    Last edited: May 5, 2020
    • Like Like x 12
    • Informative Informative x 3
    • List
  2. ICEknight

    ICEknight

    Researcher Researcher
    So... it does indeed have a name, being just a small variation of LZSS.

    I mean, same as Sinclair Basic on the ZX Spectrum is just a small variation of Sinclair Basic on the ZX81 and they're called the same.
     
  3. Blastfrog

    Blastfrog

    Frog blast the vent core! Member
    Incredible find, thanks for sharing! Source code is meant to be shared.