don't click here

Brett Kosinski format

Discussion in 'Engineering & Reverse Engineering' started by Sonic Hachelle-Bee, Nov 6, 2004.

  1. Sonic Hachelle-Bee

    Sonic Hachelle-Bee

    Taking a Sand Shower Tech Member
    814
    213
    43
    Lyon, France
    Sonic 2 Long Version
    The aim of this topic is to explain how the famous Brett Kosinski format works.
    Reading this post, you will learn how to compress and decompress this compression format all in Hex editing, without utilities.


    THEORY:

    First of all, you have to know that in all the Brett Kosinski format, there is some 16 bits value (2 bytes) whose tell the game which byte is uncompressed data or not.
    These 16 bits values (called bitfields by Brett Kosinski) begins at start, the first 2 bytes of the compressed data.
    As an example, we have this begining of compressed data:

    53 FB 01 23 FE 67 FF F8 0D 98 FF FD 65 98 75 FB B2 00 15...

    53 FB is this 16 bits value. Let's convert it into bits:

    53 FB = 0101 0011 1111 1011

    You have to read it carefully. Cut it into 2 parts in the middle, paste the left part after the right part, then read it from right to left, like this:

    0101 0011 1111 1011 --> 1111 1011 0101 0011 --> 1100 1010 1101 1111

    Now, all bits of these 16 represent one byte and his status: uncompressed, compressed (format).
    There is 3 compression forms you can encounter into Brett Kosinski format. I will use the terms of Brett Kosinski himself to call them:

    1 --> Uncompressed byte (UB).
    01 --> Separate compression (SC).
    00 XX --> Inline compression (IC).

    In our example, we have this:

    53 FB 01 23 FE 67 FF F8 0D 98 FF FD 65 98 75 FB B2 00 15...
    1100 1010 1101 1111 --> 1 (UB) / 1 (UB) / 00 10 (IC) / 1 (UB) / 01 (SC) / 1 (UB) / 01 (SC) / 1 (UB) / 1 (UB) / 1 (UB) / 1 (UB)

    Then:
    01 is an UB. Read it as it.
    23 is another UB.
    FE is IC. I will explain it later.
    67 is an UB.
    FF F8 0D is SC. I will explain it later.
    98 is an UB.
    FF FD is SC.
    65 is an UB.
    98 is an UB.
    75 is an UB.
    FB B2 is the next bitfield, before the last byte of the previous bitfield.
    00 is an UB.
    ...

    After the end of this 16 bits bitfield, the game will check for another coming after. The next is before the last byte (or compression form) referred by the current bitfield. In this example, the next bitfield is FB B2. Convert it into bits again, and continue the code with 15...

    Inline compression:

    This is when you read something like this under the bitfield:

    00 XX

    XX is a value that tells how much bytes to copy after the current uncompressed byte:

    00 --> 2 bytes.
    01 --> 3 bytes.
    10 --> 4 bytes.
    11 --> 5 bytes.

    The byte this 00 XX value refers to, is a negative value.
    In our previous example, it's FE = -2.
    The game will read all the previous bytes this negative value is telling, and will copy them for XX bytes.
    In our example: 01 23 FE
    And the bitfield value of the byte FE is 00 10, XX = 10.
    Then, 01 23 FE = 01 23 01 23 01 23.
    Writing FF instead of FE will make this: 01 23 FF = 01 23 23 23 23 23.

    Separate compression:

    This is when there is 01 under the bitfield.
    The separate compression uses at least 2 bytes, and sometimes 3 (the last is optional). In a binary form:

    NN DC (CC) = NNNN NNNN DDDD DCCC (CCCC CCCC)

    NN is another negative value. Unlike IC, this one will tell the game where to read/copy the (uncompressed) data from. Writing FF for NN will read/copy the previous byte only. Writing another value will read/copy the previous bytes this value refers to, until you have CC of them.

    DD is again a negative value on 5 bits. This is an addon to NN. Take NN and substract 256 (100 in hex) * |DD+1|. Take the result as your new NN value.

    DD = 11111 = -1 --> NN = NN - (256 * |-1+1|) = NN (Do nothing)
    DD = 11110 = -2 --> NN = NN - (256 * |-2+1|) = NN - 256
    DD = 11101 = -3 --> NN = NN - (256 * |-3+1|) = NN - 256 * 2 = NN - 512
    DD = 11100 = -4 --> NN = NN - (256 * |-4+1|) = NN - 256 * 3 = NN - 768
    ...
    DD = 00000 = -32 --> NN = NN - (256 * |-32+1|) = NN - 256 * 31 = NN - 7936

    Example:
    66 67 FF F8 0D --> NN = -1 and DD = 11111 --> NN = -1, we are starting at 67.
    66 67 FE F8 0D --> NN = -2 and DD = 11111 --> NN = -2, we are starting at 66.
    ...
    66 67 FF F0 0D --> NN = -1 and DD = 11110 --> NN = -257, we are starting 257 bytes before.
    66 67 FD E7 --> NN = -3 and DD = 11100 --> NN = -771, we are starting 771 bytes before.

    CC: Count.
    If you have no more than 9 bytes to read/copy, then the last CC byte is useless.

    F0 --> Like F8, be careful of the DD value.
    F1 --> Copy the read data for 3 bytes, be careful of the DD value.
    F2 --> Copy the read data for 4 bytes, be careful of the DD value.
    F3 --> Copy the read data for 5 bytes, be careful of the DD value.
    F4 --> Copy the read data for 6 bytes, be careful of the DD value.
    F5 --> Copy the read data for 7 bytes, be careful of the DD value.
    F6 --> Copy the read data for 8 bytes, be careful of the DD value.
    F7 --> Copy the read data for 9 bytes, be careful of the DD value.
    F8 --> Read next.
    F9 --> Copy the read data for 3 bytes.
    FA --> Copy the read data for 4 bytes.
    FB --> Copy the read data for 5 bytes.
    FC --> Copy the read data for 6 bytes.
    FD --> Copy the read data for 7 bytes.
    FE --> Copy the read data for 8 bytes.
    FF --> Copy the read data for 9 bytes.

    Else, write F8 = 1111 1000 (CCC = 000) for the first count, and use the last byte to actually write your count (-1):
    F8 09 --> Copy the read data for 10 bytes.
    F8 0A --> Copy the read data for 11 bytes.
    ...

    Writing 00 F8 00 (NN = 00 and both counts are 00's) will end the compressed data.

    In our example:
    53 FB 01 23 FE 67 FF F8 0D 98 FF FD 65 98 75 FB B2 00 15...
    We have 2 SC: FF F8 0D and FF FD.

    67 FF F8 0D = 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 (67 + 0Ex67)
    98 FF FD = 98 98 98 98 98 98 98 98 (98 + 07x98)
    Another: 12 34 56 78 9A BC DE 10 FA F9 = 12 34 56 78 9A BC DE 10 56 78 9A


    EXAMPLE:

    We have this compressed data:

    FF 3F 54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5
    FC 15 FE C3 44 78 88 98 44 30 FF FF 00 F8 00

    Bitfields:
    FF 3F = 1111 1111 0011 1111 --> 1111 1111 1111 1100
    FC 15 = 1111 1100 0001 0101 --> 0011 1111 1010 1000

    Green color: Uncompressed byte.
    Red color: Separate compression.
    Yellow color: Inline compression.

    Uncompressed data:

    54 3B C4 44 54 33 33 5B 2D 5C 44 5C C4 C5 C4 C5
    C3 44 78 88 98 44 30 30 30 30 30 30 30 30 30 30



    I hope someone will understand something.
    That's all! :(
     
  2. Syren

    Syren

    Member
    4,330
    0
    0
    Teesside, UK
    Reintergration
    Oh jesus, just skim reading that gave me a migrane. Congradulations, it looks good.
     
  3. Sonic Hachelle-Bee

    Sonic Hachelle-Bee

    Taking a Sand Shower Tech Member
    814
    213
    43
    Lyon, France
    Sonic 2 Long Version
    Yeah, that's not as easy as you expected.
    This is also very hard to explain.
     
  4. Lostgame

    Lostgame

    producer/turnablist. homebrew dev. cosplayer. Oldbie
    4,134
    58
    28
    Toronto, ON
    The O.I.C.
    Horray! Now I can build a compressor/decompressor into a program and no one will guess that's what I used!
     
  5. Korama

    Korama

    Tech Member
    272
    2
    0
    I haven't looked at the Kosinski algorithm before. Good explanation, Sonic Hachelle-Bee, I understood it. :(
    So, Kosinski appears to be an extension/variation of LZSS.
     
  6. LOst

    LOst

    Tech Member
    4,891
    8
    18
    Are you 100% sure of that?
     
  7. Quickman

    Quickman

    be attitude for gains Tech Member
    5,595
    18
    18
    :x
    omg porjcet
    That's impressive. Nice work.
     
  8. Hayate

    Hayate

    Tech Member
    Looks to me like this Kosinski is just a variant of RLE (run-length encoding).
     
  9. Quickman

    Quickman

    be attitude for gains Tech Member
    5,595
    18
    18
    :x
    omg porjcet
    It's definitely a Lempel-Ziv compression scheme; LOst knows that much.
     
  10. Hayate

    Hayate

    Tech Member
    And wtf is a Lempel-Ziv compression..?
     
  11. Quickman

    Quickman

    be attitude for gains Tech Member
    5,595
    18
    18
    :x
    omg porjcet
    If you weren't a moron you'd know, or would have been able to plug something intelligent into Google and find out.
     
  12. Korama

    Korama

    Tech Member
    272
    2
    0
    Yes, but judge for yourself.

    First, let's take a look at the original compression algorithm by Lempel and Ziv, LZ77. Its idea is to take a byte sequence, search the previously processed data for a match and then output a reference to that match.
    The output of LZ77 always consists of a triplet of the kind (Offset, Length, New Character).

    Example: compression of the text "ABABCABCD"
    Code (Text):
    1. Input ? ? ? ?Found at Offset ? ?Match Length ? ?Output
    2. ABABCABCD ? ? ? ? ? 0 ? ? ? ? ? ? ? ? 0 ? ? ? ? (0, 0, A)
    3. BABCABCD ? ? ? ? ? ?0 ? ? ? ? ? ? ? ? 0 ? ? ? ? (0, 0, B)
    4. ABCABCD ? ? ? ? ? ? 0 ? ? ? ? ? ? ? ? 2 ? ? ? ? (0, 2, C)
    5. ABCD ? ? ? ? ? ? ? ?2 ? ? ? ? ? ? ? ? 3 ? ? ? ? (2, 3, D)
    If you think about it a bit, then you can easily see the problems of this algorithm. For practical use, the "Found at Offset" and "Match Length" values should be a fixed size, eg. 16 bit for the Offset and 8 bit for the Length. This also means that you can't search the entire previous data for matches (you wouldn't want to do this anyway as it would take too much time), because the Offset can only contain values from 0 to 65535 (16 bit). So, a "sliding window" is used: from any given file position you only take the previous 64 KB into consideration and Offset acts as index into this window.

    Same example as above, but now with Offset relative to the current input position instead of absolute.
    Code (Text):
    1. Input ? ? ? ?relative Offset ? ?Match Length ? ?Output
    2. ABABCABCD ? ? ? ? ? 0 ? ? ? ? ? ? ? ? 0 ? ? ? ? (0, 0, A)
    3. BABCABCD ? ? ? ? ? ?0 ? ? ? ? ? ? ? ? 0 ? ? ? ? (0, 0, B)
    4. ABCABCD ? ? ? ? ? ? 2 ? ? ? ? ? ? ? ? 2 ? ? ? ? (2, 2, C)
    5. ABCD ? ? ? ? ? ? ? ?3 ? ? ? ? ? ? ? ? 3 ? ? ? ? (3, 3, D)
    Decoding is really simple. Example: (0, 0, A)(0, 0, B)(2, 2, C)(3, 3, D)
    Code (Text):
    1. Input ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Output
    2. (0, 0, A) ?go back 0 bytes, copy 0 bytes from there, add A ? ?A
    3. (0, 0, B) ?go back 0 bytes, copy 0 bytes from there, add B ? ?AB
    4. (2, 2, C) ?go back 2 bytes, copy 2 bytes from there, add C ? ?ABABC
    5. (3, 3, D) ?go back 3 bytes, copy 3 bytes from there, add D ? ?ABABCABCD

    On its own, LZ77 compression is quite weak. The big disadvantage is that it always outputs a (Offset, Length, New Character) triplet even if no match is found. How wasteful. So, two men named Storer and Szymanski had an idea. In their modification of LZ77, called LZSS, those triplets are replaced by two modes: literals and repeats.
    Extremely wasteful LZ77 triplets like (0, 0, X) ("repeat nothing, write X") are replaced by the literal "X". All other LZ77 triplets of the kind (y, z, A) ("go back y bytes, copy z bytes from there, write A") are replaced by the repeat pair (y, z) plus the literal "A".

    Sounds good, but when you decompress LZSS data, how do you know whether the next byte is a literal or the start of a repeat pair? Well, you need an extra bit of information for that. Thus, a literal byte in LZSS actually requires 9 bits for storage: one "flag" bit telling that it's a literal, plus the eight data bits. In order to avoid having to do awkward bit shifts, you usually group eight of those flag bits to one byte.
    Now to the repeat pairs. They are just pairs of the kind (Offset, Length). The same things I discussed for LZ77 also apply here: For a practical implementation of LZSS you have to limit Offset and Length to a certain bit size and use a sliding window. The window size depends on the range of Offset again. A typical example, as used by Microsoft's old "compress.exe" (eg. used for the data files of Sonic CD PC), is 12 bits for Offset and 4 bits for Length. Thus, the sliding window is 4 KB in size (2^12 = 4096) and a repeat pair takes up 16 bits (12 + 4).
    4 bits for Length means that we can encode the values 0 to 15. But think about it, Length = 0 is nonsense, and so is Length = 1 (because for a match length of 1, we rather use a literal instead of a repeat pair). Length = 2 isn't used either, because two literals are quicker and the same size as a repeat with Length = 2. Thus, the 4 bits of Length are used to encode the values 3 to 18.

    It starts to get confusing, so it's about time to look at an example.
    Let's decompress this LZSS sequence:

    5F 41 42 41 42 43 00 20 44 00 51 05 45 00 92 46

    The first byte 5F contains the first 8 flag bits. In binary: 0101 1111.
    In this sample implementation of LZSS, the flag bits are read from low to high (right to left) with 1 meaning literal and 0 meaning repeat.

    1st bit is 1 ==> next item is a literal: 41
    2nd bit is 1 ==> next item is a literal: 42
    3rd bit is 1 ==> next item is a literal: 41
    4th bit is 1 ==> next item is a literal: 42
    5th bit is 1 ==> next item is a literal: 43
    6th bit is 0 ==> next item is a repeat: 00 20
    Now, 00 20 could mean Offset = 002 and Length = 0 + 3 = 3.
    How Offset is exactly used depends on the specific implementation of LZSS. Let's assume here that it's the index for a ring buffer that contains the last 4 KB of our decoded output. In this example here, the ring buffer contains the bytes "41 42 41 42 43" at this point. Thus, the current repeat goes to buffer index 2 and copies three bytes: 41 42 43. After that, the new buffer content is "41 42 41 42 43 41 42 43".

    7th bit is 1 ==> next item is a literal: 44
    8th bit is 0 ==> next item is a repeat: 00 51
    Offset = 005, Length = 1 + 3 = 4.
    Buffer contains "41 42 41 42 43 41 42 43 44" at this point. Go to index 5 and copy 4 bytes. New buffer content: "41 42 41 42 43 41 42 43 44 41 42 43 44".

    We have processed all 8 bits of the current flag byte, so we need to get a new one: 05. In binary: 0000 0101

    1st bit is 1 ==> next item is a literal: 45
    2nd bit is 0 ==> next item is a repeat: 00 92
    Offset = 009, Length = 2 + 3 = 5.
    Buffer contains "41 42 41 42 43 41 42 43 44 41 42 43 44 45" at this point. Go to index 9 and copy 5 bytes. New buffer content: "41 42 41 42 43 41 42 43 44 41 42 43 44 45 41 42 43 44 45".
    3rd bit is 1 ==> next item is a literal: 46

    We have reached the end of the compressed input data, so the remaining 5 bits in the current flag byte are meaningless and we terminate.
    We have decoded the byte sequence
    41 42 41 42 43 41 42 43 44 41 42 43 44 45 41 42 43 44 45 46
    That's "ABABCABCDABCDEABCDEF" in ASCII.
    20 bytes that were reduced to 16 bytes with LZSS. Great, isn't it? :(


    The main difference between LZSS and Kosinski is that LZSS has only two modes (literal and repeat) whereas Kosinski has three modes (literal, and two kinds of repeats: "separate compression" and "inline compression").


    That's it. I hope this was somewhat understandable. :D
     
  13. Sonic Hachelle-Bee

    Sonic Hachelle-Bee

    Taking a Sand Shower Tech Member
    814
    213
    43
    Lyon, France
    Sonic 2 Long Version
    This is very understandable. ;)
    LZSS appears to have some similarities with the Brett Kosinski format, that's right.
    As you said, it seems they are not exactly the same, because Kosinski uses 3 modes instead of 2.

    For my C++ project at college, I will try to make a small Kosinski compression/decompression program on PC, so as to port it for Macintosh later. I hope I will succeed.

    EDIT: We should archive this topic. There is a lot of informations about Kosinski and LZSS formats written here now.
     
  14. saxman

    saxman

    Oldbie Tech Member
    Hey hey hey, I can't stand him either, but dissing 'intelligence' that way isn't right either. There are many more people who don't know what it is. IQ isn't everything. Let's not go there in this topic -- I don't wanna go off-topic.


    SHB, great work! I saw this at my board too, but I simply 'forgot' to tell you how much I liked your work. Although you were about one day late I think..... Magus explained the format to me, so I'll have to give him credit in my program since that's where I came to understand it. Still though, it'll help out plenty of other people!
     
  15. LOst

    LOst

    Tech Member
    4,891
    8
    18
    Great to know!

    Korama, you should be a tech member for sure. After your great GSaveState program, I managed to finally understand those horrible VDP commands and DMA commands.

    And this LZSS explaination is fantastic!
     
  16. Hayate

    Hayate

    Tech Member
    Argh! Don't quote a two-page-long post to write three frikkin lines! >_<
     
  17. Quickman

    Quickman

    be attitude for gains Tech Member
    5,595
    18
    18
    :x
    omg porjcet
    That wasn't that long. Don't bitch.
     
  18. Hayate

    Hayate

    Tech Member
    Yes it was.

    Even on 1280x1024 resolution. ??
     
  19. Quickman

    Quickman

    be attitude for gains Tech Member
    5,595
    18
    18
    :x
    omg porjcet
    I'm on 1024x768 and I can scroll past it in two or three mousewheel swathes. Now if it took enough time for the human race to become extinct to scroll past it, THAT would be too long.
     
  20. Kles

    Kles

    Member
    While overblown by Bob, I really don't think the whole thing needed to be quoted. Didn't bother me but hey.