don't click here

Sonic 1 SMS/GG compression question

Discussion in 'Engineering & Reverse Engineering' started by Travelsonic, Dec 21, 2011.

  1. Travelsonic

    Travelsonic

    Member
    826
    20
    18
    Perhaps it has been answered before, but this puzzled me.

    According to the SCHG guide for hacking Sonic 1 [SMS/GG]:

    So, why does it waste that byte by writing it twice then [# of times it is repeated - 1]? Why not just [byte][# of times repeated]?

    Would that help improve anything at all, or like the method it uses, would it have limits within which it would actually be useful?


    EDIT:

    SO WE'RE CLEAR:

    I am ASKING: Why didn't they do it slightly differently, like this:


    And would THAT TWEAK be better AT ALL?
     
  2. Rika Chou

    Rika Chou

    Tech Member
    5,276
    169
    43
    I don't know about this stuff, but it sounds like if you have.

    01 02 03 04 04 04 04 for example

    it could compress like

    01 02 03 04 04 03

    it repeats the same byte twice so to tell it's a multiple byte. Otherwise there's no way to say that byte 02 isn't supposed to repeat 03 times.



    (probably wrong here)
     
  3. Travelsonic

    Travelsonic

    Member
    826
    20
    18

    Sounds right, actually - I guess the algorithm among other things would play a role in things.
     
  4. dsrb

    dsrb

    Member
    3,149
    0
    16
    This is incredibly simple—so much so that I can answer it. If it were to do that, it'd have to do it for every byte; that is, even those that were only written once. Defeating the point and massive inflating the file size. Like RC said, there has to be some way to tell the decompressor to kick in, and this algorithm does it very simply by writing the same byte twice followed by a quantity (which is wasteful when it only occurs twice in the uncompressed data, but presumably saves enough space on runs of 3+ bytes to be worth it).
     
  5. Travelsonic

    Travelsonic

    Member
    826
    20
    18
    Like I said above, the algorithms used would play a huge role in how it would be.
     
  6. dsrb

    dsrb

    Member
    3,149
    0
    16
    Nice tautology! But we're talking about the very algorithm you posted, not some hypothetical one.

    My point is that in your question…
    …your suggestion is impossible, because it fundamentally clashes with this rule:
    How would this algorithm determine that all the bytes that were intended to be read singly were read as such, but that all those that were intended to be read as [byte][repeats] were read like that? Hence why there needs to be some system to mark the latter—in this case, two consecutive occurrences of the same byte.

    It can't “just” compress more efficiently, because it would have no way of discerning between uncompressed (1 repeat) and compressed (2+ repeats) data, and therefore no way to determine whether it should just read any given byte as-is or check the next for how many times to repeat it.
     
  7. Travelsonic

    Travelsonic

    Member
    826
    20
    18
    Of course!

    I guess my question would have better OVERALL phrased as, why the wasted byte, and why that algorithm that results in the wasted byte instead of one that wastes as little as possible.
     
  8. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    There are two things that have to be taken into consideration when deciding on a compression algorithm for a video game -- compression ratio and decompression speed. If it takes too long to decompress, it limits the usefulness of the compressed data (longer loading times); if it has poor compression ratio, it limits the usefulness of the format (little gain relative to the uncompressed data). Since we are talking about SMS/GG, the processor is the z80 -- a very slow processor. If you use too fancy a compression algorithm, it will likely take too long to decompress, so they had to keep it simple.

    Moreover, the byte is only "wasted" for the case of 2 consecutive identical bytes in the uncompressed data: for 3 consecutive identical bytes you break even, and for 4 or more consecutive bytes, you always gain more than you lose to the "wasted" byte. And for all cases where the bytes are not identical, there is no loss.
     
  9. Travelsonic

    Travelsonic

    Member
    826
    20
    18
    You are 100% spot on that speed is a critical factor.

    STRICTLY talking about space saved for a moment, If you did it [byte repeated][# of times repeated] for larger quantities of repeated bytes, wouldn't you end up saving even more than with this algorithm? [Of course, if it is too slow, then it wouldn't - even if IT DID save more space - e too useful in the grander scheme]
     
  10. dsrb

    dsrb

    Member
    3,149
    0
    16
    Yes, of course. That's the point.

    Edit—I AM DERP; I thought you meant the original algorithm.
     
  11. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    Depends on the type of data you are compressing. If you have data that is dominated by long stretches of identical bytes and a handful of odd bytes sprinkled, this might compress better than the algorithm they used (emphasis on might); the more 'odd bytes' you have, the less the advantage. You are forgetting that the [# of times repeated] would have to appear for every byte you have. Here, lets contrast your RLE compressor and the SMS/GG compressor in a few examples:

    Sequence 1:
    00 00 00 00 01 00 00 00 02 00 00 00 03 02
    [byte repeated][# of times repeated]:
    00 04 01 01 00 03 02 01 00 03 03 01 02 01
    SMS/GG:
    00 00 03 01 00 00 02 02 00 00 02 03 02
    Result: SMS/GG gains one byte versus uncompressed and versus yours.

    Sequence 2:
    00 00 00 00 01 02 02 01 03 04 02 00 00 00
    [byte repeated][# of times repeated]:
    00 04 01 01 02 02 01 01 03 01 04 01 02 01 00 03
    SMS/GG:
    00 00 03 01 02 02 01 01 03 04 02 00 00 01
    Result: SMS/GG neither gains nor loses against uncompressed, yours loses two bytes

    Sequence 3:
    00 01 02 03 02 03 02 04 03 04 02 01 00 01
    [byte repeated][# of times repeated]:
    00 01 01 01 02 01 03 01 02 01 03 01 02 01 04 01 03 01 04 01 02 01 01 01 00 01 01 01
    SMS/GG:
    00 01 02 03 02 03 02 04 03 04 02 01 00 01
    Result: SMS/GG neither gains nor loses versus uncompressed, yours is twice as long as uncompressed.
     
  12. Travelsonic

    Travelsonic

    Member
    826
    20
    18
    I see what you did there, BUT you assumed wrong - I meant that when bytes repeat to put it as [byte][# of times repeated], for fuck's sake if you put the # of bytes it repeated even with non-repeating bytes, you'd have a good way to needlessly inflate the file size.

    [and tbh I thought it was clear that the byte/# of repeats thing would only be for bytes that actually repeat - similarly to the GG/SMS algorithm]
     
  13. Glitch

    Glitch

    Tech Member
    175
    12
    18
    I'm not sure if you're trolling here or just completely missing the point.

    With RLE compression you need some way of identifying a stretch has been compressed. That is why they use the double value followed by a count. If you didn't have the double value you wouldn't be able to tell where the compressed data started. The sonic 2 algorithm does things slightly differently: it uses a marker byte (0xFF) to identify an RLE compressed stretch.
     
  14. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    And now we come back full circle to what dsrb has been telling you all along -- there is no way to put the repeat count only on the bytes that repeat. Well, you can have a compressor that does that -- but the decompressor won't be able to make heads or tails out of the compressed data. For example, consider the following compressed sequence of bytes:

    01 02

    How would you decompress it? Does 01 02 mean 'repeat 01 twice' or does it say 'we have a 01 followed by a 02'? That is, does this sequence of bytes come from compressing this:
    01 01

    or from [compressing] this:
    01 02

    See, you have to provide information to the decompressor about this: it doesn't magically know what byte means what, and it certainly does not know which byte is a repeat count and which byte is a literal byte unless you tell it which is which. This is why the compressed format on SMS/GG uses a repeat byte -- it is telling the decompressor that this byte repeats, and the next one is the repeat count.
     
  15. Travelsonic

    Travelsonic

    Member
    826
    20
    18
    Of course - that is the crux, while the method I described sounds in theory better - to me, at least - that IS the issue, how would the processes work, how would the program know how to handle what pieces of data correctly. That makes it all clear.