# Sonic and Sega Retro Message Board: Sonic 1 SMS/GG compression question - Sonic and Sega Retro Message Board

Page 1 of 1

## Sonic 1 SMS/GG compression question

### #1Travelsonic  Posted 20 December 2011 - 07:04 PM

• Posts: 715
• Joined: 01-March 05
Perhaps it has been answered before, but this puzzled me.

According to the SCHG guide for hacking Sonic 1 [SMS/GG]:

Quote

The algorithm for compressing data is as follows :

-> While two successive bytes are different
-> Bytes are stored without any compression
-> If two or more successive bytes are the same
-> The byte repeated is written twice
-> The number of times the byte is repeated, minus 1, is calculated. The result is written as a third byte.

So, why does it waste that byte by writing it twice then [# of times it is repeated - 1]? Why not just [byte][# of times repeated]?

Would that help improve anything at all, or like the method it uses, would it have limits within which it would actually be useful?

EDIT:

SO WE'RE CLEAR:

I am ASKING: Why didn't they do it slightly differently, like this:

Quote

The algorithm for compressing data is as follows :

-> While two successive bytes are different
-> Bytes are stored without any compression
-> If two or more successive bytes are the same
-> The byte repeated is written once
-> The number of times the byte is repeated, is calculated. The result is written as a second byte.

And would THAT TWEAK be better AT ALL?
This post has been edited by Travelsonic: 21 December 2011 - 04:18 PM

### #2Rika Chou  Posted 20 December 2011 - 07:24 PM

• Posts: 5195
• Joined: 11-January 03
• Gender:Not Telling
• Location:CA US
• Wiki edits:4

01 02 03 04 04 04 04 for example

it could compress like

01 02 03 04 04 03

it repeats the same byte twice so to tell it's a multiple byte. Otherwise there's no way to say that byte 02 isn't supposed to repeat 03 times.

(probably wrong here)

### #3Travelsonic  Posted 20 December 2011 - 07:31 PM

• Posts: 715
• Joined: 01-March 05

Rika Chou, on 20 December 2011 - 07:24 PM, said:

01 02 03 04 04 04 04 for example

it could compress like

01 02 03 04 04 03

it repeats the same byte twice so to tell it's a multiple byte. Otherwise there's no way to say that byte 02 isn't supposed to repeat 03 times.

(probably wrong here)

Sounds right, actually - I guess the algorithm among other things would play a role in things.

### #4dsrb  Posted 20 December 2011 - 07:34 PM

• Posts: 3150
• Joined: 10-June 09
• Gender:Male
• Wiki edits:196

Travelsonic, on 20 December 2011 - 07:04 PM, said:

So, why does it waste that byte by writing it twice then [# of times it is repeated - 1]? Why not just [byte][# of times repeated]?
This is incredibly simple—so much so that I can answer it. If it were to do that, it'd have to do it for every byte; that is, even those that were only written once. Defeating the point and massive inflating the file size. Like RC said, there has to be some way to tell the decompressor to kick in, and this algorithm does it very simply by writing the same byte twice followed by a quantity (which is wasteful when it only occurs twice in the uncompressed data, but presumably saves enough space on runs of 3+ bytes to be worth it).

### #5Travelsonic  Posted 20 December 2011 - 08:26 PM

• Posts: 715
• Joined: 01-March 05

dsrb, on 20 December 2011 - 07:34 PM, said:

[If it were to do that, it'd have to do it for every byte;

Like I said above, the algorithms used would play a huge role in how it would be.

### #6dsrb  Posted 20 December 2011 - 08:37 PM

• Posts: 3150
• Joined: 10-June 09
• Gender:Male
• Wiki edits:196
Nice tautology! But we're talking about the very algorithm you posted, not some hypothetical one.

My point is that in your question…

Quote

So, why does it waste that byte by writing it twice then [# of times it is repeated - 1]? Why not just [byte][# of times repeated]?
…your suggestion is impossible, because it fundamentally clashes with this rule:

Quote

-> While two successive bytes are different
-> Bytes are stored without any compression
How would this algorithm determine that all the bytes that were intended to be read singly were read as such, but that all those that were intended to be read as [byte][repeats] were read like that? Hence why there needs to be some system to mark the latter—in this case, two consecutive occurrences of the same byte.

It can't “just” compress more efficiently, because it would have no way of discerning between uncompressed (1 repeat) and compressed (2+ repeats) data, and therefore no way to determine whether it should just read any given byte as-is or check the next for how many times to repeat it.

### #7Travelsonic  Posted 21 December 2011 - 09:56 AM

• Posts: 715
• Joined: 01-March 05
Of course!

I guess my question would have better OVERALL phrased as, why the wasted byte, and why that algorithm that results in the wasted byte instead of one that wastes as little as possible.

### #8flamewing  Posted 21 December 2011 - 10:47 AM

• Emerald Hunter
• Posts: 1135
• Joined: 11-October 10
• Gender:Male
• Location:🇫🇷 France
• Project:Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
• Wiki edits:12
There are two things that have to be taken into consideration when deciding on a compression algorithm for a video game -- compression ratio and decompression speed. If it takes too long to decompress, it limits the usefulness of the compressed data (longer loading times); if it has poor compression ratio, it limits the usefulness of the format (little gain relative to the uncompressed data). Since we are talking about SMS/GG, the processor is the z80 -- a very slow processor. If you use too fancy a compression algorithm, it will likely take too long to decompress, so they had to keep it simple.

Moreover, the byte is only "wasted" for the case of 2 consecutive identical bytes in the uncompressed data: for 3 consecutive identical bytes you break even, and for 4 or more consecutive bytes, you always gain more than you lose to the "wasted" byte. And for all cases where the bytes are not identical, there is no loss.

### #9Travelsonic  Posted 21 December 2011 - 11:52 AM

• Posts: 715
• Joined: 01-March 05
You are 100% spot on that speed is a critical factor.

STRICTLY talking about space saved for a moment, If you did it [byte repeated][# of times repeated] for larger quantities of repeated bytes, wouldn't you end up saving even more than with this algorithm? [Of course, if it is too slow, then it wouldn't - even if IT DID save more space - e too useful in the grander scheme]

### #10dsrb  Posted 21 December 2011 - 12:03 PM

• Posts: 3150
• Joined: 10-June 09
• Gender:Male
• Wiki edits:196

Travelsonic, on 21 December 2011 - 11:52 AM, said:

STRICTLY talking about space saved for a moment, If you did it [byte repeated][# of times repeated] for larger quantities of repeated bytes, wouldn't you end up saving even more than with this algorithm?
Yes, of course. That's the point.

Edit—I AM DERP; I thought you meant the original algorithm.
This post has been edited by dsrb: 21 December 2011 - 02:33 PM

### #11flamewing  Posted 21 December 2011 - 01:01 PM

• Emerald Hunter
• Posts: 1135
• Joined: 11-October 10
• Gender:Male
• Location:🇫🇷 France
• Project:Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
• Wiki edits:12

Travelsonic, on 21 December 2011 - 11:52 AM, said:

STRICTLY talking about space saved for a moment, If you did it [byte repeated][# of times repeated] for larger quantities of repeated bytes, wouldn't you end up saving even more than with this algorithm? [Of course, if it is too slow, then it wouldn't - even if IT DID save more space - e too useful in the grander scheme]

Depends on the type of data you are compressing. If you have data that is dominated by long stretches of identical bytes and a handful of odd bytes sprinkled, this might compress better than the algorithm they used (emphasis on might); the more 'odd bytes' you have, the less the advantage. You are forgetting that the [# of times repeated] would have to appear for every byte you have. Here, lets contrast your RLE compressor and the SMS/GG compressor in a few examples:

Sequence 1:
00 00 00 00 01 00 00 00 02 00 00 00 03 02
[byte repeated][# of times repeated]:
00 04 01 01 00 03 02 01 00 03 03 01 02 01
SMS/GG:
00 00 03 01 00 00 02 02 00 00 02 03 02
Result: SMS/GG gains one byte versus uncompressed and versus yours.

Sequence 2:
00 00 00 00 01 02 02 01 03 04 02 00 00 00
[byte repeated][# of times repeated]:
00 04 01 01 02 02 01 01 03 01 04 01 02 01 00 03
SMS/GG:
00 00 03 01 02 02 01 01 03 04 02 00 00 01
Result: SMS/GG neither gains nor loses against uncompressed, yours loses two bytes

Sequence 3:
00 01 02 03 02 03 02 04 03 04 02 01 00 01
[byte repeated][# of times repeated]:
00 01 01 01 02 01 03 01 02 01 03 01 02 01 04 01 03 01 04 01 02 01 01 01 00 01 01 01
SMS/GG:
00 01 02 03 02 03 02 04 03 04 02 01 00 01
Result: SMS/GG neither gains nor loses versus uncompressed, yours is twice as long as uncompressed.

### #12Travelsonic  Posted 21 December 2011 - 04:08 PM

• Posts: 715
• Joined: 01-March 05

flamewing, on 21 December 2011 - 01:01 PM, said:

Sequence 3:
00 01 02 03 02 03 02 04 03 04 02 01 00 01
[byte repeated][# of times repeated]:
00 01 01 01 02 01 03 01 02 01 03 01 02 01 04 01 03 01 04 01 02 01 01 01 00 01 01 01
SMS/GG:
00 01 02 03 02 03 02 04 03 04 02 01 00 01
Result: SMS/GG neither gains nor loses versus uncompressed, yours is twice as long as uncompressed.

I see what you did there, BUT you assumed wrong - I meant that when bytes repeat to put it as [byte][# of times repeated], for fuck's sake if you put the # of bytes it repeated even with non-repeating bytes, you'd have a good way to needlessly inflate the file size.

[and tbh I thought it was clear that the byte/# of repeats thing would only be for bytes that actually repeat - similarly to the GG/SMS algorithm]
This post has been edited by Travelsonic: 21 December 2011 - 04:10 PM

### #13Glitch  Posted 21 December 2011 - 04:22 PM

• Posts: 158
• Joined: 22-September 08
• Gender:Male
• Project:Sonic 2 LD
• Wiki edits:22
I'm not sure if you're trolling here or just completely missing the point.

With RLE compression you need some way of identifying a stretch has been compressed. That is why they use the double value followed by a count. If you didn't have the double value you wouldn't be able to tell where the compressed data started. The sonic 2 algorithm does things slightly differently: it uses a marker byte (0xFF) to identify an RLE compressed stretch.

### #14flamewing  Posted 21 December 2011 - 04:23 PM

• Emerald Hunter
• Posts: 1135
• Joined: 11-October 10
• Gender:Male
• Location:🇫🇷 France
• Project:Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
• Wiki edits:12

Travelsonic, on 21 December 2011 - 04:08 PM, said:

I see what you did there, BUT you assumed wrong - I meant that when bytes repeat to put it as [byte][# of times repeated], for fuck's sake if you put the # of bytes it repeated even with non-repeating bytes, you'd have a good way to needlessly inflate the file size.

[and tbh I thought it was clear that the byte/# of repeats thing would only be for bytes that actually repeat - similarly to the GG/SMS algorithm]

And now we come back full circle to what dsrb has been telling you all along -- there is no way to put the repeat count only on the bytes that repeat. Well, you can have a compressor that does that -- but the decompressor won't be able to make heads or tails out of the compressed data. For example, consider the following compressed sequence of bytes:

01 02

How would you decompress it? Does 01 02 mean 'repeat 01 twice' or does it say 'we have a 01 followed by a 02'? That is, does this sequence of bytes come from compressing this:
01 01

or from [compressing] this:
01 02

See, you have to provide information to the decompressor about this: it doesn't magically know what byte means what, and it certainly does not know which byte is a repeat count and which byte is a literal byte unless you tell it which is which. This is why the compressed format on SMS/GG uses a repeat byte -- it is telling the decompressor that this byte repeats, and the next one is the repeat count.
This post has been edited by flamewing: 21 December 2011 - 04:24 PM

### #15Travelsonic  Posted 21 December 2011 - 06:13 PM

• Posts: 715
• Joined: 01-March 05

flamewing, on 21 December 2011 - 04:23 PM, said:

Of course - that is the crux, while the method I described sounds in theory better - to me, at least - that IS the issue, how would the processes work, how would the program know how to handle what pieces of data correctly. That makes it all clear.
This post has been edited by Travelsonic: 21 December 2011 - 06:52 PM

Page 1 of 1