Perhaps it has been answered before, but this puzzled me. According to the SCHG guide for hacking Sonic 1 [SMS/GG]: So, why does it waste that byte by writing it twice then [# of times it is repeated - 1]? Why not just [byte][# of times repeated]? Would that help improve anything at all, or like the method it uses, would it have limits within which it would actually be useful? EDIT: SO WE'RE CLEAR: I am ASKING: Why didn't they do it slightly differently, like this: And would THAT TWEAK be better AT ALL?
I don't know about this stuff, but it sounds like if you have. 01 02 03 04 04 04 04 for example it could compress like 01 02 03 04 04 03 it repeats the same byte twice so to tell it's a multiple byte. Otherwise there's no way to say that byte 02 isn't supposed to repeat 03 times. (probably wrong here)
This is incredibly simple—so much so that I can answer it. If it were to do that, it'd have to do it for every byte; that is, even those that were only written once. Defeating the point and massive inflating the file size. Like RC said, there has to be some way to tell the decompressor to kick in, and this algorithm does it very simply by writing the same byte twice followed by a quantity (which is wasteful when it only occurs twice in the uncompressed data, but presumably saves enough space on runs of 3+ bytes to be worth it).
Nice tautology! But we're talking about the very algorithm you posted, not some hypothetical one. My point is that in your question… …your suggestion is impossible, because it fundamentally clashes with this rule: How would this algorithm determine that all the bytes that were intended to be read singly were read as such, but that all those that were intended to be read as [byte][repeats] were read like that? Hence why there needs to be some system to mark the latter—in this case, two consecutive occurrences of the same byte. It can't “just” compress more efficiently, because it would have no way of discerning between uncompressed (1 repeat) and compressed (2+ repeats) data, and therefore no way to determine whether it should just read any given byte as-is or check the next for how many times to repeat it.
Of course! I guess my question would have better OVERALL phrased as, why the wasted byte, and why that algorithm that results in the wasted byte instead of one that wastes as little as possible.
There are two things that have to be taken into consideration when deciding on a compression algorithm for a video game -- compression ratio and decompression speed. If it takes too long to decompress, it limits the usefulness of the compressed data (longer loading times); if it has poor compression ratio, it limits the usefulness of the format (little gain relative to the uncompressed data). Since we are talking about SMS/GG, the processor is the z80 -- a very slow processor. If you use too fancy a compression algorithm, it will likely take too long to decompress, so they had to keep it simple. Moreover, the byte is only "wasted" for the case of 2 consecutive identical bytes in the uncompressed data: for 3 consecutive identical bytes you break even, and for 4 or more consecutive bytes, you always gain more than you lose to the "wasted" byte. And for all cases where the bytes are not identical, there is no loss.
You are 100% spot on that speed is a critical factor. STRICTLY talking about space saved for a moment, If you did it [byte repeated][# of times repeated] for larger quantities of repeated bytes, wouldn't you end up saving even more than with this algorithm? [Of course, if it is too slow, then it wouldn't - even if IT DID save more space - e too useful in the grander scheme]
Depends on the type of data you are compressing. If you have data that is dominated by long stretches of identical bytes and a handful of odd bytes sprinkled, this might compress better than the algorithm they used (emphasis on might); the more 'odd bytes' you have, the less the advantage. You are forgetting that the [# of times repeated] would have to appear for every byte you have. Here, lets contrast your RLE compressor and the SMS/GG compressor in a few examples: Sequence 1: 00 00 00 00 01 00 00 00 02 00 00 00 03 02 [byte repeated][# of times repeated]: 00 04 01 01 00 03 02 01 00 03 03 01 02 01 SMS/GG: 00 00 03 01 00 00 02 02 00 00 02 03 02 Result: SMS/GG gains one byte versus uncompressed and versus yours. Sequence 2: 00 00 00 00 01 02 02 01 03 04 02 00 00 00 [byte repeated][# of times repeated]: 00 04 01 01 02 02 01 01 03 01 04 01 02 01 00 03 SMS/GG: 00 00 03 01 02 02 01 01 03 04 02 00 00 01 Result: SMS/GG neither gains nor loses against uncompressed, yours loses two bytes Sequence 3: 00 01 02 03 02 03 02 04 03 04 02 01 00 01 [byte repeated][# of times repeated]: 00 01 01 01 02 01 03 01 02 01 03 01 02 01 04 01 03 01 04 01 02 01 01 01 00 01 01 01 SMS/GG: 00 01 02 03 02 03 02 04 03 04 02 01 00 01 Result: SMS/GG neither gains nor loses versus uncompressed, yours is twice as long as uncompressed.
I see what you did there, BUT you assumed wrong - I meant that when bytes repeat to put it as [byte][# of times repeated], for fuck's sake if you put the # of bytes it repeated even with non-repeating bytes, you'd have a good way to needlessly inflate the file size. [and tbh I thought it was clear that the byte/# of repeats thing would only be for bytes that actually repeat - similarly to the GG/SMS algorithm]
I'm not sure if you're trolling here or just completely missing the point. With RLE compression you need some way of identifying a stretch has been compressed. That is why they use the double value followed by a count. If you didn't have the double value you wouldn't be able to tell where the compressed data started. The sonic 2 algorithm does things slightly differently: it uses a marker byte (0xFF) to identify an RLE compressed stretch.
And now we come back full circle to what dsrb has been telling you all along -- there is no way to put the repeat count only on the bytes that repeat. Well, you can have a compressor that does that -- but the decompressor won't be able to make heads or tails out of the compressed data. For example, consider the following compressed sequence of bytes: 01 02 How would you decompress it? Does 01 02 mean 'repeat 01 twice' or does it say 'we have a 01 followed by a 02'? That is, does this sequence of bytes come from compressing this: 01 01 or from [compressing] this: 01 02 See, you have to provide information to the decompressor about this: it doesn't magically know what byte means what, and it certainly does not know which byte is a repeat count and which byte is a literal byte unless you tell it which is which. This is why the compressed format on SMS/GG uses a repeat byte -- it is telling the decompressor that this byte repeats, and the next one is the repeat count.
Of course - that is the crux, while the method I described sounds in theory better - to me, at least - that IS the issue, how would the processes work, how would the program know how to handle what pieces of data correctly. That makes it all clear.