don't click here

Sonic CD decompilation

Discussion in 'Engineering & Reverse Engineering' started by BenoitRen, Jul 17, 2023.

  1. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    Edit from the future: if you're looking for the Git repository, it's hosted at sourcehut.

    It has been known for years now that the Sonic CD version included as part of Sonic Gems Collection (which is a port of the PC version) comes with debug symbols.

    Last year, Devon dug deeper and produced a disassembly using the linker of the original compiler. Some time later, he found how to extract even more debug information. Now, a skeleton of the original source code repository was available, and he showed a sample decompilation.

    That's when I joined Sonic Retro and started this project, which aims to restore the original C89 source code. This is possible because part of the recovered debug information links line numbers of the original source code to groups of MIPS assembly instructions. Of course, comments can't be recovered, which results in lots of whitespace at times.

    After dedicating almost all of my free time for the past three months to this project, I've hit the first milestone: the decompiled main/root files are now available!

    These files can't be considered finished yet, though. I haven't yet figured out how the global variables are structured across files and in which file they belong. Hopefully, that'll become clearer while decompiling the rest of the files.

    As you can see, the source code is currently hosted at my website, but I'd like to upload it to a Git repository in Europe. I was thinking of NotABug. As for the license, I want to release this into the public domain, so I was thinking of using Unlicense. What are your thoughts on this?

    Also, does this mean I can also start rambling about quirks I found? :)
     
    Last edited: Jul 25, 2024
  2. Billy

    Billy

    RIP Oderus Urungus Member
    2,149
    215
    43
    Colorado, USA
    Indie games
    As for Git hosting, I imagine anything that allows people to clone the repo will be fine, and if you want people to be able to contribute, can handle pull requests and such.

    Licensing I'm far from an expert on. Obviously people can't legally sell Sonic CD, but I'm guessing you just want a public domain license with "software is provided as-is with no warranty, etc." disclaimer, so I imagine that'd be fine. Looks like the Mario 64 decomp project does something similar and uses CC.
     
  3. Devon

    Devon

    DROWN, DROWN, DROWN MYSELF! Tech Member
    1,391
    1,678
    93
    your mom
    I dunno anything about licensing or hosting, but I'm actually really happy to see this. Can't wait to see more progress.
    This thread could use some new posts ;)
     
  4. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
  5. Brainulator

    Brainulator

    Regular garden-variety member Member
    Personally, I'd like to see if there's a way we can make clear which labels are original to the code and which ones were made up in lieu of better information.

    Was this taken from the PS2 version of Gems Collection or the GCN version?
     
  6. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    I had to name most of the structs and all of the unions, as their names weren't included. Would a wiki page listing those work? Everything else is original.

    The PS2 collection's version is the one that's being decompiled.
     
  7. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    I'm still hard at work decompiling the files related to R11A (the first act in the present), and am almost done.

    At the same time, I've also pushed the root files through a C compiler, fixing all the compilation errors and gaining a better understanding of the global variables. The result of that work is now available.
     
  8. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    The past week I've finished decompiling the files related to R11A, and have been fighting with Metrowerks CodeWarrior, the compiler that was originally used to compile the game. Yes, "fighting", because getting it to work is a pain. The documentation in general is lacking, and the little PS2-specific documentation there is seems like an afterthought.

    After lots of frustration, I was able to compile an ELF file that resembles the original. What's different is that the code section is supposed to be at a much higher address, and something must be going on with the data section because not all global variables are stored in it.

    Despite this, I was able to start comparing assembly. It's interesting how little changes that achieve the same end result affect the output when compiled without optimisations (because, yes, this game was compiled with *all* of them off!).

    For example, if you have an integer you want to test for not having a value of zero, you can do this:
    Code (Text):
    1. if (someNumber != 0)
    However, this generates an extra instruction compared to this:
    Code (Text):
    1. if (someNumber)
    In all cases I've seen thus far, the second notation seems to be what's used.

    There are several differences that I can't explain, however. For example, in some cases when a function argument is assigned to a register, it's assigned twice in the assembly I generated.

    I'll be continuing the comparison for now, as it does make the code more complete and has unearthed a mistake, but I don't think I'll be able to get everything like the original without help.
     
  9. Devon

    Devon

    DROWN, DROWN, DROWN MYSELF! Tech Member
    1,391
    1,678
    93
    your mom
    To be fair, this is basically a debug build, considering all the debugging information left in. Leaving the compiled code unoptimized allows for easy step by step debugging as the program is run for the developers. At least the good news with it is that you can more or less get a 1:1 recreation of the source code with that, whereas optimizations would've stripped and rearranged some stuff.
     
    • Like Like x 2
    • Agree Agree x 1
    • List
  10. Black Squirrel

    Black Squirrel

    no reverse gear Wiki Sysop
    9,016
    2,842
    93
    Northumberland, UK
    steamboat wiki
    I don't know if their IDE of choice would let them switch between "debug" and "release" builds like you'd have today, but there are often bugs exclusive to release mode.

    If you're a year out from the actual release, and you know big chunks of the codebase are likely to change, fixing these bugs isn't a priority. Just compile in debug - it's not like anyone's going to break in and analyse the code oh
     
  11. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    I've now created the Missing symbols wiki page for this.
     
  12. Brainulator

    Brainulator

    Regular garden-variety member Member
    Thanks. May I ask why, though, you replaced the symbols for certain structs which did have their names saved?
     
  13. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    I figured that the type name would probably be different from the tag. For example, it's unlikely that tagPOINT would also be the type name. That logic doesn't hold true for dlink_export, though, so I'll change its type name to be the same.

    If anyone has suggestions for better names, I'm all ears. I intend to change act_info to spr_sts_tbl (sprite status table) for the sake of consistency.
     
  14. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    All the source code related to the Special Stage has been committed! It compiles, but I haven't compared the resulting ASM to the original ASM yet.

    Speaking of ASM comparisons, I've also committed all the results of comparisons I've done for the source found in the root directory. It took a long time because, frankly, it was quite discouraging to find out that, at least for now, I'm not able to compile an ELF that would be 99% identical to the original, meaning I could be sure that the code matches the original as best as possible. The resulting assembler uses different registers to process data, function arguments sometimes get written to a register twice for no discernible reason, shifting and logical AND instructions are missing, etc. All of which results in a diff that has differences on almost every line, most of which I have to ignore. I'm keeping a rough record of the meaningful differences on the wiki's ASM differences page.

    Meanwhile, at work I've learned how to properly use Git's interactive rebase feature, and in professional deformation style, it has affected this project's Git repository. I have to be careful to not make it use too much of my time.
     
  15. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    All the source code related to R1 has been committed! This has already been compared to the original ASM, so it should be as authentic as possible. At least until someone figures out the compiler.

    Of particular note is BOSS_1.C, which contains all the code for the first encounter with Eggman. At 3684 lines long (with the code beginning at line 628) and more than 100 kB, it seems to be the most complex boss of the game. It's such a shame that it's so easily dispatched.

    There's something I'd like some feedback on: multiplications by a power of two and left bitshifting in C translate to the same ASM. How do I know which to pick? It's not always obvious that the code is shifting bits around instead of calculating something.
     
    • Informative Informative x 1
    • List
  16. MarkeyJester

    MarkeyJester

    Original, No substitute Resident Jester
    2,236
    491
    63
    Japan
    Assuming the compiler produces the result 1:1, then I think it's person by person preference.

    Though I assume the consensus would be:
    • If you're shifting to multiply, use the multiply operator.
    • If you're shifting to arrange binary flags in a specific spot for non-multiplication reasons, use the shift operator.
    The purpose of high level languages are to make it human readable, so ideally you want the code to reflect the purpose, rather than with assembly where it's an optimisation necessity.

    Let the compiler do the work.
     
  17. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    I would like the code to reflect the purpose, but in most cases I don't know what that was, and it's frustrating to have to choose based on nothing. I also wonder if the staff responsible for the C conversion even knew what the purpose was and made the C code reflect that.

    Braces usage also frustrates me. I started using K&R as that's my preference, but when there are consistent gaps after flow control statements, it suggests that Allman was used. So if the entire file has that pattern, I convert it to Allman. The result, of course, is that the style used is not always consistent.

    The file I'm currently working on is obviously using Allman, but then I encountered a case where there was no gap after a for statement, but the spot where the closing brace would be does have a line number associated with it (so it's not a brace-less for statement). It's certainly possible that Sega staff wasn't always consistent, but it's also possible that those gaps are there for lost comments. There's no way to know.
     
  18. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    I've just committed the TITLE folder. It contains the source for:
    • the ending movies movie (AVIGOOD)
    • the opening and pencil test movies (AVIOPEN)
    • the screen showing the developers's best times (BESTTIME)
    • the title screen (OPENING)
    • D.A. Garden (PLANET)
    • the save file menu (SAVEDATA)
    • Sound Test (SOUNDTST)
    • Stage Select (STAGETST)
    • the Time Attack menu (TA)
    • the post-credits screen (THANKS)
    • Visual Mode (VISUALMD)
    What's notable is how much code the post-credits screen has. It's a game mode on its own.

    I've also changed some type names in the code. Some to avoid name clashing (map_init_data, POINT), and others as an improvement (for example, act_info to sprite_status).
     
    • Like Like x 6
    • Informative Informative x 1
    • List
  19. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    While verifying the ASM of my decompilation of R3 (yes, it's coming!), I've noticed that several objects seem to calculate the array index of their parent/child incorrectly.

    Usually, the array index of an object is calculated as follows:
    Code (Text):
    1. actwk - pActwk
    actwk being a pointer to the global object array (technically a pointer to its first element), and pActwk being a pointer to one of its elements. Thanks to C pointer arithmetic, the result is not the subtraction of two addresses, but the difference in number of elements.

    However, as an example, here's how trapdr3 calculates it:
    Code (Text):
    1. floorwk - actwk
    I don't have enough experience with pointer arithmetic to say this with 100% certainty, but this seems to calculate the object's negative index. As array indexes can only be positive, this seems like a bad idea. Of course, that's why these are stored as unsigned numbers, but that also means that the resulting index will be wrong.

    One would think that this is the kind of error that would be caught during playtesting because this means the link between objects is broken. Maybe there's no problem here, after all?
     
    • Informative Informative x 2
    • Like Like x 1
    • List
  20. BenoitRen

    BenoitRen

    Tech Member
    771
    380
    63
    All the source code related to R3, Collision Chaos, has been committed!

    Like I've noted in the other Sonic CD thread, there's an object called "miracle" in this zone that only does one thing: delete itself. It is listed in the object table, but other than there is no information on what this could have been.