Edit: The improved DMA queue (and instructions for its use) can now be obtained on GitHub. This post will no longer be updated with new developments, but the thread will.
Edit: I make corrections to account for the edge cases described on this post and on this post. Rest of original post follows.
I have been going over the list of things to optimize in the original Genesis games, so I can get Sonic Classic Heroes to work as smoothly as possible. One of the things I just finished optimizing was the DMA queue management function, and I was amazed at how much I managed to save. I am now sharing this so everyone can benefit.
The new functions require that you do two things before you can use (which I will detail later): the first is going through your assembly files and changing the way the queue is cleared; the second is calling the initialization function for the new DMA functions. At the end, you will have gained two bytes in RAM, and a DMA queue that runs much faster.
How much faster? Well, it depends. There are 3 different DMA functions that are used nowadays:
The stock S2 function is the fastest of the 3:
The stock S&K function is 8(2/0) cycles slower than the S2 version, but it can be safely used when the source is in RAM (the S2 version requires some extra care, but I won't go into details). So its times are:
The Sonic3_Complete version is based on the S&K stock version; it thus also safe with RAM sources. However, it breaks up DMA transfers that cross 32kB boundaries into two DMA transfers(*). The way it does this adds an enormous overhead on all DMA transfers. Its times are:
Yeah, you are wasting hundreds of cycles by using the Sonic3_Complete version... but even more than you think when you note the asterisk I added above. You see, the VDP has issues with DMAs that cross a 128kB boundary in ROM; the Sonic3_Complete tries to handle this, but is overzealous -- it breaks up transfers that cross a 32kB boundary instead. Thus, loads of DMAs are broken into two that should not be broken at all... leading to several hundreds of wasted cycles. The function is bad enough that manually breaking up the transfers would be much faster -- potentially 2/3rds of the time.
So, how does my optimized function compare with this? There are two versions you can select with a flag during assembly: the "competitor" to stock S2/stock S&K versions, which does not care whether or not transfers cross a 128kB boundary; and the "competitor" to Sonic3_Complete version, which is 128kB safe. Both of them are safe for RAM sources, and done so in an optimized way that has zero cost -- the functions would not be faster without this added protection. The times for the non-128kB-safe version are:
The times for the 128kB-safe version are:
I will leave comparisons to whoever want to make them. I will just mention that if you use SonMapEd-generated DPLCs and you are using the Sonic3_Complete function, you are easily wasting thousands of cycles every frame.
Well, so now how to add this function to your hacks.
Git S2 version
Git S&K version
The new function
Additional Care
There are some additional points that are worth paying attention to.
128kB boundaries and you
For both S2 or S&K (or anywhere you want to use this), the version that does not check for 128kB boundaries is the default. The reason is this: you can (and should) always align the problematic art in such a way that the DMA never needs to be split in two. So enabling this by default carries a penalty with little real benefit. In any case, you can toggle this by setting the Use128kbSafeDMA option to 1.
Transfers of 64kB or larger
If you have enabled the version that breaks DMAs into two if they go over a 128kB boundary, this is relevant for you. There is an option that saves 4(1/0) cycles on the case where a DMA transfer is broken in two pieces and both pieces are correctly queued (that is, the first transfer did not fill the queue). This option assumes that you never perform a transfer with length of 64kB or higher; note that 64kB is included here! In these conditions, a small optimization exists that leads to the small savings mentioned. This is disabled by default to avoid this edge case.
Interrupt Safety
The original functions have several race conditions that makes them unsafe regarding interrupts. My version removes one of them, but adds another. For the vast majority of cases, this is irrelevant -- the QueueDMATransfer function is generally called only when Vint_routine is zero, meaning that the DMA queue will not be processed, and all is well.
There is one exception, though: the S3&K KosM decoder. Since the KosM decoder sets Vint_routine to a nonzero value, you can potentially run into an interrupt in the middle of a QueueDMATransfer call. Effects range from lost DMA transfers, to garbage DMA transfers, to one garbage DMA and a lost DMA (if the transfer was split), or, in the best possible outcome, no ill effects at all. You can toggle interrupt safety by setting the UseVIntSafeDMA flag to 1, but this adds overhead to all safe callers; better would be to fix the unsafe callers to mask interrupts while the DMA transfer is being queued.
ASM68k
If you use this crap, all you need to do to use the code above is:
Edit: I make corrections to account for the edge cases described on this post and on this post. Rest of original post follows.
I have been going over the list of things to optimize in the original Genesis games, so I can get Sonic Classic Heroes to work as smoothly as possible. One of the things I just finished optimizing was the DMA queue management function, and I was amazed at how much I managed to save. I am now sharing this so everyone can benefit.
The new functions require that you do two things before you can use (which I will detail later): the first is going through your assembly files and changing the way the queue is cleared; the second is calling the initialization function for the new DMA functions. At the end, you will have gained two bytes in RAM, and a DMA queue that runs much faster.
How much faster? Well, it depends. There are 3 different DMA functions that are used nowadays:
- The stock S2 function;
- the stock S&K function;
- the Sonic3_Complete version that is used when you assemble the Git disassembly with Sonic3_Complete=1.
The stock S2 function is the fastest of the 3:
- 52(12/0) cycles if the queue was full;
- 336(51/9) cycles if the new transfer filled the queue;
- 346(52/10) cycles otherwise.
The stock S&K function is 8(2/0) cycles slower than the S2 version, but it can be safely used when the source is in RAM (the S2 version requires some extra care, but I won't go into details). So its times are:
- 52(12/0) cycles if the queue was full;
- 344(53/9) cycles if the new transfer filled the queue;
- 354(54/10) cycles otherwise.
The Sonic3_Complete version is based on the S&K stock version; it thus also safe with RAM sources. However, it breaks up DMA transfers that cross 32kB boundaries into two DMA transfers(*). The way it does this adds an enormous overhead on all DMA transfers. Its times are:
- If the source is in address $800000 and up (32x RAM, z80 RAM, main RAM):
- 72(16/0) cycles if the queue was full;
- 364(57/0) cycles if queue became full with new command;
- 374(58/10) cycles otherwise;
- 72(16/0) cycles if the queue was full;
- If the source is in address $7FFFFF and down (ROM, both SCD RAMs):
- If the DMA does not need to be split:
- 294(53/10) cycles if the queue was full at the start;
- 586(94/19) cycles if queue became full with new command;
- 596(95/20) cycles otherwise;
- 294(53/10) cycles if the queue was full at the start;
- If the DMA needs to be split in two:
- 436(83/30) cycles if the queue was full at the start;
- 728(124/21) cycles if queue became full with the first command;
- 1030(166/31) cycles if queue became full with the second command;
- 1040(167/32) cycles otherwise.
- 436(83/30) cycles if the queue was full at the start;
- If the DMA does not need to be split:
Yeah, you are wasting hundreds of cycles by using the Sonic3_Complete version... but even more than you think when you note the asterisk I added above. You see, the VDP has issues with DMAs that cross a 128kB boundary in ROM; the Sonic3_Complete tries to handle this, but is overzealous -- it breaks up transfers that cross a 32kB boundary instead. Thus, loads of DMAs are broken into two that should not be broken at all... leading to several hundreds of wasted cycles. The function is bad enough that manually breaking up the transfers would be much faster -- potentially 2/3rds of the time.
So, how does my optimized function compare with this? There are two versions you can select with a flag during assembly: the "competitor" to stock S2/stock S&K versions, which does not care whether or not transfers cross a 128kB boundary; and the "competitor" to Sonic3_Complete version, which is 128kB safe. Both of them are safe for RAM sources, and done so in an optimized way that has zero cost -- the functions would not be faster without this added protection. The times for the non-128kB-safe version are:
- 48(11/0) cycles if the queue was full at the start;
- 194(33/9) cycles otherwise.
The times for the 128kB-safe version are:
- 48(11/0) cycles if the queue was full at the start (as always);
- 214(37/9) cycles for DMA transfers that do not need to be split into two;
- 252(46/9) cycles if the first piece of the DMA filled the queue;
- 368(64/16) cycles if both pieces of the DMA were queued.
I will leave comparisons to whoever want to make them. I will just mention that if you use SonMapEd-generated DPLCs and you are using the Sonic3_Complete function, you are easily wasting thousands of cycles every frame.
Well, so now how to add this function to your hacks.
Git S2 version
Spoiler
Git S&K version
Spoiler
The new function
Spoiler
Additional Care
There are some additional points that are worth paying attention to.
128kB boundaries and you
For both S2 or S&K (or anywhere you want to use this), the version that does not check for 128kB boundaries is the default. The reason is this: you can (and should) always align the problematic art in such a way that the DMA never needs to be split in two. So enabling this by default carries a penalty with little real benefit. In any case, you can toggle this by setting the Use128kbSafeDMA option to 1.
Transfers of 64kB or larger
If you have enabled the version that breaks DMAs into two if they go over a 128kB boundary, this is relevant for you. There is an option that saves 4(1/0) cycles on the case where a DMA transfer is broken in two pieces and both pieces are correctly queued (that is, the first transfer did not fill the queue). This option assumes that you never perform a transfer with length of 64kB or higher; note that 64kB is included here! In these conditions, a small optimization exists that leads to the small savings mentioned. This is disabled by default to avoid this edge case.
Interrupt Safety
The original functions have several race conditions that makes them unsafe regarding interrupts. My version removes one of them, but adds another. For the vast majority of cases, this is irrelevant -- the QueueDMATransfer function is generally called only when Vint_routine is zero, meaning that the DMA queue will not be processed, and all is well.
There is one exception, though: the S3&K KosM decoder. Since the KosM decoder sets Vint_routine to a nonzero value, you can potentially run into an interrupt in the middle of a QueueDMATransfer call. Effects range from lost DMA transfers, to garbage DMA transfers, to one garbage DMA and a lost DMA (if the transfer was split), or, in the best possible outcome, no ill effects at all. You can toggle interrupt safety by setting the UseVIntSafeDMA flag to 1, but this adds overhead to all safe callers; better would be to fix the unsafe callers to mask interrupts while the DMA transfer is being queued.
ASM68k
If you use this crap, all you need to do to use the code above is:
- replace the dotted labels (composed symbols) by @ labels (local symbols);
- replace the last two instances of "endm" by "endr";
- edit the VRAMCommReg macro to use asm68k-style parameters.
Spoiler
This post has been edited by flamewing: 12 July 2015 - 10:31 AM


31