Sonic and Sega Retro Message Board: Optimizing the DMA queue - Sonic and Sega Retro Message Board

Jump to content

Hey there, Guest!  (Log In · Register) Help
  • 2 Pages +
  • 1
  • 2
    Locked
    Locked Forum

Optimizing the DMA queue Now on GitHub

#1 User is offline flamewing 

Posted 09 August 2014 - 01:00 PM

  • Emerald Hunter
  • Posts: 831
  • Joined: 11-October 10
  • Gender:Male
  • Location:Brasil
  • Project:Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
  • Wiki edits:12
Edit: The improved DMA queue (and instructions for its use) can now be obtained on GitHub. This post will no longer be updated with new developments, but the thread will.

Edit: I make corrections to account for the edge cases described on this post and on this post. Rest of original post follows.


I have been going over the list of things to optimize in the original Genesis games, so I can get Sonic Classic Heroes to work as smoothly as possible. One of the things I just finished optimizing was the DMA queue management function, and I was amazed at how much I managed to save. I am now sharing this so everyone can benefit.

The new functions require that you do two things before you can use (which I will detail later): the first is going through your assembly files and changing the way the queue is cleared; the second is calling the initialization function for the new DMA functions. At the end, you will have gained two bytes in RAM, and a DMA queue that runs much faster.

How much faster? Well, it depends. There are 3 different DMA functions that are used nowadays:
  • The stock S2 function;
  • the stock S&K function;
  • the Sonic3_Complete version that is used when you assemble the Git disassembly with Sonic3_Complete=1.

The stock S2 function is the fastest of the 3:
  • 52(12/0) cycles if the queue was full;
  • 336(51/9) cycles if the new transfer filled the queue;
  • 346(52/10) cycles otherwise.

The stock S&K function is 8(2/0) cycles slower than the S2 version, but it can be safely used when the source is in RAM (the S2 version requires some extra care, but I won't go into details). So its times are:
  • 52(12/0) cycles if the queue was full;
  • 344(53/9) cycles if the new transfer filled the queue;
  • 354(54/10) cycles otherwise.

The Sonic3_Complete version is based on the S&K stock version; it thus also safe with RAM sources. However, it breaks up DMA transfers that cross 32kB boundaries into two DMA transfers(*). The way it does this adds an enormous overhead on all DMA transfers. Its times are:
  • If the source is in address $800000 and up (32x RAM, z80 RAM, main RAM):
    • 72(16/0) cycles if the queue was full;
    • 364(57/0) cycles if queue became full with new command;
    • 374(58/10) cycles otherwise;
  • If the source is in address $7FFFFF and down (ROM, both SCD RAMs):
    • If the DMA does not need to be split:
      • 294(53/10) cycles if the queue was full at the start;
      • 586(94/19) cycles if queue became full with new command;
      • 596(95/20) cycles otherwise;
    • If the DMA needs to be split in two:
      • 436(83/30) cycles if the queue was full at the start;
      • 728(124/21) cycles if queue became full with the first command;
      • 1030(166/31) cycles if queue became full with the second command;
      • 1040(167/32) cycles otherwise.

Yeah, you are wasting hundreds of cycles by using the Sonic3_Complete version... but even more than you think when you note the asterisk I added above. You see, the VDP has issues with DMAs that cross a 128kB boundary in ROM; the Sonic3_Complete tries to handle this, but is overzealous -- it breaks up transfers that cross a 32kB boundary instead. Thus, loads of DMAs are broken into two that should not be broken at all... leading to several hundreds of wasted cycles. The function is bad enough that manually breaking up the transfers would be much faster -- potentially 2/3rds of the time.

So, how does my optimized function compare with this? There are two versions you can select with a flag during assembly: the "competitor" to stock S2/stock S&K versions, which does not care whether or not transfers cross a 128kB boundary; and the "competitor" to Sonic3_Complete version, which is 128kB safe. Both of them are safe for RAM sources, and done so in an optimized way that has zero cost -- the functions would not be faster without this added protection. The times for the non-128kB-safe version are:
  • 48(11/0) cycles if the queue was full at the start;
  • 194(33/9) cycles otherwise.

The times for the 128kB-safe version are:
  • 48(11/0) cycles if the queue was full at the start (as always);
  • 214(37/9) cycles for DMA transfers that do not need to be split into two;
  • 252(46/9) cycles if the first piece of the DMA filled the queue;
  • 368(64/16) cycles if both pieces of the DMA were queued.

I will leave comparisons to whoever want to make them. I will just mention that if you use SonMapEd-generated DPLCs and you are using the Sonic3_Complete function, you are easily wasting thousands of cycles every frame.

Well, so now how to add this function to your hacks.

Git S2 version
Spoiler


Git S&K version
Spoiler


The new function
Spoiler

Additional Care
There are some additional points that are worth paying attention to.

128kB boundaries and you
For both S2 or S&K (or anywhere you want to use this), the version that does not check for 128kB boundaries is the default. The reason is this: you can (and should) always align the problematic art in such a way that the DMA never needs to be split in two. So enabling this by default carries a penalty with little real benefit. In any case, you can toggle this by setting the Use128kbSafeDMA option to 1.

Transfers of 64kB or larger
If you have enabled the version that breaks DMAs into two if they go over a 128kB boundary, this is relevant for you. There is an option that saves 4(1/0) cycles on the case where a DMA transfer is broken in two pieces and both pieces are correctly queued (that is, the first transfer did not fill the queue). This option assumes that you never perform a transfer with length of 64kB or higher; note that 64kB is included here! In these conditions, a small optimization exists that leads to the small savings mentioned. This is disabled by default to avoid this edge case.

Interrupt Safety
The original functions have several race conditions that makes them unsafe regarding interrupts. My version removes one of them, but adds another. For the vast majority of cases, this is irrelevant -- the QueueDMATransfer function is generally called only when Vint_routine is zero, meaning that the DMA queue will not be processed, and all is well.

There is one exception, though: the S3&K KosM decoder. Since the KosM decoder sets Vint_routine to a nonzero value, you can potentially run into an interrupt in the middle of a QueueDMATransfer call. Effects range from lost DMA transfers, to garbage DMA transfers, to one garbage DMA and a lost DMA (if the transfer was split), or, in the best possible outcome, no ill effects at all. You can toggle interrupt safety by setting the UseVIntSafeDMA flag to 1, but this adds overhead to all safe callers; better would be to fix the unsafe callers to mask interrupts while the DMA transfer is being queued.

ASM68k
If you use this crap, all you need to do to use the code above is:
  • replace the dotted labels (composed symbols) by @ labels (local symbols);
  • replace the last two instances of "endm" by "endr";
  • edit the VRAMCommReg macro to use asm68k-style parameters.


Spoiler

This post has been edited by flamewing: 12 July 2015 - 10:31 AM

#2 User is offline KingofHarts 

Posted 10 August 2014 - 10:07 AM

  • Call me back when people stop shitting in the punch bowl...
  • Posts: 1480
  • Joined: 07-August 10
  • Gender:Male
  • Wiki edits:1
I wanna port this to Sonic 1, as I've got the DMA going on there.
DISREGARD THE QUESTION IF YOU SAW IT... I THINK I'm FIGURING IT OUT
This post has been edited by KingofHarts: 10 August 2014 - 10:17 AM

#3 User is offline KingofHarts 

Posted 10 August 2014 - 10:31 PM

  • Call me back when people stop shitting in the punch bowl...
  • Posts: 1480
  • Joined: 07-August 10
  • Gender:Male
  • Wiki edits:1
Double posting as I'm in need of help. Here is my DMA queue.asm of my attempt to port this over into a Sonic 1 disassembly that has DMA queue already implemented. When I do this... my screen gets completely glitched and Sonic's tiles are thrown all over VRAM like mad... some help would be appreciated.

#4 User is offline Hitaxas 

Posted 11 August 2014 - 12:54 AM

  • SEGA: Sorry Classic Sonic, we are sending you back to 1994
  • Posts: 1432
  • Joined: 30-September 07
  • Gender:Male
  • Location:Back in Litchfield,CT
  • Project:Sonic: Super Deformed (head director) - Slowly working on it.
  • Wiki edits:196
I'm actually getting something along the lines of what KingofHarts is experiencing, but with my attempt to slap this into sonic 2. Not sure why though.

#5 User is offline flamewing 

Posted 11 August 2014 - 07:54 AM

  • Emerald Hunter
  • Posts: 831
  • Joined: 11-October 10
  • Gender:Male
  • Location:Brasil
  • Project:Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
  • Wiki edits:12
@KoH: in your version, I noticed you replaced the movep's by move's; for example:
InitDMAQueue_Loop:
        move.l	d1,2(a1)

This should be, instead:
InitDMAQueue_Loop:
        movep.l	d1,2(a1)

So of course your version does not work. movep reads/writes alternating bytes from RAM/ROM/SRAM (so a "movep.l d0,0(a1)" would write to a1+0, a1+2, a1+4 and a1+6), whereas move writes to contiguous bytes ("move.l d0,0(a1)" writes to a1, a1+1, a1+2, a1+3).
This post has been edited by flamewing: 11 August 2014 - 07:58 AM

#6 User is offline KingofHarts 

Posted 11 August 2014 - 08:57 AM

  • Call me back when people stop shitting in the punch bowl...
  • Posts: 1480
  • Joined: 07-August 10
  • Gender:Male
  • Wiki edits:1
Yes sir, that indeed fixed it. I took them out cuz for some reason I thought those instructions were giving me errors... which in hindsight doesn't really make any sense. It was the rept's that didn't work. Everyone using Sonic 1 can see how I took care of that in the file posted above... which has been edited and should now work with no issues.
This post has been edited by KingofHarts: 11 August 2014 - 08:58 AM

#7 User is offline flamewing 

Posted 11 August 2014 - 10:18 AM

  • Emerald Hunter
  • Posts: 831
  • Joined: 11-October 10
  • Gender:Male
  • Location:Brasil
  • Project:Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
  • Wiki edits:12
Oh, I forgot -- the repts should work fine if you swap the "endm"s by "endr"s. I added this to the guide.

@Hitaxas: can you elaborate on what you did? Because I tested the code on clean S2 and S&K (with and without Sonic3_Complete) disassemblies, and it worked on all cases.
This post has been edited by flamewing: 11 August 2014 - 10:21 AM

#8 User is offline KingofHarts 

Posted 11 August 2014 - 11:07 AM

  • Call me back when people stop shitting in the punch bowl...
  • Posts: 1480
  • Joined: 07-August 10
  • Gender:Male
  • Wiki edits:1
Ah... well then I found an alternate way to achieve the same result. I'd imagine yours is faster but could you figure that out for sure, one way or the other?

#9 User is offline flamewing 

Posted 11 August 2014 - 11:54 AM

  • Emerald Hunter
  • Posts: 831
  • Joined: 11-October 10
  • Gender:Male
  • Location:Brasil
  • Project:Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
  • Wiki edits:12
Without even looking at the cycle counts, I say that mine is faster because there is no loop overhead. But I looked at it anyway to give a quantitative value:

For ProcessDMAQueue:
My version:
	rept (VDP_Command_Buffer_Slot-VDP_Command_Buffer)/(7*2)
	move.w	(a1)+,d0						;  8(2/0)
	beq.w	.done							; T: 10(2/0); F:  12(2/0)
	move.w	d0,(a5)							;  8(1/1)
	move.l	(a1)+,(a5)						; 20(3/2)
	move.l	(a1)+,(a5)						; 20(3/2)
	move.l	(a1)+,(a5)						; 20(3/2)
	endm
	; For up to 17 transfers: 88(14/7) * numtransfers + 18(4/0)
	; For 18 transfers: 1584(252/126)

Yours:
	move.w	(a1)+,d0						;  8(2/0)
	beq.s	ProcessDMAQueue_Done			; T: 10(2/0); F:  8(1/0)
	move.w	d0,(a5)							;  8(1/1)
	move.l	(a1)+,(a5)						; 20(3/2)
	move.l	(a1)+,(a5)						; 20(3/2)
	move.l	(a1)+,(a5)						; 20(3/2)
	cmpa.w	#$C8FC,a1						; 10(2/0)
	bne.s	ProcessDMAQueue_Loop			; T: 10(2/0); F:  8(1/0)
	; For up to 17 transfers: 104(17/7) * numtransfers + 18(4/0)
	; For 18 transfers: 1870(305/126)

There is also the 4(1/0) cycles your version adds because it uses "move.w #0,(VDP_Command_Buffer).w", while mine uses "move.w d0,(VDP_Command_Buffer).w", but mine also adds 4(1/0) cycles due to the "moveq #0,d0" in the case of 18 transfers.

For InitDMAQueue:
Mine:
	rept (VDP_Command_Buffer_Slot-VDP_Command_Buffer)/(7*2)
	movep.l	d1,2(a1)						; 24(2/4)
	lea	14(a1),a1							;  8(2/0)
	endm
	; 576(72/72)

Yours:
	movep.l	d1,2(a1)						; 24(2/4)
	lea	14(a1),a1							;  8(2/0)
	cmpa.w	#$C8FC,a1						; 10(2/0)
	bne.s	InitDMAQueue_Loop				; T: 10(2/0); F:  8(1/0)
	; 934(143/72)


#10 User is offline KingofHarts 

Posted 11 August 2014 - 04:00 PM

  • Call me back when people stop shitting in the punch bowl...
  • Posts: 1480
  • Joined: 07-August 10
  • Gender:Male
  • Wiki edits:1
Good to know. I never knew you could do loops this way with such efficiency. I should try putting this to use in other game mechanics that loop in a similar manner as well.

#11 User is offline flamewing 

Posted 11 August 2014 - 04:09 PM

  • Emerald Hunter
  • Posts: 831
  • Joined: 11-October 10
  • Gender:Male
  • Location:Brasil
  • Project:Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
  • Wiki edits:12
I actually removed the loops; the rept/endm block repeats the block of instructions contained in it as many times as specified.

#12 User is offline flamewing 

Posted 11 August 2014 - 05:11 PM

  • Emerald Hunter
  • Posts: 831
  • Joined: 11-October 10
  • Gender:Male
  • Location:Brasil
  • Project:Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
  • Wiki edits:12
After helping Hitaxas over through PM, it turns out his issues were the "RetroHack" and "Sonic Retro" splash screens, which hard-coded the S1 OST, thus overwriting the DMA queue and making the initialization moot. So if you use either of these, be warned that you will either have to fix them, or you will have to initialize the DMA region after they have executed.

#13 User is offline KingofHarts 

Posted 14 August 2014 - 01:21 PM

  • Call me back when people stop shitting in the punch bowl...
  • Posts: 1480
  • Joined: 07-August 10
  • Gender:Male
  • Wiki edits:1
So an update... this is weird.

I added the rept instructions. For the second one (in INIT) it builds and runs fine.

But for the other block just before it, I get all these errors occuring on the rept line:

...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (212 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (200 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (188 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (176 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (164 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (152 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (140 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (128 bytes) is out of range

ALSO, I fixed the bug that I mentioned before thanks to you, but now while all the characters work,. Tails' tails object causes the bug... and they only cause this bug when Tails is at a certain diagonal angle when jumping/rolling.
Any idea?
This post has been edited by KingofHarts: 17 August 2014 - 01:14 AM

#14 User is offline flamewing 

Posted 17 August 2014 - 08:45 AM

  • Emerald Hunter
  • Posts: 831
  • Joined: 11-October 10
  • Gender:Male
  • Location:Brasil
  • Project:Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
  • Wiki edits:12
The other block needs to use be.w. instead of beq.s — the short branch that was in he original is too short.

And which wasmthat that again?

#15 User is offline KingofHarts 

Posted 17 August 2014 - 12:47 PM

  • Call me back when people stop shitting in the punch bowl...
  • Posts: 1480
  • Joined: 07-August 10
  • Gender:Male
  • Wiki edits:1
The bug I previously encountered before you had me using movep...

Where my screen was consumed by garbled art... last time I didn't post a screenshot because it'd been fixed... but now here it is:
IMAGE

Does this ONLY when Tails' tail object is loaded. I take that out, and never get any bugs. Also it only occurs when Tails is rolling, and at an angle...

  • 2 Pages +
  • 1
  • 2
    Locked
    Locked Forum

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users