Optimizing the DMA queue

flamewing · Aug 9, 2014

Edit: The improved DMA queue (and instructions for its use) can now be obtained on GitHub. This post will no longer be updated with new developments, but the thread will.

Edit: I make corrections to account for the edge cases described on this post and on this post. Rest of original post follows.
[hr]
I have been going over the list of things to optimize in the original Genesis games, so I can get Sonic Classic Heroes to work as smoothly as possible. One of the things I just finished optimizing was the DMA queue management function, and I was amazed at how much I managed to save. I am now sharing this so everyone can benefit.

The new functions require that you do two things before you can use (which I will detail later): the first is going through your assembly files and changing the way the queue is cleared; the second is calling the initialization function for the new DMA functions. At the end, you will have gained two bytes in RAM, and a DMA queue that runs much faster.

How much faster? Well, it depends. There are 3 different DMA functions that are used nowadays:

The stock S2 function;

the stock S&K function;

the Sonic3_Complete version that is used when you assemble the Git disassembly with Sonic3_Complete=1.

The stock S2 function is the fastest of the 3:

52(12/0) cycles if the queue was full;

336(51/9) cycles if the new transfer filled the queue;

346(52/10) cycles otherwise.

The stock S&K function is 8(2/0) cycles slower than the S2 version, but it can be safely used when the source is in RAM (the S2 version requires some extra care, but I won't go into details). So its times are:

52(12/0) cycles if the queue was full;

344(53/9) cycles if the new transfer filled the queue;

354(54/10) cycles otherwise.

The Sonic3_Complete version is based on the S&K stock version; it thus also safe with RAM sources. However, it breaks up DMA transfers that cross 32kB boundaries into two DMA transfers(*). The way it does this adds an enormous overhead on all DMA transfers. Its times are:

If the source is in address $800000 and up (32x RAM, z80 RAM, main RAM):

72(16/0) cycles if the queue was full;

364(57/0) cycles if queue became full with new command;

374(58/10) cycles otherwise;

If the source is in address $7FFFFF and down (ROM, both SCD RAMs):

If the DMA does not need to be split:

294(53/10) cycles if the queue was full at the start;

586(94/19) cycles if queue became full with new command;

596(95/20) cycles otherwise;

If the DMA needs to be split in two:

436(83/30) cycles if the queue was full at the start;

728(124/21) cycles if queue became full with the first command;

1030(166/31) cycles if queue became full with the second command;

1040(167/32) cycles otherwise.

Yeah, you are wasting hundreds of cycles by using the Sonic3_Complete version... but even more than you think when you note the asterisk I added above. You see, the VDP has issues with DMAs that cross a 128kB boundary in ROM; the Sonic3_Complete tries to handle this, but is overzealous -- it breaks up transfers that cross a 32kB boundary instead. Thus, loads of DMAs are broken into two that should not be broken at all... leading to several hundreds of wasted cycles. The function is bad enough that manually breaking up the transfers would be much faster -- potentially 2/3rds of the time.

So, how does my optimized function compare with this? There are two versions you can select with a flag during assembly: the "competitor" to stock S2/stock S&K versions, which does not care whether or not transfers cross a 128kB boundary; and the "competitor" to Sonic3_Complete version, which is 128kB safe. Both of them are safe for RAM sources, and done so in an optimized way that has zero cost -- the functions would not be faster without this added protection. The times for the non-128kB-safe version are:

48(11/0) cycles if the queue was full at the start;

194(33/9) cycles otherwise.

The times for the 128kB-safe version are:

48(11/0) cycles if the queue was full at the start (as always);

214(37/9) cycles for DMA transfers that do not need to be split into two;

252(46/9) cycles if the first piece of the DMA filled the queue;

368(64/16) cycles if both pieces of the DMA were queued.

I will leave comparisons to whoever want to make them. I will just mention that if you use SonMapEd-generated DPLCs and you are using the Sonic3_Complete function, you are easily wasting thousands of cycles every frame.

Well, so now how to add this function to your hacks.

Git S2 version

Find every instance of this code:
[68k] move.l #VDP_Command_Buffer,(VDP_Command_Buffer_Slot).w[/68k]
and change it to this:
[68k] move.w #VDP_Command_Buffer,(VDP_Command_Buffer_Slot).w[/68k]
Now find this:
[68k] bsr.w VDPSetupGame[/68k]
and change it to this:
[68k] jsr (InitDMAQueue).l
bsr.w VDPSetupGame[/68k]
Now find the "SpecialStage" label and scan down to this:
[68k] move #$2700,sr ; Mask all interrupts
lea (VDP_control_port).l,a6
move.w #$8B03,(a6) ; EXT-INT disabled, V scroll by screen, H scroll by line
move.w #$8004,(a6) ; H-INT disabled
move.w #$8ADF,(Hint_counter_reserve).w ; H-INT every 224th scanline
move.w #$8230,(a6) ; PNT A base: $C000
move.w #$8405,(a6) ; PNT B base: $A000
move.w #$8C08,(a6) ; H res 32 cells, no interlace, S/H enabled
move.w #$9003,(a6) ; Scroll table size: 128x32
move.w #$8700,(a6) ; Background palette/color: 0/0
move.w #$8D3F,(a6) ; H scroll table base: $FC00
move.w #$857C,(a6) ; Sprite attribute table base: $F800
move.w (VDP_Reg1_val).w,d0
andi.b #$BF,d0
move.w d0,(VDP_control_port).l[/68k]
Add these lines after the above block:
[68k] clr.w (VDP_Command_Buffer).w
move.w #VDP_Command_Buffer,(VDP_Command_Buffer_Slot).w[/68k]
Then scan further down until you find this:
[68k] clearRAM PNT_Buffer,$C04 ; PNT buffer[/68k]
and change it to this:
[68k] clearRAM PNT_Buffer,$C00 ; PNT buffer[/68k]
Now find this:
[68k]; ---------------------------------------------------------------------------
; Subroutine for queueing VDP commands (seems to only queue transfers to VRAM),
; to be issued the next time ProcessDMAQueue is called.
; Can be called a maximum of 18 times before the buffer needs to be cleared
; by issuing the commands (this subroutine DOES check for overflow)
; ---------------------------------------------------------------------------

; ||||||||||||||| S U B R O U T I N E |||||||||||||||||||||||||||||||||||||||

; sub_144E: DMA_68KtoVRAM: QueueCopyToVRAM: QueueVDPCommand: Add_To_DMA_Queue:
QueueDMATransfer:[/68k]
and delete everything from this up until (and including) this:
[68k]; loc_14CE:
ProcessDMAQueue_Done:
move.w #0,(VDP_Command_Buffer).w
move.l #VDP_Command_Buffer,(VDP_Command_Buffer_Slot).w
rts
; End of function ProcessDMAQueue[/68k]
In its place, put the new code at the end of the post. You can also edit s2.constants.asm to reflect the fact that VDP_Command_Buffer_Slot is now a word instead of a long.

Git S&K version

Add the following equates somewhere:
[68k]VDP_Command_Buffer := DMA_queue
VDP_Command_Buffer_Slot := DMA_queue_slot[/68k]
Then find all cases of
[68k] move.w #0,(DMA_queue).w
move.l #DMA_queue,(DMA_queue_slot).w[/68k]
and change them to
[68k] move.w #0,(DMA_queue).w
move.w #DMA_queue,(DMA_queue_slot).w[/68k]
Now find all cases of
[68k] bsr.w Init_VDP[/68k]
and change them to
[68k] jsr (InitDMAQueue).l
bsr.w Init_VDP[/68k]
Now find this:
[68k]; ---------------------------------------------------------------------------
; Adds art to the DMA queue
; Inputs:
; d1 = source address
; d2 = destination VRAM address
; d3 = number of words to transfer
; ---------------------------------------------------------------------------

; =============== S U B R O U T I N E =======================================

Add_To_DMA_Queue:[/68k]
and delete everything from this up until (and including) this:
[68k]$$stop:
move.w #0,(DMA_queue).w
move.l #DMA_queue,(DMA_queue_slot).w
rts
; End of function Process_DMA_Queue[/68k]
In its place, put the new code at the end of the post. You can also edit s2.constants.asm to reflect the fact that VDP_Command_Buffer_Slot is now a word instead of a long.

The new function

[68k]; ---------------------------------------------------------------------------
; Subroutine for queueing VDP commands (seems to only queue transfers to VRAM),
; to be issued the next time ProcessDMAQueue is called.
; Can be called a maximum of 18 times before the queue needs to be cleared
; by issuing the commands (this subroutine DOES check for overflow)
; ---------------------------------------------------------------------------
; Input:
; d1 Source address
; d2 Destination address
; d3 Transfer length
; Output:
; d0,d1,d2,d3,a1 trashed
;
; With both options below set to zero, the function runs in:
; * 48(11/0) cycles if the queue was full at the start;
; * 194(33/9) cycles otherwise
; The times for the original S2 function are:
; * 52(12/0) cycles if the queue was full at the start;
; * 336(51/9) cycles if queue became full with new command;
; * 346(52/10) cycles otherwise
; The times for the original S&K function are:
; * 52(12/0) cycles if the queue was full at the start;
; * 344(53/9) cycles if queue became full with new command;
; * 354(54/10) cycles otherwise
;
; If you are on S3&K, or you have ported S3&K KosM decompressor, you definitely
; want to edit it to mask off all interrupts before calling QueueDMATransfer:
; both this function *and* the original have numerous race conditions that make
; them unsafe for use by the KosM decoder, since it sets V-Int routine before it
; executes. This can lead to broken DMAs in some rare circumstances.
;
; Like the S3&K version, but unlike the S2 version, this function is "safe" when
; the source is in RAM; this comes at no cost whatsoever, unlike what happens in
; the S3&K version. Moreover, you can gain a few more cycles if the source is in
; RAM in a few cases: whenever a call to QueueDMATransfer has this instruction:
; andi.l #$FFFFFF,d1
; You can simply delete it and gain 16(3/0) cycles.
; ===========================================================================
; This option breaks DMA transfers that crosses a 128kB block into two. It is
; disabled by default because you can simply align the art in ROM and avoid the
; issue altogether. It is here so that you have a high-performance routine to do
; the job in situations where you can't align it in ROM. It beats the equivalent
; functionality in the S&K disassembly with Sonic3_Complete flag set by a lot,
; especially since that version breaks up DMA transfers when they cross *32*kB
; boundaries instead of the problematic 128kB boundaries.
; This option adds 16(3/0) cycles to all DMA transfers that don't cross a 128kB
; boundary. For convenience, here are total times for all cases:
; * 48(11/0) cycles if the queue was full at the start (as always);
; * 214(37/9) cycles for DMA transfers that do not need to be split into two;
; * 252(46/9) cycles if the first piece of the DMA filled the queue;
; * 368(64/16) cycles if both pieces of the DMA were queued
; For comparison, times for the Sonic3_Complete version are:
; * If the source is in address $800000 and up (32x RAM, z80 RAM, main RAM):
; * 72(16/0) cycles if the queue was full
; * 364(57/0) cycles if queue became full with new command;
; * 374(58/10) cycles otherwise
; * If the source is in address $7FFFFF and down (ROM, both SCD RAMs):
; * If the DMA does not need to be split:
; * 294(53/10) cycles if the queue was full at the start;
; * 586(94/19) cycles if queue became full with new command;
; * 596(95/20) cycles otherwise
; * If the DMA needs to be split in two:
; * 436(83/30) cycles if the queue was full at the start;
; * 728(124/21) cycles if queue became full with the first command;
; * 1030(166/31) cycles if queue became full with the second command;
; * 1040(167/32) cycles otherwise
; Meaning you are wasting several hundreds of cycles on *each* *call*!
; What makes matters worse is that the Sonic3_Complete breaks up DMAs that it
; should not, meaning you will be wasting more cycles than can be seen by just
; comparing similar scenarios.
Use128kbSafeDMA := 0
; ===========================================================================
; Option to mask interrupts while updating the DMA queue. This fixes many race
; conditions in the DMA funcion, but it costs 46(6/1) cycles. The better way to
; handle these race conditions would be to make unsafe callers (such as S3&K's
; KosM decoder) prevent these by masking off interrupts before calling and then
; restore interrupts after.
UseVIntSafeDMA := 0
; ===========================================================================
; Option to assume that transfer length is always less than $7FFF. Only makes
; sense if Use128kbSafeDMA is 1. Moreover, setting this to 1 will cause trouble
; on a 64kB DMA, so make sure you never do one if you set it to 1!
; Enabling this saves 4(1/0) cycles on the case where a DMA is broken in two and
; both transfers are properly queued, and nothing at all otherwise.
AssumeMax7FFFXfer := 0&Use128kbSafeDMA
; ===========================================================================
; Convenience macros, for increased maintainability of the code.
ifndef DMA
DMA = %100111
endif
ifndef READ
READ = %001100
endif
ifndef VRAMCommReg_defined
VRAMCommReg_defined := 1
VRAMCommReg macro reg,rwd,clr
lsl.l #2,reg ; Move high bits into (word-swapped) position, accidentally moving everything else
if rwd <> READ
addq.w #1,reg ; Add write bit...
endif
ror.w #2,reg ; ... and put it into place, also moving all other bits into their correct (word-swapped) places
swap reg ; Put all bits in proper places
if clr <> 0
andi.w #3,reg ; Strip whatever junk was in upper word of reg
endif
if rwd == DMA
tas.b reg ; Add in the DMA bit -- tas fails on memory, but works on registers
endif
endm
endif

ifndef intMacros_defined
intMacros_defined := 1
enableInts macro
move #$2300,sr
endm

disableInts macro
move #$2700,sr
endm
endif
; ===========================================================================

; ||||||||||||||| S U B R O U T I N E |||||||||||||||||||||||||||||||||||||||

; sub_144E: DMA_68KtoVRAM: QueueCopyToVRAM: QueueVDPCommand:
Add_To_DMA_Queue:
QueueDMATransfer:
if UseVIntSafeDMA==1
move.w sr,-(sp) ; Save current interrupt mask
disableInts ; Mask off interrupts
endif ; UseVIntSafeDMA==1
movea.w (VDP_Command_Buffer_Slot).w,a1
cmpa.w #VDP_Command_Buffer_Slot,a1
beq.s .done ; return if there's no more room in the queue

lsr.l #1,d1 ; Source address is in words for the VDP registers
if Use128kbSafeDMA==1
move.w d3,d0 ; d0 = length of transfer in words
; Compute position of last transferred word. This handles 2 cases:
; (1) zero length DMAs transfer length actually transfer $10000 words
; (2) (source+length)&$FFFF == 0
subq.w #1,d0
add.w d1,d0 ; d0 = ((src_address >> 1) & $FFFF) + ((xfer_len >> 1) - 1)
bcs.s .double_transfer ; Carry set = ($10000 << 1) = $20000, or new 128kB block
endif ; Use128kbSafeDMA==1

; Store VDP commands for specifying DMA into the queue
swap d1 ; Want the high byte first
move.w #$977F,d0 ; Command to specify source address & $FE0000, plus bitmask for the given byte
and.b d1,d0 ; Mask in source address & $FE0000, stripping high bit in the process
move.w d0,(a1)+ ; Store command
move.w d3,d1 ; Put length together with (source address & $01FFFE) >> 1...
movep.l d1,1(a1) ; ... and stuff them all into RAM in their proper places (movep for the win)
lea 8(a1),a1 ; Skip past all of these commands

VRAMCommReg d2, DMA, 1 ; Make DMA destination command
move.l d2,(a1)+ ; Store command

clr.w (a1) ; Put a stop token at the end of the used part of the queue
move.w a1,(VDP_Command_Buffer_Slot).w ; Set the next free slot address, potentially undoing the above clr (this is intentional!)

.done:
if UseVIntSafeDMA==1
move.w (sp)+,sr ; Restore interrupts to previous state
endif ;UseVIntSafeDMA==1
rts
; ---------------------------------------------------------------------------
if Use128kbSafeDMA==1
.double_transfer:
; Hand-coded version to break the DMA transfer into two smaller transfers
; that do not cross a 128kB boundary. This is done much faster (at the cost
; of space) than by the method of saving parameters and calling the normal
; DMA function twice, as Sonic3_Complete does.
; d0 is the number of words-1 that got over the end of the 128kB boundary
addq.w #1,d0 ; Make d0 the number of words past the 128kB boundary
sub.w d0,d3 ; First transfer will use only up to the end of the 128kB boundary
; Store VDP commands for specifying DMA into the queue
swap d1 ; Want the high byte first
; Sadly, all registers we can spare are in use right now, so we can't use
; no-cost RAM source safety.
andi.w #$7F,d1 ; Strip high bit
ori.w #$9700,d1 ; Command to specify source address & $FE0000
move.w d1,(a1)+ ; Store command
addq.b #1,d1 ; Advance to next 128kB boundary (**)
move.w d1,12(a1) ; Store it now (safe to do in all cases, as we will overwrite later if queue got filled) (**)
move.w d3,d1 ; Put length together with (source address & $01FFFE) >> 1...
movep.l d1,1(a1) ; ... and stuff them all into RAM in their proper places (movep for the win)
lea 8(a1),a1 ; Skip past all of these commands

move.w d2,d3 ; Save for later
VRAMCommReg d2, DMA, 1 ; Make DMA destination command
move.l d2,(a1)+ ; Store command

cmpa.w #VDP_Command_Buffer_Slot,a1 ; Did this command fill the queue?
beq.s .skip_second_transfer ; Branch if so

; Store VDP commands for specifying DMA into the queue
; The source address high byte was done above already in the comments marked
; with (**)
if AssumeMax7FFFXfer==1
ext.l d0 ; With maximum $7FFF transfer length, bit 15 of d0 is unset here
movep.l d0,3(a1) ; Stuff it all into RAM at the proper places (movep for the win)
else
moveq #0,d2 ; Need a zero for a 128kB block start
move.w d0,d2 ; Copy number of words on this new block...
movep.l d2,3(a1) ; ... and stuff it all into RAM at the proper places (movep for the win)
endif
lea 10(a1),a1 ; Skip past all of these commands
; d1 contains length up to the end of the 128kB boundary
add.w d1,d1 ; Convert it into byte length...
add.w d1,d3 ; ... and offset destination by the correct amount
VRAMCommReg d3, DMA, 1 ; Make DMA destination command
move.l d3,(a1)+ ; Store command

clr.w (a1) ; Put a stop token at the end of the used part of the queue
move.w a1,(VDP_Command_Buffer_Slot).w ; Set the next free slot address, potentially undoing the above clr (this is intentional!)

if UseVIntSafeDMA==1
move.w (sp)+,sr ; Restore interrupts to previous state
endif ;UseVIntSafeDMA==1
rts
; ---------------------------------------------------------------------------
.skip_second_transfer:
move.w a1,(a1) ; Set the next free slot address, overwriting what the second (**) instruction did

if UseVIntSafeDMA==1
move.w (sp)+,sr ; Restore interrupts to previous state
endif ;UseVIntSafeDMA==1
rts
endif ; Use128kbSafeDMA==1
; End of function QueueDMATransfer
; ===========================================================================

; ---------------------------------------------------------------------------
; Subroutine for issuing all VDP commands that were queued
; (by earlier calls to QueueDMATransfer)
; Resets the queue when it's done
; ---------------------------------------------------------------------------

; ||||||||||||||| S U B R O U T I N E |||||||||||||||||||||||||||||||||||||||

; sub_14AC: CopyToVRAM: IssueVDPCommands: Process_DMA:
Process_DMA_Queue:
ProcessDMAQueue:
lea (VDP_control_port).l,a5
lea (VDP_Command_Buffer).w,a1
move.w a1,(VDP_Command_Buffer_Slot).w

rept (VDP_Command_Buffer_Slot-VDP_Command_Buffer)/(7*2)
move.w (a1)+,d0
beq.w .done ; branch if we reached a stop token
; issue a set of VDP commands...
move.w d0,(a5)
move.l (a1)+,(a5)
move.l (a1)+,(a5)
move.l (a1)+,(a5)
endm
moveq #0,d0

.done:
move.w d0,(VDP_Command_Buffer).w
rts
; End of function ProcessDMAQueue
; ===========================================================================

; ---------------------------------------------------------------------------
; Subroutine for initializing the DMA queue.
; ---------------------------------------------------------------------------

; ||||||||||||||| S U B R O U T I N E |||||||||||||||||||||||||||||||||||||||

InitDMAQueue:
lea (VDP_Command_Buffer).w,a1
move.w #0,(a1)
move.w a1,(VDP_Command_Buffer_Slot).w
move.l #$96959493,d1
rept (VDP_Command_Buffer_Slot-VDP_Command_Buffer)/(7*2)
movep.l d1,2(a1)
lea 14(a1),a1
endm
rts
; End of function ProcessDMAQueue
; ===========================================================================
[/68k]

Additional Care
There are some additional points that are worth paying attention to.

128kB boundaries and you
For both S2 or S&K (or anywhere you want to use this), the version that does not check for 128kB boundaries is the default. The reason is this: you can (and should) always align the problematic art in such a way that the DMA never needs to be split in two. So enabling this by default carries a penalty with little real benefit. In any case, you can toggle this by setting the Use128kbSafeDMA option to 1.

Transfers of 64kB or larger
If you have enabled the version that breaks DMAs into two if they go over a 128kB boundary, this is relevant for you. There is an option that saves 4(1/0) cycles on the case where a DMA transfer is broken in two pieces and both pieces are correctly queued (that is, the first transfer did not fill the queue). This option assumes that you never perform a transfer with length of 64kB or higher; note that 64kB is included here! In these conditions, a small optimization exists that leads to the small savings mentioned. This is disabled by default to avoid this edge case.

Interrupt Safety
The original functions have several race conditions that makes them unsafe regarding interrupts. My version removes one of them, but adds another. For the vast majority of cases, this is irrelevant -- the QueueDMATransfer function is generally called only when Vint_routine is zero, meaning that the DMA queue will not be processed, and all is well.

There is one exception, though: the S3&K KosM decoder. Since the KosM decoder sets Vint_routine to a nonzero value, you can potentially run into an interrupt in the middle of a QueueDMATransfer call. Effects range from lost DMA transfers, to garbage DMA transfers, to one garbage DMA and a lost DMA (if the transfer was split), or, in the best possible outcome, no ill effects at all. You can toggle interrupt safety by setting the UseVIntSafeDMA flag to 1, but this adds overhead to all safe callers; better would be to fix the unsafe callers to mask interrupts while the DMA transfer is being queued.

ASM68k
If you use this crap, all you need to do to use the code above is:

replace the dotted labels (composed symbols) by @ labels (local symbols);

replace the last two instances of "endm" by "endr";

edit the VRAMCommReg macro to use asm68k-style parameters.

Before you complain that asm68k is not crap, I invite you to assemble the following and check the output:
[68k] move.w (d0),d1
move.w d0 ,d1
dc.b 1 , 2
moveq #$80,d0[/68k]

RetroKoH · Aug 10, 2014

I wanna port this to Sonic 1, as I've got the DMA going on there.
DISREGARD THE QUESTION IF YOU SAW IT... I THINK I'm FIGURING IT OUT

RetroKoH · Aug 11, 2014

Double posting as I'm in need of help. Here is my DMA queue.asm of my attempt to port this over into a Sonic 1 disassembly that has DMA queue already implemented. When I do this... my screen gets completely glitched and Sonic's tiles are thrown all over VRAM like mad... some help would be appreciated.

Hitaxas · Aug 11, 2014

I'm actually getting something along the lines of what KingofHarts is experiencing, but with my attempt to slap this into sonic 2. Not sure why though.

flamewing · Aug 11, 2014

@KoH: in your version, I noticed you replaced the movep's by move's; for example:
[68k]InitDMAQueue_Loop:
move.l d1,2(a1)[/68k]
This should be, instead:
[68k]InitDMAQueue_Loop:
movep.l d1,2(a1)[/68k]
So of course your version does not work. movep reads/writes alternating bytes from RAM/ROM/SRAM (so a "movep.l d0,0(a1)" would write to a1+0, a1+2, a1+4 and a1+6), whereas move writes to contiguous bytes ("move.l d0,0(a1)" writes to a1, a1+1, a1+2, a1+3).

RetroKoH · Aug 11, 2014

Yes sir, that indeed fixed it. I took them out cuz for some reason I thought those instructions were giving me errors... which in hindsight doesn't really make any sense. It was the rept's that didn't work. Everyone using Sonic 1 can see how I took care of that in the file posted above... which has been edited and should now work with no issues.

flamewing · Aug 11, 2014

Oh, I forgot -- the repts should work fine if you swap the "endm"s by "endr"s. I added this to the guide.

@Hitaxas: can you elaborate on what you did? Because I tested the code on clean S2 and S&K (with and without Sonic3_Complete) disassemblies, and it worked on all cases.

RetroKoH · Aug 11, 2014

Ah... well then I found an alternate way to achieve the same result. I'd imagine yours is faster but could you figure that out for sure, one way or the other?

flamewing · Aug 11, 2014

Without even looking at the cycle counts, I say that mine is faster because there is no loop overhead. But I looked at it anyway to give a quantitative value:

For ProcessDMAQueue:
My version:
[68k] rept (VDP_Command_Buffer_Slot-VDP_Command_Buffer)/(7*2)
move.w (a1)+,d0 ; 8(2/0)
beq.w .done ; T: 10(2/0); F: 12(2/0)
move.w d0,(a5) ; 8(1/1)
move.l (a1)+,(a5) ; 20(3/2)
move.l (a1)+,(a5) ; 20(3/2)
move.l (a1)+,(a5) ; 20(3/2)
endm
; For up to 17 transfers: 88(14/7) * numtransfers + 18(4/0)
; For 18 transfers: 1584(252/126)[/68k]
Yours:
[68k] move.w (a1)+,d0 ; 8(2/0)
beq.s ProcessDMAQueue_Done ; T: 10(2/0); F: 8(1/0)
move.w d0,(a5) ; 8(1/1)
move.l (a1)+,(a5) ; 20(3/2)
move.l (a1)+,(a5) ; 20(3/2)
move.l (a1)+,(a5) ; 20(3/2)
cmpa.w #$C8FC,a1 ; 10(2/0)
bne.s ProcessDMAQueue_Loop ; T: 10(2/0); F: 8(1/0)
; For up to 17 transfers: 104(17/7) * numtransfers + 18(4/0)
; For 18 transfers: 1870(305/126)[/68k]
There is also the 4(1/0) cycles your version adds because it uses "move.w #0,(VDP_Command_Buffer).w", while mine uses "move.w d0,(VDP_Command_Buffer).w", but mine also adds 4(1/0) cycles due to the "moveq #0,d0" in the case of 18 transfers.

For InitDMAQueue:
Mine:
[68k] rept (VDP_Command_Buffer_Slot-VDP_Command_Buffer)/(7*2)
movep.l d1,2(a1) ; 24(2/4)
lea 14(a1),a1 ; 8(2/0)
endm
; 576(72/72)[/68k]
Yours:
[68k] movep.l d1,2(a1) ; 24(2/4)
lea 14(a1),a1 ; 8(2/0)
cmpa.w #$C8FC,a1 ; 10(2/0)
bne.s InitDMAQueue_Loop ; T: 10(2/0); F: 8(1/0)
; 934(143/72)[/68k]

RetroKoH · Aug 11, 2014

Good to know. I never knew you could do loops this way with such efficiency. I should try putting this to use in other game mechanics that loop in a similar manner as well.

flamewing · Aug 11, 2014

I actually removed the loops; the rept/endm block repeats the block of instructions contained in it as many times as specified.

flamewing · Aug 11, 2014

After helping Hitaxas over through PM, it turns out his issues were the "RetroHack" and "Sonic Retro" splash screens, which hard-coded the S1 OST, thus overwriting the DMA queue and making the initialization moot. So if you use either of these, be warned that you will either have to fix them, or you will have to initialize the DMA region after they have executed.

RetroKoH · Aug 14, 2014

So an update... this is weird.

I added the rept instructions. For the second one (in INIT) it builds and runs fine.

But for the other block just before it, I get all these errors occuring on the rept line:

...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (212 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (200 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (188 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (176 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (164 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (152 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (140 bytes) is out of range
...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (128 bytes) is out of range

ALSO, I fixed the bug that I mentioned before thanks to you, but now while all the characters work,. Tails' tails object causes the bug... and they only cause this bug when Tails is at a certain diagonal angle when jumping/rolling.
Any idea?

flamewing · Aug 17, 2014

The other block needs to use be.w. instead of beq.s — the short branch that was in he original is too short.

And which wasmthat that again?

RetroKoH · Aug 17, 2014

The bug I previously encountered before you had me using movep...

Where my screen was consumed by garbled art... last time I didn't post a screenshot because it'd been fixed... but now here it is:
IMAGE

Does this ONLY when Tails' tail object is loaded. I take that out, and never get any bugs. Also it only occurs when Tails is rolling, and at an angle...

flamewing · Aug 17, 2014

This is happening then because Tails' tails is overwriting part of the DMA queue's RAM; my version is so much faster because it assumes this does not happen. If you want, you can send me a ROM and I will tell you exactly where the error is.

RetroKoH · Aug 17, 2014

EDITED POST

Posted the rom, looking forward to hearing back

flamewing · Sep 2, 2014

After debugging KingofHarts' issue, I found out that it was caused by a slight oversight of mine in an edge case: when deciding whether or not to break up a DMA transfer that crosses a 128kB block, I was actually checking whether the last word transfered would be on a new 128kB block or not. I updated the first post with the new version. This caused the 128kB-safe version to become slightly slower; the new times for this version are:

48(11/0) cycles if the queue was full at the start (as always) [unchanged];

214(37/9) cycles for DMA transfers that do not need to be split into two [increased by 4(1/0)];

252(46/9) cycles if the first piece of the DMA filled the queue [increased by 8(2/0)];

364(63/16) cycles if both pieces of the DMA were queued [increased by 8(2/0)].

The non-128kB-safe version remains the same speed, which is all the more reason to just align the art in ROM to avoid the issue altogether.

MainMemory · Sep 2, 2014

It may be worth mentioning that there is a way to automatically detect overflows in DPLCs using my sprite mappings macros with AS. Add these lines just before the endm in the dplcEntry macro:

Code (Text):

if dplcTiles <> 0

if ((dplcTiles+(offset*$20))/131072) <> ((dplcTiles+(offset*$20)+(tiles*$20)-1)/131072)

message "Warning: DPLC crosses 128K boundary! line: \{MOMLINE/1.0} start: offset count: tiles overflow: $\{(dplcTiles+(offset*$20)+(tiles*$20))#131072}"

endif

endif

Also make sure to put "dplcTiles := 0" above the macro to initialize it. Then you can put "dplcTiles := ArtUnc_Sonic" at the top of your DPLC file and "dplcTiles := 0" at the end, and if any DPLCs cross a 128K boundary you'll get a message like "Warning: DPLC crosses 128K boundary! line: 705 start: $7C8 count: $10 overflow: $60".

RetroKoH · Sep 2, 2014

flamewing said:

After debugging KingofHarts' issue, I found out that it was caused by a slight oversight of mine in an edge case: when deciding whether or not to break up a DMA transfer that crosses a 128kB block, I was actually checking whether the last word transfered would be on a new 128kB block or not. I updated the first post with the new version. This caused the 128kB-safe version to become slightly slower; the new times for this version are:

48(11/0) cycles if the queue was full at the start (as always) [unchanged];

214(37/9) cycles for DMA transfers that do not need to be split into two [increased by 4(1/0)];

252(46/9) cycles if the first piece of the DMA filled the queue [increased by 8(2/0)];

364(63/16) cycles if both pieces of the DMA were queued [increased by 8(2/0)].

The non-128kB-safe version remains the same speed, which is all the more reason to just align the art in ROM to avoid the issue altogether.
Click to expand...

With 6 characters each having their own art, I have to take the slower route. Aligning them ALL would be brutal on ROM size. Thank you for this though, regardless. :D

Optimizing the DMA queue

Useful Searches