don't click here

Optimizing the DMA queue

Discussion in 'Engineering & Reverse Engineering' started by flamewing, Aug 9, 2014.

  1. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    Edit: The improved DMA queue (and instructions for its use) can now be obtained on GitHub. This post will no longer be updated with new developments, but the thread will.

    Edit: I make corrections to account for the edge cases described on this post and on this post. Rest of original post follows.
    [hr]
    I have been going over the list of things to optimize in the original Genesis games, so I can get Sonic Classic Heroes to work as smoothly as possible. One of the things I just finished optimizing was the DMA queue management function, and I was amazed at how much I managed to save. I am now sharing this so everyone can benefit.

    The new functions require that you do two things before you can use (which I will detail later): the first is going through your assembly files and changing the way the queue is cleared; the second is calling the initialization function for the new DMA functions. At the end, you will have gained two bytes in RAM, and a DMA queue that runs much faster.

    How much faster? Well, it depends. There are 3 different DMA functions that are used nowadays:
    1. The stock S2 function;
    2. the stock S&K function;
    3. the Sonic3_Complete version that is used when you assemble the Git disassembly with Sonic3_Complete=1.
    The stock S2 function is the fastest of the 3:
    • 52(12/0) cycles if the queue was full;
    • 336(51/9) cycles if the new transfer filled the queue;
    • 346(52/10) cycles otherwise.
    The stock S&K function is 8(2/0) cycles slower than the S2 version, but it can be safely used when the source is in RAM (the S2 version requires some extra care, but I won't go into details). So its times are:
    • 52(12/0) cycles if the queue was full;
    • 344(53/9) cycles if the new transfer filled the queue;
    • 354(54/10) cycles otherwise.
    The Sonic3_Complete version is based on the S&K stock version; it thus also safe with RAM sources. However, it breaks up DMA transfers that cross 32kB boundaries into two DMA transfers(*). The way it does this adds an enormous overhead on all DMA transfers. Its times are:
    • If the source is in address $800000 and up (32x RAM, z80 RAM, main RAM):
      • 72(16/0) cycles if the queue was full;
      • 364(57/0) cycles if queue became full with new command;
      • 374(58/10) cycles otherwise;
    • If the source is in address $7FFFFF and down (ROM, both SCD RAMs):
      • If the DMA does not need to be split:
        • 294(53/10) cycles if the queue was full at the start;
        • 586(94/19) cycles if queue became full with new command;
        • 596(95/20) cycles otherwise;
      • If the DMA needs to be split in two:
        • 436(83/30) cycles if the queue was full at the start;
        • 728(124/21) cycles if queue became full with the first command;
        • 1030(166/31) cycles if queue became full with the second command;
        • 1040(167/32) cycles otherwise.
    Yeah, you are wasting hundreds of cycles by using the Sonic3_Complete version... but even more than you think when you note the asterisk I added above. You see, the VDP has issues with DMAs that cross a 128kB boundary in ROM; the Sonic3_Complete tries to handle this, but is overzealous -- it breaks up transfers that cross a 32kB boundary instead. Thus, loads of DMAs are broken into two that should not be broken at all... leading to several hundreds of wasted cycles. The function is bad enough that manually breaking up the transfers would be much faster -- potentially 2/3rds of the time.

    So, how does my optimized function compare with this? There are two versions you can select with a flag during assembly: the "competitor" to stock S2/stock S&K versions, which does not care whether or not transfers cross a 128kB boundary; and the "competitor" to Sonic3_Complete version, which is 128kB safe. Both of them are safe for RAM sources, and done so in an optimized way that has zero cost -- the functions would not be faster without this added protection. The times for the non-128kB-safe version are:
    • 48(11/0) cycles if the queue was full at the start;
    • 194(33/9) cycles otherwise.
    The times for the 128kB-safe version are:
    • 48(11/0) cycles if the queue was full at the start (as always);
    • 214(37/9) cycles for DMA transfers that do not need to be split into two;
    • 252(46/9) cycles if the first piece of the DMA filled the queue;
    • 368(64/16) cycles if both pieces of the DMA were queued.
    I will leave comparisons to whoever want to make them. I will just mention that if you use SonMapEd-generated DPLCs and you are using the Sonic3_Complete function, you are easily wasting thousands of cycles every frame.

    Well, so now how to add this function to your hacks.

    Git S2 version
    Find every instance of this code:
    [68k] move.l #VDP_Command_Buffer,(VDP_Command_Buffer_Slot).w[/68k]
    and change it to this:
    [68k] move.w #VDP_Command_Buffer,(VDP_Command_Buffer_Slot).w[/68k]
    Now find this:
    [68k] bsr.w VDPSetupGame[/68k]
    and change it to this:
    [68k] jsr (InitDMAQueue).l
    bsr.w VDPSetupGame[/68k]
    Now find the "SpecialStage" label and scan down to this:
    [68k] move #$2700,sr ; Mask all interrupts
    lea (VDP_control_port).l,a6
    move.w #$8B03,(a6) ; EXT-INT disabled, V scroll by screen, H scroll by line
    move.w #$8004,(a6) ; H-INT disabled
    move.w #$8ADF,(Hint_counter_reserve).w ; H-INT every 224th scanline
    move.w #$8230,(a6) ; PNT A base: $C000
    move.w #$8405,(a6) ; PNT B base: $A000
    move.w #$8C08,(a6) ; H res 32 cells, no interlace, S/H enabled
    move.w #$9003,(a6) ; Scroll table size: 128x32
    move.w #$8700,(a6) ; Background palette/color: 0/0
    move.w #$8D3F,(a6) ; H scroll table base: $FC00
    move.w #$857C,(a6) ; Sprite attribute table base: $F800
    move.w (VDP_Reg1_val).w,d0
    andi.b #$BF,d0
    move.w d0,(VDP_control_port).l[/68k]
    Add these lines after the above block:
    [68k] clr.w (VDP_Command_Buffer).w
    move.w #VDP_Command_Buffer,(VDP_Command_Buffer_Slot).w[/68k]
    Then scan further down until you find this:
    [68k] clearRAM PNT_Buffer,$C04 ; PNT buffer[/68k]
    and change it to this:
    [68k] clearRAM PNT_Buffer,$C00 ; PNT buffer[/68k]
    Now find this:
    [68k]; ---------------------------------------------------------------------------
    ; Subroutine for queueing VDP commands (seems to only queue transfers to VRAM),
    ; to be issued the next time ProcessDMAQueue is called.
    ; Can be called a maximum of 18 times before the buffer needs to be cleared
    ; by issuing the commands (this subroutine DOES check for overflow)
    ; ---------------------------------------------------------------------------

    ; ||||||||||||||| S U B R O U T I N E |||||||||||||||||||||||||||||||||||||||

    ; sub_144E: DMA_68KtoVRAM: QueueCopyToVRAM: QueueVDPCommand: Add_To_DMA_Queue:
    QueueDMATransfer:[/68k]
    and delete everything from this up until (and including) this:
    [68k]; loc_14CE:
    ProcessDMAQueue_Done:
    move.w #0,(VDP_Command_Buffer).w
    move.l #VDP_Command_Buffer,(VDP_Command_Buffer_Slot).w
    rts
    ; End of function ProcessDMAQueue[/68k]
    In its place, put the new code at the end of the post. You can also edit s2.constants.asm to reflect the fact that VDP_Command_Buffer_Slot is now a word instead of a long.

    Git S&K version
    Add the following equates somewhere:
    [68k]VDP_Command_Buffer := DMA_queue
    VDP_Command_Buffer_Slot := DMA_queue_slot[/68k]
    Then find all cases of
    [68k] move.w #0,(DMA_queue).w
    move.l #DMA_queue,(DMA_queue_slot).w[/68k]
    and change them to
    [68k] move.w #0,(DMA_queue).w
    move.w #DMA_queue,(DMA_queue_slot).w[/68k]
    Now find all cases of
    [68k] bsr.w Init_VDP[/68k]
    and change them to
    [68k] jsr (InitDMAQueue).l
    bsr.w Init_VDP[/68k]
    Now find this:
    [68k]; ---------------------------------------------------------------------------
    ; Adds art to the DMA queue
    ; Inputs:
    ; d1 = source address
    ; d2 = destination VRAM address
    ; d3 = number of words to transfer
    ; ---------------------------------------------------------------------------

    ; =============== S U B R O U T I N E =======================================


    Add_To_DMA_Queue:[/68k]
    and delete everything from this up until (and including) this:
    [68k]$$stop:
    move.w #0,(DMA_queue).w
    move.l #DMA_queue,(DMA_queue_slot).w
    rts
    ; End of function Process_DMA_Queue[/68k]
    In its place, put the new code at the end of the post. You can also edit s2.constants.asm to reflect the fact that VDP_Command_Buffer_Slot is now a word instead of a long.

    The new function
    [68k]; ---------------------------------------------------------------------------
    ; Subroutine for queueing VDP commands (seems to only queue transfers to VRAM),
    ; to be issued the next time ProcessDMAQueue is called.
    ; Can be called a maximum of 18 times before the queue needs to be cleared
    ; by issuing the commands (this subroutine DOES check for overflow)
    ; ---------------------------------------------------------------------------
    ; Input:
    ; d1 Source address
    ; d2 Destination address
    ; d3 Transfer length
    ; Output:
    ; d0,d1,d2,d3,a1 trashed
    ;
    ; With both options below set to zero, the function runs in:
    ; * 48(11/0) cycles if the queue was full at the start;
    ; * 194(33/9) cycles otherwise
    ; The times for the original S2 function are:
    ; * 52(12/0) cycles if the queue was full at the start;
    ; * 336(51/9) cycles if queue became full with new command;
    ; * 346(52/10) cycles otherwise
    ; The times for the original S&K function are:
    ; * 52(12/0) cycles if the queue was full at the start;
    ; * 344(53/9) cycles if queue became full with new command;
    ; * 354(54/10) cycles otherwise
    ;
    ; If you are on S3&K, or you have ported S3&K KosM decompressor, you definitely
    ; want to edit it to mask off all interrupts before calling QueueDMATransfer:
    ; both this function *and* the original have numerous race conditions that make
    ; them unsafe for use by the KosM decoder, since it sets V-Int routine before it
    ; executes. This can lead to broken DMAs in some rare circumstances.
    ;
    ; Like the S3&K version, but unlike the S2 version, this function is "safe" when
    ; the source is in RAM; this comes at no cost whatsoever, unlike what happens in
    ; the S3&K version. Moreover, you can gain a few more cycles if the source is in
    ; RAM in a few cases: whenever a call to QueueDMATransfer has this instruction:
    ; andi.l #$FFFFFF,d1
    ; You can simply delete it and gain 16(3/0) cycles.
    ; ===========================================================================
    ; This option breaks DMA transfers that crosses a 128kB block into two. It is
    ; disabled by default because you can simply align the art in ROM and avoid the
    ; issue altogether. It is here so that you have a high-performance routine to do
    ; the job in situations where you can't align it in ROM. It beats the equivalent
    ; functionality in the S&K disassembly with Sonic3_Complete flag set by a lot,
    ; especially since that version breaks up DMA transfers when they cross *32*kB
    ; boundaries instead of the problematic 128kB boundaries.
    ; This option adds 16(3/0) cycles to all DMA transfers that don't cross a 128kB
    ; boundary. For convenience, here are total times for all cases:
    ; * 48(11/0) cycles if the queue was full at the start (as always);
    ; * 214(37/9) cycles for DMA transfers that do not need to be split into two;
    ; * 252(46/9) cycles if the first piece of the DMA filled the queue;
    ; * 368(64/16) cycles if both pieces of the DMA were queued
    ; For comparison, times for the Sonic3_Complete version are:
    ; * If the source is in address $800000 and up (32x RAM, z80 RAM, main RAM):
    ; * 72(16/0) cycles if the queue was full
    ; * 364(57/0) cycles if queue became full with new command;
    ; * 374(58/10) cycles otherwise
    ; * If the source is in address $7FFFFF and down (ROM, both SCD RAMs):
    ; * If the DMA does not need to be split:
    ; * 294(53/10) cycles if the queue was full at the start;
    ; * 586(94/19) cycles if queue became full with new command;
    ; * 596(95/20) cycles otherwise
    ; * If the DMA needs to be split in two:
    ; * 436(83/30) cycles if the queue was full at the start;
    ; * 728(124/21) cycles if queue became full with the first command;
    ; * 1030(166/31) cycles if queue became full with the second command;
    ; * 1040(167/32) cycles otherwise
    ; Meaning you are wasting several hundreds of cycles on *each* *call*!
    ; What makes matters worse is that the Sonic3_Complete breaks up DMAs that it
    ; should not, meaning you will be wasting more cycles than can be seen by just
    ; comparing similar scenarios.
    Use128kbSafeDMA := 0
    ; ===========================================================================
    ; Option to mask interrupts while updating the DMA queue. This fixes many race
    ; conditions in the DMA funcion, but it costs 46(6/1) cycles. The better way to
    ; handle these race conditions would be to make unsafe callers (such as S3&K's
    ; KosM decoder) prevent these by masking off interrupts before calling and then
    ; restore interrupts after.
    UseVIntSafeDMA := 0
    ; ===========================================================================
    ; Option to assume that transfer length is always less than $7FFF. Only makes
    ; sense if Use128kbSafeDMA is 1. Moreover, setting this to 1 will cause trouble
    ; on a 64kB DMA, so make sure you never do one if you set it to 1!
    ; Enabling this saves 4(1/0) cycles on the case where a DMA is broken in two and
    ; both transfers are properly queued, and nothing at all otherwise.
    AssumeMax7FFFXfer := 0&Use128kbSafeDMA
    ; ===========================================================================
    ; Convenience macros, for increased maintainability of the code.
    ifndef DMA
    DMA = %100111
    endif
    ifndef READ
    READ = %001100
    endif
    ifndef VRAMCommReg_defined
    VRAMCommReg_defined := 1
    VRAMCommReg macro reg,rwd,clr
    lsl.l #2,reg ; Move high bits into (word-swapped) position, accidentally moving everything else
    if rwd <> READ
    addq.w #1,reg ; Add write bit...
    endif
    ror.w #2,reg ; ... and put it into place, also moving all other bits into their correct (word-swapped) places
    swap reg ; Put all bits in proper places
    if clr <> 0
    andi.w #3,reg ; Strip whatever junk was in upper word of reg
    endif
    if rwd == DMA
    tas.b reg ; Add in the DMA bit -- tas fails on memory, but works on registers
    endif
    endm
    endif

    ifndef intMacros_defined
    intMacros_defined := 1
    enableInts macro
    move #$2300,sr
    endm

    disableInts macro
    move #$2700,sr
    endm
    endif
    ; ===========================================================================

    ; ||||||||||||||| S U B R O U T I N E |||||||||||||||||||||||||||||||||||||||

    ; sub_144E: DMA_68KtoVRAM: QueueCopyToVRAM: QueueVDPCommand:
    Add_To_DMA_Queue:
    QueueDMATransfer:
    if UseVIntSafeDMA==1
    move.w sr,-(sp) ; Save current interrupt mask
    disableInts ; Mask off interrupts
    endif ; UseVIntSafeDMA==1
    movea.w (VDP_Command_Buffer_Slot).w,a1
    cmpa.w #VDP_Command_Buffer_Slot,a1
    beq.s .done ; return if there's no more room in the queue

    lsr.l #1,d1 ; Source address is in words for the VDP registers
    if Use128kbSafeDMA==1
    move.w d3,d0 ; d0 = length of transfer in words
    ; Compute position of last transferred word. This handles 2 cases:
    ; (1) zero length DMAs transfer length actually transfer $10000 words
    ; (2) (source+length)&$FFFF == 0
    subq.w #1,d0
    add.w d1,d0 ; d0 = ((src_address >> 1) & $FFFF) + ((xfer_len >> 1) - 1)
    bcs.s .double_transfer ; Carry set = ($10000 << 1) = $20000, or new 128kB block
    endif ; Use128kbSafeDMA==1

    ; Store VDP commands for specifying DMA into the queue
    swap d1 ; Want the high byte first
    move.w #$977F,d0 ; Command to specify source address & $FE0000, plus bitmask for the given byte
    and.b d1,d0 ; Mask in source address & $FE0000, stripping high bit in the process
    move.w d0,(a1)+ ; Store command
    move.w d3,d1 ; Put length together with (source address & $01FFFE) >> 1...
    movep.l d1,1(a1) ; ... and stuff them all into RAM in their proper places (movep for the win)
    lea 8(a1),a1 ; Skip past all of these commands

    VRAMCommReg d2, DMA, 1 ; Make DMA destination command
    move.l d2,(a1)+ ; Store command

    clr.w (a1) ; Put a stop token at the end of the used part of the queue
    move.w a1,(VDP_Command_Buffer_Slot).w ; Set the next free slot address, potentially undoing the above clr (this is intentional!)

    .done:
    if UseVIntSafeDMA==1
    move.w (sp)+,sr ; Restore interrupts to previous state
    endif ;UseVIntSafeDMA==1
    rts
    ; ---------------------------------------------------------------------------
    if Use128kbSafeDMA==1
    .double_transfer:
    ; Hand-coded version to break the DMA transfer into two smaller transfers
    ; that do not cross a 128kB boundary. This is done much faster (at the cost
    ; of space) than by the method of saving parameters and calling the normal
    ; DMA function twice, as Sonic3_Complete does.
    ; d0 is the number of words-1 that got over the end of the 128kB boundary
    addq.w #1,d0 ; Make d0 the number of words past the 128kB boundary
    sub.w d0,d3 ; First transfer will use only up to the end of the 128kB boundary
    ; Store VDP commands for specifying DMA into the queue
    swap d1 ; Want the high byte first
    ; Sadly, all registers we can spare are in use right now, so we can't use
    ; no-cost RAM source safety.
    andi.w #$7F,d1 ; Strip high bit
    ori.w #$9700,d1 ; Command to specify source address & $FE0000
    move.w d1,(a1)+ ; Store command
    addq.b #1,d1 ; Advance to next 128kB boundary (**)
    move.w d1,12(a1) ; Store it now (safe to do in all cases, as we will overwrite later if queue got filled) (**)
    move.w d3,d1 ; Put length together with (source address & $01FFFE) >> 1...
    movep.l d1,1(a1) ; ... and stuff them all into RAM in their proper places (movep for the win)
    lea 8(a1),a1 ; Skip past all of these commands

    move.w d2,d3 ; Save for later
    VRAMCommReg d2, DMA, 1 ; Make DMA destination command
    move.l d2,(a1)+ ; Store command

    cmpa.w #VDP_Command_Buffer_Slot,a1 ; Did this command fill the queue?
    beq.s .skip_second_transfer ; Branch if so

    ; Store VDP commands for specifying DMA into the queue
    ; The source address high byte was done above already in the comments marked
    ; with (**)
    if AssumeMax7FFFXfer==1
    ext.l d0 ; With maximum $7FFF transfer length, bit 15 of d0 is unset here
    movep.l d0,3(a1) ; Stuff it all into RAM at the proper places (movep for the win)
    else
    moveq #0,d2 ; Need a zero for a 128kB block start
    move.w d0,d2 ; Copy number of words on this new block...
    movep.l d2,3(a1) ; ... and stuff it all into RAM at the proper places (movep for the win)
    endif
    lea 10(a1),a1 ; Skip past all of these commands
    ; d1 contains length up to the end of the 128kB boundary
    add.w d1,d1 ; Convert it into byte length...
    add.w d1,d3 ; ... and offset destination by the correct amount
    VRAMCommReg d3, DMA, 1 ; Make DMA destination command
    move.l d3,(a1)+ ; Store command

    clr.w (a1) ; Put a stop token at the end of the used part of the queue
    move.w a1,(VDP_Command_Buffer_Slot).w ; Set the next free slot address, potentially undoing the above clr (this is intentional!)

    if UseVIntSafeDMA==1
    move.w (sp)+,sr ; Restore interrupts to previous state
    endif ;UseVIntSafeDMA==1
    rts
    ; ---------------------------------------------------------------------------
    .skip_second_transfer:
    move.w a1,(a1) ; Set the next free slot address, overwriting what the second (**) instruction did

    if UseVIntSafeDMA==1
    move.w (sp)+,sr ; Restore interrupts to previous state
    endif ;UseVIntSafeDMA==1
    rts
    endif ; Use128kbSafeDMA==1
    ; End of function QueueDMATransfer
    ; ===========================================================================

    ; ---------------------------------------------------------------------------
    ; Subroutine for issuing all VDP commands that were queued
    ; (by earlier calls to QueueDMATransfer)
    ; Resets the queue when it's done
    ; ---------------------------------------------------------------------------

    ; ||||||||||||||| S U B R O U T I N E |||||||||||||||||||||||||||||||||||||||

    ; sub_14AC: CopyToVRAM: IssueVDPCommands: Process_DMA:
    Process_DMA_Queue:
    ProcessDMAQueue:
    lea (VDP_control_port).l,a5
    lea (VDP_Command_Buffer).w,a1
    move.w a1,(VDP_Command_Buffer_Slot).w

    rept (VDP_Command_Buffer_Slot-VDP_Command_Buffer)/(7*2)
    move.w (a1)+,d0
    beq.w .done ; branch if we reached a stop token
    ; issue a set of VDP commands...
    move.w d0,(a5)
    move.l (a1)+,(a5)
    move.l (a1)+,(a5)
    move.l (a1)+,(a5)
    endm
    moveq #0,d0

    .done:
    move.w d0,(VDP_Command_Buffer).w
    rts
    ; End of function ProcessDMAQueue
    ; ===========================================================================

    ; ---------------------------------------------------------------------------
    ; Subroutine for initializing the DMA queue.
    ; ---------------------------------------------------------------------------

    ; ||||||||||||||| S U B R O U T I N E |||||||||||||||||||||||||||||||||||||||

    InitDMAQueue:
    lea (VDP_Command_Buffer).w,a1
    move.w #0,(a1)
    move.w a1,(VDP_Command_Buffer_Slot).w
    move.l #$96959493,d1
    rept (VDP_Command_Buffer_Slot-VDP_Command_Buffer)/(7*2)
    movep.l d1,2(a1)
    lea 14(a1),a1
    endm
    rts
    ; End of function ProcessDMAQueue
    ; ===========================================================================
    [/68k]
    Additional Care
    There are some additional points that are worth paying attention to.

    128kB boundaries and you
    For both S2 or S&K (or anywhere you want to use this), the version that does not check for 128kB boundaries is the default. The reason is this: you can (and should) always align the problematic art in such a way that the DMA never needs to be split in two. So enabling this by default carries a penalty with little real benefit. In any case, you can toggle this by setting the Use128kbSafeDMA option to 1.

    Transfers of 64kB or larger
    If you have enabled the version that breaks DMAs into two if they go over a 128kB boundary, this is relevant for you. There is an option that saves 4(1/0) cycles on the case where a DMA transfer is broken in two pieces and both pieces are correctly queued (that is, the first transfer did not fill the queue). This option assumes that you never perform a transfer with length of 64kB or higher; note that 64kB is included here! In these conditions, a small optimization exists that leads to the small savings mentioned. This is disabled by default to avoid this edge case.

    Interrupt Safety
    The original functions have several race conditions that makes them unsafe regarding interrupts. My version removes one of them, but adds another. For the vast majority of cases, this is irrelevant -- the QueueDMATransfer function is generally called only when Vint_routine is zero, meaning that the DMA queue will not be processed, and all is well.

    There is one exception, though: the S3&K KosM decoder. Since the KosM decoder sets Vint_routine to a nonzero value, you can potentially run into an interrupt in the middle of a QueueDMATransfer call. Effects range from lost DMA transfers, to garbage DMA transfers, to one garbage DMA and a lost DMA (if the transfer was split), or, in the best possible outcome, no ill effects at all. You can toggle interrupt safety by setting the UseVIntSafeDMA flag to 1, but this adds overhead to all safe callers; better would be to fix the unsafe callers to mask interrupts while the DMA transfer is being queued.

    ASM68k
    If you use this crap, all you need to do to use the code above is:
    1. replace the dotted labels (composed symbols) by @ labels (local symbols);
    2. replace the last two instances of "endm" by "endr";
    3. edit the VRAMCommReg macro to use asm68k-style parameters.

    Before you complain that asm68k is not crap, I invite you to assemble the following and check the output:
    [68k] move.w (d0),d1
    move.w d0 ,d1
    dc.b 1 , 2
    moveq #$80,d0[/68k]
     
  2. RetroKoH

    RetroKoH

    Member
    1,662
    22
    18
    Project Sonic 8x16
    I wanna port this to Sonic 1, as I've got the DMA going on there.
    DISREGARD THE QUESTION IF YOU SAW IT... I THINK I'm FIGURING IT OUT
     
  3. RetroKoH

    RetroKoH

    Member
    1,662
    22
    18
    Project Sonic 8x16
    Double posting as I'm in need of help. Here is my DMA queue.asm of my attempt to port this over into a Sonic 1 disassembly that has DMA queue already implemented. When I do this... my screen gets completely glitched and Sonic's tiles are thrown all over VRAM like mad... some help would be appreciated.
     
  4. Hitaxas

    Hitaxas

    Retro 80's themed Twitch streamer ( on hiatus) Member
    I'm actually getting something along the lines of what KingofHarts is experiencing, but with my attempt to slap this into sonic 2. Not sure why though.
     
  5. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    @KoH: in your version, I noticed you replaced the movep's by move's; for example:
    [68k]InitDMAQueue_Loop:
    move.l d1,2(a1)[/68k]
    This should be, instead:
    [68k]InitDMAQueue_Loop:
    movep.l d1,2(a1)[/68k]
    So of course your version does not work. movep reads/writes alternating bytes from RAM/ROM/SRAM (so a "movep.l d0,0(a1)" would write to a1+0, a1+2, a1+4 and a1+6), whereas move writes to contiguous bytes ("move.l d0,0(a1)" writes to a1, a1+1, a1+2, a1+3).
     
  6. RetroKoH

    RetroKoH

    Member
    1,662
    22
    18
    Project Sonic 8x16
    Yes sir, that indeed fixed it. I took them out cuz for some reason I thought those instructions were giving me errors... which in hindsight doesn't really make any sense. It was the rept's that didn't work. Everyone using Sonic 1 can see how I took care of that in the file posted above... which has been edited and should now work with no issues.
     
  7. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    Oh, I forgot -- the repts should work fine if you swap the "endm"s by "endr"s. I added this to the guide.

    @Hitaxas: can you elaborate on what you did? Because I tested the code on clean S2 and S&K (with and without Sonic3_Complete) disassemblies, and it worked on all cases.
     
  8. RetroKoH

    RetroKoH

    Member
    1,662
    22
    18
    Project Sonic 8x16
    Ah... well then I found an alternate way to achieve the same result. I'd imagine yours is faster but could you figure that out for sure, one way or the other?
     
  9. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    Without even looking at the cycle counts, I say that mine is faster because there is no loop overhead. But I looked at it anyway to give a quantitative value:

    For ProcessDMAQueue:
    My version:
    [68k] rept (VDP_Command_Buffer_Slot-VDP_Command_Buffer)/(7*2)
    move.w (a1)+,d0 ; 8(2/0)
    beq.w .done ; T: 10(2/0); F: 12(2/0)
    move.w d0,(a5) ; 8(1/1)
    move.l (a1)+,(a5) ; 20(3/2)
    move.l (a1)+,(a5) ; 20(3/2)
    move.l (a1)+,(a5) ; 20(3/2)
    endm
    ; For up to 17 transfers: 88(14/7) * numtransfers + 18(4/0)
    ; For 18 transfers: 1584(252/126)[/68k]
    Yours:
    [68k] move.w (a1)+,d0 ; 8(2/0)
    beq.s ProcessDMAQueue_Done ; T: 10(2/0); F: 8(1/0)
    move.w d0,(a5) ; 8(1/1)
    move.l (a1)+,(a5) ; 20(3/2)
    move.l (a1)+,(a5) ; 20(3/2)
    move.l (a1)+,(a5) ; 20(3/2)
    cmpa.w #$C8FC,a1 ; 10(2/0)
    bne.s ProcessDMAQueue_Loop ; T: 10(2/0); F: 8(1/0)
    ; For up to 17 transfers: 104(17/7) * numtransfers + 18(4/0)
    ; For 18 transfers: 1870(305/126)[/68k]
    There is also the 4(1/0) cycles your version adds because it uses "move.w #0,(VDP_Command_Buffer).w", while mine uses "move.w d0,(VDP_Command_Buffer).w", but mine also adds 4(1/0) cycles due to the "moveq #0,d0" in the case of 18 transfers.

    For InitDMAQueue:
    Mine:
    [68k] rept (VDP_Command_Buffer_Slot-VDP_Command_Buffer)/(7*2)
    movep.l d1,2(a1) ; 24(2/4)
    lea 14(a1),a1 ; 8(2/0)
    endm
    ; 576(72/72)[/68k]
    Yours:
    [68k] movep.l d1,2(a1) ; 24(2/4)
    lea 14(a1),a1 ; 8(2/0)
    cmpa.w #$C8FC,a1 ; 10(2/0)
    bne.s InitDMAQueue_Loop ; T: 10(2/0); F: 8(1/0)
    ; 934(143/72)[/68k]
     
  10. RetroKoH

    RetroKoH

    Member
    1,662
    22
    18
    Project Sonic 8x16
    Good to know. I never knew you could do loops this way with such efficiency. I should try putting this to use in other game mechanics that loop in a similar manner as well.
     
  11. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    I actually removed the loops; the rept/endm block repeats the block of instructions contained in it as many times as specified.
     
  12. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    After helping Hitaxas over through PM, it turns out his issues were the "RetroHack" and "Sonic Retro" splash screens, which hard-coded the S1 OST, thus overwriting the DMA queue and making the initialization moot. So if you use either of these, be warned that you will either have to fix them, or you will have to initialize the DMA region after they have executed.
     
  13. RetroKoH

    RetroKoH

    Member
    1,662
    22
    18
    Project Sonic 8x16
    So an update... this is weird.

    I added the rept instructions. For the second one (in INIT) it builds and runs fine.

    But for the other block just before it, I get all these errors occuring on the rept line:

    ...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (212 bytes) is out of range
    ...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (200 bytes) is out of range
    ...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (188 bytes) is out of range
    ...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (176 bytes) is out of range
    ...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (164 bytes) is out of range
    ...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (152 bytes) is out of range
    ...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (140 bytes) is out of range
    ...\SONIC 1 REV C (R46)\_INC\DMA QUEUE.ASM(140) : Error : Branch (128 bytes) is out of range

    ALSO, I fixed the bug that I mentioned before thanks to you, but now while all the characters work,. Tails' tails object causes the bug... and they only cause this bug when Tails is at a certain diagonal angle when jumping/rolling.
    Any idea?
     
  14. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    The other block needs to use be.w. instead of beq.s — the short branch that was in he original is too short.

    And which wasmthat that again?
     
  15. RetroKoH

    RetroKoH

    Member
    1,662
    22
    18
    Project Sonic 8x16
    The bug I previously encountered before you had me using movep...

    Where my screen was consumed by garbled art... last time I didn't post a screenshot because it'd been fixed... but now here it is:
    IMAGE

    Does this ONLY when Tails' tail object is loaded. I take that out, and never get any bugs. Also it only occurs when Tails is rolling, and at an angle...
     
  16. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    This is happening then because Tails' tails is overwriting part of the DMA queue's RAM; my version is so much faster because it assumes this does not happen. If you want, you can send me a ROM and I will tell you exactly where the error is.
     
  17. RetroKoH

    RetroKoH

    Member
    1,662
    22
    18
    Project Sonic 8x16
    EDITED POST

    Posted the rom, looking forward to hearing back
     
  18. flamewing

    flamewing

    Emerald Hunter Tech Member
    1,161
    65
    28
    France
    Sonic Classic Heroes; Sonic 2 Special Stage Editor; Sonic 3&K Heroes (on hold)
    After debugging KingofHarts' issue, I found out that it was caused by a slight oversight of mine in an edge case: when deciding whether or not to break up a DMA transfer that crosses a 128kB block, I was actually checking whether the last word transfered would be on a new 128kB block or not. I updated the first post with the new version. This caused the 128kB-safe version to become slightly slower; the new times for this version are:
    • 48(11/0) cycles if the queue was full at the start (as always) [unchanged];
    • 214(37/9) cycles for DMA transfers that do not need to be split into two [increased by 4(1/0)];
    • 252(46/9) cycles if the first piece of the DMA filled the queue [increased by 8(2/0)];
    • 364(63/16) cycles if both pieces of the DMA were queued [increased by 8(2/0)].
    The non-128kB-safe version remains the same speed, which is all the more reason to just align the art in ROM to avoid the issue altogether.
     
  19. MainMemory

    MainMemory

    Kate the Wolf Tech Member
    4,743
    338
    63
    SonLVL
    It may be worth mentioning that there is a way to automatically detect overflows in DPLCs using my sprite mappings macros with AS. Add these lines just before the endm in the dplcEntry macro:
    Code (Text):
    1.     if dplcTiles <> 0
    2.     if ((dplcTiles+(offset*$20))/131072) <> ((dplcTiles+(offset*$20)+(tiles*$20)-1)/131072)
    3.     message "Warning: DPLC crosses 128K boundary! line: \{MOMLINE/1.0} start: offset count: tiles overflow: $\{(dplcTiles+(offset*$20)+(tiles*$20))#131072}"
    4.     endif
    5.     endif
    Also make sure to put "dplcTiles := 0" above the macro to initialize it. Then you can put "dplcTiles := ArtUnc_Sonic" at the top of your DPLC file and "dplcTiles := 0" at the end, and if any DPLCs cross a 128K boundary you'll get a message like "Warning: DPLC crosses 128K boundary! line: 705 start: $7C8 count: $10 overflow: $60".
     
  20. RetroKoH

    RetroKoH

    Member
    1,662
    22
    18
    Project Sonic 8x16
    With 6 characters each having their own art, I have to take the slower route. Aligning them ALL would be brutal on ROM size. Thank you for this though, regardless. :D