This is happening then because Tails' tails is overwriting part of the DMA queue's RAM; my version is so much faster because it assumes this does not happen. If you want, you can send me a ROM and I will tell you exactly where the error is.
Optimizing the DMA queue Now on GitHub
#17
Posted 17 August 2014 - 03:05 PM
EDITED POST
Posted the rom, looking forward to hearing back
Posted the rom, looking forward to hearing back
This post has been edited by KingofHarts: 20 August 2014 - 10:39 AM
#18
Posted 01 September 2014 - 06:42 PM
After debugging KingofHarts' issue, I found out that it was caused by a slight oversight of mine in an edge case: when deciding whether or not to break up a DMA transfer that crosses a 128kB block, I was actually checking whether the last word transfered would be on a new 128kB block or not. I updated the first post with the new version. This caused the 128kB-safe version to become slightly slower; the new times for this version are:
The non-128kB-safe version remains the same speed, which is all the more reason to just align the art in ROM to avoid the issue altogether.
- 48(11/0) cycles if the queue was full at the start (as always) [unchanged];
- 214(37/9) cycles for DMA transfers that do not need to be split into two [increased by 4(1/0)];
- 252(46/9) cycles if the first piece of the DMA filled the queue [increased by 8(2/0)];
- 364(63/16) cycles if both pieces of the DMA were queued [increased by 8(2/0)].
The non-128kB-safe version remains the same speed, which is all the more reason to just align the art in ROM to avoid the issue altogether.
#19
Posted 01 September 2014 - 07:00 PM
It may be worth mentioning that there is a way to automatically detect overflows in DPLCs using my sprite mappings macros with AS. Add these lines just before the endm in the dplcEntry macro:
Also make sure to put "dplcTiles := 0" above the macro to initialize it. Then you can put "dplcTiles := ArtUnc_Sonic" at the top of your DPLC file and "dplcTiles := 0" at the end, and if any DPLCs cross a 128K boundary you'll get a message like "Warning: DPLC crosses 128K boundary! line: 705 start: $7C8 count: $10 overflow: $60".
if dplcTiles <> 0
if ((dplcTiles+(offset*$20))/131072) <> ((dplcTiles+(offset*$20)+(tiles*$20)-1)/131072)
message "Warning: DPLC crosses 128K boundary! line: \{MOMLINE/1.0} start: offset count: tiles overflow: $\{(dplcTiles+(offset*$20)+(tiles*$20))#131072}"
endif
endif
Also make sure to put "dplcTiles := 0" above the macro to initialize it. Then you can put "dplcTiles := ArtUnc_Sonic" at the top of your DPLC file and "dplcTiles := 0" at the end, and if any DPLCs cross a 128K boundary you'll get a message like "Warning: DPLC crosses 128K boundary! line: 705 start: $7C8 count: $10 overflow: $60".
This post has been edited by MainMemory: 11 September 2014 - 02:06 PM
#20
Posted 01 September 2014 - 11:17 PM
flamewing, on 01 September 2014 - 06:42 PM, said:
After debugging KingofHarts' issue, I found out that it was caused by a slight oversight of mine in an edge case: when deciding whether or not to break up a DMA transfer that crosses a 128kB block, I was actually checking whether the last word transfered would be on a new 128kB block or not. I updated the first post with the new version. This caused the 128kB-safe version to become slightly slower; the new times for this version are:
The non-128kB-safe version remains the same speed, which is all the more reason to just align the art in ROM to avoid the issue altogether.
- 48(11/0) cycles if the queue was full at the start (as always) [unchanged];
- 214(37/9) cycles for DMA transfers that do not need to be split into two [increased by 4(1/0)];
- 252(46/9) cycles if the first piece of the DMA filled the queue [increased by 8(2/0)];
- 364(63/16) cycles if both pieces of the DMA were queued [increased by 8(2/0)].
The non-128kB-safe version remains the same speed, which is all the more reason to just align the art in ROM to avoid the issue altogether.
With 6 characters each having their own art, I have to take the slower route. Aligning them ALL would be brutal on ROM size. Thank you for this though, regardless. :D
#21
Posted 02 September 2014 - 12:21 AM
KingofHarts, on 01 September 2014 - 11:17 PM, said:
With 6 characters each having their own art, I have to take the slower route. Aligning them ALL would be brutal on ROM size. Thank you for this though, regardless. :D
The art doesn't need to start on a 128KB "bank"; it must simply not cross a 128KB boundary. If you have space left in a bank, you can move other, smaller data to the bank to fill it, instead of wasting the space with padding. However, if you're still developing your hack and the data is subject to change in size, you may want to do this later, otherwise you may have to reorganize the data repeatedly (and in the meantime, use the 128KB-safe version).
#22
Posted 02 September 2014 - 10:03 AM
Good to know... how would I know where the first "bank" starts, exactly?
#23
Posted 02 September 2014 - 10:24 AM
I suppose you would have to use a listing file to determine where the art starts in the ROM, then round up to the nearest multiple of $20000. If that address is in your art file, you'll have to check the DPLCs to see if any of them do a transfer that crosses that address, and shift the alignment of the art file so that none of the DPLC entries cross that boundary. The art file itself can cross the boundary, just as long as none of the individual DPLC entries cross it.
Or if you're using AS, you could switch the DPLCs to my macro format and add the detection code, then shift the art if it gives any warnings. It might be possible to do it with ASM68K but I'd have to go searching through the manual.
Or if you're using AS, you could switch the DPLCs to my macro format and add the detection code, then shift the art if it gives any warnings. It might be possible to do it with ASM68K but I'd have to go searching through the manual.
#24
Posted 26 September 2014 - 11:18 AM
Every single (S2) disasm I've applied this to has blown up. Ranging from fresh disasms, to my own hack. The Special Stage, especially. But I have had ARZ cause the VDP to melt down too.
You can get it to trigger by applying this to a clean Git disasm, and then going to the Special Stage via level select or checkpoint. What's going on? Is this a case of something "overwriting part of the DMA queue's RAM"?
You can get it to trigger by applying this to a clean Git disasm, and then going to the Special Stage via level select or checkpoint. What's going on? Is this a case of something "overwriting part of the DMA queue's RAM"?
#25
Posted 26 September 2014 - 12:56 PM
Ooh, nice catch, I knew I had forgotten something. What needs to be done is this:
Find the "SpecialStage" label and scan down to this:
Add these lines after the above block:
Then scan further down until you find this:
and change it to this:
I missed this because these were fixed in SCH for a long, long time. An alternative fix is to skip the first change and change the latter to this:
This is a buggy fix, though; you may lose DMA transfers because of the queue filling.
I updated the starting post to reflect these changes as well.
Find the "SpecialStage" label and scan down to this:
move #$2700,sr ; Mask all interrupts lea (VDP_control_port).l,a6 move.w #$8B03,(a6) ; EXT-INT disabled, V scroll by screen, H scroll by line move.w #$8004,(a6) ; H-INT disabled move.w #$8ADF,(Hint_counter_reserve).w ; H-INT every 224th scanline move.w #$8230,(a6) ; PNT A base: $C000 move.w #$8405,(a6) ; PNT B base: $A000 move.w #$8C08,(a6) ; H res 32 cells, no interlace, S/H enabled move.w #$9003,(a6) ; Scroll table size: 128x32 move.w #$8700,(a6) ; Background palette/color: 0/0 move.w #$8D3F,(a6) ; H scroll table base: $FC00 move.w #$857C,(a6) ; Sprite attribute table base: $F800 move.w (VDP_Reg1_val).w,d0 andi.b #$BF,d0 move.w d0,(VDP_control_port).l
Add these lines after the above block:
clr.w (VDP_Command_Buffer).w move.w #VDP_Command_Buffer,(VDP_Command_Buffer_Slot).w
Then scan further down until you find this:
clearRAM PNT_Buffer,$C04 ; PNT buffer
and change it to this:
clearRAM PNT_Buffer,$C00 ; PNT buffer
I missed this because these were fixed in SCH for a long, long time. An alternative fix is to skip the first change and change the latter to this:
clearRAM PNT_Buffer,$C02 ; PNT buffer
This is a buggy fix, though; you may lose DMA transfers because of the queue filling.
I updated the starting post to reflect these changes as well.
#26
Posted 02 February 2015 - 07:35 PM
Tiddles ran into an edge case that happens with Use128kbSafeDMA = 1. After poking around, I found out that the fix to the previous edge case added another edge case, which apparently is much rarer as no one else noticed. The issue is in this code:
When the source address is exactly at the start of a 128kB boundary, the "subq.w #1,d0" will make the "add.w d3,d0" incorrectly set the carry flag, and the DMA queue will break up the DMA into a zero-length DMA* (bad) and a DMA with the remainder of the transfer.
The fix is rather simple, and comes at no cost; it also has an additional benefit: it handles another edge case, that of a zero-length DMA*. You want to replace that bit of code with this:
I updated the version on the initial post with this, and added some other changes I had made locally as well (which not everyone will like, as it involves a macro).
* = As you know, a zero-length DMA is actually a DMA with length of $10000 words, which transfers 128kB of data.
move.w d1,d0 ; d0 = (src_address >> 1) & $FFFF
subq.w #1,d0 ; To guard against the case where (d0+d3)&$FFFF == 0
; Note: unless you modded your Genesis for 128kB of VRAM, then d3 can be at
; most $7FFF here in a valid call; we will assume this is the case
add.w d3,d0 ; d0 = ((src_address >> 1) & $FFFF) + (xfer_len >> 1) - 1
bcs.s .double_transfer ; Carry set = ($10000 << 1) = $20000, or new 128kB block
When the source address is exactly at the start of a 128kB boundary, the "subq.w #1,d0" will make the "add.w d3,d0" incorrectly set the carry flag, and the DMA queue will break up the DMA into a zero-length DMA* (bad) and a DMA with the remainder of the transfer.
The fix is rather simple, and comes at no cost; it also has an additional benefit: it handles another edge case, that of a zero-length DMA*. You want to replace that bit of code with this:
; Note: unless you modded your Genesis for 128kB of VRAM, then d3 can be at
; most $7FFF here in a valid call; we will assume this is the case
move.w d3,d0 ; d0 = length of transfer in words
; Compute position of last transferred word. This handles 2 cases:
; (1) zero length DMAs transfer length actually transfer $10000 words
; (2) (source+length)&$FFFF == 0
subq.w #1,d0
add.w d1,d0 ; d0 = ((src_address >> 1) & $FFFF) + ((xfer_len >> 1) - 1)
bcs.s .double_transfer ; Carry set = ($10000 << 1) = $20000, or new 128kB block
I updated the version on the initial post with this, and added some other changes I had made locally as well (which not everyone will like, as it involves a macro).
* = As you know, a zero-length DMA is actually a DMA with length of $10000 words, which transfers 128kB of data.
#27
Posted 17 March 2015 - 05:23 PM
Bumping again to add some constants that were needed by all non-Git-S2 disassemblies which I didn't include in the previous update. These are:
and should be defined at some point. I updated the OP with it.
VRAM = %100001 CRAM = %101011 VSRAM = %100101 ; values for the rwd argument READ = %001100 WRITE = %000111 DMA = %100111
and should be defined at some point. I updated the OP with it.
#28
Posted 18 March 2015 - 02:44 PM
Quadruple post to mention a fix to another edge case.
If you have Use128kbSafeDMA set to 1, there is one set of cases here the function won't work correctly: if transfer length is of 64kB or higher (d3 = $8000 or more). For unmodified Genesis, with the normal amount of VRAM, this is an issue only in the case of exact 64kB (d3 = $8000): in this case, the second transfer will be wrong. For Tera Drives and modified Genesis with 128kB of VRAM, the other cases also become an issue. I fixed this case by default, which makes the function slower in one case (DMA is broken in two and two pieces are correctly queued) by 4(1/0) cycles; all other cases are unmodified.
If you want the old behavior (for example because you don't ever use a transfer of 64kB and you are not making a hack targeting machines with 128kB of VRAM), just set variable AssumeMax7FFFXfer to 1.
If you have Use128kbSafeDMA set to 1, there is one set of cases here the function won't work correctly: if transfer length is of 64kB or higher (d3 = $8000 or more). For unmodified Genesis, with the normal amount of VRAM, this is an issue only in the case of exact 64kB (d3 = $8000): in this case, the second transfer will be wrong. For Tera Drives and modified Genesis with 128kB of VRAM, the other cases also become an issue. I fixed this case by default, which makes the function slower in one case (DMA is broken in two and two pieces are correctly queued) by 4(1/0) cycles; all other cases are unmodified.
If you want the old behavior (for example because you don't ever use a transfer of 64kB and you are not making a hack targeting machines with 128kB of VRAM), just set variable AssumeMax7FFFXfer to 1.
#29
Posted 12 July 2015 - 10:36 AM
In a record-setting quintuple post, I have an announcement and a bugfix.
The announcement is that the improved DMA queue (and instructions for its use) can now be obtained on GitHub. I will no longer update the OP with new developments, but I will post new things in the thread.
The bugfix: this is actually an issue only with S&K's Perform_DPLC function and 128kB-safe DMA. If you don't have the option enabled, or you are not using Perform_DPLC, then you are not affected.
The issue is that Perform_DPLC expects the high word of d3 to be unchanged by a call to the DMA function; and this was not true in the case where the DMA was split in two if 128kB-safe option was enabled. The fix is here if you want to apply it manually.
The announcement is that the improved DMA queue (and instructions for its use) can now be obtained on GitHub. I will no longer update the OP with new developments, but I will post new things in the thread.
The bugfix: this is actually an issue only with S&K's Perform_DPLC function and 128kB-safe DMA. If you don't have the option enabled, or you are not using Perform_DPLC, then you are not affected.
The issue is that Perform_DPLC expects the high word of d3 to be unchanged by a call to the DMA function; and this was not true in the case where the DMA was split in two if 128kB-safe option was enabled. The fix is here if you want to apply it manually.

31