How to improve the PLC Queue

Discussion in 'Engineering & Reverse Engineering' started by Clownacy, Jul 25, 2015.

  1. Clownacy

    Clownacy

    Tech Member
    838
    79
    28
    Sonic 2 Git disasm for now, blah blah


    Part 0: What is the PLC Queue?

    The PLC Queue is responsible for loading nemesis-compressed art. It is special in that it spaces out the loading of data so that the gameplay isn't halted (see Michael Jackson's Moonwalker for the flipside). This makes it suited for loading data... sneakily, and without an obvious loading screen. For example, in Sonic 2, the level only begins after around half of the object art loads, the rest loads during the first seconds of the level.

    It works by having the ROM source address and VRAM destination address of each set of art loaded into a $60 byte queue in RAM. Each entry is then processed, one at a time, transferring a very limited amount of art per frame, so to keep the gameplay going. The data on just how far into the nemesis-compressed art the system got is stored, and later loaded to continue the process. Once the current art is done, the queue is then shifted along to place the next art's data in place to be processed.

    Pattern Load Request Lists decide what goes into the queue itself. They consist of exactly what would be going into the queue: a ROM address and a VRAM address of the art that needs to be transferred, totalling in 6 bytes per entry. In Sonic 1, this includes the level art, the enemy art, obstacle art, etc.

    Technically, because $60/6 = 16, there should be room for 16 entries in the PLC Queue. Unfortunately, due to a bug, the 16th slot is broken, leading to a maximum of 15. Check out Vladikcomper's guide on fixing this. Also, check FraGag's guide on fixing another bug.

    To some, 16 entries just isn't enough, even when each level has two PLR lists to work around it. So, let's look into making better use of that RAM.


    Part 1: The BOOM way

    Saxman's Sonic Boom engine (hereafter referred to as BOOM) features an optimised form of the PLC Queue. It works by making some clever use of the addresses' properties. Firstly, the lowest 5 bits of the VRAM addresses are always blank, as are the highest 8 bits of the ROM addresses. In addition, the nemesis-compressed art is always aligned to an even address, so the lowest bit also goes unused. Because of the abundance of space, BOOM opts to store the data in the queue in a compacted form, with both the VRAM and ROM address occupying the same longword, reducing the size of each entry in the queue to 4 bytes, and increasing the maximum amount of entries that can fit into the queue to 24.

    Unfortunately, this comes at a cost: the art in ROM must be located within the first MB of the ROM, because the bits used for higher addresses are now used for storing the VRAM address. Also, this requires two words of RAM be reserved for storing offset data.

    To go more into more technical detail about BOOM's differences...
    • When the PLR list is loaded into the queue, the data is compacted. Then, when the data is read from the queue, it has to be split and shifted into a readable form.
    • When it comes to halting the art loading, to let the game continue, the stock PLC queue modifies the data in the queue itself to track how much art has been transferred. It does this by simply overwriting the ROM address and VRAM address with up-to-date ones. BOOM splits the up-to-date data into offset values, which are added to the data read from the queue, resulting in the same effect. In the ROM address' case, it does this because the data can sometimes be offsetted by an uneven value, something that can't be reflected by addresses missing their lowest bit. In the VRAM's case... it's pointless; you could easily store the offset data back into the queue, and save yourself 2 bytes of RAM.

    This is a good first step to improving the queue, but since BOOM never came with any guides on applying such changes to another disassembly... yeah, let's port this.

    Keep an eye on the code we add, here; we'll be coming back to it a lot later on.

    First up, go add two RAM variables: Plc_Buffer_ROM_offset and Plc_Buffer_VRAM_offset, each a word in size.

    Following that, go to LoadPLC, and change the addq.w to 4 instead of 6. This reflects the size of each slot in the queue. Next, replace the move.l/move.w/dbf above the movem.l with the following:

    Code (ASM):
    1.     ; BOOM code block
    2.     ; Begin loading data
    3. -    move.l    (a1)+,d1        ; ROM address
    4.     lsr.l    #1,d1
    5.     moveq    #0,d2
    6.     move.w    (a1)+,d2        ; VRAM address
    7.     swap    d2
    8.     or.l    d2,d1
    9.     move.l    d1,(a2)+
    10.     dbf    d0,-                ; Loop
    Speaking of movem.l, make those backup d1 and d2. As you can see, this block of code condenses the data before placing it into RAM.

    Now for LoadPLC2. Like before, replace the move.l/move.w/dbf above the movem.l with the above block of code, and make the movem.l's backup d1 and d2.

    Next up, RunPLC_RAM. Change the ++'s at the start to +++ (that was a fun bug to track down...). Now, replace the 'movea.l (Plc_Buffer).w,a0' with the following:

    Code (ASM):
    1.     ; BOOM code block 1
    2.     movem.l    d4/d7,-(sp)
    3.     tst.w    (Plc_Buffer_ROM_offset).w
    4.     bne.s    +
    5.     move.w    (Plc_Buffer).w,d4
    6.     andi.w    #$FFE0,d4
    7.     move.w    d4,(Plc_Buffer_VRAM_offset).w
    8. +
    9.     move.l    (Plc_Buffer).w,d7
    10.     lsl.l    #1,d7
    11.     andi.l    #$3FFFFF,d7
    12.     moveq    #0,d4
    13.     move.w    (Plc_Buffer_ROM_offset).w,d4
    14.     add.l    d4,d7
    15.     movea.l    d7,a0
    16.     movem.l    (sp)+,d4/d7
    This reads the data from RAM, splits it into a readable form, and adds the offset values. Likewise, replace the 'move.l a0,(Plc_Buffer).w' with this:

    Code (ASM):
    1.     ; BOOM code block 2
    2.     movem.l    d4/d7,-(sp)
    3.     move.l    a0,d7
    4.     move.l    (Plc_Buffer).w,d4
    5.     lsl.l    #1,d4
    6.     andi.l    #$3FFFFF,d4
    7.     sub.l    d4,d7
    8.     move.w    d7,(Plc_Buffer_ROM_offset).w
    9.     movem.l    (sp)+,d4/d7
    This updates the ROM offset value. You can see how all of this is more complicated than simply keeping the data in RAM raw. Now for ProcessDPLC. Alter the writes to 'Plc_Buffer+4' to use 'Plc_Buffer_VRAM_offset' instead. The same applies for ProcessDPLC2. This updates the VRAM offset value.

    ProcessDPLC_Main. Replace the 'movea.l (Plc_Buffer).w,a0' with this...

    Code (ASM):
    1.     ; BOOM code block 1
    2.     movem.l    d4/d7,-(sp)
    3.     move.l    (Plc_Buffer).w,d7
    4.     lsl.l    #1,d7
    5.     andi.l    #$3FFFFF,d7
    6.     moveq    #0,d4
    7.     move.w    (Plc_Buffer_ROM_offset).w,d4
    8.     add.l    d4,d7
    9.     movea.l    d7,a0
    10.     movem.l    (sp)+,d4/d7
    ...and replace the 'move.l a0,(Plc_Buffer).w' with this:

    Code (ASM):
    1.     ; BOOM code block 2
    2.     movem.l    d4/d7,-(sp)
    3.     move.l    a0,d7
    4.     move.l    (Plc_Buffer).w,d4
    5.     lsl.l    #1,d4
    6.     andi.l    #$3FFFFF,d4
    7.     sub.l    d4,d7
    8.     move.w    d7,(Plc_Buffer_ROM_offset).w
    9.     movem.l    (sp)+,d4/d7
    ProcessDPLC_Pop. Revert the moveq to use bytesToLcnt, as before Vladikcomper's guide, and change it and the 'lea 6(a0),a1' to use '4' instead of '6'. Then, remove both 'move.w's. Again, this reflects the slot size. It would also be optimal to remove the 'moveq #0,d0', and change the 'move.l d0,(a0)+' to a 'clr.l (a0)', because optimisation. Then, place a 'clr.w (Plc_Buffer_ROM_offset).w' before the 'rts'.

    Done

    So, now we've increased the total number of slots from 16 to 24. Not too shabby. Though, this comes at the cost of placement of the art.


    Part 1.1: Making BOOM better

    This won't do anything to how many slots our $60 bytes gives us, but it's worth going over one of BOOM's shortcomings: a ROM space improvement, and slight speed optimisation:

    Why does BOOM leave the data uncompacted in ROM, but compact it when loaded into RAM? We could save ROM space by compacting the data in-ROM, and we can reduce cycle-consumption by making LoadPLC not have to compact the data repeatedly.

    Find the plreq macro, and change it to this:

    Code (ASM):
    1. plreq macro toVRAMaddr,fromROMaddr
    2.     dc.l    (tiles_to_bytes(toVRAMaddr)<<16)|(fromROMaddr>>1)
    3.     endm
    This will compact the data. Also, change plrlistheader to this:

    Code (ASM):
    1. plrlistheader macro {INTLABEL}
    2. __LABEL__ label *
    3.     dc.w ((__LABEL___End - __LABEL__ - (2 + 4)) / 4)
    4.     endm
    This will account for the new size of each entry. Now, go to LoadPLC, and replace the block of code we added from BOOM with this:

    Code (ASM):
    1.     ; Begin loading data
    2. -    move.l    (a1)+,(a2)+        ; ROM/VRAM address
    3.     dbf    d0,-                ; Loop
    Same goes for LoadPLC2.

    Because of our changes, we'll need to edit RunPLC_ROM, so it can read the compact data. Remove the 'movea.l (a1)+,a0', remove the '+' from 'move.w (a1)+,d0', and add a 'andi.w #$FFE0,d0' after that instruction. This will make it correctly load the VRAM address. Next, for the ROM address, add this above the 'bsr.w NemDec':

    Code (ASM):
    1.     ; Begin loading data
    2.     move.l    (a1)+,d0
    3.     lsl.l    #1,d0
    4.     andi.l    #$3FFFFF,d0
    5.     movea.l    d0,a0
    Done

    Yay, optimisation. But, yeah, 24 slots. That sound like enough? No? What's wrong with you?

    But, first, mention should go to another improvement: unspliting the offsets. For VRAM, this is simple, but for ROM, this would require modifying the compact data format to save the ROM address' low bit. This would further restrict the location of the art to the first 512KB, but would allow for the offset data to be safely stored back into the queue. We will not be doing this, however, to allow for another change, one that really will require the offsets be split.


    Part 2: Queue of offsets

    To add to the restriction-fest, we can reduce the size of the slots to just 2 bytes! In order to do this, instead of copying the data inside the PLR lists to RAM, we store a relative pointer to each entry in the list in ROM in the queue. Of course, this adds the restriction that all PLR lists be within so many bytes of a certain location. I think the limit is around $8000 bytes from the offset label, because sign extension scrambles higher addresses, and a word-sized relative address can't handle >$FFFF.

    First up, above your first PLR list (usually PlrList_Std1), add the label 'PlrList_Start'. We will use this to calculate the relative offset of each list's entry.

    In case we haven't beaten the thing enough, go to LoadPLC, and replace the block of code from BOOM with this:

    Code (ASM):
    1.     move.l    a1,d1
    2.     subi.l    #PlrList_Start,d1
    3. -    move.w    d1,(a2)+
    4.     addq.w    #4,d1
    5.     dbf    d0,-                ; Loop
    Do the same with LoadPLC2. Also, back at LoadPLC, change the 'addq.w #4,a2' to add 2 instead of 4, and change the 'tst.l' above it to 'tst.w'.

    Now for RunPLC_RAM. Again, change the 'tst.l (Plc_Buffer).w' to tst.w. Then, change the first block of code added from BOOM to this:

    Code (ASM):
    1.     movem.l    d4/d7,-(sp)
    2.     movea.w    (Plc_Buffer).w,a0
    3.     adda.l    #PlrList_Start,a0
    4.     tst.w    (Plc_Buffer_ROM_offset).w
    5.     bne.s    +
    6.     move.w    (a0),d4
    7.     andi.w    #$FFE0,d4
    8.     move.w    d4,(Plc_Buffer_VRAM_offset).w
    9. +
    10.     move.l    (a0),d7
    11.     lsl.l    #1,d7
    12.     andi.l    #$3FFFFF,d7
    13.     moveq    #0,d4
    14.     move.w    (Plc_Buffer_ROM_offset).w,d4
    15.     add.l    d4,d7
    16.     movea.l    d7,a0
    17.     movem.l    (sp)+,d4/d7
    Then, replace the second block with this:

    Code (ASM):
    1.     movem.l    d4/d7,-(sp)
    2.     move.l    a0,d7
    3.     movea.w    (Plc_Buffer).w,a0
    4.     adda.l    #PlrList_Start,a0
    5.     move.l    (a0),d4
    6.     lsl.l    #1,d4
    7.     andi.l    #$3FFFFF,d4
    8.     sub.l    d4,d7
    9.     move.w    d7,(Plc_Buffer_ROM_offset).w
    10.     movem.l    (sp)+,d4/d7
    As for ProcessDPLC_Main, replace the first block from BOOM with this...

    Code (ASM):
    1.     movem.l    d4/d7,-(sp)
    2.     movea.w    (Plc_Buffer).w,a0
    3.     adda.l    #PlrList_Start,a0
    4.     move.l    (a0),d7
    5.     lsl.l    #1,d7
    6.     andi.l    #$3FFFFF,d7
    7.     moveq    #0,d4
    8.     move.w    (Plc_Buffer_ROM_offset).w,d4
    9.     add.l    d4,d7
    10.     movea.l    d7,a0
    11.     movem.l    (sp)+,d4/d7
    ...and as for the second block:

    Code (ASM):
    1.     movem.l    d4/d7/a0,-(sp)
    2.     move.l    a0,d7
    3.     movea.w    (Plc_Buffer).w,a0
    4.     adda.l    #PlrList_Start,a0
    5.     move.l    (a0),d4
    6.     lsl.l    #1,d4
    7.     andi.l    #$3FFFFF,d4
    8.     sub.l    d4,d7
    9.     move.w    d7,(Plc_Buffer_ROM_offset).w
    10.     movem.l    (sp)+,d4/d7/a0
    ProcessDPLC_Pop. Change the usage of '4' to '2', and change the bytesToLcnt to bytesToWcnt, then change the 'move.l' and 'clr.l' to 'move.w' and 'clr.w', respectively.

    Next up, because the slot size is now <4, we should update some checks to account for it. Search for all instances of 'tst.l (Plc_Buffer).w' and change them to use tst.w.

    Done

    Note that this doesn't totally negate BOOM's improvement, since we need the PLR list data to be in as small a place as possible in ROM, and BOOM's compact format helps with that a fair bit.

    Anyway, with this done, the maximum slots made available by a queue of $60 bytes is 48(!). Should we even bother making it greater?

    Sure, why not?


    Part 3: Queue of list IDs

    One of the biggest flaws with the PLC queue is that is doesn't seem to have been built with the PLR lists in mind. In fact, the PLR lists seem to have been an afterthought. We've gone through many forms of PLC Queue slot, from raw ROM/VRAM address to compacted data to word-sized relative ROM addresses, but were the queue designed with PLR lists in mind, it would no doubt queue, not the individual entries of the PLR list, but the ID of the list itself, and simply store a bookmark of which entry it's processing.

    Doing this will require two more bytes of RAM. Go and add two more RAM variables: Plc_Buffer_Bookmark and Plc_Buffer_Cap, each a byte in size.

    We'll also take some steps to make the code a little more readable. Unfortunately, this section will be a little copy/paste-ish; I've had to modify the code in such weird and hackish ways, it's way too much of a hassle to explain well.

    Anyway, above LoadPLC, add these subroutines:

    Code (ASM):
    1. ; Gets address of PLR list
    2. ;  Input:
    3. ;   d0 - PLR list ID
    4. ;  Output
    5. ;   a1 - Address of PLR list
    6. GetPLRListAddr:
    7.     lea    (ArtLoadCues).l,a1
    8.     add.w    d0,d0
    9.     subq.w    #2,d0    ; Make zero-based
    10.     move.w    (a1,d0.w),d0
    11.     lea    (a1,d0.w),a1
    12.     rts
    13.  
    14. ; Gets address of current PLR list entry
    15. ;  Output
    16. ;   a1 - Address of PLR
    17. GetPLRAddr:
    18.     move.l    d0,-(sp)
    19.     moveq    #0,d0
    20.     move.b    (Plc_Buffer).w,d0
    21.     bsr.w    GetPLRListAddr
    22.     addq.w    #2,a1
    23.     moveq    #0,d0
    24.     move.b    (Plc_Buffer_Bookmark).w,d0
    25.     lsl.w    #2,d0
    26.     adda.w    d0,a1
    27.     move.l    (sp)+,d0
    28.     rts
    Now, replace LoadPLC with this:

    Code (ASM):
    1. LoadPLC:
    2.     movem.l    d1-d2/a1-a2,-(sp)
    3.     move.w    d0,d2
    4.     bsr.s    GetPLRListAddr
    5.     lea    (Plc_Buffer).w,a2
    6.  
    7.     ; If the first slot is empty, we'll want to reset Plc_Buffer_Cap and Plc_Buffer_Bookmark
    8.     tst.b    (a2)
    9.     beq.s    + ; if it's zero, exit this loop
    10.  
    11.     ; Look for the first available PLC entry
    12. -    addq.w    #1,a2
    13.     tst.b    (a2)
    14.     beq.s    ++ ; if it's zero, exit this loop
    15.     bra.s    -
    16. +
    17.     clr.b    (Plc_Buffer_Bookmark).w    ; Read from first entry in PLR list
    18.     move.b    1(a1),d1
    19.     bmi.s    ++
    20.     move.b    d1,(Plc_Buffer_Cap).w    ; Store number of entries in PLR list
    21. +
    22.     move.b    d2,(a2)
    23. +
    24.     movem.l    (sp)+,d1-d2/a1-a2 ; a1=object
    25.     rts
    And replace LoadPLC2 with this:

    Code (ASM):
    1. LoadPLC2:
    2.     movem.l    d1-d2/a1-a2,-(sp)
    3.     move.w    d0,d2
    4.     bsr.w    GetPLRListAddr
    5.     bsr.s    ClearPLC
    6.     clr.b    (Plc_Buffer_Bookmark).w    ; Read from first entry in PLR list
    7.     move.b    1(a1),(Plc_Buffer_Cap).w    ; Store number of entries in PLR list
    8.     move.b    d2,(Plc_Buffer).w
    9.     movem.l    (sp)+,d1-d2/a1-a2 ; a1=object
    10.     rts
    As for ClearPLC, let's modify that so it clears oddly-sized queues properly. Change the moveq to 'moveq #Plc_Buffer_End-Plc_Buffer-1,d0', and change the clr.l to clr.b.

    Next up, RunPLC_RAM. Change the 'tst.w (Plc_Buffer).w' to tst.b, and replace the first block of code from BOOM with this:

    Code (ASM):
    1.     movem.l    d0-d1/a1,-(sp)
    2.     bsr.w    GetPLRAddr
    3.     tst.w    (Plc_Buffer_ROM_offset).w
    4.     bne.s    +
    5.     move.w    (a1),d0
    6.     andi.w    #$FFE0,d0
    7.     move.w    d0,(Plc_Buffer_VRAM_offset).w
    8. +
    9.     move.l    (a1),d1
    10.     lsl.l    #1,d1
    11.     andi.l    #$3FFFFF,d1
    12.     moveq    #0,d0
    13.     move.w    (Plc_Buffer_ROM_offset).w,d0
    14.     add.l    d0,d1
    15.     movea.l    d1,a0
    16.     movem.l    (sp)+,d0-d1/a1
    Replace the second block with this:

    Code (ASM):
    1.     movem.l    d0-d1/a1,-(sp)
    2.     move.l    a0,d0
    3.     bsr.w    GetPLRAddr
    4.     move.l    (a1),d1
    5.     lsl.l    #1,d1
    6.     andi.l    #$3FFFFF,d1
    7.     sub.l    d1,d0
    8.     move.w    d0,(Plc_Buffer_ROM_offset).w
    9.     movem.l    (sp)+,d0-d1/a1
    Now for ProcessDPLC_Main. Replace the first block of code from BOOM with this:

    Code (ASM):
    1.     movem.l    d0-d1/a1,-(sp)
    2.     bsr.w    GetPLRAddr
    3.     move.l    (a1),d1
    4.     lsl.l    #1,d1
    5.     andi.l    #$3FFFFF,d1
    6.     moveq    #0,d0
    7.     move.w    (Plc_Buffer_ROM_offset).w,d0
    8.     add.l    d0,d1
    9.     movea.l    d1,a0
    10.     movem.l    (sp)+,d0-d1/a1
    And as for the second block:

    Code (ASM):
    1.     movem.l    d0-d1/a1,-(sp)
    2.     move.l    a0,d0
    3.     bsr.w    GetPLRAddr
    4.     move.l    (a1),d1
    5.     lsl.l    #1,d1
    6.     andi.l    #$3FFFFF,d1
    7.     sub.l    d1,d0
    8.     move.w    d0,(Plc_Buffer_ROM_offset).w
    9.     movem.l    (sp)+,d0-d1/a1
    ProcessDPLC_Pop. Replace it with this:

    Code (ASM):
    1. ProcessDPLC_Pop:
    2.     clr.w    (Plc_Buffer_ROM_offset).w
    3.     move.b    (Plc_Buffer_Bookmark).w,d0
    4.     cmp.b    (Plc_Buffer_Cap).w,d0        ; Has the whole PLR list been processed?
    5.     beq.s    +                ; If so, move on to next list
    6.     addq.b    #1,(Plc_Buffer_Bookmark).w    ; If not, advance to next item in list
    7.     rts
    8. +
    9.     lea    (Plc_Buffer).w,a0
    10.     lea    1(a0),a1
    11.     moveq    #(Plc_Buffer_Only_End-Plc_Buffer-1)-1,d0
    12.  
    13. -    move.b    (a1)+,(a0)+
    14.     dbf    d0,-
    15.  
    16.     clr.b    (a0)+
    17.  
    18.     moveq    #0,d0
    19.     move.b    (Plc_Buffer).w,d0
    20.     bsr.w    GetPLRListAddr
    21.     clr.b    (Plc_Buffer_Bookmark).w    ; Read from first entry in PLR list
    22.     move.b    1(a1),(Plc_Buffer_Cap).w    ; Store number of entries in PLR list
    23.     rts
    RunPLC_ROM isn't so bad: just replace this...

    Code (ASM):
    1.     lea    (ArtLoadCues).l,a1
    2.     add.w    d0,d0
    3.     move.w    (a1,d0.w),d0
    4.     lea    (a1,d0.w),a1
    ...with this:

    Code (ASM):
    1.     bsr.w    GetPLRListAddr
    Now, like before, change all 'tst.w (Plc_Buffer).w's to tst.b.

    Finally, open s2.constants.asm, and find 'PLC IDs'. There, set idstart to 1. This is so there is no PLR list whose ID is 0, since the code recognises a slot with 0 in it as empty.

    (Also, PlrList_Start can be removed)

    Done

    So, now, the queue supports $60 (96) lists being queued, each with their own collection of PLRs. How's that?

    Well, I think this guide's gone on long enough. If you want to try your hand at something, it might be worth undoing the BOOM-formatted PLR data, to remove the ROM limitation. Anyway, you can now go mad with the PLR lists, and can freely strip down Plc_buffer for some more RAM. Of course, the obvious cost of all this is speed, but it's negligible.
     
    Last edited by a moderator: Jul 12, 2019