Given a Spectrum snapshot, is it possible to determine which areas are code?

A typical approach would be to modify an emulator to record the location of every instruction executed, and let the snapshot run normally for a while. This gives a guarantee about marked locations, but it’s limited to code that is actually run during the analysis. For a complete picture you need to follow every code path, which would be a challenge even if you made the effort to use every feature and play through every possible outcome of a game. For bulk-processing snapshots it’s even more difficult.

I’ve long wondered whether spidering code was a realistic option. Given a starting point, you recursively follow every possible path, stopping only when you reach a dead-end (RET) or a previously visited location. It felt pretty straight-forward so I knocked up a quick test program to try it.

The initial approach was simple: follow JP instructions, recursively process CALLs and conditional jumps, stop at RETs, and blindly skip over anything else. Skipping instructions requires some opcode filtering, to identify multi-byte instructions with operands and those with CB/ED/DD/FD prefixes. Indexed instructions also need a little extra attention to skip a possible index offset, which aren’t present in the HL version of the same instruction. Despite all that, a while loop, small switch statement, and index flag were enough to cover it.

To track previously visited locations I used a 64K array, shadowing each addressable memory location. The tracing loop marks the array at the current PC offset as it worked, stopping processing if the entry was already marked. For completeness I also mark entries for the operands, so a normal run of code is a contiguous block. Before the program exits it converts the array to a 256x192 image to help visualise the code found, with 1 (green) pixel per byte instruction byte.

The first run was spectacularly short, stopping after only a few instructions when the first RET was encountered. The problem was that the snapshot was taken at a relatively arbitrary position in the code, not at the top-level entry point. In this case we need to follow the return, taking the return address from the stack. If the parent routine also returned, we’d need to do the same thing again for the next value on the stack. That meant tracking changes to the stack pointer to ensure it was at the correct position for further returns.

Related to this, what if there was data on the stack at the time of the snapshot? The RET would pick that up instead, and it’s likely we’d start tracing a non-code location. To fix that we must process PUSH and POP instructions too, and adjust the SP value appropriately. INC SP/DEC SP also needed similar treatment. At this point I was starting to worry the code was turning into an emulator!

Of course, returning with data on the stack is still a valid thing to do in some cases. Unfortunately it’s a dynamic value that isn’t known to our static tracing, so the only option is to stop the trace path. Similarly, JP (HL) and friends are also treated as unknown dynamic targets. These are most likely to be used for jump-table lookups, and are probably the biggest limitation with static tracing.

This leads on to the problem of mixing code and data. RST 08 in the Spectrum ROM is followed by a single byte for an error code, which is picked up inside the routine by popping the return address, or using EX (SP),HL. Some games also use the same technique to inline data after a CALL. Under normal circumstances we trace both the called routine and the path beyond the CALL, but this would lead us to trace the trailing data, with unpredictable results. Fortunately, the stack tracking comes to the rescue here, allowing us to detect this as a stack underflow condition. We can’t determine how much data is present, but we can stop the parent trace continuing.

There’s a further complication to this issue. If a different part of the code calls the same routine, it would be skipped as a previously visited location before the stack underflow caught the data access, and would continue into the data. To fix this I added another marker array to allow called locations to be blacklisted, so future calls to the same code is stopped at the call.

Also connected to this is the technique used to discard return addresses from within a routine, usually by simply popping them off the stack. This also looks like an attempt to access data on the stack, and triggers call blacklisting. As a work-around I attempt to recognise code signatures for data access, including POP ss ; LD r,(ss). This is likely to need further improvement as other examples are found.

To help detect tracing escaping genuine code I’ve added some diagnostics tests: If tracing encounters a block of 4 or more NOPs in RAM, it suspects the trace has run into open memory. If an inert LD r,r instruction is encountered it suspects unwanted data tracing. In both cases it displays a warning message and stops the current trace thread, so they can be investigated.

Sometimes even the PC value in the snapshot isn’t enough as a starting point. One of the earlier snapshots I tried was Manic Miner, which sits at a PAUSE 0 after loading, requiring a key press to start the game. This is a problem because there’s no code-only path into the game, due to the start location being encoded as data in the BASIC statements. If the PC trace gives no result in RAM, I fall back looking up the PROG system variable, and if it’s pointing in roughly the correct location I scan the basic listing for USR statements. Whenever a USR n or a USR VAL n$ is found the address is traced as a new entry point. Code traced from a USR statement is shown in red, with overlapping locations using additive colour mixing to give yellow.

Another entry point option is the interrupt handler. The IM 1 handler is of little interest as it’s completely self-contained, but if the snapshot indicates IM 2 we can determine the handler address and call it as another possible entry point. Code traced from the IM 2 handler is shown in blue, with colour mixing as before.

Static tracing is only suitable for 48K snapshots, as paging may require dynamically calculated values to identify both ports and pages. In many cases 128K games only use the extra memory for music or level data, so I don’t see it as a huge problem. A bigger issue is that it requires snapshots, and most archived software is preserved in tape format. However, generating snapshots from tape images is a a trivial task for a modified emulator.

Both static and dynamic tracing have their uses, and a combination of both will be the killer solution. Perhaps dynamic tracing to feel the bones of a program, and static tracing to flesh out the bits that can’t be reached? In some cases dynamic tracing may even be able to determine how to reach areas that weren’t visit on a first run, particularly if decisions are based on keyboard input. It could also help solve the ambiguous cases where static tracing isn’t sure what the code is doing.

Here’s the output from tracing a snapshot of a freshly loaded Dynamite Dan II:

dd2.szx: PC=03D6 SP=5E05 I=3F IM=1
8155: return address popped
714D: return address popped
Traced 7195 code bytes in RAM.

And the trace image it produced:

Memory Map

If you’d like to take a closer look, and maybe even improve on what I’ve done so far, the source code is now available on GitHub.