How to travel in time and save hours of Windows debugging

It is time for another entry in our “Behind the Scenes” series. The previous two interviews were focused on doing tricky things on Linux. For a change, we’ll redirect our attention to the unusual stuff that Zoey has been doing on Windows today! As always in this series, this is a conversation about niche topics between developers – here, we assume that readers know what using a “debugger” is like.

Pascal Hertleif

Pascal: Hi Zoey! I hear you’ve been doing some interesting debugging lately. Before we get into that: What are you working on?

Zoey: I’m working on Divvun, a project that adds support for various languages, primarily the Sami languages. Right now I’m focusing on supporting non-standard keyboard layouts on Windows.

Pascal:That sounds super interesting! How does one add “language support” to an operating system?

Zoey: Good question! It’s very dependent on the platform you are adding language support to. On Windows it consists of reverse engineering several undocumented APIs because Microsoft’s official APIs don’t work for many of the languages we work with. For example, we often work with languages that do not have official LCIDs. If there is no official LCID you need to use the undocumented APIs that Windows uses internally.

Pascal: Off to a good start! I can’t imagine this going wrong. But… uh… let’s assume it does go wrong. What does that look like? I’ve never had to deal with any Windows API myself (the last time I programmed on Windows I was – of course – writing Pascal in highschool). Where do you even start?

Zoey Riordan

The first thing you do is google to see if someone else has already reverse-engineered this API before you. Why do the work when someone else already has? Sadly most of the time no one else has done this (publicly at least). So when that inevitably fails you, you gotta break out the debuggers.

Pascal: I can only assume that there is no GDB and trapping on syscalls? What do you do on Windows?

Zoey: Debugging on Windows is primarily done with either Visual Studio (proper, not Code) or Windbg. Visual Studio is useful for basic debugging operations, but if you’re really in the weeds you need the more powerful Windbg.

Pascal: (I had to look that tool up.) So what does your workflow look like?

Zoey: So the first thing to do is figure out how to reproduce the problem really quickly/simply. Doing that makes it much easier to isolate the problem. Once I have a reproduction I open up the executable in Visual Studio, set a breakpoint around where the bug is happening, and hit Run. Visual Studio is great at working with any executable so long as you have the PDB (debugging symbols) file.

Illustration of debugging timeline, going from broken to fixes via spongebog meme image of '72 hours later'
Regular debugging: Very slow

Pascal: I can only assume from that phrasing that this wasn’t enough this time…

Zoey: Unfortunately yes, you are right. The issue I was debugging this time was intermittent and not easily isolated. Even worse, the bug was occurring deep inside optimized Windows binaries that there is no public source code for.

Pascal: I’m imagining a wall of assembly that shows a state that is already broken.

Zoey: Well it's not actually broken yet, since I was able to set the breakpoint before everything went sideways. But you are right that it is a giant wall of assembly. Once I’m staring at a giant wall of assembly I step through line-by-line until I see something that is definitely wrong. The problem is, once it’s in the bad state, I need to figure out how it got there. And doing so the conventional way would require restarting this debugging session multiple times and tracing my investigation backwards.

Pascal: Is this where we leave computer science and go into breaking special relativity?

Zoey: Yes. To debug this in any reasonable timeframe I had to break out the time travelling debugger. Time travel debugging is a nifty feature that allows you to not only step forward though a program’s execution, it also allows you to step backwards. It does this by recording tons of information about the state of the program for its entire lifetime. (And by tons I mean tons, about 4 minutes of program time was 4GB in debug files)

Pascal: I have lots of questions. So you can… reverse operations? And branch off a new timeline of execution flow? Is there a multiverse of keyboards on your computer?

You can reverse operations, but you aren’t able to mutate state. We have to preserve causality after all! Messing with the past would have untold consequences on the present.

Pascal: Of course. Okay, what does your new/old workflow look like?

Zoey: So in the case of this bug I took two time travel traces. Once when the program worked flawlessly, and another time when the program failed. I then opened up both traces side by side and stepped through them simultaneously. I kept stepping until I found a place where the programs diverged. From there I assessed what was different about the computer's state (registers, memory, etc) and worked backwards to determine how that state had become different.

Funny illustration of time travel debugging with a warped timeline axis
Time travel debugging: Not very slow

Pascal: Damn, that still sounds like a lot of work, but at least now you’re having two assembly walls instead of one! So why did the state become different in the bad run?

Zoey: Well as I traced backwards I noticed that the issue was coming from some string handling functions. I decided to take a look at the strings in memory and noticed that in the bad run the string had been completely corrupted. So instead of the string containing something like “en-US” it contained a “鈠민Ǎ”.

Pascal: Very suspicious! What ended up being the fix?

Zoey: Adding “ref” to a match statement to prevent a use-after-free. People can find the end result on GitHub.

Pascal: So this was memory unsafety? But isn’t kbdi written in Rust?

Zoey: Yes, but when talking to the Windows API you have to introduce some unsafe code to marshal data into a form that Windows is able to understand. In this case it was creating a Windows HSTRING from a Rust String.

Pascal: I’m looking forward to not programming on Windows in the future, too. Can I do something similar on other OS’?

Time travel debugging does exist on other platforms, but I’m not as experienced with how to do it on anything other than Windows. To paraphrase Todd Howard, Windbg just works.
Pascal: Awesome! Thanks so much for your time!


Related article

- Check out this interview with Sjur Moshagen of Divvun on how open-source language tools are helping to keep Sámi language alive.

*Top photo by𝓴𝓘𝓡𝓚 𝕝𝔸𝕀 on Unsplash

Read more