“A lazy graphics card” from Eddy Carroll

Around the time Windows 3.0 was launched, myself and a colleague were developing a Windows video driver for an advanced graphics card. During development, we noticed that occasionally, Windows would hang while loading the graphics driver. When this occurred, we’d simply reset the PC and it would work fine the second time. Since we had plenty of other more pressing issues to work on, we decided to keep an eye open for possible causes, but otherwise ignore it for now.

After a couple of months, the video driver was in pretty good shape and we were almost ready to ship. And of course, we hadn’t yet figured out why the driver would sometimes cause Windows to crash. I _had_ noticed, however, that it always failed on a cold boot, and this was repeatable. It never happened after pressing CTRL-ALT-DEL or the hardware button. (We rarely turned our PCs off, which was why it wasn’t happening often enough to annoy us.)

The graphics card used an onboard TI graphics processor (TMS34020) which ran independently of the main CPU. The host CPU communicated with the TI chip through a 16K shared memory window. This window was mapped into the PC’s memory space at a configurable address in the PC’s high memory area. During startup, the PC downloaded graphics code to the TI chip and then brought it out of reset to execute it.
Now that I could reproduce the crash, the next step was to use the debugger to watch the driver initialisation code. Everything seemed fine — the window was mapped in, the downloaded code could be read back correctly and appeared fine; the only problem was that the graphics CPU refused to execute it.

So much for the PC’s debugger. Fortunately, we also had a debug port on the graphics card itself. Unfortunately, this port required some stub code on the TI chip to operate. However, by adding some debug output messages at the start of the TI code, I was able to confirm that none of the downloaded code was executing on those occasions when the crash occurred.

So, I resorted to comparing traces of the working vs non-working driver initialisation runs on the PC. Eventually, after much painstaking logging, I noticed that the initial pattern of data in the shared memory window was somewhat different in the failing case (after a cold start) than in the working case. After the first warm start, the data was a mixture of FF’s, FE’s, and the downloaded program code. After a cold start, the data was more like something you’d see in MS-DOS program memory…

… which was impossible, of course, because this memory resided on the graphics card and was inaccessible to MS-DOS until the graphics driver had performed a number of initialisation steps to map it in.

Eventually, it became clear what was happening. Windows 3.0 tried to use as much “high memory” as possible for its own needs, to leave normal MS-DOS program space available for applications. To determine how much high memory could be used, it did a simple non-destructive read/write test on all pages in the high memory area. Any pages that appeared to contain valid RAM were assumed to belong to expansion cards and left alone. All other pages were commandeered for Windows’ own use, and the memory management unit of the 386 processor was used to remap them to extended memory.

By now, you may have figured out the problem: on a cold boot, the shared memory window used by the graphics card hadn’t yet been initialised, and so it wasn’t mapped into memory. This led Windows to remap normal RAM at its memory address. When the shared memory window was enabled, the processor had no way to access it since the MMU was intercepting all such accesses.

So, that explained why it failed the first time — but why did it work on subsequent occasions? Because of a slight design flaw on the graphics card: the hardware designer had neglected to connect a reset line to the latch used to control where in PC memory the shared window would get mapped. On power-on, this could be set to any random value (though typically FF) and it was assumed that the driver would set it to something sensible before using it.

Since the latch wasn’t reset by a warm start, it retained its previous value — which had been set by the graphics driver as part of its initialisationduring the last cold boot. Thus, when Windows booted up the second time, its memory test showed that page to be valid memory on an expansion card, which meant Windows didn’t try to remap it for system use. This in turn meant the graphics driver (which was loaded after Windows did its own initialisation) could happily read and write to the graphics card memory.

Now that the sequence of events was clear, it turned out to be very easy to fix: we simply modified our installation program to add a line to the Windows SYSTEM.INI file telling it to always exclude the memory range where the graphics board was located from system use; no change to the graphics driver itself was required. This was a lot easier than modifying the reset logic on the board, especially since we had a large number of units already out in the field. We did ensure that future board designs correctly initialised the latch register on reset, to ensure consistent behaviour.

And our driver shipped two weeks late, as a result.

So, what lessons can be learnt from this?

– Understand the system: although Windows 3.0 was very new at the time, and Google wasn’t around to make it easy to find obscure information, it was a chance comment about Windows memory usage in an early Windows programming book that alerted me to the possibility that Windows could steal high memory from I/O cards.

– It’s never a good idea to let things default to a random value, even in hardware. If the graphics board had always been set to a consistent state after a reset, the failure would have been identified and fixed at the start rather than the end of the development period.

– It’s not always a good idea to fix bugs immediately, especially if they are hard to reproduce. In this case, I was able to narrow the cause of the bug to the cold start situation over a period of several weeks, in parallel with normal development. By then, I also had enough confidence in the graphics hardware to be sure that the problem wasn’t likely to be related to flakiness in the graphics chip itself. (In all other respects, the board was proving very reliable.)

– The reason I was seeing any recognisable patterns in memory at all after a cold start was because I was too impatient to wait 10 seconds when I power-cycled my development PC. As a result, the DRAM didn’t have a chance to fully discharge, and a ghost image of the previous contents remained. (This was a 25 MHz 386, and the memory could hold its charge for several seconds, despite the official rating.) While I don’t suggest making a habit of quick power-cycling, it does act as a reminder that help can sometimes come from unexpected quarters!