“A few stories on the theme of Check the Plug” from Nick Coghlan
Some debugging stories which, taken together, probably consumed a few months of development time (it could actually have been a LOT worse!)
1. Check those hardware configuration registers
The system I am working on is normally sold as a complete, vendor provided solution – they provide the hardware, and package it with third-party signal processing software. The vendor discovered they also had a market for companies like us that really liked the hardware design, but wanted to develop custom signal processing software on top of it.
This was fine, but since the vendor had only recently started doing this, their capacity to support the custom software development wasn’t that great (a lot of the necessary knowledge was held by the third-party software vendor, instead of the hardware vendor). So, we allocated plenty of time for experimentation and prototyping of the signal processing software.
However, for the first few months, we were consistently getting strange behaviour from the threading in the RTOS we were using. Working on the assumption that there was an errant pointer in the code, or something similar, I simply kept on eye on the problem for a while, making progress in the prototyping and generally operating reasonably well. Eventually, however, we’d got to the point where we’d established that there was nothing obviously wrong in the prototype code, but we were still getting threading problems – a software interrupt was getting pre-empted by a standard thread.
So, we stripped out as much of the prototype as we could, and the problem was still there. Not perfectly consistent, but extremely frequent. So, we went back to the RTOS vendor, describing the problem we were seeing. They were basically stumped too, so they passed me on to their actual software development team, who were intensely curious as to what was going on – as far as they were concerned, this behaviourwas impossible.
We managed to trap an occurrence of the error, and the support engineer was able to get me to look at some of the internal RTOS data values – which had reached values that, according to the engineer, should never be reached. Finally, after a few days of trans-Pacific phone calls, the RTOS vendor’s engineer and I were stepping through a section of the RTOS assembler code, monitoring register values. We saw the processor perform a calculation along the lines of “1 & 1 -> 0”. Needless to say, this was confusing the RTOS more than a little. At this point, the RTOS engineer asked me to check the configuration register for the processor’s PLL multiplier. When I’d set the value in the RTOS configuration file, I’d simply set it to the maximum value without checking what the correct value for the vendor’s hardware was. Once we tried changing it back down to 1, the strange threading behaviour disappeared, as the processor rediscovered its ability to do basic math.
When I went back to our hardware vendor, I discovered I’d been clocking the processor on the development hardware at 150% of its rated speed. Since the deployed hardware uses a faster crystal, if I’d tried running the incorrectly configured code on that, it would have been running at 750% (actually, at that speed, the magic smoke probably would have escaped from the processor).
2. Sometimes, it IS the operating system
On another occasion, a serial I/O driver seemed to be suffering strange timing problems, with a software interrupt appearing to miss its deadline. Instrumenting the code, stripping out all of the code contents _except_ the instrumentation, shutting down the rest of the system, none of it appeared to make any difference. Initially, we chalked it up to interference from the JTAG-based emulation, but then we discovered that the actual I/O paths were totally corrupted, even when the JTAG emulator was not connected.
After much testing, and head-scratching, as well as reviews of the driver code by other developers, I checked the vendor’s bug listing page for the first time in a few months. Included was one along the lines of ‘Priority 1 software interrupts will sometimes fail to be posted’. We only had one software interrupt – I changed it from priority 1 to priority 2 and the strange driver behaviour disappeared.
Assuming your own code is at fault is generally the right way to go – but keep that list of known OS bugs handy, too.
3. Linkers are all the same, right?
During the course of the project, we upgraded the version of our toolchain. Overall, this was a Very Good Thing, but there was an interesting teething problem. Our application, which worked fine in the original version of the tool chain, wasn’t working at all with the newer version of the debugger. In fact, it was causing the entire processor to lock up. Again, we went back to the RTOS vendor, even sending them a stripped down version of the application that exhibited the problem.
This time, it DID appear to be related to the JTAG emulator – if the emulator wasn’t attached, we didn’t seem to have a problem. But, all we’d done was change the tools – what could be so different as to cause the application to crash completely?
It turned out that the new linker was arranging things differently in memory from the previous version. The data buffers used by the serial IO and the data buffers used to transfer debugging data to the IDE were now in the same block of memory, and the memory port access controls were causing scheduling and/or pipeline issues that were locking up the processor. The simple expedient of moving the serial IO buffers to dual-access memory eliminated the conflict, and allowed the debugger to work correctly with the updated toolchain. This fix was actually discovered by lucky accident, before we managed to work backward to figure out _why_ moving the serial IO buffers solved the problem.
This problem actually meant that our first attempt at upgrading the tool chain was aborted – the deadline had been set that, if the new version wasn’t working as well as the previous version within four weeks, then we would postpone the upgrade until the next iteration. The next time around, we managed to track it down, and were able to shift to the new toolchain.