“Look at the simple stuff first” from Quentin Lewis
I can’t tell you how many times I have seen people burned because they instantly dig into problems up to their eyeballs, only to find that it was a simple BOM error all along. This is the story of EXACTLY such a case.
While working at a still famous computer three letter computer company (there are a couple of them you know) I was debugging a CPU board I had designed. Things seemed to be going pretty well, except for a random crash. This crash was the result of a memory ECC error, and this was clear. We began collecting data on all crashes, looking for data or address patterns to the failures. But the problem didn’t occur all that often, so it was going to be hard to find.
We did some temperature, voltage and frequency margining, and found the problem to respond as though it was a clear timing violation….yet the more we looked at it with oscilloscope and logic analyzer, we did not see any issues.
We then noticed that error rates were memory vendor related, with one particular vendor’s devices having a much larger failure rate than others. (and one vendor not failing at all)
We called in Engineers from the failing vendor, and we looked at the problem together for a week. We saw nothing. All the timing looked good, and even though the problem did not occur often, we had been able to hook up the logic analyzer and create a trigger that allowed us to capture any signals we wanted at the time of the error. (unfortunately, the error was created during the write, so it really wasn’t always in the trace)
Well, at one point, we were scoping around and we looked at a signal going between two memory controller PALS. (this will probably date the problem for you) The signal “looked funny”, but it was only a point to point signal, so it made little sense. We went in and checked the PAL equations, to make sure we had not messed them up, and we just kept looking at this strange signal, knowing that we were onto something, but now knowing that was going on.
Then a quick look at the schematic showed that there really was only one other SMALL item in the circuit, and this was a 20 ohm series resistor. We looked at it, and it was stuffed….but it looked like it might have gone on upside down as we could not read the value…….BUT WAIT….thinking quick, and remembering the shape of the signal we were looking at, we realized that that resistor was acting like a capacitor!
Sure enough, we took it off and it was a capacitor. We then checked other board in the proto run, and they were incorrectly stuffed. A peek at the BOM and it was clear, someone had somehow fat fingered the BOM entry of an ECO. The capacitor was called out in the BOM. We had been working on a problem for three weeks that was a simple BOM error that we should have picked up in the initial board inspection. (had we done a close check)
Lesson learned…..no matter how much it seems to be a waste of time, go through the “start-up checklist” every time. It will save you in the long run. (check power to ground shorts, visual inspection, board mechanical dimensions, BOM check against board and schematic, etc…)