It sometimes takes a microcontroller to boot up an FPGA
Modern ICs are little engineering marvels. Think of all the logic working within those
tiny boxes! Can anyone ever fully document how to use these modern miracles? Is it
possible to anticipate all operating conditions? In earlier days, references such as the
Texas Instruments TTL Databook went into exhaustive detail about the innards of each chip,
right down to the types of transistors and resistor values they incorporated. These
simplistic devices, at least from our present perspective, seemed thoroughly documented
and characterized. Today with the proliferation of VLSI, quick turnaround times and short
product lifetimes, this attention to detail is sadly missing. In addition, I personally
question the ancillary knowledge some of these chip designers have in addition to their
general logic equations.
The basis for these cynical thoughts lies in some problems I've recently encountered
with a Xilinx FPGA. This class of device can supposedly replace oodles of logic. However,
as you'll see, this assertion can quickly fall apart when these chips encounter the real,
far-from-ideal, world. We take so much for granted that it comes as a surprise when these
little chips don't function as we expect. And the solution we had to implement makes me
question the whole direction of embedded applications with VLSI parts.
Inadvertent program loss
One of my embedded-instrument designs uses a Xilinx FPGA to replace considerable
chip-selection logic. One device it helped control was the flash memory that served as
main program storage. For three years this hardware worked terrifically until one day
someone forgot to write down some instructions, and the Xilinx house of cards came
crashing down around my metaphorical ears.
No problem ever results from a single event, and this one was no exception. My
customer's salesperson went to put the instrument into a demonstration mode. However,
because no one had written down the procedure, he did it incorrectly. When the demo mode
didn't work he got mad at the instrument, quickly toggled the power off and then on-an
action that corrupted the flash memory. Of course, the description I got was simply that
the instrument lost its program.
With all such strange occurrences, I always try to go back to the source. Hence I
talked to the salesperson, sympathized with his plight and helped him remember his normal
operation. As it turns out, he said that everyone had problems getting into demo mode and
the accepted procedure was to switch the instrument off then on as quickly as possible to
hide the "problem" from the customer.
I examined the problematic instrument and, sure enough, quickly toggling power would
trash the flash memory's contents. Of course, the first solution that comes to everyone is
simple-don't toggle power in that fashion. And the only reason it had happened in this
case was because of the missing piece of paper. So maybe another solution is to provide
better training. The problem, as we all know, is that the real world is a tough place for
an embedded design, and if a salesperson could break the instrument, you can bet that a
customer could do it just as easily.
With an oscilloscope I discovered that the flash-memory programming voltage (Vpp)
that's supposed to be off at power up would occasionally be on while the user was quickly
cycling power. The control signal for this voltage came from a Xilinx 2018 chip. This
symptom didn't make any sense because on power up, all the 2018's I/O pins are supposed to
be in a high-impedance state with a weak pull-up to Vcc. The Vpp control requires a Low
signal, so it couldn't possibly be on.
Besides, a flash memory requires much more than just a programming voltage to write new
data into it. It also requires a chip select signal, a write signal and the proper
programming code to kick off its internal state machine. I thought that our program was
absolutely secure inside this device. All those signals coming from the Xilinx chip, all
High at reset time, couldn't possibly conspire to destroy my data. Wrong!
As it turns out, the Xilinx chip has a problem at reset time. When you cycle power
quickly, the chip sometimes comes up with its outputs on at logic Zero levels. In
addition, this anomaly only occurs on certain pins that just happened to lead to crucial
parts of my circuits. What happened was that the Xilinx chip drove the flash memory's CS/
and WR/ (chip select and write enable) pins Low, the end of the Xilinx configuration
process (even though it never actually configured itself) booted the instrument's main CPU
which, running amuck, changed the data lines such that the flash memory saw its
Of course, as we all do in these situations, Xilinx denied that its part played any
role in this domino effect: "Can't happen
, we would've heard about it by
"-you know the scenario. So I took out my trusty oscilloscope and logic
analyzer and set out to fix the problem and prove Xilinx wrong.
I found that the instrument was trashing random 1- byte memory locations in the first
flash memory chip. With further investigation, I could see on the debugging tools that the
Xilinx chip sometimes failed to boot properly, yet it signaled the rest of the system that
it was properly configured. Monitoring Vcc while switching power, I noticed that while
power was going down on the computer board, if I switched power back on when Vcc was
between 0.5V and 1.0V, I could consistently trash the flash memories. For some reason, the
chip wouldn't reset properly.
I then compared my logic-analyzer traces of the power-up process to the chip's timing
diagrams, and again everything looked right. However, a noted on several timing diagrams
states that if Vcc takes longer than 25 msec to go from 2V to Vcc min you must install a
special Xilinx reset circuit. Our Vcc rise time was well within this specification, so I
basically ignored this advice.
Telephoning Xilinx again, I proceeded to tell my story to five different people, who
all said that there was no way that the chip's outputs could come on before configuration.
Finally, I happened upon someone with a good corporate memory. Apparently Xilinx had seen
a problem similar to mine many years ago, and each progressive generation of devices has
gotten better at solving it. The engineer said that we needed to implement a ridiculously
complicated boot-up procedure so that the chip would always reset properly. He also said
that the 3000 and 4000 Series parts didn't need this technique, only the 2000 Series-but
as I later discovered, the 3000 Series also exhibits this problem.
Russian nesting dolls
The timing diagram for the fix looked simple enough, but from a hardware standpoint it
was far from trivial. First of all, I had to retrofit the circuit into an existing
instrument. Secondly, I couldn't count on the validity of most signals because they passed
through the Xilinx chip. I needed to find a solution that would work every time, fit into
the existing space and didn't create its own problems.
The first thing I considered was a PAL, but to implement the logic required more than a
garden-variety PAL chip. In addition, these devices normally need an oscillator or some
other type of clock circuit to operate, and the last thing this EMI-noisy instrument
needed was another high frequency clock leaking all over the place. I even considered
using the venerable 555 timer as a sequential 1-shot device, but this method also created
its own problems. The solution I finally settled on was to use a simple microcontroller to
start the Xilinx chip so it, in turn, could start the main CPU.
What a ridiculous idea-use a microcontroller to start the hardware to start the
computer. On further reflection, it does make a lot of sense. With the microcontroller, I
had infinite latitude to adjust for the crazy Xilinx chip reset timing. What's more, small
low-volume microcontrollers like the 87C752 cost almost the same as a PAL. Finally, I had
no real speed requirement; a few more microseconds wouldn't hurt anybody.
The only real concern was how to guarantee that the microcontroller would start each
time. In the old days, I simply drove a microprocessor's reset input with a resistor,
capacitor and maybe a diode. Then I learned the hard way that this type of reset circuit
isn't effective in brown-out conditions, so I started using power-supervisory chips to
control the micro's reset signal. It gave me a nice known signal appropriately applied at
the right times. Now look where I've graduated to. I'm using a power-supervisory chip to
start a microcontroller to start a Xilinx chip to start a microcomputer. What a ludicrous
scenario. I wonder how many transistors I'm using to just kick off the instrument's
With these modifications in place everything now works as it should. My little program
in the microcontroller implements the Xilinx reset procedure (modified for another Xilinx
problem I found), which in turn now properly resets the Xilinx chip, which now properly
starts the microcomputer. All the original flash-memory problems are long gone, and the
salespeople have their sheet of paper with the directions.
Just when I finally thought that the power-up problems with the VLSI chips were out of
the way, I got a call about a power-on problem (again with the flipping power switches)
relating to a Cirrus GD6215 video controller chip. All our signals are fine, but the
device enters a test mode.
So what's the real solution? Must we educate instrument users to not switch power off
and on too fast? Must we add a circuit to every system that guarantees to drop Vcc to 0.1V
every time the power switch goes off? Must we build smart power switches that don't allow
a user to quickly flip the power switch (like the switch in the original IBM PC)?
The VLSI chips used in modern PCs are marvels of engineering, but embedded designers
need more than functionality, they also need robustness. Because of a missing description
on a piece of paper, I ended up adding a microcontroller to a system. What's wrong with
this picture? PE&IN