In stockView Purchasing Options
Project update 34 of 34
Here’s our conference schedule for the summer:
If you happen to be attending any of these events, please look us up (assuming we can get tickets)!
It’s been a little while since our last release, but that hasn’t been for a lack of effort. Since our last release we’ve run a new production lot, added a bunch of new features, bug fixes and infrastructure improvements.
Remember back when we had a supply chain shortage? Today we are looking at the inevitable other half of that, a glut of components and a recession. My last update in February referenced arm-twisting to get parts out of vendors. Now, as I write this, the same vendors are arm-twisting me back to get me to take additional parts that I don’t immediately need, because global demand has softened and they want to meet their quarterly sales numbers.
You’d think we’d learn by now how to avoid this, but empirically we can’t: the better we get at producing things on short notice, the less stable the system becomes. "More and faster" is an equivalent way of saying "less gain and phase margin" in feedback analysis, which means wilder oscillations in response to changes. It’s one of the prices humanity pays for "fast tech" (fast as in fashion, not as in clock rates).
It had been a while since our previous production run, so I made the trip to South Korea to make sure all the material and processes were still in place at the factory to run the new lot. The good news is that the factory experienced low turnover, and even upgraded some of its equipment, so the production run itself went quite smoothly.
Since I last visited, the factory upgraded its automated optical inspection (AOI) system, shown above. In AOI, every assembled unit is optically compared against a reference unit. In addition to looking for solder defects and out-of-place parts, it also checks part numbers, orientation, and labels. In terms of "trustability", this is roughly the electronics assembly industry’s standard practice: read the label, and if it looks right, trust it. What could go wrong? ¯\(ツ)/¯
I made the trip short to keep travel costs down, but unfortunately during the post-assembly testing an interesting problem cropped up. Initially, over 60% of the units built were failing test. I’ll save you the gut-wrenching details of the ensuing three weeks of analysis and regression testing: let’s just say I had a crash course in principle component analysis, decision trees, and classifiers. Big data techniques are kind of neat, but also very fallible. They’re good for creating magic wands that seem to solve a problem, until they don’t because big data doesn’t get to the root cause. It’s just very good at grouping symptoms into clusters, and the quality of analysis is only as good as the quality of the underlying normalizations built into the data collection and pre-processing that are required to get the techniques to converge.
Anyways. It turns out the problem was metastability.
For those who are unfamiliar with the term, metastability is a fundamental undecidability problem encountered when mapping analog voltages to digital 1’s and 0’s. When flipping a coin, we generally expect it to result in either a heads or a tails — coins are "digital" like that. Except they aren’t: coins also have an edge. Metastability is that period of time where a flipped coin rolls around on a table before settling into a heads or a tails. Most of the time, it settles quickly, but, every now and then the coin rolls off the table because it lands squarely on its edge.
This is especially problematic in a clocked digital system, where we expect to take that same coin and flip it at regular intervals. If the coin is still rolling on its edge before the next interval comes, what do you call it? Heads? Tails?
In order to solve this problem, we take a metaphorical hand and slam it onto the coin before the next toss, to force the coin into a heads or tails state. In circuit terms, this is done by sampling that analog signal with a flip flop. In mathematical terms, it doesn’t guarantee the coin always goes to a heads or tails: it merely decreases the probability that it’s still on the edge. When I learned this subject in college, I was taught that you always use two registers to sample an analog signal, because at rates of hundreds of millions of samples per second, you will occasionally see the first register go undecided; a second register reduces the probability of that undecided state coming through exponentially.
That bit of knowledge worked for 30 years, until I decided to ship a TRNG…
You see, a ring oscillator-based TRNG is basically a device that "flips" coins by balancing them on their edge, and then smashing them down to get a random heads or tails outcome. In other words, the TRNG core deliberately puts a digital circuit into its metastable state. Metastability is a feature not a bug in this case. This is all well and good if the coins are thin and flat. However, in the case of this new production batch, our coins had a "fat edge": the P and N transistor thresholds cornered such that they were both well-balanced, and non-overlapping. This meant that our digital coins had a propensity to stay on their edges, even after smashing them down twice.
The result is that the signal which indicates if the OS allowed to read a random number out of the TRNG is undecided, and that propagates all the way into the bus arbiter for the SoC and lo and behold, you have a system lockup — a failure to boot during factory test.
It also turns out this problem manifests in other places where analog signals are sampled, such as the I2C block. "But I2C is digital", you say. Except, have you ever seen the rising edge of an I2C bus? It’s a big capacitive bus pulled up by a weak resistor, so the rise time is on the order of several microseconds (thousands of nanoseconds), so it spends quite a bit of time transitioning in that "no man’s land" of neither zero or one when you’re sampling it at the sysclk rate of 100 MHz (10 ns period).
The upshot is that while the place and route tools for the FPGA are great at doing static timing analysis of well-behaved signals, there is no modeling for metastability behavior. So, it’s perfectly happy to grind out routed solutions for your logic that, in some situations, aggravates metastability, but in others it works great.
This was the final factor that made root cause analysis really tricky — only about 1 in 10 of the bitstreams generated by our tool would fail on the new production lot (and they would almost never fail on the previous lot). I just "happened to get lucky" and the one bitstream I used for this production run was in the 10% that failed, and the material in this lot had transistors that were metaphorically "fat-edged coins".
Of course, when I re-ran the place & route the first time, the problem "went away". But, I am experienced enough to not be satisfied with that outcome, and instead see it as a very big red flag. Without understanding the root cause, eventually I’d push a release that contained the problem and we’d brick 60% of the customer units in the field! (cue weeks of principle component analysis…)
Fortunately, we have a pretty good CI system that logs all the past builds, so I was able to "go backwards in time" as well as generate more builds, cross-correlate build outcomes against hardware lots, and eventually determine that metastability was the root cause of failure, and address it at a design level.
The design remediation included minimizing the logic paths exposed to metastability, using more flip flops to synchronize the data (so instead of 2 stages, we used 3 or 4-stage synchronizers), and ultimately some software patches, particularly in the case of I2C where despite synchronizers, de-noisers, and design hardening on some occasions it would still fail. Because I2C is a fairly simple protocol, it’s statistically likely that you can decode a valid start/stop state or even a full transaction out of random noise. Consider that once an N-byte write transfer has been setup, all you need is 9 clock edges with an ACK bit to tack another byte onto an auto-incrementing write (this specifically caused garbage to be written to the RTC registers when the last bus operation is a write to set a wake-up alarm, and power is decaying over millisecond time scales). Fortunately the I2C code was already written with a time-out parameter; it simply meant that instead of calling
.unwrap() on the routine, installed an error handler that reset the I2C controller on time-out, and then retried the operation.
I was a little worried that simply adding more stages of synchronizers was a gross hack papering over something more fundamental, but I recently learned that in later process nodes, standard cell libraries feature four-deep flip flop synchronizers as a hard primitive. So, I guess I’m not alone in being bit by the metastability bug, just late to the game.
With all these fixes in place, I was able to run a hundred trial compilations of the bitstream and test each one on real hardware with zero failures (hooray for hardware-in-the-loop CI!), and so I’m almost confident we won’t have this problem again. But metastability is one of those skeletons you can’t ever bury. One can only put it in a closet and then put that closet in a closet, and pray every night before you go to bed that you never hear an unexpected thump coming from that direction.
We’ve also shipped Xous 0.9.13. It addresses a lot of issues, but there is still a lot more to do. Unfortunately, with all the travel, conferences, and vendor meetings coming back in force, we’ve had less time to devote to writing code. Covid-zero is over in China, and Shenzhen is open again. So, while the pandemic was great for shipping lines of code, the resumption of normal hardware supply chain operations means we’ll be splitting our time more evenly between software and hardware issues. In other words, expect the cadence of our releases to have a somewhat slower pace going forward.
Here’s the run-down of what you can find in this release.
precursorupdater, as the CPU debug port is replaced with a simple reset-halt mechanism.
atsama5d27target support via PRs from Foundation Devices. Xous is now booting on the ATSAMA5D27-SOM1-EK1 dev board!
usb noconsoleto turn off). You will need a terminal client that is capable of CRLF translations.
usb notrngto turn off). This should be compatible with existing methods to extract randomness from USB dongles such as the OneRNG (looking for an existing HW RNG dongle user to test and confirm compatibility with their existing system!).
--gdb-stubto the build. When selected, the kernel is built with GDB support over serial port. This works well in Renode. To try it in hardware, one must first run
shellchatto activate the GDB UART (otherwise the sole serial port is connected to the console log).
app-image-xipis now the release configuration. This puts a couple services in "execute in place" (XIP) mode, running out of FLASH instead of being RAM-resident, thus freeing up more RAM for apps. Users building custom images are advised to use
app-imageto avoid OOM behaviors.
vault, deleting a password and saving the record with the blank password triggers a password generation dialog box (useful for updating passwords)
That’s it for now!
Our schedules are already looking fairly packed this summer, so we’ll be prioritizing issues and PRs based on urgency. Please help us provide better support by using our issues interface to report problems, instead of casually mentioning them in our Matrix channel. It’s too easy to forget about problems that are mentioned in a direct message or a chat window, especially when travelling and/or dealing with jetlag.
Hopefully we’ll bump elbows this summer at one of the two events we’re aiming to be at!
-bunnie & xobs