May 16, 2023

Project update 34 of 39

Meetups, New Production and Xous 0.9.13

by bunnie

Meets & Greets

Here’s our conference schedule for the summer:

bunnie will be at Teardown in Portland, USA (June 23-25)
xobs would like to attend CCCamp in Ziegeleipark Mildenberg, Germany (August 15-19), but unfortunately tickets to Camp are still uncertain. It's a long trip from Singapore that requires advance planning, so if anyone can advise on how to secure a ticket, please reach out to @xobs

If you happen to be attending any of these events, please look us up (assuming we can get tickets)!

Updates

It’s been a little while since our last release, but that hasn’t been for a lack of effort. Since our last release we’ve run a new production lot, added a bunch of new features, bug fixes and infrastructure improvements.

New Production

Remember back when we had a supply chain shortage? Today we are looking at the inevitable other half of that, a glut of components and a recession. My last update in February referenced arm-twisting to get parts out of vendors. Now, as I write this, the same vendors are arm-twisting me back to get me to take additional parts that I don’t immediately need, because global demand has softened and they want to meet their quarterly sales numbers.

You’d think we’d learn by now how to avoid this, but empirically we can’t: the better we get at producing things on short notice, the less stable the system becomes. "More and faster" is an equivalent way of saying "less gain and phase margin" in feedback analysis, which means wilder oscillations in response to changes. It’s one of the prices humanity pays for "fast tech" (fast as in fashion, not as in clock rates).

It had been a while since our previous production run, so I made the trip to South Korea to make sure all the material and processes were still in place at the factory to run the new lot. The good news is that the factory experienced low turnover, and even upgraded some of its equipment, so the production run itself went quite smoothly.

Since I last visited, the factory upgraded its automated optical inspection (AOI) system, shown above. In AOI, every assembled unit is optically compared against a reference unit. In addition to looking for solder defects and out-of-place parts, it also checks part numbers, orientation, and labels. In terms of "trustability", this is roughly the electronics assembly industry’s standard practice: read the label, and if it looks right, trust it. What could go wrong? ¯\(ツ)/¯

I made the trip short to keep travel costs down, but unfortunately during the post-assembly testing an interesting problem cropped up. Initially, over 60% of the units built were failing test. I’ll save you the gut-wrenching details of the ensuing three weeks of analysis and regression testing: let’s just say I had a crash course in principle component analysis, decision trees, and classifiers. Big data techniques are kind of neat, but also very fallible. They’re good for creating magic wands that seem to solve a problem, until they don’t because big data doesn’t get to the root cause. It’s just very good at grouping symptoms into clusters, and the quality of analysis is only as good as the quality of the underlying normalizations built into the data collection and pre-processing that are required to get the techniques to converge.

Anyways. It turns out the problem was metastability.

For those who are unfamiliar with the term, metastability is a fundamental undecidability problem encountered when mapping analog voltages to digital 1’s and 0’s. When flipping a coin, we generally expect it to result in either a heads or a tails — coins are "digital" like that. Except they aren’t: coins also have an edge. Metastability is that period of time where a flipped coin rolls around on a table before settling into a heads or a tails. Most of the time, it settles quickly, but, every now and then the coin rolls off the table because it lands squarely on its edge.

This is especially problematic in a clocked digital system, where we expect to take that same coin and flip it at regular intervals. If the coin is still rolling on its edge before the next interval comes, what do you call it? Heads? Tails?

In order to solve this problem, we take a metaphorical hand and slam it onto the coin before the next toss, to force the coin into a heads or tails state. In circuit terms, this is done by sampling that analog signal with a flip flop. In mathematical terms, it doesn’t guarantee the coin always goes to a heads or tails: it merely decreases the probability that it’s still on the edge. When I learned this subject in college, I was taught that you always use two registers to sample an analog signal, because at rates of hundreds of millions of samples per second, you will occasionally see the first register go undecided; a second register reduces the probability of that undecided state coming through exponentially.

That bit of knowledge worked for 30 years, until I decided to ship a TRNG…

You see, a ring oscillator-based TRNG is basically a device that "flips" coins by balancing them on their edge, and then smashing them down to get a random heads or tails outcome. In other words, the TRNG core deliberately puts a digital circuit into its metastable state. Metastability is a feature not a bug in this case. This is all well and good if the coins are thin and flat. However, in the case of this new production batch, our coins had a "fat edge": the P and N transistor thresholds cornered such that they were both well-balanced, and non-overlapping. This meant that our digital coins had a propensity to stay on their edges, even after smashing them down twice.

The result is that the signal which indicates if the OS allowed to read a random number out of the TRNG is undecided, and that propagates all the way into the bus arbiter for the SoC and lo and behold, you have a system lockup — a failure to boot during factory test.

It also turns out this problem manifests in other places where analog signals are sampled, such as the I2C block. "But I2C is digital", you say. Except, have you ever seen the rising edge of an I2C bus? It’s a big capacitive bus pulled up by a weak resistor, so the rise time is on the order of several microseconds (thousands of nanoseconds), so it spends quite a bit of time transitioning in that "no man’s land" of neither zero or one when you’re sampling it at the sysclk rate of 100 MHz (10 ns period).

The upshot is that while the place and route tools for the FPGA are great at doing static timing analysis of well-behaved signals, there is no modeling for metastability behavior. So, it’s perfectly happy to grind out routed solutions for your logic that, in some situations, aggravates metastability, but in others it works great.

This was the final factor that made root cause analysis really tricky — only about 1 in 10 of the bitstreams generated by our tool would fail on the new production lot (and they would almost never fail on the previous lot). I just "happened to get lucky" and the one bitstream I used for this production run was in the 10% that failed, and the material in this lot had transistors that were metaphorically "fat-edged coins".

Of course, when I re-ran the place & route the first time, the problem "went away". But, I am experienced enough to not be satisfied with that outcome, and instead see it as a very big red flag. Without understanding the root cause, eventually I’d push a release that contained the problem and we’d brick 60% of the customer units in the field! (cue weeks of principle component analysis…)

Fortunately, we have a pretty good CI system that logs all the past builds, so I was able to "go backwards in time" as well as generate more builds, cross-correlate build outcomes against hardware lots, and eventually determine that metastability was the root cause of failure, and address it at a design level.

The design remediation included minimizing the logic paths exposed to metastability, using more flip flops to synchronize the data (so instead of 2 stages, we used 3 or 4-stage synchronizers), and ultimately some software patches, particularly in the case of I2C where despite synchronizers, de-noisers, and design hardening on some occasions it would still fail. Because I2C is a fairly simple protocol, it’s statistically likely that you can decode a valid start/stop state or even a full transaction out of random noise. Consider that once an N-byte write transfer has been setup, all you need is 9 clock edges with an ACK bit to tack another byte onto an auto-incrementing write (this specifically caused garbage to be written to the RTC registers when the last bus operation is a write to set a wake-up alarm, and power is decaying over millisecond time scales). Fortunately the I2C code was already written with a time-out parameter; it simply meant that instead of calling .unwrap() on the routine, installed an error handler that reset the I2C controller on time-out, and then retried the operation.

I was a little worried that simply adding more stages of synchronizers was a gross hack papering over something more fundamental, but I recently learned that in later process nodes, standard cell libraries feature four-deep flip flop synchronizers as a hard primitive. So, I guess I’m not alone in being bit by the metastability bug, just late to the game.

With all these fixes in place, I was able to run a hundred trial compilations of the bitstream and test each one on real hardware with zero failures (hooray for hardware-in-the-loop CI!), and so I’m almost confident we won’t have this problem again. But metastability is one of those skeletons you can’t ever bury. One can only put it in a closet and then put that closet in a closet, and pray every night before you go to bed that you never hear an unexpected thump coming from that direction.

Xous 0.9.13

We’ve also shipped Xous 0.9.13. It addresses a lot of issues, but there is still a lot more to do. Unfortunately, with all the travel, conferences, and vendor meetings coming back in force, we’ve had less time to devote to writing code. Covid-zero is over in China, and Shenzhen is open again. So, while the pandemic was great for shipping lines of code, the resumption of normal hardware supply chain operations means we’ll be splitting our time more evenly between software and hardware issues. In other words, expect the cadence of our releases to have a somewhat slower pace going forward.

Here’s the run-down of what you can find in this release.

Hardware & Loader Improvements

Multiple VexRiscv core patches
- Fix D$ virtual memory flush bug
- Fix ebreak instruction
SoC yield bugs fixed
- Requires an update to usb_update.py, precursorupdater, as the CPU debug port is replaced with a simple reset-halt mechanism.
- Metastability harden I2C & TRNG
- Handle I2C timeouts.
- Move I/O blocks into always-on domain to avoid clock stoppage during wakeup ops
I2C fixes
- Ensure that the RTC does not interpret line noise during shutdown as garbage by having the very last command issued be a read to the RTC.
- Harden the RTC handler such that if junk corrupts the RTC it doesn't loop forever being confused about the junk data.
Service implementations removed from crates.io (API crates still published to crates.io) -- nobody is using the implementation crates it seems, and they are very hard to maintain.
Wifi firmware blob bumped to 3.16.0 (will trigger an EC update). EC LiteX design also brought into compliance with deprecated Litex APIs, and toolchain modernized.

Multi-Platform Support

Preliminary Cramium SoC and FPGA targets incorporated
atsama5d27 target support via PRs from Foundation Devices. Xous is now booting on the ATSAMA5D27-SOM1-EK1 dev board!
Xous loader has been refactored, optimized, and made more portable
- Thanks to @southpawflow for reporting loader errors and providing the test case files

Kernel Improvements

USB serial support
- console logs can now be viewed via USB serial with the shellchat command usb console (usb noconsole to turn off). You will need a terminal client that is capable of CRLF translations.
- TRNG can be set to emit raw binary data over USB serial with usb trng (usb notrng to turn off). This should be compatible with existing methods to extract randomness from USB dongles such as the OneRNG (looking for an existing HW RNG dongle user to test and confirm compatibility with their existing system!).
@xobs has added the feature --gdb-stub to the build. When selected, the kernel is built with GDB support over serial port. This works well in Renode. To try it in hardware, one must first run console app inside shellchat to activate the GDB UART (otherwise the sole serial port is connected to the console log).
@xobs added graphical panic output to the kernel. This is different from a guru meditation in that it only happens when the kernel itself panics. Kernel panic messages have a different look and feel to it, because they have to be done with a very resource limited graphics interface.

UX and App Improvements

app-image-xip is now the release configuration. This puts a couple services in "execute in place" (XIP) mode, running out of FLASH instead of being RAM-resident, thus freeing up more RAM for apps. Users building custom images are advised to use app-image-xip instead of app-image to avoid OOM behaviors.
Wifi MAC address now added to wifi preferences status screen
All-0 app_id in U2F no longer prompts for a save record
PDDB shellchat writeall command (thanks to @pakl)
Add Force Update option for the EC. This required migrating the time server runtime into the DNS process space, due to exhaustion of connection IDs in the status process space.
In vault, deleting a password and saving the record with the blank password triggers a password generation dialog box (useful for updating passwords)
"Lock device" now sleeps after reboot (thanks to patches by @gsora)
Hosted mode now runs more smoothly, with less lag (thanks @yvt for the patch!)

Getting Support on Issues

That’s it for now!

Our schedules are already looking fairly packed this summer, so we’ll be prioritizing issues and PRs based on urgency. Please help us provide better support by using our issues interface to report problems, instead of casually mentioning them in our Matrix channel. It’s too easy to forget about problems that are mentioned in a direct message or a chat window, especially when travelling and/or dealing with jetlag.

Hopefully we’ll bump elbows this summer at one of the two events we’re aiming to be at!

Happy hacking,

-bunnie & xobs

Questions?

Ask Crowd Supply about an order
Ask Sutajio Kosagi a technical question

Learn More About This Project

Go to the main project page
See all project updates

Precursor

Mobile, Open Hardware, RISC-V System-on-Chip (SoC) Development Kit

Meetups, New Production and Xous 0.9.13

Meets & Greets

Updates

New Production

Xous 0.9.13

Hardware & Loader Improvements

Multi-Platform Support

Kernel Improvements

UX and App Improvements

Getting Support on Issues

Questions?

Learn More About This Project

Precursor

Mobile, Open Hardware, RISC-V System-on-Chip (SoC) Development Kit

Meetups, New Production and Xous 0.9.13

Meets & Greets

Updates

New Production

Xous 0.9.13

Hardware & Loader Improvements

Multi-Platform Support

Kernel Improvements

UX and App Improvements

Getting Support on Issues

Questions?

Learn More About This Project

Subscribe to the Crowd Supply newsletter, highlighting the latest creators and projects