Oct 18, 2022

Project update 31 of 39

Xous 0.9.10: Filing off Rough Edges

by bunnie

The development theme for the past two months has been "gloss": things that improve the user experience or clean up technical debt in the code base. Stuff like clearer documentation, performance improvements, working with community developers to pull in staged features, and better CI coverage.

Gloss is one of those activities where it’s hard to know when you’re done. It’s like trying to polish a fractal: one would think it has a well-defined terminus because it has a bounded volume, but with a little scrubbing one soon discovers that the surface area is actually infinite.

That’s actually a bit of a lie: it’s more like as you polish a surface, suddenly you’ve added more to the volume of the object, which creates more surfaces to polish. In this case, we added about 300k of binary object code (and removed about 225k) for a net additional "mass" of 75k. I’d like to keep Xous at size equilibrium going forward, but I do feel there are still some badly-needed new features (such as a preferences manager) which adds yet more surfaces to secure, and more UX to QC.

In practice what this means is the scope of new releases may be smaller, and the cadence may slow down as well. Every new feature grows the surface of things to test, and if the project is to be sustainable we’re going to have to adjust our approach to releases.

That being said, here are some of the most significant changes in this release.

Simplified Updates

Update flow has been simplified. There is just one Python script to run on all platforms: precursorupdater, which is published to PyPi. Be sure you’re running 0.0.7 or later. Thanks to @neutralinsomniac for helping to sort out packaging issues: the Python ecosystem is frustrating to figure out, so a template that "just works" is appreciated.

The updater should now stage all the artifacts (SoC, loader, Xous, and EC); and Xous will now guide users through all the necessary post-update steps automatically.

Please remember to reset your device by pushing a paperclip into the hard reset hole in the lower right hand corner after every update. This ensures that the SoC picks up any changes, otherwise, you may end up in a loop where the system asks you to update the SoC over and over again, or in a state where it complains about a signature mismatch.

UX Improvements

A large number of UX improvements made it into this release:

WLAN submenu added, thanks to a huge effort by @gsora.
- Scan and configure APs
- Add/delete configurations
- Power off WiFi puts the WiFi into hardware reset. This greatly reduces sleep current, at the expense of a long wait to re-connect WiFi when turned back on.
Import TOTP/passwords into vault over USB
- Huge thanks to @gsora for this PR.
- Support bitwarden and Google Authenticator (thanks to @zeldovich for Google Authenticator support)
- Same tool can also backup TOTP and password secrets. However, the backed up secrets are stored in plaintext. For encrypted backup, see PDDB backup tool.
- See apps/vault/tools/vaultbackup-rs/README.md for more details.
HOTP support added to vault
ball demo removed from default apps
Unlock PIN can now be changed

Fixes to Backups and to the PDDB

The backup flow now adds checksums to the backup file. This allows off-line scripts (such as the backalyzer) to check for media errors and backup file format issues, without requiring password disclosure.

The underlying AEADs (Authenticated Encryption with Associated Data, e.g. AES-GCM-SIV in this case) guarantee authenticity, but checking them requires typing passwords into a potentially less trusted device. Naked checksums give us a quick way to make sure everything ran smoothly, protecting against non-malicious adversaries such as bad USB cables, entropy, and poor UX decisions on our part. Unfortunately, a user reported total data loss because we did not make it clear that backups must be prepared before every backup, so this was a hard-learned lesson. The new flow should prevent this from ever happening again, and also hardens against other common failure modes.

The main downside of adding checksums is it now takes a minute for Precursor to checksum its entire 98MiB PDDB region, and this is done before every backup.

The PDDB also incorporated a number of fixes:

Cache coherence issues between FLASH write and read fixed
FSCB SpaceUpdate records are periodically flushed to prevent deniability leakage via free space consumption patterns
AES-keywrap library had a bug, now fixed. The PDDB should transparently upgrade to the fixed keywrapping protocol.

Robustified Networking

Apparently, Rust has a set of standard tests for std networking, which we just discovered in this release cycle. They’ve all been ported into the net crate. You can run these tests by rebuilding the Xous with the nettest feature and then running net test. These tests found a lot of subtle bugs in the implementation that are now fixed:

Error codes are now compliant to library standards
Nonblocking sockets added
Peek fixed
Short reads fixed -- no more discarding of excess data when the read buffer is shorter than the data in the buffer
TcpClose fixed -- waits until the close handshake finishes before removing the socket
Handle "unattended" closes -- this is when someone connects to a server and immediately disconnects, so the application layer never has time to issue a "formal" close request. Now it is automatically issued.
Fix timeouts

Refactored `xtask`; Modularization and Crating of Xous

xtask, the "Makefile" equivalent for Xous, has been cleaned up and refactored. As the build manager, xtask is, by definition, one of the oldest modules in the system, and "Exhibit A" for examples of how not to write idiomatic Rust code. The refactor is still probably not professional-grade idiomatic Rust, but, at least we use builder patterns now instead of stacking function arguments ad nauseum.

We’re also starting on a journey to modularize Xous so it can be ported more easily to new targets. This involves crating up a bunch of APIs and domiciling them on crates.io. This has the side effect of the build system pulling stuff from crates.io instead of using mostly local code. To counter this, xtask has grown a feature that checks the code in the "cargo cache" that was ostensibly used to build the kernel image, against the code in the local tree from which it was derived.

This was an interesting yak-shaving adventure that revealed numerous quirks, such as the fact that the Cargo.toml file goes through a full semantic re-write on export to crates.io, and files are munged for CR/LF normalization. This means that the code you get back from crates.io does not match byte-for-byte with the code you sent it. The CR/LF normalization was irritating but easily worked around.

However, the full re-write of the Cargo.toml file makes it very hard to prove equivalence of the manifest file. The cargo source code has over 1000 lines of high-complexity Rust code to turn your local manifest into an abstract tree and then re-export it into a normalized format, handling many strange edge cases. As far as I can tell, there is no simple tool to confirm that what is in crates.io matches what you intended to be there. Thus, we cannot confirm the congruence of the Cargo.toml for downloaded crates. This is a potent vector for tampering, because if you can change the Cargo.toml file you can swap out arbitrary dependencies and this would largely go undetected. To counter this, after CI builds we check that the Cargo.lock file has not changed, under the presumption that Cargo.toml differences would likewise modify Cargo.lock. This is mostly true, except that feature flags are not recorded in the lock file.

One side effect of modularizing Xous is that you can now go to crates.io/docs.rs and get API docs for some of the most core crates in Xous:

log service
names based service discovery and access control
suspend/resume
ticktimer process scheduler
xous core APIs
xous-ip Xous IPC calls

A downside of the modularization process is that crates are fetched from the Internet, adding even more crates.io to our supply chain attack surface. To be fair, I imagine the crates.io maintainers have much better opsec than I have, so they probably aren’t the weak link in the chain. However, I don’t know how to go about checking that assumption; I just really want to believe they are super competent, well-funded, and will say no to governments that make unreasonable demands.

Another downside is even a tiny change in a fundamental crate like xous means that every other crate that depends on it must be republished with a new minor version number to capture that difference. In most Rust configurations, you would allow dependent crates to vary by a minor number, so this wouldn’t be a problem: so you may specify xous = "0.9", which means any of 0.9.0, 0.9.1, … would match, and the build system would simply pick up the latest version available at the time you ran the build.

However, in Xous, we’d like to have our builds be reproducible across systems and over time, so we specify crates down to the minor number. This means that every time we make a tweak in xous and publish it, we have to also bump the version number of every crate that depends on xous — we basically have to republish the entire kernel, crate by crate!

Unfortunately, this is not going to scale. We may ultimately have to relax the version numbering requirement to allow "weaker" version numbers, and simply rely on the sanctity of Cargo.lock to enforce reproducibility of builds. I suppose that is the role of the lock file after all; but I’m a little uneasy relying on Cargo.lock entirely. It is too easy to ignore it, and there is no warning emitted if the lock file is accidentally deleted and regenerated from scratch with a fresh set of dependencies.

TLS, Websockets, and Other Fixes

This update also includes a simple demo of TLS and Websockets inside the net_cmd.rs module of the shellchat. Our previous update discussed this development in detail, so I won’t cover it again.

Other fixes & features that made it into this release include:

ditherpunk improvements (PR #207):
- iterator form for PNG decoding (thanks to @nworbnhoj for a ton of work to get that together)
- memory usage is well-constrained now, and suitable for everyday use
- primary limit to PNG decode speed is read speed over e.g. TCP
Performance monitoring framework added. See tools/perflib and services/shellchat/net_cmd.rs for examples of how to use it.
Fix/close various old issues (in particular, RTC interrupts stripped out, and suspend lock failures now trigger a notification instead of a silent failure)
Move RTC resume handler to the secure/private server - hopefully resolves a subtle susres failure case

Future Directions

Working on the cryptography code in rustls/ring, the performance profiling framework, and modularizing Xous reminded me how much I enjoy working with primitives that touch bare-iron. For a hardware engineer, getting Xous to this point has been an incredible journey — it’s been a wild ride through parts of the technology stack that tower thousands of feet above my comfort zone of solder and transistors.

Don’t get me wrong, writing Rust code has been a productive way to pass the time as the supply chain sorts itself out. We’re now almost two years on from when The Problems started, yet vendors are still reporting 50+ week lead times. This time, it’s not because they don’t have the capacity to produce — some vendors are just cutting back because they want to keep supply short, so that prices remain high. Everyone in the hardware industry is dreading the crash that comes when a glut of capacity meets a looming recession. Still, I was optimistic that we would be able to resume regular production of Precursor soon. So far we’ve been able to meet demand with spot-buys of parts and chasing vendors for inventory on a weekly basis, but there was hope that the order of FPGAs we placed — and paid for, in full — over 13 months ago(!) would arrive "any day now". While writing this update, I’ve learned the distributor just tacked another 50 weeks onto the lead time, placing delivery in late September 2023. There’s clearly some shenanigans going on…my first suspicion is someone received the parts and realized they could 10x their money by selling it on the spot market to a higher bidder, at the cost of "delaying" my delivery. Needs more investigation…

All that being said, I’m eager to get back something closer to hardware and less working my way up the application stack. For example, the next logical step in the product development chain could be to start looking at the Signal or Matrix SDK. However, I took a peek there, and the code there terrifies me. I imagine this is what Real Software looks like — enormous code bases written by well-paid professional programmers. Programmers not constrained by the laws of physics, where cryptography is Lego, not math, whose primary job is to cling to the body of a jumbo-jet code base laden with users, and bolt ever more mind-bending features onto it as it lofts itself to cruise altitude.

I feel I have reached the point where I should stop ascending the application stack, and instead focus my efforts on helping others more comfortable and experienced with Real Software to build apps for the platform. So, for the next release cycle, I’m going to continue to focus my efforts on maintaining the code we have: improving documentation, optimizing performance, trimming bloat, and fixing bugs. I’ll add new features as necessary to support other efforts to build applications such as crypto wallets and secure messaging, but for now, I have no plans to take a lead role in developing these applications.

A couple things in particular that I’m looking forward to attempting is porting Xous to a more generic LiteX FPGA target, so that other developers can toy around with it in an application-neutral environment, and perhaps poking at rustls/ring a bit more and seeing if we can’t do an even better job of enabling a pure-Rust cryptography stack. I think I’ll also take another pass at improving the performance of the PDDB, so that the vault app has a smoother user experience as password databases expands beyond hundreds of entries. And perhaps most importantly, I want (need?) to put more effort into supporting and expanding our community of developers. I feel like we’re just reaching the point in the Precursor journey where success is not measured by the lines of code that I write — but rather, the lines of code I help others to write.

Happy Hacking!

-bunnie

Questions?

Ask Crowd Supply about an order
Ask Sutajio Kosagi a technical question

Learn More About This Project

Go to the main project page
See all project updates

Precursor

Mobile, Open Hardware, RISC-V System-on-Chip (SoC) Development Kit