Traverse Ten64

An eight-core ARM64 networking platform with mainline Linux support

Nov 16, 2020

Crypto & AI Acceleration

by Mathew M

One key differentiator of Ten64 from general-purpose and media-oriented appliances is the networking-oriented acceleration capabilities of Ten64’s LS1088 System-on-Chip.

The previous 10G Options & Performance post described some of the options available to improve packet routing performance - all the way up to the programmable offload engine (AIOP).

There are two other workloads you can accelerate on Ten64. In this post, we will describe how Ten64 can accelerate cryptography (important for VPNs) and AI workloads using an AI acceleration card.

Cryptographic & VPN Acceleration

The LS1088 SoC provides two separate methods of cryptography acceleration:

Method 1: Acceleration via the the ARMv8 cryptography extension

This provides acceleration for AES, and SHA-1,-224 and SHA-256. It is analogous to the AES-NI in most modern x86 processors. This is an optional extension which is not present on all ARM-powered processors, but is present on the LS1088. You can check if it is available on your ARM machine by looking at the flags in cpuinfo:

Ten64 supports AES, SHA1, SHA2, and PMULL (polynomial long multiply)
$ cat /proc/cpuinfo  | grep Features | head -n 1
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
Raspberry Pi 3 and 4 do not implement the crypto extension
$ cat /proc/cpuinfo  | grep Features | head -n 1
Features        : fp asimd evtstrm crc32 cpuid

To illustrate the difference, we ran OpenSSL’s speed benchmark on the Ten64, and Raspberry Pi 3 and 4. The Raspberry Pi 3 also uses the Cortex-A53 core (like the LS1088), but does not have the ARMv8 crypto extension.

The newer Raspberry Pi 4 uses the Cortex-A72 - a faster, out-of-order core, but also lacks the cryptography extension.

As we can see, the lack of AES acceleration is a major handicap — the LS1088 is 18-22x faster in this particular use case.

The ARMv8 Cryptography extension is used by OpenSSL, wolfSSL, and through the arm/sha*-ce kernel modules in the Linux kernel, so most applications using these libraries should be able to take advantage of them.

Method 2: Acceleration via the NXP SEC engine

The NXP SEC engine (also known as CAAM) is NXP’s encryption acceleration block. It is designed to accelerate communications workloads like IPSec, as well as some earlier versions of TLS and ciphers used in standards such as 3G/UMTS (Kasumi, Snow) and Wi-Fi. It also implements some older, but still relevant standards such as RSA and 3DES.

SEC engine is best at accelerating packets to/from the network stack in the kernel (or similar environments such as DPDK). There are higher latencies as data packets need to be transferred in and out of it via DMA, rather than the ARMv8 crypto extensions, which are part of the CPU instruction set.

It is possible to use SEC from userspace, using mechanisms such as cryptodev, but you might end up with better performance using the CPU instructions.

IPSec throughput comparison between ARMv8 crypto and SEC engine. We anticipate the SEC engine throughput can be improved even further in the future.

Nonetheless, you can get some impressive performance from the SEC engine for IPSec workloads, because it can accelerate not only the encryption cipher but also a chain of related operations such as AEAD and HMAC, as can be seen in /proc/crypto when the SEC drivers are compiled into the kernel:

cat /proc/crypto | grep aes | grep caam
driver       : cmac-aes-caam
driver       : xcbc-aes-caam
driver       : seqiv-authenc-hmac-sha512-rfc3686-ctr-aes-caam
driver       : authenc-hmac-sha512-rfc3686-ctr-aes-caam
driver       : seqiv-authenc-hmac-sha384-rfc3686-ctr-aes-caam
driver       : authenc-hmac-sha384-rfc3686-ctr-aes-caam
driver       : seqiv-authenc-hmac-sha256-rfc3686-ctr-aes-caam
driver       : authenc-hmac-sha256-rfc3686-ctr-aes-caam
driver       : seqiv-authenc-hmac-sha224-rfc3686-ctr-aes-caam
driver       : authenc-hmac-sha224-rfc3686-ctr-aes-caam
driver       : seqiv-authenc-hmac-sha1-rfc3686-ctr-aes-caam
driver       : authenc-hmac-sha1-rfc3686-ctr-aes-caam
driver       : seqiv-authenc-hmac-md5-rfc3686-ctr-aes-caam
driver       : authenc-hmac-md5-rfc3686-ctr-aes-caam
driver       : echainiv-authenc-hmac-sha512-cbc-aes-caam
driver       : authenc-hmac-sha512-cbc-aes-caam
driver       : echainiv-authenc-hmac-sha384-cbc-aes-caam
driver       : authenc-hmac-sha384-cbc-aes-caam
driver       : echainiv-authenc-hmac-sha256-cbc-aes-caam
driver       : authenc-hmac-sha256-cbc-aes-caam
driver       : echainiv-authenc-hmac-sha224-cbc-aes-caam
driver       : authenc-hmac-sha224-cbc-aes-caam
driver       : echainiv-authenc-hmac-sha1-cbc-aes-caam
driver       : authenc-hmac-sha1-cbc-aes-caam
driver       : echainiv-authenc-hmac-md5-cbc-aes-caam
driver       : authenc-hmac-md5-cbc-aes-caam
driver       : gcm-aes-caam
driver       : rfc4543-gcm-aes-caam
driver       : rfc4106-gcm-aes-caam
driver       : ecb-aes-caam
driver       : xts-aes-caam
driver       : rfc3686-ctr-aes-caam
driver       : ctr-aes-caam
driver       : cbc-aes-caam

(For a full output from /proc/crypto, see the cryptographic acceleration page in the Ten64 manual.)

IPSec may not be the easiest VPN solution to use (especially in the face of alternatives like OpenVPN and Wireguard) but this is balanced by its ubiquitous nature (as many operating systems and network appliances implement it) and ability to leverage hardware offloads such as the SEC engine.

AI acceleration

Those of you interested in machine learning and AI may be interested to know that the Coral AI EdgeTPU cards work in the Ten64. The Coral PCIe cards are available in both Mini PCIe and M.2.

The Coral Mini PCIe card installed on a Ten64 board

While we haven’t had an opportunity to piece together an AI/ML demo of our own, the TensorFlow Lite image classification example shows an impressive speedup:

Unaccelerated (CPU only)

----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
140.4ms
138.9ms
139.1ms
139.3ms
139.3ms
-------RESULTS--------
Ara macao (Scarlet Macaw): 0.77734

EdgeTPU accelerated

----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
12.6ms
2.5ms
2.4ms
2.4ms
2.4ms
-------RESULTS--------
Ara macao (Scarlet Macaw): 0.77734

That is an over 50x speedup - which opens up possibilities involving real-time processing, such as classifying objects from a video feed.

For information on how to setup a development environment for the Coral EdgeTPU, see our application note.


Subscribe to the Crowd Supply newsletter, highlighting the latest creators and projects