Project update 10 of 17
Hello everyone, Thank you for signing up and backing this project!
We learned about how Neural Networks can be mapped into the Vision FPGA SoM in the last update. Let’s look into some of the details in this update…
One of the primary applications of computer vision is to detect and/or localize everyday objects around us, and subsequently execute an action based on object locations. Many of us have tried connecting a simple USB webcam to a laptop and running a TensorFlow or PyTorch object detection model for simple and sometimes not-so-simple robotics applications. However, we should note that the convolutional neural networks, which are commonly used for computer vision applications, involve billions of multiply and accumulate (also called MAC) operations, thereby resulting in high power consumption. Hence, we need tools for neural network quantization and compression that reduce the number of operations, without sacrificing the accuracy of detections. The need for quantization and compression grows even higher when it comes to deploying object detection models on highly memory-constrained boards like our FPGA-based SoM. We are happy to share with you that our SoM consumes just about 30 mW of power for performing a not-so-easy task of human being detection. The SoM is able to do this task at 8 frames per second. That means, you can detect human beings 8 times per second!
Details: The green LED lights every time the image sensor captures an image. The Red LED lights when the Neural Network in the FPGA detects a human torso. The Vision FPGA SoM is mounted on a breakout board and hooked up to 2x AA batteries. The SoM is flashed with human detection code that works out of the box with no additional training required. The SoM wiggles a GPIO whenever it captures an image and a different GPIO when it detects an object it recognizes.
A few gory details regarding the human detection demo.
The human detection model is trained on a dataset containing about 4400 annotated images (with bounding boxes around the upper torso). The model is a convolutional neural network, which was initially published Bichen Wu (UC Berkeley) by the name of SqueezeDet. Our model has about 68500 parameters (8-bit quantized) and requires a little over 57 million operations. The training was carried out on Google Colab and took about 6 hours to finish (for about 25,000 epochs). Subsequently, the model was converted to a frozen inference graph and then, into a binary file using Lattice Semiconductor’s SensAI Neural Network Compiler. The binary file for the human detection model and the RTL (verilog) were programmed into the flash on the SoM. After following these steps, we enjoyed the LEDs responding to anyone’s presence in front of the SoM camera. With the above simple steps, you can also make your very own ultra-low power “Eye of the Sauron”!
Have you ever wondered how Google’s and Amazon’s home devices can spot keywords? Well, now you can actually implement this for yourself on the SoM using the on-board I2S microphone and train the device to respond to any keyword of your liking!
Details: The Vision SoM is mounted on the developer board for power and physical stability. The demo uses the I2S microphone on the SoM. The Red LED lights up whenever audio activity is detected. If the audio corresponds to a trained keyword (Seven in this case), the Blue LED lights up. The FPGA code and model for keyword detection as well as training framework will be provided on GitHub.
In this project, we used open-source data on human speech commands and keywords from Google. The training of the keyword detection model proceeded in two steps, namely filter training and model training. In the first step, we trained a convolution-based audio filter to extract features of the human speech. In the second step, we froze the audio filter’s parameters and trained a convolutional neural network to classify an input human speech command/keyword. Let’s go into a little more detail. In the first step, we use a 1D convolution on a speech signal which is about 1 second long. A 1D convolution generates a plot of spectral density of frequencies over time (also called a spectrogram). Simply speaking, it’s frequency vs time plot for every signal. So you get a plot….but wait….a plot is also an image! So what magic did we do in the first step? We converted a keyword detection problem into a computer vision problem by plotting the dominant frequencies over time in the speech signal. Now that you have an image, you can follow the same steps as those used in Visual SoM, to train a keyword detection model (of course, with some changes, but the concept stays the same). Hence, you will not hear from us that you need to implement the tough techniques of RNNs and LSTMs to train a keyword detection model. In the second step, all we have to do is to train a model that takes in the “image” generated by the 1D convolution audio filter and finally classify into a keyword.
Source code for raw audio data download, sample rate conversion, and filter and keyword model training as well as the Google Colab notebook will be added to this repository very soon. However, it should be mentioned that the keyword detection module has many training-related user-defined parameters which are set in the shell scripts provided in this module. You can run these shell scripts in Colab but cannot edit them. Hence, we recommend that you edit the files on your local computer and then upload them to Colab (takes less than 10 seconds because you are uploading python code) and run the scripts on Colab. You will get a training speed of about 54 epochs per minute.