Benchmarking Deep Neural Networks on Edge Devices, using different frameworks
- 2.1 MacBook Pro
- 2.2 Jetson Nano
- 2.3 Raspberry Pi 4
- 2.4 Architecture Comparison
3 Models and Frameworks at the Edge
- 3.1 Frameworks
- 3.2 Mobilenet
- 3.3 Squeezenet
- 3.4 Inception net
- 3.5 GPU-enabled results
Today’s society, propelled by the 4th industrial revolution, continues to grow thanks to increasingly innovative state-of-the-art technologies. Among them, we can mention the fast multiplication of connected devices. Billions of them, until now linked to cloud services to serve a vast variety of topics, generate data of all kinds. This new black gold has become an essential resource for creating new ideas, solving problems and, above all, making profits.
In parallel, Deep Learning, until now hampered by a lack of data and insufficient computing power, is nestling itself at the top of the hottest trends ranking too. Both big companies and startups have launched lots of AI investment around this AI-related technology.
But besides the current use of cloud-based solutions, the emergence of ”Edge computing based Artificial Intelligence”, a concept that combines these two technologies, offers multiple benefits, such as rapid response with low latency, high privacy, more robustness, and a better efficient use of network bandwidth. In order to accompany this new need, companies have released new AI frameworks, as well as advanced edge devices like the popular Raspberry Pi and Nvidia’s Jetson Nano for acting as compute nodes at the edge computing environments. Although the edge devices are limited in terms of computing power and hardware resources, they can be powered by accelerators to enhance their performance behavior and provide superior Edge AI ability in a wide range of applications. Therefore, it is interesting to see how AI-based Deep Neural Networks perform on such devices with limited resources. In this article, we present and compare the performance in terms of inference time, frames per second, use of CPU/GPU and temperature produced by two different edge devices : a Jetson Nano and a Raspberry Pi 4. We will compare the obtained results to those acquired with a Macbook Pro, which will serve as a reference here. We also measure the performances of three light-weight models widely used for edge use cases, working either with TensorflowLite or the ONNX frameworks.
2 Devices at the Edge
The concept of Edge Computing has been recently proposed to complement cloud computing in order to resolve problems like latency or data privacy by performing certain tasks at the edge of the network. The idea is to distribute parts of processing and communication to the ”edge” of the network, i.e closer to the location where it is needed. As a result, the server needs less computing resources, the network is less strained and latencies are decreased. Edge devices can come in a variety of forms ranging from large servers to low-powered System on a chip (SoC) devices like the popular Raspberry Pi or any other ARM based devices.
Deep Neural Networks (DNNs) may occupy big amounts of storage and computing resources. Although the edge devices are limited in terms of computing power and hardware resources, they are powered by accelerators to enhance their performance. In the context of Edge Computing, it is rather interesting to see how devices with low power consumption and limited resources can handle DNN evaluation. In this section, we compare different edge device architectures and present their hardware overview. We choose the following devices as our target edge devices to assess their performance behavior and capabilities for DNN application:
- Raspberry Pi 4 (Raspberry Pi Foundation)
- Jetson Nano (NVIDIA)
- MacBook Pro 2019 (Apple)
2.1 MacBook Pro
Launched in January 2006, the Macbook Pro series from the American giant Apple continues to improve every year. Part of Apple’s transition to Intel as the second model to feature an Intel processor, the Macbook Pro is positioned at the top end of the Macbook family, thanks to its powerful hardware components. Already quite popular in the computer’s world in general for its price and refined design, the Macbook Pro’s new M1 chip has opened a new door for Apple’s way to compete in the AI area.
Moreover, Apple has released a new version of the famous TensorFlow v2.4 machine learning library and it is fully optimised for its new M1-powered Macs. The upgrade takes advantage of Apple’s ML Compute framework, which is designed to accelerate the training of artificial neural networks using not only CPUs, but all available GPUs.
According to several recent reports, Apple claims that the optimised version of TensorFlow will allow new computers to learn and develop tasks up to 7 times faster (shown in ?? in the case of the 13-inch Macbook Pro with M1 chip). Apple also suggests that solving an algorithm will now take 2 seconds on the 2019 Intel Mac Pro (optimised with TensorFlow), compared to 6 seconds on non-optimised models. In this experiment, we will mainly use the Macbook Pro as a reference device to compare the others behavior at the edge.
2.2 Jetson Nano
At the GPU Technology Conference in 2019, an annual event organised by Nvidia, the Jetson Nano was presented, as part of the Jetson devices series. It represents a single-board computer (SoC) that makes it possible to develop cost-effective and energy-efficient AI systems. It was specifically designed to onboard AI-related applications. With four ARM cores and a Maxwell GPU as a CUDA computing accelerator and video engine, it opens up new possibilities for graphics and computation-intensive projects.
Figure 1: Board comparing the different components of a Macbook Pro, a Jetson Nano and a Raspberry Pi 4.
CUDA is an architecture developed by NVIDIA for parallel calculations. The additional use of the GPU relieves the CPU and increases the computing power of a computer. Since both cores are found on microprocessors based on semiconductor technology, CUDA cores are usually considered to be equivalent to CPU cores. In addition, both cores can process data, whereby the CPU is used for serial data processing, while the GPU is used for parallel data processing. However, CUDA cores are less complex.
Due to its compact design, the Jetson Nano can be perfectly integrated into complex projects, like robotics or AI. With 128 CUDA cores, the single-board computer can carry out many operations in parallel and thus enables the use of several sensors with real-time calculation. Finally, thanks to the support of CUDA, a neural network could be trained directly on the board. In contrast, such a project with a Raspberry Pi could only be implemented with an additional GPU.
Its successor, the Jetson Xavier, is a higher-end product and is even more dedicated to artificial intelligence.
2.3 Raspberry Pi 4
Raspberry Pis in general are small single-board computers (SBCs) developed by the Raspberry Pi Foundation in association with Broadcom. It’s capable of doing everything a desktop computer can do, from browsing the internet and playing high-definition video, to making spreadsheets, word-processing, and playing games. The latest of the series, the Raspberry Pi 4 Model B was unveiled in 2019 and significant improvements have been made compared to previous versions.
Indeed, it gains in power with its BCM2711 ARM Cortex-A72 CPU clocked at 1.5 GHz which has four 64-bit cores. The processor is backed up by 1GB, 2GB or 4GB of RAM depending on the user’s needs, unlike the Raspberry Pi 3 B+ which is limited to 1GB of RAM. The Raspberry Pi 4’s new processor is accompanied by a VideoCore VI GPU, all of which can now decode 4K HEVC video at 60 frames per second with support for two displays simultaneously. The new card has two micro-HDMI ports that replace the traditional HDMI interface for display.
All in all, users get more power and more options to make the motherboard a more comfortable little computer, among other uses. Maybe not only under Linux, as Windows 10 ARM was recently ported to a Raspberry Pi, while waiting for a more stable version.
2.4 Architecture comparison
In this section, we will compare the architecture of the three devices quoted above, more specifically what components they are constituted of.
First of all, it is worth mentioning that the Macbook Pro stands out from the other devices as being more powerful. However, it is quite pricy and cannot be considered as a low-cost edge device. Staying within the ’price’ element of comparison, the Jetson Nano is the next in line. The 99 dollars SBC is one of the most popular boards to compete with the Raspberry Pi that appeared to be the cheapest option for getting started with edge deployments of AI models.
For day to day computing activities and embedded work and projects, Raspberry Pi is a better value for money. Only when projects demand GPU usage or ML or AI applications that can benefit from CUDA cores you should consider Jetson Nano.
The Cortex-A72 in the Raspberry Pi 4 is one generation newer than the Cortex-A57 in the NVIDIA Jetson Nano. This CPU offers higher performance and faster clocking speed. But for deep learning and AI, it might not provide enough performance benefits.
In terms of GPU, the Jetson Nano is one step ahead thanks to its 128- core Maxwell GPU @ 921 Mhz. While it doesn’t offer dual-monitor support, the Jetson Nano has a much more powerful GPU. For machine learning and artificial intelligence applications, the Jetson Nano remains the better choice.
3 Models and Frameworks at the Edge
Deep Learning models are known for being large and computationally expensive. It’s a challenge to fit these models into edge devices which usually have frugal memory. This motivated researchers to minimize the size of the neural networks, while maintaining accuracy. In this section, we will present, compare and rank three popular parameter efficient neural networks:
- The Mobilenet
- The Squeezenet
- The Inception net
First and foremost, we can emphasise the contrast between the models by comparing their weight and accuracy.
We can notice on the 2 that the Squeezenet is the lightest one, but also the least accurate. Oppositely, the Inception model in much more precise, but is also very heavy. A good balance between the two is the Mobilenet. Indeed, without being too heavy-weighted, the model provides us with a quite acceptable accuracy for a large variety of classification use cases at the edge.
As a means to run these DL models on edge devices or in general, machine learning frameworks will be needed. A ML Framework is a set of tools, interface or library meant to simplify ML algorithms. It allows users to develop ML models easily, without understanding the underlying algorithms. There are a variety of machine learning frameworks, geared at different purposes. Most of them are written with the Python programmation language.
When it comes to framework types and device types, the big companies are competing to create the best combination. For example, the Jetson Nano has been optimised to work with Tensor RT. Google’s Coral was designed to run with TensorflowLite, and so on. On its side, the Open Neural Network Exchange (ONNX) is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers. For most cases, it works as an interchange format between frameworks bus is also used as a pure framework by most of the edge devices in general. Indeed, ONNX makes it easier to access hardware optimizations its runtime/libraries are designed to maximize performance across hardware. Besides, a wide documentation around it, as well as many existing conversion workflows, makes is an invaluable tool. In our experiment, the .onnx and .tflite versions of the different models will be compared.
Neural networks and more specifically CNNs - Convolution Neural Networks - are particularly popular for image classification, object/face detection and fine-grained classification. However, these neural networks perform convolutions, which are very costly in terms of computation and memory. Image classification in embedded systems is therefore a major challenge due to hardware constraints.
Figure 2: Comparison between the models, their accuracy in percentage (blue and green) as well as their weigh in MB (orange)
Figure 3: Mean FPS (left) and Inference Time (right) for the Mobilenet v2-7 over 10 seconds (higher the better)
Contrary to classical and heavy CNNs, some models, tailored for the edge, contain ’Depthwise Separate Convolution’ instead.
MobileNet improves the state-of-the-art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. The MobileNet models perform image classification - they take images as input and classify the major object in the image into a set of pre-defined classes. They are trained on ImageNet dataset which contains images from 1000 classes. MobileNet models are also very efficient in terms of speed and size and hence are ideal for embedded and mobile applications.
The network latency is one of the most crucial aspects of deploying a deep network into a production environment. Most real-world applications require blazingly fast inference time, varying anywhere from a few milliseconds to one second.
Using the Macbook Pro as a reference, we expect it to deliver good results. This way, we will be able to determine how good the two other devices are doing at the edge. As seen in 3, the MacBook Pro reached around 22 FPS with ONNX, and 11 with TFLite. The Raspberry Pi differs slightly from the Jetson Nano, especially on the ONNX version of the model. But as far as the tflite version is concerned, both devices are equal. Nevertheless, the Pi gave us a higher number of FPS when working with ONNX. As seen above, the Jetson Nano has less advanced CPUs than the Pi, but a much more powerful computing capability provided by its GPUs. These results are not surprising, since neither the ONNX version nor the tflite are optimised to run on GPUs in our experiments.
Linked to the FPS variable, the inference time is the time to process and make predictions against new/unseen data for a trained DNN model. It’s equal to 1/FPS, so the smaller the inference, the better. Here again, the battle is tight between NVIDIA’s board and the Pi 4, but once again, we did not use the Jetson at the maximum of its abilities. Globally, the .onnx version of the models proposes a better inference time that the .tflite one.
SqueezeNet is remarkable not for its accuracy but for how less computation does it need. Squeezenet has accuracy levels close to that of AlexNet however, the pre-trained model on Imagenet has a size of less than 5 MB which is great for using CNNs in a real world application. SqueezeNet introduced a Fire module which is made of alternate Squeeze and Expand modules.
Figure 4: Mean FPS (left) and Inference Time (right) for the Squeezenet v1.1-7 over 10 seconds (higher the better)
Once again, we can observe on both ?? that globally, the Raspberry Pi is doing a bit better than the Jetson in terms of FPS and Inference Time. But this time, it seems that the .tflite version of the model provided a slightly lower inference time for the Jetson. Thus, for this particular case, the TFLite framework is worth considering.
3.4 Inception net
Inception net achieved a milestone in CNN classifiers when previous models were just going deeper to improve the performance and accuracy but compromising the computational cost. The Inception network, on the other hand, is heavily engineered. It uses a lot of tricks to push performance, both in terms of speed and accuracy. It is the winner of the ImageNet Large Scale Visual Recognition Competition in 2014, an image classification competition, which has a significant improvement over ZFNet (The winner in 2013), AlexNet (The winner in 2012) and has relatively lower error rate compared with the VGGNet (1st runner-up in 2014). The 22-layered model is way heavier than the two others that we presented, but insures a precision never seen at the edge before.
Without any surprise, the results obtained and seen on the 5 for all three devices are way lower than for the previous models in terms of FPS. Even for the MacBook Pro, it does not reach 10 FPS. The results for the Pi and the Jetson are again quite similar. Moreover, it is worth mentioning that the 4th version of this model weights 162.8Mo and reaches an astonishing accuracy (Top-1 = 80.1% and Top-5 =95.1%). Still, we have measured that if we obtained 8.454FPS on the MacBook for the v3, we collected a value around 3FPS for the v4. The balance between weight/speed and accuracy does not make these models suitable for sensible edge use cases, like robbery detecting.
3.5 GPU-enabled results
As said and seen in the previous sections, the Pi’s CPU is newer and slightly better than the Nano’s. However, we didn’t compare it enabling its full capacities yet. Let’s have a look at the graphs when running the models on the Jetson’s GPUs.
As per the 6, we can notice that the GPU makes the models run faster, at least for the Mobilenet and the Inception net, the two heaviest. Indeed, we see a 43% increase for the Mobilenet and a 223% one for the Inception net. For lighter models, it seems that the CPUs in general are more efficient than the GPUs. If the model’s too small, the bottleneck becomes the time you need to load and unload the data from the RAM to the GPU. We realize that this time seems to be almost the same as the inference time, hence the results.
Figure 5: Mean FPS (left) and Inference Time (right) for the Inception v3 over 10 seconds (higher the better)
Figure 6: Mean FPS comparison for the Jetson, GPU enabled and disabled (higher the better)
Figure 7: Mean FPS comparison between a Jetson Nano and a Raspberry Pi (higher the better)
In this last graph, we can corroborate the results seen above : the Jetson Nano with the GPUs enabled is performing better for models like the Mobilenet and the Inception net. However, the CPU will be more efficient for models like the Squeezenet. In this case, the Pi produces better results.
What makes a suitable model for the edge lies in its ability to strike a balance between speed and accuracy. Use cases at the edge require models to react quickly, but to be precise enough too. The choice of the device is as, if not more, important. Again, depending on the application, users might turn towards one or another. In this work, we mainly presented and compared the performances in terms of inference and FPS of three devices that can be found at the edge : the MacBook Pro, the Jetson Nano and the Raspberry Pi 4. We also provided additional information concerning the model that have been deployed. Noticeably, the results for each model turned out to be quite different, depending on their size and framework. Nevertheless, we can already draw some major conclusions. The Macbook Pro serves as an element of comparison that helps highlighting how the two other devices are doing. We observed that, with its GPUs enabled, the Jetson Nano achieves a better performance for the two largest models. However, the Pi dethrones it when testing the Squeezenet. It seems that the CPUs put on notable results for lighter models.
Considering the weight and the accuracy of each model, the Mobilenet presumably outperforms the other models and is the most suitable for the edge. The inception net is the most accurate but also extremely heavy. On another hand, the Squeezenet revealed itself to be fast but not precise enough.
Extending the work to include other SoCs such as the Google Coral or the Intel Movidius as well as evaluating different CNNs models is prospective future work. We can also think about testing optimised models with TensorRT on the Jetson Nano for example, or to compile libraries in a different way so they’re better refined for the edge.
Explore the other initiatives within Cisco Emerging Technologies and Incubation by clicking on the following link : eti.cisco website.