Optimization of an action recognition DL model for the NVIDIA Jetson platform
Figure 1. Demo application running on a NVIDIA Jetson Xavier
This blog shows the process of optimizing a PyTorch action recognition model for an assembly line running on a NVIDIA Jetson Platform, starting from an initial performance of 5.9 fps to a final optimized speed of 32 fps. It is the follow-up of our previous article, Action Recognition for Assembly Lines using Deep Learning. In the original article a deep learning model was trained and generated using PyTorch on an x86 computer and a SlowFast architecture. The project's result was a model able to recognize the actions done by an operator. The detected actions corresponded to assembly steps done on an assembly line for a mechanical part. The final model reported an accuracy and F1 scores around 0.90.
The main objective is to show the process of optimizing a deep learning model to be executed in an edge device in real-time. In the original report, the model ran at 30 fps on a desktop computer, however when executing the same model in a NVIDIA Jetson AGX Xavier platform, the performance dropped to 5.6 fps, here is where proper optimization techniques for edge computing take place in order to ensure real-time performance.
The optimization stage described in this article uses the model and framework generated in the previous project as the baseline. The system was ported to NVIDIA Jetson platform; specifically the Jetson AGX Xavier 16GB and the Jetson Nano 4GB, with the main aim to reach 30fps performance on the Jetson Xavier and use the Jetson Nano as comparison.
Throughout the project, traditional code optimizations were used, but most importantly the inferencing framework was migrated to TensorRT and the media handling to GStreamer to be able to use the platform's accelerators. After the optimizations, the final performance is around 32 fps using a 1080p source video, reaching the 30fps target on the Jetson Xavier and the model was able to run on the Jetson Nano at 19 fps for a 255x455 input video, due to memory limitations.
The project's results are used in real-life environments with an embedded solution that enables the automation of the quality assurance or quality control processes. A real use case is presented in Figure 1, where a real-time processing manufacturing facility video shows an operator that is manually assembling a part. The video is used as input to detect assembly errors before the part leaves the station, thus preventing further time and resource waste.
The base project uses PyTorch as the inference framework and for some video-oriented data operations, it uses Pytorchvideo. The model was trained using transfer learning and a SlowFast network architecture. The resulting SlowFast model was generated using int64 data types, and the training and inference were done using a x86 CPU only.
The inference process used to detect the operator actions is shown in Figure 2. First, the full video was loaded into memory from a previously recorded file. Then the media went through a processing pipeline to transform the video as required by the model. After that, the inference was made and the output video was written using OpenCV.
A resulting frame from the process can be seen in Figure 3.
The three stages shown in Figure 2; input state, inference, and post-processing, were the focus of the optimization efforts, in the following lines you will find a more detailed description of the original state of each stage, so you can have a better understanding of how the process was working before the optimizations.
Can be divided into two parts: the media processing stage where the video feed is decoded and converted to the desired color format needed for the network, and the data preprocessing stage, where the data gets conditioned to the necessary format and type right before it gets passed to the inference network.
The time window used is 1s. For a video at 30fps, this means that 30 frames are needed to get a sample for a processing step.
The media processing stage uses pytorchvideo's EncodedVideo class to decode and load the video into system memory. On the other hand, the preprocessing stage uses mainly pytorchvideo transforms and consists of several smaller stages: a square crop, a frame normalization, and then sampling stages to get two frame groups at a different sampling rate, one for the slow path and another for the fast path of the network.
This is where the network receives the input frames (after being processed) and applies the deep learning(DL) model. A vector that contains the results is generated, this vector contains the probability distribution. To make the results usable for action detection, the generated vector goes to a softmax operation to get the label index and the probability of said label. A config file is used to hold the literal labels, and that is checked to get the corresponding label name for the action index.
This step receives the resulting label and probability from the previous stage. This label and probability are then written over the video frames. Those frames then are written into an mp4 file, using OpenCV.
Jetson AGX Xavier 16GB
The main target is the Jetson AGX Xavier 16GB. This system represents a top-of-the-line device that is capable of doing heavy computation in a very efficient package, especially tasks that involve machine learning for neural networks. This device provides a high-performance platform that contains deep learning accelerators, known as DLA units or other accelerators such as Tensor Cores. It also has the benefit of having a high budget of available system memory. It also provides an octa-core 64 bit-arm processor, which can be very helpful to keep the GPU and other accelerators fed with data.
Jetson Nano 4GB
This platform is the baseline device, Figure 5. It has a reduced quad-core 64 bit-arm processor and a reduced GPU with less overall frequency and fewer FUs. This device is used to check if an action recognition model can run on a limited device such as the Jetson Nano. Also together with the testing done on the Jetson Xavier, can provide data points to estimate the performance of other Jetson devices.
Software and Tools
NVIDIA Jetpack: Jetpack version 4.6.2 was used since it's the last Jetpack to support the Jetson Nano. It has Ubuntu 18.04 and by default Python 3.6. This was accompanied by the DeepStream libraries version 6.0.1.
Python: Python was used since a lot of DL libraries and tools are available for said language, and also the base project was done using Python. The version chosen was 3.8, this is a balance on library support, Jetpack tools support, and it's still a recent release.
Python's libraries: Torch version 1.10 was used and later 1.12 was tested, this was accompanied by the torchvideo and torchaudio version according to the Pytorch versions page. Also, NVIDIA's DALIi library was used, version 1.17.
GStreamer: GStreamer version 1.14.5 was used. This was used for the data handling, for example the video read and the video write, as well as the video decoding and encoding.
NVIDIA's libraries: The version used for TensorRT is the version 188.8.131.52, for CUDA the included with the Jetpack, 10.2.300 and for cuDNN also the one included inside the Jetpack, version 184.108.40.206.
For testing purposes, an existing video from the previous project dataset was used, shown in Figure 6. This video is a real recording of an assembly line, where a part is manually assembled by an operator, following a sequence of steps. The video is at 1080p resolution, with a framerate of 30fps and a duration of 60s. The video has a total of 60 samples for the inference process using a window of 1s corresponding to 30 frames.
Then for performance measurements, tegrastats was called when executing the inference to get GPU and memory stats. For CPU stats the psutillibrary was used to get the usage percentage and the frequency. The temperature was kept under 40°C and the Jetson-clocks setting was enabled on both targets. For time measurements, Python's default time package was sufficient to take a sample before and after a piece of code and subtract it to get execution time.
For the baseline metrics, the model was executed without any optimizations. This first step allows measuring the system performance for an unoptimized Torch model that was originally designed for desktop.
On the Jetson Xavier, the initial inference time was around 2.1s per sample, with the preprocessing stage taking 1.6s, and the postprocessing another 1.2s. Those three made a total of around 5s (for 30 frames on every sample). That means that the full process runs at 5.9 fps on average.
On the Jetson Nano, the inference takes 20s, the preprocessing around 6s, and the postprocessing another 3.4s. That comes to a total of 29.8s for a frame rate of 1 fps. Also, this platform was only able to do around 4 to 3 inferences before the process was killed by the OS because it was using too much memory.
On both devices the GPU was not being used at all, only the CPU was used.
First, a migration to another inference framework was done, in this case, TensorRT. This framework is a high-performance inference framework. It uses the platform's included hardware accelerators, to provide the best possible inference speeds. TensorRT can use a ONNX model as the input and thanks to the Torch to ONNX export the Torch model was quickly migrated to TensorRT. For more reference check Torch to TensorRT tutorial. The speed-up obtained after the said change was around 7.5x for an inference time of 0.34s per sample. On the Nano it was not able to run at all.
After that, the biggest speedup was obtained due to model quantization. This refers to changing the model's weights to use a less accurate data type. This is done in order to have less data footprint and increase throughput. For example, instead of using the original float32 weights, their data types can be reduced to float16 or even int8, which is the lowest possible data type. Those reductions help alleviate memory bandwidth bottlenecks and use the more abundant SMs with lower precision.
The model quantization to float16; resulted in a speedup of 30x, without a significant accuracy loss in the inference results. Later a further quantization to int8 was tested, achieving a speedup of 45 times against the base inference results, again with a little accuracy loss in the inference output.
The final inference times, using int8 quantization on the Jetson Xavier for a 1080p video, were of around 50-47ms, a speed-up of 45x over the initial 2s inference time. All with minimal accuracy losses as seen in Table 1.
Table 1. Scores of the different models used and the different data types used
In regards to the Jetson Nano, the video resolution was changed to 455x256 and the input handling was replaced to use GStreamer instead of loading the video onto memory. Both changes were done to better control the amount of memory being used, and limit as much as possible the need for the process to use swap space. Given that the NVIDIA Jetson Nano provides hardware acceleration for float32 and float16, but not for int8, the performance on the int8 model decreased. With that said, a final speedup of 13.6x was obtained on the Jetson Nano using the float16 model.
Preprocessing and Input Optimizations
To optimize the video processing and loading speeds, the input stage was migrated from Pytorchvideo to GStreamer. The main advantage is that GStreamer has hardware-accelerated elements such as encoders, video converters, and scalers that can handle live sources, a functionality that Pytorchvideo doesn't have. This allowed it to reduce the processing time a full second from the original 1.6s.
Also, the preprocessing was migrated to use NVIDIA's DALI library. This library provides hardware-accelerated functions, designed specifically to process images and video for deep-learning applications. The library used a pipeline syntax to define the different stages to process the data. The functions that were used are resize, crop, binary division, and normalize. All of them are hardware accelerated. With the DALI preprocessing, the time was reduced by 100ms.
The final preprocessing stage takes around 0.7s per sample on the Jetson Xavier at 1080p, counting the IO operations. The results from the Jetson Nano are around 0.3s but it uses a hardware-accelerated scaler element to reduce the video to 455 by 256, due to the memory limitations mentioned before.
Postprocessing and Output Optimizations
The main change at this stage was the migration to GStreamer pipelines to handle the output process, from the previously used OpenCV. Also, thanks to GStreamer versatility, the option of having different output pipelines was on the table, a file writer, an encoding pipeline for streaming, and a live preview that shows directly the video as it's being generated were created without significantly impacting the system resources. Also, it opened the possibility to use a shared memory mechanism to broadcast video to another local process.
The final time for the Jetson Xavier came to be around 0.1s to 0.15s when using the file writer at 1080p, and locked to 0.1s when using the shared memory write. On the Jetson Nano the time obtained for the file writer was 0.048s, at a video of 455 by 256, and a locked 0.035s for the shared memory write.
The full speedup summary can be seen in Figure 7 and Figure 8. The final times for the device's best run were:
Jetson AGX Xavier at 1080p: 0.049s inference time, 0.77s preprocessing time and 0.1s postprocessing time. That comes to a total of 0.93s, for a fps number of 32. With the final memory usage at around 5GB, a GPU usage of 99% and a low CPU usage, on average around 20-30%.
Jetson Nano at 455x256: 1.4s inference time, 0.29s preprocessing and 0.09s for a total of 1.8s and a fps count of 16. The memory used was almost on the limit, at around 3GB without taking into account the memory used by the OS, taking it into account it comes to almost 3.7GB. The GPU usage is high at 95% and the CPU is around 20%.
A demo application was developed in order to test the full process after the optimizations. It consists of two processes as seen in Figure 9: one for the main processing (it contains the media handling, inference and media write), and the other that executes the user interface and receives the output from the main process. The demo application was divided into two processes to keep the main process independent from the UI, if the UI crashes or it is not present, the main process will keep running. It also helps with maintenance since it's easier to modify one without needing to alter the other one.
The UI shows the video feed with the results alongside a log panel with the detected actions. It also has a part status indicator, used to monitor the process for assembly mistakes. The mistakes detection uses a configurable file, where the user can set a list of sequential steps that the main process will take and compare against the detected steps on the video feed.
It also has the ability to change the source of the feed on the fly, specifically it can use:
Recorded videos: allows to check for assembly errors with recorded footage. Useful to check for possible causes of faulty parts. Can also be useful to get process analytics, get errors per time period, or identify recurrent assembly mistakes.
UDP streams: it gives the ability to use live UDP feeds that can come from other computers or simple live IP cameras. A setup can be made where a camera is placed on an assembly line and the process is monitored live for mistakes. Helping to stop bad parts from going further down the assembly line, and causing problems or losses.
The assembly consists of a mechanical part. And the work area is divided as seen in Figure 10.
Each section of the image shows the location of different parts used during the assembly as follows:
Riveted start, it's the initial step of the assembly.
Press, it's the second to last step, where a press is used to compress the assembly together.
Lid, this is the part that's needed before using the press.
Sequence to Check
The application was made so the steps to check, and their order are easily configured. So two examples will be provided where a good sequence is defined and later modified to be able to show the capacity for error detection.
The sequence to check was defined as follows:
Spacer and washer.
Repeat 2 to 4, 4 more times.
Figure 11. Demo application with a video of a good assembly,
acording to the defined sequence
As can be seen in Figure 11, the process doesn't flag any step as an error, the only flag is the warning flag, used when the network spits out a probability value that's less than a set threshold. This is done to avoid false results or results that are highly unlikely based on the input.
The sequence was modified so the last spacer and greased disk are now the install lid and press-down actions.
Figure 12. Demo application with a video of a bad assembly,
according to the defined sequence. Shows the error flag
On the last spacer and greased disk it detects the error, as seen in Figure 12, since it expects an install lid and press down, it marks both as errors. And on the last steps it detects that the final step was done and returns a summary of the missing steps. And since the summary is not empty, it flags it as an error.
The process is able to make a log file of the steps and the errors, as can be seen on the next image. This can be used to debug the process or for the user to keep its own records.
With those results, it can be seen that real-time action detection can be done with an optimized deep-learning model and hardware-accelerated media processing. It can tackle real-life problems, such as error detection during a manual assembly process. And can be a valuable asset that can reduce or prevent errors from making it out of the assembly line.