Go Fast or Go Home !— Thinking Parallelly

Dhruval PB
3 min readJun 24, 2021
Hardware-accelerated Real-time 3D Point Cloud Generation

As heavy computations in autonomous machines become more prominent, hardware acceleration for running such algorithms becomes ever so prevalent. Our project involved building a vision-based depth perception software stack that is capable of running in real-time. We turned to Nvidia’s CUDA® for speeding things up. Interested in hardware acceleration for autonomous machines, our Formula Student team, Vega Racing Electric set out to learn more about these technologies.

CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs) for speeding up computing applications dramatically.

In GPU-accelerated applications, the sequential part of the workload runs on the CPU — which is optimized for single-threaded performance — while the compute-intensive portion of the application runs on thousands of GPU cores in parallel.

In our application of depth perception from stereo-vision, to generate a point cloud, a linear transformation (projection onto 3D space) has to be performed, where each pixel needs its position in 3-dimensional space to be computed by a matrix multiplication operation that is independent of all other pixels and can be made to run in parallel. We obtained huge speedups by using CUDA for this!

From the above figure, it is evident that the time taken by the CPU to compute the point cloud from the disparity of images grows very aggressively compared to a GPU as the number of pixels goes up.

It is important to note that, not all parts of the code can be parallelized on a GPU. In such cases, CPU parallelism using OpenMP and PThreads would yield considerable speed-ups. Apart from the above, it is important to not ignore the obvious compiler optimizations!!

In conclusion,

As improvements in single-threaded performance are reaching their limits (gone are the days of exponential gains in single-threaded performance from one generation to the next), it has become increasingly important to think parallelly, breaking up the computation into independent chunks and make use of all the hardware available for improving performance. This is because most future improvements in processors will likely be in the increase of the number of cores, or lower power consumption at existing (or slightly higher) speeds. With these things in mind, we managed to use CUDA, OpenMP, and pthreads to improve the performance of the task at hand, i.e. Depth Perception from Stereo Vision.

To read more about how we used hardware acceleration to compare Li-DAR and Vision for the task of autonomous driving, do check out our article “Vision Based Depth for Autonomous Machines

To read more about how we Containerised the Formula Student Driverless Simulator using Docker, do check out our article “Formula Student Driverless Simulator on Docker”

--

--

Dhruval PB

Computer Science student at PES University, Class of 2023