NEON Vectorization Workshop
Unleash the performance of your embedded ARM chip with NEON!
We introduce vectorization with NEON ground up from the most basic concept up to very advanced vectorization topics.
Workshop agenda:
- A short introduction to vectorization
- Introduction to NEON intrinsics
- Advanced NEON intrinsics
- Basic vectorization patterns – vectorizing for loops, foor loops with early exit, while loops and convergence loops
- Common vectorization patterns – vectorizing loops with conditions, conditional counting, loops with structs and matrix transposition.
- Vectorization inhibitors – learn to detect and remove obstacles that hinder efficient vectorization
- Vectorization types according to data access pattern – there are several ways to do vectorization, here we investigate inner-loop vectorization, outer-loop vectorization.
- Advanced vectorization patterns – we talk about how to vectorize copy_if, trees and lookup tables.
- Memory performance – improve the performance of your vectorized code by better using the memory subsystem.
- Peak performance – reach peak software performance by breaking instruction dependencies, avoiding register spills and cleverly using everything hardware has to offer.
Hardware/Software requirements
The workshop can be done on real hardware or on an emulator.
For doing the NEON workshop on real hardware, you will need:
An ARM 64-bit CPU with NEON extensions. These can be:
- Any embedded CPU with ARM64 and Linux, e.g. Raspberry Pi 3 or later supports this. On a Linux ARM system, you can check by running: lscpu | grep -e asimd -e aarch64. The output of this command should be:
Architecture: aarch64
Flags: fp asimd evtstrm crc32 cpuid,
- A MacOS device with Apple Silicon M1 or later.
You should have g++ installed for Linux or XCode for Mac.
For doing the NEON workshop on an emulator, you will need:
- An x86-64 system with either Ubuntu based Linux OR Ubuntu running on Windows Services for Linux (WSL)
qemu-aarch64 for emulating the system (available in the Ubuntu repositories through qemu-user package
aarch64-linux-gnu-g++ compiler (available in the Ubuntu repositories through g++-aarch64-linux-gnu package)

Ivica is a Senior Software Engineer with 15 years of experience active in the domain of Linux and bare-metal embedded systems. His professional focus is application performance improvement - techniques used to make your C/C++ program run faster by using better algorithms, better exploiting the underlying hardware, and better usage of the standard library, programming language, and the operating system. He is the writer for a performance-related tech blog: https://johnysswlab.com