NEON Vectorization Workshop

Unleash the performance of your embedded ARM chip with NEON!

  • Sept 22
    Magazinet Kongsberg
    2 days
    07:00 - 15:00 UTC
    Ivica Bogosavljevic
    13 490 NOK

We introduce vectorization with NEON ground up from the most basic concept up to very advanced vectorization topics.

Workshop agenda:

  • A short introduction to vectorization
  • Introduction to NEON intrinsics
  • Advanced NEON intrinsics
  • Basic vectorization patterns – vectorizing for loops, foor loops with early exit, while loops and convergence loops
  • Common vectorization patterns – vectorizing loops with conditions, conditional counting, loops with structs and matrix transposition.
  • Vectorization inhibitors – learn to detect and remove obstacles that hinder efficient vectorization
  • Vectorization types according to data access pattern – there are several ways to do vectorization, here we investigate inner-loop vectorization, outer-loop vectorization.
  • Advanced vectorization patterns – we talk about how to vectorize copy_if, trees and lookup tables.
  • Memory performance – improve the performance of your vectorized code by better using the memory subsystem.
  • Peak performance – reach peak software performance by breaking instruction dependencies, avoiding register spills and cleverly using everything hardware has to offer.

Hardware/Software requirements
The workshop can be done on real hardware or on an emulator.

For doing the NEON workshop on real hardware, you will need:

An ARM 64-bit CPU with NEON extensions. These can be:

  • Any embedded CPU with ARM64 and Linux, e.g. Raspberry Pi 3 or later supports this. On a Linux ARM system, you can check by running: lscpu | grep -e asimd -e aarch64. The output of this command should be:

Architecture: aarch64
Flags: fp asimd evtstrm crc32 cpuid,

  • A MacOS device with Apple Silicon M1 or later.

You should have g++ installed for Linux or XCode for Mac.

For doing the NEON workshop on an emulator, you will need:

  • An x86-64 system with either Ubuntu based Linux OR Ubuntu running on Windows Services for Linux (WSL)

qemu-aarch64 for emulating the system (available in the Ubuntu repositories through qemu-user package
aarch64-linux-gnu-g++ compiler (available in the Ubuntu repositories through g++-aarch64-linux-gnu package)

Ivica Bogosavljevic
Application Performance Engineer at Johnny's Software Lab

Ivica is a Senior Software Engineer with 15 years of experience active in the domain of Linux and bare-metal embedded systems. His professional focus is application performance improvement - techniques used to make your C/C++ program run faster by using better algorithms, better exploiting the underlying hardware, and better usage of the standard library, programming language, and the operating system. He is the writer for a performance-related tech blog: https://johnysswlab.com

    Programutvikling uses cookies to see how you use our website. We also have embeds from YouTube and Vimeo. How do you feel about that?