CSE 305 - Tutorial 8b - SIMD intrinsics
In this tutorial, you will explore parallelization using SIMD instructions. It turns out that most of modern processors can performs operations on several numbers in a single instruction if the numbers are properly located. For example, one can perform four multiplications of double precision numbers with a single instruction. This provides an opportunity to do parallel computations even on a single core!
Since the introduction of SIMD instructions in the late 90-s, different sets of instructions have been introduced. In order not to get buried under technical details, today we will restrict ourselves to
- AVX/AVX2 instructions which are almost standard these days (supported by the computer room machines and most likely by your laptop);
- working only with
double
type (most of the instructions have their versions for floats and integers).
The main C++ type for us will be __m256d
which stores
four double
s, so you can think of it as
double[4]
. This line of code will create a variable of this
type storing numbers 1.0, 2.0, 3.0, 4.0
:
= _mm256_set_pd(4.0, 3.0, 2.0, 1.0); __m256d a
Then you can compute squares of all these numbers with a single CPU instruction by writing
= _mm256_mul_pd(a, a); a
After this, a
will contain
1.0, 4.0, 9.0, 16.0
. Take a look at the demo codes
(demo1.cpp
and demo2.cpp
in the archive) for a
longer code using such multiplication and comparing it to the standard
one.
So, the overall idea of using SIMD instructions to speed up the code is (somewhat reminiscent of CPUs):
- put the relevant data into
__m256d
variables usingset
orload
instructions; - perform the computation using SIMD instructions;
- fetch the data using
store
instructions.
The tricky part (and somewhat art) is to do as much computation as
possible staying in the SIMD registers (that is, in
__m256d
) since moving data around often causes slowdowns. A
nice catalogue of existing instructions is avaliable here.
Note that we are interested in instructions working with
double
s, they always end with pd
and are,
thus, of the form _mm256_*_pd
.
Demo code as well as the starting code is available in the archive. There are no automatic tests on the server but you are encouraged to upload your code. In the exercises below, the target speedup is given for the lab machines, it may be a bit different for you depending on the processor/compiler.
File
sqrt.cpp
gives a functionsqrt_vector
which takes as input an array ofdouble
s and computes their square roots. Implement a functionsqrt_vector_simd
which does the same using SIMD instructions (yes, there is an intstruction for taking square roots!). You can test your code by doingmake sqrt
and running./sqrt
.Expected speedup: 2 times.
Upload your file sqrt.cpp
:
Logistic map is a discrete dynamical system defined by an equation \(x_{n + 1} = \lambda x_n (1 - x_n)\). That is, the state at the time \(n + 1\) is computed from the state at the time \(n\) by this formula (and \(\lambda\) is a scalar parameter). File
logistic.cpp
gives a functionlogistic_batch
which takes as input an array of four (guess why) initial values \(x_0\), the value of \(\lambda\), and computes the state after the given number of iterations. Implement a functionlogistic_batch_simd
which does the same using SIMD instructions. You may find_mm256_set1_pd
and_mm256_fmsub_pd
instructions useful (the latter is also more precise!). You can test your code by doingmake logistic
and running./logistic
.Expected speedup: 5-6 times.
Upload your file logistic.cpp
:
We define a structure
Vector2D
with two field (coordinates)x
andy
of typedouble
. Filelengths.cpp
gives a functioncompute_lengths
which takes as input an array ofVector2D
and computes their lengths. Implement a functioncompute_lengths_simd
which does the same using SIMD instructions. You can test your code by doingmake lengths
and running./lengths
.This is trickier than it may sound since in your input data
x
andy
coordinates are interleaved, and you would like to have them separated. You can separate them before putting into SIMD registers but this will be costly and may ruin the speed up. Another solution is to use instructions that allow to shuffle the data inside SIMD registers, you may want to use_mm256_shuffle_pd
and_mm256_permute4x64_pd
.Expected speedup: 2 times.
Upload your file lengths.cpp
:
We would like a ball at
x
meters high moving with the velocityp
under the gravity force and the air resistance force. Furthermore, atx = 0
, there is a floor from which the ball reflects elastically (that is, the speed of the ball changes sign but keeps the absolute value). Filejolly_jumper.cpp
gives a functionjumper_batch
which simulates this process for 4 balls. Implement a functionjumper_batch_simd
which does the same using SIMD instructions. You can test your code by doingmake jolly_jumper
and running./jolly_jumper
.You will need to implement the parallel version of conditionals, take a look at the instructions
_mm256_max_pd
,_mm256_cmp_pd
, and_mm256_blendv_pd
.Expected speedup: 2-3 times.
Upload your file jolly_jumper.cpp
: