CSE 305 - Tutorial 8b - SIMD intrinsics

In this tutorial, you will explore parallelization using SIMD instructions. It turns out that most of modern processors can performs operations on several numbers in a single instruction if the numbers are properly located. For example, one can perform four multiplications of double precision numbers with a single instruction. This provides an opportunity to do parallel computations even on a single core!

Since the introduction of SIMD instructions in the late 90-s, different sets of instructions have been introduced. In order not to get buried under technical details, today we will restrict ourselves to

AVX/AVX2 instructions which are almost standard these days (supported by the computer room machines and most likely by your laptop);
working only with double type (most of the instructions have their versions for floats and integers).

The main C++ type for us will be __m256d which stores four doubles, so you can think of it as double[4]. This line of code will create a variable of this type storing numbers 1.0, 2.0, 3.0, 4.0:

    __m256d a = _mm256_set_pd(4.0, 3.0, 2.0, 1.0);

Then you can compute squares of all these numbers with a single CPU instruction by writing

    a = _mm256_mul_pd(a, a);

After this, a will contain 1.0, 4.0, 9.0, 16.0. Take a look at the demo codes (demo1.cpp and demo2.cpp in the archive) for a longer code using such multiplication and comparing it to the standard one.

So, the overall idea of using SIMD instructions to speed up the code is (somewhat reminiscent of CPUs):

put the relevant data into __m256d variables using set or load instructions;
perform the computation using SIMD instructions;
fetch the data using store instructions.

The tricky part (and somewhat art) is to do as much computation as possible staying in the SIMD registers (that is, in __m256d) since moving data around often causes slowdowns. A nice catalogue of existing instructions is avaliable here. Note that we are interested in instructions working with doubles, they always end with pd and are, thus, of the form _mm256_*_pd.

Demo code as well as the starting code is available in the archive. There are no automatic tests on the server but you are encouraged to upload your code. In the exercises below, the target speedup is given for the lab machines, it may be a bit different for you depending on the processor/compiler.

File sqrt.cpp gives a function sqrt_vector which takes as input an array of doubles and computes their square roots. Implement a function sqrt_vector_simd which does the same using SIMD instructions (yes, there is an intstruction for taking square roots!). You can test your code by doing make sqrt and running ./sqrt.

Expected speedup: 2 times.

Upload your file sqrt.cpp:

Upload form is only available when connected

Logistic map is a discrete dynamical system defined by an equation \(x_{n + 1} = \lambda x_n (1 - x_n)\). That is, the state at the time \(n + 1\) is computed from the state at the time \(n\) by this formula (and \(\lambda\) is a scalar parameter). File logistic.cpp gives a function logistic_batch which takes as input an array of four (guess why) initial values \(x_0\), the value of \(\lambda\), and computes the state after the given number of iterations. Implement a function logistic_batch_simd which does the same using SIMD instructions. You may find _mm256_set1_pd and _mm256_fmsub_pd instructions useful (the latter is also more precise!). You can test your code by doing make logistic and running ./logistic.

Expected speedup: 5-6 times.

Upload your file logistic.cpp:

Upload form is only available when connected

We define a structure Vector2D with two field (coordinates) x and y of type double. File lengths.cpp gives a function compute_lengths which takes as input an array of Vector2D and computes their lengths. Implement a function compute_lengths_simd which does the same using SIMD instructions. You can test your code by doing make lengths and running ./lengths.

This is trickier than it may sound since in your input data x and y coordinates are interleaved, and you would like to have them separated. You can separate them before putting into SIMD registers but this will be costly and may ruin the speed up. Another solution is to use instructions that allow to shuffle the data inside SIMD registers, you may want to use _mm256_shuffle_pd and _mm256_permute4x64_pd.

Expected speedup: 2 times.

Upload your file lengths.cpp:

Upload form is only available when connected

We would like a ball at x meters high moving with the velocity p under the gravity force and the air resistance force. Furthermore, at x = 0, there is a floor from which the ball reflects elastically (that is, the speed of the ball changes sign but keeps the absolute value). File jolly_jumper.cpp gives a function jumper_batch which simulates this process for 4 balls. Implement a function jumper_batch_simd which does the same using SIMD instructions. You can test your code by doing make jolly_jumper and running ./jolly_jumper.

You will need to implement the parallel version of conditionals, take a look at the instructions _mm256_max_pd, _mm256_cmp_pd, and _mm256_blendv_pd.

Expected speedup: 2-3 times.

Upload your file jolly_jumper.cpp:

Upload form is only available when connected