CSE 305 - Tutorial 8b - SIMD intrinsics
In this tutorial, you will explore parallelization using SIMD instructions. It turns out that most of modern processors can performs operations on several numbers in a single instruction if the numbers are properly located. For example, one can perform four multiplications of double precision numbers with a single instruction. This provides an opportunity to do parallel computations even on a single core!
Since the introduction of SIMD instructions in the late 90-s, different sets of instructions have been introduced. In order not to get buried under technical details, today we will restrict ourselves to
- AVX/AVX2 instructions which are almost standard these days (supported by the computer room machines and most likely by your laptop);
- working only with
doubletype (most of the instructions have their versions for floats and integers).
The main C++ type for us will be __m256d which stores
four doubles, so you can think of it as
double[4]. This line of code will create a variable of this
type storing numbers 1.0, 2.0, 3.0, 4.0:
__m256d a = _mm256_set_pd(4.0, 3.0, 2.0, 1.0);Then you can compute squares of all these numbers with a single CPU instruction by writing
a = _mm256_mul_pd(a, a);After this, a will contain
1.0, 4.0, 9.0, 16.0. Take a look at the demo codes
(demo1.cpp and demo2.cpp in the archive) for a
longer code using such multiplication and comparing it to the standard
one.
So, the overall idea of using SIMD instructions to speed up the code is (somewhat reminiscent of CPUs):
- put the relevant data into
__m256dvariables usingsetorloadinstructions; - perform the computation using SIMD instructions;
- fetch the data using
storeinstructions.
The tricky part (and somewhat art) is to do as much computation as
possible staying in the SIMD registers (that is, in
__m256d) since moving data around often causes slowdowns. A
nice catalogue of existing instructions is avaliable here.
Note that we are interested in instructions working with
doubles, they always end with pd and are,
thus, of the form _mm256_*_pd.
Demo code as well as the starting code is available in the archive. There are no automatic tests on the server but you are encouraged to upload your code. In the exercises below, the target speedup is given for the lab machines, it may be a bit different for you depending on the processor/compiler.
File
sqrt.cppgives a functionsqrt_vectorwhich takes as input an array ofdoubles and computes their square roots. Implement a functionsqrt_vector_simdwhich does the same using SIMD instructions (yes, there is an intstruction for taking square roots!). You can test your code by doingmake sqrtand running./sqrt.Expected speedup: 2 times.
Upload your file sqrt.cpp:
Logistic map is a discrete dynamical system defined by an equation \(x_{n + 1} = \lambda x_n (1 - x_n)\). That is, the state at the time \(n + 1\) is computed from the state at the time \(n\) by this formula (and \(\lambda\) is a scalar parameter). File
logistic.cppgives a functionlogistic_batchwhich takes as input an array of four (guess why) initial values \(x_0\), the value of \(\lambda\), and computes the state after the given number of iterations. Implement a functionlogistic_batch_simdwhich does the same using SIMD instructions. You may find_mm256_set1_pdand_mm256_fmsub_pdinstructions useful (the latter is also more precise!). You can test your code by doingmake logisticand running./logistic.Expected speedup: 5-6 times.
Upload your file logistic.cpp:
We define a structure
Vector2Dwith two field (coordinates)xandyof typedouble. Filelengths.cppgives a functioncompute_lengthswhich takes as input an array ofVector2Dand computes their lengths. Implement a functioncompute_lengths_simdwhich does the same using SIMD instructions. You can test your code by doingmake lengthsand running./lengths.This is trickier than it may sound since in your input data
xandycoordinates are interleaved, and you would like to have them separated. You can separate them before putting into SIMD registers but this will be costly and may ruin the speed up. Another solution is to use instructions that allow to shuffle the data inside SIMD registers, you may want to use_mm256_shuffle_pdand_mm256_permute4x64_pd.Expected speedup: 2 times.
Upload your file lengths.cpp:
We would like a ball at
xmeters high moving with the velocitypunder the gravity force and the air resistance force. Furthermore, atx = 0, there is a floor from which the ball reflects elastically (that is, the speed of the ball changes sign but keeps the absolute value). Filejolly_jumper.cppgives a functionjumper_batchwhich simulates this process for 4 balls. Implement a functionjumper_batch_simdwhich does the same using SIMD instructions. You can test your code by doingmake jolly_jumperand running./jolly_jumper.You will need to implement the parallel version of conditionals, take a look at the instructions
_mm256_max_pd,_mm256_cmp_pd, and_mm256_blendv_pd.Expected speedup: 2-3 times.
Upload your file jolly_jumper.cpp: