INF442 - Introduction to Unix and basic statistics in C++
- Structure
- Before you begin
- Section 0 (optional) — Useful basic commands in Unix terminal
- Section 1 — Quiz: wine consumption data
- Section 2 (optional) — A few simple C++ programs
- Section 3 — Computing basic statistics: Mean, Variance and Standard Deviation
- Section 4 — Working with matrices
- Section 5 — Covariance and Correlation Matrices
Structure
- Section 0 provides a very brief introduction to the fundamental commands that you can use with a Unix/Linux terminal. More information is provided in a separate tutorial. If you are already comfortable with Unix and/or Linux, then you can safely skip this section, however we do recommend that you take at least a quick look at the section to make sure that you master all the commands introduced therein.
- Section 1 provides the required background and material for the quiz. The quiz is graded.
- Section 2 provides a gentle introduction to writing, compiling and executing a simple C++ program. If you are already comfortable with these concepts, then you can safely skip it as well, however we do recommend that you take at least a quick look at the section to make sure that you master all the concepts introduced therein.
- Finally, Sections 3, 4 and 5 comprise the graded exercises that your have to complete. Their goal is to compute some basic statistics on data.
Before you begin
Download and unzip the archive INF442-td1-1-handin.zip
.
It contains the following files:
- the source file
stats_functions.cpp
, which you will have to complete, - the header file
stats_functions.hpp
, which you should not modify - the file
main.cpp
, which is used for the tests script - a
Makefile
that you can use to compile these tests (see Section 2)
It also contains some subdirectories that you are not expected to modify:
csv
with the main dataset of the TDgrading
with the test scripts (you may want to modify it for debugging though)gradinglib
with the test library
We compiled some general instructions that can help you get things set up for the lab sessions of the course.
Section 0 (optional) — Useful basic commands in Unix terminal
Click to expand
Open a terminal. Play a bit with the following commands:
pwd
(Print Working Directory) — prints out the current working directory.
To access a directory or a file, one has to specify a path. Paths can
be absolute or relative. An absolute path starts from
the root denoted by /
and goes all the way to the desired
element (directory or file). The command pwd
prints
absolute paths. For instance, on Linux systems, the home
directory of the user student
is
/home/student
. This can be read as follows: from the
root, go to a sub-directory home
, then to the
sub-directory student
. (On other systems, the default home
directory might be different. For instance, on MacOS, it would be
/Users/student
.)
ls
(LiSt) — lists the files and directories in the current working directory or in the directory given as a parameter. Compare the output of this command with different options:ls /
,ls -l /
, andls -lh /
(the three commands list the content of the root directory in different formats).A single dot
.
is used to refer to the current working directory. Hence, the commandsls
andls .
are equivalent. Two dots..
refer to the parent directory of the current one.cd
(Change Directory) — moves us in the directory tree. It can use both absolute and relative paths. For instance, the following command line moves us to the home directory of our hypothetical userstudent
, which becomes the new working directory:$ cd /home/student
(Here,
$
is the command prompt — yours could look different — do not type it in the terminal.)In order to move to the home directory of another user
student1
, one can either issue the same command replacingstudent
bystudent1
or, instead issue the following two commands using relative paths:$ cd .. $ cd student1 $ pwd /home/student1
The first command moves to the parent directory, i.e.
/home
. The second one moves from there to the sub-directorystudent1
.cd ~
(tilde) moves to the home directory of the current user,cd -
(minus) moves to the previous location:$ cd ~ ; pwd /home/student $ cd /home/student1 ; pwd /home/student1 $ cd - ; pwd /home/student
(Notice how we use
;
to chain two commands on the same line.)mkdir
(MaKe DIRectory) — creates a new directory. For instance,$ mkdir mydir
creates a directory named
mydir
in the working directory.rm
— this command is used for deleting a filermdir
— is used for deleting an empty directory.
Streams are input and output communication channels between
a computer program and its environment. There are three standard
streams: stdin
, stdout
, and
stderr
. The first two correspond to the "normal" input and
output channels, whereas the third one is used for printing error
messages. By default, in Unix/Linux terminals, stdin accumulates the
input coming from the keyboard. Similarly, by default, all content
channeled through stdout and stderr is printed directly in the terminal.
However, all these streams may be redirected, respectively
to or from files using the >
,
>>
, and <
operators:
>
redirects stdout to the file given as an argument, erasing its previous content,>>
appends stdout to the content of the file given as an argument,<
reads the contents of the file given as an argument into stdin.
Finally,
echo
— prints its arguments on the standard output.cat
(conCATenate) — prints the contents of the file given as an argument on the standard output (more complex in reality but this is good enough for now).
Now you can play around with the above commands to create files for the INF442 exercises as follows:
- Move to your home directory, using
ls
andcd
- Create a directory
INF442
- Change to the
INF442
directory - Create a new subdirectory
TD1
- Change to the
TD1
directory
During the next sessions, you will create sub-directories
TD2
, TD3
, etc. For now, you can continue
playing around with the above commands:
- Use
echo
and>
to print some message into a filetest.txt
. - Use
ls -l
and>>
to append the contents of the root directory/
to the filetest.txt
. - Use
cat
to check that the content of the filetest.txt
corresponds to your expectations. - Use
rm
to delete the file.
Repeat until you manage to obtain something like this:
$ cat test.txt
Here are the contents of the root dir:
total 10
drwxrwxr-x+ 98 root admin 3136 Feb 28 20:15 Applications
drwxr-xr-x 83 root wheel 2656 Feb 26 01:35 Library
drwxr-xr-x@ 8 root wheel 256 Dec 5 2019 System
drwxr-xr-x 5 root admin 160 Dec 5 2019 Users
drwxr-xr-x 6 root wheel 192 Feb 28 20:46 Volumes
drwxr-xr-x@ 38 root wheel 1216 Feb 26 01:31 bin
drwxr-xr-x 3 root wheel 96 Dec 5 2019 com.apple.TimeMachine.localsnapshots
drwxr-xr-x 2 root wheel 64 Nov 9 2019 cores
dr-xr-xr-x 3 root wheel 4795 Feb 26 10:00 dev
lrwxr-xr-x@ 1 root admin 11 Dec 18 2019 etc -> private/etc
lrwxr-xr-x 1 root wheel 25 Feb 26 10:00 home -> /System/Volumes/Data/home
drwxr-xr-x 5 root wheel 160 Dec 18 2019 opt
drwxr-xr-x 6 root wheel 192 Feb 26 01:32 private
drwxr-xr-x@ 63 root wheel 2016 Feb 26 01:31 sbin
lrwxr-xr-x@ 1 root admin 11 Dec 18 2019 tmp -> private/tmp
drwxr-xr-x@ 11 root wheel 352 Dec 18 2019 usr
lrwxr-xr-x@ 1 root admin 11 Dec 18 2019 var -> private/var
Section 1 — Quiz: wine consumption data
Click to expand
Let's start playing with some data!
The file wines.csv
in the subdirectory csv/
includes the consumption of 17 types of wine in 8 countries, expressed
in kL. We will consider wines as variables and countries as
individuals.
The histogram below represents the cumulative consumption for each type of wine:
In the second part of this TD, you will program several functions to compute means, variances, and covariance matrices. The eigendecomposition of the covariance matrix is used by Principal Component Analysis (PCA)—a projection technique that reduces the number of dimensions in a given dataset. Given a dataset \(\mathbf{X}\) with \(n\) observations (vectors) \(\mathbf{x}^T = (x_1, x_2, \dots, x_d)\) of \(d\) dimensions, we perform an orthogonal transformation of the vector space, such that the first element of the new basis (called the first principal component) contributes the most to the variance of the data, then the second basis element (second principal component) contributes the most to the residual variance after projecting the data along the first basis element, and so on.
This means that, once we have our data represented in the new basis, the projection of the dataset over the first \(j\) principal components will preserve most of its variance. The following graph is obtained for the individuals (countries) analysis on normalized data.
Let us now create a short Python script using the pandas
library, to compute a few basic statistics on our data. Before we begin,
download and install the pandas library:
$ pip3 install --user --only-binary=:all: pandas
Then open a new file named stats.py
in the root
directory of the TD. This file will contain our Python script. You are
free to use any editor you wish; to open the file in VS Code, you can
use the terminal command code stats.py
. You can run it at
any time using the following command in the terminal:
$ python3 stats.py
Here is now the content of our Python script, which you can
copy-paste into your file stats.py
.
First, we import the pandas library and set a few default display options:
import pandas as pd
'display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 2) pd.set_option(
Now, we read in the data from the .csv
file and store
them in a pandas dataframe
:
= pd.read_csv('csv/wines.csv', index_col=0) data
Let us print our data (wines are in columns, countries are in rows):
print(data)
The mean for each individual row in the dataframe data
can be computed via data.mean(axis=1)
, while the mean for
each individual column can be computed via
data.mean(axis=0)
. Like the method mean
to
compute the means, the method var
computes the variances in
a dataframe. More precisely, data.var(axis=1, ddof=0)
computes the variance for each individual row, while
data.var(axis=0, ddof=0)
computes the variance for each
individual column. The parameter ddof
defines the degrees
of freedom to be used in the calculation of the variance, i.e.,
ddof=0
for the variance and ddof=1
for the
(corrected) sample variance.
We can thus compute the mean, variance and corrected variance per country as follows:
# stats per country
print('\nCountries:')
= pd.concat([data.mean(axis=1), data.var(axis=1, ddof=0), data.var(axis=1, ddof=1)], axis=1)
meanvar = ['mean', 'variance', 'sample variance']
meanvar.columns print(meanvar)
Now we do the same per wine:
# stats per wine
print('\nWines:')
= pd.concat([data.mean(axis=0), data.var(axis=0, ddof=0), data.var(axis=0, ddof=1)], axis=1)
meanvar = ['mean', 'variance', 'sample variance']
meanvar.columns print(meanvar)
Computing the correlation matrix for the dataframe data
with respect to the columns can be done via data.corr()
. To
do the computation with respect to the rows, we can simply transpose the
matrix before calling the method corr
:
data.transpose().corr()
.
We can thus compute the correlation matrices per country and per wine as follows:
# correlation matrices
print('\nWines correlation matrix:')
print(data.corr())
print('\nCountries correlation matrix:')
print(data.transpose().corr())
Analyze the data in wines.csv
as well as the two figures
above (the histogram and the PCA graph) to answer the questions in the
quiz on Moodle.
You can find a more advanced python script producing, in particular, the PCA plot above here.
Section 2 (optional) — A few simple C++ programs
Click to expand
Section 2.1 — First C++ program
In this part of the TD, we will practice writing, compiling and executing C++ programs step by step.
Caution If you create a Makefile
to
run your C++
files, pay attention so that you do not delete
(or overwrite) the Makefile
provided in the INF442-td1-1-handin.zip
file. Probably, the best choice is to experiment with these optional
exercises in a new directory.
Here is a simple program that prints a fixed message on the standard output stream:
// Include the library that defines input/output streams and operations on them
#include <iostream>
int main() {
// Print "Hello, world!" on the standard output...
std::cout << "Hello, world!" // <- Notice: no ';' here: 1 statement / 2 lines
// ...then advance to the next line
<< std::endl; // <- 2nd line of the statement
return 0;
}
Using a text editor of your choice (we recommend of VS Code) save this program in
a file called hello.cpp
. Then, in a terminal window,
(integrated in VS Code or other), issue the following commands to
compile and run the program:
$ g++ -o hello hello.cpp
$ ./hello
Here, -o hello
tells g++ to generate a program called
hello
, which is executed in the second line. We have to
prefix hello
with ./
to tell the system to
look for the executable file hello
in the current directory
(denoted by .
).
Section 2.2 — Input from the keyboard
In order to get information from the user via the terminal, you can
use the object cin
and the operator >>
.
Modify your program by inserting the following lines before the
return
statement:
// Declare the variable 'name' to be an array of 256 characters
char name[256];
std::cout << "What is your name?" << std::endl;
// Read the name from the standard input
std::cin >> name;
std::cout << "Hello, " << name << std::endl;
Recompile and run the program using the same commands as above. You should now see the following:
Hello, world!
What is your name?
Enter your first name and press the Enter key. Hello,
followed by your name should appear.
Section 2.3 — Input/output using redirection
Create a file names.txt
containing some text in the same
directory as your files hello.cpp
and hello
.
Run the program hello by redirecting the input from
names.txt
file as follows and observe the output:
./hello < names.txt
Run the program again, redirecting the output to the
output.txt
:
./hello < names.txt > output.txt
Check the content of output.txt
using cat
and repeat using >>
instead of >
:
./hello < names.txt >> output.txt
Run several more times, alternating >
and
>>
, and see how the contents of
output.txt
change.
Section 2.4 — Makefiles
A makefile is a file containing a set of directives used to compile different files and generate the executable.
Create a file named makefile
or Makefile
with the following content:
CC = g++ # defines a variable, called 'CC'
hello: hello.cpp # 'hello' (the target) depends on 'hello.cpp'
$(CC) -o hello hello.cpp # the command to build the target
clean: # Another target
rm -f hello *~
Remark: The Makefile format is relatively rigid and has to
be carefully respected. (Check the GNU Make
manual for a more detailed reference.) In particular, note that
command lines (those starting with $(CC)
and
rm
in the example) must start with a tab character.
Alternate the following commands several times and observe the effect
using ls
or ls -l
:
make
make hello
make clean
(By default, the first target—in our case hello
—is
built.)
Remark: It is possible to have a makefile that is not named
makefile
or Makefile
. If the name of your
makefile is foo.bar
, it can still be used by passing an
explicit argument to make, for example:
make -f foo.bar hello
Section 2.5 — A simple loop
Expand your hello program by inserting the following code before the
return
statement:
int count = 0;
std::cout << "Give a positive number: ";
std::cin >> count;
std::cout << "Hey";
for (int i = 1; i < count; i++) { // <-- Here goes the loop!
std::cout << "-hey";
}
std::cout << ", " << name << "!" << std::endl;
Compile and run it to observe the output.
Section 2.6 — Hello, multiple people!
Modify the program to ask instead for the number of people present, then greet each one by their name separately. The output should look like this:
$ ./hello
Hello, world!
How many are you? 3
What is your name?
Alice
Hello, Alice!
What is your name?
Bob
Hello, Bob!
What is your name?
Charlie
Hello, Charlie!
Section 2.7 — Segmentation faults
So far, you saw some basic programs that work. It is important, however, to also see some that do not … or at least not as expected. Here is one:
#include <iostream>
int main() {
int n;
std::cout << "Enter a number: ";
std::cin >> n;
std::cout << "I am going to build an array of "
<< n
<< " elements"
<< std::endl;
int array[n];
for (int i = 0; i < 2*n; i++) {
[i] = i*i;
arraystd::cout << "Wrote "
<< array[i]
<< " into cell "
<< i
<< " of the array"
<< std::endl;
}
return 0;
}
Copy that program into a file, say segfault.cpp
, compile
it then run it. The output should be similar (although perhaps not
equal) to:
$ ./test
Enter a number: 10
I am going to build an array of 10 elements
Wrote 0 into cell 0 of the array
Wrote 1 into cell 1 of the array
Wrote 4 into cell 2 of the array
Wrote 9 into cell 3 of the array
Wrote 16 into cell 4 of the array
Wrote 25 into cell 5 of the array
Wrote 36 into cell 6 of the array
Wrote 49 into cell 7 of the array
Wrote 64 into cell 8 of the array
Wrote 81 into cell 9 of the array
Wrote 100 into cell 10 of the array
Wrote 121 into cell 11 of the array
Wrote 144 into cell 12 of the array
Wrote 169 into cell 13 of the array
Segmentation fault: 11
What's going on? If you look carefully at the code, you will notice that the array has size \(n\). So only n cells are reserved for it in computer's memory. However, the loop goes from \(0\) to \(2n-1\), which means that we are trying to write to the memory that is not reserved for us. (Read about the segmentation faults to find out more.) An important thing to notice is that the Segmentation fault occurs at cell 14, not at cell 11 as one would expect! Take-away: you have to be very careful with the array bounds. It is very easy to make a mistake, and in the worst case, you may not notice it immediately at runtime.
Section 3 — Computing basic statistics: Mean, Variance and Standard Deviation
The file stats_functions.cpp
provides a skeleton of a
program that computes some basic statistics for an array of double
values.
Exercise 1
Complete the following functions in your program (for this you will
need to use the square root function
std::sqrt(double)
from the C numerics library
cmath
):
double ComputeMean(double values[], int length)
computes the mean of the given valuesdouble ComputeVariance(double values[], int length)
computes the variancedouble ComputeSampleVariance(double values[], int length)
computes the unbiased sample variancedouble ComputeStandardDeviation(double values[], int length)
computes the standard deviationdouble ComputeSampleStandardDeviation(double values[], int length)
computes the sample standard deviation (which is still biased)
You can observe the output by compiling the the provided program
grader
and issuing the command make grader
followed by ./grader 1
. This command
make grader
launches the compilation and uses the makefile
that we provided. The command ./grader 1
runs the tests for
the exercise.
Caution In C++
when you divide two
int
’s, the result is an int
. For more details
about the various conversion, you might take a look at this here. This
tip might be useful when you compute the unbiased standard
deviation.
Once your program passes all (or at least some) of the tests, upload
your file stats_functions.cpp
:
Section 4 — Working with matrices
In the previous section, you were manipulating linear arrays. As you have seen, such arrays can be used as parameters of functions but the length has to be specified separately. In this section, we move on to matrices, i.e., 2-dimensional arrays. Here, things become even more complex since an array of arrays is not guaranteed to be of rectangular shape: each row could potentially have a different length. For this reason, it is not possible for a matrix to be used as a function parameter in a similar manner as the one used above for arrays:
// This is not allowed!
double MyFunction (double matrix[][], int rows, int columns) {
...
}
Instead, one has to rely on the fact that, since all elements of an array are stored in memory contiguously, an array is also a pointer (more details in future lectures). Thus, one can have
// This is fine!
double MyFunction (double **matrix, int rows, int columns) {
return matrix[0][0];
}
Notice that we can still use the brackets notation
matrix[i][j]
to access individual cells of the matrix.
Exercise 2
Complete the following functions in your program (pay attention to the indices when manipulating matrices: always row first):
void PrintMatrix(double** matrix, int rows, int columns)
prints a rectangular matrix on the standard output, placing each row on a separate line, with a single whitespace between consecutive entriesvoid GetRow(double** matrix, int columns, int index, double row[])
copies the row with indexindex
of the matrix to the arrayrow
provided as an argumentvoid GetColumn(double** matrix, int rows, int index, double column[])
copies the column with indexindex
of the matrix to the arraycolumn
provided as an argument
Re-compile the test program via make grader
and run the
tests via ./grader 2
.
Once your program passes all (or at least some) of the tests, upload
your file stats_functions.cpp
:
Section 5 — Covariance and Correlation Matrices
We will now compute correlations and covariances. Then we will build the covariance matrix which was used in Section 1 to classify countries according to their wine consumption.
Given a data matrix with variables (columns) \(X_1, ..., X_d\), its associated covariance matrix is given by the following formula:
Its correlation matrix is given by the same formula, with variances and covariances replaced by correlations between the variables.
Exercise 3
Complete the following functions in your program:
double ComputeCovariance(double values1[], double values2[], int length)
computes the covariance of two series of values of same lengthdouble ComputeCorrelation(double values1[], double values2[], int length)
computes the correlation of two series of values of same length.
You can observe and test your program as above by re-compiling and
running ./grader 3
.
Once your program passes all (or at least some) of the tests, upload
your file stats_functions.cpp
:
We can now compute the entries of the covariance and correlation matrices, which are symmetric.
Exercise 4
Use the functions that you wrote for Exercise 2 and Exercise 3, to complete the following functions:
double** ComputeCovarianceMatrix(double** data_matrix, int rows, int columns)
returns the covariance matrix ofdata_matrix
double** ComputeCorrelationMatrix(double** data_matrix, int rows, int columns)
returns the correlation matrix ofdata_matrix
Again, you can observe and test your program as above by re-compiling
and running ./grader 4
.
Once your program passes all (or at least some) of the tests, upload
your file stats_functions.cpp
: