This assignment has been closed on March 13, 2024.

You must be authenticated to submit your files

INF442 - Introduction to Unix and basic statistics in C++

Structure
Before you begin
Section 0 (optional) — Useful basic commands in Unix terminal
Section 1 — Quiz: wine consumption data
Section 2 (optional) — A few simple C++ programs
Section 3 — Computing basic statistics: Mean, Variance and Standard Deviation
Section 4 — Working with matrices
Section 5 — Covariance and Correlation Matrices

Structure

Section 0 provides a very brief introduction to the fundamental commands that you can use with a Unix/Linux terminal. More information is provided in a separate tutorial. If you are already comfortable with Unix and/or Linux, then you can safely skip this section, however we do recommend that you take at least a quick look at the section to make sure that you master all the commands introduced therein.
Section 1 provides the required background and material for the quiz. The quiz is graded.
Section 2 provides a gentle introduction to writing, compiling and executing a simple C++ program. If you are already comfortable with these concepts, then you can safely skip it as well, however we do recommend that you take at least a quick look at the section to make sure that you master all the concepts introduced therein.
Finally, Sections 3, 4 and 5 comprise the graded exercises that your have to complete. Their goal is to compute some basic statistics on data.

Before you begin

Download and unzip the archive INF442-td1-1-handin.zip. It contains the following files:

the source file stats_functions.cpp, which you will have to complete,
the header file stats_functions.hpp, which you should not modify
the file main.cpp, which is used for the tests script
a Makefile that you can use to compile these tests (see Section 2)

It also contains some subdirectories that you are not expected to modify:

csv with the main dataset of the TD
grading with the test scripts (you may want to modify it for debugging though)
gradinglib with the test library

We compiled some general instructions that can help you get things set up for the lab sessions of the course.

Section 0 (optional) — Useful basic commands in Unix terminal

Click to expand

Open a terminal. Play a bit with the following commands:

pwd (Print Working Directory) — prints out the current working directory.

To access a directory or a file, one has to specify a path. Paths can be absolute or relative. An absolute path starts from the root denoted by / and goes all the way to the desired element (directory or file). The command pwd prints absolute paths. For instance, on Linux systems, the home directory of the user student is /home/student. This can be read as follows: from the root, go to a sub-directory home, then to the sub-directory student. (On other systems, the default home directory might be different. For instance, on MacOS, it would be /Users/student.)

ls (LiSt) — lists the files and directories in the current working directory or in the directory given as a parameter. Compare the output of this command with different options: ls /, ls -l /, and ls -lh / (the three commands list the content of the root directory in different formats).

A single dot . is used to refer to the current working directory. Hence, the commands ls and ls . are equivalent. Two dots .. refer to the parent directory of the current one.
cd (Change Directory) — moves us in the directory tree. It can use both absolute and relative paths. For instance, the following command line moves us to the home directory of our hypothetical user student, which becomes the new working directory:
```
$ cd /home/student
```
(Here, $ is the command prompt — yours could look different — do not type it in the terminal.)

In order to move to the home directory of another user student1, one can either issue the same command replacing student by student1 or, instead issue the following two commands using relative paths:
```
$ cd ..
$ cd student1
$ pwd
/home/student1
```
The first command moves to the parent directory, i.e. /home. The second one moves from there to the sub-directory student1.

cd ~ (tilde) moves to the home directory of the current user, cd - (minus) moves to the previous location:
```
$ cd ~ ; pwd
/home/student
$ cd /home/student1 ; pwd
/home/student1
$ cd - ; pwd
/home/student
```
(Notice how we use ; to chain two commands on the same line.)
mkdir (MaKe DIRectory) — creates a new directory. For instance,
```
$ mkdir mydir
```
creates a directory named mydir in the working directory.
rm — this command is used for deleting a file
rmdir — is used for deleting an empty directory.

Streams are input and output communication channels between a computer program and its environment. There are three standard streams: stdin, stdout, and stderr. The first two correspond to the "normal" input and output channels, whereas the third one is used for printing error messages. By default, in Unix/Linux terminals, stdin accumulates the input coming from the keyboard. Similarly, by default, all content channeled through stdout and stderr is printed directly in the terminal. However, all these streams may be redirected, respectively to or from files using the >, >>, and < operators:

> redirects stdout to the file given as an argument, erasing its previous content,
>> appends stdout to the content of the file given as an argument,
< reads the contents of the file given as an argument into stdin.

Finally,

echo — prints its arguments on the standard output.
cat (conCATenate) — prints the contents of the file given as an argument on the standard output (more complex in reality but this is good enough for now).

Now you can play around with the above commands to create files for the INF442 exercises as follows:

Move to your home directory, using ls and cd
Create a directory INF442
Change to the INF442 directory
Create a new subdirectory TD1
Change to the TD1 directory

During the next sessions, you will create sub-directories TD2, TD3, etc. For now, you can continue playing around with the above commands:

Use echo and > to print some message into a file test.txt.
Use ls -l and >> to append the contents of the root directory / to the file test.txt.
Use cat to check that the content of the file test.txt corresponds to your expectations.
Use rm to delete the file.

Repeat until you manage to obtain something like this:

$ cat test.txt
Here are the contents of the root dir:
total 10
drwxrwxr-x+ 98 root  admin  3136 Feb 28 20:15 Applications
drwxr-xr-x  83 root  wheel  2656 Feb 26 01:35 Library
drwxr-xr-x@  8 root  wheel   256 Dec  5  2019 System
drwxr-xr-x   5 root  admin   160 Dec  5  2019 Users
drwxr-xr-x   6 root  wheel   192 Feb 28 20:46 Volumes
drwxr-xr-x@ 38 root  wheel  1216 Feb 26 01:31 bin
drwxr-xr-x   3 root  wheel    96 Dec  5  2019 com.apple.TimeMachine.localsnapshots
drwxr-xr-x   2 root  wheel    64 Nov  9  2019 cores
dr-xr-xr-x   3 root  wheel  4795 Feb 26 10:00 dev
lrwxr-xr-x@  1 root  admin    11 Dec 18  2019 etc -> private/etc
lrwxr-xr-x   1 root  wheel    25 Feb 26 10:00 home -> /System/Volumes/Data/home
drwxr-xr-x   5 root  wheel   160 Dec 18  2019 opt
drwxr-xr-x   6 root  wheel   192 Feb 26 01:32 private
drwxr-xr-x@ 63 root  wheel  2016 Feb 26 01:31 sbin
lrwxr-xr-x@  1 root  admin    11 Dec 18  2019 tmp -> private/tmp
drwxr-xr-x@ 11 root  wheel   352 Dec 18  2019 usr
lrwxr-xr-x@  1 root  admin    11 Dec 18  2019 var -> private/var

Section 1 — Quiz: wine consumption data

Click to expand

Let's start playing with some data!

The file wines.csv in the subdirectory csv/ includes the consumption of 17 types of wine in 8 countries, expressed in kL. We will consider wines as variables and countries as individuals.

The histogram below represents the cumulative consumption for each type of wine:

In the second part of this TD, you will program several functions to compute means, variances, and covariance matrices. The eigendecomposition of the covariance matrix is used by Principal Component Analysis (PCA)—a projection technique that reduces the number of dimensions in a given dataset. Given a dataset $\mathbf{X}$ with $n$ observations (vectors) $\mathbf{x}^T = (x_1, x_2, \dots, x_d)$ of $d$ dimensions, we perform an orthogonal transformation of the vector space, such that the first element of the new basis (called the first principal component) contributes the most to the variance of the data, then the second basis element (second principal component) contributes the most to the residual variance after projecting the data along the first basis element, and so on.

This means that, once we have our data represented in the new basis, the projection of the dataset over the first $j$ principal components will preserve most of its variance. The following graph is obtained for the individuals (countries) analysis on normalized data.

Let us now create a short Python script using the pandas library, to compute a few basic statistics on our data. Before we begin, download and install the pandas library:

$ pip3 install --user --only-binary=:all: pandas

Then open a new file named stats.py in the root directory of the TD. This file will contain our Python script. You are free to use any editor you wish; to open the file in VS Code, you can use the terminal command code stats.py. You can run it at any time using the following command in the terminal:

$ python3 stats.py

Here is now the content of our Python script, which you can copy-paste into your file stats.py.

First, we import the pandas library and set a few default display options:

import pandas as pd
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 2)

Now, we read in the data from the .csv file and store them in a pandas dataframe:

data = pd.read_csv('csv/wines.csv', index_col=0)

Let us print our data (wines are in columns, countries are in rows):

print(data)

The mean for each individual row in the dataframe data can be computed via data.mean(axis=1), while the mean for each individual column can be computed via data.mean(axis=0). Like the method mean to compute the means, the method var computes the variances in a dataframe. More precisely, data.var(axis=1, ddof=0) computes the variance for each individual row, while data.var(axis=0, ddof=0) computes the variance for each individual column. The parameter ddof defines the degrees of freedom to be used in the calculation of the variance, i.e., ddof=0 for the variance and ddof=1 for the (corrected) sample variance.

We can thus compute the mean, variance and corrected variance per country as follows:

# stats per country
print('\nCountries:')
meanvar = pd.concat([data.mean(axis=1), data.var(axis=1, ddof=0), data.var(axis=1, ddof=1)], axis=1)
meanvar.columns = ['mean', 'variance', 'sample variance']
print(meanvar)

Now we do the same per wine:

# stats per wine
print('\nWines:')
meanvar = pd.concat([data.mean(axis=0), data.var(axis=0, ddof=0), data.var(axis=0, ddof=1)], axis=1)
meanvar.columns = ['mean', 'variance', 'sample variance']
print(meanvar)

Computing the correlation matrix for the dataframe data with respect to the columns can be done via data.corr(). To do the computation with respect to the rows, we can simply transpose the matrix before calling the method corr: data.transpose().corr().

We can thus compute the correlation matrices per country and per wine as follows:

# correlation matrices
print('\nWines correlation matrix:')
print(data.corr())

print('\nCountries correlation matrix:')
print(data.transpose().corr())

Analyze the data in wines.csv as well as the two figures above (the histogram and the PCA graph) to answer the questions in the quiz on Moodle.

You can find a more advanced python script producing, in particular, the PCA plot above here.

Section 2 (optional) — A few simple C++ programs

Click to expand

Section 2.1 — First C++ program

In this part of the TD, we will practice writing, compiling and executing C++ programs step by step.

Caution If you create a Makefile to run your C++ files, pay attention so that you do not delete (or overwrite) the Makefile provided in the INF442-td1-1-handin.zip file. Probably, the best choice is to experiment with these optional exercises in a new directory.

Here is a simple program that prints a fixed message on the standard output stream:

// Include the library that defines input/output streams and operations on them
#include <iostream>

int main()  {
  // Print "Hello, world!" on the standard output...
  std::cout << "Hello, world!"  // <- Notice: no ';' here: 1 statement / 2 lines
                     
  // ...then advance to the next line
        << std::endl;      // <- 2nd line of the statement

  return 0;
}

Using a text editor of your choice (we recommend of VS Code) save this program in a file called hello.cpp. Then, in a terminal window, (integrated in VS Code or other), issue the following commands to compile and run the program:

$ g++ -o hello hello.cpp
$ ./hello

Here, -o hello tells g++ to generate a program called hello, which is executed in the second line. We have to prefix hello with ./ to tell the system to look for the executable file hello in the current directory (denoted by .).

Section 2.2 — Input from the keyboard

In order to get information from the user via the terminal, you can use the object cin and the operator >>. Modify your program by inserting the following lines before the return statement:

  // Declare the variable 'name' to be an array of 256 characters 
  char name[256];
  
  std::cout << "What is your name?" << std::endl;
                         
  // Read the name from the standard input
  std::cin >> name;

  std::cout << "Hello, " << name << std::endl;

Recompile and run the program using the same commands as above. You should now see the following:

Hello, world!
What is your name?

Enter your first name and press the Enter key. Hello, followed by your name should appear.

Section 2.3 — Input/output using redirection

Create a file names.txt containing some text in the same directory as your files hello.cpp and hello. Run the program hello by redirecting the input from names.txt file as follows and observe the output:

./hello < names.txt

Run the program again, redirecting the output to the output.txt:

./hello < names.txt > output.txt

Check the content of output.txt using cat and repeat using >> instead of >:

./hello < names.txt >> output.txt

Run several more times, alternating > and >>, and see how the contents of output.txt change.

Section 2.4 — Makefiles

A makefile is a file containing a set of directives used to compile different files and generate the executable.

Create a file named makefile or Makefile with the following content:

CC = g++                              # defines a variable, called 'CC' 

hello: hello.cpp                      # 'hello' (the target) depends on 'hello.cpp'
	$(CC) -o hello hello.cpp          # the command to build the target

clean:                                # Another target
	rm -f hello *~

Remark: The Makefile format is relatively rigid and has to be carefully respected. (Check the GNU Make manual for a more detailed reference.) In particular, note that command lines (those starting with $(CC) and rm in the example) must start with a tab character.

Alternate the following commands several times and observe the effect using ls or ls -l:

make
make hello
make clean

(By default, the first target—in our case hello—is built.)

Remark: It is possible to have a makefile that is not named makefile or Makefile. If the name of your makefile is foo.bar, it can still be used by passing an explicit argument to make, for example:

make -f foo.bar hello

Section 2.5 — A simple loop

Expand your hello program by inserting the following code before the return statement:

  int count = 0;
  
  std::cout << "Give a positive number: ";
  std::cin >> count;

  std::cout << "Hey";
  for (int i = 1; i < count; i++) {   // <-- Here goes the loop!
    std::cout << "-hey";
  }

  std::cout << ", " << name << "!" << std::endl;

Compile and run it to observe the output.

Section 2.6 — Hello, multiple people!

Modify the program to ask instead for the number of people present, then greet each one by their name separately. The output should look like this:

$ ./hello 
Hello, world!
How many are you? 3
What is your name?
Alice
Hello, Alice!
What is your name?
Bob
Hello, Bob!
What is your name?
Charlie
Hello, Charlie!

Section 2.7 — Segmentation faults

So far, you saw some basic programs that work. It is important, however, to also see some that do not … or at least not as expected. Here is one:

#include <iostream>

int main() {
  int n;
  std::cout << "Enter a number: ";
  std::cin >> n;
  std::cout << "I am going to build an array of "
	    << n
	    << " elements"
	    << std::endl;
  
  int array[n];
  
  for (int i = 0; i < 2*n; i++) {
    array[i] = i*i;
    std::cout << "Wrote "
	      << array[i]
	      << " into cell "
	      << i
	      << " of the array"
	      << std::endl;
  }
  
  return 0;
}

Copy that program into a file, say segfault.cpp, compile it then run it. The output should be similar (although perhaps not equal) to:

$ ./test 
Enter a number: 10
I am going to build an array of 10 elements
Wrote 0 into cell 0 of the array
Wrote 1 into cell 1 of the array
Wrote 4 into cell 2 of the array
Wrote 9 into cell 3 of the array
Wrote 16 into cell 4 of the array
Wrote 25 into cell 5 of the array
Wrote 36 into cell 6 of the array
Wrote 49 into cell 7 of the array
Wrote 64 into cell 8 of the array
Wrote 81 into cell 9 of the array
Wrote 100 into cell 10 of the array
Wrote 121 into cell 11 of the array
Wrote 144 into cell 12 of the array
Wrote 169 into cell 13 of the array
Segmentation fault: 11

What's going on? If you look carefully at the code, you will notice that the array has size $n$. So only n cells are reserved for it in computer's memory. However, the loop goes from $0$ to $2n-1$, which means that we are trying to write to the memory that is not reserved for us. (Read about the segmentation faults to find out more.) An important thing to notice is that the Segmentation fault occurs at cell 14, not at cell 11 as one would expect! Take-away: you have to be very careful with the array bounds. It is very easy to make a mistake, and in the worst case, you may not notice it immediately at runtime.

Section 3 — Computing basic statistics: Mean, Variance and Standard Deviation

The file stats_functions.cpp provides a skeleton of a program that computes some basic statistics for an array of double values.

Exercise 1

Complete the following functions in your program (for this you will need to use the square root function std::sqrt(double) from the C numerics library cmath):

double compute_mean(double values[], int length) computes the mean of the given values
double compute_variance(double values[], int length) computes the variance
double compute_sample_variance(double values[], int length) computes the unbiased sample variance
double compute_standard_deviation(double values[], int length) computes the standard deviation
double compute_sample_standard_deviation(double values[], int length) computes the sample standard deviation (which is still biased)

You can observe the output by compiling the the provided program grader and issuing the command make grader followed by ./grader 1. This command make grader launches the compilation and uses the makefile that we provided. The command ./grader 1 runs the tests for the exercise.

Caution In C++ when you divide two int’s, the result is an int. For more details about the various conversion, you might take a look at this here. This tip might be useful when you compute the unbiased standard deviation.

Once your program passes all (or at least some) of the tests, upload your file stats_functions.cpp:

Upload form is only available when connected

Section 4 — Working with matrices

In the previous section, you were manipulating linear arrays. As you have seen, such arrays can be used as parameters of functions but the length has to be specified separately. In this section, we move on to matrices, i.e., 2-dimensional arrays. Here, things become even more complex since an array of arrays is not guaranteed to be of rectangular shape: each row could potentially have a different length. For this reason, it is not possible for a matrix to be used as a function parameter in a similar manner as the one used above for arrays:

    // This is not allowed!
    double my_function (double matrix[][], int rows, int columns) {
      ...
    }

Instead, one has to rely on the fact that, since all elements of an array are stored in memory contiguously, an array is also a pointer (more details in future lectures). Thus, one can have

    // This is fine!
    double my_function (double **matrix, int rows, int columns) {
      return matrix[0][0];
    }

Notice that we can still use the brackets notation matrix[i][j] to access individual cells of the matrix.

Exercise 2

Complete the following functions in your program (pay attention to the indices when manipulating matrices: always row first):

void print_matrix(double** matrix, int rows, int columns) prints a rectangular matrix on the standard output, placing each row on a separate line, with a single whitespace between consecutive entries
void get_row(double** matrix, int columns, int index, double row[]) copies the row with index index of the matrix to the array row provided as an argument
void get_column(double** matrix, int rows, int index, double column[]) copies the column with index index of the matrix to the array column provided as an argument

Re-compile the test program via make grader and run the tests via ./grader 2.

Once your program passes all (or at least some) of the tests, upload your file stats_functions.cpp:

Upload form is only available when connected

Section 5 — Covariance and Correlation Matrices

We will now compute correlations and covariances. Then we will build the covariance matrix which was used in Section 1 to classify countries according to their wine consumption.

Given a data matrix with variables (columns) $X_1, ..., X_d$, its associated covariance matrix is given by the following formula:

Its correlation matrix is given by the same formula, with variances and covariances replaced by correlations between the variables.

Exercise 3

Complete the following functions in your program:

double compute_covariance(double values1[], double values2[], int length) computes the covariance of two series of values of same length
double compute_correlation(double values1[], double values2[], int length) computes the correlation of two series of values of same length.

You can observe and test your program as above by re-compiling and running ./grader 3.

Once your program passes all (or at least some) of the tests, upload your file stats_functions.cpp:

Upload form is only available when connected

We can now compute the entries of the covariance and correlation matrices, which are symmetric.

Exercise 4

Use the functions that you wrote for Exercise 2 and Exercise 3, to complete the following functions:

double** compute_covariance_matrix(double** data_matrix, int rows, int columns) returns the covariance matrix of data_matrix
double** compute_correlation_matrix(double** data_matrix, int rows, int columns) returns the correlation matrix of data_matrix

Again, you can observe and test your program as above by re-compiling and running ./grader 4.

Once your program passes all (or at least some) of the tests, upload your file stats_functions.cpp:

Upload form is only available when connected