During last semester I was attending Multiprocessor Architectures course, given at Facultad de Informática where I study my Computer Science degree.
As part of the assignments due to pass the course, we had to do several programs written in C to benchmark matrix multiplication by testing different techniques and technologies. First of all we had to do a secuential program in three different versions:
- A normal one where the result matrix is ordered by rows and the loops range the matrix by rows too
- An “inter†version where the result matrix is ordered by rows but the loops range the matrix by columns (loops exchange).
- A “noacum†version where the variable keeping up the partial sums of the elements is deleted and each time a value is calculated, it is written to and read from the result matrix.
Later on, we had to do a parallel implementation using OpenMP, which will not be explained here, and another one using 3 threads. In this version we have other three versions:
- One where we use bounded threads (system or kernel level threads) and we parallelize elements of the result matrix. Each thread is bounded to a system process and there is real parallelism.
- Another one where we use bounded threads but we parallelize entire rows of the result matrix.
- A last one where we use unbounded threads (user level threads) parallelizing elements of the result matrix where there is no real parallelism as we multiplex several threads into a system process.
There are some considerations to be made. The programs where made and compiled for a 64 bit machine using Linux CentOS 6.0 with kernel 2.6.32. This machine is part of a cluster though programs were running in a single computer, and mounts two Quad-core ©Intel Xeon X5450 microprocessor at 3GHz. So we have a 8 core machine in 2 dies, which each of them has separate Cache L1 for instructions and data with 32KB capacity each and a write-back policy, and they share a 12MB Cache L2. Some interesting values in the configuration of the machine are that max locked memory is limited to 64KB per user and the stack size is 10,240 KBytes.
Another considerations are that this version of Linux doesn’t support unbounded threads in C, so we will not have into account their execution times and that the Go versions have been done using the GOMAXPROCS function to determine the existence of real parallelism.
Before digging into the code and the time results let’s see which tests have been passed and how are the matrix we will be multiplying:
- Test 1
- Matrix 1: 800×900 – 720,000 elements.
- Matrix 2: 900×3000 – 2,700,000 elements.
- Result Matrix: 800×3000 – 2,400,000 elements.
- Test 2
- Matrix 1: 8000×90 – 720,000 elements.
- Matrix 2: 90×8000 – 720,000 elements.
- Result Matrix: 8000×8000 – 64,000,000 elements
- Test 3
- Matrix 1: 8000×10 – 80,000 elements.
- Matrix 2: 10×80000 – 800,000 elements.
- Result Matrix: 8000×80000 – 640,000,000 elements.
So let’s analyze the source code proposed. To minimize the differences between the C an Go programs the global structure is being coherent between both languages although there could be better implementations in both. This test is supposed to compare how each language performs with a common implementation, rather than having its best performance in each version.
Source code shown here is explained. The full version can be downloaded on GitHub (https://github.com/rcostu/Go-vs-C-Matrix-benchmark).
Note.- GCC version used is [gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC)] and Go version is [0c39eee85b0d weekly/weekly.2011-12-06].
There are two different implementations depending on how matrix are declared. The files called Matrix, contain matrix defined as vectors of vectors in C as follows:
int **matrix1 = (int **) calloc(matrix1_rows, sizeof(int*)); for (i = 0; i < matrix1_rows; i++){ matrix1[i] = (int *) calloc(matrix1_cols, sizeof(int)); }
Go Version:
matrix1 := make([][]int, matrix1_rows) for i := 0; i < matrix1_rows; i++ { matrix1[i] = make([]int, matrix1_cols) }
The files named with vector declare matrix as a single vector making displacements such as matrix[i * cols + j], where i and j are the indexes of an element and cols is the number of columns in the matrix. They are declared in C as follows:
int *matrix1 = (int *) calloc(matrix1_rows * matrix1_cols, sizeof(int));
Go version:
matrix1 := make([]int, matrix1_rows * matrix1_cols)
Every single version has a init_matrix function which sets the matrix elements to 1. This function is executed in one thread both in secuential and parallel versions and parallelism is only applied to calculations.
Matrix multiplication is done as follows in C (or its equivalent in Go):
for (i = 0; i < matrix1_rows; i++) { for (j = 0; j < matrix2_cols; j++) { acum = 0; for (k = 0; k < matrix1_cols; k++) { acum += matrix1[i][k] * matrix2[k][j]; } matrixR[i][j] = acum; } }
The parallel version has this variables defined at the top (C version listed):
// Thread list pthread_t *thread_list; // Variables to get exclusive access pthread_mutex_t mutex; // Number of pending jobs int pending_jobs = 0; // Struct to store the job to be done. struct job { int i, j; struct job *next; }; // Simple linked list of total jobs. struct job *job_list = NULL; struct job *last_job = NULL;
Jobs are added an deleted from the list as indicated in this functions:
void add_job(int i, int j){ struct job *job = malloc(sizeof(struct job)); job->i = i; job->j = j; job->next = NULL; if(pending_jobs == 0){ job_list = job; last_job = job; } else{ last_job->next = job; last_job = job; } pending_jobs++; }  struct job* get_job(){ struct job *job = NULL; if(pending_jobs > 0){ job = job_list; job_list = job->next; if(job_list == NULL){ last_job = NULL; } pending_jobs--; } return job; }
Each launched thread executes the dispatch_job() function, where a job is obtained from the list and its calculations are done:
void do_job(struct job *job) { int k, acum; acum = 0; for (k = 0; k < matrix1_cols; k++) { acum += matrix1[job->i][k] * matrix2[k][job->j]; } matrixR[job->i][job->j] = acum; } void* dispatch_job () { struct job *job; while(1) { pthread_mutex_lock(&mutex); job = get_job(); pthread_mutex_unlock(&mutex); if (job) { do_job(job); free(job); } else { pthread_exit(NULL); } } }
And the main thread waits for all the threads to finish. Go version is slightly different. The dispatch_job() function is launched in a new goroutine (and will run in parallel if we have set the GOMAXPROCS to one or more processors). I have defined two different channels ch which will signal that a goroutine has ended, and chm which will give the goroutine exclusive access to the shared list.
The dispatch_job() function is as follows:
func dispatch_job() { var job *struct_job for { chm <- 1 job = get_job() <- chm if (job != nil) { do_job(job) } else { ch <- 1 runtime.Goexit() } } }
Now that we have analyzed the global structure of the code in both versions, let’s see how are time calculated and which are the final results. Each time has been taken by running the make testX command with the given makefiles for the X test (which runs the executable taking times with time command). Each time was taken 3 times to avoid high load of the server (which is a shared machine at the university), and the final taken time is the result of the formula: ((fastest time + slowest time) / 2 + middle time) / 2.
With this formula we take into account margin values but getting near the two nearest values.
So here are the times for each test, being the first figure the execution of the secuential versions, and the second one the parallelized one. Note that we don’t have unbounded threads (user level threads) in C because of Linux implementation. System stands for bounded threads (kernel level threads).
- Test 1
- Test 2
- Test 3
There are some things to remark here:
- Go version implementing the matrix as double-dimension arrays is
very, very slow on test1. It gets better when the matrix are bigger.
- Single-dimension arrays are better than double-dimension arrays. This
is logical because we have contiguous memory against non-contiguous
memory. Go vector version is amazingly fast.
- Overall speedup is between 10-20% faster, being Go faster than C.
Although this test is not very scientific and implementations can be improved, I think it makes a point where in some kind of applications and depending on the implementation Go is faster than C. Where it is not, it should improve in the next Go 1 stable version.
I hope this post gets discussed and is able to gain new users in the Go community. Do not hesitate to fork the GitHub repository or leaving comment.
Source:http://roberto.costumero.es/en/2012/02/07/go-vs-c-benchmark-puede-ser-go-mas-rapido-que-c/