The code for this tutorial is on GitHub: https://github.com/sol-prog/threads.
In my previous tutorials I’ve presented some of the newest C++11 additions to the language: regular expressions, raw strings and lambdas.
Perhaps one of the biggest change to the language is the addition of
multithreading support. Before C++11, it was possible to target
multicore computers using OS facilities (pthreads on Unix like systems) or libraries like OpenMP and MPI.
This tutorial is meant to get you started with C++11 threads and not to be an exhaustive reference of the standard.
Creating and launching a thread in C++11 is as simple as adding the
thread header to your C++ source. Let’s see how we can create a simple
HelloWorld program with threads:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | #include <iostream>
#include <thread>
void call_from_thread() {
std::cout << "Hello, World!" << std::endl;
}
int main() {
std:: thread t1(call_from_thread);
t1.join();
return 0;
}
|
On Linux you can compile the above code with g++:
1 | g++ -std=c++0x -pthread file_name.cpp
|
On a Mac with Xcode 4.x you can compile the above code with clang++:
1 | clang++ -std=c++0x -stdlib=libc++ file_name.cpp
|
On Windows you could use a commercial library, just::thread, for compiling multithread codes. Unfortunately they don’t supply a trial version of the library, so I wasn’t able to test it.
In a real world application the “call_from_thread†function will do
some work independently of the main function. For this particular code,
the main function creates a thread and wait for the thread to finish at
t1.join(). If you forget to wait for a thread to finish his work, it is
possible that main will finish first and the program will exit killing
the previously created thread regardless if “call_from_thread†has
finished or not.
Compare the relative simplicity of the above code with an equivalent code that uses POSIX threads:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | #include <iostream>
#include <pthread.h>
void *call_from_thread( void *) {
std::cout << "Launched by thread" << std::endl;
return NULL;
}
int main() {
pthread_t t;
pthread_create(&t, NULL, call_from_thread, NULL);
pthread_join(t, NULL);
return 0;
}
|
Usually we will want to launch more than one thread at once and do
some work in parallel. In order to do this we could create an array of
threads versus creating a single thread like in our first example. In
the next example the main function creates a group of 10 threads that
will do some work and waits for the threads to finish their work (there
is also a POSIX version of this example in the github repository for
this article):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | ...
static const int num_threads = 10;
...
int main() {
std:: thread t[num_threads];
for ( int i = 0; i < num_threads; ++i) {
t[i] = std:: thread (call_from_thread);
}
std::cout << "Launched from the main\n" ;
for ( int i = 0; i < num_threads; ++i) {
t[i].join();
}
return 0;
}
|
Remember that the main function is also a thread, usually named the
main thread, so the above code actually runs 11 threads. This allows us
to do some work in the main thread after we have launched the threads
and before joining them, we will see this in an image processing example
at the end of this tutorial.
What about using a function with parameters in a thread ? C++11 let
us to add as many parameters as we need in the thread call. For e.g. we
could modify the above code in order to receive an integer as a
parameter (you can see the POSIX version of this example in the github
repository for this article):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | #include <iostream>
#include <thread>
static const int num_threads = 10;
void call_from_thread( int tid) {
std::cout << "Launched by thread " << tid << std::endl;
}
int main() {
std:: thread t[num_threads];
for ( int i = 0; i < num_threads; ++i) {
t[i] = std:: thread (call_from_thread, i);
}
std::cout << "Launched from the main\n" ;
for ( int i = 0; i < num_threads; ++i) {
t[i].join();
}
return 0;
}
|
The result of the above code on my system is:
1 2 3 4 5 6 7 8 9 10 11 12 13 | Sol$ . /a .out
Launched by thread 0
Launched by thread 1
Launched by thread 2
Launched from the main
Launched by thread 3
Launched by thread 5
Launched by thread 6
Launched by thread 7
Launched by thread Launched by thread 4
8L
aunched by thread 9
Sol$
|
You can see in the above result that there is no particular order in
which once created a thread will run. It is the programmer’s job to
ensure that a group of threads won’t block trying to modify the same
data. Also the last lines are somehow mangled because thread 4 didn’t
finish to write on stdout when thread 8 has started. Actually if you run
the above code on your system you can get a completely different result
or even some mangled characters. This is because all 11 threads of this
program compete for the same resource which is stdout.
You can avoid some of the above problem using barriers in your code
(std::mutex) which will let you synchronize the way a group of threads
share a resource, or you could try to use separate data structures for
your threads, if possible. The use of mutex is too advanced for the
purpose of this tutorial, you could read more about mutex on one of the
references suggested at the end of this post.
In principle we have all we need in order to write more complex parallel codes using only the above syntax.
In the next example I will try to illustrate the power of parallel
programming by tackling a slightly more complex problem: removing the
noise from an image, with a blur filter. The idea is that we can
dissipate the noise from an image by using some form of weighted average
of a pixel and his neighbours.
This tutorial is not about optimum image processing nor the author is
an expert in this domain, so we will take a rather simple approach
here. Our purpose is to illustrate how to write a parallel code and not
how to efficiently read/write images or convolve them with filters. I’ve
used for e.g. the definition of the spatial convolution instead of the
more performant, but slightly more difficult to implement, convolution
in the frequency domain by use of Fast Fourier Transform.
For simplicity we will use a simple non-compressed image file format like PPM.
Next we present the header file of a simple C++ class that allows you
to read/write PPM images and to store them in memory as three arrays
(for the R,G,B colours) of unsigned characters:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | class ppm {
bool flag_alloc;
void init();
unsigned int nr_lines;
unsigned int nr_columns;
public :
unsigned char *r;
unsigned char *g;
unsigned char *b;
unsigned int height;
unsigned int width;
unsigned int max_col_val;
unsigned int size;
ppm();
ppm( const std::string &fname);
ppm( const unsigned int _width, const unsigned int _height);
~ppm();
void read( const std::string &fname);
void write( const std::string &fname);
};
|
A possible way to structure our code is:
Load an image to memory.
Split the image in a number of threads corresponding to the max
number of threads accepted by your system, for e.g. on a quad-core
computer we could use 8 threads.
Launch number of threads – 1 (7 for a quad-core system), each one will process his chunk of the image.
Let the main thread to deal with the last chunk of the image.
Wait until all threads have finished and join them with the main thread.
Save the processed image.
Next we present the main function that implements the above algorithm (many thanks to wicked for suggesting some code improvements):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | int main() {
std::string fname = std::string( "your_file_name.ppm" );
ppm image(fname);
ppm image2(image.width, image.height);
int parts = 8;
std::vector< int >bnd = bounds(parts, image.size);
std:: thread *tt = new std:: thread [parts - 1];
time_t start, end;
time (&start);
for ( int i = 0; i < parts - 1; ++i) {
tt[i] = std:: thread (tst, &image, &image2, bnd[i], bnd[i + 1]);
}
for ( int i = parts - 1; i < parts; ++i) {
tst(&image, &image2, bnd[i], bnd[i + 1]);
}
for ( int i = 0; i < parts - 1; ++i)
tt[i].join();
time (&end);
std::cout << difftime (end, start) << " seconds" << std::endl;
image2.write( "test.ppm" );
delete [] tt;
return 0;
}
|
Please ignore the hard coded name of image file and the number of
threads to launch, on a real world application you should allow the user
to enter interactively these parameters.
Now, in order to see a parallel code at work we will need to give him
a significative amount of work, otherwise the overhead of creating and
destroying threads will nullify our effort to parallelize this code. The
input image should be large enough to actually see an improvement in
performance when the code is run in parallel. For this purpose I’ve used
an image of 16000×10626 pixels which occupy about 512 MB in PPM format:
I’ve added some noise over the above image in Gimp. The effect of the noise addition can be seen in the next detail of the above picture:
Let’s see the above code in action:
As you can see from the above image the noise level was dissipated.
The results of running the last example code on a dual-core MacBook Pro from 2010 is presented in the next table:
Compiler
|
Optimization
|
Threads
|
Time
|
Speed up
|
clang++ |
none |
1 |
40 s |
|
clang++ |
none |
4 |
20 s |
2x |
clang++ |
-O4 |
1 |
12 s |
|
clang++ |
-O4 |
4 |
6 s |
2x |
On a dual core machine this code has a perfect speed up 2x for
running in parallel versus running the code in serial mode (a single
thread).
I’ve also tested the code on a quad-core Intel i7 machine with Linux, these are the results:
Compiler
|
Optimization
|
Threads
|
Time
|
Speed up
|
g++ |
none |
1 |
33 s |
|
g++ |
none |
8 |
13 s |
2.54x |
g++ |
-O4 |
1 |
9 s |
|
g++ |
-O4 |
8 |
3 s |
3x |
Apparently Apple’s clang++ is better at scaling a parallel program,
however this can be a combination of compiler/machine characteristics,
it could also be because the MacBook Pro used for tests has 8GB of RAM
versus only 6GB for the Linux machine.
If you are interested in learning the new C++11 syntax I would recommend reading Professional C++ by M. Gregoire, N. A. Solter, S. J. Kleper 2nd edition:
or C++ Primer Plus by Stephen Prata:
A good book for learning about C++11 multithreading support is C++ Concurrency in Action: Practical Multithreading by Anthony Williams:
Source : http://solarianprogrammer.com/2011/12/16/cpp-11-thread-tutorial/