When writing C++ code, you are probably inclined to use data types that are closest to what you are trying to express, expecting that this allows the compiler to provide the most efficient implementation possible. So, when processing boolean values, i.e., single-bit information distinguishing between true
and false
, you will likely want to use the bool
type in C++. However, this bool
type may behave in unexpected ways, in particular if you are working with codes that have to perform well and correctly in multi-threaded codes.
Performance impact
Let’s first have a look at performance. Consider a code that determines if a given memory area is uniformly filled with a given byte value. This code should clearly be memory-bandwidth-bound:
bool same(const char *ptr, const char elem, const size_t len) { bool same = true; for (size_t i = 0; i < len; i++) { if (ptr[i] != elem) same = false; } return same; }
Based on this code, GCC 13.2 invoked with the flags -O3 -ftree-vectorize -march=znver3
will generate scalar instructions for the Zen 3 architecture: https://godbolt.org/z/5GqTzrGro. This shows in quite bad performance. On my test system, which contains an AMD EPYC 7713, we only see about 3.4 GiB/s where nearly 28 GiB/s should be possible.
Let’s modify the code and use the data type char
instead of bool
for the variable same
. This does not change the result in any way, because same
will still be either true
or false
in the end. In this case, GCC invoked with the same flags as above will generate optimized (in particular, vectorized) code for the same architecture: https://godbolt.org/z/o6rbohseP. This directly shows in performance, as the generated code achieves 27 GiB/s for the same task on the same system. So we are getting nearly eight times the performance using char
instead of bool
, although bool
is semantically the best match for the data we want to represent.
Correctness issues in multi-threaded code
Let’s now assume we want to parallelize the above code over many cores to utilize the full memory bandwidth of the system. For example, the AMD EPYC 7713 CPU contains 64 cores that we should make use of for maximum performance.
In HPC codes, we may do this using OpenMP. For example, we could split the task of iterating throught the entire buffer buf
among the threads as follows:
const auto nthreads = omp_get_max_threads(); const auto per_thread = len / nthreads; auto thread_results = std::vector<bool>(nthreads); #pragma omp parallel { const auto tid = omp_get_thread_num(); auto local_result = same( buf + (tid * per_thread), elem, per_thread ); thread_results[tid] = local_result; } auto result = std::all_of( thread_results.begin(), thread_results.end(), [](bool v) { return v; } );
For simplicity, we assume here that the size of buf
is divisible by the number of threads.
Writing from different threads into neighboring elements within thread_results
is a prime example for false sharing. However, this should only affect performance and not correctness. As we are writing to different elements of the vector, an unsuspecting programmer will assume that the code works correctly.
It turns out, it does not! Instead, the code produces incorrect results, sometimes returning false
even if all threads compute true
as local_result
.
There are different ways to fix the code, for example
- adding
#pragma omp critical
tothread_results[tid] = local_result;
, - changing the data type of
thread_results
tostd::array<bool, N>
whereN
is the maximum amount of threads possible, or - changing the data type of
thread_results
tostd::vector<char>
.
Reasons for the observed behavior
So, what is special about type bool
that can explain that behavior?
The second observation regarding correctness issues with thread-parallel access on different elements in std::vector<bool>
can be explained by looking into the C++ standard. In section 24.3.12 it describes a specialization of vector
for bool
. In particular, it states that
[…] all operations have the same requirements and semantics as the primary vector template, except that operations dealing with the bool value type map to bit values in the container storage […]
and that
[t]here is no requirement that the data be stored as a contiguous allocation of bool values. A space-optimized representation of bits is recommended instead.
So, while a single bool
is probably represented in memory by a single byte, a vector
of bool
s is not just a concatenation of multiple bool
s but a space-optimized representation. Standard libraries will likely pack eight neighboring bool
s within a vector
into a single byte. Hence, different threads cannot simply work on neighboring bool
s, as those accesses boil down to non-atomic concurrent accesses on a single byte. Another symptom of this speciality is that it is not possible to determine the address of elements in a std::vector<bool>
: https://godbolt.org/z/Ys5c98GGx.
The first symptom regarding the lack of code optimization is more difficult to explain. I can only guess that ensuring the semantics of bool
somehow interferes with GCC’s code optimization. In fact, Clang 16 does not have these issues: https://godbolt.org/z/x7s6bjT13. This also shows in performance which is on par with the char
version when using Clang to compile the code.
Summary
In summary, you should be aware of the specialities of the bool
type, in particular if you care about performance or if you implement thread-parallel algorithms. In general, if you are working in HPC codes, it may be better to stick to classical C-style integer types, such as char
, when representing boolean values.
Putting “openmp c++ vector bool” into your favorite search engine will show that a couple of people ran into the correctness issue discussed in this article. Being aware of bool
‘s special characteristics will help you avoiding these issues.