Using the C++ bool type in HPC codes

When writing C++ code, you are probably inclined to use data types that are closest to what you are trying to express, expecting that this allows the compiler to provide the most efficient implementation possible. So, when processing boolean values, i.e., single-bit information distinguishing between true and false, you will likely want to use the bool type in C++. However, this bool type may behave in unexpected ways, in particular if you are working with codes that have to perform well and correctly in multi-threaded codes.

Performance impact

Let’s first have a look at performance. Consider a code that determines if a given memory area is uniformly filled with a given byte value. This code should clearly be memory-bandwidth-bound:

bool same(const char *ptr,
          const char elem,
          const size_t len) {

    bool same = true;
    for (size_t i = 0; i < len; i++) {
        if (ptr[i] != elem) same = false;
    }

    return same;
}

Based on this code, GCC 13.2 invoked with the flags -O3 -ftree-vectorize -march=znver3 will generate scalar instructions for the Zen 3 architecture: https://godbolt.org/z/5GqTzrGro. This shows in quite bad performance. On my test system, which contains an AMD EPYC 7713, we only see about 3.4 GiB/s where nearly 28 GiB/s should be possible.

Let’s modify the code and use the data type char instead of bool for the variable same. This does not change the result in any way, because same will still be either true or false in the end. In this case, GCC invoked with the same flags as above will generate optimized (in particular, vectorized) code for the same architecture: https://godbolt.org/z/o6rbohseP. This directly shows in performance, as the generated code achieves 27 GiB/s for the same task on the same system. So we are getting nearly eight times the performance using char instead of bool, although bool is semantically the best match for the data we want to represent.

Correctness issues in multi-threaded code

Let’s now assume we want to parallelize the above code over many cores to utilize the full memory bandwidth of the system. For example, the AMD EPYC 7713 CPU contains 64 cores that we should make use of for maximum performance.

In HPC codes, we may do this using OpenMP. For example, we could split the task of iterating throught the entire buffer buf among the threads as follows:

const auto nthreads = omp_get_max_threads();
const auto per_thread = len / nthreads;

auto thread_results = std::vector<bool>(nthreads);
#pragma omp parallel
{
  const auto tid = omp_get_thread_num();
  auto local_result = same(
      buf + (tid * per_thread),
      elem,
      per_thread
  );
  thread_results[tid] = local_result;
}
auto result = std::all_of(
  thread_results.begin(),
  thread_results.end(),
  [](bool v) { return v; }
);

For simplicity, we assume here that the size of buf is divisible by the number of threads.

Writing from different threads into neighboring elements within thread_results is a prime example for false sharing. However, this should only affect performance and not correctness. As we are writing to different elements of the vector, an unsuspecting programmer will assume that the code works correctly.

It turns out, it does not! Instead, the code produces incorrect results, sometimes returning false even if all threads compute true as local_result.

There are different ways to fix the code, for example

adding #pragma omp critical to thread_results[tid] = local_result;,
changing the data type of thread_results to std::array<bool, N> where N is the maximum amount of threads possible, or
changing the data type of thread_results to std::vector<char>.

Reasons for the observed behavior

So, what is special about type bool that can explain that behavior?

The second observation regarding correctness issues with thread-parallel access on different elements in std::vector<bool> can be explained by looking into the C++ standard. In section 24.3.12 it describes a specialization of vector for bool. In particular, it states that

[…] all operations have the same requirements and semantics as the primary vector template, except that operations dealing with the bool value type map to bit values in the container storage […]

and that

[t]here is no requirement that the data be stored as a contiguous allocation of bool values. A space-optimized representation of bits is recommended instead.

So, while a single bool is probably represented in memory by a single byte, a vector of bools is not just a concatenation of multiple bools but a space-optimized representation. Standard libraries will likely pack eight neighboring bools within a vector into a single byte. Hence, different threads cannot simply work on neighboring bools, as those accesses boil down to non-atomic concurrent accesses on a single byte. Another symptom of this speciality is that it is not possible to determine the address of elements in a std::vector<bool>: https://godbolt.org/z/Ys5c98GGx.

The first symptom regarding the lack of code optimization is more difficult to explain. I can only guess that ensuring the semantics of bool somehow interferes with GCC’s code optimization. In fact, Clang 16 does not have these issues: https://godbolt.org/z/x7s6bjT13. This also shows in performance which is on par with the char version when using Clang to compile the code.

Summary

In summary, you should be aware of the specialities of the bool type, in particular if you care about performance or if you implement thread-parallel algorithms. In general, if you are working in HPC codes, it may be better to stick to classical C-style integer types, such as char, when representing boolean values.

Putting “openmp c++ vector bool” into your favorite search engine will show that a couple of people ran into the correctness issue discussed in this article. Being aware of bool‘s special characteristics will help you avoiding these issues.

0xStubs

Using the C++ bool type in HPC codes

Performance impact

Correctness issues in multi-threaded code

Reasons for the observed behavior

Summary

Leave a Reply Cancel reply