2

While evaluating thread pool libraries for short-running tasks, I noticed that they all performed significantly worse than OpenMP. The root cause seems to be that other libraries struggle to start multiple threads concurrently, while OpenMP can somehow do it.

To demonstrate the problem, I have created a simplified parallel_for example. I start 8 threads and then make them wait either with std::condition_variable or with a spin look using std::atomic until their start is signaled. This is to exclude the overhead from launching threads. Start and end times per thread are logged to memory and then written to a file for visualization. I also parallelize the same amount of work using OpenMP.

The results can be seen below. The threads do not start their work at the same time when using a regular lock or a spin lock, but all threads start at roughly the same time using OpenMP.

I compiled with g++ -O3 -fopenmp -lm -std=c++20 main.cpp -o main and ran the experiment on a i5-10300H CPU with 8 cores (4 of them "real").

plot for lock, spin_lock and OpenMP times per thread

main.cpp

#include <vector>
#include <mutex>
#include <atomic>
#include <chrono>
#include <cstdio>
#include <thread>
#include <cmath>
#include <condition_variable>

size_t num_threads = 8;

double sec(){
    std::chrono::duration<double> d = std::chrono::high_resolution_clock::now().time_since_epoch();
    return d.count();
}

void work(){
    volatile double accumulator = 0.0;
    for (size_t i = 0; i < 10 * 1000; i++){
        accumulator += std::sin(i);
    }
}

struct LogItem {
    double start, end;
    size_t thread_id;
};

void write_log(const std::vector<LogItem>& log, const char* filename) {
    FILE* f = fopen(filename, "w");
    fprintf(f, "start,end,thread_id\n");
    for (const auto& item : log) {
        fprintf(f, "%f,%f,%zu\n", item.start, item.end, item.thread_id);
    }
    fclose(f);
}

void parallel_for_lock(){
    std::vector<std::thread> threads;
    std::vector<LogItem> log(num_threads);

    std::mutex mtx;
    std::condition_variable cv;
    bool start_flag = false;

    for (size_t thread_id = 0; thread_id < num_threads; thread_id++){
        threads.emplace_back([thread_id, &log, &mtx, &cv, &start_flag]{
            // wait until start
            {
                std::unique_lock<std::mutex> lock(mtx);
                cv.wait(lock, [&start_flag]{ return start_flag; });
            }
            // do work and log the time it takes
            double start = sec();
            work();
            double end = sec();
            log[thread_id] = LogItem{start, end, thread_id};
        });
    }

    // (attempt to) start all threads at once
    {
        std::lock_guard<std::mutex> lock(mtx);
        start_flag = true;
    }
    cv.notify_all();

    for (auto& thread : threads){
        thread.join();
    }

    write_log(log, "log_lock.csv");
}

void parallel_for_spin_lock(){
    std::vector<std::thread> threads;
    std::vector<LogItem> log(num_threads);

    std::atomic<bool> start_flag{false};

    for (size_t thread_id = 0; thread_id < num_threads; thread_id++){
        threads.emplace_back([thread_id, &log, &start_flag]{
            while (!start_flag.load(std::memory_order_acquire));
            double start = sec();
            work();
            double end = sec();
            log[thread_id] = LogItem{start, end, thread_id};
        });
    }

    start_flag.store(true, std::memory_order_release);

    for (auto& thread : threads){
        thread.join();
    }

    write_log(log, "log_spin_lock.csv");
}

void parallel_for_omp(){
    std::vector<LogItem> log(num_threads);

    #pragma omp parallel for
    for (size_t thread_id = 0; thread_id < num_threads; thread_id++){
        double start = sec();
        work();
        double end = sec();
        log[thread_id] = LogItem{start, end, thread_id};
    }

    write_log(log, "log_omp.csv");
}

int main(){
    // run a few times for warmup
    for (size_t i = 0; i < 10; i++){
        parallel_for_lock();
        parallel_for_spin_lock();
        parallel_for_omp();
    }

    return 0;
}

plot_results.py

import csv, matplotlib.pyplot as plt

def plot_log(filename):
    with open(filename) as f:
        rows = list(csv.DictReader(f))

    thread_ids = [int(row["thread_id"]) for row in rows]

    # convert to ms
    start_times = [float(row["start"]) * 1e3 for row in rows]
    end_times = [float(row["end"]) * 1e3 for row in rows]

    # start time at 0
    min_time = min(start_times)
    start_times = [s - min_time for s in start_times]
    end_times = [e - min_time for e in end_times]

    for start, end, tid in zip(start_times, end_times, thread_ids):
        plt.barh(tid, end - start, left=start)

    plt.xlabel("Time [ms]")
    plt.ylabel("Thread ID")
    plt.yticks(range(len(thread_ids)))
    plt.grid(axis="y", alpha=0.5)
    plt.xlim([0, 2])

def main():
    plt.figure(figsize=(10, 16))
    for i, name in enumerate(["lock", "spin_lock", "omp"], 1):
        plt.subplot(3, 1, i)
        plot_log(f"log_{name}.csv")
        plt.title(name)
    plt.tight_layout()
    plt.show()

if __name__ == "__main__":
    main()
8
  • 2
    Unrelated micronote: Be careful using high_resolution_clock for timing. It's specced to prefer resolution over the stability of the clock. On some systems you may find yourself with timer-breaking behaviour like the clock jumping backward in time. Use is_steady to ensure that the clock used will play nicely, and if it doesn't, consider other options like steady_clock. Commented 2 days ago
  • 1
    does the situation change if you wait some time between starting the threads and setting the start signal? It could be that by the time you set the condition variable the threads arent actually running already Commented 2 days ago
  • 2
    I would suspect the threads are not started ideally distributed among the cores and have to wait around/get migrated. While openmp may be setting CPU affinity to select specific cores. Try experimenting with that. Or you can try setting OMP_PROC_BIND to false to possibly break openmp as well... Commented 2 days ago
  • 1
    fwiw, your code on godbolt.org godbolt.org/z/nsqzYWM76. There isnt the drastic difference between omp and the other implementations. A visible pattern is that with omp always 2 threads seem to start at the same time. Commented 2 days ago
  • 2
    @teapot418 Great suggestion! pthread_setaffinity_np did the trick. Would you like to post it as an answer? Commented 2 days ago

1 Answer 1

5

Try distributing the threads evenly among the cores beforehand using pthread_setaffinity_np. I believe that is what openmp does. See also OMP_PROC_BIND and related tunables.

Without that multiple threads may start on the same core and have to wait to be scheduled/migrated.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.