How to start multiple threads exactly at once?

Question

While evaluating thread pool libraries for short-running tasks, I noticed that they all performed significantly worse than OpenMP. The root cause seems to be that other libraries struggle to start multiple threads concurrently, while OpenMP can somehow do it.

To demonstrate the problem, I have created a simplified parallel_for example. I start 8 threads and then make them wait either with std::condition_variable or with a spin look using std::atomic until their start is signaled. This is to exclude the overhead from launching threads. Start and end times per thread are logged to memory and then written to a file for visualization. I also parallelize the same amount of work using OpenMP.

The results can be seen below. The threads do not start their work at the same time when using a regular lock or a spin lock, but all threads start at roughly the same time using OpenMP.

I compiled with g++ -O3 -fopenmp -lm -std=c++20 main.cpp -o main and ran the experiment on a i5-10300H CPU with 8 cores (4 of them "real").

main.cpp

#include <vector>
#include <mutex>
#include <atomic>
#include <chrono>
#include <cstdio>
#include <thread>
#include <cmath>
#include <condition_variable>

size_t num_threads = 8;

double sec(){
    std::chrono::duration<double> d = std::chrono::high_resolution_clock::now().time_since_epoch();
    return d.count();
}

void work(){
    volatile double accumulator = 0.0;
    for (size_t i = 0; i < 10 * 1000; i++){
        accumulator += std::sin(i);
    }
}

struct LogItem {
    double start, end;
    size_t thread_id;
};

void write_log(const std::vector<LogItem>& log, const char* filename) {
    FILE* f = fopen(filename, "w");
    fprintf(f, "start,end,thread_id\n");
    for (const auto& item : log) {
        fprintf(f, "%f,%f,%zu\n", item.start, item.end, item.thread_id);
    }
    fclose(f);
}

void parallel_for_lock(){
    std::vector<std::thread> threads;
    std::vector<LogItem> log(num_threads);

    std::mutex mtx;
    std::condition_variable cv;
    bool start_flag = false;

    for (size_t thread_id = 0; thread_id < num_threads; thread_id++){
        threads.emplace_back([thread_id, &log, &mtx, &cv, &start_flag]{
            // wait until start
            {
                std::unique_lock<std::mutex> lock(mtx);
                cv.wait(lock, [&start_flag]{ return start_flag; });
            }
            // do work and log the time it takes
            double start = sec();
            work();
            double end = sec();
            log[thread_id] = LogItem{start, end, thread_id};
        });
    }

    // (attempt to) start all threads at once
    {
        std::lock_guard<std::mutex> lock(mtx);
        start_flag = true;
    }
    cv.notify_all();

    for (auto& thread : threads){
        thread.join();
    }

    write_log(log, "log_lock.csv");
}

void parallel_for_spin_lock(){
    std::vector<std::thread> threads;
    std::vector<LogItem> log(num_threads);

    std::atomic<bool> start_flag{false};

    for (size_t thread_id = 0; thread_id < num_threads; thread_id++){
        threads.emplace_back([thread_id, &log, &start_flag]{
            while (!start_flag.load(std::memory_order_acquire));
            double start = sec();
            work();
            double end = sec();
            log[thread_id] = LogItem{start, end, thread_id};
        });
    }

    start_flag.store(true, std::memory_order_release);

    for (auto& thread : threads){
        thread.join();
    }

    write_log(log, "log_spin_lock.csv");
}

void parallel_for_omp(){
    std::vector<LogItem> log(num_threads);

    #pragma omp parallel for
    for (size_t thread_id = 0; thread_id < num_threads; thread_id++){
        double start = sec();
        work();
        double end = sec();
        log[thread_id] = LogItem{start, end, thread_id};
    }

    write_log(log, "log_omp.csv");
}

int main(){
    // run a few times for warmup
    for (size_t i = 0; i < 10; i++){
        parallel_for_lock();
        parallel_for_spin_lock();
        parallel_for_omp();
    }

    return 0;
}

plot_results.py

import csv, matplotlib.pyplot as plt

def plot_log(filename):
    with open(filename) as f:
        rows = list(csv.DictReader(f))

    thread_ids = [int(row["thread_id"]) for row in rows]

    # convert to ms
    start_times = [float(row["start"]) * 1e3 for row in rows]
    end_times = [float(row["end"]) * 1e3 for row in rows]

    # start time at 0
    min_time = min(start_times)
    start_times = [s - min_time for s in start_times]
    end_times = [e - min_time for e in end_times]

    for start, end, tid in zip(start_times, end_times, thread_ids):
        plt.barh(tid, end - start, left=start)

    plt.xlabel("Time [ms]")
    plt.ylabel("Thread ID")
    plt.yticks(range(len(thread_ids)))
    plt.grid(axis="y", alpha=0.5)
    plt.xlim([0, 2])

def main():
    plt.figure(figsize=(10, 16))
    for i, name in enumerate(["lock", "spin_lock", "omp"], 1):
        plt.subplot(3, 1, i)
        plot_log(f"log_{name}.csv")
        plt.title(name)
    plt.tight_layout()
    plt.show()

if __name__ == "__main__":
    main()

Unrelated micronote: Be careful using high_resolution_clock for timing. It's specced to prefer resolution over the stability of the clock. On some systems you may find yourself with timer-breaking behaviour like the clock jumping backward in time. Use is_steady to ensure that the clock used will play nicely, and if it doesn't, consider other options like steady_clock. — user4581301
– user4581301, Commented 2 days ago
does the situation change if you wait some time between starting the threads and setting the start signal? It could be that by the time you set the condition variable the threads arent actually running already — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented 2 days ago
I would suspect the threads are not started ideally distributed among the cores and have to wait around/get migrated. While openmp may be setting CPU affinity to select specific cores. Try experimenting with that. Or you can try setting OMP_PROC_BIND to false to possibly break openmp as well... — teapot418
– teapot418, Commented 2 days ago
fwiw, your code on godbolt.org godbolt.org/z/nsqzYWM76. There isnt the drastic difference between omp and the other implementations. A visible pattern is that with omp always 2 threads seem to start at the same time. — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented 2 days ago
@teapot418 Great suggestion! pthread_setaffinity_np did the trick. Would you like to post it as an answer? — BlueSky
– BlueSky, Commented 2 days ago

teapot418 · Accepted Answer · 2025-11-28 18:11:12Z

5

Try distributing the threads evenly among the cores beforehand using pthread_setaffinity_np. I believe that is what openmp does. See also OMP_PROC_BIND and related tunables.

Without that multiple threads may start on the same core and have to wait to be scheduled/migrated.

answered 2 days ago

teapot418

2,9682 gold badges8 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to start multiple threads exactly at once?

main.cpp

plot_results.py

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

main.cpp

plot_results.py

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related