1

I'm interested in the most efficient way to check for equality of memory blocks.
Currently, I use memcmp(...) == 0 to determine if two memory blocks are equal. While this works, memcmp is designed not only to check for equality but also to determine the ordering of two memory blocks (greater, less, or equal). This additional functionality might introduce unnecessary overhead.

Task:
I need to efficiently check whether two memory blocks are equal or not.

Problem:
The fact that memcmp provides ordering information means it might be less efficient than a function specifically designed to check for equality.

The additional overhead in memcmp is related to the need to determine the exact ordering of the blocks if they are not equal. For example, if memcmp returns a non-zero result, it means there are two bytes at some index i in each memory block that differ in a certain way, and all previous bytes are equal. To achieve this, memcmp must verify the equality of all previous bytes, which takes time. In contrast, a function that only checks for equality could, for instance, immediately check the last bytes and, upon finding a difference, terminate the check. This could lead to amortized improvements in time complexity.

In simpler terms: the inefficiency of memcmp is due to the fact that it searches not just for any difference between memory blocks, but for the FIRST difference.

What I've tried:
I wrote my own implementations to demonstrate this idea and ran some tests.

#include <cstring>
#include <immintrin.h>

bool memcmp_memeq(const void* s1, const void* s2, size_t n) {
    return memcmp(s1, s2, n) == 0;
}

bool my_memeq(const void* s1, const void* s2, size_t n) {
    const static size_t simd_width = 16;

    if (n <= simd_width) {
        return memcmp(s1, s2, n) == 0;
    }

    const int8_t* end1 = (const int8_t*)s1 + n - simd_width;
    const int8_t* end2 = (const int8_t*)s2 + n - simd_width;

    const __m128i m1 = _mm_loadu_si128((__m128i*)end1);
    const __m128i m2 = _mm_loadu_si128((__m128i*)end2);
    const __m128i result = _mm_cmpeq_epi8(m1, m2);

    if (_mm_movemask_epi8(result) != 0xFFFF) {
        return false;
    }

    return memcmp(s1, s2, (const int8_t*)end1 - (const int8_t*)s1) == 0;
}

bool my_memeq_256(const void* s1, const void* s2, size_t n) {
    const static size_t simd_width = 32;

    if (n <= simd_width) {
        return memcmp(s1, s2, n) == 0;
    }

    const int8_t* end1 = (const int8_t*)s1 + n - simd_width;
    const int8_t* end2 = (const int8_t*)s2 + n - simd_width;

    const __m256i m1 = _mm256_loadu_si256((__m256i*)end1);
    const __m256i m2 = _mm256_loadu_si256((__m256i*)end2);
    const __m256i result = _mm256_cmpeq_epi8(m1, m2);

    if (static_cast<unsigned int>(_mm256_movemask_epi8(result)) != 0xFFFFFFFF) {
        return false;
    }

    return memcmp(s1, s2, (const int8_t*)end1 - (const int8_t*)s1) == 0;
}
#include <cstring>
#include <sstream>
#include <fstream>
#include <iomanip>

#include <gtest/gtest.h>

#include <src/util/my_memeq.h>


template <typename Func>
std::pair<std::chrono::duration<double>, bool> measure_time(
        Func func,
        const std::vector<std::pair<std::string, std::string>>& test_cases,
        int iterations = 100000) {
    if (test_cases.empty()) {
        return { std::chrono::duration<double>(0), true };
    }
    volatile bool result;
    const auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < iterations; ++i) {
        for (const auto& test_case : test_cases) {
            result = func(test_case.first.c_str(), test_case.second.c_str(), test_case.first.size());
        }
    }
    const auto end = std::chrono::high_resolution_clock::now();
    return { (end - start) / test_cases.size() / iterations, result };
}


void run_tests(std::ostream& out, size_t len, const std::vector<std::pair<std::string, std::string>>& test_cases) {

    const auto memcmp_memeq_time_result = measure_time(memcmp_memeq, test_cases);
    const auto my_memeq_time_result = measure_time(my_memeq, test_cases);
    const auto my_memeq_256_time_result = measure_time(my_memeq_256, test_cases);

    const auto memcmp_memeq_time = memcmp_memeq_time_result.first.count();
    const auto my_memeq_time = my_memeq_time_result.first.count();
    const auto my_memeq_256_time = my_memeq_256_time_result.first.count();

    const auto memcmp_memeq_result = memcmp_memeq_time_result.second;
    const auto my_memeq_result = my_memeq_time_result.second;
    const auto my_memeq_256_result = my_memeq_256_time_result.second;

    EXPECT_EQ(memcmp_memeq_result, my_memeq_result);
    EXPECT_EQ(memcmp_memeq_result, my_memeq_256_result);

    out << "RESULTS for len: " << len << " bytes\n";
    out << std::setw(19) << "memcmp_memeq time: " << std::setw(9) << std::setprecision(4) << memcmp_memeq_time << " seconds\n";
    out << std::setw(19) << "my_memeq time: " << std::setw(9) << std::setprecision(4) << my_memeq_time << " seconds\n";
    out << std::setw(19) << "my_memeq_256 time: " << std::setw(9) << std::setprecision(4) << my_memeq_256_time << " seconds\n";
    out << '\n';
}


std::vector<std::pair<std::string, std::string>>
    generate_test_cases(
        size_t len,
        bool include_equal = true,
        bool include_start = true,
        bool include_middle = true,
        bool include_end = true,
        bool include_quarters = true) {

    std::vector<std::pair<std::string, std::string>> test_cases;

    std::string str1 (len, 'A');
    std::string str2 (len, 'A');

    if (include_equal) {
        test_cases.emplace_back(str1, str2);
    }

    if (include_end) {
        str2[len - 1] = 'B';
        test_cases.emplace_back(str1, str2);
        str2[len - 1] = 'A';
    }

    if (include_start) {
        str2[0] = 'B';
        test_cases.emplace_back(str1, str2);
        str2[0] = 'A';
    }

    if (include_middle) {
        if (len > 2) {
            str2[len / 2] = 'B';
            test_cases.emplace_back(str1, str2);
            str2[len / 2] = 'A';
        }
    }

    if (include_quarters) {
        if (len > 4) {
            str2[len / 4] = 'B';
            test_cases.emplace_back(str1, str2);
            str2[len / 4] = 'A';

            str2[len / 4 * 3] = 'B';
            test_cases.emplace_back(str1, str2);
            str2[len / 4 * 3] = 'A';
        }
    }

    return test_cases;
}


TEST(MemcmpTest, PerformanceComparison) {

    const static size_t start_len = 4;
    const static size_t max_bytes = 131072;

    std::stringstream out;

    out << "VARIOUS STRINGS\n\n";
    for (size_t len = start_len; len <= max_bytes; len *= 2) {
        run_tests(
                out,
                len,
                generate_test_cases(
                        len,
                        true,
                        true,
                        true,
                        true,
                        true)
        );
    }

    out << "DIFFERENCE AT THE START\n\n";
    for (size_t len = start_len; len <= max_bytes; len *= 2) {
        run_tests(
                out,
                len,
                generate_test_cases(
                        len,
                        false,
                        true,
                        false,
                        false,
                        false)
        );
    }

    out << "DIFFERENCE AT THE END\n\n";
    for (size_t len = start_len; len <= max_bytes; len *= 2) {
        run_tests(
                out,
                len,
                generate_test_cases(
                        len,
                        false,
                        false,
                        false,
                        true,
                        false)
        );
    }

    out << "DIFFERENCE AT THE MIDDLE\n\n";
    for (size_t len = start_len; len <= max_bytes; len *= 2) {
        run_tests(
                out,
                len,
                generate_test_cases(
                        len,
                        false,
                        false,
                        true,
                        false,
                        false)
        );
    }

    const auto output = out.str();
//  std::cout << output;

    {{ std::ofstream out_file ("../test/my_memeq_test_results.txt");
        out_file << output;
    }}
}

Test results:

VARIOUS STRINGS

RESULTS for len: 4 bytes
memcmp_memeq time:   1.6e-08 seconds
    my_memeq time:   1.7e-08 seconds
my_memeq_256 time:   1.6e-08 seconds

RESULTS for len: 8 bytes
memcmp_memeq time:   1.6e-08 seconds
    my_memeq time:   1.6e-08 seconds
my_memeq_256 time:   1.6e-08 seconds

RESULTS for len: 16 bytes
memcmp_memeq time:   1.6e-08 seconds
    my_memeq time:   1.6e-08 seconds
my_memeq_256 time:   1.6e-08 seconds

RESULTS for len: 32 bytes
memcmp_memeq time:   1.6e-08 seconds
    my_memeq time:   1.7e-08 seconds
my_memeq_256 time:   1.6e-08 seconds

RESULTS for len: 64 bytes
memcmp_memeq time:   1.6e-08 seconds
    my_memeq time:   1.7e-08 seconds
my_memeq_256 time:   1.8e-08 seconds

RESULTS for len: 128 bytes
memcmp_memeq time:   1.5e-08 seconds
    my_memeq time:   1.8e-08 seconds
my_memeq_256 time:   1.9e-08 seconds

RESULTS for len: 256 bytes
memcmp_memeq time:   1.6e-08 seconds
    my_memeq time:   1.8e-08 seconds
my_memeq_256 time:   1.9e-08 seconds

RESULTS for len: 512 bytes
memcmp_memeq time:   1.7e-08 seconds
    my_memeq time:   1.9e-08 seconds
my_memeq_256 time:     2e-08 seconds

RESULTS for len: 1024 bytes
memcmp_memeq time:   1.9e-08 seconds
    my_memeq time:   2.1e-08 seconds
my_memeq_256 time:   2.2e-08 seconds

RESULTS for len: 2048 bytes
memcmp_memeq time:   2.5e-08 seconds
    my_memeq time:   2.5e-08 seconds
my_memeq_256 time:   2.6e-08 seconds

RESULTS for len: 4096 bytes
memcmp_memeq time:   4.1e-08 seconds
    my_memeq time:   3.3e-08 seconds
my_memeq_256 time:   3.5e-08 seconds

RESULTS for len: 8192 bytes
memcmp_memeq time:     8e-08 seconds
    my_memeq time:     6e-08 seconds
my_memeq_256 time:     6e-08 seconds

RESULTS for len: 16384 bytes
memcmp_memeq time:  1.57e-07 seconds
    my_memeq time:  1.21e-07 seconds
my_memeq_256 time:  1.22e-07 seconds

RESULTS for len: 32768 bytes
memcmp_memeq time:  2.97e-07 seconds
    my_memeq time:  2.19e-07 seconds
my_memeq_256 time:   2.2e-07 seconds

RESULTS for len: 65536 bytes
memcmp_memeq time:  6.34e-07 seconds
    my_memeq time:  4.25e-07 seconds
my_memeq_256 time:  4.26e-07 seconds

RESULTS for len: 131072 bytes
memcmp_memeq time: 1.429e-06 seconds
    my_memeq time:  9.97e-07 seconds
my_memeq_256 time:   9.8e-07 seconds

DIFFERENCE AT THE START

RESULTS for len: 4 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.6e-08 seconds
my_memeq_256 time:   2.6e-08 seconds

RESULTS for len: 8 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.6e-08 seconds
my_memeq_256 time:   2.6e-08 seconds

RESULTS for len: 16 bytes
memcmp_memeq time:   2.7e-08 seconds
    my_memeq time:   2.6e-08 seconds
my_memeq_256 time:   2.6e-08 seconds

RESULTS for len: 32 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.8e-08 seconds
my_memeq_256 time:   2.6e-08 seconds

RESULTS for len: 64 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 128 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 256 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.8e-08 seconds
my_memeq_256 time:   2.8e-08 seconds

RESULTS for len: 512 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.8e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 1024 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.8e-08 seconds

RESULTS for len: 2048 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.8e-08 seconds

RESULTS for len: 4096 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 8192 bytes
memcmp_memeq time:   2.4e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 16384 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.8e-08 seconds

RESULTS for len: 32768 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.8e-08 seconds

RESULTS for len: 65536 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.8e-08 seconds

RESULTS for len: 131072 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

DIFFERENCE AT THE END

RESULTS for len: 4 bytes
memcmp_memeq time:   2.5e-08 seconds
    my_memeq time:   2.6e-08 seconds
my_memeq_256 time:   2.6e-08 seconds

RESULTS for len: 8 bytes
memcmp_memeq time:   2.5e-08 seconds
    my_memeq time:   2.6e-08 seconds
my_memeq_256 time:   2.6e-08 seconds

RESULTS for len: 16 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.6e-08 seconds

RESULTS for len: 32 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.6e-08 seconds

RESULTS for len: 64 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.8e-08 seconds

RESULTS for len: 128 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 256 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 512 bytes
memcmp_memeq time:     3e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 1024 bytes
memcmp_memeq time:   3.5e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.8e-08 seconds

RESULTS for len: 2048 bytes
memcmp_memeq time:   4.2e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 4096 bytes
memcmp_memeq time:   6.1e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 8192 bytes
memcmp_memeq time:   9.8e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 16384 bytes
memcmp_memeq time:  1.95e-07 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 32768 bytes
memcmp_memeq time:  4.84e-07 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 65536 bytes
memcmp_memeq time:  9.42e-07 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

RESULTS for len: 131072 bytes
memcmp_memeq time: 1.862e-06 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

DIFFERENCE AT THE MIDDLE

RESULTS for len: 4 bytes
memcmp_memeq time:   2.7e-08 seconds
    my_memeq time:   2.6e-08 seconds
my_memeq_256 time:   2.6e-08 seconds

RESULTS for len: 8 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   3.2e-08 seconds
my_memeq_256 time:   4.9e-08 seconds

RESULTS for len: 16 bytes
memcmp_memeq time:   2.7e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.7e-08 seconds

RESULTS for len: 32 bytes
memcmp_memeq time:   2.8e-08 seconds
    my_memeq time:   2.8e-08 seconds
my_memeq_256 time:   2.7e-08 seconds

RESULTS for len: 64 bytes
memcmp_memeq time:   2.7e-08 seconds
    my_memeq time:     3e-08 seconds
my_memeq_256 time:     3e-08 seconds

RESULTS for len: 128 bytes
memcmp_memeq time:   2.8e-08 seconds
    my_memeq time:     3e-08 seconds
my_memeq_256 time:   3.2e-08 seconds

RESULTS for len: 256 bytes
memcmp_memeq time:   2.8e-08 seconds
    my_memeq time:   3.3e-08 seconds
my_memeq_256 time:   3.1e-08 seconds

RESULTS for len: 512 bytes
memcmp_memeq time:   3.1e-08 seconds
    my_memeq time:   3.2e-08 seconds
my_memeq_256 time:   3.3e-08 seconds

RESULTS for len: 1024 bytes
memcmp_memeq time:   3.3e-08 seconds
    my_memeq time:   3.4e-08 seconds
my_memeq_256 time:   3.6e-08 seconds

RESULTS for len: 2048 bytes
memcmp_memeq time:   3.7e-08 seconds
    my_memeq time:   3.7e-08 seconds
my_memeq_256 time:   3.8e-08 seconds

RESULTS for len: 4096 bytes
memcmp_memeq time:   4.3e-08 seconds
    my_memeq time:   4.7e-08 seconds
my_memeq_256 time:   4.8e-08 seconds

RESULTS for len: 8192 bytes
memcmp_memeq time:   5.8e-08 seconds
    my_memeq time:     6e-08 seconds
my_memeq_256 time:   6.1e-08 seconds

RESULTS for len: 16384 bytes
memcmp_memeq time:  1.02e-07 seconds
    my_memeq time:  1.06e-07 seconds
my_memeq_256 time:  1.07e-07 seconds

RESULTS for len: 32768 bytes
memcmp_memeq time:  2.31e-07 seconds
    my_memeq time:  2.36e-07 seconds
my_memeq_256 time:  2.41e-07 seconds

RESULTS for len: 65536 bytes
memcmp_memeq time:  5.17e-07 seconds
    my_memeq time:   5.2e-07 seconds
my_memeq_256 time:  5.23e-07 seconds

RESULTS for len: 131072 bytes
memcmp_memeq time:   9.9e-07 seconds
    my_memeq time: 1.007e-06 seconds
my_memeq_256 time:  9.78e-07 seconds

Separately, here are the results for the largest memory blocks:

VARIOUS STRINGS
RESULTS for len: 131072 bytes
memcmp_memeq time: 1.429e-06 seconds
    my_memeq time:  9.97e-07 seconds
my_memeq_256 time:   9.8e-07 seconds

DIFFERENCE AT THE START
RESULTS for len: 131072 bytes
memcmp_memeq time:   2.6e-08 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

DIFFERENCE AT THE END
RESULTS for len: 131072 bytes
memcmp_memeq time: 1.862e-06 seconds
    my_memeq time:   2.7e-08 seconds
my_memeq_256 time:   2.9e-08 seconds

DIFFERENCE AT THE MIDDLE
RESULTS for len: 131072 bytes
memcmp_memeq time:   9.9e-07 seconds
    my_memeq time: 1.007e-06 seconds
my_memeq_256 time:  9.78e-07 seconds

As can be seen, my implementations can provide significant performance improvements in certain cases.
However, slight performance degradation can be expected for smaller memory blocks.

I am interested to know if there is an existing solution specifically optimized for checking memory block equality.

5
  • 2
    You have "END" difference time smaller than "MIDDLE" difference time (and the same as "START" time). This implies your version is not exactly faster, it just makes comparisons in a different order (or there's a bug in the test). Commented Jul 1, 2024 at 15:40
  • 6
    Personally I wouldn't want to try and outperform memcmp - it's been optimized to hell and back over the years. The tiny overhead involved in detecting less than/greater than is so miniscule that I doubt it will ever make an actual difference to anyone. Just use memcmp and move on to more interesting and relevant stuff. Commented Jul 1, 2024 at 16:28
  • 4
    In the end every integer comparison is implemented using subtraction. That means keeping the tristate outcome eventually has zero cost. You cannot optimize it out. Commented Jul 1, 2024 at 16:48
  • 1
    volatile bool result; - Why volatile? Unless you are interacting directly with hardware or similar, volatile should never be a thing you use and I don't see any reason for it in your code. It's a huge red flag unless there's an explanation (and a very good one at that) nearby. Commented Jul 1, 2024 at 17:31
  • 1
    Also, don't use std::chrono::high_resolution_clock - see the notes at en.cppreference.com/w/cpp/chrono/high_resolution_clock - you don't know what you are getting. Use steady_clock for timing stuff. Commented Jul 1, 2024 at 17:35

2 Answers 2

0

Background

Fundamentally, processors set a condition code when comparing, then perform a conditional branch or jump. Some common operations are subtraction, exclusive-or and a compare instruction.

(Note: if the size of the memory blocks can be compared, perform the sizes first. The fastest memory block compare is not to compare.)

An efficient compare for equality function only needs to find a single cell that is different. That cell could be the first, in the middle, or the last. This is the thorn in optimizing the algorithm. Worst case, all cells will need to be tested.

Every branch is slower than a comparison. This means the processor can do multiple comparison (data processing) instruction in the duration of a branch instruction. The goal is to reduce the quantity of branches.

An Experiment

An idea is to have your code perform a quantity of this Boolean operation:
bool result = result & (*p_block_1++ ^ p_block_2++);
before performing a condition branch. Although this technique may perform extra comparisons, the duration of the extra comparisons is negligible. The memory fetching should be restricted to the memory block; don't fetch outside either memory block.

Using the same idea, read in as many cells as possible, per operation. For example, if your processor has 64-bit registers, read in 64-bit quantities from memory before the operation.

Summary

To optimize your memory compares:

  1. Reduce the branch instructions, such as loop branches. (Biggest speed bottleneck)
  2. Reduce comparison branches (e.g. jump if zero).
  3. Reduce fetch time (fetch as much memory as possible in an instruction).
  4. Run blocks of comparisons in parallel (this needs to be benchmarked).
Sign up to request clarification or add additional context in comments.

1 Comment

When I implemented this algorithm, I had to check the size first. The algorithm may require a minimal block size in order to outperform a linear comparison.
0

The fact that memcmp provides ordering information means it might be less efficient than a function specifically designed to check for equality.

That is not the case. The sequence of instructions executed is subject to pipelining with slight out-of-order execution (so loads can be performed prior to the branch they depend on) on a super-scalar architecture (more than one instruction per clock) with a large number of execution ports that have different capabilities (so different types of instructions can execute in parallel).

You may safely assume that the instruction behind REPE CMSB (which is what memcmp compiles to) is implemented in micro code very similar to your single my_memeq_256 iteration.

Specifically, speaking for recent x86 architectures, out of the 8-10 execution ports each CPU core has, only 2 are capable of performing memory loads. Once those are fully saturated with loads of the maximum permissible width, you are capped out on any form of memory throughput, even when loading straight from the L1 cache. That is typically the case for any sort of operation that performs only a single arithmetic or logical operation on two contiguous memory ranges, up to and including vectorized floating point fused-multiply-add.

Now you have 1 port dealing with memory comparisons on vectors, and 1 port dealing with vector shuffling (including eg. _mm256_movemask_epi8). Leaving you with another 2-3 execution ports still capable of handling simple pointer arithmetic (such as incrementing addresses), branching and alike.

Check e.g. for Skylake which ports can execute which instruction type: https://en.wikichip.org/w/images/7/7e/skylake_block_diagram.svg

Effectively this means that a certain level of "scalar" instructions and branching is "for free" as long as you are bound by either vector arithmetic or vectorized memory loads / stores. Not "free" in the sense of "doesn't cost energy", but still "free" in the sense of no measurable overhead in terms of time cost.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.