1

We have a system which verify two locations have same data periodically (say 30 mins). We can assume that data is made of chunks and each chunk has uniq name. The way it currently match is that it query both locations and match them. Given there are lot of these chunks, system spends lot of time fetching chunk names from database and sending them over to matcher.

Is there something out there, I can use to optimize this and we do not need to send full list of chunk names each time.

If chunks were static, we can just compute crc32 and send that if its does not match, then we can query the chunks. But in our system chunks can be deleted or added anytime. So we need something like running checksum, which we can add / substract a chunk name. I thought about bloom filter but it will not work for us because it can generate false positives. We need to be sure.

3
  • A CRC-32 can also give a false positive. How sure is your "sure"? Commented Jun 13, 2023 at 23:23
  • @MarkAdler once a while its okay, since we periodically scan and match. Say out of 100 scan, one scan can be full match. It will delay fixing the inconsistency between two data source in case of collision but let system scale. Lower the collision probability better it is. Do you have something in mind? Commented Jun 14, 2023 at 1:24
  • I think, you can use some kind of Merkle Tree: en.wikipedia.org/wiki/Merkle_tree Commented Jun 14, 2023 at 3:55

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.