So you have two sets, right, and you want to know their symmetric
difference --- that is, the set of things that's in one set and not
the other, or in the other set and not the one.
Not too bad. Just sort them, walk down both lists, and see where they differ.
However, you are unlucky today, and they are very big sets, in very
far away places.
Now simply comparing them element-by-element really sucks --- you have
to transmit every element across the wire to see if it exists at the
remote end! Oof, such wasted bandwidth. The only piece of good luck you have
is that, for some reason, you know they are very similar ---
that the size of their difference is small. What can you do, then?
Well, your mind turns to checksumming. What happens if we take a big
XOR of all the elements of both lists, and only transmit that?
This'll probably alert you whether the sets are different, at least.
Although it's not perfect. If one set is {1,2} and the other {3}, the checksum
will lie to you and tell you they're the same set, even though they're not.
Let's transmit a checksum of hashes of all the elements as well.
Pick your favorite hash function H, and do:
Ok, great. This is a pretty reliable measure of whether the
sets are different --- that is, whether the symmetric difference is
nonempty or not --- and, even better, in the event that they differ
by exactly one element, we can reliably detect that, and recover what
that element was! Make the abbreviations:
\[s_A = \bigoplus_{i=1}^n A_i \qquad s_B = \bigoplus_{i=1}^m B_i\]
\[h_A = \bigoplus_{i=1}^n H[A_i] \qquad h_B = \bigoplus_{i=1}^m H[B_i]\]
These are the only four chunks of data sent across the wire. Or, actually,
we only need to send two of them: either the person holding the A-database sends \(s_A\) and \(h_A\) to the person holding the B-database (who just
computes \(s_B\) and \(h_B\) herself) or the other way around.
In any event, take a look at whether
\[H(s_A \oplus s_B) = h_A \oplus h_B\]
I claim this can only happen if the difference-set \(D =
\{A_1,\ldots,A_n\} \triangle \{B_1,\ldots,B_m\}\) has size one.
This is because
\[ s_A \oplus s_B = \bigoplus_{x\in D} x\]
\[ h_A \oplus h_B = \bigoplus_{x\in D} H(x)\]
and you're only going to find that
\[ H\left( \bigoplus_{x\in D} x\right) =
\bigoplus_{x\in D} H(x)\]
if \(D\) is a singleton (or empty!). And when this does happen,
the missing element we're interested in is just \(s_A \oplus s_B\).
How do we recover more than one differing element?
A recent paper by Eppstein, Goodrich, Uyeda, and Varghese gives a really
neat way of doing it, by
throwing a clever variant of Bloom filters at the problem. (There is
a longer trail in the literature than just this paper, but this
is the algorithm that I feel like talking about right now!)
Pick a few more of your favorite hash functions, say \(p\) of them,
call them \(K_1,K_2,\ldots,K_p\),
whose range is rather small: \(0\ldots b-1\) for some number of buckets \(b\).
These hash functions assign set-elements to buckets, but redundantly by a factor of \(p\). Here's a picture for \(p=3\) and \(b = 6\):
For every bucket we do the same kind of checksumming calculation that
we did before, taking a big XOR of all the elements that ended up in
that bucket, and a big XOR of the hashes of those elements. We
transmit \(2p\) numbers across the wire.
To actually find the set difference, we
do the same sort of XOR-everything-in-sight, but this time bucket-by-bucket,
and look at what's left.
If we're lucky, some bucket will
contain just one differing element after most of the sets' elements in
that bucket cancel out. For instance, in bucket #1, the Ds cancel out
and we're left with \(E\) and \(H[E]\). In bucket #2, the Bs and Ds
cancel out and we're left with \(C\) and \(H[C]\). And in these
cases we can reliably check that the hash-of-differences is the
difference-of-hashes to see that we have exactly one element, just as
we did above.
The thing that I found really clever is the observation that once
you get an element that you know is in the symmetric difference
of the original two sets, you know all the p things it hashes
to, so you can wipe them out and make even more progress. Once we find
\(C\), we can cancel out the \(C\) in bucket 5, which leaves just an \(F\),
which we were unable to notice before, since \(H[F\oplus C] \ne H[F]\oplus H[C]\).
So it turns out we have very good odds of succeeding at
stubbornly unravelling the loose thread in the sweater, and recovering
the exact set of elements in symmetric difference, as long as
\(b\) is at least about 1.5 times the size of the difference, and
\(p\) is ideally about 3 or 4. The theory behind these numbers is
calculating the probability that a random hypergraph of uniform
degree \(p\) with \(|D|\) edges and \(b\) vertices has an empty 2-core.
This also has some fascinating connections to phase transitions
in randomly-created SAT problems --- a topic for another time, maybe.