Featured image of post V. Universal and Perfect Hashing

V. Universal and Perfect Hashing

Minimizing hash collisions

Motivation

A prevalent usecase of hashing is in storing sets or mappings for a subset of the input space — hash tables. An optimal hash table uniformly distributes elements among its buckets.

Universal Hashing

Definition

A randomized algorithm $H$ for constructing hash functions $h:U\rightarrow\lbrace 0,1,…,M-1\rbrace$ is universal if $\forall x \neq y \text{ s.t. } x, y\in U$, we have $$\mathbb{P} [h(x)=h(y)|\thinspace h \leftarrow H]\leq\frac{1}{M}$$

Construction

Random Matrix

Suppose keys are $u$-bits long and $M=2^m$. Define $A$ to be a $m$-by-$u$ matrix filled with $0$ and $1$ randomly.

muAx=h(x)=Ax
Claim
$H=\lbrace h\rbrace$ is universal
Proof
Consider an arbitrary pair of distinct keys $x, y$. Suppose they differ in the $i$th bit. WLOG, $x_i=0$ and $y_i=1$.

Observe that regardless of the elements in the $i$th column of $A$, $h(x)=Ax$ since $x_i=0$.

However, each of the $2^m$ possibilities for the $i$th column of $A$ yield distinct $h(y)=Ay$.

A bit flip in the $i$th column of $A$ at row $j$ flips $Ay$ at the $j$th bit

$$\mathbb{P} [Ax=Ay]=\frac{1}{2^m}$$

This is unfortunately quite space inefficient.

Random Vector

View the key $x$ as a vector of integers $\langle x_1, x_2, …, x_k \rangle$ where $0\leq x_i < M$ and $M$ is prime.

Define a $k$-length vector $r_1, r_2, …, r_k$ filled with random values where $0\leq r_i < M$.

$$h(x)=r\cdot x\mod M$$

Claim
$H=\lbrace h\rbrace$ is universal
Proof
Consider an arbitrary pair of distinct keys $x, y$. Suppose they differ in the $i$th number $x_i \neq y_i$.

Consider the dot product defined by $h$ excluding the $i$th expression. Specifically,

$$h’(x)=\sum_{j\neq i}r_jx_j$$ Thus,
$$h(x)=h’(x)+r_ix_i$$ Collision between $x, y$ occurs precisely when $h’(x) + r_ix_i = h’(y) + r_iy_i\mod M$.
$$r_i(x_i-y_i)=h’(y)-h’(x)\mod M$$
Note that because of $M$’s primality, every integer has a multiplicative inverse. Thus, $r_i$ is unique.
$$\mathbb{P} [h(x)=h(y)]=\frac{1}{M}$$

Perfect Hashing

Definition

A hash function is perfect for a set $S, |S|=N$ if all lookups involve $\mathcal{O}(1)$ work.

Construction

Try 1 — Quadratic Space

Let $H$ be universal and $M=N^2$.

Claim
$\mathbb{P}[\exists\text{ collision in $S$}]< \frac{1}{2}$
Proof
There are $N\choose 2$ pairs $(x, y)$ in $S$. Each pair has at most $\frac{1}{M}=\frac{1}{N^2}$ collision probability by definition of universality.
$\mathbb{P}[\exists\text{ collision in $S$}]\leq \frac{N \choose 2}{N^2}<\frac{1}{2}$

Try 2 — Linear Space

Let $H$ be universal and $M=N$. Hash into the first layer with $N$ buckets. Each bucket maps to a secondary layer each with $C_i^2$ slots, where $C_i$ represents the number of elements that collide in the $i$th bucket of the first layer.

N
Theorem
$\mathbb{P}[\sum_iC_i^2 > 4N]<\frac{1}{2}$
Proof
Let $I_{xy}$ be an indicator that $x,y$ collide. Observe that within any secondary layer with $C_i$ elements ($C_i^2$ slots), for any two elements $x, y$, $I_{xy}=1$ (including $I_{xx}$, this amounts to $C_i^2$).
$$\begin{align*}\mathbb{E}[\sum_iC_i^2]&=\mathbb{E}[\sum_x\sum_yI_{xy}]\newline&=N+\sum_x\sum_{y\neq x}\mathbb{E}[C_{xy}]\newline&\leq N+\frac{N(N-1)}{M}\newline&=N+\frac{N(N-1)}{N}\newline&<2N\end{align*}$$
By Markov’s Inequality, the problem statement is proven.

$$\mathbb{P}[X\geq a]\leq\frac{\mathbb{E}[X]}{a}$$

My heart is in the work ― Andrew Carnegie
Built with Hugo
Theme Stack designed by Jimmy