HASHING: A great practical tool, with interesting & subtle theory too. Basic setting: We have a set of items we want to store for efficient lookup. Two versions: - STATIC: Given a set S of items, store them for efficient lookup. - DYNAMIC: The data structure processes a sequence of inserts, lookups, and delete requests. Want to do these all efficiently. In the static case, we can use a sorted array with binary search for lookups. In the dynamic case, we can use a balanced search tree. Both those schemes have O(log n) complexity. Hashing gives an alternative approach that is often the fastest and most convenient way of solving these problems. Applications: An AI-search program may want to "cache" subproblems or situations (board positions or elements of state-space) that the program has already solved before; we don't want to redo the same computation when you encounter them again. Many other uses in cryptography, networks, complexity theory. FORMAL SETUP ------------ - The keys come from a large universe U. (e.g, all < 50-character strings) - The INSTANCE S is a subset of U (static or dynamic). N = |S|. These are the keys a particular run of the program must process. - The data structure is an array (table) A of size M, and a HASH FUNCTION h: U -> {0,...,M-1}, which maps each key to a unique location in the table. Given an element x, we store it in location A[h(x)]. Note: If U were small (like 3-character strings) then we could just store x in A[x] as in bucket sort. But U is big; e.g. 50-char strings, or 32 bit numbers. That's why we need the hash function. - The hashing may produce collisions---two keys x and y may map to the same location: h(x) = h(y). There are several ways to deal with such collisions. The simplest one is called "separate chaining": make each entry in A a linked list, which stores all the keys that map to that location. DESIRABLE PROPERTIES: - The hash table size should be small, e.g. O(N). - The hash function should be fast and easy to compute. A SIMPLE IDEA: Suppose N = 100; that is, we expect to deal with at most 100 keys. We create a table of size M = 100, and use the MOD function: i.e. h(x) = x mod 100. It has both the desired properties: small size and fast hashing. An example: 91, 2048, 65000, 547356, 329, etc... What can go wrong? What if all keys have the last two digits the same? e.g. 00! Then, all keys map to the same location, and the hash table becomes an unordered linked list, with O(N) cost for lookup!!! This motivates the 3rd desirable property: - The hash function should spread keys out in the table. Ideally all buckets (lists) should be about the same size. Why not RANDOM NUMBER: One idea would be to use randomization. May each key to a random location in the array. This will help in evening out the load, but "how do we do a FIND?" We would need to know, for each key, what random number was used to map it! But it turns out that pseudo-randomness does help. Before we come to that, let's first consider some BAD NEWS: The worst-case behavior is bad! CLAIM: For ANY hash function h, if |U| >= N*M, there exists a set S of size N where all keys hash to the SAME location. PROOF: By the pigeon-hole principle. Map all keys of U using h. At least one table location must receive at least N keys; use those keys as the set S. This is partly why hashing seems so mysterious -- how can you claim hashing is good if for any hash function you can come up with ways of foiling it? One answer is that there are a lot of simple hash functions that work well in practice. There is a formal concept of Universal Hashing and Perfect Hashing, which we will briefly touch later. For now, let's claim that hash functions o the type: h(x) = (ax + b) mod p, where p is a prime number larger than N work very well in practice, and there is good theoretical justification for that. Hashing with Separate Chaining: ============================== Each entry of the hash table has a linked list, which stores all keys mapped to that position. We use hash functions of the form: h(x) = x mod p for simplicity. Given a key x, search the linked list at position A[h(x)]. When inserting a new element x, add it to the head of the list at A[h(x)]. Delete is done by first searching and removing the element from the list. Example: x mod 11 0: 1: 23 -> 1 -> 56 2: 24 3: 36 -> 14 4: 5: 16 6: 17 7: 7 -> 29 8: 9: 31 -> 20 -> 42 10: Insert 53. 53 mod 11 -> 9 Because The key 53 is added in location 9, changing that list to: 9: 53 -> 31 -> 20 -> 42 ANALYSIS: Worst-case: All keys hash to the same bucket. Insert takes O(1), but delete and find take O(N). Average Case: If keys are uniformly distributed, then all buckets are about the same size. Each linked list has expected size O(N/M). Thus, insert is still O(1), but delete and find become O(N/M). N/M is called the load factor. -------------------------------------- OPEN ADDRESSING: ================ This is another method for dealing with collisions. Rather than allocate new memory dynamically, here we use the table itself. Ideally, we would have a table at least a constant factor bigger than N, so we expect many empty cells. When the location A[h(x)] is occupied, we use some fixed strategy to look for NEXT available open cell. The simplest strategy is: Linear Probing. Sequentially try next position until an open cell is found. Example: h(x) = x Mod 11 (In this example, keys are inserted in increasing order.) 0: 42 1: 1 2: 24 3: 14 4: 5: 16 6: 28 7: 7 8: 9: 31 10: 9 Linear Probing: Insert 12 12 mod 11 = 1 Because 1, 2, 3, are occupied, the next available slot is 4, so put 12 in A[4]. 0: 42 1: 1 2: 24 3: 14 4: 12 5: 16 6: 28 7: 7 8: 9: 31 10: 9 Linear Probing: Search 15 15 mod 11 = 4 We start at A[4], and continue sequentially until we either find 15, or reach an OPEN SLOT. In this case, we reach the open slot A[8], and report NOT FOUND. CODE for Search with Linear Probing: hSearch (k) hashVal = k % Tablesize; j = hashVal; do { if (empty[j] || A[j] == k) return j; j = (j+1) % Tablesize; } while (j != hashVal); return j; } find (k, e) b = hSearch (k); if (empty[b] || A[b] != k) return false; e = A[b]; return true; } DELETION in Hashing with Linear Probing. Because empty slots are used also to terminate search, we cannot simply delete a key. It may invalidate other searches. Instead, we simply mark the key as DELETED, but not remove it. We simple modify the search, so that it simply goes past the Deleted keys. Advantage: Easy and Correct. Disadvantage: Table can become full with dead items. All these operations take O(1) plus the length of the linked list. Deletion with Linear Probing (Lazy Deletion) Delete 9 9 mod 11 = 9 0: 42 1: 1 2: 24 3: 14 4: 12 5: 16 6: 28 7: 7 8: 9: 31 10: 9 <-- Found Change it to: 10: D <-- Deleted; otherwise search for 42 can fail -------------------------------------------------------- Review: Hash function. h: U -> {0, 1, ..., M-1} Collision resolution: Separate chaining Open addressing linear probing. lazy deletion On average, if M = O(N), avg search is O(1). ------------------------------------------------------- Eager Deletion is possible; but expensive. Need to move data around to make sure deletion hole does not corrupt valid searches. QUADRATIC PROBING: Open addressing with linear probing can have bad search behavior due to clustered data: a few big blocks of the table get filled. Quadratic probing attempts to circumvent this by jumping non-linearly. Check h(x). If collision, search h(x) + i^2 Start at h(x). If key not there, search h(x) + 1 If key not there, search h(x) + 4 If key not there, search h(x) + 9 If key not there, search h(x) + 16, etc. Example: Insert 12 12 mod 11 = 1 0: 42 1: 1 <- h(x) 2: 24 <- h(x) + 1 3: 14 4: <- h(x) + 25 INSERT 12 here 5: 16 <- h(x) + 4 6: 28 <- h(x) + 16 7: 7 8: 9: 31 10: 9 <- h(x) + 9 -------------------------------------- DOUBLE HASHING. ============== Another idea is to use a second hash function to deal with collisions. Suppose R is the largest prime smaller than the table size. Then, consider the hash function h'(x) = R - (x mod R) * Why not simply h'(x) = (x mod R) ? * INSERTING 12. h' (12) = 7 - (12 mod 7) = 7 - 5 = 2 Check h(x) = A[h(12)] = A[1]. If collision, check h(x) + h'(x) If collision, check h(x) + 2 * h'(x) If collision, check h(x) + 3 * h'(x) i.e. check h(x) + i * h'(x). Example: 0: 42 <- h(x) + 10 1: 1 <- h(x) 2: 24 <- h(x) + 12 3: 14 <- h(x) + 2 4: <- h(x) + 14 INSERT 12 here. 5: 16 <- h(x) + 4 6: 28 7: 7 <- h(x) + 6 8: 9: 31 <- h(x) + 8 10: 9 ------------------------------------------------------ Practice of hashing: REHASHING. o. If table gets too full, operations take too long. o. Build another table, TWICE as large. (Use prime number as size). o. Common advice to rebuild when table becomes 70% full. ------------------------------------------------------ A Little THEORY OF HASHING: Recall the bad news: CLAIM: For any hash function h, if |U| >= (N-1)*M + 1, there exists a set S of size N where all keys hash to the same location. PROOF: By the pigeon-hole principle. An idea: Use randomization, like in randomized quicksort. We will use some random numbers in our construction of h, however, h itself will be a deterministic function. We will show that for *any* set S, if we pick h in this random way, things will be good in expectation. This is idea of UNIVERSAL hashing. Later we will use universal hashing to get PERFECT HASHING! UNIVERSAL HASHING ================= A SET of hash functions H is *universal* if, for any hash function h chosen randomly, we get Pr [ h(x) = h(y) ] <= 1/M, for all x != y in U. THEOREM: If H is universal, then for any set S in U, for any x in S, the *expected* number of collisions x has [for a random h in H] is <= (N-1)/M PROOF: Each y != x has <= 1/M chance of colliding with x. Adding up over all y in S, you get expected number <= (N-1)/M. Or, more formally: - Let C(x,y) = 1 if x and y collide; and 0 otherwise. Then, E[C(x,y)] = Pr(x and y collide) <= 1/M - C(x) = total # collisions for x = sum_y C(x,y). - E[C(x)] = sum_y E[ C(x,y) ] <= (N-1)/M COROLLARY: For any *sequence* of L lookups, the expected *total* time for the lookups if we use a random h in H is O(L(1 + N/M)). That is, constant time per operation if M=N. PROOF: By the linearity of expectation, the expected total time is just the sum of expected times for each lookup. Lookup time is proportional to length of list = 1 + # of collisions. QUESTION: Can we actually construct a universal hash family? If not, this is all pretty vacuous. Answer is YES. We are not actually going to explicitly construct all of H and then pick one h in H at random. Instead, we'll have a randomized procedure for constructing h. In fact, it probably would have been more intuitive to define H as a *probability distribution* over hash functions, and say H is a universal probability distrib over hash function if [rest of defn here]. But ``universal sets'' is the way it's done in the literature. Analogy is randomized quicksort: can think of it as picking an ordering on pivots at random from the set of all n! orderings, but instead we find it more natural to just say: each time you need to pick a pivot, pick one at random. HOW TO CONSTRUCT: Matrix method --------------------------------- Suppose the keys are u-bits long, suppose the hash table size is power of 2; that is, M = 2^b, so an index is b-bits long. We choose hash function h to be a random 0/1 matrix of dimensions b-by-u. The hash function h(x) multiplies this matrix with key x, addition modulo 2. For instance, consider a table of size 8 (3-bit indices), and 4-bit keys. h x h(x) --------- - --- [1 0 0 0] 1 1 [0 1 1 1] 0 = 1 [1 1 1 0] 1 0 0 CLAIM: For x != y, Pr[h(x) = h(y)] = 1/M = 1/2^b. PROOF: First of all, what does it mean to multiply h by x? We can think of it as adding some of the columns of h (mod 2) where the 1 bits in x tell you which ones to add. (we added 1st and 3rd columns of h above) Now, take arbitrary x and y, x != y. They must differ in some bit position, so say they differ in ith coordinate; for concreteness let us assume that x_i = 0 while y_i = 1. Imagine we first choose all of h except the ith column. Over remaining choices of ith column, h(x) is fixed. But, each of the 2^b different settings of the ith column gives a different value of h(y) [every time we flip a bit in that column, we flip the corresponding bit in h(y).] So there is exactly a 1/2^b chance that h(x) = h(y). There are other methods based on multiplication modulo primes too. ===================================================================== Question: If we fix the set S, can we find hash function h such that *all* lookups are constant-time? Yes. This leads to... PERFECT HASHING =============== A Hash function is "perfect" for S if all lookups take O(1) time. METHOD #1 --------- Suppose we are willing to have a table whose size is quadratic in the size N of our dictionary S. Then, here is an easy method. Let H be universal and set M = N^2. Pick a random hash function h from H and hash everything in S. CLAIM: Pr [no collisions] >= 1/2. [So, we just try h, and if we got any collisions, we try a new h] PROOF: - How many pairs (x,y) in S are there? {N \choose 2} - For each pair, the chance they collide is <= 1/M by definition of universal. - So, Pr [exists a collision] <= {N \choose 2}/M < 1/2. This is like the "other side" to the "birthday paradox". If the number of days is a lot *more* than (# people)^2, then there is a reasonable chance that *no* pair has the same birthday. BUT, what if we want to use just O(N) space? METHOD #2: --------- HISTORY: This was a BIG open problem for a long time -- posed as the question "Should Tables be Sorted?". With linear space, we can get O(log N) search using sorted tables. Is it possible to achieve O(1) worst-case search time? There were many increasingly complicated schemes, with increasingly closer to O(1) time and generality. Finally solved using nice idea of universal hash functions in 2-level scheme. PROPOSAL: Hash into table of size N. Will get some collisions. Then, for each bin, rehash it using Method #1, squaring the size of the bin to get zero collisions. E.g., If 1st hash function maps {a,b,c,d,e,f} to [--,{a}, {b,c}, {d,e,f}], then final result would look something like: [--,h_2, h_3, h_4] h_2 would just be the function h_2(x) = 0. h_3(x) would be a function to table of size 4, h_4 would be function to table of size 9. The Point is: using Method #1, we can find h_i with no collisions by picking from universal H and trying it -- if it doesn't work we try again (each time, Pr [success] >= 1/2). By *design* this is O(1) time for lookups, since just compute two hash functions. Question is: How much space does it use? Space used is O(N) for the 1st table (assume it takes a constant amount of space to store each hash function), plus O(sum of |B_i|^2) for the other tables (B_i is the ith bucket). So, all we need to do is prove: THEOREM: If initial h is chosen from the universal set H, then Pr[sum of |B_i|^2 > 4*N] < 1/2. Proof: We'll prove this by showing that E[sum_i |B_i|^2] < 2*N (Why does this give us the result we need? <-- "Markov's inequality") The neat trick: we count this quantity by counting the number of ordered pairs that collide, including element colliding with itself. E.g, if bucket has {d,e,f}, then d collides with each of d,e,f; e collides with each of d,e,f; f collides with each of d,e,f; so get 9. Now, sum_i |B_i|^2 = # of ordered pairs that collide (allowing x=y) = sum_x sum_y C(x,y) [C(x,y) = 1 if in same bucket, else 0] So, E[sum_i |B_i|^2] = sum_x sum_y Pr [x and y are in same bucket] <= N + N(N-1) * (1/M), where the "1/M" comes from definition of universal. < 2*N. (since we set M = N) So, final "hash function" is 1. compute i = h(x) 2. Now compute h_i(x) and use to lookup in table T_i. Total space used is O(sum of (B_i)^2) = O(N). ============================================================================ Other uses of hashing ===================== Suppose we have a long sequence of items and we want to see how many *different* items are in the list. What's a good way of doing that? One way: Create hash table; do lookup and then insert if not in table; count number of inserts. Now what if the list is really huge, so we don't have space to store them all, but we are OK with just an approximate answer. E.g., imagine we're a router and watching a lot of packets go by, and we want to see (roughly) how many different source IP addresses there are. NEAT IDEA: Say we have a hash function h that behaves like a random function, and let's think of h(x) as a real number between 0 and 1. One thing we can do is just keep track of the *minimum* hash value produced so far (so we won't have a table at all). E.g., if keys are 3,10,3,3,12,10,12 and h(3)=0.4, h(10)=0.2, h(12)=0.7, then we get 0.2. Point is: if we pick N random numbers in [0,1], the expected value of the minimum is 1/(N+1). And, there's a good chance it is fairly close (can improve estimate by running several hash functions and taking median of mins). So, from the value of the MIN observed, we can estimate its inverse, which must be the number of distinct elements. Question: Why use a hash function rather than just picking a random number each time? That is because we care about the number of *different* items, not just the number of items (that problem is a lot easier: just keep a counter...) ======================================================= Topics: * search trees vs hashing, * augmenting search tree data structures (if didn't do it last time) * another way of getting universal hashing [if time] ========================================================================== Search trees versus hashing: --------------------------- Hashing is very convenient if you just need to do inserts and lookups, but what are some operations that are better suited for search trees? - lookup-by-prefix. E.g., type in first few letters of someone's name and have it output all names in the database that begin with those letters. - find-in-range[key1, key2] (output all keys between key1 and key2) - a version of lookup(x) that returns the closest key to x when x is not in the database. Now, what if we want to do the following operations: - output the kth smallest element (let's assume all keys are distinct) - do a version of lookup(x) that tells us the rank of x (how many keys are smaller than x). If we had a sorted array, these would be easy. To output the kth smallest element, we just output A[k]. To get the rank of x, we just do binary search to find it, and then output which location it's in. So this brings us to the question: how could we do this with search trees? Auxiliary information in search trees: ===================================== By adding extra information to the nodes of a binary search tree, it's often possible to dramatically expand what can be done. However, the data you add has to have two properties: (1) it's sufficient to solve the problem you want to solve (2) the values of the data in a node can be maintained efficiently. So, here would be a *bad* way of solving the problem: have each node store the rank of its key. Why is this bad? Because if you insert a new key (that, say, is smaller than everything else) you may have to update *everyone's* rank. So we don't have property (2). What's a better way? One thing we can do is store the size of the subtree rooted at each node in that node. - how do we use this to solve the problems? - how do we maintain these when we do an insertion or tree rotation? =================================================== Here's another method of doing universal hashing that requires less computation than the one in lecture, and is closer to what you likely coded up in 15-211. The hashing scheme: Let's assume keys in {0,1,...,U-1}. Hashing down to {0,1,...,M-1} Pick prime p >= U. (Or, think of U as prime). Define h_{a,b}(x) = ( (a*x + b) mod p ) mod M. H: Pick a at random from 1,...,p-1. Pick b at random from 0,...,p-1. Claim: H is a 2-universal family. Proof idea: the inner part will spread things out nicely in the range 0..p-1. In particular, given x!= y, (1) x is equally likely to go anywhere, and (2) given where x goes, y is equally likely to go to each of the p-1 places x didn't go (there won't be any collisions at the mod p level). Then, this allows us to show things will be nice when we take the results mod M. Actual Proof: Fixing x!=y and two locations r and s, let's look at: Pr[[(ax + b) = r (mod p)] AND [(ay + b) = s (mod p)]]. -> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which has exactly one solution (mod p) since Z_p* is a field. [We needed p >= U to ensure that x!=y mod p] -> If r=s, then this means we need a=0 which wasn't allowed. So Pr=0. -> If r!=s, then there is a 1/(p-1) chance that a has the right value. Given this value of a, we need b = r-ax (mod p), and there is a 1/p chance b gets this value, so the overall probability is 1/[p(p-1)]. Now, the probability x and y collide is equal to 1/p(p-1) times the number of pairs r!=s in {0,...,p-1} such that r = s (mod M). We have p choices for r, and then at most ceiling(p/M)-1 choices for s ("-1" since we disallow s=r). The product is at most p(p-1)/M Putting this all together, Pr[(a*x + b mod p) mod M = (a*y + b mod p) mod M] <= p(p-1)/M * [1/(p(p-1))] = 1/M. QED