HASHING: A great practical tool, with interesting & subtle theory too.

Basic setting: We have a set of items we want to store for efficient lookup.
Two versions: 

 - STATIC: Given a set S of items, store them for efficient lookup.

 - DYNAMIC: The data structure processes a sequence of inserts, lookups, 
	and delete requests. Want to do these all efficiently. 

In the static case, we can use a sorted array with binary search for lookups.  
In the dynamic case, we can use a balanced search tree.  Both those schemes 
have O(log n) complexity.

Hashing gives an alternative approach that is often the fastest and most
convenient way of solving these problems.  

Applications: An AI-search program may want to "cache" subproblems or 
    situations (board positions or elements of state-space) that the program 
    has already solved before; we don't want to redo the same computation when 
    you encounter them again. Many other uses in cryptography, networks, 
    complexity theory.

FORMAL SETUP
------------

- The keys come from a large universe U.  (e.g, all < 50-character strings)

- The INSTANCE S is a subset of U (static or dynamic).  N = |S|.
  These are the keys a particular run of the program must process.

- The data structure is an array (table) A of size M, and a HASH FUNCTION 
  h: U -> {0,...,M-1}, which maps each key to a unique location in the table. 
  Given an element x, we store it in location A[h(x)].

  Note:  If U were small (like 3-character strings) then we could just store
  x in A[x] as in bucket sort.  But U is big; e.g. 50-char strings, or 32 bit 
  numbers.  That's why we need the hash function.

- The hashing may produce collisions---two keys x and y may map to the
  same location: h(x) = h(y). There are several ways to deal with such
  collisions. The simplest one is called "separate chaining": make each entry 
  in A a linked list, which stores all the keys that map to that location.

DESIRABLE PROPERTIES:

- The hash table size should be small, e.g. O(N).

- The hash function should be fast and easy to compute. 

A SIMPLE IDEA:

   Suppose N = 100; that is, we expect to deal with at most 100 keys. We 
   create a table of size M = 100, and use the MOD function: 
   i.e. h(x) = x mod 100.

It has both the desired properties: small size and fast hashing.

An example: 91, 2048, 65000, 547356, 329, etc...

What can go wrong? What if all keys have the last two digits the same?
e.g. 00!  Then, all keys map to the same location, and the hash
table becomes an unordered linked list, with O(N) cost for lookup!!!

This motivates the 3rd desirable property:


- The hash function should spread keys out in the table.  Ideally all buckets 
  (lists) should be about the same size.


Why not RANDOM NUMBER: One idea would be to use randomization. May each key to
    a random location in the array. This will help in evening out the load,
    but "how do we do a FIND?" We would need to know, for each key, what
    random number was used to map it!  But it turns out that pseudo-randomness
    does help. Before we come to that, let's first consider some BAD NEWS:
    The worst-case behavior is bad!

CLAIM: For ANY hash function h, if |U| >= N*M, there exists a set S of size N 
       where all keys hash to the SAME location. 

PROOF: By the pigeon-hole principle. Map all keys of U using h.  At least one 
       table location must receive at least N keys; use those keys as the set S.

This is partly why hashing seems so mysterious -- how can you claim hashing 
is good if for any hash function you can come up with ways of foiling it?  
One answer is that there are a lot of simple hash functions that work well 
in practice.

There is a formal concept of Universal Hashing and Perfect Hashing, which
we will briefly touch later. For now, let's claim that hash functions o
the type:

	h(x) = (ax + b) mod p, where p is a prime number larger than N

work very well in practice, and there is good theoretical justification
for that.


Hashing with Separate Chaining:
==============================

Each entry of the hash table has a linked list, which stores all keys mapped 
to that position.

We use hash functions of the form:   h(x) = x mod p for simplicity.

Given a key x, search the linked list at position A[h(x)].
When inserting a new element x, add it to the head of the list at A[h(x)].
Delete is done by first searching and removing the element from the list.

Example:		x mod 11

0:
1:	23 -> 	1	-> 56
2:	24
3:	36 -> 	14
4:
5: 	16
6:	17
7:	7 -> 	29
8:
9: 	31 ->	20 ->	42
10:


Insert 53.	53 mod 11 -> 9

Because The key 53 is added in location 9, changing that list to:

9:      53 ->	31 ->   20 ->   42


ANALYSIS:
Worst-case:	All keys hash to the same bucket.		
		Insert takes O(1), but
		delete and find take O(N).

Average Case:	If keys are uniformly distributed, then all buckets are
	about the same size. Each linked list has expected size O(N/M).
	Thus, insert is still O(1), but delete and find become O(N/M).
	N/M is called the load factor.

--------------------------------------

OPEN ADDRESSING:
================

This is another method for dealing with collisions. Rather than allocate
new memory dynamically, here we use the table itself. Ideally, we would have 
a table at least a constant factor bigger than N, so we expect many empty cells.

When the location A[h(x)] is occupied, we use some fixed strategy to look for 
NEXT available open cell.

The simplest strategy is: Linear Probing.  Sequentially try next position until
an open cell is found.


Example:	h(x) = x Mod 11
	(In this example, keys are inserted in increasing order.)

0:	42
1:	1
2:	24
3:	14
4:
5:	16
6:	28
7:	7
8:	
9:	31
10:	9


Linear Probing:		Insert 12 		12 mod 11 = 1

Because 1, 2, 3, are occupied, the next available slot is 4, so put 12
in A[4].

0:      42
1:      1
2:      24
3:      14
4:	12
5:      16
6:      28
7:      7
8:      
9:      31
10:     9


Linear Probing:		Search 15 		15 mod 11 = 4

We start at A[4], and continue sequentially until we either find 15, or reach 
an OPEN SLOT.  In this case, we reach the open slot A[8], and report NOT FOUND.

CODE for Search with Linear Probing:

	hSearch (k)

	hashVal = k % Tablesize;
	j = hashVal;
	do {
		if (empty[j] || A[j] == k) return j;
		j =  (j+1) % Tablesize;
	} while (j != hashVal);
	return j;
	}


	find (k, e)
		b = hSearch (k);
		if (empty[b] || A[b] != k) return false;
		e = A[b];
		return true;
	}


DELETION in Hashing with Linear Probing.

Because empty slots are used also to terminate search, we cannot simply delete 
a key. It may invalidate other searches.

Instead, we simply mark the key as DELETED, but not remove it.

We simple modify the search, so that it simply goes past the Deleted keys.

Advantage: Easy and Correct.

Disadvantage: Table can become full with dead items.

All these operations take O(1) plus the length of the linked list.

Deletion with Linear Probing (Lazy Deletion)

	Delete 9			9 mod 11 = 9

0:      42
1:      1
2:      24
3:      14
4:      12
5:      16
6:      28
7:      7
8:
9:      31
10:     9	<-- Found

Change it to:

10:	D	<-- Deleted; otherwise search for 42 can fail


--------------------------------------------------------

Review:

Hash function. h: U -> {0, 1, ..., M-1}
Collision resolution:
		Separate chaining
		Open addressing
			linear probing.
			lazy deletion

		On average, if M = O(N), avg search is O(1).

-------------------------------------------------------

Eager Deletion is possible; but expensive. Need to move data around
to make sure deletion hole does not corrupt valid searches.


QUADRATIC PROBING: Open addressing with linear probing can have bad
	search behavior due to clustered data: a  few big blocks of the
	table get filled. Quadratic probing attempts to circumvent this
	by jumping non-linearly.


Check h(x). If collision, search h(x) + i^2

	Start at h(x).
	If key not there, search h(x) + 1
	If key not there, search h(x) + 4
	If key not there, search h(x) + 9
	If key not there, search h(x) + 16, etc.

Example:	Insert 12		12 mod 11 = 1

0:      42
1:      1 		<-	h(x)
2:      24      	<-	h(x) + 1
3:      14      
4:                    	<- 	h(x) + 25	INSERT 12 here
5:      16              <-	h(x) + 4
6:      28		<-	h(x) + 16
7:      7       
8:
9:      31
10:     9 		<-	h(x) + 9
--------------------------------------

DOUBLE HASHING.
==============

Another idea is to use a second hash function to deal with collisions.
Suppose R is the largest prime smaller than the table size.
Then, consider the hash function

	h'(x) = R - (x mod R)

* Why not simply   h'(x) = (x mod R) ? * 


INSERTING 12.

	h' (12) = 7 - (12 mod 7) = 7 - 5 = 2

	Check h(x) = A[h(12)] = A[1].
	If collision, check h(x) + h'(x)
	If collision, check h(x) + 2 * h'(x)
	If collision, check h(x) + 3 * h'(x)

	i.e. check h(x) + i * h'(x).

Example:

0:      42		<-    h(x) + 10
1:      1 		<- h(x)
2:      24      	<-    h(x) + 12
3:      14      	<- h(x) + 2
4:      		<-    h(x) + 14		INSERT 12 here.
5:      16              <- h(x) + 4
6:      28
7:      7       	<- h(x) + 6
8:
9:      31		<- h(x) + 8
10:     9

------------------------------------------------------

Practice of hashing: REHASHING.

o.	If table gets too full, operations take too long.
o.	Build another table, TWICE as large. (Use prime number as size).
o. 	Common advice to rebuild when table becomes 70% full.

------------------------------------------------------

A Little THEORY OF HASHING:


Recall the bad news:

CLAIM: For any hash function h, if |U| >= (N-1)*M + 1, there exists
   a set S of size N where all keys hash to the same location. 

PROOF: By the pigeon-hole principle.


An idea: Use randomization, like in randomized quicksort. We will use 
some random numbers in our construction of h, however,  h itself will be 
a deterministic function.  We will show that for *any* set S, if we pick 
h in this random way, things will be good in expectation.  

This is idea of UNIVERSAL hashing. Later we will use universal hashing
to get PERFECT HASHING!

UNIVERSAL HASHING
=================

A SET of hash functions H is *universal* if, for any hash function
h chosen randomly, we get Pr [ h(x) = h(y) ] <= 1/M, for all x != y in U.

THEOREM: If H is universal, then for any set S in U, for any x in S, the
*expected* number of collisions x has [for a random h in H] is <= (N-1)/M 

PROOF: 	Each y != x has <= 1/M chance of colliding with x. Adding up over all 
	y in S, you get expected number <= (N-1)/M.  Or, more formally:

  - Let C(x,y) = 1 if x and y collide; and 0 otherwise.
    Then, E[C(x,y)] = Pr(x and y collide) <= 1/M

  - C(x) = total # collisions for x = sum_y C(x,y).

  - E[C(x)] = sum_y E[ C(x,y) ] <= (N-1)/M

COROLLARY: For any *sequence* of L lookups, the expected *total* time
	for the lookups if we use a random h in H is O(L(1 + N/M)).
	That is, constant time per operation if M=N.

PROOF: By the linearity of expectation, the expected total time is just the 
	sum of expected times for each lookup.  Lookup time is proportional 
	to length of list = 1 + # of collisions.

QUESTION: Can we actually construct a universal hash family?  If not, this 
	is all pretty vacuous.   Answer is YES.

We are not actually going to explicitly construct all of H and then pick one 
h in H at random.  Instead, we'll have a randomized procedure for constructing 
h.  In fact, it probably would have been more intuitive to define H as a 
*probability distribution* over hash functions, and say H is a universal 
probability distrib over hash function if [rest of defn here].  But 
``universal sets'' is the way it's done in the literature.  Analogy is 
randomized quicksort: can think of it as picking an ordering on pivots at 
random from the set of all n! orderings, but instead we find it more natural 
to just say: each time you need to pick a pivot, pick one at random.

HOW TO CONSTRUCT:  Matrix method
---------------------------------

Suppose the keys are u-bits long, suppose the hash table size is power of 2;
	that is, M = 2^b, so an index is b-bits long.

We choose hash function h to be a random 0/1 matrix of dimensions b-by-u.
The hash function h(x) multiplies this matrix with key x, addition modulo 2.
For instance, consider a table of size 8 (3-bit indices), and 4-bit keys.

          h     x   h(x)
      --------- -   ---

	[1 0 0 0] 1      1
	[0 1 1 1] 0  =   1
	[1 1 1 0] 1      0
	          0


CLAIM: For x != y, Pr[h(x) = h(y)] = 1/M = 1/2^b.

PROOF: First of all, what does it mean to multiply h by x?   We can think of 
	it as adding some of the columns of h (mod 2) where the 1 bits in x 
	tell you which ones to add.  (we added 1st and 3rd columns of h above)

Now, take arbitrary x and y, x != y.  They must differ in some bit position,
	so say they differ in ith coordinate; for concreteness let us assume 
	that  x_i = 0 while y_i = 1. Imagine we first choose all of h except 
	the ith column.  Over remaining choices of ith column, h(x) is fixed. 
	But, each of the 2^b different settings of the ith column gives a 
	different value of h(y) [every time we flip a bit in that column, we 
	flip the corresponding bit in h(y).] So there is exactly a 1/2^b 
	chance that h(x) = h(y). 

There are other methods based on multiplication modulo primes too.

=====================================================================

Question: If we fix the set S, can we find hash function h such that *all* 
	lookups are constant-time?  Yes.  This leads to...

PERFECT HASHING
===============

A Hash function is "perfect" for S if all lookups take O(1) time.


METHOD #1
---------
Suppose we are willing to have a table whose size is quadratic in the size
N of our dictionary S.  Then, here is an easy method.  Let H be universal and 
set M = N^2.  Pick a random hash function h from H and hash everything in S.

CLAIM: Pr [no collisions] >= 1/2.

[So, we just try h, and if we got any collisions, we try a new h]

PROOF:
 - How many pairs (x,y) in S are there?  {N \choose 2}
 - For each pair, the chance they collide is <= 1/M by definition of universal.
 - So, Pr [exists a collision] <= {N \choose 2}/M < 1/2.

This is like the "other side" to the "birthday paradox".  If the number of 
days is a lot *more* than (# people)^2, then there is a reasonable chance 
that *no* pair has the same birthday.

BUT, what if we want to use just O(N) space?
	
METHOD #2:
---------

HISTORY: This was a BIG open problem for a long time -- posed as the question
	"Should Tables be Sorted?". With linear space, we can get O(log N)
	search using sorted tables. Is it possible to achieve O(1) worst-case
	search time? There were many increasingly complicated schemes, with 
	increasingly closer to O(1) time and generality. Finally solved using 
	nice idea of universal hash functions in 2-level scheme.  

PROPOSAL: Hash into table of size N.  Will get some collisions.  Then, for 
	each bin, rehash it using Method #1, squaring the size of the bin to 
	get zero collisions. 

  E.g., If 1st hash function maps {a,b,c,d,e,f} to [--,{a}, {b,c}, {d,e,f}], 
  then final result would look something like: [--,h_2, h_3, h_4]

  h_2 would just be the function h_2(x) = 0. h_3(x) would be a function
  to table of size 4, h_4 would be function to table of size 9.  
  The Point is: using Method #1, we can find h_i with no collisions by picking
  from universal H and trying it -- if it doesn't work we try again
  (each time, Pr [success] >= 1/2). 

By *design* this is O(1) time for lookups, since just compute two hash
functions.  Question is: How much space does it use?

Space used is O(N) for the 1st table (assume it takes a constant amount of 
space to store each hash function), plus O(sum of |B_i|^2) for the other 
tables (B_i is the ith bucket).  So, all we need to do is prove:

THEOREM: If initial h is chosen from the universal set H, then
		Pr[sum of |B_i|^2 > 4*N] < 1/2.

Proof:  We'll prove this by showing that E[sum_i |B_i|^2] < 2*N

     (Why does this give us the result we need? <-- "Markov's inequality")

The neat trick: we count this quantity by counting the number of ordered 
	pairs that collide, including element colliding with itself. 
	E.g, if bucket has {d,e,f}, then d collides with each of d,e,f; 
	e collides with each of d,e,f; f collides with each of d,e,f; so get 9.

Now,   
   sum_i |B_i|^2   =  # of ordered pairs <x,y> that collide (allowing x=y)

		   =  sum_x sum_y C(x,y)   [C(x,y) = 1 if in same bucket, else 0]
So,
  E[sum_i |B_i|^2] =  sum_x sum_y Pr [x and y are in same bucket]

                   <= N + N(N-1) * (1/M), where the "1/M" comes from
						definition of universal.

                   <  2*N. (since we set M = N)

So, final "hash function" is
	1. compute i = h(x)
	2. Now compute h_i(x) and use to lookup in table T_i.
Total space used is O(sum of (B_i)^2) = O(N).

============================================================================

Other uses of hashing
=====================

Suppose we have a long sequence of items and we want to see how many
*different* items are in the list.  What's a good way of doing that?

One way: Create hash table; do lookup and then insert if not in table; count 
	number of inserts.  

Now what if the list is really huge, so we don't have space to store them all, 
but we are OK with just an approximate answer.  E.g., imagine we're a router 
and watching a lot of packets go by, and we want to see (roughly) how many 
different source IP addresses there are.  

NEAT IDEA: Say we have a hash function h that behaves like a random function, 
and let's think of h(x) as a real number between 0 and 1.  One thing we can 
do is just keep track of the *minimum* hash value produced so far (so we 
won't have a table at all).  E.g., if keys are 3,10,3,3,12,10,12 and 
h(3)=0.4, h(10)=0.2, h(12)=0.7, then we get 0.2.

Point is: if we pick N random numbers in [0,1], the expected value of
the minimum is 1/(N+1).  And, there's a good chance it is fairly close
(can improve estimate by running several hash functions and taking
median of mins).  So, from the value of the MIN observed, we can
estimate its inverse, which must be the number of distinct elements.

Question: Why use a hash function rather than just picking a random number 
	each time?  That is because we care about the number of *different* 
	items, not just the number of items (that problem is a lot easier: 
	just keep a counter...)
=======================================================


Topics:
* search trees vs hashing, 
* augmenting search tree data structures (if didn't do it last time)
* another way of getting universal hashing  [if time]

==========================================================================

Search trees versus hashing:  
---------------------------

Hashing is very convenient if you just need to do inserts and lookups, but 
what are some operations that are better suited for search trees?
  - lookup-by-prefix.  E.g., type in first few letters of someone's
    	name and have it output all names in the database that begin with
    	those letters. 

  - find-in-range[key1, key2]  (output all keys between key1 and key2)

  - a version of lookup(x) that returns the closest key to x when x is
    not in the database.


Now, what if we want to do the following operations:

   - output the kth smallest element (let's assume all keys are distinct) 

   - do a version of lookup(x) that tells us the rank of x (how many
     	keys are smaller than x).

If we had a sorted array, these would be easy.  To output the kth smallest 
element, we just output A[k].  To get the rank of x, we just do binary search 
to find it, and then output which location it's in.

So this brings us to the question: how could we do this with search trees?

Auxiliary information in search trees:
=====================================

  By adding extra information to the nodes of a binary search tree, it's
  often possible to dramatically expand what can be done.  However, the data 
you add has to have two properties:

   (1) it's sufficient to solve the problem you want to solve
   (2) the values of the data in a node can be maintained efficiently.

So, here would be a *bad* way of solving the problem: have each node store 
the rank of its key.  Why is this bad?  Because if you insert a new key 
(that, say, is smaller than everything else) you may have to update 
*everyone's* rank.  So we don't have property (2).

What's a better way?  One thing we can do is store the size of the subtree 
rooted at each node in that node.

  - how do we use this to solve the problems?

  - how do we maintain these when we do an insertion or tree rotation?

===================================================


Here's another method of doing universal hashing that requires less 
computation than the one in lecture, and is closer to what you likely coded 
up in 15-211.

The hashing scheme:

  Let's assume keys in {0,1,...,U-1}.  Hashing down to {0,1,...,M-1}

  Pick prime p >= U.  (Or, think of U as prime).   Define 

        h_{a,b}(x) = ( (a*x + b) mod p ) mod M.

        H: Pick a at random from 1,...,p-1.  
	   Pick b at random from 0,...,p-1.

Claim: H is a 2-universal family.

Proof idea: the inner part will spread things out nicely in the range 0..p-1.  
In particular, given x!= y, (1) x is equally likely to go anywhere, and 
(2) given where x goes, y is equally likely to go to each of the p-1 places 
x didn't go (there won't be any collisions at the mod p level).  Then, this 
allows us to show things will be nice when we take the results mod M.

Actual Proof: Fixing x!=y and two locations r and s, let's look at:
        Pr[[(ax + b) = r (mod p)] AND [(ay + b) = s (mod p)]].

 -> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which
        has exactly one solution (mod p) since Z_p* is a field.
	[We needed p >= U to ensure that x!=y mod p]

 -> If r=s, then this means we need a=0 which wasn't allowed.  So Pr=0.

 -> If r!=s, then there is a 1/(p-1) chance that a has the right
	value.  Given this value of a, we need b = r-ax (mod p), and there is
	a 1/p chance b gets this value, so the overall probability is
	1/[p(p-1)].

Now, the probability x and y collide is equal to 1/p(p-1) times the
number of pairs r!=s in {0,...,p-1} such that r = s (mod M).

We have p choices for r, and then at most ceiling(p/M)-1 choices for
s ("-1" since we disallow s=r). The product is at most p(p-1)/M

Putting this all together,

        Pr[(a*x + b mod p) mod M = (a*y + b mod p) mod M]

                <= p(p-1)/M * [1/(p(p-1))] = 1/M.
QED