As a programmer, I use hash
tables of varying kinds
pretty much all day, every day. But one of the odd and embarrassing
parts of being a community-trained programmer is that I've never
actually implemented one. Eek! Well, today I pulled an algorithms
book off the shelf
and decided to see how long it would take me to implement their simplest
example in Guile. It turns out that it takes less than 25 lines of code
to implement a basic hash table with O(1) best time, O(1) average time,
and O(n) worst case time. The worst case won't be too common depending
on how we size things so this isn't so bad, but we'll get into that as
we go along.
Here's the code:
(use-modules (srfi srfi-1))
(define (make-dumbhash size)
"Make a dumb hash table: an array of buckets"
(make-array '() size))
(define* (dumbhash-ref dumbhash key #:optional (default #f))
"Pull a value out of a dumbhash"
(let* ((hashed-key (hash key (array-length dumbhash)))
(bucket (array-ref dumbhash hashed-key)))
(or (find (lambda (x) (equal? (car x) key))
bucket)
default)))
(define (dumbhash-set! dumbhash key val)
"Set a value in a dumbhash"
(let* ((hashed-key (hash key (array-length dumbhash)))
(bucket (array-ref dumbhash hashed-key)))
(if (not (find (lambda (x) (equal? (car x) key))
bucket))
(array-set! dumbhash
(cons (cons key val) bucket)
hashed-key))))
You might even notice that some of these lines are shared between
[dumbhash-ref]{.title-ref} and [dumbhash-set!]{.title-ref}, so this
could be even shorter. As-is, sans comments and docstrings, it's a mere
17 lines. That's nothing.
We also cheated a little: we're using [hash]{.title-ref} and
[equal?]{.title-ref} to generate a hash and to test for equality, which
are arguably the hard parts of the job. But these are provided by Guile,
and it's one less thing to worry about. Here's a brief demonstration
though:
(equal? 'a 'a) ;; => #t, or true
(equal? 'a 'b) ;; => #f, or false
(equal? "same" "same") ;; => #t
(equal? "same" "different") ;; => #f
(hash "foo" 10) ;; => 6
(hash 'bar 10) ;; => 5
[equal?]{.title-ref} is self-explanatory. The important thing to know
about [hash]{.title-ref} is that it'll pick a hash value for a
[key]{.title-ref} (the first parameter) for a hash table of some
[size]{.title-ref} (the second parameter).
So let's jump into an example. [make-dumbhash]{.title-ref} is pretty
simple. It just creates an array of whatever [size]{.title-ref} we pass
into it. Let's make a simple hash now:
scheme@(guile-user)> (define our-hash (make-dumbhash 8))
scheme@(guile-user)> our-hash
$39 = #(() () () () () () () ())
This literally made an array of 8 items which easy start out with the
empty list as its value (that's [nil]{.title-ref} for you common
lispers). (You can ignore the [$39]{.title-ref} part, which may be
different when you try this; Guile's REPL lets you refer to previous
results at your prompt by number for fast & experimental hacking.)
So our implementation of hash tables is of fixed size, which doesn't
limit the number of items we put into it, since buckets can contain
multiple values in case of collision (and collisions tend to happen a
lot in hash tables, and we come prepared for that), but this does mean
we have an existing guess of about how many buckets we need for
efficiency. (Resizing hash tables is left as an exercise for the
reader.) Our hash table also uses simple linked lists for its buckets,
which isn't too uncommon as it turns out.
Let's put something in the hash table. Animal noises are fun, so:
scheme@(guile-user)> (dumbhash-set! our-hash 'monkey 'ooh-ooh)
scheme@(guile-user)> our-hash
$40 = #(() () () ((monkey . ooh-ooh)) () () () ())
The monkey was appended to the third bucket. This makes sense, because
the hash of [monkey]{.title-ref} for size 8 is 3:
scheme@(guile-user)> (hash 'monkey 8)
$41 = 3
We can get back the monkey:
scheme@(guile-user)> (dumbhash-ref our-hash 'monkey)
$42 = (monkey . ooh-ooh)
We've set this up so that it returns a pair when we get a result, but if
we try to access something that's not there, we get #f instead of a
pair, unless we set a default value:
scheme@(guile-user)> (dumbhash-ref our-hash 'chameleon)
$43 = #f
scheme@(guile-user)> (dumbhash-ref our-hash 'chameleon 'not-here-yo)
$44 = not-here-yo
So let's try adding some more things to `our-hash`:
scheme@(guile-user)> (dumbhash-set! our-hash 'cat 'meow)
scheme@(guile-user)> (dumbhash-set! our-hash 'dog 'woof)
scheme@(guile-user)> (dumbhash-set! our-hash 'rat 'squeak)
scheme@(guile-user)> (dumbhash-set! our-hash 'horse 'neigh)
scheme@(guile-user)> ,pp our-hash
$45 = #(()
((horse . neigh))
()
((rat . squeak) (monkey . ooh-ooh))
((cat . meow))
()
((dog . woof))
())
([,pp]{.title-ref} is a shortcut to [pretty-print]{.title-ref} something
at the REPL, and I've taken the liberty of doing some extra alignment of
its output for clarity.)
So we can see we have a collision in here, but it's no problem. Both
[rat]{.title-ref} and [monkey]{.title-ref} are in the same bucket, but
when we do a lookup of a hashtable in our implementation, we get a list
back, and we search to see if that's in there.
We can figure out why this is O(1) average / best time, but O(n) worst
time. Assume we made a hash table of the same size as the number of
items we put in... assuming our [hash]{.title-ref} procedure gives
pretty good distribution, most of these things will end up in an empty
bucket, and if they end up colliding with another item (as the rat and
monkey did), no big deal, they're in a list. Even though linked lists
are of O(n) complexity to traverse, assuming a properly sized hash
table, most buckets don't contain any or many items. There's no
guarantee of this though... it's entirely possible that we could have
a table where all the entries end up in the same bucket. Luckily, given
a reasonably sized hash table, this is unlikely. Of course, if we ended
up making a hash table that started out with 8 buckets, and then we
added 88 entries... collisions are guaranteed in that case. But I
already said resizing hash tables is an exercise for the reader. :)
If you're familiar enough with any Scheme (or probably any other Lisp),
reading [dumbhash-ref]{.title-ref} and [dumbhash-set!]{.title-ref}
should be pretty self-explanatory. If not, go read an introductory
Scheme tutorial, and come back! (Relatedly, I think there aren't many
good introductory Guile tutorials... I have some
ideas
though!
What lessons are there to be learned from this post? One might be that
Guile is a pretty dang nice hacking environment, which is true! Another
might be that it's amazing how far I've gotten in my career without ever
writing a hash table, which is also true! But the lesson I'd actually
like to convey is: most of these topics are not as unapproachable as
they seem. I had a long-time fear that I would never understand such
code until I took the time to actually sit down and attempt to write it.
As an additional exercise for the reader, here's a puzzle: is the [Any
Free License]{.title-ref} this code released under actually a free
license? And what commentary, if any, might the author be making? :)