Archives

Tags

Posts with tag "code"

Hash tables are easy (in Guile)

By Christopher Allan Webber on Mon 09 November 2015

As a programmer, I use hash tables of varying kinds pretty much all day, every day. But one of the odd and embarrassing parts of being a community-trained programmer is that I've never actually implemented one. Eek! Well, today I pulled an algorithms book off the shelf and decided to see how long it would take me to implement their simplest example in Guile. It turns out that it takes less than 25 lines of code to implement a basic hash table with O(1) best time, O(1) average time, and O(n) worst case time. The worst case won't be too common depending on how we size things so this isn't so bad, but we'll get into that as we go along.

Here's the code:

;;; Simple hash table implementation -- (C) 2015 Christopher Allan Webber
;;; Released under the "Any Free License 2015-11-05", whose terms are the following:
;;;   This code is released under any of the free software licenses listed on
;;;     https://www.gnu.org/licenses/license-list.html
;;;   which for archival purposes is
;;;     https://web.archive.org/web/20151105070140/http://www.gnu.org/licenses/license-list.html

(use-modules (srfi srfi-1))

(define (make-dumbhash size)
  "Make a dumb hash table: an array of buckets"
  (make-array '() size))

(define* (dumbhash-ref dumbhash key #:optional (default #f))
  "Pull a value out of a dumbhash"
  (let* ((hashed-key (hash key (array-length dumbhash)))
         (bucket (array-ref dumbhash hashed-key)))
    (or (find (lambda (x) (equal? (car x) key))
              bucket)
        default)))

(define (dumbhash-set! dumbhash key val)
  "Set a value in a dumbhash"
  (let* ((hashed-key (hash key (array-length dumbhash)))
         (bucket (array-ref dumbhash hashed-key)))
    ;; Only act if it's not already a member
    (if (not (find (lambda (x) (equal? (car x) key))
                   bucket))
        (array-set! dumbhash
                    ;; extend the bucket with the key-val pair
                    (cons (cons key val) bucket)
                    hashed-key))))

You might even notice that some of these lines are shared between dumbhash-ref and dumbhash-set!, so this could be even shorter. As-is, sans comments and docstrings, it's a mere 17 lines. That's nothing.

We also cheated a little: we're using hash and equal? to generate a hash and to test for equality, which are arguably the hard parts of the job. But these are provided by Guile, and it's one less thing to worry about. Here's a brief demonstration though:

(equal? 'a 'a)               ;; => #t, or true
(equal? 'a 'b)               ;; => #f, or false
(equal? "same" "same")       ;; => #t
(equal? "same" "different")  ;; => #f
(hash "foo" 10)              ;; => 6
(hash 'bar 10)               ;; => 5

equal? is self-explanatory. The important thing to know about hash is that it'll pick a hash value for a key (the first parameter) for a hash table of some size (the second parameter).

So let's jump into an example. make-dumbhash is pretty simple. It just creates an array of whatever size we pass into it. Let's make a simple hash now:

scheme@(guile-user)> (define our-hash (make-dumbhash 8))
scheme@(guile-user)> our-hash
$39 = #(() () () () () () () ())

This literally made an array of 8 items which easy start out with the empty list as its value (that's nil for you common lispers). (You can ignore the $39 part, which may be different when you try this; Guile's REPL lets you refer to previous results at your prompt by number for fast & experimental hacking.)

So our implementation of hash tables is of fixed size, which doesn't limit the number of items we put into it, since buckets can contain multiple values in case of collision (and collisions tend to happen a lot in hash tables, and we come prepared for that), but this does mean we have an existing guess of about how many buckets we need for efficiency. (Resizing hash tables is left as an exercise for the reader.) Our hash table also uses simple linked lists for its buckets, which isn't too uncommon as it turns out.

Let's put something in the hash table. Animal noises are fun, so:

scheme@(guile-user)> (dumbhash-set! our-hash 'monkey 'ooh-ooh)
scheme@(guile-user)> our-hash
$40 = #(() () () ((monkey . ooh-ooh)) () () () ())

The monkey was appended to the third bucket. This makes sense, because the hash of monkey for size 8 is 3:

scheme@(guile-user)> (hash 'monkey 8)
$41 = 3

We can get back the monkey:

scheme@(guile-user)> (dumbhash-ref our-hash 'monkey)
$42 = (monkey . ooh-ooh)

We've set this up so that it returns a pair when we get a result, but if we try to access something that's not there, we get #f instead of a pair, unless we set a default value:

scheme@(guile-user)> (dumbhash-ref our-hash 'chameleon)
$43 = #f
scheme@(guile-user)> (dumbhash-ref our-hash 'chameleon 'not-here-yo)
$44 = not-here-yo

So let's try adding some more things to our-hash:

scheme@(guile-user)> (dumbhash-set! our-hash 'cat 'meow)
scheme@(guile-user)> (dumbhash-set! our-hash 'dog 'woof)
scheme@(guile-user)> (dumbhash-set! our-hash 'rat 'squeak)
scheme@(guile-user)> (dumbhash-set! our-hash 'horse 'neigh)
scheme@(guile-user)> ,pp our-hash
$45 = #(()
        ((horse . neigh))
        ()
        ((rat . squeak) (monkey . ooh-ooh))
        ((cat . meow))
        ()
        ((dog . woof))
        ())

(,pp is a shortcut to pretty-print something at the REPL, and I've taken the liberty of doing some extra alignment of its output for clarity.)

So we can see we have a collision in here, but it's no problem. Both rat and monkey are in the same bucket, but when we do a lookup of a hashtable in our implementation, we get a list back, and we search to see if that's in there.

We can figure out why this is O(1) average / best time, but O(n) worst time. Assume we made a hash table of the same size as the number of items we put in... assuming our hash procedure gives pretty good distribution, most of these things will end up in an empty bucket, and if they end up colliding with another item (as the rat and monkey did), no big deal, they're in a list. Even though linked lists are of O(n) complexity to traverse, assuming a properly sized hash table, most buckets don't contain any or many items. There's no guarantee of this though... it's entirely possible that we could have a table where all the entries end up in the same bucket. Luckily, given a reasonably sized hash table, this is unlikely. Of course, if we ended up making a hash table that started out with 8 buckets, and then we added 88 entries... collisions are guaranteed in that case. But I already said resizing hash tables is an exercise for the reader. :)

If you're familiar enough with any Scheme (or probably any other Lisp), reading dumbhash-ref and dumbhash-set! should be pretty self-explanatory. If not, go read an introductory Scheme tutorial, and come back! (Relatedly, I think there aren't many good introductory Guile tutorials... I have some ideas though!

What lessons are there to be learned from this post? One might be that Guile is a pretty dang nice hacking environment, which is true! Another might be that it's amazing how far I've gotten in my career without ever writing a hash table, which is also true! But the lesson I'd actually like to convey is: most of these topics are not as unapproachable as they seem. I had a long-time fear that I would never understand such code until I took the time to actually sit down and attempt to write it.

As an additional exercise for the reader, here's a puzzle: is the Any Free License this code released under actually a free license? And what commentary, if any, might the author be making? :)

Could we please end the SQLite ALTER TABLE pain?

By Christopher Allan Webber on Sat 15 June 2013

I think there's nothing else in the world of programming that's given me more headaches than the lack of proper alter table support in SQLite. I'm not alone, because almost every developer I've worked with has had similar pains and complaints. How many hours of developer time have been wasted on the lack of proper SQLite alter table support? It must be in the thousands of hours. Surely this is fixable, and I'm willing to put my money where my mouth is: if someone is willing to develop SQLite alter support, I'm pre-pledging $200.00 towards fixing the problem. And I bet others would be willing to donate towards such a solution as well.

First of all, yes, there's a venting of frustration above, but it is not a venting of frustration in lack of appreciation of SQLite. SQLite is wonderful software; you could say that it was one of the biggest reasons (maybe the biggest, but of several) for my own project MediaGoblin switching from MongoDB to SQL (with SQLAlchemy, which is also wonderful software). Yes, we want people to be able to run medium to large installations of MediaGoblin (and there's PostgreSQL for that), but we also want people to be able to run smallish installations for themselves or their friends and family as well. SQLite is great for this, and it's also great for making developing super simple.

But one of the original reasons for going with Mongo was remembering how frustrating migration failures in SQL could be. When deciding to switch to SQL, I realized that we'd be moving back into this pain territory (we did do migrations with Mongo; anyone who suggests you don't need migrations with a document store database doesn't know what they're talking about and is aiming a shotgun squarely at their foot... even so, migrations were easier with Mongo). But of ALTER TABLE commands, SQLite only supports RENAME TABLE and ADD COLUMN. I had remembered also how because of this sqlite could require annoying workarounds to make migrations happen. I didn't realize though that at times doing migrations would become nearly impossible.

Since sqlite lacks most ALTER TABLE commands, most migration frameworks like South and sqlalchemy-migrate do crazy workarounds for the missing commands that usually involve renaming the table, creating an entirely new table renamed with the new schema in place, copying all the data back, and killing the old table.

If that sounds like a mess, that's because it is. In fact, the sqlalchemy-migrate project homepage suggests that for new projects to use a successor called Alembic founded by the same core author as sqlalchemy-migrate (edit: I've been corrected on this on HackerNews: "Alembic is not founded by the same core author as sqlalchemy-migrate. Alembic is founded by Mike Bayer who is the core author of SQLAlchemy itself."). But we didn't use Alembic because at the time there wasn't much support for sqlite (I think bit of support has been added since then, though I don't know how much) and we knew we really wanted it. In fact, on the Alembic homepage, this is listed as a goal:

Don't break our necks over SQLite's inability to ALTER things. SQLite has almost no support for table or column alteration, and this is likely intentional. Alembic's design is kept simple by not contorting its core API around these limitations, understanding that SQLite is simply not intended to support schema changes. While Alembic's architecture can support SQLite's workarounds, and we will support these features provided someone takes the initiative to implement and test, until the SQLite developers decide to provide a fully working version of ALTER, it's still vastly preferable to use Alembic, or any migrations tool, with databases that are designed to work under the assumption of in-place schema migrations taking place.

And indeed, there's a lot of neck-breaking involved in trying to use migrations with SQLite...

At one point I tried dropping a boolean field but discovered this was impossible because SQLAlchemy doesn't have a good sense of the constraints on an sqlite table, so sqlalchemy-migrate tries to reproduce the table without the boolean field, but since the boolean check is implemented as a constraint, the new table still has a constraint on a non-existent field and sqlalchemy-migrate doesn't notice. When the statement to create the new table is executed, sqlite explodes wondering what this boolean check is doing on this field that doesn't exist. More recently, one of our Summer of Code students tried writing some migrations, discovered that one broke in sqlite and another that didn't break deleted the unique constraint. We have no idea how to move forward on some of these issues, and that's a frustrating situation to be in.

Given all the above pain, why doesn't SQLite implement ALTER TABLE? I actually don't really know the details, but one of our contributors knows a bit about the sqlite structure and tells me that he thinks it might be because the data format makes some actions like appending rows fairly easy, but other actions like deleting a field would mean rewriting the entire table line by line.

But to that I think: migration frameworks are already rewriting tables entirely! So as far as I can tell, in the worst case scenario, sqlite implementing these other alter table methods means that it will be doing the same thing that migration frameworks have to do already, but in an official way, with a better sense of the structure of the existing tables, and probably even a bit faster than some other program likely operating through a different language doing the same. Sure, this may not be ideal, but it would be much better than the present situation. The documentation could even say this: "be aware that due to the nature of the sqlite file structure, this is a very slow operation that requires rewriting your entire table." But at least it would be a operation that rewrites the table natively, and would not explode in such strange and unpredictable ways!

I would be interested in helping myself, but I don't know SQLite's codebase, database structures is a domain I don't presently know, and I do not have time to learn it. But I'm more than happy to donate money (and I'm running off a crowdfunding campaign salary... I don't normally donate this amount of money to things, but surely fixing the most frustrating recurring bug in my programming career is worth putting $200 down). And I bet I'm not alone. If someone experienced with developing sqlite was willing to make an upstream-aimed contribution to kill this pain point, I bet it'd be a very fundable project.

Update: This post has gotten a fair amount of discussion on HackerNews, which is good! I'm surprised though at the amount of people who are taking the stance of "that feature doesn't exist, so why would you want that feature?" I thought this reply gave a good response to that.

Another update: One comment on HackerNews suggests that SQLite needs to stay around 250kb to stay "light". But on my Debian install, the sqlite binary is 680kb. Now granted, Debian probably has everything optional compiled in. But there's your answer if you're afraid about the binary getting too large: make it a compile-time option!