Hitchhiker's guide to data formats

By Christine Lemmer-Webber on Wed 21 October 2015

Just thinking out loud this morning on what data formats there are and how they work with the world:

  • XML: 2000's hippest technology. Combines a clear, parsable tree based syntax with extension mechanisms and a schema system. Still moderately popular, though not as it once was. Tons of tooling. Many seem to think the tooling makes it overly complex, and JSON has taken over much of its place. Has the advantage of unambiguity over vanilla JSON, if you know how to use it right, but more effort to work with.
  • SGML: XML's soupier grandmother. Influential.
  • HTML: Kind of like SGML and XML but for some specific data. Too bad XHTML never fulfilled its dream. Without XHTML, it's even soupier than SGML, but there's enough tooling for soup-processing that most developers don't worry about it.
  • JSON: Also tree-based, but keeps things minimal, just your basic types. Loved by web developers everywhere. Also ambiguous since on its own, it's schema-free... this may lead to conflicts between applications. But if you know the source and the destination perfectly it's fine. Has the advantage of transforming into basic types in pretty much every language and widespread tooling. (Don't be evil about being evil, though? #vaguejokes) If you want to send JSON between a lot of locations and want to be unambiguous in your meaning, or if you want more than just the basic types provided, you're going to need something more... we'll come to that in a bit.
  • S-expressions: the language of lisp, and lispers claim you can represent anything as s-expressions, which is true, but also that's kind of ambiguous on its own. Capable also of representing code just as well, which is why lispers claim benefits of symmetry and "code that can write code". However, serializing "pure data" is also perfectly possible with s-expressions. So many variations between languages though... it's more of a "generalized family" or even better, a pattern, of data (and code) formats. Some damn cool representations of some of these other formats via sexps. Some people get scared away by all the parens, though, which is too bad, because (though this strays into code + data, not just data) homoiconicity can't be beat. (Maybe Wisp can help there?)
  • Canonical s-expressions: S-expressions, with a canonical representation... cool! Most developers don't know about it, but was designed for public key cryptography usage, and still actively used there (libgcrypt uses canonical s-expressions under the hood, for instance). No schema system, and actually pretty much just lists and binary strings, but the binary strings can be marked with "display hints" so systems can know how to unpack the data into appropriate types.
  • RDF and friends: The "unicode" of graph-oriented data. Not a serialization itself, but a specification on the conceptual modeling of data, and you'll hear "linked data" people talking about it a lot. A graph of "subject, predicate, object" triples. Pretty cool once you learn what it is, though the introductory material is really overwhelming. (Also, good luck representing ordered lists). However, there is no one serialization of RDF, which leads to much confusion among many developers (including myself, while being explained to the contrary, for a long time). For example, rdf/xml looks like XML, but woe be upon ye who uses XML tooling upon it. So, deserialzie to RDF, then deal with RDF in RDF land, then serialize again... that's the way to go with RDF. Has more sane formats than just rdf/xml, for example Turtle is easy to read. RDF community seems to get mad when you want to interpret data as anything other than RDF, which can be very off-putting, though the goal of a "platonic form" of data is highly admirable. That said, graph based tooling is definitely harder for most developers to work with than tree-based tooling, but hopefully "the jQuery of RDF" library will become available some day, and things will be easier. Interesting stuff to learn, anyway!
  • json-ld: A "linked data format", technically can transform itself into RDF, but unlike other forms of RDF syntax, can often be parsed just on its own as simple JSON. So, say you want to have JSON and keep things easy for most of your users who just use their favorite interpreted language to extract key value pairs from your API. Okay, no problem for them! But suddenly you're also consuming JSON from multiple origins, and one of them uses "run" to say "run a mile" whereas your system uses "run" to mean "run a program". How do you tell these apart? With json-ld you can "expand" a JSON representation with supplied context to an unambiguous form, and you can "compact" it down again to the terms you know and understand in your system, leaving out those you don't. No more executing a program for a mile!
  • Microformats and RDFa: Two communities which are notoriously and exasperatingly at odds with each other for over a decade, so why do I link them together? Well, both of these take the same approach of embedding data in HTML. Great when you have HTML for your data to go with, though not all data needs an HTML wrapper. But it's good to be able to extract it! RDFa simply extracts to RDF, which we've discussed plenty; Microformats extracts to its own thing. Frequent form of contention between these groups is about vocabulary, and how to represent vocabulary. RDFa people like their vocabulary to have canonical URIs for each term (well, that's an RDF thing, so not surprising), Microformats people like to document everything in a wiki. Arguments about extensibility is a frequent topic... if you want to get into that, see Amy Guy's summary of things.

Of course, there's more data formats than that. Heck, even on top of these data formats there's a lot more out there (these days I spend a lot of time working on ActivityStreams 2.0 related tooling, which is just JSON with a specific structure, until you want to get fancier, add extensions, or jump into linked data land, in which case you can process it as json-ld). And maybe you'd also find stuff like Cap'n Proto or Protocol Buffers to be interesting. But the above are the formats that, today, I think are generally most interesting or impactful upon my day to day work. I hope this guide was interesting to you!