Adam Jacobs wrote an article, which I believe was first published in Queue, paper called The Pathologies of Big Data, which looks at — you know — big data.
The title begs the question of what “big data” really is, and the author doesn’t ignore that. Without saying it explicitly, he points out that the definition of “big data” is relative to time–based on what storage technologies are available at the time, what they cost, and how maintainable they are.
The author then makes an interesting posit: he builds some sample data, carefully packing the record so it stores as much information as possible. The list represents a set of people, who have ages. Since people don’t live past 127 years old very often, he uses only seven bits of a byte. Most database systems don’t allow this level of control over the storage type and packing, so it should be no surprise that the data he stores in his commercial database package takes a lot more space than his hand-coded representation.
What is suprising, however, is that he stores one billion rows of three bytes and the data ends up taking more than 40 gigs on disk. Ideally, the storage would take three gigs; 3 bytes times one billion. The author offers no explanation about why his database system causes more than 1300% overhead in the storage of the data, which seems rather negligent, particularly since he calls this “sort of inflation … typical of a traditional RDBMS”.
The author describes his eight-core Macintosh as “hefty hardware”, but ironically gives no description of the disk subsystem. He says that it has two terabytes of RAID-0 disk, but doesn’t mention if he’s using two commodity one-terabyte drives (which are slow — 12 or 15 milliseconds of latency), or fourteen enterprise class, 15 KRPM 147-gig SAS drives (which are about as fast as you can buy these days, if you’re shopping for mechanical storage).
In investigating the query performance, the author does examine some interesting facts, though he stops just short of directly indicting the PostgreSQL query optimizer or execution engine as the performance issue.
The experiment is a good idea, and I intend to perform a similar test on my own hardware when I have a few moments. What piques me about the article, though, is that the author claims that this behaviour isn’t pathological, and that it happens all the time. Indeed, it does: what’s happening is that the author is using the wrong tool for the wrong job. Scanning large tables and performing aggregation summaries overt the contained data is a pattern that database systems do exercise when supporting data warehousing or business decision systems. But it is, pretty plainly, a misapplication of such a system and isn’t suprising that it’s slow.
I think the author is further misguided in stating that businesses don’t produce such large amounts of data. They do: website logs, transactions, network monitoring, and so on–they’re all applications that aren’t uncommon in today’s businesses, and they do (or have the potential to) generate massive amounts of data. Oddly, the author cites several such applications–and the multipliers, like time that make them big–later in the article.
Anyway, the author does acknowledge that data warehousing is a solution to the problem he’s working, but does not entertain it. Indeed, “merely saying we will build a data warehouse” doesn’t get the job done. Just saying anything does not get the job done–someone has to do actual work.
The article makes one very interesting point, though defeats it itself. The author points out that randomly scanning memory is slightly slower (36.7 million/second) than sequentially reading from disk (53.2 million/second). Reading from disk is probably the slowest thing your computer can do, and we know that cache misses in memory reads are painful–but it is a bit surprising to realize that cache misses total something far slower than a sequential, physical disk read.
The author makes the point that denormalizing tables can help. This is certainly true, particularly when they avoid the sort of joins that the author is describing. However, I think it’s debatable how often such scans are interesting. The suggested pattern, joining transaction data back to the user table to show who performed these transactions isn’t typical because the aggregation isn’t common. When it is, it’s easy enough to cache or hash the lookups so they’re far more efficient than randomly probing the data. It’s also possible to realize this access pattern in the query optimizer and fully scan both tables rather than randomly probing. This access pattern is a bit counter-intutivie, but it turns out the sequential read is so much faster that its benefit easily overwhelms the act of reading data that isn’t actually required to drive the query.
I like the author’s treatment of scale-out solutions for handling larger data sets. Scaling to multiple computers can be an inexpensive way to share the load and requires less specialty hardware. The problem is that it requires lots of special engineering. These days, engineering is lots more expensive than the hardware; it’s easier to spend the money for an exorbitant server just once than it is to spend the money for a team of engineers to make cheaper, lesser hardware do the same job.
The usual solution of commoditization applies here, I think. That is, we have to rely on vendors to absorb the difficult engineering problems and make products which we can purchase at a fraction of the original engineering cost that address the problems we have with large data. No sane organization would ever build their own version of a product as complicated as SQL Server, for example; we shouldn’t expect those same organizations to build even more complicated software.
While I think some parts of this article are poorly written or poorly supported, I like the author’s definition of “big data”, which involves stepping past the “tried-and-true techniques” that we’re used to. I don’t think large data will be successfully utilized as an asset in most organizations until the tools are really there.