Archive for October, 2009

The Amazing Math Challenge

My wife enjoys watching The Amazing Race which is a reality game show where contestants travel to different locations and perform different challenges based on local customs or industry. For example, in Japan, you might eat some number of sushi rolls within a specified amount of time; or in Malaysia, the teams help a fisherman load his boat with fresh water buckets; or in Holland, bicycle during rush hour from your hotel to the train station.

This year, one of the challenges was terrifying. In Dubai, the contestants had the choice of visiting a jeweler. The riches of Dubai were portrayed by a store that had bars, ingots, and coins of gold on display. They also had a large monitor showing the current gold price. The contestants had a scale, and were to weigh out enough gold to equal $500,000 at the current exchange rate.

Three of the teams on the show couldn’t find the formula to compute the number of ounces at the given exchange rate to total $500,000. One team had a calculator and couldn’t get the right answer. Another team couldn’t figure out the results and had to ask an allied team for help.

This is nothing short of remarkable to me. The simple problem, resolving a rate, is a matter of dividing $500,000 by the price of an ounce. (Or, more technically, multiplying $500,000 by the reciprocal of the price per ounce to get the number of ounces–converting the units.) Be mindful that these teams didn’t know what math problem to do. It’s not that they did know what problem they wanted to do and just made a mistake with the mechanical math. They weren’t literate enough to discover the operation needed to arrive at their answer.

Engineers use math throughout their day jobs, so I might be guilty of expecting too much. Firmly, I don’t believe I could expect much less than this simple problem. That so many of these teams failed is a fact I simply find terrifying.

A few months ago, I heard an interview with Cliff Mass, a professor of meteorology at the University of Washington. He has said that the students he sees in his meteorology program are unable to use a calculator to perform operations with fractions or work with simple ratios and percentages. They weren’t familiar with simple trigonometric functions. These aren’t trick questions; they aren’t day-long applied problems. They’re things you should be able to do in your head in the grocery aisle.

Professor Mass is involved with Where’s the Math?, a program that’s trying to revitalize mathematics education in secondary schools in Washington State. I hope you can find a way to support him, or find a way to help a student in your life build a strong with mathematics.

PC World confused about Bitness

Steve Fox wrote an article for the November 2009 issue of PC World called “Will Windows 7 Leave Users Champing at the Bits?” It’s been a long time since this magazine has had technical content. This article is confusing to readers, whiny, and ill-informed.

Fox claims that Windows 7 will “shake up the computing landscape in ways that Windows Vista didn’t” by begging upgrade decisions and selling lots of PCs. One of Vista’s major problems at its launch was incompatibility with existing hardware because of the changed driver model, and the failure of Microsoft to work with vendors early enough to get new drivers in the pipeline. When hardware vendors didn’t write Vista drivers, they brought lots of their peripherals and components to end-0f-life, and provided Vista support for the newest hardware. Users had to upgrade, and they felt it was terrible to trash equipment that was only a couple years old, still working, and simply inoperable with Vista.

Windows 7 doesn’t have this problem, and doesn’t force users to upgrade. Releases of Windows have historically been coupled to increased hardware sales, however.

The author says that 64-bit computing is a “brave and zippy new world” but doesn’t describe the tangible benefits of 64-bit operating systems when compared to 32-bit systems. While the author does acknowledge that 64-bit systems have more addressable memory, he doesn’t explain how this helps anyone.  He says that some of the “systems power goes to waste” when running a 64-bit capable machine with a 32-bit operating system, but it’s really just potential that goes unused. And for the majority of applications today, the extra memory isn’t really that important.

Somehow, the author predicts that users installing 64-bit versions of  Windows 7 will “probably have problems with device drivers”, but offers no evidence to support this assertion. Such an extraordinary claim certainly needs some support if it is to be taken credibly.

Quizzically, the author says that he’s disappointed that Windows 7 64-bit won’t install as an upgrade from 32-bit versions. But he points out in his own article that a hardware upgrade is needed–since many older machines don’t support 64-bit computing, they wont’ run the 64-bit OS. And because they probably don’t have more than three or four gigabytes of memory, they won’t see any benefit from Win64.

While Vista shipped a very viable 64-bit edition, the author claims that we’re “stuck in 64-bit land” for the time being and still at least one generation away from “a common 64-bit experience”. It’s not clear what that “common experience” really is, or why it is at all important. Fox ends the article with a call to action for users to “make noise” about 64-bit machines, and to “agitate for change”. This cry goes out without any explanation of the resulting benefits. After all, won’t we just have driver problems” if we try to use 64-bit versions of  Windows, as this author claims?

The author says that one of their test machines—an overclocked Core i7-920 machine—would have torn through 64-bit applications. But they didn’t run 64-bit applications because their benchmark suite doesn’t support them. Fox should have started with his own organization, it seems. And if they did have a 64-bit version of the suite, a direct comparison would’ve been possible. This would either support his own point, or demonstrate that the improvements really aren’t beneficial for corporate desktop users.

Steve Fox’s article is full of weakly-supported assertions and terribly reasoned arguments, a pinch of nonsense, and calls readers to rally around a vague cause with no clear benefit. I’m not sure why PC World publishes such pieces.

The Art of Failing Early

Nobody wants to fail, obviously. We’re taught by teachers, parents, and greeting cards alike to never give up and follow our dreams.

But failing—and failing early, in particular—is an important part of success. It’s counter-intuitive, so let’s examine the way it works.

Say we’re doing a project. Like any challenging project, it’s got some risk and some unknowns. We don’t know how long it will take, but we get started. Maybe we guess about two years. So, we get started; we’re working, and we find some of those challenges. We grind through them, and find almost adequate solutions, and keep trucking. Our schedule swells, our budget soars. We chew through all the time money only to realize the problems we were facing were signals that we had a bad design, and it turns out that we’ve got nothing. We might have learned something along the way, but we’ve used all that time and money.

What if we could fail early? If there was a way to know that we were on the wrong track, we could quit early and save all that time and money.

Failing early is an interesting goal. The idea is to assert that the first milestone of a project must meet critical goals for cutting risk and closing open issues. If solutions to those critical problems are found, or if they’re at least becoming more sold, then work can proceed. If not, then it’s time to re-evaluate what the project will do; or, at least, when it will finish. Projects that ignore these signals are going to waste lots of time.

Admitting defeat isn’t failure. It’s actually very smart to recognize an intractable situation and backtrack to a better path. The sooner a team backs up and finds a path to success, the more time they have to realize that success.

Thinking about Large Data an Scale-Out

Adam Jacobs wrote an article, which I believe was first published in Queue, paper called The Pathologies of Big Data, which looks at — you know — big data.

The title begs the question of what “big data” really is, and the author doesn’t ignore that. Without saying it explicitly, he points out that the definition of “big data” is relative to time–based on what storage technologies are available at the time, what they cost, and how maintainable they are.

The author then makes an interesting posit: he builds some sample data, carefully packing the record so it stores as much information as possible. The list represents a set of people, who have ages. Since people don’t live past 127 years old very often, he uses only seven bits of a byte. Most database systems don’t allow this level of control over the storage type and packing, so it should be no surprise that the data he stores in his commercial database package takes a lot more space than his hand-coded representation.

What is suprising, however, is that he stores one billion rows of three bytes and the data ends up taking more than 40 gigs on disk. Ideally, the storage would take three gigs; 3 bytes times one billion. The author offers no explanation about why his database system causes more than 1300% overhead in the storage of the data, which seems rather negligent, particularly since he calls this “sort of inflation … typical of a traditional RDBMS”.

The author describes his eight-core Macintosh as “hefty hardware”, but ironically gives no description of the disk subsystem. He says that it has two terabytes of RAID-0 disk, but doesn’t mention if he’s using two commodity one-terabyte drives (which are slow — 12 or 15 milliseconds of latency), or fourteen enterprise class, 15 KRPM 147-gig SAS drives (which are about as fast as you can buy these days, if you’re shopping for mechanical storage).

In investigating the query performance, the author does examine some interesting facts, though he stops just short of directly indicting the PostgreSQL query optimizer or execution engine as the performance issue.

The experiment is a good idea, and I intend to perform a similar test on my own hardware when I have a few moments. What piques me about the article, though, is that the author claims that this behaviour isn’t pathological, and that it happens all the time. Indeed, it does: what’s happening is that the author is using the wrong tool for the wrong job. Scanning large tables and performing aggregation summaries overt the contained data is a pattern that database systems do exercise when supporting data warehousing or business decision systems. But it is, pretty plainly, a misapplication of such a system and isn’t suprising that it’s slow.

I think the author is further misguided in stating that businesses don’t produce such large amounts of data. They do: website logs, transactions, network monitoring, and so on–they’re all applications that aren’t uncommon in today’s businesses, and they do (or have the potential to) generate massive amounts of data. Oddly, the author cites several such applications–and the multipliers, like time that make them big–later in the article.

Anyway, the author does acknowledge that data warehousing is a solution to the problem he’s working, but does not entertain it. Indeed, “merely saying we will build a data warehouse” doesn’t get the job done. Just saying anything does not get the job done–someone has to do actual work.

The article makes one very interesting point, though defeats it itself. The author points out that randomly scanning memory is slightly slower (36.7 million/second) than sequentially reading from disk (53.2 million/second). Reading from disk is probably the slowest thing your computer can do, and we know that cache misses in memory reads are painful–but it is a bit surprising to realize that cache misses total something far slower than a sequential, physical disk read.

The author makes the point that denormalizing tables can help. This is certainly true, particularly when they avoid the sort of joins that the author is describing. However, I think it’s debatable how often such scans are interesting. The suggested pattern, joining transaction data back to the user table to show who performed these transactions isn’t typical because the aggregation isn’t common. When it is, it’s easy enough to cache or hash the lookups so they’re far more efficient than randomly probing the data. It’s also possible to realize this access pattern in the query optimizer and fully scan both tables rather than randomly probing. This access pattern is a bit counter-intutivie, but it turns out the sequential read is so much faster that its benefit easily overwhelms the act of reading data that isn’t actually required to drive the query.

I like the author’s treatment of scale-out solutions for handling larger data sets. Scaling to multiple computers can be an inexpensive way to share the load and requires less specialty hardware. The problem is that it requires lots of special engineering. These days, engineering is lots more expensive than the hardware; it’s easier to spend the money for an exorbitant server just once than it is to spend the money for a team of engineers to make cheaper, lesser hardware do the same job.

The usual solution of commoditization applies here, I think. That is, we have to rely on vendors to absorb the difficult engineering problems and make products which we can purchase at a fraction of the original engineering cost that address the problems we have with large data. No sane organization would ever build their own version of a product as complicated as SQL Server, for example; we shouldn’t expect those same organizations to build even more complicated software.

While I think some parts of this article are poorly written or poorly supported, I like the author’s definition of “big data”, which involves stepping past the “tried-and-true techniques” that we’re used to. I don’t think large data will be successfully utilized as an asset in most organizations until the tools are really there.

The Windows Server Standard memory limit on newer machines

The newest Xeon processors use tri-channel memory, which means you’ll configure memory in increments of 3*2^n, rather than 2^n.  That is, for the newest servers, you might get a machine with 12, 24, 48, or 96 gigs. You could make one with 36 gigs, if you wanted to, by using six 4 gig parts and six 2 gig parts. Anything else gives away performance by running unbalanced channels.

This means the Windows Server Standard memory limit of 32 gigs makes even less sense than it did before. Do I configure my server with 24 gigs of memory and waste some software capability? Or do I configure my machine with 36 or 48 gigs of memory, and waste the hardware capacity?  Jumping to an OS that costs more than three times the price ($800 street for Windows Server Standard compared to about $2800 street for Windows Server Enterprise) is hard to justify compared to “wasting” $200 worth of DIMMs, I guess.

But with the changing hardware platform, will Microsoft relax the memory limitations on Windows Server boxes and allow up to 48 gigs of memory, a more natural boundary for the new processors, and an attainable limit for the older processors?

A study about DRAM errors

The University of Toronto this week released a paper titled “DRAM Errors in the Wild“. You might have read that he paper was sponsored by Google. In a way, it was; the first author is Bianca Schroeder is an intern from the University of Toronto who is working at Google. A couple of Google employees are listed as second authors.

There are lots of interesting facets to this paper, and I read it very eagerly.

Some have to do with the paper itself, or the process behind it. It’s remarkable that Google gives interns interesting work to do, and allows them to take full credit for it—being listed as first author on a relevant academic paper while interning at a research-heavy company like Google is flattering for the student and very generous from the company.  I don’t mean that to sound like I’m assuming Ms Schroeder didn’t deserve the opportunity or do the work. It’s just that companies generally don’t reward interns for their hard work, and sometimes don’t even offer them opportunities on high-visibility projects.

Another interesting fact is that Google so carefully monitors its servers. The hard drive paper, and now this memory paper, show us that Google is paying close attention to their machines. In a way, you’d assume that they’d have to, since their farms are geographically distributed and numerous. (One estimate suggests they have a million servers, while others are in the solid five-digit range. The latter seems more relevant, since the paper identifies the sample population as “many ten-thousands of machines”, though this doesn’t discredit the lager estimate.) While redundancy and distribution help, they still need people to replace failing hardware. Lots of companies check on the health of their rigs, but to do so at a level of detail that tracks and stores the history of those machines is a technique that’s very sensible and forward-thinking.

Google, with its large (or huge?) array of servers has the unique opportunity to study computer equipment in the same environment where it’s normally installed and used, as it is actually used, rather than in a laboratory environment. The results of these observations are very valuable to anyone who deploys servers. Since they help show how the machines might predictably fail or degrade, server operators can do better jobs of estimating and planning, resulting in better service and less waste.

DRAM, as we know, is the volatile memory that stores data closest to the processor. Like any other part of a computer, it can experience errors. When the processor stores data in it, the data might not be stored correctly. When the data is retrieved, it might not match what was stored. That’s obviously a problem: if the data stored in memory is actually code, the corruption actually causes a crash or unpredictable execution. If the data is user data, the manifestation ends up being incorrect results or a crash, as well.

But how often does DRAM fail?

The study of the Google servers shows that it fails more frequently than we previously thought; that temperature of the device doesn’t correlate to failure, and the manufacturer doesn’t correlate to failure, as well. The relationship between temperature and failure is interesting, since it might mean that we over-cool our computers. All computers have exhaust fans, and it seesms that too much energy might be spent on spinning those fans if we can correlate neither disk drive nor DIMM failure to ambient temperature. The idea that certain vendors make memory that’s less prone toe errors is also relieving, as the paper concludes that its the design of hte host system which more directly correlates to memory errors. Builders should spend their attention, then, on choosing robust systems rather than memory which vendor offers the best memory. That is, paying a premium for a particular vendor is probably a waste.

A very interesting finding is that utilization does increase memory rate. While this finidng is suprising, I’m not sure how it might be practically applied as memory is there to be used. It would be remarkable to provision a machine with twice as much memory as needed just to try and reduce the chance of memory errors, but the fact that high rates of memory access increase the normalized rate of errors suggests that memory simply isn’t as reliable as we might like to think.

The most substantial item in the paper is the high rate of errors. The case is very strong for using EEC memory; most server machines do so already, as do some high-end workstations. But very few desktop machines do. Since density also correlates to higher error rates, I think we can expect that as density increases, ECC will eventually become necessary at all levels.

Why do Drives Fail?

Disk drive failures are terrifying. Drives that haven’t failed yet, arguably, are just as terrifying as failed drives—they’ll fail eventually!

Some companies use lots of disks; Google is one of them. They surveyed their drive population over time, analyzing failures and applications. The observations are recorded in a paper called Failure Trends in a Large Disk Drive Population. The study investigates almost 3500 drives that failed while being used at Google’s data centers.

There are several conclusions based on the observations and data that seem to contradict commonly held beliefs about disk dives. One is that there’s no link between drive activity or the heat in its environment and the failure rate of the drive. A cooler drive doesn’t necessarily enjoy more longevity, and a drive that’s used heavily doesn’t always last longer.

Surprisingly, the paper also fails to conclude that the age of a drive determines its failure rates. This is partially because the drive technology changes so quickly, and drives which are older are mostly drives that are of a different design or model. You’d expect either a linear curve between age and failure rate, or an upswing as drives age and then fail more. There’s an upswing in drive failure rates as drives age, but there’s also a hump and a decline in mid-life drives. I don’t think that the paper uses enough data to reach a hard conclusion, but I think it’s still notable that the conclusion can’t be reached even though it might have seemed quite apparent.

Further, the paper explains that SMART doesn’t help much in predicting drive failure. This is rather shocking, considering what the drive industry has done to invest in SMART and the support around it. The drives in Google’s servers are monitored frequently, and the SMART data is recorded. SMART gave no clear indication of impending failure.

Internet forums are full of people who proclaim that they’ve had a lot of a certain brand of drive fail, and this is very dubious logic. Almost all of these authors have very small samples contributing to their experience, and the anecdotal result is nothing to be relied upon. The fact remains that drives fail, and I don’t believe any particular model or line to be more or less reliable than another. I was prepared for this, but I wasn’t really expecting that SMART wouldn’t be useful for monitoring drive health.