The University of Toronto this week released a paper titled “DRAM Errors in the Wild“. You might have read that he paper was sponsored by Google. In a way, it was; the first author is Bianca Schroeder is an intern from the University of Toronto who is working at Google. A couple of Google employees are listed as second authors.
There are lots of interesting facets to this paper, and I read it very eagerly.
Some have to do with the paper itself, or the process behind it. It’s remarkable that Google gives interns interesting work to do, and allows them to take full credit for it—being listed as first author on a relevant academic paper while interning at a research-heavy company like Google is flattering for the student and very generous from the company. I don’t mean that to sound like I’m assuming Ms Schroeder didn’t deserve the opportunity or do the work. It’s just that companies generally don’t reward interns for their hard work, and sometimes don’t even offer them opportunities on high-visibility projects.
Another interesting fact is that Google so carefully monitors its servers. The hard drive paper, and now this memory paper, show us that Google is paying close attention to their machines. In a way, you’d assume that they’d have to, since their farms are geographically distributed and numerous. (One estimate suggests they have a million servers, while others are in the solid five-digit range. The latter seems more relevant, since the paper identifies the sample population as “many ten-thousands of machines”, though this doesn’t discredit the lager estimate.) While redundancy and distribution help, they still need people to replace failing hardware. Lots of companies check on the health of their rigs, but to do so at a level of detail that tracks and stores the history of those machines is a technique that’s very sensible and forward-thinking.
Google, with its large (or huge?) array of servers has the unique opportunity to study computer equipment in the same environment where it’s normally installed and used, as it is actually used, rather than in a laboratory environment. The results of these observations are very valuable to anyone who deploys servers. Since they help show how the machines might predictably fail or degrade, server operators can do better jobs of estimating and planning, resulting in better service and less waste.
DRAM, as we know, is the volatile memory that stores data closest to the processor. Like any other part of a computer, it can experience errors. When the processor stores data in it, the data might not be stored correctly. When the data is retrieved, it might not match what was stored. That’s obviously a problem: if the data stored in memory is actually code, the corruption actually causes a crash or unpredictable execution. If the data is user data, the manifestation ends up being incorrect results or a crash, as well.
But how often does DRAM fail?
The study of the Google servers shows that it fails more frequently than we previously thought; that temperature of the device doesn’t correlate to failure, and the manufacturer doesn’t correlate to failure, as well. The relationship between temperature and failure is interesting, since it might mean that we over-cool our computers. All computers have exhaust fans, and it seesms that too much energy might be spent on spinning those fans if we can correlate neither disk drive nor DIMM failure to ambient temperature. The idea that certain vendors make memory that’s less prone toe errors is also relieving, as the paper concludes that its the design of hte host system which more directly correlates to memory errors. Builders should spend their attention, then, on choosing robust systems rather than memory which vendor offers the best memory. That is, paying a premium for a particular vendor is probably a waste.
A very interesting finding is that utilization does increase memory rate. While this finidng is suprising, I’m not sure how it might be practically applied as memory is there to be used. It would be remarkable to provision a machine with twice as much memory as needed just to try and reduce the chance of memory errors, but the fact that high rates of memory access increase the normalized rate of errors suggests that memory simply isn’t as reliable as we might like to think.
The most substantial item in the paper is the high rate of errors. The case is very strong for using EEC memory; most server machines do so already, as do some high-end workstations. But very few desktop machines do. Since density also correlates to higher error rates, I think we can expect that as density increases, ECC will eventually become necessary at all levels.