Archive for September 10th, 2017

debugging MSVC++ CRTL memory leaks

It seems like none of the docs on this subject are quite complete.

One of the most missed issues is that there can be multiple heaps in a single process. DLLs that use the CRTL, for example, can get a module-local heap. Each heap will get a separate run of the dump activity, so the allocation number shown in the dump report is local to that heap. You might not break in the right place if you don’t set up the allocation break number in the right module.

When I see the “Detected memory leaks!” message, I put a breakpoint in the _CrtDumpMemoryLeaks() function. This function is in dbgheap.c, which installs to C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\crt\src\dbgheap.c in Visual C++ 2013. It’s the same module that implements _CrtSetDbgFlag() and _CrtSetBreakAlloc(), so stepping into those functions can help open the file.

Breaking on that function’s output of the leaked blocks will reveal the module that’s actually making the call; just look at the module name information in the stack when the breakpoint is hit.

After finding the right module, adding these two lines of code to the application’s InitInstance() method should get the allocator to break on the appropriate allocation number:

_CrtSetDbgFlag(_CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF);
_CrtSetBreakAlloc(42530);

The break call might be placed in main() or in DllLoadLibrary(); whatever is the earliest available point. Static constructors run before these method are called, so it’s conceivable that another static constructor needs to be written to get the all in place early enough.

Some sources show multiple calls to _CrtSetBreakAlloc() implying that the library can check for multiple break numbers — but it can’t.

Why do Drives Fail?

Disk drive failures are terrifying. Drives that haven’t failed yet, arguably, are just as terrifying as failed drives—they’ll fail eventually!

Some companies use lots of disks; Google is one of them. They surveyed their drive population over time, analyzing failures and applications. The observations are recorded in a paper called Failure Trends in a Large Disk Drive Population. The study investigates almost 3500 drives that failed while being used at Google’s data centers.

There are several conclusions based on the observations and data that seem to contradict commonly held beliefs about disk dives. One is that there’s no link between drive activity or the heat in its environment and the failure rate of the drive. A cooler drive doesn’t necessarily enjoy more longevity, and a drive that’s used heavily doesn’t always last longer.

Surprisingly, the paper also fails to conclude that the age of a drive determines its failure rates. This is partially because the drive technology changes so quickly, and drives which are older are mostly drives that are of a different design or model. You’d expect either a linear curve between age and failure rate, or an upswing as drives age and then fail more. There’s an upswing in drive failure rates as drives age, but there’s also a hump and a decline in mid-life drives. I don’t think that the paper uses enough data to reach a hard conclusion, but I think it’s still notable that the conclusion can’t be reached even though it might have seemed quite apparent.

Further, the paper explains that SMART doesn’t help much in predicting drive failure. This is rather shocking, considering what the drive industry has done to invest in SMART and the support around it. The drives in Google’s servers are monitored frequently, and the SMART data is recorded. SMART gave no clear indication of impending failure.

Internet forums are full of people who proclaim that they’ve had a lot of a certain brand of drive fail, and this is very dubious logic. Almost all of these authors have very small samples contributing to their experience, and the anecdotal result is nothing to be relied upon. The fact remains that drives fail, and I don’t believe any particular model or line to be more or less reliable than another. I was prepared for this, but I wasn’t really expecting that SMART wouldn’t be useful for monitoring drive health.