Fewer BSODs with Windows 8?

This content was provided by Windows MVP John D. Carrona whom I recently interview here. One of the questions I originally asked him was: Are there fewer BSODs (Blue Screen of Death) with Windows 8 compared to previous versions? It required a longer response, so John has kindly allowed me to publish this article.

Are there fewer BSODs with Windows 8?

A difficult question, as the BSOD’s ebb and flow as time goes on. In general, I find fewer BSOD’s in Win8/8.1 systems than I did in Win7 systems. But there’s some additional considerations here.
There’s the difference between a developer analyzing a BSOD in order to fix a driver that they’ve written and a user trying to fix a BSOD on their own system. When a developer analyzes a crash dump, they usually look at the individual instructions in order to see where the error occurred – whereas a user doesn’t need to look at the instructions (and even if they did, they would have no use for this information as they cannot rebuild the driver).

What can cause a BSOD?

BSOD’s are caused by problems with the operating system in the kernel space. They occur when the kernel or a driver running in kernel mode encounters an error from which it cannot recover. When this happens the system is designed to “fail fast” so that there isn’t any damage to the system. This may result in data loss as the system reboots before the user has a chance to save anything that is being worked on. According to the documentation, a user mode function cannot directly cause a BSOD – it must use accepted routines to perform actions in the kernel space – and the kernel space is where the errors occur that generate BSOD’s.

BSOD Analysis

BSOD analysis is a complicated task which is made easier by the ability to run a debugger and the ability to see patterns. Sometimes we’re able to stop the BSOD’s – only to find that the system has other problems. Most often we’ll see user mode hangs , crashes, and black screen errors. IMO these are the remnants of root cause of the problem – even though the BSOD’s have been stopped. This is a sign (IMO) that the BSOD’s are complicated errors that can have components in both the kernel and user spaces.
When the system crashes a crash dump is generated. The crash dump is the main component of a crash dump analysis. Here’s a draft paper on the 5 phases of crash dump generation that I wrote a while ago: http://www.carrona.org/dumpgen.html

As you can see, there’s many different ways that a system can fail to capture a crash dump.

Quite often we will see that the memory dump blames Windows drivers or ntoskrnl.exe (the OS kernel). This is most often incorrect. If the kernel is to blame, then there would most likely be many more problems other than the occasional BSOD. Also, there’s protective systems in place that protect the kernel and other “protected” operating system files from becoming corrupted – so this makes Windows drivers even less likely to be the cause of a crash. Lastly, the odds of you being the first one to experience a crash caused by Windows drivers is extremely unlikely – and if the crash is reported to Microsoft, they’ll be hard at work looking for ways to stop the crash and will issue a patch/update as soon as is possible in order to fix that particular problem.

Also, BSOD’s just aren’t all that common. When they’re happening to you they are much too frequent – but in the grand scheme of things one just don’t see them all that often. It also depends on who’s counting as to how big a problem that this is. For example, Microsoft publishes statistics on the frequency of BSOD’s. These figures are based on (AFAIK) the submission of WER (Windows Error Reports) to the Microsoft Crash Analysis center. But, at times, owners won’t submit the reports – or they’ll only submit a few.

And there’s even more different ways of counting BSODs:

  • There’s relying on my experience to see how frequent they are (a very subjective measure – but one that works for me)
  • There’s also counting those posted in online forums. They’re affected by many things, not the least of which is the confidence of the owner. The owners (in this case) feel confident enough to work their way through the BSOD’s with a bit of help.
  • Other owners won’t feel comfortable fixing things this way – and may elect to suffer through the occasional BSOD – or may decide to take the system to a repair shop for fixing.
  • Then we get into technicians fixing BSOD’s – there’s 2 types here. Those who wipe and reinstall, and those who try to get to the root of the problem.
  • Also, we can count the number that come into a repair shop. Since I work in just such a shop, I’ll relate here that we don’t see more than 1 or 2 BSOD problems a month (and most aren’t even brought in for the BSOD – but rather for other problems that the owner is suffering with). We have 10 to 30 computers on our benches at any one time – and we average 3.5 days to fix a system.
  • and there’s those that analyze their own crashes. In particular I’m referring to developers – those who develop drivers and have to debug them in order to make them work correctly. Most of this work is private and isn’t documents where users can see it.

Next is that most BSOD’s (approximately 90%) are caused by 3rd party, non-Windows drivers. Approximately 10% are due to hardware problems (which includes compatibility issues), while less 1% are due to Windows problems. But this presumes a fully updated system – so the first step in BSOD analysis is to get ALL Windows Updates. It’s not only the BSOD errors that are useful, but also the patterns that help to diagnose these. At times the errors simply repeat themselves over and over. These are generally the simplest errors – and are the easiest to fix as the errors occur frequently enough to make testing easy to accomplish. Other errors have a common thread (such as networking or storage errors) – those are also fairly easy to work with. The most difficult “pattern” is the lack of a pattern – this is most often seen with hardware problems (but can be due to other low-level problems).

Here’s some of the possible reasons for a low-level problem:

  • “borked” (broken) hardware (several different procedures used to isolate the problem device)
  • BIOS issues (check for updates at the motherboard manufacturer’s website)
  • overclocking/overheating – You’ll know if you’re overclocking or not. If uncertain we can suggest things to check.
  • missing Windows Updates
  • compatibility issues (3rd party hardware/drivers) – and older systems
  • low-level driver problems
  • or even malware (scanned for when we ask for hardware diagnostics from http://www.carrona.org/initdiag.html or http://www.carrona.org/hwdiag.html ).

In general I like to see at least 4 or 5 memory dumps before I conclude that there is the lack of a pattern – and then focus my efforts on low-level/hardware troubleshooting.

And, BSOD’s tend to come in cycles. Now that Microsoft has formalized “Patch Tuesday”, I find that we see an increase in BSOD’s shortly after Patch Tuesday. I suspect that the majority of these are due to the patches exposing incompatibilities in 3rd party drivers.

BSOD’s also become less frequent as the OS matures. There’s several reasons for this:

  • Microsoft works to make sure that Windows is compatible. This means constant research and subsequent releases of updates that keep conflicts between drivers and Windows to a minimum.
  • as an OS matures, developers become more familiar with working with it – which makes problems less common
  • unsuccessful software dies out if it’s not able to work with the new OS

There’s also some patterns that seem to crop up. We’ve identified some BSOD causes and listed them in the Common BSOD Related Drivers page. I’ve also observed that BSOD’s seem more common in systems that use/are used for gaming, torrents, USB audio, VM’s, or USB networking – but there’s no distinct thing that ties them all together (and many exceptions to the above generalities). And the BSOD’s seem to come in 2 basic types – those that have a definite, easy to determine cause, and those that are complicated and are likely to have multiple causes.

We also see issues with older drivers – either due to corruption or to compatibility issues. The review of the dump files includes looking at the stack text and the raw stack text for signs of driver problems, the use of Driver Verifier to attempt to force the system to give up the name of the faulting driver, researching drivers that are most frequently associated with BSOD’s (see the link in the previous paragraph), and updating drivers that date from before the latest version of the OS.
At times, there’s no evidence of the cause of the problem. Here’s an example of how this can occur:

  • Driver A writes to a memory address that’s owned by Driver B
  • Driver A then exits, leaving no trace of itself
  • A while later, Driver B looks into the memory address that Driver A wrote to and find something unexpected there.
  • The system panics because of this and crashes to a BSOD.
  • As the system has no evidence of Driver A, it place blame elsewhere (most often on Driver B).

The final consideration is the complexity of the OS. As the OS has increased in complexity, the BSOD’s have seemed to become more complicated also. Back in the XP days, there were far fewer BSOD error codes than there are today (I currently have over 400 listed on the BSOD Index page). And, as I recall, the XP errors seemed simpler to diagnose back then. For example, hibernation BSOD’s in XP were almost always caused by corruptions in the video drivers. With the later OS’s, this simple viewpoint will rarely fix these problems – and we end up updating many drivers and even disabling some driver components in order to fix the BSOD’s
In short, the process is:

  • get all Windows Updates
  • get all OEM updates (emphasis is on compatibility with the current OS level)
  • get all additional updates for installed devices and software (emphasis is on compatibility with the current OS level)
  • eliminate 3rd party driver problems (common BSOD causes, compatibility/age issues, corruption, etc)
  • perform hardware diagnostics ( Hardware DIagnostics and Hardware Stripdown Troubleshooting )
  • fix/reinstall Windows (SFC.EXE, DISM, Refresh, Reset, etc)

I wrote this How-To guide back when I was analyzing Win7 memory dumps: “How I Do It”
There have been some significant changes with Win8/8.1 – and I’m working on an update to it.