Thursday, 20 September 2012

How to Debug Kernel Mode Blue Screen Crashes (for beginners)


Perhaps the largest call generator for the Core Team is for us to determine cause of a hard system crash that generates a Blue Screen and memory dump file.  Commonly called a "Blue Screen of Death (BSOD)."  The vast majority of these memory dumps could be analyzed by Administrators in just a few minutes using the latest debugging tools.  These tools do most of the work for you, once they're set up.  Kernel mode debugging is a pretty specialized skill, with experienced debuggers throwing around lots of imponderable terms.  But it's really pretty simple and I'll point out the gaffe's you'll want to avoid as a beginner.
Keep in mind that the following is very basic (Debugging for Dummies, if you will).  If you're already familiar with !analyze  and how to get there, this article is not for you.  Consider instead our sister website, NTDebugging (http://blogs.msdn.com/ntdebugging/).
Here's some terminology you should know before carrying on:
Blue screen
When the system encounters a hardware problem, data inconsistency, or similar error, it may display a blue screen containing information that can be used to determine the cause of the error. This information includes the STOP code and whether a crash dump file was created. It may also include a list of loaded drivers and a stack trace.
Crash dump file
You can configure the system to write information to a crash dump file on your hard disk whenever a STOP code is generated. The file (memory.dmp) contains information the debugger can use to analyze the error. This file can be as big as the physical memory contained in the computer.  By default, it's located in the Windows folder, and you CAN call them "memory dumps" without fear of offending anyone.
Debugger 
A program designed to help detect, locate, and correct errors in another program. It allows the user to step through the execution of the process and its threads, monitoring memory, variables, and other elements of process and thread context.
Kernel mode
The processor mode in which system services and device drivers run. All interfaces and CPU instructions are available, and all memory is accessible.
Minidump file A minidump is a smaller version of a complete, or kernel memory dump.  Usually Microsoft will want a kernel memory dump.  But the debugger will analyze a mini-dump and quite possibly give information needed to resolve.  If it's all you have, then debug it, rather than waiting for the machine to crash again.  Open the file in the debugger (see below) just as opening memory.dmp in the demonstration.
STOP code
The error code that identifies the error that stopped the system kernel from continuing to run.  It is the first set of hexadecimal values displayed on the blue screen.  At a minimum, frontline Admins should be required to note this code, and the four other codes displayed in parenthesis, and any drivers identified on the screen.  Often, this is all you really need!
Symbol files All system applications, drivers, and DLLs are built such that their debugging information resides in separate files known as symbol files. Therefore, the system is smaller and faster, yet it can still be debugged if the symbol files are available.   You don't need the Symbol files to debug - the debugger will automatically access the ones it needs from Microsoft's public site.
First, let's install the Debugger and Symbols.  You can debug a 64 bit dump on a 32 bit system, and you can debug a 32 bit dump on an x64 machine.  If you have an x64 machine then, you only need the x64 version to analyze any version of memory.dmp.  Many engineers prefer to use just the 32 bit version, since you'll still see the information necessary to determine cause.
The sites below identify the system requirements, etc. you'll need for the debugger to work.  For our purposes, we'll assume you have an actual memory dump (memory.dmp) file.  If you don't the rest is not going to be much fun.  You can access a memory dump over the network to a machine that's recently crashed.  Most times though, it will make more sense to copy the dump file to your Debugging machine.  Oh, and if you're wondering, you don't need a separate "Debugging machine" - the debugger doesn't use much memory and evil code from a memory dump can't sneak on to your machine and devour your movies and music.
For 32 bit, x86 debugging  
For 64 bit debugging
In this article I'll be using x64, but the examples will still apply to a 32 bit system.  You'll need to download the debugger and install it - accept the defaults.
image
image
By default, everything you need (for now) is installed here.
C:\Program Files\Debugging Tools for Windows (x64)
Note there's a help file (debugger.chm) that will be very useful as you advance your debugging skills.  You start the debugger from /Start /Debugging Tools for Windows /WinDbg.  This brings up the GUI mode of the Windows Debugger.  There's also a command version that can be  started using kd.exe.  Unless you work at a driver developer, the GUI version is fine.  If you do work at a driver developer, never open the GUI mode unless you're ready for sneers behind your back.
The debugger opens to a big red window with nothing in it.  Assuming you have a memory.dmp file to be analyzed in your X:\crashes folder, you'll want to go to /File /Open Crash Dump and browse there.
image
When you so open the memory.dmp, another window will be launched and you'll see output similar to below.  Note the errors about Symbol files. 
Loading Dump File [X:\Crashes\MEMORY.DMP]
Kernel Summary Dump File: Only kernel address space is available
Symbol search path is:
Executable search path is:
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for ntkrnlmp.exe -
Windows Server 2003 Kernel Version 3790 (Service Pack 2) MP (8 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Built by: 3790.srv03_sp2_gdr.080813-1204
Kernel base = 0xfffff800`01000000 PsLoadedModuleList = 0xfffff800`011d4140
Debug session time: Thu Oct 23 08:53:46.973 2008 (GMT-5)
System Uptime: 6 days 9:45:10.361
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for ntkrnlmp.exe -
Loading Kernel Symbols
..............................................................................................................................
Loading User Symbols
PEB is paged out (Peb.Ldr = 000007ff`fffde018).  Type ".hh dbgerr001" for details
Loading unloaded module list
............................................
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
Use !analyze -v to get detailed debugging information.
BugCheck D1, {0, c, 0, 0}
*** ERROR: Module load completed but symbols could not be loaded for mssmbios.sys
***** Kernel symbols are WRONG. Please fix symbols to do analysis.
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: nt!_KPRCB                                     ***
***                                                                   ***
*************************************************************************
*** ERROR: Module load completed but symbols could not be loaded for CLASSPNP.SYS
Obviously, we have a Symbols problem!  More importantly, this is our first experience of the debugger telling us what to do (or giving good hints).  You'll want to watch for these clues as you progress in debugging.  If you've heard people muttering about symbols and not being able to find the right ones, fear not!  Go to the window at the bottom of the page and type !symfix.
image
Most of the commands you'll use start with an exclamation point.  But don't call it that!  What you just typed is called "bang symfix."  And what it does is connects the debugger to Microsoft's public symbols library on the internet. http://msdl.microsoft.com/download/symbols  Note this isn't an ordinary web page, you can't access it through a browser.  At this point, you'll need to save your workspace (give it a name in /File /Save Workspace).  Close WinDbg and reopen it, your workspace, and your memory dump file.
This time, information will fly by and voila, you're debugging!  What you'll see in the debugger window will vary by the kind of Stop Code being debugged.  In this example, we're looking at a Stop 0x000000D1 (known to those in the know as a "Stop D1" - zeroes are ignored).  You should see something like the following.  If you get errors, or Symbols errors, for now, ignore them.
Microsoft (R) Windows Debugger Version 6.10.0002.229 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [X:\crashes\MEMORY.DMP]
Kernel Summary Dump File: Only kernel address space is available
Symbol search path is: http://msdl.microsoft.com/download/symbols
Executable search path is: srv*
Windows Server 2003 Kernel Version 3790 (Service Pack 2) MP (8 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Built by: 3790.srv03_sp2_gdr.080813-1204
Machine Name:
Kernel base = 0xfffff800`01000000 PsLoadedModuleList = 0xfffff800`011d4140
Debug session time: Thu Oct 23 08:53:46.973 2008 (GMT-5)
System Uptime: 6 days 9:45:10.361
Loading Kernel Symbols
...............................................................
...............................................................
Loading User Symbols
PEB is paged out (Peb.Ldr = 000007ff`fffde018).  Type ".hh dbgerr001" for details
Loading unloaded module list
............................................
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
Use !analyze -v to get detailed debugging information.
BugCheck D1, {0, c, 0, 0}
Debugger CompCtrlDb Connection::Open failed 80004005
PEB is paged out (Peb.Ldr = 000007ff`fffde018).  Type ".hh dbgerr001" for details
PEB is paged out (Peb.Ldr = 000007ff`fffde018).  Type ".hh dbgerr001" for details
Probably caused by : HpCISSs2.sys
Followup: wintriag
---------
At this point the debugger might give us a clue to what likely caused the problem, with the statement (which may not be present in your analysis), 
        Probably caused by :              
Then the problem file will be identified.   Nearly all bugchecks are caused by an incorrect driver (most manufacturers are pretty good about fixing flaws in their drivers).  You can fix this (again in most cases) by just obtaining the latest version of that driver (and related installation software) from the vendor.
If the debugger doesn't give this clue, or you're suspicious it's incorrect, the debugger tells you what to do..
        Use !analyze -v to get detailed debugging information.
In fact, you don't even have to type, just click on the !analyze -v with your mouse, and you're off and running again.  The debugger gives even more detailed information and a message of what to do next... 
7: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: 0000000000000000, memory referenced
Arg2: 000000000000000c, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: 0000000000000000, address which referenced memory
Debugging Details:
------------------
PEB is paged out (Peb.Ldr = 000007ff`fffde018).  Type ".hh dbgerr001" for details
PEB is paged out (Peb.Ldr = 000007ff`fffde018).  Type ".hh dbgerr001" for details
READ_ADDRESS:  0000000000000000
CURRENT_IRQL:  c
FAULTING_IP:
+0
00000000`00000000 ??              ???
PROCESS_NAME:  vssrvc.exe
DEFAULT_BUCKET_ID:  DRIVER_FAULT
BUGCHECK_STR:  0xD1
TRAP_FRAME:  fffffadf238fc110 -- (.trap 0xfffffadf238fc110)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=00000000fff92000 rbx=0000000000000000 rcx=00000000c0000102
rdx=00000000000007ff rsi=0000000000000000 rdi=fffff80001031095
rip=0000000000000000 rsp=fffffadf238fc2a0 rbp=0000000000000007
r8=0004969a8262692a  r9=fffff800011b73e8 r10=0000000000000000
r11=fffffadf29aed450 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na pe nc
00000000`00000000 ??              ???
Resetting default scope
LAST_CONTROL_TRANSFER:  from fffff8000102e5b4 to fffff8000102e890
FAILED_INSTRUCTION_ADDRESS:
+0
00000000`00000000 ??              ???
STACK_TEXT: 
fffffadf`238fbf88 fffff800`0102e5b4 : 00000000`0000000a 00000000`00000000 00000000`0000000c 00000000`00000000 : nt!KeBugCheckEx [d:\nt\base\ntos\ke\amd64\procstat.asm @ 170]
fffffadf`238fbf90 fffff800`0102d547 : fffffadf`35519260 00000000`00008000 00000000`00000100 fffffadf`292ca8cf : nt!KiBugCheckDispatch+0x74 [d:\nt\base\ntos\ke\amd64\trap.asm @ 2122]
fffffadf`238fc110 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiPageFault+0x207 [d:\nt\base\ntos\ke\amd64\trap.asm @ 1006]
STACK_COMMAND:  kb
MODULE_NAME: HpCISSs2
IMAGE_NAME:  HpCISSs2.sys
DEBUG_FLR_IMAGE_TIMESTAMP:  4600a3fe
POOL_CORRUPTOR:  HpCISSs2
FOLLOWUP_NAME:  wintriag
FAILURE_BUCKET_ID:  X64_POOL_CORRUPTION_HpCISSs2
BUCKET_ID:  X64_POOL_CORRUPTION_HpCISSs2
OCA_CRASHES:  854 (in last 90 days)
Followup: wintriag
---------
The Debugger again tells you what to do (just click on  HpCISSs2   to get details on the driver you should update  and the timestamp (highlighted below).
7: kd> lmvm HpCISSs2
start             end                 module name
fffffadf`296f3000 fffffadf`29705000   HpCISSs2   (deferred)            
    Image path: HpCISSs2.sys
    Image name: HpCISSs2.sys
    Timestamp:        Tue Mar 20 22:18:22 2007 (4600A3FE)
    CheckSum:         00015F1F
    ImageSize:        00012000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4
To confirm, you should contact the manufacturer of this driver to see if they have any reported issues, and whether there's a replacement.  You can also search the Microsoft Knowledge Base, and one of the hits will be:
You receive a Stop error message after you install update 932755 or 941276
on an HP ProLiant server that is running Storport in Windows Server 2003
http://support.microsoft.com/default.aspx?scid=kb;EN-US;940015
The article explains exactly what you'll need to do to resolve the bugcheck problem.  It won't always be that easy, but usually a little intelligent searching on the internet (using the bugcheck code and the driver) will lead you to a resolution.  If it doesn't please open a case with us to confirm or identify root cause.
If you're ready to venture out on your own, hit the helpfile and navigate to the Bug Check Code Reference.
image
Here, you'll find information you need to begin debugging the Code referenced.  For example, if you're analyzing a Stop A, you'll want to check out the advice in the help window to the right of the marker above.
Further study:

No comments:

Post a Comment