Thursday, 20 September 2012

Understanding the DiskRunChkdsk parameter in Windows 2008 and 2008R2 Failover Clusters



I’d like to share a quick tip for handling Windows Server Cluster administrators.

There may come a time, for whatever reason, that a Cluster managed volume is flagged as dirty and you will see an event ID message indicating that CHKDSK needs to run against the volume.  Just for a little background, the NTFS File System is monitoring the drive/partition at all times.  If it detects corruption, it will flip a bit on the volume and mark it as dirty.  During the online process of a Clustered drive, it will check for the existance of this bit and spawn CHKDSK if it sees it.  You can check, at any time, to see if a volume it is dirty with the CHKNTFS command.

C:\> chkntfs z:
The type of the file system is NTFS.
Z: is not dirty.

C:\> chkntfs z:
The type of the file system is NTFS.
Z: is dirty.

In a best case scenario, you can take the volume out of production, run CHKDSK on the volume if needed (refer to: http://technet.microsoft.com/en-us/library/cc772587.aspx, and then put the volume back into production.

In most situations though, the volume that needs attention is a heavily utilized production volume and will be extremely disruptive to have the volume offline for any length of time.

For example, a recent case I was involved with had a 14Tb* (see note 1 below) volume that was being flagged for CHKDSK to run on it about once a month. The volume had about 9tb of data on it. Apart from the concern of why the volume was continually being flagged as corrupt, the length of time that CHKDSK took to run on the volume was extremely painful for the customer’s business. When it ran initially, it took roughly 80 hours to complete a run on the volume.

It may be necessary to temporarily configure a problem volume to block CHKDSK from running against it while troubleshooting continues to determine why the volume is being flagged for CHKDSK to run.

I stress the word temporary here.

Turning off the health monitoring tool for the file system as a permanent solution could only lead to more downtime in the future.  You may end up on the phone with one of the File Systems experts on my team, such as Robert Mitchell.

Ok – so let’s talk specifics about temporarily blocking CHKDSK from doing work on a Cluster volume.

Say we have determined that we need to suspend CHKDSK from running on a problem volume. For you old school Cluster admins, the first command parameter that probably jumps to mind is SKIPCHKDSK.

This works just fine for Windows 2003 Server Clusters, but will NOT work for Windows 2008 and 2008R2 Failover Clusters.

If SKIPCHKDSK is used for a Clustered volume, it will be ignored when the disk is next brought online and CHKDSK will be run. In a situation where the volume is 18tb, the volume will remain unavailable for use until CHKDSK finishes* (See note 2 below).

The correct way to configure a volume to block CHKDSK from running on it, is to use the DiskRunChkdskparameter.  Keep in mind that these two parameters we are discussing only apply to the Cluster environment. If the machine is restarted, the OS may prompt for CHKDSK to run on the affected volumes.

For information on how to configure the OS to ignore the dirty bit, refer to:

KB158675
How to Cancel CHKDSK After It Has Been Scheduled

Before walking through an example of setting the DiskRunChkdsk parameter, I first must expain what the values mean.  In Windows 2003 Server Clusters, the SKIPCHKDSK parameter was either 0x0 (disabled) or 0x1 (enabled).  In Windows 2008 and 2008R2 Failover Clusters, there are different settings and what it is checking varies.

DiskRunChkDsk <0x0>: This is the default setting for all Failover Clusters. This policy will check the volume to see if the dirty bit is set and it will perform a Normal check of the file system. The Normal check is similar to running the DIR command at the root. If the dirty bit is set or if the Normal check returns aSTATUS_FILE_CORRUPT_ERROR or STATUS_DISK_CORRUPT_ERROR, CHKDSK with be started in Verbose mode (Chkdsk /x /f).

DiskRunChkDsk <0x1>: This setting will check the volume to see if the dirty bit is set and it will perform a Verbose check. A verbose check will scan the volume by traversing from the volume root and checking all the files) of the file system. If the dirty bit is set or if the Verbose check returns aSTATUS_FILE_CORRUPT_ERROR, CHKDSK with be started in normal mode (Chkdsk /x /f).

DiskRunChkDsk <0x2>: This setting will run CHKDSK in Verbose mode (Chkdsk /x /f) on the volume every time it is mounted.

DiskRunChkDsk <0x3>: This setting will check the volume to see if the dirty bit is set and it will perform a Normal check of the file system. The Normal check is similar to running the DIR command at the root. If the dirty bit is set or if the Normal check returns a STATUS_DISK_CORRUPT_ERROR, CHKDSK will be started in Verbose mode (Chkdsk /x /f), otherwise CHKDSK will be started in read only mode (Chkdsk without any switches).

DiskRunChkDsk <0x4>: This setting doesn’t perform any checks at all.

DiskRunChkDsk <0x5>: This setting will check the volume to see if the dirty bit is set and it will perform a Verbose check (scan the volume by traversing from the volume root and checking all the files) of the file system. If a problem is found, CHKDSK will not be started and the volume will not be brought online.

So now that we know what the varies switches do, to have CHKDSK never run during an online operation of the disk, we want to set DiskRunChkdsk to 0x4.

Here are the steps you can run through to accomplish this task.

Step 1: Determine the resource name as seen by Cluster
clip_image002

Step 2: Open either an Administrative command prompt or Windows Powershell Modules and run the command:

C:\> cluster res "Cluster Disk 8" /priv DiskRunChkdsk=4

or
PS C:\> Get-ClusterResource "Cluster Disk 8" | Set-ClusterParameter DiskRunChkdsk 4

Note: For the setting to WORK, the disk must be brought offline and back online.  Otherwise, it is simply stored until the next time it is taken offline and back online.

Step 4: Bring the disk offline, then online again.
clip_image005 clip_image002[1]

Step 5: Verify the setting is applied
clip_image006

or

PS C:\> Get-ClusterResource "Cluster Disk 8" | Get-ClusterParameter DiskRunChkdsk

Object            Name             Value
------            ----             -----
Cluster Disk 8    DiskRunChkDsk    4 

Step 6: Actively start troubleshooting what could cause the volume to end up flagged dirty and needing CHKDSK.

Footnotes:

Note 1: It’s not suggested to run with volumes this large. In my experience once they exceed 2tb in size, they rapidly become an administrative liability, especially in a situation where CHKDSK has to run against the volume. We strongly suggest that mount points be used to carve up larger volumes like this, into more administratively friendly chunks. CHKDSK runs against mount points just fine, too.

Note 2: While it’s not recommended to interrupt CHKDSK while it’s running, an admin is not locked into having to let CHKDSK finish once it starts. The process can be terminated if absolutely required. However, we cannot guarantee that the end result will be positive. If the process is interrupted during the “magic moment” when CHKDSK is making changes, the results may be worse than the initial reason for the volume being flagged as corrupt.

Additional reading material related to the components and tools mentioned in this post:

KB947021
How to configure volume mount points on a server cluster in Windows Server 2008

The shared disk on Windows Server 2008 cluster fails to come online

FSUTIL utility; marking a volume dirty for testing

In summary; try to keep your production volumes’ size under control, be aware that command line switches may not persist through all versions of a product, and continue being successful with Windows Server 2008!

I hope this post has been helpful!

No comments:

Post a Comment