Thursday, 20 September 2012

Windows 2003 Server Cluster and Access Denied Errors


Hello Cluster fans, I am back for another go at it.  This one you will need to hark back to the days of Windows 2003 Server Clusters.  Many of you still have this running and may have seen Access Denied issues with node joins, the opening of Cluster Administrator, and Cluster Service failures.  We seem to getting these type problems in bunches and decided it was time to blog about it.  There are three different errors I will cover, but they all have the same cause.

Before beginning, I have to add the sales pitch to look into upgrading to Windows 2008 R2 Failover Clustering.  There are numerous updates, fixes, and enhancements over Windows 2003 Server Clustering.

The first issue is simply opening Cluster Administrator and getting an access denied error. 

clip_image001

When opening Cluster Administrator using the name of the Cluster (default), it must do some authentication as it is connecting via an RPC call to the Cluster Name.  When opening up Cluster Administrator using the period (.), it is making a local LPC call to itself, so the authentication method is different and will go in fine.

The second issue you can see is a failure of a node joining the Cluster.  In the System Event Log, you will see these events:

Event ID:  1009
Source:  ClusSvc
Description:  Cluster service could not join an existing server cluster and could not form a new server cluster. Cluster service has terminated.

Event ID:  7031
Source:  Service Control Manager
Description:  The Cluster Service service terminated unexpectedly.  It has done this xtime(s).  The following corrective action will be taken in 960000 milliseconds: Restart the service.

If you look in the Cluster Log of the node failing to join, you will see this:

INFO [CS] Cluster Service started - Cluster Node Version 4.3790
INFO      OS Version 5.2.3790 - Service Pack 2 (ADS 03000112L)
INFO [CS] Service Starting...
***
INFO [INIT] Attempting to join cluster CLUSTER-2003
INFO [JOIN] Spawning thread to connect to sponsor 1.1.1.1
INFO [JOIN] Spawning thread to connect to sponsor CLUSTERNODE01
INFO [JOIN] Asking 1.1.1.1 to sponsor us after delay of 0 milliseconds.
INFO [JOIN] Spawning thread to connect to sponsor 2.2.2.2
WARN [JOIN] Unable to get join version data from sponsor 1.1.1.1 using NTLM package, status 5.
WARN [JOIN] JoinVersion data for sponsor 1.1.1.1 is invalid, status 5.
INFO [JOIN] Asking CLUSTERNODE01 to sponsor us after delay of 1000 milliseconds.
WARN [JOIN] Unable to get join version data from sponsor CLUSTERNODE01 using NTLM package, status 5.
WARN [JOIN] JoinVersion data for sponsor CLUSTERNODE01 is invalid, status 5.
INFO [JOIN] Asking 10.27.101.175 to sponsor us after delay of 2000 milliseconds.
WARN [JOIN] Unable to get join version data from sponsor 2.2.2.2 using NTLM package, status 5.
WARN [JOIN] JoinVersion data for sponsor 2.2.2.2 is invalid, status 5.
INFO [JOIN] Got out of the join wait, CsJoinThreadCount = 1.
ERR  [JOIN] Unable to connect to any sponsor node.
WARN [INIT] Failed to join cluster, status 53

You would see this for all IP Addresses and Node names it tries to connect to.  The Status 5 is an Access Denied type error.

You could see is a node having its Cluster Service terminated unexpectedly.  In the System Event Log, you would see the same Event 7031 as shown above.  In the Cluster Log of this node, you could see something similar to:

WARN [EVT] EvtBroadcaster: EvPropEvents for node 1 failed. status 5
WARN [NM]  RpcExtErrorInfo: Error info not found.
WARN [EVT] EvtBroadcaster: EvPropEvents for node 1 failed. status 5
WARN [NM]  RpcExtErrorInfo: Error info not found.
WARN [EVT] EvtBroadcaster: EvPropEvents for node 1 failed. status 5
WARN [NM]  RpcExtErrorInfo: Error info not found.

And:

ERR [GUM] Update routine of type 1, context 0 failed with status 5
ERR [GUM] GumSendUpdate: Update on non-locker node(self) failed with 5 when it must succeed
ERR [CS]  Halting this node to prevent an inconsistency within the cluster. Error status = 5

In the Cluster Log of the node that stays running, you could see this:

INFO [GUM] GumSendUpdate: Dispatching seq 7222 type 1 context 4098 to node 2
INFO [GUM] GumSendUpdate: Locker updating seq 7222 type 1 context 4098
ERR  [GUM] GumUpdateRemoteNode: Failed to get completion status for async RPC call,status 5
ERR  [GUM] GumSendUpdate: Update on node 2 failed with 5 when it must succeed
WARN [NM] RpcExtErrorInfo: Error info not found.
ERR  [GUM] GumpCommFailure 5 communicating with node 2
WARN [NM] RpcExtErrorInfo: Error info not found.
INFO [NM] Received advice that node 2 has failed with error 5.
INFO [NM] Received advice that node 2 has failed with error 5.
ERR  [NM] Banishing node 2 from active cluster membership.

All of the above things can happen when all three of the following are true.

1.       The account used to start the Cluster Service has a password of less than 15 characters.

2.       The Network security: LAN Manager authentication level is set for “Send LM & NTLM responses” or “Send LM & NTLM - use NTLMv2 session security if negotiated

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa
Lmcompatibilitylevel:  REG_DWORD:  0 or 1

3.       The Network security: Do not store LAN Manager hash value on next password change is enabled.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa
Nolmhash:  REG_DWORD:  1

Instead of storing your user account password in clear-text, Microsoft Windows generates and stores user account passwords by using two different password representations, generally known as "hashes." When you set or you change the password for a user account to a password that contains fewer than 15 characters, Windows generates both a LAN Manager Hash (LMHash) and a Microsoft Windows NT hash (NT hash) of the password. These hashes are stored in the local Security Accounts Manager (SAM) database or in Active Directory.

If the Network security: Do not store LAN Manager Hash value on next password change policy is set , no LMHash is in the Cluster Service account (CSA) in the Active Directory.

When a password of less than 15 characters is used for the CSA, when you join the second node, open Cluster Administrator, or updates between the nodes occur, the process will generate the LMHash to build a session key to authenticate.  Because no LMHash is stored in Active Directory, the Domain Controller cannot build a matching session key. So, the access is denied.  When you use a password that has 15 or more characters for the CSA, an LMHash cannot be generated by the setup process.  Instead, the Windows NT password hash will be used to derive the session key.  The Domain Controller will be able to generate a matching session key and the authentication will succeed.

So to resolve this, you simply need to change only one of the three above.  Once this is done, you should be good to go and no more access denied errors.

One thing to note, if the Network security: Do not store LAN Manager hash value on next password change is enabled, then you must set your Network security: LAN Manager authentication level to “Send NTLM response only” or above.

If you decide to go with the Cluster Service password change, you can use the /CHANGEPASS: command so that Cluster Service production is not needed to be taken down.

KB305813            


We have two articles that talk about this in a little more detail that you can also refer to:

KB823659            


KB828861            


This information should resolve most if not all of the access denied problems you could be receiving.

Happy Clustering !!

No comments:

Post a Comment