SRM

VMware Site Recovery Manager – Connection Error 503

A few days ago I was presented with a Site Recovery Manager 5.5.1 install that was not functioning. The problem was first noticed when trying to login to SRM via the vSphere C# Client

Connection Error: Lost connection to SRM Server [Hostname:Port]

The Server [Hostname] could not interpret the client’s request. (The remote server returned an error: (503) Server Unavailable.)

Connecting…

Usually when I see this, I simply restart the SRM service however on this occasion the SRM service would start and stop after a few seconds.

This led me to review the SRM logs which are available here:

C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs

You want to take a look at the latest Vmware-dr-xxx log where xxx is an incremented number.

At the end of the log I noticed the following panic:

 [error ‘authorize’] [Auth] Failed to initialize: <class Vmomi::Fault::SystemError::Exception(vmodl.fault.SystemError)>
[error ‘authorize’] Failed to initialize security
[info ‘Default’] CoreDump: Writing minidump
[panic ‘Default’]
–>
–> Panic: Assert Failed: “rc” @ d:/build/ob/bora-1964818/srm/src/authorization/authorize.cpp:136
–> Backtrace:
–> backtrace[00] rip 00007ffeb8f8a04a
–> backtrace[01] rip 00007ffeb8e4f51f
–> backtrace[02] rip 00007ffeb8e506be
–> backtrace[03] rip 00007ffeb8fa172c
–> backtrace[04] rip 00007ffeb8fa188c
–> backtrace[05] rip 00007ffeb8e40ae0
–> backtrace[06] rip 0000000003141d63
–> backtrace[07] rip 0000000140043741
–> backtrace[08] rip 00000000046c42b6
–> backtrace[09] rip 00000000046c52c5
–> backtrace[10] rip 00000000510c2fdf
–> backtrace[11] rip 00000000510c3080
–> backtrace[12] rip 00007ffef01716ad
–> backtrace[13] rip 00007ffef1164409
–>

The following line in the log pointed to authorization being the cause of the issue:

[error ‘authorize’] Failed to initialize security

When the SRM service starts up, the vCenter Server is prompted to check all logins to the VC, knowing this I then checked the VPXD logs on the vCenter server. For a Windows Server 5.5 VC they are located here:

C:\ProgramData\VMware\VMware Virtual Center\Logs

In this instance, SRM was not actually using any AD account, it was setup to use a local SSO account with the administrator role. This means that the issue occurs even if SRM does not use any AD identity sources

Open the vpxd-xxx log file and search for around the time you started the SRM service.

In my log I found that sure enough the VC was looping through all the users and groups with rights to login to the VC, the last one was failing:

[ info ‘[SSO]’ opID=5b78d12a] [UserDirectorySso] GetUserInfo(DOMAIN\Domain Admins, true)
[ info ‘[SSO][SsoAdminFacadeImpl]’ opID=5b78d12a] [Lookup]
[error ‘[SSO]’ opID=5b78d12a] [UserDirectorySso] GetUserInfo exception: class Vmacore::Authorize::AuthUserUnresolvedException(Group DOMAIN\Domain Admins, cause: class Sso::Fault::InternalFault::Exception(sso.fault.InternalFault))
[error ‘Default’ opID=5b78d12a] Error accessing directory: Group DOMAIN\Domain Admins cause: class Sso::Fault::InternalFault::Exception(sso.fault.InternalFault)

The highlighted entry implies that the VC has some form of error when checking the Domain Admins group. I Logged into the VC and removed the group from the VC permissions list and went to add it back in but on doing so I was presented with multiple instances of the same group name to select from. I released immediately that this was because there were multiple domains in the AD forest!

I decided to change the base DN on the Active Directory identity source for the VC to ensure that the other groups were not available to the VC, thus ensuing that the VC was not confused as to which group it was to use.

Login to the vSphere Web Client with an account with the administrator role. Navigate to Home > Configuration. Select the identity source that applies and edit:

Edit the Base DN for groups (If the issue was with a group) to the OU that directly contains the Group that was an issue.

In this scenario I changed it from dc=domain, dc=local to cn=users, dc=domain, dc-local

Of course doing this means that you will not be able to login to the VC if you are using groups for your VC permissions and the group is outside of this OU. There are likely more elegant workarounds.

Now if not already done so, remove the group from the VC permissions and add it back in.

Start the SRM service and the issue should be resolved.

 

Unfortunately there are no KBs for this issue. This environment was running Site Recovery Manager 5.5.1 and vCenter Server 5.5U1

Leave a Response