Where oh where, did my AD site go...[Alternate title: It's the DNS, stupid.]
I recently had a very confusing issue arise at one of my Exchange 2007 clients and I decided to share it with you. At this particular company, an Active Directory site is reserved for Exchange, and there are two domain controllers (both global catalog servers) in that AD site. The front-end is two CAS (CAS1 and CAS2) load-balanced by an ISA enterprise array with a CCR backend.
The week before, we had replaced all of the domain controllers in the forest with Windows Server 2008 R2 domain controllers, and bumped both the domain functional level and the forest functional level to Server 2008 R2 (we are going to enable the AD Recycle Bin). The new DCs replaced the old DCs and kept the original IP addresses.
That's the setup.
An onsite technician was applying patches late one night (good for him!). Unfortunately, he patched and rebooted both of the Exchange AD site DCs at the same time (bad for him!). As you may already know - that makes Exchange very unhappy. System Center Operations Manager is also running in the environment and it immediately started to generate alerts about the missing domain controllers.
Sidebar: In Exchange 2003 and above, Exchange executes an Active Directory Topology discovery every 15 minutes. The specifics vary between versions of Exchange, but suffice it to say that, within 15 minutes, Exchange will find another DC/GC set (if they exist). In that case, your best bet is just to wait out that 15 minutes.
The technician reacted to the alerts from OpsMgr by rebooting the Client Access Servers. They both found out-of-site DCs and began working.
Then, the fun began. When the in-site DCs came back online (just a few minutes later), CAS1 reassociated with the in-site DCs and reset its secure channel to one of the in-site DCs. CAS2 did not.
The symptom of this is that all users connected to CAS1 through ISA were fine. However, the users that ISA connected to CAS2 were redirected through the same URL that they had already used - and CAS-to-CAS proxying did not work, so they couldn't access any Exchange services - OWA, Exchange, anything. Quick workaround: remove CAS2 from the webfarm and RPC publishing in ISA so everything was routed through CAS1. However, redundancy is now lost.
Why this problem happened - I don't know. The NetLogon service is responsible for maintaining the AD site a computer identifies itself with and maintaining the secure channel to a proper DC. However, for CAS2, NetLogon refused to reassociate to an in-site DC.
NetLogon bases site affinity on DNS. Both servers, CAS1 and CAS2, were configured identically for DNS. NetLogon uses a Windows API call named DsGetSiteName. In Windows Server 2008 and Windows Server 2008 R2 (and in Windows 7), you can use the nltest.exe utility to check this value. To wit:
PS C:\> nltest.exe /dsgetsite
The command completed successfully
Sidebar: nltest is available for Windows Server 2003 and Windows Server 2003 R2 as well, you just have to download and install the Windows Support Tools.
NetLogon does its check-and-reset once an hour, and upon startup. Once you know that, it should be easy to just restart the NetLogon service, right? Well, that didn't make any difference.
So, we have the capability of forcing a particular secure channel for a server, and this also will set its AD site. To wit:
PS C:\> netdom.exe reset cas2 /server:DsEx2
The secure channel from CAS2 to DOMAIN was reset.
The command completed successfully.
Note: nltest.exe has this functionality too, but netdom.exe has been around longer and I was more familiar with its parameters. See the SC_RESET parameter to nltest.
The AD site is updated, the secure channel is updated, and everything looks great. I declare success, put the server back into ISA, and move on.
An hour later, the client calls and says it is broken again.
Well, he's right. The AD site has flipped back again and CAS2 is thus not operating properly. Obviously this has happened because NetLogon did its cycle.
OK. Now its time to buckle down. AD sites are based on DNS. We know that. So, I ran dcdiag on all the servers. replmon on all the servers. Everything is clean.
But then visually examine the DC locator records in DNS - and... I find an extra one.
During the process of standing up all the new DCs, and configuring the new DCs with old permanent IP addresses, the OLD DCs ended up with the temporary IP addresses of the new DCs. Then, the old DCs were demoted.
All of the DCs but one cleaned up after themselves. The extra locator record was one of those DCs, and shockingly, now has stale DNS.
The fix? Remove the stale DC locator record. Reset the secure channel again, just to ensure it gets to the right place.
And Voila! It's fixed.
If you've ever been to one of my installation seminars or read many of my articles, I talk about the importance of DNS in both Active Directory and Exchange Server. Here, yet again, is another example of that. Sometimes, you just have to take a look in the right place to find the problem.
Until next time...
If there are things you would like to see written about, please let me know.