Wednesday, February 21, 2007

Wrong Time on a Domain Controller

Today we had to shut down all our servers for electricity maintenance, and while they were starting back up after the maintenance, clock was set incorrectly on one of our two domain controllers. I am not completely sure why this happened, but it seems to be the fault of Tardis Internet time service. One way or another, time was reset to Jan 1st, 2000. All the hell broke loose.

First, the wrong time was picked up by some other computers that suddenly shifted to year 2000. That was the least of all problems. Second, users started having problems logging in to the domain. The faulty domain controller denied access due to clock skew. These two problems were resolved once the time problem was noticed and time at the domain controller was reset to the correct value. But this was not the end.

Unfortunately, while the faulty domain controller (we'll call it DCB) was in year 2000 it replicated with the other controller (let's call it DCA) that had the correct time. Then DCA remembered that it last replicated with DCB in year 2000! As the result, when the time on DCB was corrected, DCA no longer wanted to replicate to DCB thinking that it was way too outdated to replicate again. Attempts to force replication in replmon utility (from Windows Support Tools) resulted in the error "The Active Directory cannot replicate with this server because the time since the last replication with this server has exceeded the tombstone lifetime." Directory Service Event log on DCA showed Event 2042 which read:

It has been too long since this machine last replicated with the named source machine. The time between replications with this source has exceeded the tombstone lifetime. Replication has been stopped with this source.

The reason that replication is not allowed to continue is that the two machine's views of deleted objects may now be different. The source machine may still have copies of objects that have been deleted (and garbage collected) on this machine. If they were allowed to replicate, the source machine might return objects which have already been deleted.

This problem soon had consequences. One user reported that her workstation could no longer connect to shared resources on the DCB controller, with error message "Target principal is incorrect". It was a strange situation, since it could connect to any other server in the network, plus all other workstations but this one could connect to DCB too. Although I still don't have a definite explanation to this, my best guess is that this workstation had the bad luck to automatically change its domain account password while the replication was broken. As the result, DCA had its new password, but DCB still had the old one, and thus could not authenticate the workstation.

To resolve the replication problem I followed the advice from the Event 2042 log entry and numerous sources on the internet, such as "Event ID 2042: It has been too long since this machine replicated" chapter from Windows Server 2003 Active Directory Operations guide on Microsoft TechNet. On DCB I went to the registry key HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Parameters and created a DWORD value Allow Replication With Divergent and Corrupt Partner with value 1. Then forced replication from DCA to DCB in replmon. This time replication worked, and after that I reset the registry value to 0.

Still, replication in the other direction, from DCB to DCA didn't work. Directory Service Event log on DCB showed Event 1988 which read:

Active Directory Replication encountered the existence of objects in the following partition that have been deleted from the local domain controllers (DCs) Active Directory database. Not all direct or transitive replication partners replicated in the deletion before the tombstone lifetime number of days passed. Objects that have been deleted and garbage collected from an Active Directory partition but still exist in the writable partitions of other DCs in the same domain, or read-only partitions of global catalog servers in other domains in the forest are known as "lingering objects".

This event is being logged because the source DC contains a lingering object which does not exist on the local DCs Active Directory database. This replication attempt has been blocked.

The best solution to this problem is to identify and remove all lingering objects in the forest.

Additionally, the Application event log was flooded with Event 1053 that read "Windows cannot determine the user or computer name. (Access is denied.). Group Policy processing aborted."

At the edge of panic, I searched for suggestions to Event 1053 error. In retrospect, I'm not sure I did the right thing, but I followed the advice from the Microsoft KB article 288167 (or 260575) to reset DCB's computer acccount password with netdom utility. After the password was reset and computer rebooted, the problem of the workstation that could not connect to DCB's shared resources disappeared.

It remained to restore replication from DCB to DCA though. I followed "Event ID 1388 or 1988: A lingering object is detected" chapter from Windows Server 2003 Active Directory Operations guide on Microsoft TechNet (there's also a few KB articles that you can find if you search for "lingering objects" at support.microsoft.com). Using repladmin /removelingeringobjects on DCA, as described there, I removed the objects that stopped the replication, and the issue was finally resolved.

In retrospect, I don't fully understand how those lingering objects occurred and if it was safe to remove them or not. Perhaps I should have disabled Strict Replication Consistency as described at the same TechNet article or in the Event 1988 log entry.

To wrap up this post, don't mess with time, it's important.

1 comment:

aggybong said...

Thanks! We accidentally brought up a DC with the wrong time and had the same issues. I wasn't receiving the exact same issues you were, but setting the registry key to the value you gave on both servers and then forcing a replication on each seems to have solved all issues.