Ok…for anyone interested, I just got back from flying out to the customers site and I’ve ‘fixed’ the problem. It was pretty strange. Basically there were a number of things going on at once, so troubleshooting was difficult. The first thing was that someone (a kid most likely) had looped one of the buildings switches, which was why that building was unable to get phones or network (I’ve since fixed all the buildings by putting spanning tree on all the switches so hopefully this won’t happen again). I found this fairly early on and corrected the problem, but it didn’t seem to have any effect on the DHCP issue…it just made it so that the building in question would work at all (it was not working even with statics before this).
Next, the DHCP problem seems to have been some weird caching issue. I tried taking down all the servers and bringing them back up. No effect. Tried going in to DHCP and clearing out all of the leases. Nothing. Took down the entire infrastructure and brought that back up. Nothing. You would go to one of the machines that wasn’t working and, sure enough, they weren’t getting an IP address. Using ipconfig /release ipconfig /renew…nada. Turn the machine off…the same. No address. However, if I plugged my laptop into the same port I’d get an address on the proper VLAN with no issues. Plug it back into the original workstation and…works fine. No issues. :smack: It was just luck that I tried this, since I was just playing around before I used the nuclear option on the domain and did an image restore from last week.
I’m speculating that Windows caches the old reservations between the work stations and the server somehow (my guess…the workstation has it cached and when it makes a request for a new address it tells the server what it’s old address was…if the address is available the server just gives it that one back). For some reason this was preventing the workstations from getting new addresses. It was almost like port security was on or something (though only for DHCP)…really weird.
Anyway, we found out, through trial and error, that simply unplugging the ethernet cable from the machine and plugging it back in ‘fixed’ the problem. Taking down the switches the machine was in didn’t work. Taking down the switch and powering down the machine didn’t work. But unplugging the ethernet cable when the machine was on then replugging it back in works. Unreal. There are still some issues (for some reason the machines getting addresses can’t ping each other, even though they can ping the gateway, firewall and domain servers). The machines that were working before can ping each other but none of the machines getting addresses now can ping each other (though they can ping the machines that worked previously…not vice versa though). Again…really weird. It’s not affecting service, though, since the workstations can get to everything they need to get to with no problem, so I’m letting that go for now and will look at it more in-depth when I’m out there for my next scheduled trip.
Just to answer some of the other posts in the thread:
[QUOTE=Quartz]
Odds are there’s a rogue DHCP server out there. Has someone enabled ICS? Has someone accidentally bridged in an ADSL router?
[/QUOTE]
Wouldn’t work. You’d need to have brought up a rogue DHCP server on every subnet, which didn’t seem reasonable. The way DHCP/BootP works in a routed network is that you need a UDP Helper address that tells the router where to forward your DHCP/BootP packets too. So, if you brought up a rogue router or device putting out DHCP at, say, building 1, it would only effect the users of building 1, since you’d have no UDP helper statement to get that rogue devices DHCP/BootP packets to the other segments. This, whatever it was, was effecting the entire campus…about 8 separate routed networks.
[QUOTE=Aestivalis]
Do you have this problem on the same subnet that the server is on?
[/QUOTE]
No…the servers (as well as infrastructure) are all on the backbone VLAN (VLAN 1), and that VLAN doesn’t have an associated DHCP scope (it’s all static).
[QUOTE=teletype]
As the OP said, it would only affect one VLAN. VLANs, by definition, do not share broadcast domains. And broadcast traffic will not traverse a router. Thus the only way DHCP will traverse the router is by being converted to a unicast with an appropriate source address and giaddr field.
[/QUOTE]
Exactly.
[QUOTE=Quartz]
I do not think this is correct. Routers can be programmed to forward DHCP traffic.
[/QUOTE]
Yes, they can, but they forward it to a specific DHCP device. To make a rogue device work you’d need to give it the same address as your DHCP server/device or it would never forward them…and that would mean you’d need to put it on the backbone network in the MDF. The building switches aren’t provisioned for any ports being in VLAN 1, so a random user couldn’t just insert a rogue device into the network and have it be the same address as the UDP helper address.
Besides, if someone did have enough knowledge to program their rogue device with an IP the same as the DHCP server, and had access to the MDF to put it on the network, then the entire system would have gone down, since you’d have a device with the same IP address as the authentication server (that’s how it was before I started working on this originally). And when I switched servers and moved the DHCP scopes to another server and changed the UDP helper address to the new server then it would have cleared that part of it up, but we’d have still had problems with a device having the same IP address as the authentication server (which is also the primary share server for file services for the campus). It would have changed the problem IOW…and it didn’t. Which is why I had to fly out there yesterday instead of being being able to stay home and play Fallout New Vegas. 
Anyway, thanks for the comments and suggestions. If anyone has any theories as to what might have happened or what might still be happening with the pinging problem I’d appreciate it. If not, again, my thanks.
-XT