Weird Windows 2003 DHCP problem

One of my customers is having a really weird DHCP problem. It just started on Monday but it’s basically taken down their whole network, since they aren’t getting any DHCP to the clients (they’ve been going around and putting in static addresses, but it’s hundreds of machines).

Basically, what looks like is happening is that the DHCP server is only giving out a couple of addresses and then stopping for each VLAN (scope). The rest are getting nothing. I’ve looked at the DHCP server (it’s on one of the Windows 2003 domain controllers) and the service is running fine…nothing weird in the Event Viewer, seemingly no problem or issue. Except that only a few addresses are being handed out at a time. And they seem to be mostly renewals…i.e. if you had an address of, say, 192.168.10.104 then you’d get that exact address back, even when I go into the DHCP utility and delete all of the reservations.

I’ve had the customer try to manually release and renew at the workstations, reboot the work stations, take down the entire domain, and take down the domain and the routers, switches and firewall. I even shut down DHCP on the server it was on, started DHCP on a new server, re-built the scopes and changed the IP-Helper address on the router to reflect the new server…and the exact same thing happens. A couple of machines work, then nothing.

Static addressing also works fine. Machines can connect to the domain, can connect to domain assets, and can get out to the internet…so I know it’s not a routing or infrastructure issue. It’s something with the domain, but I have no idea what it might be, and I’ve tried everything I can think of short of blowing away the original server and reloading the OS from image…and I’m not even sure this will work in any case, since it’s just one of the domain controllers, and I’ve already tried moving DHCP to another server and that didn’t seem to do anything.

Any thoughts or suggestions would be greatly appreciated. I have no idea what to try at this point except the nuclear option, and I’m not even confident that will work…plus, I’d need to take an unanticipated trip in order to do it if it comes to that…which would take me away from my Fallout New Vegas addiction (doesn’t play well on my laptop…sigh).

-XT

Try and find a rogue DHCP server on your network. I’ve had a co-worker bring a small consumer router from home and plug it into their LAN port so they could connect a second PC on their desk for whatever reason. The default behaviour of these routers (D-Link, I think it was) is to be a DHCP server and it will compete with the legitimate one with strange and unwanted results.

Thought of that but ruled it out. It’s a routed network with multiple VLANs (so you need a UDP helper statement in the router to allow DHCP to be routed across the VLANs), so you’d either need to have someone put such a device on every local network, or it would only effect one VLAN. This, whatever it is, is effecting the entire system which consists of 8 VLAN networks on a campus.

It’s really weird. According to the customer they’ve changed nothing, loaded no software nor made any changes to the network or domain. The service seems to be running fine. Hell, I moved it from one server to another AND changed the IP helper command to reflect the change. And it works…but only for a few addresses then it seems to just stop.

-XT

I’m not sure if your router/switch environment is to blame, or the DHCP server. I think your single biggest clue here is that you can see lease renewals happening, though.

(Disclaimer: This is from memory so some of it my be speech emanating from my posterior)

When 50% of the original lease time has elapsed, a Windows machine will send a unicast DHCPREQUEST to the server the lease was originally obtained from, to attempt to renew said lease. If all is well, the server responds with a unicast DHCPACK. I believe it also does this when rebinding to an existing lease, e.g. if the network cable was unplugged. It will not send a broadcast DHCPOFFER until 87.5% of the original lease duration has elapsed.

So, since the renewals are unicast, your ip helper-address configuration does not come into play. Since DHCPDISCOVERS when obtaining a new lease are always going to be broadcast, it does come into play.

The ip helper-address works by converting the broadcast received on that vlan into a unicast, and forwarding that to the specified address. It will also update the GIADDR (gateway address) field of the DHCP packet with the address of the source interface the broadcast was received on. Windows will then use that to determine which DHCP scope to associate the request with.

So, my guesses are:

Something is interfering with the broadcast traffic before it hits the interface w/ the ip helper-address defined.
Something is stripping the GIADDR after it leaves the router, before it hits the DHCP server.
The DHCP server is ignoring the GIADDR or not properly associating it with the correct scope.

Good luck, that’s a head-scratcher. :frowning: I pretty confident that the unicast nature of renewal DHCPOFFERs may be a big clue, though.

Missed the edit window but forgot one more possibility. Remember that since the original broadcast packet had no IP associated (0.0.0.0) in addition to setting giaddr, the relaying router will set its own IP (the IP of the interface it’s routing through to get to your dhcp server) as the source IP of the unicast dhcp packet it has created. The server will respond to this IP and the router will relay this back out the vlan intetface, to the originating client.

So, make sure nothing has interfered with your dhcp server’s route back to the relaying router, no ACLs are dropping it along the way, etc.

ETA: in retrospect, this seems more likely than the guesses I made in the previous post. Please post when you’ve got it solved though, I’m curious now!

I know the routing is working because I can put in a static address on one of the work stations that isn’t getting DHCP and have it get to basically anywhere on the network or to the internet. There are some ACL’s, but none of them effect this traffic…they are mostly in place for their VoIP system.

The weird thing here is that this system has been up and running in a stable way for 2 years now. According to the customer nothing has changed at all. The really strange thing is that I moved DHCP to an entirely different server and it’s STILL having the exact same issue…which, to me, would indicate it’s not the server but the infrastructure. Except that the routing all seems to be working fine (from the testing I’ve been able to have the customer do over the phone)…if I put in a static address it gets there. And, more telling, SOME of the machines are definitely getting DHCP when I have them do a /release and /renew. Just not the majority of the machines. I can see the leases in the DHCP server utility as well. I’ve reconciled the scopes and get no errors. No errors in the event logs. There are some errors in the DHCP logs, but they seem to have to do with the workstations that are actually working.

I’m probably going to grab a flight out this afternoon unless I can come up with something brilliant to try between now and then. I don’t have Wireshark loaded on any of their servers and might try and do that (or wait until I get out there). Most likely I’ll end up restoring the original server from last weeks image if nothing else comes to mind…and hope that the problem is with the domain somehow. I’m really stumped and don’t know what else to try at this point. :frowning:

-XT

Odds are there’s a rogue DHCP server out there. Has someone enabled ICS? Has someone accidentally bridged in an ADSL router?

Do you have this problem on the same subnet that the server is on?

As the OP said, it would only affect one VLAN. VLANs, by definition, do not share broadcast domains. And broadcast traffic will not traverse a router. Thus the only way DHCP will traverse the router is by being converted to a unicast with an appropriate source address and giaddr field.

I do not think this is correct. Routers can be programmed to forward DHCP traffic.

Ok…for anyone interested, I just got back from flying out to the customers site and I’ve ‘fixed’ the problem. It was pretty strange. Basically there were a number of things going on at once, so troubleshooting was difficult. The first thing was that someone (a kid most likely) had looped one of the buildings switches, which was why that building was unable to get phones or network (I’ve since fixed all the buildings by putting spanning tree on all the switches so hopefully this won’t happen again). I found this fairly early on and corrected the problem, but it didn’t seem to have any effect on the DHCP issue…it just made it so that the building in question would work at all (it was not working even with statics before this).

Next, the DHCP problem seems to have been some weird caching issue. I tried taking down all the servers and bringing them back up. No effect. Tried going in to DHCP and clearing out all of the leases. Nothing. Took down the entire infrastructure and brought that back up. Nothing. You would go to one of the machines that wasn’t working and, sure enough, they weren’t getting an IP address. Using ipconfig /release ipconfig /renew…nada. Turn the machine off…the same. No address. However, if I plugged my laptop into the same port I’d get an address on the proper VLAN with no issues. Plug it back into the original workstation and…works fine. No issues. :smack: It was just luck that I tried this, since I was just playing around before I used the nuclear option on the domain and did an image restore from last week.

I’m speculating that Windows caches the old reservations between the work stations and the server somehow (my guess…the workstation has it cached and when it makes a request for a new address it tells the server what it’s old address was…if the address is available the server just gives it that one back). For some reason this was preventing the workstations from getting new addresses. It was almost like port security was on or something (though only for DHCP)…really weird.

Anyway, we found out, through trial and error, that simply unplugging the ethernet cable from the machine and plugging it back in ‘fixed’ the problem. Taking down the switches the machine was in didn’t work. Taking down the switch and powering down the machine didn’t work. But unplugging the ethernet cable when the machine was on then replugging it back in works. Unreal. There are still some issues (for some reason the machines getting addresses can’t ping each other, even though they can ping the gateway, firewall and domain servers). The machines that were working before can ping each other but none of the machines getting addresses now can ping each other (though they can ping the machines that worked previously…not vice versa though). Again…really weird. It’s not affecting service, though, since the workstations can get to everything they need to get to with no problem, so I’m letting that go for now and will look at it more in-depth when I’m out there for my next scheduled trip.

Just to answer some of the other posts in the thread:

Wouldn’t work. You’d need to have brought up a rogue DHCP server on every subnet, which didn’t seem reasonable. The way DHCP/BootP works in a routed network is that you need a UDP Helper address that tells the router where to forward your DHCP/BootP packets too. So, if you brought up a rogue router or device putting out DHCP at, say, building 1, it would only effect the users of building 1, since you’d have no UDP helper statement to get that rogue devices DHCP/BootP packets to the other segments. This, whatever it was, was effecting the entire campus…about 8 separate routed networks.

No…the servers (as well as infrastructure) are all on the backbone VLAN (VLAN 1), and that VLAN doesn’t have an associated DHCP scope (it’s all static).

Exactly.

Yes, they can, but they forward it to a specific DHCP device. To make a rogue device work you’d need to give it the same address as your DHCP server/device or it would never forward them…and that would mean you’d need to put it on the backbone network in the MDF. The building switches aren’t provisioned for any ports being in VLAN 1, so a random user couldn’t just insert a rogue device into the network and have it be the same address as the UDP helper address.

Besides, if someone did have enough knowledge to program their rogue device with an IP the same as the DHCP server, and had access to the MDF to put it on the network, then the entire system would have gone down, since you’d have a device with the same IP address as the authentication server (that’s how it was before I started working on this originally). And when I switched servers and moved the DHCP scopes to another server and changed the UDP helper address to the new server then it would have cleared that part of it up, but we’d have still had problems with a device having the same IP address as the authentication server (which is also the primary share server for file services for the campus). It would have changed the problem IOW…and it didn’t. Which is why I had to fly out there yesterday instead of being being able to stay home and play Fallout New Vegas. :wink:
Anyway, thanks for the comments and suggestions. If anyone has any theories as to what might have happened or what might still be happening with the pinging problem I’d appreciate it. If not, again, my thanks.

-XT

I was about to suggest Wireshark or some other way of looking at the actual packet traffic to see what was happening. That should have clued you in to the loop problem, since you’d see a whole bunch of BPDU’s in the affected VLAN, as well as a general utilization rate through the roof.

Also was going to point out that although technically each VLAN doesn’t share layer 2 broadcasts with other VLANs, some switch implementations do let layer 2 broadcasts go out to every port, regardless of which VLAN the port is assigned to. (I’m looking at you, Enterasys!)

For the non-pinging issue – are you trying to resolve machine names or are you pinging by IP address? If names, are you relying on a master browser or DNS?

Wireshark was exactly how I found the traffic loop. Took me about 10 minutes once I got there. I think I’ve corrected that problem by turning spanning tree on (not sure why it wasn’t on already).

They are using HP Procurve equipment, so it definitely requires an IP Helper command to allow DHCP/BootP packets to cross the routed VLANs.

I was trying to ping them directly using the IP address of the box, not pinging using the resolved addresses. They would be relying on the local DNS, however, for name resolution, though WINS is also enabled.

-XT

So look at the ARP table of the router on the interface of the target box. Is there an ARP entry for your target’s MAC address?

ETA – I think spanning tree is usually off by default. Are their switches dumping into a syslog server somewhere or using SNMP? You also might have seen large numbers of spanning tree elections via the syslog errors and avoided Wireshark…

I cleared the cache for the ARP tables, and no…they didn’t repopulate for the machines that weren’t able to ping each other. It was really strange, since they can ping the IP addresses of the boxes that were previously working…just like they can ping the servers (on VLAN 1), their own gateway, and statically addressed devices like printers and such. Wireshark wasn’t much help here either, since I couldn’t get any indication as to WHY they weren’t able to ping (makes sense since they are on the same subnet).

This is a more minor problem, since it’s not effecting the users access in any way. It’s just a strange problem that seemingly cropped up when this weird DHCP caching issue came up (if, indeed, that was the problem…it’s pure speculation on my part based on what I was seeing and how I got around the issue). It’s weird that something like this would happen on several VLANs in the exact same way.

-XT

Can you add a static ARP entry and see if ping works then?

Well, I can’t do anything now…I flew home this morning. :wink: But I’m not sure how that would help, to be honest. It’s on the same subnet, so doesn’t really NEED an entry in the ARP table to ping.

-XT

Good point.

And we’re sure nothing was hosing up the subnet mask? In other words, the pinging node KNEW its target was on the same subnet?

Yup. Once I got the workstations to accept DHCP addresses again I checked to make sure all of the parameters were correct (since I had to rebuild the scopes on that other server by hand). The subnet masks were correct, as were the default gateway address and DNS server addressing.

-XT