Running Redhat 8.0 on a server. It has one network card, bound to two different addresses, both on the same subnet. For discussion, I’ll refer to them as 66.1.1.145 and 66.1.1.146. (numbers have been changed to protect the innocents.)
Everything has worked fine for over a year. Starting last week, from the internet, I was unable to ping (or otherwise access) the .145 address and yet was able to ping (and otherwise access) the .146 address. From other machines on the local 66.1.1 LAN, I can access both addresses.
I’ve triple-checked to be sure there aren’t dupe IP addresses or bad configs in upstream devices (including the ISPs edge router). Also, I’ve had all those devices rebooted and that didn’t help.
But here’s the really weird part: the problem can be fixed (temporarily) sometimes by restarting networking (service network restart) on the host with these two addresses. When that doesn’t fix it, I can fix it (temporarily) by swapping the ip addresses in the /etc/sysconfig/ifcfg-eth0 and ifcfg-eth0:0 files, then restarting networking. So the issue seems to be inside this machine. BTW, I say “temporarily” because the problem shows back up again, typically the next day.
For reference, here is ifcfg-eth0 and ifcfg-eth0:0:
Also, each time it’s happened, the first thing I try to do is reboot the host (soft reboot as it’s usually remote from me), and that’s never fixed the problem.
Only thing left I can think of is to try is a new network card, but I don’t see how it could be that. Any ideas?
If hosts on the same subnet can reach the host, but hosts outside the subnet can’t, it’s probably not a problem on your machine. Is there an ARP table out of whack on a router? IP addresse/host name mapping can be correct through the whole chain, but ARP tables can screw everything up. And it’s a tricky problem to spot.
Thanks, leenmi; that’s an excellent question. At first (before I spotted the weird behaviour inside the machine), that’s exactly what I assumed. Upstream from this devices are (in order), switch1, switch2, router1, router2. All except router2 are my property; router2 is outside on the ISP’s side. I’ve reset all of those devices (including router2), which should’ve reset the arp tables. And no luck. Also, assuming that maybe there was some cached info inside the host, I reset it as well. Still no luck. It was at that point in debugging that I tried disabling one of the interfaces on the host and realized that playing with network configuration and restarting the network fixed the problem.
Note that my mac address doesn’t change in any of the arp tables when I swap ip addresses between eth0 and eth0:0 (they’re both the same arp address), yet that fixes the problem (for a while).
Is traceroute an option? Or whatever is similar in Linux?
Something else: Is it at all possible that someone or some thing in the chain has stolen your IP address? If it’s a laptop, or an entity that is not constantly connected, it might explain what you are seeing.
This was an initial thought of mine as well. Did a traceroute, and it ends at router1 in the list I mentioned above (i.e. the first routing hop beyond the host.) So, initially I was thinking it had to be something misconfigured there. But the thing is, that routers configuration hadn’t changed in months. Nonetheless, I carefully perused the config for errors, and also reset the router in case something was wrong, or an improper arp cache was there. no dice. Also, if it were the router, you’d think that my restarting the routing in the host wouldn’t fix it.
I’m sure there isn’t IP duplication. I did a ping sweep (as well as a port scan for absolute completeness) on the LAN, and didn’t find any overlap. Since there are only 3 other devices on the LAN with it, I manually checked their configs as well.
This device isn’t a laptop, in fact it’s a server that had an uptime of over a year before I rebooted it the other day. Have never had any problems like this with it before.
It’s hard for me to imagine what possible value there is in having two IP addresses on the same subnet on the same NIC. Delete the alias and save yourself some grief.
Well, the reason I need multiple IPs is that this host sports multiple services for multiple domains, where each domain requires a single IP. Yes, I could run several http servers on a single IP address, but these aren’t all http servers.
Also, this is a pretty standard configuration as a Google search shows. Another typical usage is for someone who provides vanity domains for IRC users or other.
You say that switching the IP Addresses temporarily fixes the problems, but the problem resumes, typically, the next day.
Can you nail down the time a little more specifically. An iteration that pings the addresses or tries to connect once every couple seconds shouldn’t be too expensive, resource-wise.
Is there a cron job or some scheduled task that may be having unanticipated consequences? I still don’t think it’s on your server, since all is well on the local sub-net. But, armed with a more specific time, the ISP might twitch to something they do.
Wow… This is a tricky one. I’m a network/system admin and work with Red Hat Linux all day and you’ve certainly got me stumped. Any chance you could capture a bunch of data with tcpdump and see what the hell’s really going on at the packet level? The command line you’d want would be something like this:
tcpdump -ieth0 -wdump.dat -s1514
That would capture all traffic to/from interface eth0 to a file called dump.dat and make sure you captured all the bytes in the ethernet packets. You could then either read it like this:
tcpdump -rdump.dat
or load the dump.dat file into Ethereal (the greatest network analyzer on earth, which I believe is included with Red Hat 8.0) and really see what’s going on.
leenmi, pestie, both great ideas; I’ll do both and see what comes of it. I have some good in-depth knowledge of TCP/IP protocols, so seeing a trace may help.
Naturally, everything’s been ok for the past 1 1/2 days. Happy it isn’t down, but unhappy that I can’t further debug it.
The guy at my ISP said something about Linux having a setting about returning packets on the local LAN or something, but I’d never heard of it and it didn’t make much sense. If you’ve heard of such a thing, I’d appreciate your thoughts.
That sounds vaguely familiar, but there’s all sorts of crazy tweaking that can be done to the Linux IP stack. If the answer is to be found anywhere, it’s probably in the Linux Advanced Routing and Traffic Control HOWTO. Chapter 13 seems to address a lot of the options that might be affecting you.
Argh. right now it’s acting up, and the restarting network stuff isn’t fixing it. And naturally, I have other real things to deal with in my life. I’ll do a trace and see what shows up.