Windows 2003 Server guru's...need a hand on weird problem

I’m at a customers site (after having to fly out last night late so I’m a bit punchy atm), and I’ve encountered a weird problem with the file system. The server is using a RAID 1/5 array set up (i.e. the OS is on a RAID 1 two drive partition and the data is on a RAID 5 3 drive partition). Sometime yesterday the customers data drive (on the RAID 5 partition) became inaccessible. It simply disappeared both from the servers My Computer file system view and as accessible shares externally. The users (who aren’t the most computer sophisticated) claim that it just stopped on it’s own for reasons unknown. My assumption talking to them on the phone last night was that something had happened to the array elements (the array controller and software is the same for the RAID1 and RAID 5 arrays, so logically if it’s working for one it should be working for both).

However, after doing a few tests I find that in fact both arrays are functioning properly AND all the drives are showing up in the green. Further, when I go into Computer Management in the Disk Management section I find that indeed both partitions are there and showing healthy. The only strange thing is that the data partition no longer has a logical drive letter (which is why it’s not showing up in My Computer I’m guessing). I can, of course, assign a logical drive letter to it (I have done so) and then access the data fine (all the shares are gone however), but on reboot of the system it reverts back.

So, my questions are…what’s going on here exactly? I’ve never seen anything like this before. What could have caused this? And, most importantly, how do I permanently fix it? It seems like this should be a fairly simple thing, but I haven’t seen anything in google on it (granted, I’m not sure exactly how to phrase the question, and am still trying). I’d be grateful if anyone knows of this problem and knows what’s happening here.

Please, use small sentences as well…I’m completely exhausted (I’ve been up for over 30 hours at this point as it seems this is the weeks for weird network issues for customers who are hundreds of miles apart). :slight_smile:

-XT

I’d be right chuffed if you were to call me a guru… :slight_smile:

First off, go get some sleep: you need to approach problems like this with a clear head.

Let’s start with the basics: we need information.

[ul]Is there anything in any of the event logs? Have any patches been applied recently? Particularly driver updates?
[li]Next, Basic or Dynamic volumes? [/li][li]Is either volume disk full? [/li][li]Have you checked for malware?[/li][li]Have you run CHKDSK?[/li][li]Can you detail the hardware - server model, controller model (Adaptec, perchance?), etc?[/li][li]Are all drives internal or external, or a mixture?[/li][li]Have you checked SCSI termination?[/li][/ul]

I definitely need some sleep but it doesn’t look to be in the cards any time soon. But thanks for the suggestion. :slight_smile:

One of the first things I checked. There was only one strange thing in the System section…at some point in the past week someone turned on the internal 4mm DAT tape drive. However, it’s on a second SCSI controller, so I’m not seeing how it would effect this. I have ripped it out at this point as well (one of the first things I did, especially since they don’t need it…they are using a NAS for backups), but this seemed to have no effect.

As far as I can tell there have been no drivers updates, though they did get the latest security updates from MicroSoft.

Basic volume and NTFS (you didn’t ask and it prolly doesn’t matter, but figured I’d put it in anyway).

Nope, quite the opposite. The primary OS volume is at 57.8 GB free and the data volume is at 136.9 GB free.

No…and unfortunately they don’t have AV running on their system at all (it was a recommendation I made to them months ago but they chose not to heed it…at least I and my company is covered). I suppose it’s possible that there is something like that causing this, though I don’t see anything additional installed under Add Remove Programs. Of course, that doesn’t really mean all that much.

Yes…both volumes are healthy with no file systems issues except the one I described.

It’s an older IBM xSeries 225 with, I believe, an IBM RAID controller. The secondary SCSI controller (for the tape drive) seems to be an Adaptec chip set, though I’m not sure what model it is.

Internal…IBM has a drive caddy system for it’s SCSI RAID drives. All are seated correctly and all are showing in the green using IBM’s ServerRAID software. Both arrays are also showing in the green with no issues.

No, I haven’t cracked the case, but I don’t see how this could be an issue given what I’m seeing here. I don’t think they use a physical terminator in any case with this type of array…IIRC it’s a back plane array that is self terminating.
BTW, appreciate the helpage on this. I’m fairly sure that when the problem is found and fixed I’ll be going :smack::smack::smack: and kicking myself because I SHOULD have left for a few hours, slept and then it would have been obvious. That’s how I hope this all plays out.

-XT

Have you got any virtual drive software on there? Windows will knock off drive letters if there is a conflict and some virtual drive software hooks into the OS in bad ways.

Second, how is the external array connected? If the disk isn’t present at OS startup it might not be allocated a drive letter.

As **Quartz **says, event logs would be good.

cheers
t.

Nope, no virtual drive software. Standards Windows partitions.

It’s not an external array at all. It’s an internal array and integrated RAID controller. It’s definitely present at OS bootup since essentially what they have is one array controller segregated into 2 logical parts (one using RAID 1 and one using RAID 5).

Here is the only odd thing from the event logs:

Event Type: Error
Event Source: 4mmdat
Event Category: None
Event ID: 11
Date: 2/19/2009
Time: 11:27:42 AM
User: N/A
Computer: SERVER
Description:
The driver detected a controller error on \Device\Tape0.

For more information, see Help and Support Center at Microsoft Support.
Data:
0000: 0f 00 68 00 01 00 be 00 …h…¾.
0008: 00 00 00 00 0b 00 04 c0 …À
0010: 01 01 00 00 85 01 00 c0 ………À
0018: 00 00 00 00 00 00 00 00 …
0020: 00 00 00 00 00 00 00 00 …
0028: 6d 40 00 00 00 00 00 00 m@…
0030: ff ff ff ff 00 00 00 00 ÿÿÿÿ…
0038: 40 00 00 84 02 02 06 00 @…„…
0040: ff 20 06 12 08 01 00 00 ÿ …
0048: 00 00 00 00 68 01 00 00 …h…
0050: 00 00 00 00 d0 07 e2 89 …Ð.â‰
0058: 00 00 00 00 90 e3 e1 89 …ãá‰
0060: 00 00 00 00 00 00 00 00 …
0068: 00 00 00 00 00 00 00 00 …
0070: 00 00 00 00 00 00 00 00 …
0078: 70 00 04 00 00 00 00 0a p…
0080: 00 00 00 00 44 ae 00 00 …D®…
0088: 00 ee 00 00 00 00 00 00 .î…

This started yesterday at 11:40am. I have since taken this out of the system (by uninstalling the drivers), but this seems to have no effect on the system. It’s also on a separate SCSI controller, so not sure what effect it would have on this problem.

:slight_smile:

-XT

I used to have terrible problems with Dell Powervaults (external SCSI RAID) because the backplane controllers were very flakey.

Doesn’t seem to be a controller or hardware issue at all. In fact, I don’t think it is a drivers issue either. It’s something weird with the OS. For whatever reason on reboot the logical drive pointer is simply not there. I’ve tried using different drive letters (the default next drive letter is E: ), and this doesn’t seem to work either. When I go into Computer Manager after reboot the volume comes up as Volume <volume name> with no logical drive associated with it. Right clicking on the partition allows me to add a logical drive pointer and access the drive and all it’s data normally. I can create shares, manipulate the file system, etc etc. However, as soon as I reboot the system it reverts back to no logical drive pointer until I manually put it back in.

I’m totally at a loss as to how to fix this. Everything seems to be working fine…except that I simply can’t get the logical drive pointer to stay after a reboot.

-XT

Go to sleep, mate! Your brain will make all sorts of connections while you sleep. If it needs to be billable time, put it down as research. Trust me: after such a long time awake, you are not thinking straight.

Is your Logical Disk Manager service enabled and running?

Presumably that would turn up as a red entry in one of the event logs.

Would that I could. Unfortunately the customer is fairly frantic at this point so I need to get something working here. I’m considering the nuclear option at this point since they have a decent backup system and good backup images.

Yes it is. One additional weird thing I’ve found though…after reboot when I add a new logical drive pointer and shares the folders and files in the data partition are all flagged as read only…and this seems to be a hard flag (I can’t change it…it’s grayed out). I found this out when I tried to do a limited restore from the NAS image from 3 days ago and was told that the disk is write protected.
At this point I’m leaning towards blowing the entire partition away and restoring the system from before this issue cropped up. They would lose a few days worth of work but they would at least have something…and I can run a backup of the system before I do that so in theory they might not lose anything.

-XT

Let me reiterate: none of the advice here will trump a good night’s sleep. You do neither yourself nor your clients a service by not sleeping.

It’s good advice. In fact, now I can actually take it and get some sleep. I’ve been able to restore at least rudimentary access. I did end up blowing the partition away and doing a restore of the critical systems (I LOVE Acronis!) and this seems to have worked to a degree. Still going to be a long weekend here but at least the customer is happy since they are able to get their critical accounting stuff done today that they needed.

Thanks for the advice and helpage everyone. :slight_smile:

-XT

FWIW, the reason why you split off the data drives into a separate array is so you can blow it away without impacting the system drive. If you can restore access by reassigning the drive letter, go ahead and back all the data up to NAS (or a 1TB USB drive for that matter), blow away and reconfigure the array, then copy the data back.

Yup…I know. And blowing it away seemed to have done the trick. It would have been nice if I could have figured out what actually happened, or been able to fix it without nuking the partition (for one thing there is over a TB on it so it’s going to take a long time to restore and reconfigure it all), but at least the customer is happy and the system is at least working to a degree. It was absolutely critical that they have this one system up and working today.

-XT