Excuse Me, Your SCSI Is Slipping

by Mark Berry 6/29/2007 12:17:00 PM

So you've got a hot-swap SCSI RAID array and you figure you are covered, even if a disk fails. Well, that is usually true. Kinda. Here is a long story with a few short conclusions about what to do when SCSI drives start failing.

Environment

Five-year-old Dell PowerEdge 1500SC with Perc 3/SC RAID controller. Four 18GB SCSI disks configured as a RAID 5 array, with a fifth 18GB available as a hot spare. Sixth disk is a 36GB drive configured as RAID 0 (not redundant). Windows 2003 R2.

In this artilce, I'll call the physical disks Disk 1 through Disk 6, corresponding to what the SCSI BIOS calls 0:0 through 0:5. 

The Drama

First of all, kudos to Dell OpenManage with an IT Assistant workstation. What a pain to set up, but without them, I wouldn't have known about these disk issues until who knows when. The client's server sent SNMP warnings to my IT Assistant workstation (through a router-based VPN). Then IT Assistant sent me emails reporting the critical errors.

Disk 6 Failing

Disk 6, containing the 36GB non-redundant volume, started generating SCSI sense errors, which IT Assistant faithfully passed on to me. Example:  "Alert message ID: 2095, Array disk warning. SCSI sense data Sense key: 3 Sense code:11 Sense qualifier: 1, Controller 0, Channel 0, Array Disk 0:5" (that's Disk 6). After discussing with Dell support, I decided this drive is on its last legs. I moved data from this disk to an external drive and stopped using Disk 6. However I did not delete the virutal disk nor did I remove the physical disk. I figured as long as I didn't need the data, it didn't hurt to leave the failing drive in there for a while.

Disk 4 Failed 

Suddenly, a timeout error appears in the event log:

Source: mraid35x
Event ID: 9
Description: The device, \Device\Scsi\mraid35x1, did not respond within the timeout period. 

IT Assistant did not forward this message, but it did tell me that the virtual disk was "degraded," which means that a disk had dropped out of the array. When I checked the server, I saw that Disk 4 had been dropped from the array, the hot spare (Disk 5) had automatically become active and was being rebuilt into the RAID array. So the automatic rebuild using the hot spare worked great, but why did Disk 4 fail in the first place? No hardware errors were reported on Disk 4 before or after it was kicked out of the RAID 5 array.

Using Dell Diagnostics under Windows, a Quick Test of Disk 4 reported that a Verify did in fact fail. (As an aside, the diagnostics caused the Terminal Server and web server to stop responding for some time, perhaps because they were trying to gain access to the tape drive that was under control of the Veritas Backup Exec drivers. Fortunately the machine eventually "un-hung" itself, averting an emergency road trip to the client site.)

On Site:  A Lovely BSOD and a Rare RAID BIOS Issue

The next day I went to the client site to assess the situation.

The first thing I wanted to do was get rid of Disk 6 (the one with SCSI sense errors). Under Windows, using the Dell Server Administrator (the local OpenManage application), I deleted the second virtual disk, then made Disk 6 ready for removal. Its lights flashed appropriately, then turned off. (Flashing lights are a good thing, reminding me that the drives are numbered from right to left.) I removed the left-most Disk 6, then restarted Windows. Windows took a long time shutting down, eventually displaying a blue screen with this message:

KERNEL_STACK_INPAGE_ERROR
Stop 0x00000077 (0xC00000185, 0xC00000185, 0x00000000, 0x00C1D000)

After generating a full memory dump, the system restarted. But the RAID BIOS got stuck on the message

Spinning SCSI devices...80%

Google has only four hits on the "Spinning SCSI devices" error. The one in English seems to be about Linux. However a search of the Microsoft Knowledge Base for the kernel error was more fruitful.  A quick review of this article:

Troubleshooting "Stop 0x00000077" or "KERNEL_STACK_INPAGE_ERROR"
http://support.microsoft.com/kb/315266

led to the conclusion that Windows had encountered a SCSI termination problem ("0xC0000185, or STATUS_IO_DEVICE_ERROR: improper termination or defective cabling of SCSI-based devices, or two devices attempting to use the same IRQ.") That makes some sense, since this was the last device in the SCSI chain, but isn't the RAID controller supposed to allow removing a drive and dynamically adjust the termination? Well, maybe not.

I inserted the failed Disk 6 back into its slot and turned on the machine. No more "Spinning SCSI Devices" message. Windows booted fine and shut down fine.

This time I booted into the RAID BIOS and double-checked that Drive 6 is not in use (in other words, it's in a Ready state). I couldn't find an option to make the physical drive "ready for removal," so I just powered off the server, removed Disk 6, and powered on again. Still no "Spinning SCSI Devices" message. Conclusion:  the RAID controller needs to cycle power to change SCSI termination.

Re-Testing Disk 4 

Alright, now what about that Disk 4 that supposedly failed? It's a Quantum Atlas 10K III. I wanted to run the vendor's SCSI Max tests against it (downloaded from Seagate's site), but that only runs from a Windows 98 boot diskette, and that diskette hung when searching the PCI bus of the server. I had doubts that it would be compatible with the Perc RAID controller. My backup plan was to take Disk 4 back to the office and insert it into an old PowerEdge 2400 that has a SCSI backplane but no RAID controller. My hunch is that this would allow accessing the drive as a "normal" SCSI device.

But while I'm on site, let's try the Dell diagnostics. During system boot, I pressed F10 to start the Dell system utilities from the utility partition. I ran the Quick Test on that drive: Pass! Then the Extended Test (45 minutes): Pass! So hopefully Disk 4 is actually still good and was just showing errors because the failing Disk 6 was still in the chain.

At this point I went back in to the RAID BIOS to assign Disk 4 as a global hot spare. Oh--and while the server is down, I made a photocopy of Disk 4 so I have the serial number for verifying warranty status, should the need arise. Finally, after booting into Windows, I ran the Quick Test from the Windows version of Dell diagnostics. This test, which failed yesterday, now passed as well.

Lessons Learned

There are a couple things I will keep in mind for next time.

Work or Get Off the Bus 

If a drive is generating SCSI errors, fix the errors or get the drive off the SCSI bus. Don't leave the drive there, even if you aren't using it, or it might cause problems with other drives.

Hot Swap Does Not Mean Hot Remove

Maybe you can replace a drive without shutting down the server. But to remove a drive, especially the last drive in the SCSI chain, remove the drive from the RAID configuration (whether under Windows or the RAID BIOS), then power down the server. Remove the drive and power up. This way, the RAID controller stands a better chance of adjusting termination. 

Remove First from Windows, Then from RAID

Although this didn't cause problems this time, it would probably be best to delete the volume under Windows Disk Management before removing its logical volume from the RAID configuration.

Certificate Authority on SBS 2003

by Mark Berry 6/26/2007 6:00:00 AM

The June/July 2007 issue of SMB Partner Community magazine has a nice article by Mikael Nyström on setting up certificate services under Small Business Server. 

I set up a Microsoft Certificate Authority (CA) under SBS a few years ago, primarily to centralize management of the Encrypting File System (EFS). And while I also use it for the external web certificate, Mikael takes that a step further by explaining how to configure the certificate server to handle multiple web site names (localhost, companyweb, the external name, etc.) from the same certificate.

Setting up and maintaining a CA is not a trivial task. I only wish I'd had Mikael's step-by-step instructions when I was getting started!

The SMB Partner Community magazine is free to Microsoft Small Business Specialists, but anyone can subscribe for a small fee:  www.smbnation.com/smbpc.htm.

Another excellent reference is the book Microsoft Windows Server 2003 PKI and Certificate Security by Brian Komar with the Microsoft PKI Team, ISBN 0-7356-2021-0.

Seagate FreeAgent Pro eSATA and Bad Block Errors

by Mark Berry 6/25/2007 4:51:00 AM

Background 

I was looking for an external hard drive for a Windows 2003 Server. The drive will be used mostly for backups. My long-range plan is to be able to save workstation backup images to the drive via the network, but to be able to attach the drive directly to a workstation, if necessary, to restore an image. The drive should be reasonably fast when attached to the server, so eSATA seemed to fit the bill. However, since the workstations don't have SATA cards, the drive needed to support USB as well. I chose a 500GB Seagate FreeAgent Pro USB/eSATA version, along with a SIIG eSATA II-150 PCI card for the server.

I should perhaps mention that while the SIIG card is supported under Windows Server 2003, the FreeAgent drive is not. I talked to Seagate sales before I purchased the drive, and the impression I got was that although the drive is not officially supported, it should work. And it fact it does--with one important modification.

Symptoms 

The problem I encountered after connecting the drive to the eSATA card was the following error in the Windows System event log:

Source: Disk
Event ID: 7
Description: The device, \Device\Harddisk2, has a bad block.

At first I thought this indicated a bad disk. But eventually I saw a pattern:  the bad block error only occurred if the drive had entered its "sleep" state, i.e. had spun down. By default, this happens after 15 minutes. If I accessed the drive in Windows explorer after it had spun down, in the few seconds that it took it to spin back up, Windows would get impatient and report a bad block error. Note that the "sleep" mode I'm referring to is an internal feature of the Seagate drive--it is not under the control of Windows Power Options in the Control Panel.

If I attached the Seagate drive via a USB cable, let it spin down, and then accessed it, I did not get a bad block error.

Another symptom, one that I did not at first associate with the FreeAgent disk, was the following message in the Application event log:

Source:  VSS
Event ID:  12289
Description:  Volume Shadow Copy Service error: Unexpected error DeviceIoControl(\\?\Volume{ba849e07-88fb-11d9-9c6f-806d6172696f} - 0000017C,0x0053c020,00039B48,0,00038B40,4096,[0]).  hr = 0x80070017.

This error occurred when I started an ntbackup job. Apparently during Volume Shadow Services initialization, it couldn't immediately access to the Seagate drive and so logged this error. A bad block error was also logged at exactly the same time. 

Cause

My speculation is that Windows knows that USB drives may spin down, and it will wait for them to become accessible. However, because an eSATA drive runs as a BIOS-attached drive (similar to a SCSI drive), Windows treats it as an internal drive and expects it to be "on" at all times.

Solution

The solution is to attach the drive to a Windows XP or Vista machine, install the FreeAgent Tools software, go to Utilities, and set the Drive Sleep Interval to Never. Then move the drive back to the server. The Sleep Interval setting is maintained even though the drive is powered down when moving it to another machine. Once I did this, both the bad block and the VSS errors stopped.

Conclusion

It's obviously a pain to have to install a 142MB software package just to change the Sleep Interval, but Seagate Support said there is no other way. Too bad they don't make a simple command-line utility for updating the drive settings. (The software has lots of slick backup/restore features, integration with Internet drive access, etc.--all kinds of things that I don't need in this environment.) I tried installing the FreeAgent software under Windows Server 2003, but the installation failed with a message that it only runs under Windows 2000, XP, or Vista. Hence the solution of temporarily installing the software on a desktop machine.

I wonder if others have had similar problems with eSATA drives that like to go to sleep? Eventually Windows may need to add an option to treat eSATA drives like it treats USB drives. Considering that this drive is only used once a day during backups, it would be nice if sleep mode worked without causing errors. In the meantime, I'll hope that keeping the drive out of sleep mode resolves the "bad block" errors.

Configuring Services for Macintosh under Windows Server 2003

by Mark Berry 6/23/2007 10:14:00 AM
I recently uninstalled and re-installed network drivers on a Windows 2003 system. This clobbered the Services for Macintosh (SFM) configuration made about five years ago, and it took quite a while to figure out how to set up SFM again. This is a small network with one server, a few Windows PCs, and one Macintosh.

Step 1:  Install File and Print Services for Macintosh 

I was able to use Windows Component setup to install File and Print Services for Macintosh, which also installed the AppleTalk protocol. But I couldn't figure out why the zone list dropdown was empty (under Control Panel > Network Connections > Local Area Connection > Properties > AppleTalk Protocol > Properties). I installed Windows Server 2003 in a new virtual machine and it was still empty!

Step 2:  Configure AppleTalk Routing

Finally I started to grasp that the zone list comes from a router. Since this network has no external AppleTalk router, I needed to configure the Windows 2003 Server as an AppleTalk router. The follow procedure is expanded from the Help and Support topic "Configure AppleTalk routing", also found on TechNet.

  1. Open Administrative Tools > Routing and Remote Access. If you're not already using RRAS, there will be a red down-arrow next to the server name indicating that the service is not running. As far as I know, the Routing and Remote Access Service does not need to run to enable AppleTalk routing.
  2. Under Routing and Remote Access, double-click your server and right-click AppleTalk Routing.
  3. Click Enable AppleTalk Routing.
  4. In the Adapters list, right-click an adapter, and then click Properties.
  5. Configure seed routing, network number allocation, and the zone list as appropriate for the computer.
    1. Check Enable seed routing on this network.
    2. Set the Network range From 1 To 100. This could get complicated in a multi-router/multi-zone environment, but for this single-server situation, 1-100 is more than enough:  at 253 nodes per number, that allows for 253 * 100 = 25,300 AppleTalk nodes.
    3. Under Zones, click on New and type in the desired Zone Name. Set As Default is grayed out because there is only one zone:  it already is the default.
    4. Click on OK to close the adapter's properties. This seems to take a few seconds.

Now, when you go back to Control Panel > Network Connections > Local Area Connection > Properties > AppleTalk Protocol > Properties, you should see the zone you just defined listed and selected in the drop-down.

Step 3:  Set Up Printer and Folder Sharing

It looks like a shared printer will automatically be shared by Print Server for Macintosh without further configuration.

However, folder shares (Mac "volumes") must be set up individually. For instructions, see "Create a Macintosh-accessible volume" in Help or on TechNet.

Notes on Folder Sharing

  • If the folder is shared for both Windows and Macintosh users, it should appear twice in the list of shares under Computer Management > System Tools > Shared Folders > Shares, once with Type = Windows and again with Type = Macintosh.
  • When you set up a new Macintosh share with the Share a Folder Wizard, the share is read-only. To make it writeable, you have to go back in to edit the share's properties and clear the This volume is read-only checkbox. I did not set a password on the shares since file access is controlled by Window file permissions.
  • Macintosh shares do not appear when you look at a folder's sharing properties under Windows Explorer; you have to edit them from Computer Management > System Tools.

Welcome to Mark's Small Business Developer Blog

by Mark Berry 6/23/2007 8:09:00 AM

Finally bloggin...

I'm old school, I guess. I love gadgets, but I still think of computers mostly as tools for work. I don't play video games. I don't listen to digital music. I rarely IM and I never text. And until today, I've never blogged.

So why start? Details. I've been finding it increasingly difficult to keep track of all the technical details that it takes to be in the computer business these days. Working in both IT administration and development adds to the confusion--one day I may be reconfiguring a network card; the next I'm researching the best way to trap errors in a C# program. For my own reference, I need a way to log the various sites and nuggets of information that I dig up.

I've tried building an internal web site for this kind of info, but that's pretty tedious. Maybe blogging will be easier, and therefore more frequently used. If I can occasionally share something that others find useful, so much the better.

I looked at a few free and paid blogging sites, but in the end, I decided to host the blog on my own site using a pretty cool open source .NET blog engine called, uh, BlogEngine.NET.

Let me know what you think!

Recover native zip folder functionality under SBS 2003

by Mark Berry 6/22/2007 11:31:00 PM

My .zip files were not appearing as folders under SBS 2003. I assume that is because I uninstalled WinZip. Finally found some instructions on fixing that under XP that also worked under SBS 2003. Go to a command prompt and type:

REGSVR32 ZIPFLDR.DLL
REGSVR32 CABVIEW.DLL

Powered by BlogEngine.NET 1.4.5.0
Theme by Mads Kristensen. Customized by Mark Berry.

About the author

Mark Berry Mark Berry owns MCB Systems, a firm active in both IT administration and database software development.

E-mail me Send mail
`

Tags

Disclaimer

The opinions expressed herein are my own personal opinions and absolutely represent my employer's views. I'm self-employed! Please keep in mind that what worked for me or someone else may not apply to your situation. Always have a good backup, and use any information here at your own risk!

Entire contents copyright © 2010 by MCB Systems. Sign in