Oracle 6140 Copyback doesnt start automatically

Again this week we had a disk fail in one of our ageing 6140 arrays. no big deal, but once I’d pulled the failed disk, waited around for a few mins, and replaced the disk with a new one copybook didn’t start of its own accord. I’ve seen this a number of times before, but not for a year or so now.
It pretty much does it if the disk fails with:

Event Message: Drive by-passed
Component type: Drive

It can sometimes also declare the disk Missing. Here’s how to breath some life into the process. Ok, before we start, this will only work when the array is on 07.xx firmware.

You will have a disk looking something like this:

Replaced Drive - No copyback

Replaced Drive – No copyback

First click on the volume group(VG) in the logical tab that the disk belongs to.

Now right click the volume group name and click on replace drives.

Replace drives

Replace drives….

It will now bring up a new window and in here you will see the missing disk mentioned in the top panel and in the bottom panel it should list the hot spare(HS) that is in-use (if you had any) and any unassigned disk that might be in the array.

Select drives to replace

Select drives to replace

Your disk you replaced should show here as an unassigned disk so click on it in the bottom panel and ‘replace drive’ will now become available, click it and the copyback will now happen.

If you did not want to have the copyback happen and would like to keep the HS as part of the VG you would click on that, it would then flip the disk from a ‘in-use hot spare’ to a member of the VG but you would now be one HS less unless you made the other disk the HS.

Once you make your disk selection, copyback should begin.

Copyback Begins

Copyback Begins

Pretty simple, but perhaps not obvious, I certainly can’t remember it first time every time, so in some ways, this post is for me.

Sun Storagetek/Oracle 6000 series firmware Preparation

I recently had some good feedback from a couple of really nice folk asking for advice/help/comment on some Sun/STK/Oracle 6xxx series issues they were having. I’ll start to try and put more stuff up about the arrays.

One very decent document I read was the upgrade guide. It’s pretty simple to read doc and contains good information about the upgrade process.

Sun Storage 6000 Series Array Firmware Upgrade Guide

To that end, it’s not exactly obvious, since the move away from Sunsolve, how to find the software in the MOS portal. To locate the 6xxx series software for the firmware upgrade process, I’ve put together this ‘where is it’ step by step.

1. Login to My Oracle Support at https://support.oracle.com/.
2. Along the top of the window that opens as your first page, click on the ‘Patches & Updates’ tab.
3. In the Patch Search pane, click on “Product or Family(Advanced Search).”
4. Tick or check the box for “Include all products in a family.”
5. In the Product field, clock the drop down and select “Sun StorageTek 6000 Series Software”
6. In the Release field, select “Sun StorageTek 6000 Series software 1.0”. It should already be selected, but just check.
7. Select the platform to install the tool and click search OR as i wanted to see all I just clicked Search.

My Oracle Support - Sun storagetek 6000 series firmware

8. This will take you to a new window with your search results. -Patch 10265930: “Sun StorageTek 6000 Series Array Firmware Upgrade Utility”
9. Download the zip file and extract the executables.

Once you have the file extracted and installed, we proceed to the firmware updates themselves.

Modifying or clearing Sun STK DACstore on foreign disks

This procedure only works for 06.xx code running on the controllers, and is intended for Santricity users.

You are not clearing the DACstore so much as you are removing the foreign one to add the correct one of the receiving subsystem. The effect is the same, you can use the foreign disk.

1.       In the source subsystem, locate the non-data (unassigned) disks you wish to transfer and remove them one at a time with approx. two minutes between each. If the array is off, just remove them.
2.       In the target subsystem, locate a hot spare of equivalent spec/size etc
3.       In SANtricity, fail the hot spare.
4.       Remove it from the tray, wait a minute or so.
5.       Place the ‘foreign’ disk into the hot spare slot, wait for the system to settle and mark it as hot spare. (Santricity can lag behind actual events)
6.       Unassign it as a hot spare.
7.       Reassign it as a hot spare.
8.       Install the previous disk from this slot in a desired unassigned slot within this subsystem
9.       Repeat steps 2 through 8 for all disk to be installed in this target subsystem
10.       When the last disk is done, leave it as the hot spare.

For 7.x firmware subsystems it is different for adding ‘foreign’ disks as they could be from 6.x or 7.x firmware subsystems. (to follow)

Sun STK 6140 firmware upgrade

On Saturday we jumped from 06.xx.xx.xx software to the latest 07 software (crystal). We had done a lot of preparation for this event, including practical stuff like vmotioning hosts to alternate storage (even local) and checking backups (repeatedly!). There was also the less practical stuff like talking about it/worrying about it etc.. hot air.

Some pre-requisites:
* Multi-path software: rdac is only supported on 06 series firmware up to and including 06.60.xx.xx; conversely MPIO (Sun) is only supported on 06.60.xx.xx and above. You either need to use 06.60 as an intermediate firmware or plan to migrate your Windows hosts from rdac to mpio on the day. Practically, I opted for MPIO in advance (it worked, but isn’t supported)
* VMWare: Only 3.5Update5 and above is supported.
* Array: Configuration- save; Profile- save; Full Support capture- save. The array has to be ‘green’
* Backups: Make sure they work!

When you move to 07.xx.xx.xx there is a VMWARE host region created. It’s recommended that you move your host regions from linux to this new VMWARE zone. However before you do this, you need to delete all the access volumes (LUN 31), if you do all out-of-band management you do not need them in any case (for any host region), but in band or not, with the new firmware you don’t need these luns to run scripts etc for VMWare communication/management. Most importantly when you use the VMWARE host region they cause VMWARE problems. VMWare attempts to mount them etc.
 
Remember, the HBA’s on the servers that have VMWARE on them would have used a LINUX host region and hence the LINUX HBA recommended settings – you now use the VMWARE settings which are mostly to leave everything as default (with Update5 and beyond)

Make the host region change BEFORE you bring your array back online, making this change whilst VMWare hosts are attached will cause a VMWare failure and at worst could result in data loss.

Go into Santricity, locate the mappings view, expand your hosts group for VMWARE and right-click on each host. Select Change Host region, select VMWARE.

Now go ahead, connect the array.
I opted to fire up VMWare hosts, check them, and then Windows. We then put each VMWare host through a maintenance reboot for good measure.

Sun Storagetek and VMWare

Anyone who deals with Sun/Storagetek SAN hardware will know all about firmware upgrades and firmware versioning. It’s tricky to get the right level for you at the right time, and of course just like any firmware, there’s always a newer one to fix bugs you’ll “probably” never have.
At the same time, there are also updates which you need, you just don’t know that until you somehow envoke them.
Now as I explained to my boss and colleagues, I could spend days reading Sun bug reports and still be very little wiser, but truth is, they just aren’t published, as a lot of what they hold is commercially sensitive/damaging.

We have 7 Storagetek (Sun->Oracle) arrays, 1 is retired, 2 are now off maintenance and used as scratch area (trade in just doesn’t get you much, it’s more useful working for us). The other 4 are very much live, 3 x 6140 and 1 6540.

Across our 3 5140 arrays we run pretty old firmware, 6.19.xx.xx. Why? Well partly the old adage of it aint broke.. and also we simply don’t require a lot of the newer feature sets.

Going back 3 years nearly, Sun introduced the ‘crystal’ firmware, the 7.YY.xx.xx range of firmware. It introduced many bug fixes, but also removed the 2TB lun limit with the 6 series firmware had. The upgrade wasn’t (isn’t) trivial, and as we had little reason to trip 2Tb, we elected to stay put. This is a fully supported thing to do.

All was fine until, we hit 4 conditions.
1. We use RVM (remote volume mirroring). 2 of our SAN’s replicate certain luns to each other. I’d always been dubious about this setup as various things had been done badly (mirror db on same spindles etc). It wa an inherited config.
2. We have firmware 6.19.xx.xx
3. We use VMWare 3.5U4.
4. A disk failed in a RAID10 group, where the RVM and VMWare storage was held.

This triggered a lesser known fault where VMWare fails to receive the correct heartbeat SCSI bus reponse from the array via the vmfs driver, and it corrupts the MFT (master file table).

How?.. Well that’s partly a mystery as neither Sun or VMWare will give us detail..Why? It’s commercially sensitive and embarrasing to both the chip manufacturer and VMWare.

So, I’m faced with a fault where though VMWare can see it’s volumes, browsing them shows no files. The file/data is there, it’s just that the failed heartbeat commands have caused the MFT to be overwritten. The result is host OS’s on VMWare just stop, or report they can’t read their disk, BSOD etc. IMagine it, a 7 host HA farm, which loses 48 of it’s 138 guest OS’s…. finance systems, email, SQL, you name it.. it died.
Now without backups and a neat tool that replicates the vmdk files off site (and on site) we would have been f*cked. Even so, it’s a mountain of work to start recovering that many systems, overnight, to be online. On top of this, you are restoring them to a setup which just caused the problem, but where else, this isn’t trivial storage size.
We did it, but that’s not my point here…

So, we discover more detail. Sun engaged Vmware and LSI to look into the issue. Vmware analysed that the corruption occurs in the metadata and heartbeat records, as the vmfs driver has a pending heartbeat update but fails to find the heartbeat slot.  This indicates corruption.. You can find indications of the heartbeat corruption in the logs with an error along the lines of “Waiting for timed-out heartbeat”.  LSI have also looked into why this happens and identified that VMware is not following the SCSI specification regarding handling Aborted commands at certain levels of code.

We gathered logs for VMWare and logs for Sun, and in essence the answer back from both parties (who in 1 way or another blame each other… or in reality LSI the chip manufacturer) is to upgrade to the crystal firmware.
The testing done by Sun/LSI/VMWare reproduced the bug, but showed it only happened every 4 hours (a cycle based thing) and only for 1-2 seconds. So if you blow out a disk in those 12 seconds across 24 hrs, you trigger this bug… LUCK.

The next twist of this, is that at code 6.19.xx.xx, you run Suns own multipathing software, rdac, which makes sure you only ‘see’ your lun the once, as opposed to 4 times (depending on cabling/path redundancy you put in) we always see the lun 4 times without rdac (2 hba’s in the host, and 1 link to each controller in the array)
Code 6.60.xx.xx and above allows you to run both rdac and Microsofts MPIO (for windows hosts), howerver version 7 code onwards, only supports MPIO.
I’m not saying that rdac won’t work on the high code, or MPIO won’t work on the low code, but they aren’t SUPPORTED.. magic words in a support agreement.

To get from 6.19.xx.xx to 7.60.xx.xx we have to go to 6.60.xx.xx in between, else we can’t get a means of converting our windows hosts to MPIO (and testing!!, this is production kit remember in a 24/7/365 operation) Yes, I have to arrange to take down 138 VM Hosts and about 16 windows hosts..twice.

Another important detail is this hasn’t been ‘fixed’ by VMWare, even in VSphere 4, they don’t regard it as their fault (despite it being a scsi bus standards issue or their lack thereof) LSI had to write in a seperate VMWare host region to take it away from ‘Linux’, so there is a specific region just for VMWare with what are clearly non-standard responses to SCSI based commands requested from the VM hosts…. I find that.. ODD.

The moral of the story? There isn’t one, there is no right/wrong way here in firmware terms, horse for course. It’s pure chance we hit a disk failure in that situation. We get a fair amount of regular disk failures, it’s the nature of the beast, and on this array. This 1 time, we triggered the event.