ESX Clusterbomed – Part 2

August 14, 2014, Troliver, Day Job, , 0

Looks like the storage array is dead. After a bunch of messing around a couple of days ago, it really is apparent that we have lost the FFD2 enclosure.

With it, we lose a few server, but we can also gain a load of storage disks. Until we manage come up with a new storage solution that has backups, I’m taking all the old SAS drives out to use as hotspares for FFD1. I have a feeling we have lots of 15k RPM SAS drives lying around that were used in other servers – more recently – and of the same brand to use to rebuild FFD2 again. It should work. If not, it’ll be a learning experience.

I ended up familiarising myself a lot with the Dell Modular Storage Manager (DMSM) software and found out the correct way to assign hotspares and replace drives in the enclosure. A lot of messing around, unplugging and restarting took place on Tuesday, eventually resulting in a hot spare being designated as a physical replacement on another enclosure. I had actually written a good amount up about this but it was being written on notepad on a virtual machine that subsequently got restarted when – at some unknown point – it was decided to restart everything that was already running. Frustrating. But not the end of the world.

Moving forward, what needs to be done now is:

Have a backup solution:
- If a server fails, if the hosts fail and if everything is lost, we need to be able to – at worse – rebuild and reconfigure those servers. Each server should have a designated backup plan associated with it
- Designate some replacement hot spare drives.
- Purchase a new storage array and an appropriate backup, with perhaps something like a scheduled daily backup of the system.
- Ideally the content from our internal wiki should be mirrored elsewhere so thatin the event of a disaster, we can recap on how to fix it.
Maintain the storage array and the ESX hosts more closely. Someone needs to monitor alarms as they appear and be informed of any storage array issues. I also need to look into why we no longer receive support emails automatically generated by alarms on the storage array (and this used to happen).
Rebuild the vCentre server – probably on a physical host rather than a virtual one. Will need to look into that.

For each of these points, I would probably make a new post – but this is just one part of what I am working with. FOG and the redeployment of our labs is also a priority, as are some other projects I have been working on lately. To be continued!

esx, storage

‹ ESX Clusterbombed – Part 1 FOG Update – Part 4 ›

ESX Clusterbomed – Part 2

Share this:

Leave a Reply Cancel reply