Troliver

stories of war between boy and machine

ESX Clusterbomed – Part 2

Looks like the storage array is dead. After a bunch of messing around a couple of days ago, it really is apparent that we have lost the FFD2 enclosure.

With it, we lose a few server, but we can also gain a load of storage disks. Until we manage come up with a new storage solution that has backups, I’m taking all the old SAS drives out to use as hotspares for FFD1. I have a feeling we have lots of 15k RPM SAS drives lying around that were used in other servers – more recently – and of the same brand to use to rebuild FFD2 again. It should work. If not, it’ll be a learning experience.

I ended up familiarising myself a lot with the Dell Modular Storage Manager (DMSM) software and found out the correct way to assign hotspares and replace drives in the enclosure. A lot of messing around, unplugging and restarting took place on Tuesday, eventually resulting in a hot spare being designated as a physical replacement on another enclosure. I had actually written a good amount up about this but it was being written on notepad on a virtual machine that subsequently got restarted when – at some unknown point – it was decided to restart everything that was already running. Frustrating. But not the end of the world.

Moving forward, what needs to be done now is:

  • Have a backup solution:
    • If a server fails, if the hosts fail and if everything is lost, we need to be able to – at worse – rebuild and reconfigure those servers. Each server should have a designated backup plan associated with it
    • Designate some replacement hot spare drives.
    • Purchase a new storage array and an appropriate backup, with perhaps something like a scheduled daily backup of the system.
    • Ideally the content from our internal wiki should be mirrored elsewhere so thatin the event of a disaster, we can recap on how to fix it.
  • Maintain the storage array and the ESX hosts more closely. Someone needs to monitor alarms as they appear and be informed of any storage array issues. I also need to look into why we no longer receive support emails automatically generated by alarms on the storage array (and this used to happen).
  • Rebuild the vCentre server – probably on a physical host rather than a virtual one. Will need to look into that.

For each of these points, I would probably make a new post – but this is just one part of what I am working with. FOG and the redeployment of our labs is also a priority, as are some other projects I have been working on lately. To be continued!

,

Leave a Reply

Your email address will not be published. Required fields are marked *