I thought I would share some of my thoughts on how I have seen our server and storage platforms evolve over time. What I loved about them, what I hated about them, where I am at today and what made a company that stayed a loyal HP customer for years decide to move to DELL.
Whilst the DEC/Compaq/HP storage solutions were innovative for the time and in most cases aided productivity, flexibility and gave varied levels of protection to the business, some of the solutions included software and hardware features that we either had to pay a premium for, or didn’t require. Also the administrative overhead of keeping all the moving parts up to the required firmware and software patch levels in order to receive vendor support at times became really cumbersome and problematic for a really lean team to manage.
Anyhow, here comes the timeline of how the storage environment has changed over the years.
We started off with an Ultrix machine with direct attached SCSI storage, this was replaced shortly after I started fresh from college with a DEC 3000/800 Graphics workstation.
Heres a pic of me for your amusement looking very 90’s with the trusty booking system just sitting under the desk in the I.T office – this was before the days of us having a dedicated room with fire suppression and cooling.
At the time this thing was amazing to me both visually and in terms of the compute power it offered. It featured a set of TurboChannel cards and a BNC connection backed off to an Ethernet transceiver to allow us to connect it to our more ‘modern’ network switches. We also used a set of 6 StorageWorks Arrays that we grouped together using the joys of HP LSM (HP Logical Storage Manager) on OSF/1. This did us proud as a growing company for many years, although it was like standing on a flight-deck when it started up or was under load.
I would say at this point this was probably the longest period we ran without change to either server or storage platform, this was mainly due to the fact that this was pre-web era for us as a company and most of our business came from third parties via our Prestel network/protocol converters. I think its fair to say that the only real limit was the amount of StorageWorks Arrays we could link together (at the time this was costly for us to continue indefinitely), and also managing the LSM environment which could get really messy, and give you a bad day really easily if you lost more than a few disks.
Next up we made the move to an Alpha 1200. Amongst other reasons the other key reason for the switch was so that we could become Y2K compliant. The Alpha 1200 had SMP and ran HP Tru64 Unix along with the AdvFS filesystem on StorageWorks Arrays. I loved AdvFS, as it allowed us to mix our environment with ease when we needed a reactive quick fix – which became more important to us as we moved the majority of our focus to becoming a growing Internet business. This server and storage was the last of the direct attached approach we used and was quickly changed to our first clustered environment, as business continuity became really important for a business that was now 24/7.
This meant that we replaced the Alpha with a pair of Alpha ES40s connected together with MemoryChannel Interconnects into a set of MemoryChannel Hubs. These featured as the Hardware-based method of heartbeating to check ‘is the other node still live’. With the ES40s we also made our first move into SAN instead of direct attached storage, with a HSG80 disk array. This allowed us a much more fluid way of being able to share the disk storage across multiple members in the TruCluster, and featured redundant controllers.
The access to the controllers at the time had to be done through a serially attached terminal, with the concept of any action you performed being either on THIS_CONTROLLER or OTHER_CONTROLLER, which got a bit hairy if you needed to do something and forgot half way through what controller you were on. Its things like this that have slowly become much more integrated with the entire ‘Admin experience’ with the introduction of newer storage solutions, by giving you access to the interfaces by default via a webGUI, with serial fallback in case things get really serious.
Whilst this served us really well, it introduced a level of complexity to managing the disk layout, that at times became a hinderance, as you would need to configure mirroring and RAID level as two extremely separate operations before you even began thinking about putting a filesystem on the disk. It also made the possibility of performing some operations – particularly recovery very difficult to achieve remotely.
Our last shift was onto an set of HP RX6600 Itaniums running a ServiceGuard HPUX cluster and a EVA8000 storage Array. This was great as the RX6600s featured the BMC which gave us the ability to get remote access to the console and power, and also allowed us to create storage groups on the fly without needing to plug in to a separate terminal. This was great as it meant that an Admin could administer nearly all of the infrastructure from a single point. The EVA was great with CommandView EVA, it was fast, flexible, fault tolerant.
It had (in my opinion) several drawbacks:
All of the storage zoning was required to be done seperately on the Fibre switches, which although there was an available web interface, it was an additional action that needed to be carried out, another set of hardware that needed patching to bring it up to date. It was also pretty easy to make a mistake and scrub the zone of an unrelated host system (which never happened, but was a concern, which meant that the number of members in the team able to administer the array was limited).
Also once the 2C6D was full to capacity it was possible to expand, but required more shelving, another cabinet (once you got over 10 disk shelves in our rack). Also from a cost point on host connectivity, if you required a large number of hosts to be able to connect (as we found) you were on the verge of having to purchase another set of fibre switches and Host HBAs just to get the host connected. This meant for us as an organisation that cost began to become really prohibitive to us. You also needed to run a separate storage management server to interface with the array.
Also, it was overly complex to stay in support with ServiceGuard and the versions of HPUX, it was also frustrating every time you wanted to experiment with a new approach to something being told ‘Ahh, you need a licence for that’.
I did love all of the fault tolerant applications that you could use the EVA for, and certainly using tools like ‘Business Copy’ and ‘Continuous Access’ proved their worth to us. As the business began to grow it was great to be able to use it along with SSSU to perform scripted snapshots from inside a Host, to be able to backup our running databases without interruption. Also nice touches like RSS (Redundant Storage Shelves) fascinated me, as it really showed some real ingenuity. I even wrote a program to manage RSS ( this was something that HP didnt offer up to their customers, but did have a similar tool for their engineers.)
Where Are we today?
We reviewed our environment and what we wanted to achieve, and this meant we really needed to shake things up, and I’m sad to say we had to move away from HP mainly due to cost.
The decision to go with DELL for our server and storage environment was made, and whilst I was sceptical at the outset, can honestly say that it has proved a good move. We saved a bucket load of money, and a lot of the tools (particularly for storage) that we would have paid a premium for from HP were bundled by default as part of the offering from DELL.
We got rid of over 200 servers/services and consolidated them down onto 5 DELL R710s running ESXi or running Linux natively on them, and moved from our traditional SAN onto DELL EqualLogic ISCSI SANs, and managed to achieve all of this at roughly the price of all of our annual maintenance/support premiums for our HP kit. Now being on iSCSI meant that we could massively reduce the cabling in our estate, lose the management server and SAN fabric and run everything we needed over our normal and familiar network config.
Whats so amazing about the DELL kit?
Unlike HP iLO, DELL iDRAC comes fully licenced for you with no requirements for a licence extension if you need to use Virtual Media or a full console. Apart from previously having access to crippled/restrictive software, it was typical that we often needed access to some of these features in a reactive fashion, and so didnt have the time to mess around requesting licences,etc. As any whose had to admin a set of systems would probably say, being able to get to the power button and console remotely is pretty important, so full console access and Virtual Media is a must for me.
Now my favourite, the storage.
The great things for us about the EqualLogic approach are:
We can have up to 16 of these guys in the same group, and because they are frameless they can go in different cabinets if we need it, or even different comms rooms.
Things like snapshotting, replication and thin provisioning are all available at no extra charge.
Every EqualLogic Array we buy will always be useful, as the Firmware upgrades are guaranteed backwards compatible, and the controllers and management are all visible/addressable from one place. Also the Firmware update, updates *everything* we need in one go.
You can group together multiple arrays to create storage pools in an EqualLogic group, this is great for distributing I/O and scaling. The PS6100s give us the ability to use Internal Storage tiering as they are a mix of SAS and SSD disks. Which means that it has the ability to move hot data to the SSD drives at block level not LUN level.
For us this was really good, as it allowed us to be much more granular about how we moved our data around. Also, if we decided that for example at the end of the month a particular set of Host LUNs would benefit from being moved onto the 6100s and benefit from the SSD, we could do this on the fly and move a system from a storage pool containing slower SATA based arrays over to the newer/faster models without *any* service interruption.
For ESXi we used the DELL MEM(Multipath Extension Module) to get the best out of our MPIO requirements and take the headache out of configuration, and more recently DELL have revised the MEM to work with more vendors, which is a big win for us.
DELL also packed a load of functionality into their Host side tools, both for Virtual environments and Linux and Windows, giving the user the ability to use ASM/LE or ASM/VE to schedule things like smart-copies or cloning from inside the hosting OS. Also, if somethings broken and you complain to DELL, they will actually do something to correct the problem or come up with a compromise – something that was missed time and time again for us over the years by HP.
This also allowed us to rethink some of our backup strategies, and meant we could move away in some cases from traditional agent based backups and use smart-copies mounted on a backup server, and backup the LUN directly on the backup server itself without being restricted to start times of backups or having to stop services or flush anything, to get something that should be simple to work well.
Being able to be almost immediately reactive and answer current day requirements whilst being able to preserve your legacy kit is invaluable. The direction I am going in now has a been a massive benefit to the business.
Because the storage and the platform are very flexible, it means we no longer have to wait to provision new hosts or their storage – we can overcommit on storage, and RAM (thanks to memory ballooning) and CPU. and the move away from Itanium/HPUX and getting the estate running almost entirely on Linux means that we are able to virtualise almost anything, put it in a container and shift it from one data-centre to another, most of the time without incurring any downtime penalty by using VMotion and Storage VMotion. We can also achieve clustering without needing to use things like STONITH by using the HA and FT features of VMware and save energy and resource by shifting things around with DRS. – Admittedly with the implementation of Fault Tolerance this is limited at the moment to uniprocessor setups, but once VMware have nailed SMP environments under FT this will be a massive win.
In previous years to be this dynamic would have cost the company a heavy premium and months of testing, now that our entire platform is virtualised it means we can always upgrade and refresh hardware without worrying about things like host incompatibility. Now we can put something up and tear it down in minutes using the cloning tools available to ESXi, similarly we can also do this now on AWS.
Its been a busy year getting all this to happen, but worth every effort!
(c) Matt Palmer 21 Aug 2012