A story about a XenServer failure

Missing a bit of data?

One machine a HP DL380G4 plays the role of many. We run a Citrix 6.2.0 XenServer to host our iPhone/iPad weather server, Apple/IOS in-app purchase validation, Atlassian tool set, and some other internal servers. The virtual VMs use Ubuntu Precision 12.04 LTS with full disk encryption and encrypted home directories sitting in a DMZ behind our OpenBSD firewall. ** Updated to a DL380G6, a more power efficient machine (171 watts).


On Thursday morning we decided to apply the Citrix 6.2 service pack 1 to update the server to a more current software base, as the service pack contains various security fixes we would like to have. On the reboot the XenServer fails with the xapi service running at 100% CPU and that service later crashes.

http://discussions.citrix.com/topic/346663-xenserver-62-sp1-breaks-network/

Basically the onboard motherboard video isn't properly supported, so xapi fails. Citrix seems unable to address the root problem, which is discussed at http://support.citrix.com/article/CTX139299. Frankly this is the first vendor failure point and https://taas.citrix.com later flags our box as need security upgrades, but we can’t apply them...

A solution appears to be find and install a PCI-X 100Mhz 3.3V video card; "Good Luck" with that as a tour of our local computer stores/shops/outlets finds no new or even used cards. Post thoughts suggested we could have pursued eBay, or Amazon, and found one for next day delivery…

(a) 1st lesson: Do lots of Googling to ensure your box won't be broken. Mind as this is hardware dependent your actual results may vary.

(b) 2nd lesson: Ensure you have a current copy of all your VMs in another Storage Repository, before you do maintenance, and does the vendor back out plan work? You should test that.

 

Disaster Recovery Plan


We invoke our existing disaster recovery plan, which is to restart a dormant AWS server, and restore our data. That stopped EC2 machine costs a few cents per month to keep, but for disaster recovery it can be restarted in a few minutes.


Weather server restore

A grade of B+ 

A B+ mostly as we waited a few hours before invoking the contingency plan, next time we will do it right away.

We fire up our EC2 server, which has a copy of the database from last month, which is acceptable for the moment, and repoint the AWS Route 53 DNS server to the EC2 box, and we are done. Testing shows we have a expired SSL certificate, so we have to update the nginx/certificate data, and we have to apt-update things to bring the operating system up to current patch levels. As this is a near live restore we should ensure the box is more current on our weekly patch hour.


(c) 3rd lesson: Continuous testing of disaster recovery process, in this case we were lucky as we could refetch the new certificate from the SSL supplier.

 

Jira/tool kit restore

A grade of 'F'

We store encrypted backups of the Atlassian data, attachments, and postsql database on AWS S3 with migration to Glacier after two weeks. For a few $ a month this provides excellent encrypted offsite storage of all our critical business data.

Problem. Where is a copy of the gnupgp encryption keys and passphrases?

Well it exists on a couple of machines but the broken server hosts ALL of them! Searching shows, yes nary a copy elsewhere! In fact after post recovery a spotlight searches with the key show it wasn't copied elsewhere. This was good key management from a “How many copies are in existence”, but bad from a recovery viewpoint, as the key couldn't be located. So none of our encrypted server backups were usable.


*It is important to remember even if you think everything is ok with your disaster recovery plan it is never perfect. Ensure you have it reviewed by other participants, and actually perform tests to ensure there isn’t a surprise like this one *

 

(d) 4th lesson: Ensure you have a paper copy locally and offsite, as you just don’t know how many computers might get toasted…


We determine an unencrypted backup used for earlier testing is about six months old, which is unacceptable. So we buy a year of Citrix support for $$$$ hoping technical support can dig us out of this hole.


(e) 5th lesson: Have a current copy/snapshot of your virtual machines and meta-data in some other storage repository. I repeat myself, but we could have recovered everything within an hour if we had it.


We talk to Citrix for about 12 hours, via five engineers around the world over Thursday & Friday to attempt a full backup. We can restart the tool stack and run xapi for 10 minutes or so before failure. Via "Go to meeting" Citrix is unable because of the stack failure to make a backup of the server meta-data or the VDi etc.

They suggest downgrading the server back to 6.2.0 which appears to be documented at:

http://support.citrix.com/article/CTX119905

 As we have a limited set of VMs it seems doable as only a small number of VMs have to be recreated and reattached to their logical drives.

 **** But don't do this it will cause the 2nd disaster ***

 We try a test run on a test server but fail to actually confirm sanity...

 (f) 6th lesson: Grab some sleep, since our test did show the vendor process failed, we just didn't recognize it yet...

 

The downgrade failure

Friday, after much talking with Citrix technical support we proceed with the downgrade, accepting we might lose data. On rebooting the XenServer is at 6.2.0 but the local Storage Repository containing our data is gone, and the host name changes, Citrix support is puzzled.

This was the second Citrix failure: The upgrade fails, and then the downgrade fails, and seems to wipe out your data. How clever...

Citrix support staff mumble, and we insist they try their process in their laboratory to confirm their documentation is correct or incorrect, we hang up the phone and go exploring. Later Citrix support inform us, if we upgraded from 6.1.0 to 6.2.0 then rebooted to install 6.1.0 there is a clean re-install option, but if you apply a service-pack or a hot-patch that causes failure, it seems there is no backout options?


Fortunately Citrix builds an extensive installer.log found in 

/var/log/installer/install-log

Extremely valuable as they do diagnostics on the current machine before installing by logging lvm and internal xe data, so in looking at the log file we see


 INFO     [2014-05-02 18:15:39] ran ['/sbin/lvm', 'vgs', '--noheadings', '--nosuffix', '--units', 'b', '--separator', '#', '--options', 'vg_name']; rc 0

STANDARD OUT:
   VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73

 Our lvm based Storage Repository

INFO     [2014-05-02 18:15:39] ran ['/sbin/lvm', 'lvs', '--noheadings', '--nosuffix', '--units', 'b', '--separator', '#', '--options', 'lv_name,vg_name']; rc 0

STANDARD OUT:
  MGT#VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73
  VHD-348b7073-3f14-44a2-85b8-88719756ca73#VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73
  VHD-baa6fc4e-7413-4c91-8335-03964c39de9c#VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73
  VHD-f17e8199-54d4-4bb9-956b-862768dfb647#VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73

INFO     [2014-05-02 18:15:39] ran ['/sbin/lvm', 'lvs', '--noheadings', '--nosuffix', '--units', 'b', '--separator', '#', '--segments', '--options', 'seg_pe_ranges']; rc 0

STANDARD OUT:
  /dev/sda3:0-0
  /dev/sda3:1-2567
  /dev/sda3:5391-6161
  /dev/sda3:2568-5390

 

INFO     [2014-05-02 18:15:39] ran ['/sbin/lvm', 'pvs', '--noheadings', '--nosuffix', '--units', 'b', '--separator', '#', '--options', 'pv_name,vg_name,pe_start,pv_size,pv_free,pv_pe_count,dev_size']; rc 0

STANDARD OUT:
  /dev/sda3#VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73#10551296#98771664896#72926363648#23549#9878318233

 Our logical volumes, and finally 

 /sbin/sgdisk --print /dev/sda 
                                                             

Number  Start (sector)    End (sector)  Size       Code  Name

   1            2048         8388641   4.0 GiB     0700 
   2         8390656        16777249   4.0 GiB     0700 
   3        16779264       209715166   92.0 GiB    8E00 

  

  

----------------------

So far so good our SR is c4582230-be8f-ba20-cd32-6f04a9e5dc73 which points to the PV in the lvm at partition 3 on the hard disk and contains the three VDIs

 

Disaster 2

 However as part of the installation process after collecting all this data the script invokes these deadly commands.

INFO     [2014-05-02 19:51:56] ran ['/sbin/sgdisk', '--zap-all', '/dev/sda']; rc 0

STANDARD OUT:
GPT data structures destroyed! You may now partition the disk using frisk or other utilities.

INFO     [2014-05-02 19:51:56] ran ['/sbin/sgdisk', '--mbrtogpt', '--clear', '/dev/sda']; rc 0

STANDARD OUT:
Creating new GPT entries.
The operation has completed successfully.

INFO     [2014-05-02 19:51:56] ran ['sfdisk', '-A1', '/dev/sda']; rc 0

STANDARD ERROR:
Done

INFO     [2014-05-02 19:51:58] ran ['/sbin/sgdisk', '--new=1:34:8388641', '/dev/sda']; rc 0

STANDARD OUT: Information: Moved requested sector from 34 to 2048 in order to align on 2048-sector boundaries. The operation has completed successfully.

INFO     [2014-05-02 19:52:01] ran ['/sbin/sgdisk', '--typecode=1:0700', '/dev/sda']; rc 0

STANDARD OUT:
The operation has completed successfully.

INFO     [2014-05-02 19:52:03] ran ['/sbin/sgdisk', '--attributes=1:set:2', '/dev/sda']; rc 0

STANDARD OUT:
The operation has completed successfully.

INFO     [2014-05-02 19:52:05] ran ['/sbin/sgdisk', '--new=2:8388642:16777249', '/dev/sda']; rc 0

STANDARD OUT:
Information: Moved requested sector from 8388642 to 8390656 in 
order to align on 2048-sector boundaries. The operation has completed successfully.

INFO     [2014-05-02 19:52:07] ran ['/sbin/sgdisk', '--typecode=2:0700', '/dev/sda']; rc 0

STANDARD OUT:
The operation has completed successfully.

INFO     [2014-05-02 19:52:11] ran ['mkfs.ext3', '-L', 'root-skirhppo', '/dev/sda1']; rc 0

STANDARD OUT:
Filesystem label=root-skirhppo
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
524288 inodes, 1048324 blocks
52416 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=1073741824
32 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736

Writing inode tables: ....

 

SIGH

 

We run sgdisk on the downgraded server and yes it confirms partition 3 which contains our VDI is gone.

 

    /sbin/sgdisk --print /dev/sda

Disk /dev/sda: 209715200 sectors, 100.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): D3AD98BF-078D-432A-A94A-1356D28BDE42
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 209715166
Partitions will be aligned on 2048-sector boundaries
Total free space is 192941945 sectors (92.0 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         8388641   4.0 GiB     0700 
   2         8390656        16777249   4.0 GiB     0700 

 

Yes Partition 3, our lvm data is gone.

 

Now a clue is that the script comments imply the user can setup a custom lvm after installation. We go digging thru the lvm and fdisk documentation and realize the lvm is not trashed, just forgotten.

Fortunately the standard install grabs the first 8 GB of disk and doesn't touch the rest of the disk so we can restore the partition as the re-install of XenServer didn't trash it by overwriting the partition header, and the --zap-all doesn't trash the lvm partition data.

In order to test this theory we setup a XenServer on a VmWare machine and ran an example test consisting of:

(1) install 6.2.0
(2) make three VMs
(3) upgrade to SP1
(4) downgrade to 6.2.0
(5) restore partition 3
(6) restore PV & LV
(7) restore VMs and VDIs

We documented all of this and ran the procedure twice to ensure the production run won't have surprises or could we have overlooked something.

The following information is from the test box, not the production box, and the uuid will be different as they are unique to your configuration.

First we have to gain access to our lvm partition by invoking two sgdisk commands. One to indicate the partition, the other command to tag it as a lvm type. Since 16777249 is the end of partition 2, we add 1 to get 16777250 as the start of partition 3. Then note how the 16777250 is altered to 16779264 due to sector boundary rounding, and this creates a type '0700' linux partition.

 /sbin/sgdisk --new=3:16777250:209715166  /dev/sda

Information: Moved requested sector from 16777250 to 16779264 in order to align on 2048-sector boundaries.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
The operation has completed successfully.

 

Now we need to reset the type to '8E00' lvm.

 /sbin/sgdisk --typecode 3:0x8E00 /dev/sda

Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
The operation has completed successfully.

 

Reboot the XenServer, and once we log back in a pvscan shows:

[root@xenserver-piqyctev ~]# pvscan

  PV /dev/sda3   VG VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73   lvm2 [91.99 GB / 67.92 GB free]

 

This is great but we need access to the logical volumes. A bit more reading about how you export a lvm between machines leads us to the following commands: vgexport and vgimport plus vgchange

So if you consider we are going to fake a lvm data move, say export the lvm, copy the partition to another computer, restore the lvm and hook back up the pv & lvs.

 First just to check our assumption we issue

 [root@xenserver-piqyctev ~]# vgimport VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73

  Volume group "VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73" is not exported

 

Now we issue the export, normally this would delete information about the LV & LGs from the machine, but we don’t have that information anyway.

 [root@xenserver-piqyctev ~]# vgexport VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73

  Volume group "VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73" successfully exported

 

Now we import the lvm partition after pretending to move it between computers.

[root@xenserver-piqyctev ~]# vgimport VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73

  Volume group "VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73" successfully imported

Once imported we have to invoke vgchange to fix the volume group links etc.

[root@xenserver-piqyctev ~]# sudo vgchange -ay VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73

  4 logical volume(s) in volume group "VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73" now active

 

We reboot, and success the LVs are there!

 

[root@xenserver-piqyctev ~]# lvs

  LV                                       VG                                                 Attr   LSize  Origin Snap%  Move Log Copy%  Convert

  MGT                                      VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73 -wi-a-  4.00M                                     

  VHD-348b7073-3f14-44a2-85b8-88719756ca73 VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73 -wi-a- 10.03G                                     

  VHD-baa6fc4e-7413-4c91-8335-03964c39de9c VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73 -wi-a-  3.01G                                      

  VHD-f17e8199-54d4-4bb9-956b-862768dfb647 VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73 -wi-a- 11.03G    

 

So our data is back. Now we create the storage repository and link it to the PV via sr-introduce as per http://support.citrix.com/article/CTX119905

Note you might use a different SR label than 'Local', oddly the example label of "Local Storage" did not work as the name became just "Local. Just another example of poor quality control from Citrix.

[root@xenserver-piqyctev ~]# pvscan

  PV /dev/sda3   VG VG_XenStorage-c4582230-be8f-ba20-cd32-6f04a9e5dc73   lvm2 [91.99 GB / 67.92 GB free]

  Total: 1 [91.99 GB] / in use: 1 [91.99 GB] / in no VG: 0 [0   ]

Introduce the source repository using the uuid embedded in the Volume group name.

[root@xenserver-piqyctev ~]# xe sr-introduce uuid=c4582230-be8f-ba20-cd32-6f04a9e5dc73 type=lvm name-label=Local content-type=user

 which gives: c4582230-be8f-ba20-cd32-6f04a9e5dc73

Now we need to create the PBD  (Physical block device), and for that we need the host uuid, and a path to the partition. The documentation implies using a link via /dev/disk/by-id which seems HP related, but that didn't exist for our VmWare box as only the /dev/disk/by-path information was available.

[root@xenserver-piqyctev ~]# xe host-list

uuid ( RO)                : d8d61c2e-219e-4a2e-b53b-288be5b9ec6c
name-label ( RW): xenserver-piqyctev
name-description ( RW): Default install of XenServer

 

[root@xenserver-piqyctev ~]# ls -l /dev/disk/by-path/

total 0

lrwxrwxrwx 1 root root  9 May  2 17:39 pci-0000:00:07.1-scsi-1:0:0:0 -> ../../sr0
lrwxrwxrwx 1 root root  9 May  2 17:39 pci-0000:00:10.0-scsi-0:0:0:0 -> ../../sda
lrwxrwxrwx 1 root root 10 May  2 17:39 pci-0000:00:10.0-scsi-0:0:0:0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 May  2 17:39 pci-0000:00:10.0-scsi-0:0:0:0-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 May  2 17:39 pci-0000:00:10.0-scsi-0:0:0:0-part3 -> ../../sda3


Create the PDB by supplying the SR uuid, the path, and the host uuid, the clue of course is that our SR via the pvscan is located on /dev/sda3

 [root@xenserver-piqyctev ~]# xe pbd-create sr-uuid=c4582230-be8f-ba20-cd32-6f04a9e5dc73  device-config:device=/dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:0:0-part3 host-uuid=d8d61c2e-219e-4a2e-b53b-288be5b9ec6c

 which gives: 80af078c-e960-7873-37f3-126cd103dd16

 

Finally we create the PDB-plug entity with this newly created PDB uuid

 [root@xenserver-piqyctev ~]# xe pbd-plug uuid=80af078c-e960-7873-37f3-126cd103dd16

 

At this point we again reboot the server. Once rebooted we start up a windows based XenCenter where it shows we have a "Local" storage repository containing our missing VDis. Unfortunately we do not have the VM's, so we have to recreate them from memory.  In this case we had some 64bit Ubuntu servers, and a sole 32bit Ubuntu server.

Again the documentation in CTX119905 doesn’t match reality.

We created the VM from the vendor templates, selecting No Start, but had to say we were using a new 8GB VDI and a DVD for installation as we didn't have any other choices. We then deleted the empty 8GB VDI and attached the actual VDI to the VM, but on a VM Start we got a message saying the VDI was unusable, the CDROM install didn't run (or something like that).

The recovery for this problem was to create the VM, indicate a 8GB VDI, and boot from the Ubuntu DVD. Then cancel & shutdown the install via the Ubuntu boot prompt leaving us with an empty VDI. This then made the VM happy that a cdrom install happened, so we deleted the 'fake' 8GB VDI, and assigned the actual production data VDI to the VM via the VM storage tab. Obviously be careful when you delete the empty VDIs .  

Citrix support later indicates this is because the PV-bootloader parm is set to ‘eliloader’ is configured for booting from installation media, later it is changed to PV-bootloader=pygrub so they imply you could use a xe command and alter the PV-bootloader value then reboot.


On the reboot the VM used the production VDI and booted us to the ubuntu passphrase prompt for encryption partition.

 

We dance and drink some wine...

 

Post recovery:

We were extremely luckily we got to test our recovery process for real, yet at the end actually recover all our data.

Based on the failures we:

(1) Promptly created a NFS SR, and duplicated all our VMs. That NFS server is then is backed up incrementally to two off-site Time Machine systems.

(2) Ensured we could invoke a s3cmd to get our data unencrypted from S3 into a newly setup AWS EC2 box via hard copy documentation stored in various places.

(3) We repointed the DNS entry for the weather server from the EC2 box to our XenServer box, and started the process of migrating the updated/new data from the EC2 box.

(4) Finally we scheduled time to run a full Atlassian machine restore from the offset encrypted backup data.

(5) We wrote up and printed a more formal Disaster Recovery Plan document.


In retrospect all disaster recovery planning and execution has issues, so document and test for real so you understand the pitfalls you might encounter when it actually happens for real, and you have to wonder: “Where are those encryption keys?"





© John M McIntosh 2013-2014