As many of you already know, fencing is an important component of maintaining the health of your cluster. When cluster nodes experience issues, behave improperly or overall, just aren’t playing nice with the remainder of the nodes, it’s important to bring down that node as fast as possible, otherwise you risk service interruption or even worse, data corruption!
Before the prevalence of Virtualization in the Data Center, the most common way to fence a node was log into it’s IPMI or DRAC card and issue a shutdown/reboot command. If the node didn’t have a DRAC or IPMI, you could also log into the PDU it was connected to and power-off the outlet. Either of the two methods ensured that if required, cluster nodes could be quickly taken offline when necessarily.
Well on virtualized cluster nodes, there isn’t a dedicated IPMI or DRAC card. And you certainly wouldn’t want to log into the PDU and shutdown the entire physical host. So the only methods left are those that require the host to self-fence or those that require in-band access to the host to issue a reboot/shutdown command. Sadly, these methods unreliable at best. For example, if a node is unstable and has partially crashed, ssh access may be unavailable, or components of the OS may be unstable so it cannot properly self-fence. So then, what is the best way to fence an unstable, vm based cluster-node?
Well, if you are using XenServer or XCP, then I’d say the best way to do that would be through an agent or script that leverages XAPI. For example, if you could execute the xe vm-reboot
or xe vm-reset-powerstate
when a vm cluster node was unstable or hung, then that would be awesome. It would be even better if this script or agent was integrated with the cluster stack so that it worked inside rather than outside (e.g. so that the cluster itself was responsible for detecting when it was necessary to fence a node as opposed to an external script or agent), but does such a thing even exist??
Well kiddies, yes it does. And we have Matthew J Clark to thank for it. He wrote a pretty nice fencing agent for XenServer/XCP that does everything I said above. You can download it over at his Google Code Page or from the FedoraHosted.com git repo.
I recommend you pull it down from the git repository at fedorahosted.com. It works with CMAN (I think lol) and Pacemaker clusters, though I’ve only tested it with the later.
Pull it down, build it (it requires pexpect and python-suds) and install it. You should have /usr/sbin/fence_xenapi
after you install it. Run stonith_admin -M -a fence_xenapi
to check out it’s metadata:
fence_cxs is an I/O Fencing agent used on Citrix XenServer hosts. It uses the XenAPI, supplied by Citrix, to establish an XML-RPC sesssion to a XenServer host. Once the session is established, further XML-RPC commands are issued in order to switch on, switch off, restart and query the status of virtual machines running on the host. Fencing Action Login Name Login password or passphrase Script to retrieve password Physical plug number or name of virtual machine The URL of the XenServer host. The UUID of the virtual machine to fence. Verbose mode Write debug information to given file Display version information and exit Display help and exit Separator for CSV created by operation list Test X seconds for status change after ON/OFF Wait X seconds for cmd prompt after issuing command Wait X seconds for cmd prompt after login Wait X seconds after issuing ON/OFF Wait X seconds before fencing is started Count of attempts to retry power on
Notice the use the pcmk_host
and pcmk_host_check
in the primitive. The agent isn’t able to map cluster nodes back to vm-labels on it’s own, so the pcmk_host
stanza provides that mapping. Also, the location
blocks ensure that a node cannot fence itself.
To test, you’d simply need to execute the stonith_admin -B <node-name>
. That will reboot the specified node.
After you’ve tested and you are ready to go production, don’t forget to set stonith-enabled=true
in your properties and no-quorum-policy="ignore"
and you’ll be all set.
Hello,
I am trying to get that stonith primitive run. could you please explain this pcmk_host and pcmk_host_check params? I have no idea which values are needed for them. Have two physical server with xenserver, wanna reboot a vm via stonith primitive.
could you please provide an example ?