Wednesday, May 11, 2016

VMWARE VSPHARE HIGH AVAILABILITY (HA)

              VMware vSphare High Availability (HA):-

Network Load Balancing (NLB) clustering:- The Network Load Balancing configuration involves an aggregation of servers that balances the requests for applications or services. In a typical NLB cluster, all nodes are active participants in the cluster and are consistently responding to requests for services. If one of the nodes in the NLB cluster goes down, client connections are simply redirected to another available node in the NLB cluster. NLB clusters are most commonly deployed as a means of providing enhanced performance and availability.
Example- NLB  on IIS server, ISA server, VPN server etc.


Windows Failover Clustering (WFC):- it is used solely for the sake of availability. Server clusters or WFC do not provide performance enhancements outside of high availability. In a typical server cluster, multiple nodes are configured to be able to own a service or application resource, but only one node owns the resource at a given time. Each node requires at least two network connections: one for the production network and one for the cluster service heartbeat network between nodes. A common datastore is also needed that houses the information accessible by the online active node and all the other passive nodes. When the current active resource owner experiences a failure, causing a loss in the heartbeat between the cluster nodes, another passive node becomes active and assumes ownership of the resource to allow continued access with minimal data loss.

Raw device mapping (RDM):- An RDM is a combination of direct access to a LUN, and a normal virtual hard disk file.
An RDM can be configured in either Physical Compatibility mode or Virtual Compatibility mode. The Physical Compatibility mode option allows the VM to have direct raw LUN access. The Virtual Compatibility mode, however, is the hybrid configuration that allows raw LUN access but only through a VMDK file acting as a proxy.
So, why choose one over the other? Because the RDM in Virtual Compatibility mode uses a VMDK proxy file, it offers the advantage of allowing snapshots to be taken. By using the Virtual Compatibility mode, you will gain the ability to use snapshots on top of the raw LUN access in addition to any SAN-level snapshot or mirroring software.

Cluster with Windows Server 2008 VMs:-
Cluster in a Box:- The clustering of two VMs on the same ESXi host
Cluster across Boxes- The clustering of two VMs that are running on different ESXi hosts.
Physical to Virtual Clustering- The clustering of a physical server and a VM together.


What is VMware HA?
As per VMware Definition:-
VMware® High Availability (HA) provides easy to use, cost effective high availability for applications running in virtual machines. In the event of server failure, affected virtual machines are automatically restarted on other production servers with spare capacity.



What are pre-requites for HA to work?


1.Shared storage for the VMs running in HA cluster
2.Essentials plus, standard, Advanced, Enterprise and Enterprise Plus Licensing
3.Create VMHA enabled Cluster
4.Management network redundancy to avoid frequent isolation response in case of temporary network issues (preferred not a requirement)


FDM- vSphere HA uses a new VMware-developed tool known as Fault Domain Manager (FDM) for supporting HA.

AAM- Earlier versions of vSphere used Automated Availability Manager (AAM), which had a number of notable limitations, like a strong dependence on name resolution and scalability limits.


What is the command to restart /Start/Stop HA agent in the ESXi host?
# /etc/init.d/vmware-fdm stop

# /etc/init.d/vmware-fdm start

# /etc/opt/init.d/vmware-fdm restart

Where to locate HA related logs in case of troubleshooting?
/var/log/fdm.log


HA-MASTER- When vSphere HA is enabled, the vSphere HA agents participate in an election to pick a vSphere HA master. The vSphere HA master is responsible for a number of key tasks within a vSphere HA–enabled cluster. If the existing master fails, a new vSphere HA master is automatically elected. The new master will then take over the responsibilities listed here, including communication with vCenter Server.

HA-Slaves- Once an ESXi host in a vSphere HA–enabled cluster elects a vSphere HA master, all other hosts become slaves connected to that master.
HA Master's responsibilities:-
monitors slave hosts:-
sends heartbeat messages to the slave hosts:-
manages addition and removal of Hosts:-
monitors the power state of VMs:-

reports state information to vCenter Server:-
keeps list of protected VMs:-
notifies cluster configuration change to slave hosts:-

HA Slave Host's responsibilities:
HA master's health check:-
implement some vSphere HA features like local vm's health check:-
watches local VM's runtime states:-

network partition:-
"Network partition" is the term used to describe the situation in which one or more slave hosts cannot communicate with the master even though they still have network connectivity between themselves. In this case, vSphere HA is able to use the heartbeat datastores to detect whether the partitioned hosts are still live and whether action needs to be taken to protect VMs on those hosts.

 network isolation:-
Network isolation is the situation in which one or more slave hosts have lost all management network connectivity. Isolated hosts can neither communicate with the vSphere HA master nor communicate with other ESXi hosts.

datastore heart-beating:-
 In this case, the slave host uses heartbeat datastores to notify the master that it is isolated. The slave host uses a special binary file, the host-X-poweron file, to notify the master. The vSphere HA master can then take the appropriate action to ensure that the VMs are protected.”


What is the maximum number is of hosts per HA cluster?
Maximum number of hosts in the HA cluster is 32

How is the Host Isolation is detected?

In HA cluster, ESXi hosts uses heartbeats to communicate among other hosts in the cluster. By default, Heartbeat will be sent every 1 second.
If a master ESXi host in the HA enabled cluster didn’t received heartbeat from any other hosts in the cluster then the master host assumes that the slave host may be in isolated state. It then checks that the slave host is capable of pinging its configured isolation address(default gateway by default) or not. If the ping fails, VMware HA will execute the Host isolation response.
 In VMware vSphere 5.x, if the agent which fails is from a master host, then isolation is declared in 5 seconds. If it is a slave, isolation is declared in 30 seconds.
In vmware 5.x then master host uses another technique to check live-ness of the slave hosts in the cluster before declaring it as isolated. It is called datastore heartbeating. Datastore heartbeating is used to determine whether the slave host has failed, is in a network partition, or is network isolated. If the slave host has stopped datastore heartbeating, it is considered to have failed and its virtual machines are restarted elsewhere.

Vsphare HA requirements:-
Same shared storage for all hosts:-
Identical virtual networking configuration:-

Do HA uses vMotion to transfer live  VM's to other HA hosts when source Hosts fails?
No because HA restarts VMs to other Hosts when source hosts fails. It is not live migration and involves few minutes of downtime.

Vsphare Height Availability admission control :-
It control the behavior Of the vSphere HA–enabled cluster with regard to cluster capacity or cluster tolerance. Specifically, should vSphere HA allow the user to power on more VMs than it has capacity to support in the event of a failure?

Or should the cluster prevent more VMs from being powered on than it can actually protect? That is the basis for the admission control — and by extension, the admission control policy — settings.

Admission Control has two settings:-
Enable: Disallow VM power-on operations that violate availability constraints.
Disable: Allow VM power-on operations that violate availability constraints.



Admission control policy:-
When Admission Control is enabled, the Admission Control Policy settings control its behavior by determining how much resources need to be reserved and the limit that the cluster can handle and still be able to tolerate failure.

Vmware HA isolation response:-When an ESXi host in a vSphere HA–enabled cluster is isolated — that is, it cannot communicate with the master host nor can it communicate with any other ESXi hosts or any other network devices — then the ESXi host triggers the isolation response settings configured. The default isolation response is "Leave Powered On". it can be Shut Down or Power Off also.

High Availability VM Monitoring:- vSphere HA has the ability to look for guest OS and application failures. When a failure is detected, vSphere HA can restart the VM or the specific application. The foundation for this functionality is built into the VMware Tools which provide a series of heartbeats from the guest OS up to the ESXi host on which that VM is running. By monitoring these heartbeats in conjunction with disk and network I/O activity, vSphere HA can attempt to determine if the guest OS has failed.

vSphere Fault Tolerance:- vSphere Fault Tolerance (FT) is the evolution of “continuous availability” that works by utilizing VMware vLockstep technology to keep a primary machine and a secondary machine in a virtual lockstep. This virtual lockstep is based on the record/playback technology. vSphere FT will stream data that will be recorded, and then replayed. By doing it this way, VMware has created a process that matches instruction for instruction and memory for memory to get identical results on the secondary VM. So, the record process will take the data stream from primary VM, and the playback will perform all the keyboard actions and mouse clicks on the secondary VM.

Perquisites or Requirements of vSphere Fault Tolerance:-

Cluster level
 Same FT version or build number on at least 2 host:-
 HA must be enabled:-
  VMware EVC must be enabled:-



ESXi host level    
vSphere FT compatible CPUs:-
Hosts must be licensed for vSphere FT.
Hardware Virtualization (HV) must be enabled:-
Access to the same datastores:-
vSphere FT logging network with at least Gigabit Ethernet connectivity:-


VM level
VMs with a single vCPU:-
Supported guest OS's:-
VM files on share storage:-
Thick provisioned (eagerzeroedthick) or a Virtual mode RDM
No VM snapshots:-
No NIC passthrough or the older vlance NIC driver:-
No Paravirtualized kernel:-
No USB devices, sound devices, serial ports, or parallel ports:-
No mapped CD-ROM or floppy devices:-
No N_Port ID Virtualization:-
No Nested page tables/extended page tables (NPT/EPT):-
Not a linked clone VM:-
Operational changes or recommendations for FT:-
Power management must be turn off in the host BIOS:-
No sVmotion or sDRS for vSphere FT:-
No Hot-plugging devices:-
No Hardware Changes:- No Hardware Changes Includes No Network Changes.

NO snapshots based backup solutions:-


What the basic troubleshooting steps in case of HA agent installs failed on hosts in HA cluster?

1. Check for some network connectivity issues.

2. Check the DNS is configured properly.

3. Check HA related ports are open in firewall to allow for the communication.

Ex- 8182-  (TCP/UDP) (Inbound/outbound )Traffic between hosts for vSphere High Availability (vSphere HA)

4.Troubleshoot FDM :-
A.> Verify that all the configuration files of the FDM agent were pushed successfully from the vCenter Server to your ESXi host:
Location: /etc/opt/vmware/fdm
File Names: 
clusterconfig (cluster  configuration),
compatlist (host compatibility list for virtual machines),
hostlist(host membership list), and 
fdm.cfg.

B.> Search the log files for any error message: 
/var/log/fdm.log or /var/run/log/fdm* (one log file for FDM operations)
/var/log/fdm-installer.log (FDM agent installation log) 

5. Check the network settings like port group, switch configuration, etc are properly configured and named exactly as other hosts in the cluster. 

 6. First try to restart /stop/start the VMware HA agent on the affected host using the below commands. In addition u can also try to restart vpxa and management agent in the Host.

# /etc/init.d/vmware-fdm stop

# /etc/init.d/vmware-fdm start

# /etc/opt/init.d/vmware-fdm restart

7. Right Click the affected host and click on “Reconfigure for VMWare HA” to re-install the HA agent that particular host.

8. Remove the affected host from the cluster. Removing ESXi host from the cluster will not be allowed untill that host is put into maintenance mode.


Alternative solution for 3 step is, Goto cluster settings and uncheck the vmware HA in to turnoff the HA in that cluster and re-enable the vmware HA to get the agent installed.

No comments:

Post a Comment