Understanding Adaptive Resync in VMware vSAN 6.7

Introduction to Adaptive Resync

Consistent Performance delivery along with data resiliency are two key tenet for an enterprise storage solution. If case of host / disk failure few of the components of the Virtual Machine might get non-complaint because of the the missing data. To make the impacted Virtual Machine complaint with their policy vSAN will sync the data on to the host / drives where we have sufficient resources available.

Resync operations will aim to finish the creation of missing components asap. The Resync operations consumes the I/O of the disk drives where vSAN is creating the missing objects of the impacted Virtual Machine. The longer the resysc takes, the longer you are at risk. If background operations like resync, rebalancing, etc consumes all I/O then application servers performance will suffer. On the other hand if Application server consumes all I/O and your backend cannot safely maintain availability. Some of the reasons for resynchronizations include:

  • Object policy changes
  • Host or disk group evacuations
  • Host upgrades (Hypervisor, on-disk format)
  • Object or component rebalancing
  • Object or component repairs

In the earlier releases, there was a manual throttling mechanism to handle these kind of situations. In vSAN 6.6, a throttling mechanism was introduced in the UI allowing a user to define a static limit on resync I/O. It required manual intervention, knowledge of performance metrics, and had limited abilities to control I/O types at specific points in the storage stack

With VMware vSAN 6.7, VMware introduced a new method to balance the use of resources during background activities like resync, rebalancing, etc. vSAN 6.7 distinguishes four types of I/O, and has a pending queue for each I/O class.

vSAN employs a sophisticated, highly adaptive congestion control scheme to manage I/O from one or more resources. vSAN 6.7 has two distinct types of congestion to help regulate I/O, improving upon the single congestion type found in vSAN 6.6 and earlier.

Bandwidth congestion. This type of congestion can come from the feedback loop in the “bandwidth regulator”, and is used to tell the vSAN layer on the host that manages vSAN components the speed at which to process I/O.

Backpressure congestion. This type of congestion can come as the result of the pending queues for the various I/O classes filling to capacity. Backpressure congestion is visible in the UI by highlighting the cluster, clicking Monitor > vSAN > Performance, and selecting the “VM” category.

The benefit to this optimized congestion control method is the ability to better isolate the impact of congestions and improve resource utilization. The dispatch / Fairness scheduler is at the heart of vSAN’s ability to manage and regulate I/O based on the conditions of the environment. The separate queues for the I/O classes allow vSAN to prioritize new incoming I/O that may have an inherently higher level of priority over existing I/O waiting to be processed. If a Virtual Machine latency reach to high watermark, vSAN will cut the bandwidth of background operations to half. Now vSAN will again check if Virtual Machine latency is still above it will cut the resources for background operations to half again. When the latency is below the low watermark then vSAN will increase the bandwidth of resync traffic granularly until the low watermark is reached and stay at that level. When resync and VM I/O activity is occurring, and the aggregate bandwidth of the I/O classes exceeds the advertised bandwidth, resync I/Os are assigned no less than approximately 20% of the bandwidth, allocating approximately 80% for VM I/Os.

Conclusion

Adaptive Resync in vSAN 6.7 introduces the mechanism to implement a fully intelligent, adaptable flow control mechanism for managing resync I/O and VM I/O.  If there is no resync activity is occurring, VM I/Os can consume up to 100% of the available bandwidth. If resync and VM I/O is below the advertised available bandwidth, neither I.O class will be throttled. If resync and VM I/O aggregate bandwidth of the I/O classes exceeds the advertised bandwidth, resync I/Os are assigned no less than approximately 20% of the bandwidth, allocating approximately 80% for VM I/Os.

 

VMware vSAN Capacity Management and Utilization Deep-dive

Introduction

VMware is putting lot of efforts to make tracking of vSAN Capacity utilization simpler. VMware vSAN provides few of the dashboard within vCenter to track capacity utilization, free space remaining, deduplication can compression efficiency and many more. In the latest release, there are more then 200 alarms that provides various alerts for vSAN.

In addition to the capacity dashboard vSAN Health includes capacity-related health checks as well.

vSAN Capacity Overview

This dashboard part of the vCenter UI provides a capacity overview. The dashboard includes the details on

  1. Total Raw Capacity of the vSAN Datastore
  2. Details on Consumed Capacity
  3. Details on Remaining Capacity

The level of resilience configured has a direct impact on how much useable capacity is used. The vSAN Capacity Overview section enables us to select a storage policy and see how much usable free space remains. In my lab, there is 113.42 GB of “Free Usable with Policy” capacity if I have RAID 5/6 policy as the default policy for vSAN Datastore but there will be close to 226 GB “Free Usable With Policy” when  vSAN Default Storage Policy is applied to the vSAN Datastore. All the Capacity Overview metrics are self-explanatory, you can get the metrics details by hovering the mouse cursor over the metric.

VMware recommends keeping 25-30% raw free space on the vSAN datastore for a few reasons such as snapshots. Secondly, vSAN Health will raise an alert if free space drops below 20% on any of the physical vSAN capacity drives. When this happens, vSAN initiates a rebalance operation to evenly distribute vSAN components across capacity drives and free up space on the highly utilized capacity drives.

Used Capacity Breakdown

The Used Capacity Breakdown displays the percentage of capacity used by different types or data types. If you select Data types, vSAN displays the percentage of capacity used by primary VM data, vSAN overhead and temporary overhead.

If you want to get details of the capacity used by the individual objects, then you can change the “Group By” to “Objects Type”. If you select Object types vSAN displays the percentage of capacity used by the following object types:

  • Virtual disks
  • VM home objects
  • Swap objects
  • Performance management objects
  • Vmem objects
  • File system overhead
  • Checksum overhead
  • Deduplication and compression overhead
  • Space under deduplication engine consideration
  • iSCSI home and target objects, and iSCSI LUNs
  • Other, such as user-created files, VM templates, and so on

You can also get details on vSAN DataStore Capacity history form “Capacity History” tab. By default it show the Capacity History of 1 day but you can choose the any number of days from 1 – 30.

vSAN Health

vSAN Health service keep monitoring the capacity of the vSAN Capacity drives not the vSAN Cache Drives. Till the time the capacity drive utilization is less then 80% the vSAN Health UI will show it as green. It will be changed to Yellow if utilization is between 80% – 95% & Red once the capacity utilization reaches above 95%.

vSAN will trigger automatically rebalance mechanism once the capacity drive utilization crosses 80%. Rebalancing mechanism will attempt to to migrate the data on to the capacity drives with less then 80% utilization in the vSAN Cluster. As migrating the data other capacity drives will generated lot traffic, the rebalancing operations traffic might impact the virtual machine traffic. vSAN’s Adaptive Resync feature monitors and dynamically adjusts resource utilizations to avoid resync traffic contention with virtual machine traffic.

Dashboards in the vSphere Client

With the latest versions of vSphere VMware have tried to make vSAN monitoring more simple. Now it is very each to deploy  vRealize Operations and view dashboards directly in the vSphere Client. The Administrators no need to switch tools and learn a new UI for basic monitoring.

.The vSphere Client shows information such as current utilization and historical trends. vRealize Operations can be deployed, which provides overview dashboards integrated into the vSphere Client for simplicity and ease of use. Hope this will be informative for you. Thanks for reading. Please share if you  find worth sharing it!!!

 

VMware vSAN Stretched Cluster Deep Dive

Introduction to VMware vSAN Stretched Cluster

VMware vSAN Stretched Cluster was introduced back in vSAN 6.1 specific configuration for the environments where disaster / downtime is a primary requirement. Stretched Cluster deployment will be having 2 Active / Active Data Site connected with the well-connected network with a round trip time (RTT) latency not more then 5ms. Both the data sites will be connected to third site having vSAN Witness host to avoid “Split-Brian” issue if in case connectivity is lost between the data sites. You can have a maximum of 31 host in a vSAN Stretched Cluster deployment (15 host in each data site and 1 Witness host in the third site). Virtual machine deployed in a vSAN Stretched Cluster will have one copy of data in data site and witness components will be in third site.

 

vSAN Stretched Cluster implementation limitation

  1. In a vSAN Stretched Clusters, there are only 3 Fault Domains. The maximum number of FTT supported is 1 in pre-vSAN 6.6 stretched cluster.
  2. SMP-FT is not supported if FT Primary VM and Secondary VM are not running in same location. SMP-FT is supported if Primary & Secondary VM are running in the same fault domain.
  3. Erasure Coding feature is not supported in Stretched Cluster Configurations pre vSAN 6.6 but is supported for Local Protection within a site when using vSAN stretched cluster and Per-Site Policies
  4. vSAN iSCSI Target Service is not supported

Networking and Latency Requirements

  1. VMware recommends vSAN communication between the data sites be over stretched L2.
  2. VMware recommends vSAN communication between the data sites and the witness site is routed over L3.
  3. VMware recommends latency of less than or equal to 200 milliseconds in vSAN Stretched Cluster configurations up to 10+10+1. For configurations that are greater than 10+10+1, supported  latency of less than or equal to 100 milliseconds.
  4. Bandwidth between data sites and the witness nodes are dependent on the number of objects residing on vSAN. A standard rule of thumb is 2Mbps for every 1000 components on vSAN.

Bandwidth Calculation

Between Data Sites

Bandwidth requirement between the two data sites is dependent on workload like number of write operations per ESXi host, rebuild traffic need to be factored in. Assuming read locality and there will be no inter-site read traffic, read operations are not required to be factored.

The required bandwidth between the two data sites (B) is equal to the Write bandwidth (Wb) * data multiplier (md) * resynchronization multiplier (mr):

B = Wb *md * mr

VMware recommends to have Data Multiplier value as 1.4. For resynchronization traffic, VMware recommends an additional 25% i.e. mr = 1.25. Assuming 20,000 write IOPS a “typical” 4KB size write would require 40MB/s, or 640Mbps bandwidth.

Bandwidth = 640 * 1.4 * 1.25  = 1120 Mbps

Between Data and Witness Site

The required bandwidth between the Witness and each site is equal to ~1138 B x Number of components /5s. Assuming 1000 components in data site, the bandwidth required between the Witness and Data Site will be

1138 * 8* 1000/5 = 1,820,800 i.e. 1.82 Mbps

With 10% buffer as a rule of thumb we can consider 2 Mbs of bandwidth is required for every 1000 components.

Configuring VMware vSAN Stretched Cluster

Hope this will be informative for you. Thanks for reading. Please share if you find worth sharing it !!!

VMware vSAN performance Service: What is it?

Introduction to vSAN Performance Service

VMware introduced vSAN performance service back in vSAN 6.2 which vSAN administrators can leverage to monitor the performance of an vSAN environment. vSAN performance service collects and analyzes performance statistics and displays the data in a graphical format.

You might ask, why VMware introduced vSAN Performance Service to troubleshoot vSAN performance issues when we already have VSAN Observer to troubleshoot  vSAN performance issues. Even though vSAN Observer is a great tool but have some limitations like

  • VSAN observer doesn’t provide historic data and provides only the real-time status of the system.
  • VSAN Observer ran on its own web service separately and is not integrated with vSphere Web Client.
  • VSAN Observer provides lot metrics which sometimes requires experts skill sets to understand
  • Impact on vCenter Server.
  • Have to enable manually from the RVC Cli
  • No API Access

VMware tried to address all of these limitations with the Virtual SAN performance service. Secondly, VMware vSAN performance service does not require the use of vRealize Operations Manager or the vCenter database. Instead, Virtual SAN performance service uses the Virtual SAN object store to store its data in a distributed and protected fashion.

Configuring vSAN Performance Service

When you create a vSAN Cluster, performance service is disabled by default.

You need to turn on the performance service to monitor vSAN Cluster, hosts, disks and VM’s.

You can modify the time range from the last 1-24 hours or a custom date and time range. It is also possible to save performance data for later viewing. Once you enable the performance service it database is stored as a vSAN object independent of vCenter and a storage policy is assigned to the object to control availability and space utilization of that object.

Hope this will be informative for you. Please share if you find worth sharing it.

 

 

 

VMware vSAN Witness Appliance Requirements

Introduction to vSAN Witness Appliance

As SDDC (Software Defined Datacenter) offerings from VMware are continuing to mature with Virtual SAN 6.1, VMware announced two new great features i.e vSAN Stretched Cluster and 2 – Node vSAN Cluster. In a stretched or 2-node cluster, to maintain the quorum, a dedicated witness host is required. As the purpose of the witness host is only to store virtual machine witness components to avoid a split brain situation in case of an host failure there is not much compute which is required to be available on the witness host. You can configure a physical host or a virtual appliance to work as Witness. Leveraging vSAN Witness virtual appliance gives a signification cost saving for the customer wish to deploy a vSAN Stretched Cluster or 2-Node cluster ROBO as Witness Virtual Appliance comes up with required licenses where as you need to have to buy separate licenses in case you want configure physical server as a Witness host. You cannot share the witness appliance with another stretched cluster or with a 2-Node Configuration.A witness appliance cannot be shared between configurations; it is a 1:1 relationship with a stretched cluster or with a ROBO configuration.

Minimal Requirements to Host the vSAN Witness Appliance

  • The vSAN Witness Appliance must on an ESXi 5.5 or greater VMware host backed with any supported storage (vmfs datastore, NFS datastore, vSAN Cluster).
  • Networking must be in place that allows for the vSAN Witness Appliance to properly communicate with the vSAN 2 Node Cluster.

Bandwidth requirements Between 2 Node vSAN and the Witness Site

As Witness Appliance don’t hold any Virtual Machine data, it only hold the metadata of the VM components like VM Home, Swap Object, Virtual Disk, and Snapshots. So bandwidth requirement between the Witness Site and Data Site is not calculated the same way we do for bandwidth requirement between Data Site (Stretched Cluster)

The required bandwidth between the Witness and each site is equal to ~1138 B x Number of Components / 5s The 1138 B value comes from operations that occur when the Preferred Site goes offline, and the Secondary Site takes ownership of all of the components.

Deploying a Witness Appliance

When deploying the vSAN Witness Appliance, there are 3 potential deployment options: Tiny, Normal, & Large. These deployment profiles all have 2 vCPUs, 1 vmdk for the ESXi installation, 1 vmdk for the vSAN cache device, and at least 1 vmdk for the vSAN capacity.

Deploying New Witness Host in case of failure

This concludes the deployment of VMware vSAN Witness Appliance. Hope this will be informative for you. Please share if you find worth sharing it.

Configuring VMware vSAN 2 – Node Cluster

Introduction to 2-Node vSAN Cluster

The vSAN 2 Node configuration initially introduced in vSAN 6.1. Prior to 2-node vSAN, three-node clusters was the minimum supported configuration for vSAN enabled environments. VMware vSAN 2 Node Clusters are supported on both Hybrid configurations and All-Flash configurations. A 2-node cluster is very useful in case of ROBO (Remote office Branch office). You don’t need a NAS or SAN for the shared storage.  Each node is configured as a vSAN Fault Domain. The supported configuration is 1+1+1 (2 nodes + vSAN Witness Host). As we know vSAN is something like a RAID over the network. Normally vSAN supports RAID 1 and RAID 5 but once deployed in a 2 – node cluster on RAID 1 is available. When a VM Objects are created and stored in vSAN, the data is written on disk drives of both nodes. So, two components will be created: the original data and the replica.

vSAN Two-Node Architecture

The two-node vSAN architecture builds on the concept of Fault Domains where each of the two VMware ESXi™ hosts, represent a single Fault Domain. In vSAN architecture, the objects that make up a virtual machine are typically stored in a redundant mirror across two Fault Domains, assuming the Number of Failures to Tolerate is equal to 1. As a result of this, in a scenario where one of the hosts goes offline, the virtual machines can continue to run, or be restarted, on the alternate node. To achieve this, a Witness is required to act as a tie-breaker, to achieve a quorum, and enable the surviving nodes in the cluster to restart the affected virtual machines. However, unlike a traditional vSAN enabled cluster, where the witness objects are local to the configured cluster hosts, in a two-node architecture, the witness objects are located externally at a second site on a dedicated virtual appliance specifically configured to store metadata, and to provide the required quorum services for a host failure.

vSAN Witness Appliance Licensing

VMware introduced the vSAN Witness Appliance as a free alternative to using a physical ESXi host as a vSAN Witness Host. This appliance is only used for housing vSAN Object Witness Components, and is not allowed to run any virtual machine workloads.

Networking and Latency Requirements

VMware recommends that vSAN communication between the vSAN Data nodes be over L2. The communication between the Data Node and Witness Appliance can be of Layer – 2 (if Witness host is in the same site) or L3 (if Witness host is in an alternate Site).  If Witness Appliance is located in an alternate site the supported  latency between the Data Noes & Witness Appliance is 500 ms (250 ms one way)

Deploying 2-Node VSAN Cluster

 

Hope this will be informative for you. Please share if you find worth sharing it!!!.

Understanding Degraded Device Handling in VMware VSAN

Degraded Device Handling in VMware VSAN

As the name suggest Degraded Device Handling (DDH) or Dying Disk Handling is an unhealthy drive detection method  help VMware VSAN customers to avoid cluster performance degradation due to an unhealthy drive. There can be a situation where the drive part of VSAN is not completely failed but can show inconsistent behavior and is generating lot of IO retries / errors. Now the question comes How could we deal with such a situation?

With VSAN 6.1, VMware introduced the functionality called Degraded Device Handling (DDH) where vSAN itself monitored the drives with excessive read or write latency. If vSAN observes that the average latency of drive was higher then 50 ms for more then 10-min period, it dismounts the concerned drive. Once unmounted, components on the dismounted drive get marked as absent and a rebuilding of components starts after a period of 60 min. Dismounting the drive considering the last 10 minutes of data leads to number of challenges as it might be possible that concerned drive is temporarily reporting the higher average latency.

To overcome the issues called by the false positives from drives temporarily reporting higher average latencies, VMware did bunt of enhancements in VMware VSAN unhealthy drive detection method in the upcoming releases.

  1. Average latency will be tracked over multiple, randomly selected 10 min intervals not just from last 10 min. A d.rive will be marked unhealthy only when the average write IO round trip latency exceed the configured threshold for four times in last six hour period.
  2. In case of high read latencies, dismounting of cache or capacity devices will not happen.
  3. In case of high write latency, dismounting of cache drive will not happen.
  4. DDH will only dismount the capacity device with high write latencies.
  5. Latency threshold for a magnetic disk was set to 500 ms and 200 ms for a SSD.
  6. Remounting of unmounted drive by DDH will be tried approximately 24 times over 24 hour period.
  7. DDH will not unmount the drive if the drive hold the last remaining copy of the data. If the drive holds the last remaining copy of the data, DDH will start the data evacuation from the device immediately. This is in contrast to waiting for the vSAN CLOM Rebuild Timer (60 minutes by default) to expire before rebuilding copies of “absent” components.

Once VMware VSAN unhealthy drive detection method detects an unhealthy disk, it logs a key disk SMART attributes for monitoring and detecting errors on the unhealthy disk. The SMART attribute mentioned below gives an idea as to why the device was inconsistent and why DDH have chose to unmount the concerned device.

  • Re-allocated sector count. Attribute ID = 0x05.
  • Uncorrectable errors count. Attribute ID = 0xBB.
  • Command timeouts count. Attribute ID = 0xBC.
  • Re-allocated sector event count. Attribute ID = 0xC4.
  • Pending re-allocated sector count. Attribute ID = 0xC5.
  • Uncorrectable sector count. Attribute ID = 0xC6.

We have seen few cases where drive failed without any warning.Predicting device failure and the proactive evacuation of data from a degraded device  enhances the resilience of a vSAN datastore.

Updating VMware Virtual SAN HCL database offline

VMware offers certified compatibility guides which list System, I/O, Storage/SAN and Backup compatibility with VMware Infrastructure and previous versions of VMware ESX Server. VSAN Health Check leverage VMware Compatibility Guide database for various health checks stored on vCenter Server in place of VMware website.

VMware ships a copy of VMware compatibility guides database can be used for HCL Checks which was current when released. The VMware compatibility Guides database becomes outdated with time because of new certification with partners keep getting added to VMware Compatibility Guide database.  Hardware vendors regularly update their drivers and VMware adds certification for them. Therefore, it is critically important to keep the local copy up-to-date.

In one of my earlier post, I covered VSAN Hardware Compatibility list checker a very nice VMware Fling to verify the VSAN underline hardware.

VMWare VSAN HCL database can be automatically from VMware website if vCenter server can connect to Internet directly or via proxy. If you don’t have direct or proxy internet access, VMware VSAN HCL database can be downloaded manually and uploaded to vCenter Server. To download VMware VSAN HCL database manually, open the below URL in a web browser and save the content in a file with JSON extension.

http://partnerweb.vmware.com/service/vsan/all.json

Once you have VSAN updated HCL database JSON file, you can upload the same to vCenter Server using Web Client.

1 2

In this post we covered the steps to update VSAN HCL database offline. Thanks for Reading. Be social and share it in social media, if you fee worth sharing it. Happy Learning 🙂

 

VMware VSAN Network Design consideration

VMware Virtual SAN is a distributed shared storage solution that enables the rapid provisioning of storage within VMware vCenter. As Virtual SAN is a distributed shared storage, it is very much dependent on correctly configured network for Virtual Machines I/O and for communication between Virtual SAN Cluster nodes.  Because the majority of virtual machine I/O travels the network due to the distributed storage architecture, highly performing and available network configuration is critical to a successful Virtual SAN deployment.

In this post we will be covering few important points need to be considered from network perspective before VMware VSAN deployment.

Supported Network Interface Cards

In a VMware Virtual SAN hybrid configuration, Virtual SAN supports both 1 GB and 10 GB Network Interface Cards. If you have 1 GB Network Interface card installed on ESXi host than VMware requires this NIC to be dedicated only for Virtual SAN traffic. If a 10Gb NIC is used, this can be shared with other network traffic types. It is advisable to implement QoS using Network I/O Control to prevent one traffic to claim all the bandwidth. Considering the potential for an increased volume of network traffic between the hosts to achieve higher throughput, for Virtual SAN All Flash Configuration VMware supports only 10 GB Network Interface Card which can be shared with other network traffic types.

Teaming Network Interface Cards

Virtual SAN supports Route based on IP-hash load balancing, but cannot guarantee improvement in performance for all configurations. IP-hash performs the load balancing when Virtual SAN traffic type is among its many network traffic types. By design, Virtual SAN network traffic is not designed to load balanced across teamed network interface cards. NIC Teaming for VMware Virtual SAN Traffic is a way of making the Virtual SAN traffic network high available, where standby adapter take over the communication if primary adapter fails.

Jumbo Frame Support

VMware Virtual SAN supports Jumbo Frame. Even if use of Jumbo frame reduce CPU utilization and improve throughput, VMware recommends to configure jumbo frame only if the network infrastructure already supports it. As vSphere already use TCP segmentation offload (TSO) and large receive offload (LRO), Jumbo frame configured for Virtual SAN provides limited CPU and performance benefits. The biggest gains for Jumbo Frames will be found in all flash configurations.

Multicast Requirement

Multicast forwarding is a one-to-many or many-to-many distribution of network traffic. Rather than using the network address of the intended recipient for its destination address, multicast uses a special destination address to logically identify a group of receivers.

One of the requirements for VSAN is to allow multicast traffic on the VSAN network between the ESXi hosts participating in the VSAN cluster. Multicast is being used in discovering ESXi host and to keep track of changes within the Virtual SAN Cluster. Before deploying VMware Virtual SAN, testing performance of switch being used for Multicast is also very important. One should ensure a high quality enterprise switch is being used for Virtual SAN multicast traffic. Virtual SAN health services can also be leveraged to test Multicast performance.

Summary of network design considerations

  • Virtual SAN Hybrid Configuration support 1 GB and 10 GB network.
  • Virtual SAN All Flash Configuration support 10 GB network.
  • Consider implementing QoS for Virtual SAN Traffic using NIOC.
  • Consider Jumbo frame for Virtual SAN traffic if it is already configured in network infrastructure.
  • Consider NIC team for availability / redundancy for Virtual SAN traffic.
  • Multicast must be configured and functional between all hosts.

I hope this is informative for you. Thanks for Reading, be social and share it on social media if you feel it is worth sharing it.  Happy Learning … 🙂

VMware Virtual SAN Quiz

Questions mentioned in the quiz are for learning purpose only and are not from Exam perspective.

I hope this Quiz will be informative for you. Thanks for visiting this blog. Be social and share it in social media, if you feel worth sharing it.