Understanding Adaptive Resync in VMware vSAN 6.7

Introduction to Adaptive Resync

Consistent Performance delivery along with data resiliency are two key tenet for an enterprise storage solution. If case of host / disk failure few of the components of the Virtual Machine might get non-complaint because of the the missing data. To make the impacted Virtual Machine complaint with their policy vSAN will sync the data on to the host / drives where we have sufficient resources available.

Resync operations will aim to finish the creation of missing components asap. The Resync operations consumes the I/O of the disk drives where vSAN is creating the missing objects of the impacted Virtual Machine. The longer the resysc takes, the longer you are at risk. If background operations like resync, rebalancing, etc consumes all I/O then application servers performance will suffer. On the other hand if Application server consumes all I/O and your backend cannot safely maintain availability. Some of the reasons for resynchronizations include:

  • Object policy changes
  • Host or disk group evacuations
  • Host upgrades (Hypervisor, on-disk format)
  • Object or component rebalancing
  • Object or component repairs

In the earlier releases, there was a manual throttling mechanism to handle these kind of situations. In vSAN 6.6, a throttling mechanism was introduced in the UI allowing a user to define a static limit on resync I/O. It required manual intervention, knowledge of performance metrics, and had limited abilities to control I/O types at specific points in the storage stack

With VMware vSAN 6.7, VMware introduced a new method to balance the use of resources during background activities like resync, rebalancing, etc. vSAN 6.7 distinguishes four types of I/O, and has a pending queue for each I/O class.

vSAN employs a sophisticated, highly adaptive congestion control scheme to manage I/O from one or more resources. vSAN 6.7 has two distinct types of congestion to help regulate I/O, improving upon the single congestion type found in vSAN 6.6 and earlier.

Bandwidth congestion. This type of congestion can come from the feedback loop in the “bandwidth regulator”, and is used to tell the vSAN layer on the host that manages vSAN components the speed at which to process I/O.

Backpressure congestion. This type of congestion can come as the result of the pending queues for the various I/O classes filling to capacity. Backpressure congestion is visible in the UI by highlighting the cluster, clicking Monitor > vSAN > Performance, and selecting the “VM” category.

The benefit to this optimized congestion control method is the ability to better isolate the impact of congestions and improve resource utilization. The dispatch / Fairness scheduler is at the heart of vSAN’s ability to manage and regulate I/O based on the conditions of the environment. The separate queues for the I/O classes allow vSAN to prioritize new incoming I/O that may have an inherently higher level of priority over existing I/O waiting to be processed. If a Virtual Machine latency reach to high watermark, vSAN will cut the bandwidth of background operations to half. Now vSAN will again check if Virtual Machine latency is still above it will cut the resources for background operations to half again. When the latency is below the low watermark then vSAN will increase the bandwidth of resync traffic granularly until the low watermark is reached and stay at that level. When resync and VM I/O activity is occurring, and the aggregate bandwidth of the I/O classes exceeds the advertised bandwidth, resync I/Os are assigned no less than approximately 20% of the bandwidth, allocating approximately 80% for VM I/Os.


Adaptive Resync in vSAN 6.7 introduces the mechanism to implement a fully intelligent, adaptable flow control mechanism for managing resync I/O and VM I/O.  If there is no resync activity is occurring, VM I/Os can consume up to 100% of the available bandwidth. If resync and VM I/O is below the advertised available bandwidth, neither I.O class will be throttled. If resync and VM I/O aggregate bandwidth of the I/O classes exceeds the advertised bandwidth, resync I/Os are assigned no less than approximately 20% of the bandwidth, allocating approximately 80% for VM I/Os.


VMware vSAN Capacity Management and Utilization Deep-dive


VMware is putting lot of efforts to make tracking of vSAN Capacity utilization simpler. VMware vSAN provides few of the dashboard within vCenter to track capacity utilization, free space remaining, deduplication can compression efficiency and many more. In the latest release, there are more then 200 alarms that provides various alerts for vSAN.

In addition to the capacity dashboard vSAN Health includes capacity-related health checks as well.

vSAN Capacity Overview

This dashboard part of the vCenter UI provides a capacity overview. The dashboard includes the details on

  1. Total Raw Capacity of the vSAN Datastore
  2. Details on Consumed Capacity
  3. Details on Remaining Capacity

The level of resilience configured has a direct impact on how much useable capacity is used. The vSAN Capacity Overview section enables us to select a storage policy and see how much usable free space remains. In my lab, there is 113.42 GB of “Free Usable with Policy” capacity if I have RAID 5/6 policy as the default policy for vSAN Datastore but there will be close to 226 GB “Free Usable With Policy” when  vSAN Default Storage Policy is applied to the vSAN Datastore. All the Capacity Overview metrics are self-explanatory, you can get the metrics details by hovering the mouse cursor over the metric.

VMware recommends keeping 25-30% raw free space on the vSAN datastore for a few reasons such as snapshots. Secondly, vSAN Health will raise an alert if free space drops below 20% on any of the physical vSAN capacity drives. When this happens, vSAN initiates a rebalance operation to evenly distribute vSAN components across capacity drives and free up space on the highly utilized capacity drives.

Used Capacity Breakdown

The Used Capacity Breakdown displays the percentage of capacity used by different types or data types. If you select Data types, vSAN displays the percentage of capacity used by primary VM data, vSAN overhead and temporary overhead.

If you want to get details of the capacity used by the individual objects, then you can change the “Group By” to “Objects Type”. If you select Object types vSAN displays the percentage of capacity used by the following object types:

  • Virtual disks
  • VM home objects
  • Swap objects
  • Performance management objects
  • Vmem objects
  • File system overhead
  • Checksum overhead
  • Deduplication and compression overhead
  • Space under deduplication engine consideration
  • iSCSI home and target objects, and iSCSI LUNs
  • Other, such as user-created files, VM templates, and so on

You can also get details on vSAN DataStore Capacity history form “Capacity History” tab. By default it show the Capacity History of 1 day but you can choose the any number of days from 1 – 30.

vSAN Health

vSAN Health service keep monitoring the capacity of the vSAN Capacity drives not the vSAN Cache Drives. Till the time the capacity drive utilization is less then 80% the vSAN Health UI will show it as green. It will be changed to Yellow if utilization is between 80% – 95% & Red once the capacity utilization reaches above 95%.

vSAN will trigger automatically rebalance mechanism once the capacity drive utilization crosses 80%. Rebalancing mechanism will attempt to to migrate the data on to the capacity drives with less then 80% utilization in the vSAN Cluster. As migrating the data other capacity drives will generated lot traffic, the rebalancing operations traffic might impact the virtual machine traffic. vSAN’s Adaptive Resync feature monitors and dynamically adjusts resource utilizations to avoid resync traffic contention with virtual machine traffic.

Dashboards in the vSphere Client

With the latest versions of vSphere VMware have tried to make vSAN monitoring more simple. Now it is very each to deploy  vRealize Operations and view dashboards directly in the vSphere Client. The Administrators no need to switch tools and learn a new UI for basic monitoring.

.The vSphere Client shows information such as current utilization and historical trends. vRealize Operations can be deployed, which provides overview dashboards integrated into the vSphere Client for simplicity and ease of use. Hope this will be informative for you. Thanks for reading. Please share if you  find worth sharing it!!!


VMware vSAN Stretched Cluster Deep Dive

Introduction to VMware vSAN Stretched Cluster

VMware vSAN Stretched Cluster was introduced back in vSAN 6.1 specific configuration for the environments where disaster / downtime is a primary requirement. Stretched Cluster deployment will be having 2 Active / Active Data Site connected with the well-connected network with a round trip time (RTT) latency not more then 5ms. Both the data sites will be connected to third site having vSAN Witness host to avoid “Split-Brian” issue if in case connectivity is lost between the data sites. You can have a maximum of 31 host in a vSAN Stretched Cluster deployment (15 host in each data site and 1 Witness host in the third site). Virtual machine deployed in a vSAN Stretched Cluster will have one copy of data in data site and witness components will be in third site.


vSAN Stretched Cluster implementation limitation

  1. In a vSAN Stretched Clusters, there are only 3 Fault Domains. The maximum number of FTT supported is 1 in pre-vSAN 6.6 stretched cluster.
  2. SMP-FT is not supported if FT Primary VM and Secondary VM are not running in same location. SMP-FT is supported if Primary & Secondary VM are running in the same fault domain.
  3. Erasure Coding feature is not supported in Stretched Cluster Configurations pre vSAN 6.6 but is supported for Local Protection within a site when using vSAN stretched cluster and Per-Site Policies
  4. vSAN iSCSI Target Service is not supported

Networking and Latency Requirements

  1. VMware recommends vSAN communication between the data sites be over stretched L2.
  2. VMware recommends vSAN communication between the data sites and the witness site is routed over L3.
  3. VMware recommends latency of less than or equal to 200 milliseconds in vSAN Stretched Cluster configurations up to 10+10+1. For configurations that are greater than 10+10+1, supported  latency of less than or equal to 100 milliseconds.
  4. Bandwidth between data sites and the witness nodes are dependent on the number of objects residing on vSAN. A standard rule of thumb is 2Mbps for every 1000 components on vSAN.

Bandwidth Calculation

Between Data Sites

Bandwidth requirement between the two data sites is dependent on workload like number of write operations per ESXi host, rebuild traffic need to be factored in. Assuming read locality and there will be no inter-site read traffic, read operations are not required to be factored.

The required bandwidth between the two data sites (B) is equal to the Write bandwidth (Wb) * data multiplier (md) * resynchronization multiplier (mr):

B = Wb *md * mr

VMware recommends to have Data Multiplier value as 1.4. For resynchronization traffic, VMware recommends an additional 25% i.e. mr = 1.25. Assuming 20,000 write IOPS a “typical” 4KB size write would require 40MB/s, or 640Mbps bandwidth.

Bandwidth = 640 * 1.4 * 1.25  = 1120 Mbps

Between Data and Witness Site

The required bandwidth between the Witness and each site is equal to ~1138 B x Number of components /5s. Assuming 1000 components in data site, the bandwidth required between the Witness and Data Site will be

1138 * 8* 1000/5 = 1,820,800 i.e. 1.82 Mbps

With 10% buffer as a rule of thumb we can consider 2 Mbs of bandwidth is required for every 1000 components.

Configuring VMware vSAN Stretched Cluster

Hope this will be informative for you. Thanks for reading. Please share if you find worth sharing it !!!

Configuring VMware vSAN 2 – Node Cluster

Introduction to 2-Node vSAN Cluster

The vSAN 2 Node configuration initially introduced in vSAN 6.1. Prior to 2-node vSAN, three-node clusters was the minimum supported configuration for vSAN enabled environments. VMware vSAN 2 Node Clusters are supported on both Hybrid configurations and All-Flash configurations. A 2-node cluster is very useful in case of ROBO (Remote office Branch office). You don’t need a NAS or SAN for the shared storage.  Each node is configured as a vSAN Fault Domain. The supported configuration is 1+1+1 (2 nodes + vSAN Witness Host). As we know vSAN is something like a RAID over the network. Normally vSAN supports RAID 1 and RAID 5 but once deployed in a 2 – node cluster on RAID 1 is available. When a VM Objects are created and stored in vSAN, the data is written on disk drives of both nodes. So, two components will be created: the original data and the replica.

vSAN Two-Node Architecture

The two-node vSAN architecture builds on the concept of Fault Domains where each of the two VMware ESXi™ hosts, represent a single Fault Domain. In vSAN architecture, the objects that make up a virtual machine are typically stored in a redundant mirror across two Fault Domains, assuming the Number of Failures to Tolerate is equal to 1. As a result of this, in a scenario where one of the hosts goes offline, the virtual machines can continue to run, or be restarted, on the alternate node. To achieve this, a Witness is required to act as a tie-breaker, to achieve a quorum, and enable the surviving nodes in the cluster to restart the affected virtual machines. However, unlike a traditional vSAN enabled cluster, where the witness objects are local to the configured cluster hosts, in a two-node architecture, the witness objects are located externally at a second site on a dedicated virtual appliance specifically configured to store metadata, and to provide the required quorum services for a host failure.

vSAN Witness Appliance Licensing

VMware introduced the vSAN Witness Appliance as a free alternative to using a physical ESXi host as a vSAN Witness Host. This appliance is only used for housing vSAN Object Witness Components, and is not allowed to run any virtual machine workloads.

Networking and Latency Requirements

VMware recommends that vSAN communication between the vSAN Data nodes be over L2. The communication between the Data Node and Witness Appliance can be of Layer – 2 (if Witness host is in the same site) or L3 (if Witness host is in an alternate Site).  If Witness Appliance is located in an alternate site the supported  latency between the Data Noes & Witness Appliance is 500 ms (250 ms one way)

Deploying 2-Node VSAN Cluster


Hope this will be informative for you. Please share if you find worth sharing it!!!.