.. post:: Oct 23, 2015 :tags: monster vm, pvscsi, cache :author: Tze Liang .. _sec-iaas-monster-vm-design: ********************** IaaS Monster VM Design ********************** VMs with large performance requirements (i.e., greater than 8 vCPU, memory and/or disk IO intensive requirements) are generally classified as Monster VMs. Monster VMs are widely practised and fully supported, and they will become more and more common going forward. This article provides a high-level considerations of designing and configuring monster workloads in virtualised environment. There are misconceptions/misunderstandings that virtualisation is not suitable for high end workloads, such as workloads with CPU, memory and disk IO intensive workloads. As with any design and configuration - underwhelming performance are often due to lack of understanding of the application requirements, and as a result, not sizing the infrastructure appropriately. Workshops on designing and sizing Monster VMs have been one of the most active and highly participated sessions in VMworld (and vForum) over the past several years. There are a long list of guidelines and best practices out in the wild regarding virtualising monster VM workloads. [#vmware-perf-links1]_ [#vmware-perf-links2]_ Even public cloud providers like Amazon Web Services are offering monster VM configurations [#aws-ec2-types]_ (e.g., m4.10xlarge, c4.8xlarge, r3.8xlarge instances) Most monster VMs are by nature business critical applications. Therefore it is crucial that extra care and due diligent are exercised to understand the requirements and design/configure these workloads to meet the desired outcome. .. _sec-iaas-monster-cpu: Monster VM CPU Considerations ============================= When sizing CPU configuration for monster VMs, there are many factors to consider. For majority of the workloads, the CPU requirement could be satisfied by generic CPU hardware. Workloads that require special hardware outside of the standard BoM are out of scope for this discussion here (e.g., workloads that require GPU acceleration, or specific model of CPU optimised for clock frequency which we do not cater for in our general BoM). For high-end workloads that are consistently high in CPU usage and/or require a consistent/deterministic performance response from CPU, a general understanding of CPU scheduling, NUMA optimisation and the effect of hyper-threading are required. Importantly, it comes down to knowing what the parameters of the underlying hardware and translate that into how we could best match the application's behaviour and requirements into sizing and configuring the VM (within constraints such as operational model, supportability and standardisation). .. _sec-iaas-monster-overcommit: Monster VM vCPU:pCPU Overcommit Guideline ----------------------------------------- The simple guideline for sizing and configuring monster VM is to **NOT** overcommit vCPU:pCPU ratio. When sizing for monster VMs, especially for those that are likely to be CPU bound, it is essential that the underlying host has enough free logical processors to execute the CPU requests on demand without incurring extra CPU scheduling contentions. Virtualisation is not magic, it cannot induce more performance than the physical hardware can provide. Many pitfalls in virtualised environment are due to practices to aggressively consolidate and overcommit as much as possible to drive up ROI by sacrificing performance. Hypervisors, regardless of which product (VMware ESXi, Microsoft Hyper-V, Xen, KVM) or what type (full virtualisation, paravirtualisation), rely on resource scheduling to coordinate and execute VMs. If a hypervisor host has more vCPU assigned than its available pCPU (CPU overcommit), it means that the hypservisor will have to time-share the pCPU amongst the VMs. Time-sharing introduces wait penalty, on top of virtualisation overhead, which means that VMs will not be able to get a consistent and deterministic CPU response and throughput. .. _sec-iaas-monster-overcommit-ht: The Value of Hyper-threaded Core ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ An HT core is not a real core. It is an SMT context using a CPU core to execute parallel threads, and not all applications can benefit greatly from HT. Even with software optimised for taking advantage of multi-threaded processing, HT is not going to provide twice the performance - overall improvement is typically around 15%. In virtualised environment, VMware ESXi hypervisor will always try to schedule a full CPU core to execute request from VMs, if no full core is freely available, ESXi will try to schedule on the partial core as the next option (i.e., HT core), and if it cannot find a partial core to execute, it will halt the request and place them in wait queue. VMware published a detailed technical paper on CPU scheduling, which has an in-depth look at how ESXi workloads are scheduled on the physical cores. [#esxi-cpu-sched]_ In an overcommitted host, i.e., total vCPU assigned to VMs are greater than the number of total logical CPU available, ESXi hypervisor will most likely schedule some requests in the SMT context, which may result in the process taking a performance hit. .. note:: There are certain cases where scheduling on HT cores could be preferred over full logical core. For example, applications that are heavy in multi-threaded scheduling *and* are cache intensive. If these workloads are configure with vCPUs that span across physical cores, and therefore do not fit within single NUMA node, the latency of accessing remote memory could have greater performance impact than running on HT cores. In such situation, it may be worth enforcing the VMs to be scheduled in HT mode [#vmware-HT-sched-setting]_. .. _sec-iaas-monster-numa: The Impact of Sizing to NUMA Node --------------------------------- In most cases, IaaS cloud providers hardware fleet running on VMware ESXi hypervisors are configured with 2 or more physical processors. These are all NUMA hardware, where each physical processor is a NUMA node (usually the case, except for some AMD Opteron architecture, which could have multiple NUMA nodes within a physical socket). Processor can access local and foreign memory; where within a NUMA node memory access is local, and in cases where the processor is required to access memory reside in another NUMA node that is considered as foreign memory. VMware ESXi is hardware aware and is smart enough to attempt to optimise its scheduling and placement algorithms to best fit NUMA node. If a VM is sized with less than the number cores of 1 physical processor, ESXi will always attempt to place and run it within a NUMA node. In cases where VM is sized with number of vCPU that cannot fit within a NUMA core, ESXi will attemp to balance the vCPU across the NUMA nodes - however, this will rely on us right sizing the VMs vCPU according to the NUMA parameters. For example, the CPU used in a lot of our older ESXi hosts are 2x 8 cores processors, which means in order to fit within a NUMA node, the vCPU assigned to a VM should not exceed 8 vCPU (the exception to this is if we enforce scheduling settings to prefer HT over core). If we must size above 8 vCPU, we should size the VM's vCPU as a multiple of the server's NUMA node size, 8 in this example. When a processor is required to access foreign memory (access memory in another NUMA node), there is performance hit as the latency of the memory access is greater across separate NUMA node. This is especially noticeable for monster VMs that are cache and memory intensive. VMware and experts in the fields have published multiple articles and tests to demonstrate the effects of how NUMA configuration can impact performance. [#vmware-perf-numa-links1]_ [#vmware-perf-numa-links2]_ [#vmware-perf-numa-links3]_ .. _sec-iaas-monster-disk: Monster VM Disk Considerations ============================== For workloads that require high-performance storage (for e.g., high IOPS and throughput), the back-end storage should be sized appropriately to cater for the requirements. Most service providers, i.e., AWS and Azure, have different storage tiers based on IOPS per TB building block to cater for different workloads requirements. From the VM design perspectives, there are guidelines to sizing and configuration in order for the VM to be able optimise the performance characteristics. .. _sec-iaas-disk-pvscsi: Attach High Performance Virtual Disks to PVSCSI Adapter ------------------------------------------------------- VMware provide an option to use PVSCSI (`Paravirtual SCSI adapter `_ ) to attach virtual disks. PVSCSI adapters are designed to optimise CPU utilisation and cater for virtual disks that require very high IO throughput. It is an optional configuration, which means that unless you specifically configure the virtual adapter to be PVSCSI, it will default to LSI SAS. .. warning:: Keep in mind the following limitations for PVSCSI: * Hot add or hot remove requires a bus rescan from within the guest. * Disks with snapshots might not experience performance gains when used on Paravirtual SCSI adapters if memory on the ESX host is over committed. * Do not use PVSCSI on a virtual machine running Windows with spanned volumes. Data may become inaccessible to the guest operating system. (Note that Windows Storage Spaces is supported. e.g., Present multiple disks on PVSCSI to form a Storage Spaces volume pool) VMware PVSCSI can be used for VMs that absolutely require maximum IOPS that it could potentially drive. PVSCSI is designed to support very high throughput and efficient with CPU cycle utilisation. If a VM has multiple virtual disks that require high IOPS requirement, create multiple PVSCSI adapters and spread the virtual disks across them. This will allow the VM to drive higher IOPS through IO parallelism. .. note:: MSCS clusters are now supported on VMware PVSCSI controllers on vSphere 5.5U3 and later versions. [#vmware-pvscsi-mscs-support]_ .. _sec-iaas-monster-design-examples: Monster VM Design Examples ========================== .. _sec-iaas-monster-perf-enhancements: Monster VM Performance Enhancements Options =========================================== For extremely high IOPS throughput workloads, shared T1 storage may not be able to cater for the performance requirements. Our T1 shared storage controllers generally serve multiple customers environments with hundreds of virtual machines across them. For VMs with extremely high IOPS, it may be difficult to satisfy the IOPS and latency required by these monster workloads. In many cases, large IOPS workloads could exhaust the capacity of our shared storage and impact other workloads without proper QoS. .. note:: Most public cloud providers, e.g., AWS, Azure, GCP published their storage QoS (IOPS and bandwidth limits) online to allow application solution designers to design and size their workloads accordingly. These well-defined QoS limits are there to serve as a guideline for deterministic outcome. Performance Enhancements: AFA (All-Flash Arrays) ------------------------------------------------ One approach is to deploy AFA (All-flash arrays) to cater for these sort of workloads. Flash storage offer higher IOPS capability at much lower latency compared to traditional spinning disks platform. In many cases, deploying AFA could satisfy majority of the high IOPS workloads requirements. However, there may be some scenario where AFA may not be the only solution due to the following reasons, #. AFA deployed will be shared across different workloads (as it is economically inefficient to dedicate AFA to any single workload). By sharing AFA across workloads, we are sharing resources and performance, and could potentially bottleneck at shared storage if the VMs are driving large IOPS and throughput. #. When coupled the above condition with latency sensitive and demanding workloads, shared AFA alone would not be enough to satisfy the throughput nor the latency requirements. Performance Enhancements: Host-based Caching -------------------------------------------- Enter host-based caching solutions. Host-based cache solutions provide an extra layer of buffer between the VMs and the storage network. They can cache the working dataset in high-performance flash devices located on the physical hosts, closest to the VMs, at very low latency IO path with extremely high throughput. A single modern PCI-e flash (or NVMe) on a server can drive `hundreds of thousands of IOPS with throughput of GB/s `_ (that is GIGABYTES per seconds. And the latest generation of NVMe can drive even `higher IOPS and throughput rates `_). As NVMe is basically flash on PCIe, IOs hitting the cached dataset on NVMe will have much lower latency than IOs leaving the host and hitting the storage network. These performance characteristics of NVMe allow host-based caching to provide extremely high performance IOs for the hot working dataset, but also at the same time, buffer and save those IOs from hitting the storage network. The end result is that we are now able to scale-out and run more monster workloads and extend the life of our shared storage platform, whilst at the same time giving monster workloads the IOPS and latency requirements that they need. There are many host-based caching solutions available in the market, each of them has slightly different ways of implementing the caching mechanism, however they are all designed to achieve the above benefits and outcomes. #. `PernixData `_ #. `SanDisk FlashSoft `_ #. `VMware vSphere buillt-in vFRC `_ .. rubric:: Footnotes .. [#vmware-perf-links1] `VMware VROOM! Blog (from VMware's performance team) `_ .. [#vmware-perf-links2] `When to Overcommit vCPU:pCPU for Monster VMs `_ .. [#aws-ec2-types] `AWS EC2 Instance Types `_ .. [#esxi-cpu-sched] `VMware vSphere CPU scheduler whitepaper `_ .. [#vmware-HT-sched-setting] `Configure VM to use hyper-threading with NUMA in ESXi `_ .. [#vmware-perf-numa-links1] `Does corespersocket Affect Performance? `_ .. [#vmware-perf-numa-links2] `SAP on VMware Sizing & Design Example `_ .. [#vmware-perf-numa-links3] `The Importance of VM Size to NUMA Node Size `_ .. [#vmware-pvscsi-mscs-support] `Microsoft Clustering on VMware vSpphere `_