Overview of Computer System Latency¶
The advancement of computing power over the past decades has allowed the IT industry to tackle problems that were previously difficult to address, and also opened up opportunity to innovate new solutions that were never thought possible.
As the underlying technology gets faster and more sophisticated, the solutions that run on the modern computing hardware tend to focus more on the actual functionality, relying on the low-level built-in mechanisms (i.e., the compilers, program libraries, framework etc) to deal with the details of abstracting the hardware layer from the software layer. However, it is important to understand the low-level principles, especially when designing good scalable solutions.
Latency is one of the critical factors to consider when building a scalable infrastructure. Computer systems with latency falling outside of the expected threshold could result in negative users experience, or more seriously, significant financial losses.
Table of Contents
Making Sense of Computer System Latency¶
Humans perception of time are usually measured in seconds, hours, days and years. However, modern computer subsystems respond to events in nanoseconds, microseconds and milliseconds. Therefore, when thinking about how latency could impact the infrastructure, it is sometimes easier to conceptualise latency in relative scale rather than absolute values to put things into perspective.
A high-level view of a computer system is depicted on the following diagram (exclude motherboard chipset, PCI peripherals details).
The latency values shown above are approximated based on Intel Haswell architecture. The objective here is not about the actual numbers, but an overview of the concept.
The above diagram represents a dual-processor computer system. L1 and L2 caches are unique to each physical core, L3 caches are shared across all physical cores within a single physical processor chip (i.e., in dual-socket servers, there will be two L3 caches). DRAMs are divided into two domains, each addressed locally by the processor socket that it belongs to (NUMA). However, Socket 1 is able to address the remote DRAM belongs to Socket 2, and vice versa, via the SMP interconnect (QPI in Intel’s world) albeit with extra latency penalty.
Copying the idea shown in the book, Systems Performance: Enterprise and the Cloud (written by Brendan Gregg), [1] with a few adjustments based on Haswell CPU, the systems latencies are scaled to unit measurements more relatable by humans in the following table,
| Event | Latency | Scaled |
|---|---|---|
| 1 CPU cycle | 0.4 ns | 1 s |
| L1 cache access | 1.6 ns | 4 s |
| L2 cache access | 4.8 ns | 12 s |
| L3 cache access | 14.4 ns | 36 s |
| Local DRAM access | 70 ns | 3 min |
| Remote DRAM access | 105 ns | 4.5 min |
| Flash (NVMe) I/O | 50-150 us | 1.5-4.5 days |
| Rotational disk I/O | 2-15 ms | 2-15 months |
| WAN: Melbourne to Sydney | 12 ms | 1 year |
| WAN: Perth to Sydney | 24 ms | 2 years |
| TCP Retransmission Timeout | 1-3 s | 83.3-250 years |
Note
The latency from CPU to local caches to memory subsystems varies depending on CPU architecture, revision and clock speed. The above is based loosely on an Intel Haswell CPU of 2.4GHz, where 1 CPU cycle takes 0.417ns (1GHz CPU completes a cycle in 1ns). The numbers are referenced with adjustments from the 7-Zip LZMA Benchmark website. [2] (Please note that accuracy of the actual numbers here is not the objective, it is the concept of understanding these latency relative to each subsystem in a computer system)
In modern computers, billions of events are triggered, and penalty of a few nanoseconds per instruction can have significant flow on effects on the end result. By scaling the time ‘seen’ by computer systems to unit measurements more familiar to humans, hopefully, makes it easier to appreciate that every microsecond (or eve nanosecond) matters in modern computer systems - especially so in large scale enterprise environments.
Why Latency Matters¶
Many large businesses invest in efforts to analyse latency in their systems to understand the financial impact in order maximise efficiency. For example, Amazon found every 100ms of latency cost them 1% in sales, [4] Google stated that an extra 500ms in search page generation time dropped traffic by 20%, [5] and in large global high-frequency trading firms, an extra 1ms latency would result in $100m per annum of lost opportunity. [3]
There are also other hidden costs to latency that are non-apparent and hard to quantify. For example, smaller IT service providers normally have hundreds of different customers running on consolidated infrastructure in order to minimise cost and maximise profit. It is often easy to dismiss a few tens of milliseconds (or even a few extra milliseconds) in the underlying computer systems as acceptable, as there may not have any immediate financial impact associated with them. However, when the end users (i.e., the customers) regularly experience degradation of response due to application behaving abnormally, that translates to bad reputation to the business. This could ultimately result in loss of contracts renewal in the long run.
Impact of Latency on IaaS¶
In the IaaS environment, latency is one of the most important factors to consider, however it is also one of the most overlooked parameters simply because the symptoms are often not easily quantifiable (or even empirically proven) by just looking at the infrastructure layer.
There are many parts of a computer system that latency could occur, and they are usually inter-related. i.e., CPU latency could reduce IO throughput, IO latency could impact CPU wait and increase network packet processing latency etc. Therefore it is important to look at the whole infrastructure holistically. Any fluctuation of latency in one subsystem has flow on effects on the others. The snowballing effect would then ultimately result in inconsistency in application response.
Impact of CPU Overcommit¶
As per depicted in the Computer Internal diagram, a typical dual-socket computer system consists of two physical processor chips. Each physical processor chips usually has multiple processor cores within it. Each processor core has separate and dedicated L1 and L2 caches, but L3 cache is shared amongst multiple processor cores within a physical processor chip. Each level of caches down adds extra latency.
In an Intel Haswell Xeon 2600 series processor (E5-2697v3 specifically), the caches’ sizes are as follows, [8]
| Cache Level | Size |
|---|---|
| L1 | 64 KB (per core) |
| L2 | 256 KB (per core) |
| L3 | 35 MB (shared) |
CPU caches store both instruction cache to speed up executable instruction fetch, and data cache to speed up data fetch and store. The idea is to keep as much frequently used data and instruction in the fastest level caches as possible, as every cache miss will waste CPU cycles and add latency.
Overcommitting processor cores at a 2:1 ratio, i.e., 2 vCPU per physical core, means that the L1 and L2 caches are now shared by two different vCPU likely with very different context. Potentially, in the worst case scenario, this effectively halves the L1 and L2 caches sizes and increases the chances of cache miss by 100%. Overcommit ratio of 4:1 effectively reduces the cache sizes down to a quarter of the actual physical amount and quadruple the chances of cache miss.
Relate this back to the latency table, cache miss is relatively expensive. In the worst case scenario, if an instruction cache misses on all three levels, and the CPU has to fetch from DRAM, the worst penalty hit is 44x increase in latency. By overcommitting to 2:1 ratio could potentially doubling the chances of this happening, and quadrupling the chances in 4:1 over commit ratio - in worst case scenario.
Note
In practice, chances are that multiple VMs may have the same set of common instruction and data, therefore the cache miss ratio should be lower than the worst case scenario. However, for cache intensive workloads and running different types of operating systems (or applications) sharing the same core, it would be highly likely that cache thrashing would occur.
If CPU cannot find what it is looking for in L1 cache and has to fetch data from DRAM, the the extra wait time is about 68ns. For humans, this number seems very fast.
However, to put things into perspective, consider the equivalent difference scaled into humans time measurement. A normal conversation between two people is similar to L1 cache hit, 4 seconds response time, which is totally acceptable. A cache miss requiring the CPU to dig into DRAM is equivalent to conversation between two people where the response delay is 3 minutes - i.e., saying Hello to a friend and not getting a greeting back 3 minutes later.
Compounding Effects of CPU Sharing¶
In reality, overcommitting CPU not only affect cache effectiveness, it also introduces scheduling difficulty and increase context switching, both come with significant latency from performance point of view. An overview of CPU scheduling in virtualised environment has been discussed in CPU Scheduling in Virtual Environment.
These effects usually do not exist independently, but are instead compounded and could introduce instability to the overall systems if not managed properly.
Impact of NUMA Sizing¶
NUMA is used in most symmetric multiprocessing (SMP) architectures. Inside an Intel dual-socket SMP server, each physical processor chip forms a NUMA node with its own set of DRAMs that it can address directly/locally. Between the two NUMA nodes, there are usually multiple SMP interconnect links (QPI links in Intel’s terminology). The SMP interconnect links are there to allow physical processor chips to access the DRAMs in the adjacent NUMA node (remote memory access).
As per table tbl-scaled-latency, accessing remote memory in the adjacent NUMA node comes with penalty hit - in this example, a 50% extra latency (though in real life it is often much lower than that). Therefore, when designing a high performance solution or devising scalable building blocks, it is very important to size VMs appropriately such that number vCPU and RAM size are aligned in a single NUMA node where possible. For VMs that require more vCPU than a single NUMA node could provide, ensure the VM is NUMA-aware and size the vCPU and RAM such that they align and balance across the NUMA nodes.
For cache or memory intensive applications, crossing the NUMA node has significant impact to the overall throughput and latency. Fitting workloads and sizing VM to NUMA node size matters a lot when it comes to performance. [9] [10] Google has pubished a detail analysis of how placement of workloads in NUMA have significant performance impact on applications response and throughput [14] .
How to Size for NUMA¶
A general concept to sizing VMs to align with NUMA nodes can be found in the Monster VM Design article. There are many more in-depth articles published on the web. [11] [12] [13]
In a nutshell, when sizing VMs with large number of vCPUs or devising building blocks for VM sizes, ensure to align them within the boundary of physical NUMA nodes.
For example, consider an SMP server with two Intel Xeon E5-2697v3 processor chips configured with 256GB RAM. Each of the Xeon processor chip has 14 processor cores, and addresses 128GB of local DRAM - this is the physical NUMA node size.
- In order to not cross the NUMA boundary, each VM running on this server should have less than 14 vCPU and less than 128GB RAM.
- If a VM require more than 14 vCPU, ensure that VM vCPU and RAM placement are balanced across the NUMA nodes
- It is an SMP system, always design with symmetry and balance in mind.
- The model of CPU used in a server dictates the NUMA sizing. So when changing the model of CPU used in servers, VM service catalog will need to be revisited.
Impact of Storage Latency¶
Storage used in IaaS are generally built on magnetic media (spinning disk) and/or flash media. Magnetic media provide large capacity low cost storage. Flash media are faster and have much lower latency but at higher cost.
Access time for a single magnetic media are often measured in milliseconds with IOPS measured in the hundreds, whereas a single flash device is able to achieve sub-millisecond latency with IOPS measured in the tens of thousands (or even hundreds of thousands with the newer NVMe) [15].
Compare to DRAM, flash media are generally 1000x slower, and magnetic media are hundreds of thousands times slower again in latency.
Storage Latency Expectation¶
Sub-millisecond storage latency requirement is becoming the norm for many workloads. There are many applications that demand high-performance near real-time analysis on dataset, and lots of scale-out architectures also depend on low latency storage access in order for their performance objectives to be met.
Even general end users are now getting accustomed to their laptops or tablets boot up in seconds, rather than in minutes. This can only be possible with flash media. In a minimal Microsoft Windows OS, the boot up process can consume thousands of IOPS. With flash media, this can easily be served in sub-milliseconds response (and therefore very fast boot up). Magnetic media will take tens of milliseconds, which significantly delay the boot up process.
Note
In test lab, both Microsoft Windows 2012 R2 VM and an Ubuntu Linux 15.10 VM were installed with minimal packages to test OS boot up IO requirement. Setting the statistics sampling rate at 2 seconds interval (to capture IOPS spikes), it was observed that both VMs consumed up to 3000 IOPS in order to boot in under 15 seconds. If the IOPS rate are limited to 1000 (@ 32 KB/IO block) with latency in the tens of milliseconds, the boot up time got extended to more than 2 minutes.
Highly Consistent Scale-out Storage¶
VMware VSAN (or any scale-out shared-nothing architecture) is a good example where flash/SSD is an absolute requirement to ensure acceptable performance is deliverable. Due to its nature of distributed and shared-nothing architecture, every write received by a node has to be replicated another node (or more depending on data protection policy) in the cluster to provide data resiliency and high availability. Writes have to be committed into flash before acknowledgement can be sent back. The only way do this at scale is to utilise flash media as non-volatile cache layer for both reads and writes - magnetic hard disks are too slow for such architecture to work well.
In large scale distributed environment (think Lustre in the HPC, multi-rack scale Ceph and GlusterFS in the cloud IaaS storage, and Hadoop in big data world), the use of RDMA cards, NVDIMM, NVMe, ultra-low latency top-of-rack switches to shift storage data are common place to cut down latency in the microseconds.
Change of Expectations¶
Apart from application requirements, users expectations have also evolved over the years (which in turn further drive the demands for near real-time application response). Consumer devices such as tablets, smart phones and the speed of which people are interacting on the Internet have changed people’s expectation. When we turn on tablets or phones, we expect them to be usable straight away. People are used to modern laptops to boot in seconds, rather than minutes. These expectations have also transformed the work space as well, businesses are fighting to shed milliseconds of latency in order to provide better user experience and win sales.
The way we consume data and information have changed. Single digit milliseconds (or sub-millisecond) primary storage latency is becoming the norm, and expectations have also evolved to expect that as well.
Impact of Network Latency¶
Impact of datacenter network latency in scale out architecture - i.e., provide example of important of network latency is to VSAN Impact of WAN network latency - why span L2 network across long distance is bad idea for many modern applications
TL;DR¶
- Latency matters and its impact can be significant. In IaaS, extra care must be exercised to design production environment with latency in mind to ensure that the outcome meets the business requirements.
- Production applications that require consistent response time do not work well with overcommit systems.
- 4:1 CPU overcommit makes little sense for busy production workloads (especially for VM more than 2 vCPU). For overcommit ratio of 4:1 to work well, no more than 2 vCPUs per VM should be configured (preferably 1 vCPU per VM).
- IO intensive applications are highly susceptible to latency. Every millisecond counts.
- Scale-out architecture (i.e., VSAN, or shared-nothing database or storage systems) require low latency network, and sub-milliseconds IO response to work well.
- CPU overcommit is not recommended for production workloads. For example, Microsoft Azure only offers A-series (specifically A0 instance) as over-subscribed instance [6] , similiar to AWS T2 instances [7] , both do not recommend these instances to be used as production workloads that require consistent response.
- For lowest possible IO latency in IaaS environment, NVMe as local cache is the most viable solution.
Footnotes
| [1] | Idea borrowed from Table 2.2 of the book, Systems Performance: Enterprise and the Cloud, by Brendan Gregg |
| [2] | CPU latency numbers taken from 7-Zip LZMA Benchmark website |
| [3] | Wall Street’s Quest To Process Data At The Speed Of Light |
| [4] | Amazon found every 100ms of latency cost them 1% in sales |
| [5] | Google found latency affects traffic volume |
| [6] | Sizes of VMs in Azure. A0 size is the only offering that is over-subscribed |
| [7] | Amazone EC2 Instance Types |
| [8] | Haswell (microarchitecture) |
| [9] | Shared Memory Clusters: Of NUMA And Cache Latencies |
| [10] | Process and memory affinity: why do you care? |
| [11] | The Importance of VM Size to NUMA Node Size |
| [12] | SAP on VMware Sizing & Design Example |
| [13] | Does corespersocket Affect Performance? |
| [14] | Optimizing Google’s Warehouse Scale Computers: The NUMA Experience |
| [15] | TheSSDReview Intel P3608 NVMe Benchmark |