How to tolerate the tail: Addressing long-tail latency in data centres

How to tolerate the tail: Addressing long-tail latency in data centres

Rajkumar Vijayarangakannan, Lead of Network Design and DevOps, ManageEngine

The demand for data centres is increasing but its responsiveness is the key driver for revenue. Rajkumar Vijayarangakannan, Lead of Network Design and DevOps, ManageEngine, discusses how to minimise long-tail latency and improve data centre operations.

The widespread adoption of data centres today has prompted numerous businesses to embrace and deliver advanced, highly interactive, real-time services via Edge networks distributed worldwide. The proliferation of users, mobile apps and the 5G revolution has contributed massively to the growing scale and demand for these services.

Demand aside, it is the speed and responsiveness of these services that drive their revenue and reliability as they heavily rely on instant response times. The strong need for quick and reliable delivery has driven businesses to seek distributed platforms and microservices architectures to deliver these services.

To enhance responsiveness, these complex architectures slice and parallelise end-user requests into several sub-operations that are executed across a large number of shared, multi-tenant physical machines either as virtual machines (VMs) or as containers, therefore response times become less predictable. The larger the scale of operations in data centres, the greater the impact of latency variability.

The most prevalent long-tail latency challenge affecting the holistic performance of data centres manifests as an elongated spectrum of variable latencies. The familiar adage ‘the tail wags the dog’ finds relevance here, wherein certain niche factors or rare occurrences lead the performance of data centres. In modern data centres, this concept is vividly exemplified, as various subtle or infrequent events surprisingly tend to dominate the overall data centre’s network performance.

In such complex environments, for effective performance, responses from each subprocess must exhibit consistent low latency before a final response is delivered to the client or the overall operation response time will be tragically slow. With thousands of microservices executing in parallel, the process that exhibits a slow response determines the overall response time of the user facing real-time web services.

What causes long-tail latencies?

The causes of long-tail latency lie not just in the availability of resources but also in interactions at the data centre component level. The various factors causing long-tail latencies include:

  1. Resource contention: This could arise from different concurrent workloads operating within the same shared environment or even resource contention within a single workload, synchronisation resource locking, or strict resource ordering strategies. It is the primary contributor to the variability in latency.
  2. Concurrent activities: Interference from unrelated applications such as log compaction or garbage collection can result in escalated latency. Clashes amongst co-located applications for resources in a shared environment can also yield latency issues.
  3. Queuing delays: Stalled sub-processes stuck in queues can heighten the variations in latency.
  4. Other outliers: Performance bugs and sub-optimal software or algorithms can also result in latency.

Traditional debugging methods often encounter significant challenges in addressing these sources due to the complexity of the data centre environment and their tendency to manifest intermittently.

How ManageEngine is addressing long-tail latencies

ManageEngine practices the following tail-tolerant techniques to respond to latency variable workloads quickly and reliably:

Global anycasting

Leveraging the power of ManageEngine CloudDNS, a robust web-based solution for DNS resolution, it scales up and distributes ManageEngine’s services across the internet with the aim of getting closer to end-users. This advanced setup comprises of multiple anycast sites or data centres at global vantage points for each continent to deliver services with good latency to end users wherever they are.

This way, the workload can be distributed to another closer accessible data centre equally equipped with servers capable of processing and promptly addressing incoming requests as the origin server. This prevents overwhelming the origin server while effectively mitigating potential service disruptions for clients seeking content from the origin server.

Global load balancing

CloudDNS optimises data centre traffic via global load balancing techniques and smart traffic steering filters. In distributed infrastructures like ManageEngine’s, the global load balancer deftly handles growing loads, orchestrating them across dispersed resources. This involves configuring routing protocols for different geographic zones, ensuring optimised global resource routing.

To minimise latency and promptly address queries specifically optimised for the end-user’s network, the system employs special routing based on IP addresses or Autonomous System Numbers identifying specific network groups. This guides users to optimal resources, minimising latency and ensuring peak performance.

With a comprehensive tool set for load balancing and routing optimisation, CloudDNS guarantees minimal latency and enhances user experiences by swiftly providing tailored resources for users accessing ManageEngine’s services. This commitment to performance optimisation gives ManageEngine a competitive edge in delivering exceptional online experiences worldwide.

Integrating health monitoring checks

The integrated health monitoring system of CloudDNS keeps running proactive monitoring checks at frequent intervals using various protocols, including website protocols such as HTTPS and HTTP, TCP, DNS and ICMP (Ping). It diligently monitors the network for active failover events from multiple vital vantage points.

Upon detecting any unhealthy resources during monitoring, the health monitor promptly updates DNS failover configurations with robust, healthy resource replicas. This process significantly minimises latencies that may arise for the end-users accessing ManageEngine services from anywhere around the globe to the point where these latencies are either negligible or entirely imperceptible.

Dedicated VMs for latency-sensitive workloads

Resource contentions easily arise when colocated applications pathologically compete for resources in shared environments that service multiple applications. This can cause pending transactions to take longer to complete than expected, which ultimately may be experienced as excessive tail latency. Specifically, task mixing or co-scheduling CPU-intensive workloads with latency-sensitive workloads results in instances of poor neighbouring, which contributes invariably to long-tail latencies. ManageEngine deploys dedicated VMs to isolate and execute latency-sensitive workloads.   

Tolerating tails over taming tails

It is possible to tame tail latencies by controlling the outliers and enhancing the workload predictability. This can be achieved by instrumenting the data centre resources with embedded hardware and supplementary software that facilitate real-time collection and distribution of real-time debugging information along with high-granularity performance monitoring and system-wide insights.

Though it is possible to tame tail latencies with such integrated solutions that proactively prevent workload clashes without altering the core infrastructure, it is not feasible to sweep away all the sources of latencies from actively functioning data centres, like the ones hosting ManageEngine services.

At ManageEngine, its approach solely revolves around practising tail-tolerant techniques to enhance the capabilities of its data centres in addressing a wide range of latency challenges, all while delivering reliable responses for end-users.

Browse our latest issue

Intelligent Data Centres

View Magazine Archive