Burst-Aware Weighted Fair Queueing for Serverless Inference: Mitigating Noisy Neighbor Effects in Multi-Tenant Systems

Authors

Keywords:

serverless, Cloud computing technology, distributed systems, fairness, multi tenant systems

Abstract

Multi-tenant serverless inference often devolves into noisy-neighbor scenarios where a single tenant’s bursty LLM batch floods the fleet, pushing interactive calls beyond their latency budgets. We are proposing a Burst-Aware Weighted Fair Queueing (BWFQ) - a scheduler that requires only two counters per tenant (tokens earned, tokens spent) and a constant-time heap pop to pick the next invocation. In BWFQ, we use the classic token-bucket shaper where tokens accumulate at a tenant-specific base rate and are reduced on each dispatch. When a tenant exhausts all its tokens, its requests are queued, giving chances to other quieter tenant s to run. Techniques described in other papers like Dominant-Resource Fairness, BWFQ requires neither per-invocation resource profiling nor multi-dimensional share accounting, making it easy to integrate onto existing Lambda-style dispatchers. To evaluate our algorithm, we built a prototype using AWS Lambda and observed that BWFQ reduces the P99 latency gap between interactive and batch tenants from 8.5s to 2.1s; a 4.0X improvement, while preserving 94% of the throughput achieved by First-Come-First-Serve. The algorithm adds only 35 µs of scheduling overhead per decision and fits in approximately in 150 lines of Go code. These results demonstrate that a simple token-bucket fair queueing is a practical, immediately usable step towards building fairness in production serverless inference.

Downloads

Download data is not yet available.

Author Biographies

  • Rajesh Kumar Pandey, Amazon Web Services

    Rajesh Kumar Pandey is a Principal Engineer at Amazon Web Services, where he leads the design of AWS Lambda’s eventing infrastructure. With deep expertise in distributed systems and serverless computing, he focuses on building resilient and scalable cloud-native architectures. Rajesh has authored technical papers, holds multiple patents, and actively shares his knowledge through industry publications, podcasts, and conference talks. Passionate about advancing the state of serverless and event-driven systems.

  • Jubin Abhishek Soni, Paypal Inc.

    Jubin Abhishek Soni is a Senior Software Engineer at Yahoo, where he leads large-scale cloud migrations and real-time data platform development. With over 13 years of experience in full-stack development, AI systems, and cloud-native architectures, he has held key roles at Nextdoor, Chartmetric, and S&P Global. Jubin has published research in AI and cybersecurity, serves as a judge for technology awards, and peer-reviews scholarly articles. He actively shares his expertise in scalable AI systems and big data engineering through mentoring, writing, and contributions to professional communities.

  • Amit Anand, Yahoo Inc.

    Amit Anand is a lead software engineer at PayPal, where he leads the Pricing and FX platform. With extensive expertise in distributed systems, system management, and fintech, he is dedicated to creating customer-centric, resilient solutions. Before joining PayPal, Amit led the engineering team for Dell's flagship OpenManage Enterprise product. He actively shares his insights through industry publications and is passionate about developing scalable, distributed systems while mentoring the next generation of technologists.

Downloads

Published

29-12-2025

Issue

Section

Articles

How to Cite

Pandey, R. K., Soni, J. A., & Anand, A. (2025). Burst-Aware Weighted Fair Queueing for Serverless Inference: Mitigating Noisy Neighbor Effects in Multi-Tenant Systems. Journal of Soft Computing and Data Mining, 6(3), 356-369. https://publisher.uthm.edu.my/ojs/index.php/jscdm/article/view/23553