Serving numerous models is essential today due to diverse business needs and various customized use-cases. However, this raises the challenge of how to efficiently deploy and manage these models while considering both ease of use and cost-effectiveness. This talk aims to provide a comprehensive insight into various patterns of serving many models using Ray Serve. We will delve into how 3 features in Ray Serve - model composition, multi-application, model multiplexing - enable seamless deployment of numerous models while optimizing resource utilization.
Takeaways:
• Discuss common industry patterns for serving many models.
• Learn how to simplify management and enhance performance of many-model serving through Ray Serve's model composition, multi-application, and model multiplexing features.
• Deep dive into case studies of Ray Serve users running many-model applications in production.
Sihan is a software engineer at Anyscale, a contributor to the Ray Serve. Before joining Anyscale, he was the software engineer in Pinterest, working on ML inference service.
Jon Park is a Principal ML Engineer at Clari, where he leads ML Platform. Previously, he worked as a Director of API Platform Engineering at TIBCO. He holds a BA in Computer Science and MBA from UC Berkeley, and Master of Computer Science in Data Science from UIUC.
Cindy Zhang is a software engineer focusing on Ray Serve and Ray infrastructure at Anyscale.