Multi Cluster GPU Allocation for AI Research

時間

2025年8月10日 10:40 ~ 11:10

講者

Hrittik Roy, Shivay Lamba

位置

共筆

https://hackmd.io/r1azFcW_ex

英語入門

開源人工智慧和機器學習 / Open Source AI and Machine Learning

簡介

As the LLMs and generative models become more and more complex, one can’t simply train them on CPU, or a single GPU cluster, this requires the use of multiple GPUs but managing those can be complicated.GPU partitioning in the cloud is perceived to be a complicated, resource-consuming process that is worth the exclusive involvement of narrowly focused teams or large enterprises. So this talk explores why GPU partitioning is necessary for running Python AI workloads and how it can be done efficiently using open source tooling.

The talk will cover about some common myths: that this has something to do with advanced hardware configurations or prohibitive costs, on systems likeKubernetes

In this talk, we will illustrate how modern frameworks like NVIDIA MIG with vCluster effectively enable seamless sharing of GPUs across different teams, leading to more efficient resource utilization, higher throughput, and broader accessibility for workloads like LLM finetuning and inference. The talk aims to inspire developers, engineers to understand the key techniques for efficient GPU scheduling and sharing of resources across multiple GPU Clusters with open source platform tooling like vCluster.

關於講者

Hrittik Roy

Hrittik is currently Platform Advocate at Loft Labs and a CNCF Ambassador, who has previously worked at various startups helping them scale their content efforts. He loves diving deep into distributed systems and creating articles on them and has spoken at conferences such as Azure Cloud Summit, UbuCon Asia and Kubernetes Community Days - Lagos and Chennai among others! His best days are when he finds ways to create impact in the communities he’s a part of either by code, content, or mentorship!

Shivay Lamba

Shivay Lamba is a software developer specializing in DevOps, Machine Learning and Full Stack Development.

He is an Open Source Enthusiast and has been part of various programs like Google Code In and Google Summer of Code as a Mentor and has also been a MLH Fellow. He is actively involved in community work as well. He is a TensorflowJS SIG member, Mentor in OpenMined and CNCF Service Mesh Community, SODA Foundation and has given talks at various conferences like Github Satellite, Voice Global, Fossasia Tech Summit, TensorflowJS Show & Tell.