Muhammad Shahbaz receives Google Research Scholar award for work on distributed training in the cloud

Shahbaz is working to develop a novel system to enhance the efficiency of distributed training of large-scale deep learning models in the cloud.
Muhammad Shahbaz headshot
Prof. Muhammad Shahbaz

Muhammad Shahbaz, incoming assistant professor of computer science and engineering at the University of Michigan, has received a Google Research Scholar award in support of his project titled “Robust and Tail-Optimal Collective Communication for Distributed Training in the Cloud.” The Google Research Scholar Program supports world-class, innovative research by early-career professors in computer science and related fields. With the support of this award, Shahbaz aims to develop a new collective-communication system to improve the speed and efficiency of distributed training in the cloud.

Distributed training is the current standard for training large-scale deep learning models used in a wide variety of contexts, from large language models (LLMs) to health care. As its name suggests, it involves distributing the training workload across multiple processors, or workers, that are trained in parallel. This significantly speeds up the training process, but with ever-larger datasets and models, growing computational and communication demands threaten to overwhelm distributed training systems, creating a need for more efficient and adaptable approaches.

With this in mind, Shahbaz and his collaborators are working to develop a novel system, called Ultima, that will use a set of state-of-the-art tools to ensure time-bounded and predictable training workloads for deep learning models. Using adaptive timeouts, a gradient maximization technique, a method for minimizing dropped gradients, and more, his system will harness the inherent characteristics of distributed training (i.e., resilience to gradient loss) to maximize performance and speed without sacrificing training accuracy. Through these advances, Shahbaz and his team hope to create a reliable model for increasing the speed and predictability of distributed training, benefitting the vast number of current and emerging applications that rely on deep learning models.