“Diving into the GPU Supply Shortage: Is a Wave of Change Coming for AI and High Performance Computing?”

6 min readDec 30, 2023

Understanding GPU Supply and FLOPS Demand for Next-Gen AI

The demand for high-performance computing, driven by the rise of AI, is undergoing a significant shift as the next wave of GPUs prepares to enter the market. Companies are increasingly focusing on the FLOPS demands of the next generation of AI, emphasizing the need for scalable and cost-effective solutions.

GPU Scaling and Cost Considerations: Large tech companies such as Amazon and Google have a significant cost advantage when it comes to renting infrastructure and accessing high-performance GPUs. This advantage allows them to leverage greater compute power and optimize AI workloads more efficiently.

Concept of GPU Poor and GPU Rich: The concept of GPU poor and GPU rich entities has emerged as a framework to differentiate between organizations with limited access to high-end GPUs and those with abundant resources. This differentiation has prompted discussions about strategies for leveling the playing field and minimizing the disparities in GPU utilization.

Challenges and Opportunities in GPU Utilization: Working with limited access to high-end GPUs presents both challenges and opportunities. While it poses constraints on compute power, it also drives innovation and encourages the development of alternative approaches to maximize the potential of available resources.

Open Source and Resources: The importance of open source resources in the AI and HPC landscape cannot be overstated. The success of open source initiatives is vital for fostering collaboration, knowledge sharing, and advancements in AI and HPC technologies.

Compute and Acceleration: The significance of compute power and the accelerated pace of technological advancements are driving the need for scalable, high-performance computing solutions. As AI workloads continue to grow, the demand for efficient computation and acceleration methods becomes increasingly imperative.

Google’s TPU and PyTorch: Google’s Tensor Processing Units (TPUs) play a crucial role in machine learning, competing with frameworks such as PyTorch. The competition between these technologies highlights the ongoing evolution and diversification within the AI and HPC landscape.

LLM Inference and Batch Size: The challenges associated with Large Language Model (LLM) inference and the impact of batch size on memory bandwidth utilization and FLOPS underscore the intricacies involved in optimizing performance at scale.

Model Flop Utilization (MFU): The importance of Model Flop Utilization (MFU) in both training and inference stages cannot be overstated. Maximizing MFU is a key objective in achieving efficient utilization of computational resources.

Rapid Advancements in Technology: The rapid pace of technological advancements in the AI and HPC domains has transformative implications for future developments in machine learning. As computational capabilities expand, new opportunities for innovation and discovery emerge.

Challenges in Training and Inference: Distinguishing between the challenges of training and inference is crucial in understanding the varying computational constraints and batch size considerations that impact performance optimization.

Focusing on Key Metrics: Key metrics such as model flop utilization and memory bandwidth utilization are paramount in evaluating the efficiency and effectiveness of AI and HPC processes. Emphasizing these metrics guides the development of strategies to enhance computational performance.

Memory Bandwidth Utilization: High memory bandwidth utilization is critical, with the A100 achieving approximately 60% MFU and the H100 achieving 40–45% MFU. Maximizing memory bandwidth utilization is essential for optimizing performance in AI and HPC workloads.

Latency and Model Bandwidth Requirements: To achieve human reading speed on a model with 70 billion parameters, approximately 2100 gigabytes per second of memory bandwidth is needed. Both A100 and H100 models fall short of meeting this requirement, reflecting the formidable challenges posed by latency and bandwidth demands.

Networking and GPU Poor: The slower increase in networking speed compared to FLOPS and bandwidth presents significant challenges for chip designers, particularly for organizations categorized as GPU poor.

Google TPO and Mellanox: Google’s Tensor Processing Unit (TPU) developments, in collaboration with Mellanox, emphasize high-speed networking as a critical component of ML hardware, signifying a concerted effort to address networking challenges in AI and HPC.

Challenges in Running GPUs or Models on Device: Running GPUs or models on devices presents fundamental challenges related to memory bandwidth and capacity limitations. Addressing these challenges requires innovative approaches and new paradigms in hardware and software development.

Innovative Approaches to Overcome Limitations: Innovative strategies such as speculative decoding are being explored to overcome the limitations of running models on devices, pointing towards the potential for creative solutions to optimize performance within constrained environments.

Development of Models with Limited Hardware Resources: The development of impactful models like Medusa with just one server and eight GPUs exemplifies the potential for creating significant AI and HPC models with limited hardware resources. This demonstration highlights opportunities for innovation and resource optimization.

Considerations for Hardware Depreciation and Model Fine-Tuning: Debates surrounding hardware depreciation versus the rapid evolution of models, particularly related to fine-tuning smaller models, underscore the dynamic landscape of AI and HPC and the challenges associated with balancing hardware investments with model advancements.

The Competition between OpenAI, Microsoft, and NVIDIA:

Microsoft may face challenges even with OpenAI’s assistance due to limitations of their chip performance.
There are concerns about the scalability of the technology and the potential need for alternative solutions such as GPUs and TPUs.

Apple’s Role in the Industry:

Apple’s emphasis on perfect products and less openness may limit its ability to compete in developing open AI models.
Despite this, Apple’s powerful distribution should not be underestimated.

Concerns about Taiwan’s National Security:

There are varying levels of concern in Taiwan about the possibility of a Chinese invasion, with differing predictions about potential timeframes.
Semiconductor Supply Chain Fragmentation:
The semiconductor supply chain is highly fragmented, with significant dependencies on specific technologies and companies in various regions.
Complexity and Limitations of Rebuilding the Semiconductor Supply Chain:
Rebuilding the semiconductor supply chain in an alternate location would be a daunting task due to its extensive complexity and interdependencies.

Impact of Morris Chang’s Role on Technology Development:

The role of Morris Chang and his move to Taiwan has likely accelerated technology development and innovation in the semiconductor industry.
The dissemination of technology across various regions has led to significant advancements in the industry.

A lot of the smaller companies (when compared to $1T+ giants, it’s all relative) are trying hard to fight against the GPU rich, but they can’t quite offer the same scale:

HuggingFace is trying to launch a training cluster as a service, but it seems to just be a software wrapper around NVIDIA’s GDX Cloud, as they don’t actually own that much GPU supply. The max option for GPUs to use is 1,000 in their form.
Databricks’ “GPU-enabled clusters” run on AWS, and the largest one listed there is only powered by 8 NVIDIA A10Gs. The Mosaic team is also doing research on running on AMD cards with some promising results, but they seem to be pushing up to just 128 cards, which isn’t much.
Together actually has 4,424 H100s live in production, which is quite sizable but still nothing compared to the 100,000 that Meta is putting online.

Take LLaMA2 as an example; the 70B model was trained on 2T tokens. Using the highest accelerator count on HuggingFace it’d take ~43 days to train the model from scratch and it’d cost ~$2M. That doesn’t include all the data and prep work. In the meantime, Zuck is probably burning tens of thousands of H100s to train LLaMA3, which will surely have much higher performance than whatever a GPU poor company can train in the same time span.

The good news, is that there’s a ton of opportunity for the GPU poors to shine, especially around fine-tuning. Most of the open source models coming out are one-size-fits-all, and there’s a ton of opportunity for startups to take them and tailor them to their customers, or to specific tasks or use cases to build vertical applications. The other area of improvement is data quality; Mistral showed how you can build a high quality small model with less FLOPs by feeding it better data.

“Diving into the GPU Supply Shortage: Is a Wave of Change Coming for AI and High Performance Computing?”

Understanding GPU Supply and FLOPS Demand for Next-Gen AI

Written by LAIKA AI

Responses (1)