PyTorch 2.5 Unleashes High-End GPU Performance, Supercharges LLMs

PyTorch recently released a new update PyTorch 2.5. The latest version includes CuDNN backend support for SDPA, providing up to 75% speedups on H100 GPUs, and torch.compile’s regional compilation, which reduces cold start time for nn.Module compilations, ideal for LLM use cases.

As per the release notes, “this option allows users to compile a repeated nn.Module (e.g., a transformer layer in LLM) without recompilations,” resulting in faster performance with minimal degradation.

Plus, TorchInductor’s CPP backend brings substantial improvements, including FP16 support and max-autotune mode, consistently outperforming eager mode in 97.5% of models. With 4095 commits from 504 contributors, this release expands ecosystem projects like TorchRec and TorchFix, pushing the boundaries of PyTorch’s versatility in AI applications.

PyTorch 2.4 vs PyTorch 2.5

Compared to PyTorch 2.4 which was released in July this year and focused on introducing Python 3.12 support, improving performance for GPU-based models, and enhancing distributed training, the new version—PyTorch 2.5—shifts towards optimising LLM workflows and leveraging high-end GPUs for significant speed gains.

With the advancements in GPU-based performance, particularly the new CuDNN backend and TorchInductor’s enhancements, PyTorch 2.5 is a critical release for users working on large-scale AI models.

Feature	PyTorch 2.5	PyTorch 2.4
CuDNN Backend for SDPA	New backend for faster transformer execution	No SDPA-specific backend, generalised performance
Regional Compilation	Regional compilation with torch.compile	Entire model compilation with torch.compile
TorchInductor CPP Backend	Significant performance boost on CPU	Initial TorchInductor release, general optimisations
Performance Focus	Targeted at attention models, CPU speedup	General performance improvements across models
Distributed Training	Optimised for multi-GPU, transformer-based models	Distributed training improvements, but less focused
Bug Fixes	Numerous fixes and stability enhancements	Stability improvements, but some torch.compile edge cases

Key Improvements

CuDNN Backend for SDPA

In PyTorch 2.5, the introduction of the CuDNN backend for scaled dot-product attention (SDPA) provides up to a 75% speedup on H100 GPUs, a substantial leap over version 2.4. The latest update offers performance boosts “enabled by default for all users of SDPA on H100 or newer GPUs.”

PyTorch 2.4, in contrast, mostly focused on supporting 3.12 and performance optimisations for GPU-based tasks like AOTInductor freezing, with no GPU-specific breakthroughs of this magnitude.

Regional Compilation for torch.compile
Version 2.5 introduces regional compilation for reducing the cold start time in torch.compile, which is crucial for repeated modules like transformer layers in LLMs.

“This option allows users to compile a repeated nn.Module without recompilations,” read the release note. This brings significant efficiency to repeated tasks in LLM architectures.

By contrast, PyTorch 2.4 had optimisations for torch.compile but focused on adding support for Python 3.12 rather than reducing compilation overhead for repeated tasks.

TorchInductor Enhancements

PyTorch 2.5 pushes TorchInductor performance even further, introducing a CPP backend with FP16 support, CPP wrappers, and the max-autotune mode for fine-tuning performance.

PyTorch 2.4 laid the groundwork for such enhancements with its introduction of AOTInductor freezing for CPU, which optimised MKLDNN weight serialisation but didn’t offer such a comprehensive range of features for GPU optimisation.

“TorchInductor consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested,” revealed the PyTorch team.

TCPStore Backend

PyTorch 2.4 had earlier introduced a TCPStore server backend using libuv, which significantly reduced initialization times for large-scale jobs. Many developers believe this backend was pivotal for distributed training setups. While not a focal point in PyTorch 2.5, this backend still contributes to the overall efficiency of distributed environments in the newer release.

Beta and Prototype Features

Both releases have numerous beta and prototype features, but the focus in 2.5 shifts toward high-performance LLMs and GPU-intensive applications.

For example, the new FlexAttention API in 2.5 handles various attention mechanisms, and Compiled Autograd extends the flexibility of backward pass execution. In 2.4, the innovations were more related to pipeline parallelism and FSDP2 for data sharding, which were still pivotal but more aligned with distributed model training.

Ecosystem Expansion
PyTorch 2.5 introduces ecosystem projects likeTorchRec and TorchFix, further expanding PyTorch’s versatility. In contrast, PyTorch 2.4’s improvements largely focused on integrating Intel GPUs into Linux systems and enhancing CPU operations.

Apart from the major highlights like the SDPA CuDNN backend, FlexAttention and others, PyTorch 2.5 has also introduced exciting features such as FP16 support on the CPU path. It caters for both eager mode and the TorchInductor CPP backend, Autoload Device Extension, which streamlines integration with out-of-tree device extensions, and significant enhancements for Intel GPU support. All of these unique features make PyTorch a highly versatile and performance-optimised release, positioning it as a go-to tool for developers working on diverse hardware setups and complex AI models. “I run three Arc A770s and have been waiting for tensor parallel outside Vulcan. Hallelujah!” said a user on Reddit, highlighting their excitement over the long-awaited support for Intel GPUs in PyTorch 2.5, which significantly expands its capabilities for AI workloads.

The post PyTorch 2.5 Unleashes High-End GPU Performance, Supercharges LLMs appeared first on AIM.