Bibliographic Details
Title: |
Context Parallelism for Scalable Million-Token Inference |
Authors: |
Yang, Amy, Yang, Jingyi, Ibrahim, Aya, Xie, Xinfeng, Tang, Bangsheng, Sizov, Grigory, Reizenstein, Jeremy, Park, Jongsoo, Huang, Jianyu |
Publication Year: |
2024 |
Collection: |
Computer Science |
Subject Terms: |
Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning |
More Details: |
We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth. |
Document Type: |
Working Paper |
Access URL: |
http://arxiv.org/abs/2411.01783 |
Accession Number: |
edsarx.2411.01783 |
Database: |
arXiv |