A Tensor Processing Framework for CPU-Manycore Heterogeneous Systems.

Bibliographic Details
Title:	A Tensor Processing Framework for CPU-Manycore Heterogeneous Systems.
Authors:	Cheng, Lin, Pan, Peitian, Zhao, Zhongyuan, Ranjan, Krithik, Weber, Jack, Veluri, Bandhav, Ehsani, Seyed Borna, Ruttenberg, Max, Jung, Dai Cheol, Ivanov, Preslav, Richmond, Dustin, Taylor, Michael B., Zhang, Zhiru, Batten, Christopher
Source:	IEEE Transactions on Computer-Aided Design of Integrated Circuits & Systems. Jun2022, Vol. 41 Issue 6, p1620-1635. 16p.
Subjects:	OPEN source software, CENTRAL processing units, COMPUTER architecture, COPROCESSORS, ENERGY consumption
Abstract:	Future CPU-manycore heterogeneous systems can provide high peak throughput by integrating thousands of simple, independent, energy-efficient cores in a single die. However, there are two key challenges to translating this high peak throughput into improved end-to-end workload performance: 1) manycore co-processors rely on simple hardware putting significant demands on the software programmer and 2) manycore co-processors use in-order cores that struggle to tolerate long memory latencies. To address the manycore programmability challenge, this article presents a dense and sparse tensor processing framework based on PyTorch that enables domain experts to easily accelerate off-the-shelf workloads on CPU-manycore heterogeneous systems. To address the manycore memory latency challenge, we use our extended PyTorch framework to explore the potential for decoupled access/execute (DAE) software and hardware mechanisms. More specifically, we propose two software-only techniques, naïve-software DAE and systolic-software DAE, along with a lightweight hardware access accelerator to further improve area-normalized throughput. We evaluate our techniques using a combination of PyTorch operator microbenchmarking and real-world PyTorch workloads running on a detailed register-transfer-level model of a 128-core manycore architecture. Our evaluation on three real-world dense and sparse tensor workloads suggests these workloads can achieve approximately 2– $6\times $ performance improvement when scaled to a future 2000-core CPU-manycore heterogeneous system compared to an 18-core out-of-order CPU baseline, while potentially achieving higher area-normalized throughput and improved energy efficiency compared to general-purpose graphics processing units. [ABSTRACT FROM AUTHOR]
	Copyright of IEEE Transactions on Computer-Aided Design of Integrated Circuits & Systems is the property of IEEE and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Military & Government Collection

More Details
ISSN:	02780070
DOI:	10.1109/TCAD.2021.3103825
Published in:	IEEE Transactions on Computer-Aided Design of Integrated Circuits & Systems
Language:	English