Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAsby Abid Rafique, George A. Constantinides, Nachiket Kapre

IEEE Trans. Parallel Distrib. Syst.

About

Similar

Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs

Authors:
Xiaowen Feng, Hai Jin, Ran Zheng, Kan Hu, Jingxiang Zeng, Zhiyuan Shao
2011

When cache blocking of sparse matrix vector multiply works and why

Authors:
Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, Katherine A. Yelick
2007

LXVIII. The Abbot and Convent of Woburn to the King

Authors:
Thabbot and convent of Woburn
1843

Programmable analogue vector-matrix multiplier

Authors:
T.Y. Lin, A.J. Payne
2002

Regulatory approaches to the control of environmental mutagens and carcinogens

Authors:
Members and Consultant of Committee
1983

Text

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2013 1

Communication Optimization of Iterative

Sparse Matrix-Vector Multiply on GPUs and

FPGAs

Abid Rafique, George A. Constantinides, Senior Member, IEEE Nachiket Kapre, Member, IEEE

Abstract—Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPU in accelerating communication-bound sparse iterative solvers. While k iterations of the iterative solver can be unrolled to provide

O(k) reduction in communication cost, the extent of this unrolling depends on the underlying architecture, its memory model and the growth in redundant computation. This paper presents a systematic procedure to select this algorithmic parameter k, which provides communication-computation tradeoff on hardware accelerators like FPGA and GPU. We provide predictive models to understand this tradeoff and show how careful selection of k can lead to performance improvement that otherwise demands significant increase in memory bandwidth. On an Nvidia C2050 GPU, we demonstrate a 1.9×−42.6× speedup over standard iterative solver for a range of benchmarks and that this speedup is limited by the growth in redundant computation. In contrast, for FPGAs, we present an architecture-aware algorithm that limits off-chip communication but allows communication between the processing cores. This reduces redundant computation and allows large k and hence higher speedups. Our approach for

FPGA provides a 0.3×−4.4× speedup over same generation GPU device where k is picked carefully for both architectures for a range of benchmarks.

Index Terms—Iterative Numerical Methods; Spare Matrix−Vector Multiply; Matrix Powers Kernel; Field Programmable Gate

Arrays (FPGAs); Graphics Processing Units (GPUs)  1 INTRODUCTION

The cost of a high performance scientific computation operating on large datasets consists of two factors (1) computation cost of performing floatingpoint operations (2) communication cost (both latency and bandwidth) of moving data within the memory hierarchy in sequential case or between processors in parallel case. One of the communication-intensive scientific computations is an iterative solver used for solving large-scale sparse linear system of equations (Ax = b) and eigenvalue problems (Ax = λx) [1].

The solution of these problems is computed from a

Krylov subspace span(x, Ax, A2x, ...., Arx) [1], where a new vector is generated in each iteration. Iterative solvers are challenging to accelerate as they spend most of the time in communication-bound computations, like sparse matrix-vector multiply (SpMV) and vector-vector operations (dot products and vector additions). Additionally, the data dependency between these operations hinder overlapping communication with computation. For large-scale problems where the matrix A does no fit on-chip, no matter how much parallelism can be exploited to accelerate SpMV, the performance of the iterative solver is bounded by • Abid Rafique and George A. Constantinides are with the Department of Electrical and Electronic Engineering, Imperial College London, UK. • Nachiket Kapre is with School of Computer Engineering, Nanyang

Technological University, Singapore. the available off-chip memory bandwidth, e.g. with 2 flops per 4 bytes (single-precision) in SpMV, the maximum theoretical performance is 71 GFLOPs on an Nvidia C2050 GPU and 17 GFLOPs on a Virtex6

FPGA. This results in less than 7% and 4% efficiency of GPU and FPGA respectively (See Table 1 for peak single-precision GFLOPs and off-chip memory bandwidth).

The communication problem is connected to the memory wall problem [2]. Due to technology scaling, computation performance is increasing at a dramatic rate (flops/sec improves by 59% each year) whereas communication performance is improving but at a much lower rate (DRAM latency improves by 5.5% and bandwidth improves by 23% each year) [3]. It is a well-known idea to formulate algorithmic innovations that hide memory latency and optimize memory bandwidth [4] [5] [6]. For iterative solvers, Demmel et al. [7] trade communication with redundant computation by replacing k SpMVs with the matrix powers kernel. The key idea is to partition the matrix into blocks and performs k SpMVs on blocks without fetching the block again in the sequential case and performing redundant computation to avoid communication with other processors in the parallel case. In this way the communication cost is reduced by O(k) at the expense of increase in redundant computation. They show that such an approach can minimize latency in a grid [7] and both latency and bandwidth on a multicore CPUs [8] to give up to 4× and 4.3× speedup respectively over k SpMVs for banded matrices. While

Digital Object Indentifier 10.1109/TPDS.2014.6 1045-9219/14/$31.00 © 2014 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2013 2

TABLE 1

Architectural Features of FPGA and GPU (On-Chip memory of GPU refers to shared memory).

Memory Memory BW Memory BW

Device Tech. Peak GFLOPs (On-Chip) (On-Chip) (Off-Chip) Freq. (nm) (single-precision) Total. RAM Registers RAM Registers MHz

Virtex6-SX475T 40 450 [18] 4 MB 74 KB 5.4 TB/s 36 TB/s 34 GB/s [18] 258

Nvidia C2050 Fermi 40 1030 672 KB 1.7 MB 1.3 TB/s [19] 8 TB/s [19] 144 GB/s 1150 the communication-avoiding approach is promising, there are two main challenges associated with this communication-avoiding approach on parallel architectures (1) how to keep the redundant computation as low as possible to minimize the computation cost and (2) how to select the optimal value of the algorithmic parameter k, which minimizes overall runtime by providing a tradeoff between computation and communication cost.

In this paper, we show how we can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solver. As a motivation, we show a tradeoff between computation and communication cost for FPGA and GPU to minimize overall runtime as shown in Figure 1. We observe that in standard iterative solver (k equal to 1), the communication cost is higher on FPGA as compared to GPU due to marked difference in off-chip memory bandwidth as shown in Table 1. However, we see a unique value of k, which trades communication with redundant computation to reduce overall cost and that this value needs to be selected carefully for each architecture. Additionally, we observe that unlike