基于LDL算法的大规模矩阵求逆加速器设计及其FPGA实现

doi:10.16180/j.cnki.issn1007-7820.2023.07.001

摘要/Abstract

摘要：

矩阵求逆是工程计算中的基本问题,在大规模MIMO系统、阵列信号处理以及图像信号处理等应用中,大规模矩阵求逆的处理速度对系统性能至关重要,但传统矩阵求逆方法运算复杂度高、并行性低且消耗大量存储空间,不利于硬件加速。针对大规模矩阵求逆硬件加速问题,文中研究了基于LDL分解的矩阵求逆算法,并提出了一种基于该算法的大规模矩阵求逆加速架构。利用LDL分解后三角矩阵对角线元素全为1的特点,对矩阵进行分块迭代设计,减少了求逆运算的计算量,提高了计算速度。文中基于Xilinx Virtex7 FPGA设计实现了该加速器,实验结果表明,在128阶矩阵下,吞吐量达105.2 Inv·s^-1,最高时钟频率达200 MHz。与现有矩阵求逆加速方案相比,该设计占用的硬件资源更少,且具有更高的性能。

关键词: LDL分解, 矩阵求逆, Cholesky分解, 矩阵分块, 三角矩阵变换, 矩阵相乘, 硬件加速, 现场可编程门阵列

Abstract:

Matrix inversion is a basic problem in engineering calculation. In large-scale MIMO systems, array signal processing, image signal processing and other applications, the processing speed of large-scale matrix inversion is very important to the system performance. However, the traditional matrix inversion method has high computational complexity, low parallelism and consumes a lot of storage space, which is not conducive to hardware acceleration. Aiming at the hardware acceleration problem of large-scale matrix inversion, this study studies the matrix inversion algorithm based on LDL decomposition and proposes a large-scale matrix inversion acceleration architecture based on this algorithm. Using the characteristic that the diagonal elements of triangular matrix after LDL decomposition are all 1, the matrix is designed by block iteration, which reduces the amount of calculation and improves the calculation speed. This study designs and implements the accelerator based on Xilinx Virtex7 FPGA. The experimental results show that under the 128 order matrix, the throughput is 105.2 Inv·s^-1 and the maximum clock frequency is 200 MHz. Compared with the existing matrix inversion acceleration scheme, this design occupies less hardware resources and has higher performance.

Key words: LDL decomposition, matrix inversion, Cholesky decomposition, matrix block, triangular matrix transformation, matrix multiplication, hardware acceleration, field programmable gate array

中图分类号:

TP309.7

余浩然,肖昊. 基于LDL算法的大规模矩阵求逆加速器设计及其FPGA实现[J]. 电子科技, 2023, 36(7): 1-7.

YU Haoran,XIAO Hao. Design and FPGA Implementation of Large Scale Matrix Inversion Accelerator Based on LDL Algorithm[J]. Electronic Science and Technology, 2023, 36(7): 1-7.

图/表 12

图1

图2

图3

图4

图5

图6

图7

图8

图9

表1

图10

表2

参考文献 19

[1]	Liu Q, Qin S, Yu B, et al. π-BA:Bundle adjustment hardware accelerator based on distribution of 3D-point observations[J]. IEEE Transactions on Computers, 2020, 69(7):1083-1095.
[2]	Hyukyeon L, Kyungmook O, Minjeong C, et al. Efficient low-latency implementation of CORDIC-based sorted QR decomposition for multi-gbps MIMO systems[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2018, 65(10):1375-1379. doi: 10.1109/TCSII.2018.2853099
[3]	汪杨, 王晓蕾, 袁子昂, 等. 一种基于NoC多核系统的矩阵乘法映射技术[J]. 电子科技, 2021, 34(5):54-60.
	Wang Yang, Wang Xiaolei, Yuan Ziang, et al. A matrix multiplication mapping technology based on NOC multi- core system[J]. Electronic Science and Technology, 2021, 34(5):54-60.
[4]	Liu H, Wang K, Dong P, et al. Curve-driven-based acoustic inversion for photoacoustic tomography[J]. IEEE Transactions on Medical Imaging, 2016, 35(12):2546-2557. pmid: 27352391
[5]	Lutzweiler C, Tzoumas S, Rosenthal A, et al. High- throughput sparsity-based inversion scheme for optoacoustic tomography[J]. IEEE Transactions on Medical Imaging, 2016, 35(2):674-684. doi: 10.1109/TMI.2015.2490799 pmid: 26469127
[6]	Ding L, Deán-Ben X L, Razansky D. Real-time model -based inversion in cross-sectional optoacoustic tomography[J]. IEEE Transactions on Medical Imaging, 2016, 35(8):1883-1891. doi: 10.1109/TMI.2016.2536779 pmid: 26955023
[7]	刘璐, 张洪艳, 张良培. 基于光谱加权低秩矩阵分解的高光谱影像去噪方法[J]. 电子科技, 2020, 33(5):21-27.
	Liu Lu, Zhang Hongyan, Zhang Liangpei. Hyperspectral image denoising via spectral weighted low rank matrix approximation[J]. Electronic Science and Technology, 2020, 33(5):21-27.
[8]	Wang S S, Tien Y C, Hwang Y T, et al. MVDR based adaptive beamformer design and its FPGA implementation for ultrasonic imaging[C]. Jeju: IEEE Asia Pacific Conference on Circuits and Systems.IEEE, 2016:143-145.
[9]	Munoz S D, Hormigo J. High-throughput FPGA implementation of QR decomposition[J]. IEEE Transactions on Circuits & Systems II Express Briefs, 2015, 62(9):861-865.
[10]	Chen J, Liang X, Chen Z. Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with GPUs[C]. Chicago:IEEE International Parallel and Distributed Processing Symposium, 2016:993-1002.
[11]	Gangarajaiah R, Prabhu H, Edfors O, et al. A cholesky decomposition based massive MIMO uplink detector with adaptive interpolation[C]. Baltimore: IEEE International Symposium on Circuits and Systems, 2017:1-4.
[12]	Mahajan Y, Obla S, Namboothiripad M K, et al. FPGA -based acceleration of LU decomposition for analog and RF circuit simulation[C]. Pune:Proceedings of the Thirty-third International Conference on VLSI Design and Nineteenth International Conference on Embedded Systems, 2020:131-136.
[13]	Hashemian R. UaL Decomposition an Alternative to the LU factorization of MNA matrices[J]. IEEE Transactions on Circuits and Systems II:Express Briefs, 2020, 67(4):630-634. doi: 10.1109/TCSII.8920
[14]	Xu Y, D Li, Xi Y, et al. Improved predictive controller on FPGA by hardware matrix inversion[J]. IEEE Transactions on Industrial Electronics, 2018, 65(9):7395-7405. doi: 10.1109/TIE.41
[15]	Vosvrda M S. Discrete random signals and statistical signal processing: Charles W.Therrien[J]. Automatica, 1993, 29(6):1617-1623.
[16]	黄廷祝, 钟守铭, 李正良. 矩阵理论[M]. 北京: 高等教育出版社, 2003:23-67.
	Huang Tingzhu, Zhong Shouming, Li Zhengliang. Matrix theory[M]. Beijing: Higher Education Press, 2003:23-67.
[17]	陈宗泽. 大规模矩阵求逆运算电路设计与优化[D]. 南京: 东南大学, 2019:49-56.
	Chen Zongze. Design and optimization of large-scale matrix inverse operation circuit[D]. Nanjing: Southeast University, 2019:49-56.
[18]	Gaithuru J N, Salleh M, Mohamad I. NTRU inverse polynomial algorithm based on the LU decomposition method of matrix inversion[C]. Miri:IEEE Conference on Application, Information and Network Security, 2017:1-6.
[19]	李丽, 张巍. 改进Cholesky分解算法的设计与FPGA实现[J]. 电讯技术, 2020, 60(7):845-849.
	Li Li, Zhang Wei. Improved Cholesky decomposition algorithm design and FPGA implementation[J]. Telecommunications Technology, 2020, 60(7):845-849.

矩阵阶数	QR^[9] /ms	Cholesky^[17] /ms	LU^[18] /ms	LDL^[19] /ms	本文 /ms
2²	0.685	0.403	0.583	0.304	0.201
2⁴	1.486	0.846	1.462	0.676	0.484
2⁵	7.784	5.357	6.852	4.147	3.543
2⁶	16.475	13.592	15.634	9.769	7.495
2⁷	43.486	25.468	33.582	14.874	10.495

对比文献	QR^[9]	Cholesky^[17]	LU^[18]	LDL^[19]	本文
LUT	179 402	38 291	39 130	59 055	36 402
DSP	195	128	184	1530	116
BRAM	218	216	126	189	128
Latency /μs	237.4	165.6	175.4	131.1	104.8
Frequency /MHZ	150	150	100	250	200
Throughput /Inv·s^-1	47.6	73.5	68.5	52.6	105.2