
Design of Synchronization Accelerator in HPC Computing Node

  • 摘要: 随着GPU等加速部件在超级计算领域的广泛应用,超级计算机单个节点的硬件并行度比单核时代高几倍甚至几十倍。在该环境下,并行应用于单个芯片、计算节点内和计算节点间的通信密度较单核时代急剧增加,通信瓶颈问题愈发突出。为应对高并行度带来的通信瓶颈问题,提出一种同步引擎的硬件设计,该同步引擎可有效地支持和加速计算节点内多任务间频繁小数据量传输(细粒度同步)以及计算节点内和节点间的Barrier、All-reduce集合操作,进而加速并行应用的性能。测试结果表明,在16进程规模下的集合操作测试中,同步引擎相比传统的软件实现有约4倍的加速,在三角矩阵分解(LU分解)测试程序中可以获得约20%的性能提升。


    Abstract: With the widely use of acceleration devices, hardware parallelism of single hybrid programming computer (HPC) node has increased many. As a result, both on-chip communication and inter-node communication become more and more frequently. Apparently, communication is becoming the bottleneck of system performance. This paper proposes a design of hardware module called synchronization accelerator to accelerate synchronization communication patterns. These patterns include fine-grain synchronization, barrier, and all-reduce. At the scale of 16 processes, synchronization accelerator can achieve about 4 times speedup than software-based collective operations. Also, the performance of benchmark LU can achieve 20% improvement with the use of synchronization accelerator.


