Hardware Accelerated Dynamic RDMA Method for Gigabit Ethernet

LI Long-fei; SHI Yang-chun; WANG Jian-feng; HE Zhan-zhuang

doi:10.3969/j.issn.1001-0548.2018.05.006

Waiting for connection establishment can be inefficient during remote direct memory access (RDMA) transport for Gigabit Ethernet. Aiming at solving this problem, a hardware accelerated dynamic RDMA method is proposed in this paper. This method allows starting RDMA transport at the same time with connection establishment by sending packages first to indirection buffer in network interface card (NIC) and copying them later to host memory. The terminal models of dynamic RDMA method are built and the simulation platforms for perform experiments are developed. Experimental results show that the dynamic RDMA method not only can solve the long waiting problem of connection establishment with little extra hardware cost, but also can provide higher transport performance compared with traditional RDMA methods.

HTML

根据吉尔德定律(Gilder's law)和摩尔定律(Moore's law)，网络带宽的增长速度至少是CPU运算性能增长速度的3倍(即每6个月增长1倍)^[1]。因此，近年来影响系统性能的瓶颈已经逐渐从网络带宽转变为了CPU的运算性能。减小网络数据对CPU资源的消耗是提高系统性能的有效手段。作为InfiniBand中的一项重要技术，RDMA从被提出就得到了业界的广泛关注。RDMA操作可以实现对远程应用程序内存的直接读写操作，且整个过程不需要远程节点CPU的参与，实现“零复制”操作，因此没有消耗任何CPU资源，为InfiniBand带来了传统TCP/IP完全无法实现的高速数据传输特性。

千兆以太网的普及使得RDMA不再是InfiniBand独享的技术。然而由于千兆以太网是不可靠网络，因此必须采取一定的技术手段来保证RDMA的可靠传输。InfiniBand提供硬件级的端到端可靠传输服务，于是以RoCE(RDMA over Converged Ethernet)为代表的RDMA技术提出采用硬件设备保证数据链路层的无损传输^[2]。但该方案需要改变网络拓扑，代价较高，不适宜现有网络环境的改造。另一种方案是采用面向连接的传输，通过上层协议来保证RDMA传输的可靠性，如iWARP^[3]。除了需要上层协议支持外，该方案还需要支持RDMA的NIC(简称为RNIC)才能实现，但比起前者，该方案在硬件方面的改动很小^[4]，且对现有的网络和设备兼容，因此其逐渐成为了国内外诸多学者关注的重点和研究的方向。

在采用面向连接的传输方式中，文献[5]通过对流量控制、重传控制进行优化，减小了在发生数据丢失时系统的重传开销，实现了快速RDMA重传；文献[6]提出一种轻量级的RDMA引擎，通过RNIC硬件辅助来实现RDMA技术在虚拟机中的应用；文献[7]重点关注RDMA过程中的内存注册，提出基于动态链表的注册内存池技术，减小内存注册操作的使用频率；文献[8]面向RNIC的加速设计，提出批量转发机制来提高RDMA传输在数据帧较小时的吞吐量；文献[9]针对RDMA报文乱序到达、重路由等问题，提出动态连接的解决方案。在面向连接的RDMA传输中，连接的建立需要采用握手机制实现。若传输发起端等待应答端的应答时间过长，那么RDMA所带来的性能优势就会被削弱。上述文献均未考虑过长的连接建立时间对RDMA性能造成的影响，而在实际应用中，以太网的不可靠传输所造成的链路丢帧以及内存注册失败等原因均会造成发起端的长时间等待，从而引起传输性能下降^[10-11]。

针对上述问题，本文面向千兆以太网提出一种硬件支持的动态RDMA通信方法。该方法保持对传统RDMA通信方法的兼容，在此基础上提出了间接RDMA传输，即通过扩展RNIC的硬件逻辑并预留相应的缓存资源，从而使发起端在连接未建立成功的情况下进行RDMA传输。本文同时给出了直接传输、间接传输两种模式的动态切换方法并建立端系统模型进行仿真实验。与其他研究相比，本文的主要创新点如下：

1) 首次提出间接RDMA传输及相应的传输模式动态切换方法，即动态RDMA通信方法。

2) 以千兆以太网NIC为原型构建端系统模型对动态RDMA通信方法和传统RDMA通信方法的传输性能进行对比分析。

3. 结束语

本文提出了一种面向千兆以太网的动态RDMA通信方法，通过在RNIC中预留资源，使RDMA数据传输与连接建立同步进行，克服了在连接建立过程中等待应答时间过长从而造成传输效率降低的问题。在千兆以太网环境下构建模型并进行实验，结果表明动态RDMA通信方法在不可靠网络环境下可以有效提高数据传输效率。

下一步的研究工作将针对具体的上层协议，通过对其进行修改和优化，将动态RDMA通信方法应用到实际网络环境中，并对其性能进行实验和分析。

Reference (18)

[1]	BROWN D. Are new approaches needed for developing long-term strategies for STEM information?[J]. Learned Publishing, 2017, 30(3): 21-28.
[2]	KAGAN M. Performance evaluation of the RDMA over ethernet (RoCE) standard in enterprise data centers infrastructure[C]//The 23rd International Teletraffic Congress. San Francisco, USA: ACM, 2011: 9-15.
[3]	GRANT R E, RASHTI M J, AFSAHI A, et al. RDMA capable iWARP over datagrams[C]//Parallel & Distributed Processing Symposium. Boston, USA: IEEE, 2013: 628-639.
[4]	GUO C, WU H, SONI G, et al. RDMA over commodity Ethernet at scale[C]//Conference on ACM SIGCOMM. Florianópolis, Brazil: ACM, 2016: 202-215.
[5]	WANG Shao-gang, XU Wei-xia, WU Dan, et al. Fast NIC based RDMA implementation for adaptive unreliable networks[C]//The 11th International Conference on Computer Systems and Applications (AICCSA). Ifrane, Morocco: IEEE, 2014: 302-309.
[6]	MOUZAKITIS A, PINTO C, NIKOLAEV N, et al. Lightweight and generic RDMA engine para-virtualization for the KVM hypervisor[C]//International Conference on High PERFORMANCE Computing & Simulation. Genoa, Italy: IEEE, 2017: 737-744.
[7]	董勇, 周恩强, 卢宇彤. 基于天河2高速互连网络实现混合层次文件系统H~2FS高速通信[J]. 计算机学报, 2017, 40(9): 1961-1979.	DONG Yong, ZHOU En-qiang, LU Yu-tong. The implementation of communicating operation in hybrid hierarchy file system H2FS with TH-Express 2[J]. Chinese Journal of Computers, 2017, 40(9): 1961-1979.
[8]	MA S, KIM J, MOON S. Exploring low-latency interconnect for scaling out software routers[C]//IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era. Barcelona, Spain: IEEE, 2016: 9-15.
[9]	夏军, 庞征斌, 刘路. 一种基于NIC的RDMA可靠传输协议的设计与实现[J]. 计算机工程与科学, 2014, 36(2): 216-221. doi: 10.3969/j.issn.1007-130X.2014.02.005	XIA Jun, PANG zhen-bin, LIU Lu. Design and implementation of a NIC based RDMA reliable communication protocol[J]. Computer Engineering and Science, 2014, 36(2): 216-221. doi: 10.3969/j.issn.1007-130X.2014.02.005
[10]	FREY P W, ALONSO G. Minimizing the hidden cost of RDMA[C]//The 29th IEEE International Conference on Distributed Computing Systems. Montreal, Canada: IEEE, 2009: 553-560.
[11]	MAC A P, RUSSELL R D. A performance study to guide RDMA programming decisions[C]//The 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS). Liverpool, United Kingdom: IEEE, 2012: 778-785.
[12]	SUBRAMONI H, LAI P, LUO M, et al. RDMA over Ethernet-a preliminary study[C]//IEEE International Conference on Cluster Computing and Workshops. New Orleans, USA: IEEE, 2009: 1-9.
[13]	ZHANG W, HAO M, XU Z. Communication optimization for RDMA-based science data transmission tools[J]. Journal of Super Computing, 2016, 72(9): 3312-3327.
[14]	MAC A P, RUSSELL R D. An efficient method for stream semantics over RDMA[C]//The 28th International Parallel and Distributed Processing Symposium. Phoenix, USA: IEEE, 2014: 841-851.
[15]	苏文, 章隆兵, 高翔. 基于Cache锁和直接缓存访问的网络处理优化方法[J]. 计算机研究与发展, 2014, 51(3): 681-690.	SU Wen, ZHANG Long-bing, GAO Xiang. A cache locking and direct cache access based network processing optimization method[J]. Journal of Computer Research and Development, 2014, 51(3): 681-690.
[16]	ANDREW S T, DAVID J W. Computer networks[M]. New Jersey, USA:Pearson, 2011.
[17]	JIN H W, NARRAVULA S, BROWN G, et al. Performance evaluation of rdma over ip: a case study with the ammasso gigabit ethernetnic[C]//The 14th IEEE International Symposium on High Performance Distributed Computing (HPDC-14). Virginia, USA: IEEE, 2005: 598-605.
[18]	LI Long-fei, HE Zhan-zhuang, WANG Jian-feng, et al. Implementation of gigabit ethernet controller with fault tolerance and prevention mechanism[C]//2017 Prognostics and System Health Management Conference (PHMHarbin). Harbin, China: IEEE, 2017: 1-8.

参数	说明
TID	本次RDMA传输的ID号
TYPE	传输类型(读或写)
MODE	传输模式(直接或间接)
SBA	本次传输中发起端内存区域的基地址
DBA	本次传输中应答端内存区域的基地址
LEN	RDMA传输数据长度(字节)
SN	RDMA数据帧传输序列号
ACKN	RDMA数据帧传输确认号

Stratix V 5SGSMD4E2H29I3	传统RDMA			动态RDMA
Stratix V 5SGSMD4E2H29I3	使用个数	总数	使用率/%	使用个数	总数	使用率/%
ALMs	106 754	135 840	79	111 846	135 840	82
RAM Blocks	288	957	30	302	957	32
Block memory bits	3 323 924	19 599 360	17	3 717 140	19 599 360	19
Pins	247	416	59	247	416	59
PLLs	4	40	10	4	40	10

Hardware Accelerated Dynamic RDMA Method for Gigabit Ethernet

doi: 10.3969/j.issn.1001-0548.2018.05.006

Abstract

References

Proportional views

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Related

Proportional views