Abstract:
Multi-stage networks are widely used in machine learning clusters. Due to the large number of available paths in a multi-stage network, packet routing is a combinatorial optimization problem. Existing routing algorithms based on heuristics lack performance guarantee, which seriously affects the packet transmission delay. This paper proposes a packet routing method based on reinforcement learning for multi-stage networks, using a novel policy iteration algorithm to compute an optimal routing policy by learning. In the policy evaluation step, this algorithm uses the maximum likelihood estimator of the value function, which overcomes the low sample efficiency problem of Monte Carlo (MC) or Temporal-Difference (TD) value function estimators in reinforcement learning. To deal with the high computational complexity of the combinatorial optimization problem in the policy improvement step, this algorithm decomposes the optimization over a combinatorial action space into a sequential optimization of each action. Experiments based on NS-3 network simulator show that the routing policy learnt by the algorithm reduces 13.9% of the average packet transmission delay compared to existing best routing heuristics.