Abstract:
Extractive document summarization uses certain strategies to select some sentences from lengthy texts to form a summary, whose key is to use as much semantic and structural information of the text as possible. In order to better mine such information and then use it to guide the summarization, an extractive document summarization model based on heterogeneous graph and keywords (HGKSum) is proposed, which models the text as a heterogeneous graph composed of sentence nodes and word nodes. The model uses the graph attention networks to learn the features of the nodes in the graph. The multi-task learning is applied to the model, which considers the keywords extraction task as an auxiliary task of the document summarization task. The candidate summary which derived from the prediction of the neural networks in the model is often highly redundant, so the model refines it to create the final summary of low redundancy. The comparative experiment on the document summarization benchmark shows that the proposed model outperforms the baselines. Besides, ablation studies also demonstrate the necessity of introducing heterogeneous nodes and keywords.