Abstract:
Indicators of compromise (IOC), as behavioral descriptions of cyber threats, are important credentials for identifying and defending against cyberattacks. The current IOC recognition mainly adopts the deep neural network training model, and its effect depends on a large amount of training data. However, there is currently a lack of recognized datasets in the field of IOC recognition. IOC can only be manually labeled by security experts, the labeling cost is high, and it is difficult to obtain a large amount of labeling data. To solve this problem, we propose a threat intelligence IOC identification method with active learning, called ICAL (IOC identification combined with active learning). The method first selects the initial samples for manual labeling according to the representativeness of the samples; then it pseudo-labels the clustered samples according to the clustering hypothesis; finally, it continues to iteratively label the samples according to the uncertainty of the samples until the termination conditions are satisfied. Using CNNPLUS as the classification model, experiments are performed on the self-built threat intelligence dataset. The results show that ICAL reduces the labeling workload by nearly 58% compared with the traditional IOC automatic identification strategies, and the recognition accuracy rate reaches 94.2%. ICAL reduces the amount of data labeling in IOC identification with strong practicability.