Towards Large Scale Packet Capture and Network Flow Analysis on Hadoop
Network traffic continues to grow yearly at a compounded rate. However, network traffic is still being analyzed on vertically scaled machines that do not scale as well as distributed computing platforms. Hadoop's horizontally scalable ecosystem provides a better environment for processing these network captures stored in packet capture (PCAP) files. This paper proposes a framework called hcap for analyzing PCAPs on Hadoop inspired by the Rseaux IP Europens' (RIPE's) existing hadoop-pcap library but built completely from the ground up. The hcap framework improves several aspects of the hadoop-pcap library, namely protocol, error, and log handling. Results show that, while other methods still outperform hcap, it not only performs better than hadoop-pcap by 15% in scan queries and 18% in join queries, but it's more tolerant to broken PCAP entries which reduces preprocessing time and data loss, while also speeding up the conversion process used in other methods by 85%.
Saavedra, M. Z. N. L., & Yu, W. E. S. (2018). Towards large scale packet capture and network flow analysis on Hadoop. 2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW), 186–189. https://doi.org/10.1109/CANDARW.2018.00043