Towards Large Scale Packet Capture and Network Flow Analysis on Hadoop

Document Type

Conference Proceeding

Publication Date



Network traffic continues to grow yearly at a compounded rate. However, network traffic is still being analyzed on vertically scaled machines that do not scale as well as distributed computing platforms. Hadoop's horizontally scalable ecosystem provides a better environment for processing these network captures stored in packet capture (PCAP) files. This paper proposes a framework called hcap for analyzing PCAPs on Hadoop inspired by the Rseaux IP Europens' (RIPE's) existing hadoop-pcap library but built completely from the ground up. The hcap framework improves several aspects of the hadoop-pcap library, namely protocol, error, and log handling. Results show that, while other methods still outperform hcap, it not only performs better than hadoop-pcap by 15% in scan queries and 18% in join queries, but it's more tolerant to broken PCAP entries which reduces preprocessing time and data loss, while also speeding up the conversion process used in other methods by 85%.