Developing a framework for horizontally scalable network flow analytics on the Hadoop ecosytem

Date of Award


Document Type


Degree Name

Master of Science in Computer Science, Straight


Information Systems & Computer Science

First Advisor

Yu, William Emmanuel S., Ph.D.


This study proposes an improved way of analyzing raw network data on Hadoop called hcap. This new framework is evaluated against three common methods currently used for this type of analytics; conversion to text, conversion to Parquet, and direct parsing of PCAP binaries with the hadoop-pcap library both withand without logs. The comparison was conducted with four key performance indicators: preprocessing, storage efficiency, data retention, and query response time. Because the original hadoop-pcap framework failed to process larger datasets, its version with logs suppressed was instead used for the evaluation. Results show that Parquet outperforms hcap by 90% and hadoop-pcap with its logs suppressed by 96% in terms of query response time while text also runs 80% faster than hcap and 92% faster than hadoop-pcap with its logs suppressed, however, it also runs 30% slower in scan and aggregate queries and 70% and 40% slower in joins and aggregate-joins respectively when compared to Parquet. The framework created in this study not only provided an improved method for parsing PCAP binaries on Hadoop, outperforming hadoop-pcap by at least 20%, it also provided analternative technique for conversion to Parquet, reducing preprocessing time by a factor of 5.


The C7.S238 2018