Measuring the Performance of An Object-Based Multi-Cloud Data Lake

Document Type

Conference Proceeding

Publication Date



As the amount of data generated by society continues to become less structured and larger in size, more and more organizations are implementing data lakes in the public cloud to store, process, and analyze this data. However, concerns over the availability of this data as well as the potential of vendor lock-in lead more users to adopt the multi-cloud approach. This study investigates the viability of this approach in data lake use cases. Results that a multi-cloud data lake can potentially be implemented with less than 1% performance impact to query run times at the cost of a 300% increase in one-time loading. This opens the door for future work on more algorithms and implementations that leverage multi-cloud deployments to enhance availability, scalability, and cost optimization.