Undergraduate Research at Jefferson Lab - Analysis of Spark as a Data Engine

Analysis of Spark as a Data Engine

Student: Ethan Buck

School: College of William and Mary

Mentored By: Amber Boehnlein

ROOT is a data analysis tool specifically tailored to nuclear physics. As ROOT ages, other alternatives such as Spark become more appealing, as the resources and computing power of Spark are greater than ROOTs antiquated tools. Spark APIs provide powerful support for automatic parallelization, which creates opportunities to take advantage of a cloud computing service, such as Amazon Web Services (AWS). Spark is supported for use in many computing languages, including Python, R, and Scala, allowing a data analyst to use their preferred environment. The focus of this project is to evaluate the potential of using Spark in combination with AWS to read current ROOT data files and to provide more resources for data analytics. Spark, Jupyter Notebook, Python, and required Python packages such as py4j were installed locally and tested first with a single 1 GB ROOT file. Then, an AWS account was created, and the above was installed on an Amazon Elastic Compute Cloud (EC2) instance, where the notebook file was tested as well. Finally, many ROOT files from the Zeus experiment were combined, and a sample data analysis was done to show how well Spark can handle larger data set filtering. This project showed that Spark has the potential to be a more powerful data analysis platform, and that transferring some analysis of Jefferson Lab data sets to this data engine could allow analysts more resources and a more diverse skill set for future opportunities. More evaluation of switching to the platform is needed through attempting to load other ROOT files into Spark that have a different schema structure, or different data containers.

Citation and linking information

For questions about this page, please contact Education Web Administrator.