Analysis of Spark as a Data Engine
Student: Ethan Buck
School: College of William and Mary
Mentored By: Amber Boehnlein
ROOT is a data analysis tool specifically tailored to nuclear physics. As ROOT ages, other alternatives such as Spark become more appealing, as the resources and computing power of Spark are greater than ROOTs antiquated tools. Spark APIs provide powerful support for automatic parallelization, which creates opportunities to take advantage of a cloud computing service, such as Amazon Web Services (AWS). Spark is supported for use in many computing languages, including Python, R, and Scala, allowing a data analyst to use their preferred environment. The focus of this project is to evaluate the potential of using Spark in combination with AWS to read current ROOT data files and to provide more resources for data analytics. Spark, Jupyter Notebook, Python, and required Python packages such as py4j were installed locally and tested first with a single 1 GB ROOT file. Then, an AWS account was created, and the above was installed on an Amazon Elastic Compute Cloud (EC2) instance, where the notebook file was tested as well. Finally, many ROOT files from the Zeus experiment were combined, and a sample data analysis was done to show how well Spark can handle larger data set filtering. This project showed that Spark has the potential to be a more powerful data analysis platform, and that transferring some analysis of Jefferson Lab data sets to this data engine could allow analysts more resources and a more diverse skill set for future opportunities. More evaluation of switching to the platform is needed through attempting to load other ROOT files into Spark that have a different schema structure, or different data containers.