Science Education Home Home Teachers Students Games Videos VA SOL Programs

Undergraduate Research at Jefferson Lab

Developing and Implementing a Web Scraper to Identify Personally Identifiable Information

Student: Jake Brown

School: Tennessee Technological University

Mentored By: Greg Nowicki

The Cyber Security group at JLab has a program that can search for targeted sensitive information, or any personally identifiable information (PII) on JLab users' User Web pages. This program is used to search for specifically formatted strings of text that could potentially be sensitive information. This program is a web scraper that can search for strings of text that matches the requirements for a JLab user account password and can search for social security numbers in the United States or United Kingdom formats. It can search any URL for targeted information with the use of specific regular expressions written in python. Though it can search any URL it is currently being implemented to search the URLs of the JLab User Web sites. With the content that is uploaded to these userweb sites, it would be easy to upload a file that contains sensitive information that you are unaware of. The program splits every word on a webpage into raw text and uses the regular expression to sort through the text to match the targeted information being searched. If there is a word in the text that matches the format or character requirements given by the regular expression it will create a .txt that contains the potentially sensitive string and the URL it was found in. The program is being used to search for U.S. social security numbers (SSN) and U.K. National Identification numbers (NIN), and has found two valid U.S. SSNs within the JLab userweb sites. This has revealed that there exists potential cyber security risks from PII on the JLab User Web space. Given that any JLab user can have one of these sites and can upload any information they wish onto the site, there was certainly bound to be sensitive information located within the space. Given that there have been Social Security Numbers found within the User Web sites, it suggests that there is potentially more PII that has been uploaded within the space. These findings will help JLab coach the employees and users on more secure cyber security practices.

Developing and Implementing a Web Scraper to Identify Personally Identifiable Information

Citation and linking information

For questions about this page, please contact Steve Gagnon.