YPIE Scientist: Jaden

Research: A Novel Natural Language Processing Approach for Analyzing Lengthy Terms and Condition Agreements

Mentor: Sol Vitkin, Chester Curme

Awards: NAACP ACT-SO Competition 2021


Privacy has always been a concern in an increasingly digital world with very few participants reading the terms and conditions. The “Notice of Choice” framework is what requires organizations to inform the users on its collection and utilization of their data. Upon accepting the terms, companies can exchange the user data with third party platforms for profit. There are very limited practices addressing these issues, some described as “ineffective” and “unattainable”. The purpose of this research is an improved automated process to inform users of the company's policies and the agreement they accepted rashly through the use of computational linguistics. Data is obtained by open-source website, to help build a better and more efficient model for the experiment. Additional sources include the website Polisis, developed by monitoring Google Trends and professionally annotated policies. The program takes each individual data and organizes it using JSON format (a standard file type for storing data) successfully cleaning and storing the data for observation and training of the model. The first approach introduced a bag of words concept of word vectorization to classify good and bad quote text with the simple model trained with a data set of 10 random JSON files pulled from Each point and quote from the policies was analyzed in which the model had an AUC (Area Under Curve) of around .8. AUC is a classification measurement of the models performance and distinguishing capabilities. With such a small dataset, there is a possibility of data manipulation leaving a large margin of accuracy. Therefore the software may be fully functional, but inefficient in its current stages. Even with a dictated classifier and count vectorizer, there is lots of room for optimization. In terms of future research, this will be helpful as an acknowledgment of the concerning actions companies perform with the consumer data, and is the first step toward solutions to grant individuals digital privacy and protection.

