PySpark | AWS | NLP | Big Data | SparkML
Project Gutenberg is the oldest and largest digital repository of public domain literature on the internet. Browsing the website is rather difficult, as there is only a search bar and a list of volunteer-generated “bookshelf” collections, from Science Fiction to Psychology. These bookshelves would be great for searching if they were not mostly incomplete–only 20% of books are in any bookshelf at all.
Thus, we used the following methodology to “complete” the bookshelves, using semi-supervised machine learning:
- Ecode all whole texts by term frequency,
- Create a logistic regression model for each bookshelf, training it on positive samples only (i.e. books that belong in that bookshelf), and synthetic negative samples so that the model learns the shape of the inlier class without excluding other real books that may belong.
- Evaluation on accuracy to existing books in the bookshelf unseen by the model (80-20 train-test split), and testing on other books not included in the bookshelves.
Here are some of our results:
US Civil War

United Kingdom

Psychology

The models produced great results, finding books that were very relevant to the incomplete bookshelves.
This project was made possible with the power of Cloud Computing to process these large text files, using a 15 node EMR cluster on AWS.
The full notebook and report can be accessed here.