Book Recommendations from Charles Darwin
- 2 minutes read - 225 words- Explored the books to be used in recommendation system, and loaded the contents of each book
- Pre-processed the data to facilitate the downstream analysis
- Referred Darwin’s most famous book: “On the Origin of Species.” for consistency of the analysis
- Transformed the Corpus (collection of words) into a format that is easier to deal with for the downstream analyses, i.e., transform each text into a list of the individual words (called tokens)
- Implemented Stemming Process to group together the inflected forms of a word so they can be analyzed as a single item: the stem
- Loaded the final result from a pickle file to make the process faster, as stemming algorithm takes several minutes to run
- Created universe of all words, i.e., Dictionary to further create Bag-of-Words (BoW) Model using stemmed tokens and dictionary
- Transformed the results returned by BoW into a DataFrame to better understand how the model has been generated and to visualize its content
- Calculated tf-idf (term frequency–inverse document frequency) model score, to determine which tokens are the most specific to a book
- Measured the similarity of books using Cosine Similarity, and visualized the results in a Bar Chart
- Visualized the whole similarity matrix as a Dendrogram to better understand the big picture and see how Darwin’s books are generally related to each other (in terms of topics discussed)