How it was made
Finding the Book Dataset
We sourced various Goodreads datasets, some coming from universities, some from Kaggle, and others sourced from forums. We tried to focus on English books that had information relating to book titles, genres, ratings, and ISBN13 in order to serve as the foundation for our recommendations.
Cleaning the Dataset
Much of the raw data was messy. We used Python for all the data processing and cleaning. This included normalizing fields, removing duplicates, handling missing values, and excluding books that did not have adequate information that we could replace. This was a key step, as having clean data would be needed for embedding.
Vectorizing with OpenAI
Each book’s description was converted into a high-dimensional vector using OpenAI’s text-embedding-3-small model, then stored in a Pinecone vector database for similarity search.
Building the Site
The frontend was built with HTML, CSS, and JavaScript. The backend runs on Node.js and Express, connecting the UI to the text model and Pinecone via the API.
Pushed to GitHub
All of our source code lives in our GitHub repository, which we use for version control and collaboration. That also gives us a clear history of the changes we have made throughout the project.
Deployed with Vercel
Lastly, we use Vercel because it automatically builds and deploys the site to production. Paired with a custom domain, that setup is what keeps our production site running today.