Am I predictable in how I rate books?

I’ve been using Goodreads consistently for the last 8 years to line up future reads and rate books I’ve read. Recently, I exported these 8 years’ worth of data and analysed it to see how my reading behaviour has changed.

As an extension to this, I’m going to see if it’s possible to predict how much I will like/dislike a book.

The dataset

Dataset sample

The dataset contains quite a few features:

Book details - Title, author, ISBN, original publication year
Goodreads platform data - Rating given by Goodreads members
Personal reading data - Rating given by user, date read, date added to a shelf, shelf the book is currently on i.e. to-read, read, currently-reading

Some top level numbers

I’ve rated 615 books
My average rating is 4.29 out of 5
75% of the books read are fiction and the rest are simply labelled non-fiction

Pre-processing

To keep things as simple as possible I’ll only do the following pre-processing steps to prepare the data.

Extract any series numbers that exist in the ‘Title’ column - it might be a sign I like a book if I’m 11 books into the series.
Removal of the ‘Book Id’, ‘Title’, ‘Author’ and ‘ISBN13’ columns
Convert the datetime columns (‘Date Read’ and ‘Date Added’) to be relative to days before now
Impute missing values with the mean
Hot encode the ‘genre’ (Fiction vs non-fiction) column

The Preprocessed Data

Preprocessed data

Modelling the data

Since there are only five possible ratings I can give, I’ve framed this question as a Classification problem, hoping to predict which rating I am likely to give. I’m going to throw a few classifiers at the data to see what sort of accuracy they come back with.

Here we’ll use Logistic Regression, Decision Trees and KNN alongside a Voting Classifier ensemble method to round it off.

Findings

We score the models based on their accuracy which I judge to be a suitable metric for this classification problem.

Logistic Regression score: 0.45
Decision Tree score: 0.44
K Nearest Neighbors score: 0.39
Voting Classifier score: 0.42

Each of the models give a score between 0.39-0.45. With 5 possible labels and bearing in mind ~85% of the time I give either a 4 or 5, this is not a great accuracy.

In fact, this score proves no better than just randomly choosing a rating of either 4 or 5 each time.

from sklearn.metrics import accuracy_score

random_prediction = np.random.choice([4, 5], len(read))
accuracy_score(y, random_prediction)
# 0.43

Conclusion

Whilst it’s always fun to create a model with a high accuracy, in the end I’m happy that I’m not so predictable in how much I’m going to enjoy a book. Maybe if the model was accurate, I could have picked future books by plugging them into the model to see what rating I was likely to give, instead I’ll go back to the trusty method of judging a book by its cover.

With the world of books being as broad as it is, the data I’m plugging into the models is hardly granular. We’re generalising whole years and genres and it is not surprising that the models struggle to predict which rating I’ll give. A better method for predicting how much I’d like a book would be to use a dataset containing the ratings of thousands of users rather than just myself.