Recently I finished an Alura course named Python for Data Science and I want to put what I learned into practice, to do so I’ll make a descriptive analysis on this dataset Amazon Top 50 Bestselling Books 2009 - 2019. It contains 550 books and the data has been categorized as fiction and non-fiction by Goodreads. All of the code can be found here.

I started checking the first five observations from the dataset.

NameAuthorUser RatingReviewsPriceYearGenre
10-Day Green Smoothie CleanseJJ Smith4.71735082016Non Fiction
11/22/63: A NovelStephen King4.62052222011Fiction
12 Rules for Life: An Antidote to ChaosJordan B. Peterson4.718979152018Non Fiction
1984 (Signet Classics)George Orwell4.72142462017Fiction
5,000 Awesome Facts (About Everything!) (Natio…National Geographic Kids4.87665122019Non Fiction

Here it’s possible to see that the data has the Year in which the book was on the top 50 list, it’s Price, the average User Rating, total Reviews, Author, Name and lastly, Genre.

There are no null values in the dataset. And from 550 books there are 248 unique authors, so let’s see which authors have had more books in the top 50 bestselling during this period.

AuthorNumber of Books
Jeff Kinney12
Gary Chapman11
Rick Riordan11
Suzanne Collins11
American Psychological Association10
Dr. Seuss9
Gallup9
Rob Elliott8
Stephen R. Covey7
Stephenie Meyer7
Dav Pilkey7
Bill O’Reilly7
Eric Carle7

The author with more books in the top 50 list was Jeff Kinney, tied at second, with 11 books, was Gary Chapman, Rick Riordan, and Suzanne Collins. Tied at 9th is Stephen R. Covey, Stephenie Meyer, Dav Pilkey, Bill O’Reilly, and Eric Carle, with 7 books.

Violing plot of User Rating

With the violing plot, we can see how the user rating is concentrated and because our data is composed of bestsellers it makes sense that the user rating is mostly concentrated around 4.5 and 4.75.

Boxplot of Review Count by Year

This boxplot of reviews count by year shows that the variability increases through the years, having its peak at 2014 and gradually stabilizing. We can also see that in the first years, 2010 and 2011, there were more outliers in the data.

I wanted to look at the user rating and price by book genre. So I calculated these average values.

GenreUser RatingPrice
Fiction4.6510.85
Non Fiction4.6014.84

The user rating average by genre seems to be similar just 0.05 difference, but the price has a bigger difference 10.85 for fiction and 14.84 for non-fiction books. To be sure that these differences are statistically significant I’ll use the Mann-Whitney test.

The Mann-Whitney null hypothesis is that the samples have the same distribution, and in both cases, we reject the null hypothesis with a 95% confidence level. The p-value for the price data was 8.34e-08 and the user rating was 1.495e-07.

To visually show how different their distribution is we can take a look at the following plots.

Distribution for Book Price by Genre

The distribution for the price of fiction books is heavily inclined to the left and consistently diminishes as the price goes up. While the non-fiction books price starts high and becomes even higher, 120 and almost 140 occurrences in the first two categories, then it rapidly diminishes.

Distribution for User Rating by Genre

The distribution for the user rating by the fiction genre slowly increases, having its peek at around 4.8. And the distribution of the non-fiction genre has its peak at a little over 4.6.