Amazon Review Scraper

Dane Olsen
Oct 18, 2018
4 min read

Updated: Oct 26, 2018

It all started one day as I was looking through Amazon. Sometimes, I don't care a whole lot about the quality of what I am buying as long as it is a good deal; other times, I gotta know. Between reading the high and low reviews, I wanted to know what actual problems people are having with the item so I can decide whether it is worth the risk of possibly buying a defective item or not. But sometimes there are a lot of reviews! This is where I wondered if there was a way I could somehow see a summary of all the reviews - beyond what Amazon gives me in the bars. (as shown above) So I started see if I could make this thought a reality.

I had done a worksheet on web scraping once that involved scraping one of the most basic web pages out there. So I figured, "Hey, why not try scraping one of the most used websites?" So that's what I did. I know python, but I only have a vague idea of how HTML works, so I knew this was going to be quite the challenge. In the end though, web scraping is really all about recognizing patterns, identifying uniqueness, and isolating the desired values, all three of which I have confidence in.

The first challenge was to find out what part of the HTML code holds the actual reviews in a page. Thanks to Google Chrome's "Inspect" tool it wasn't too hard to find. After a lot of trial and error, I had written the code that, as far as I could tell, was pulling all the correct information from each page of 50 reviews at a time (the max amount that Amazon will put on one page). I put my code into a loop, and figured I would give a shot at the most reviewed item on Amazon: The Amazon Fire TV Stick. At the time it had 185,000+ reviews, which means my web scraper was going to have to scrape nearly 4,000 individual pages. With my program slowed down to decrease chances of slowing down those web pages for other Amazon users, I figured my program would finish scraping the 4,000 pages in 18 hours... Over 24 hours later I had to stop the scraper to do some trouble shooting. I found a few problems in addition to how long it took. I ended up having to make my code more dynamic and also had to build in error handling. One of the big issues for me was that I had no idea where my scraper was while in the process. So I wrote some code to update me as it scraped. Here is a snippet of the scraper updating me when it had some really slow internet. It provides the page number, whether or not it got an error, the final output of the seconds it took to scrape the page, and how many reviews it scraped from that page.

In the process of building in the features above, there was a lot more trial and error as I fine-tuned the system. I ended up running the web scraper multiple times - due to internet connectivity - and built each individual dataset into one giant dataset that is 185,498 rows long.

After cleaning the dataset and initial exploratory data analysis, I made the following visualization. Let me explain what it is first. After taking out the more common words that don't really mean anything by themselves (known as stopwords) such as; and, the, is, my, ect. I made the size of the word in the visualization correlate with how many times that word appears throughout all of the reviews. So the bigger the word, the more often it shows up in the reviews as a whole.

Looks like people love it, it's great, it works, and is easy to use. But, what do the summaries look like for each rating?

There are some interesting things that I could see at this point. The main thing that stood out to me is that the remote (the word depicting the part of the product that the user actually interacts with) is used most in the reviews with lower ratings. When I took a look into it, I found that It was a pretty incredible difference when we look at the percent of reviews that mention remote out of all of the reviews with that rating.

There is such a higher percentage of lower rated reviews that mention remote in respect to each of the ratings... but how do these percentages compare to the whole?

Now taking a look at the percentage of reviews that mention remote compared to all of the reviews given, we see that the numbers that looked so big in the last visualization, (the 1-3 star ratings) are less than 3% of all of the reviews.

Well I am still curious to see what those less than 3% of all of the reviews are saying about the remote.

So for people who are having problems with the remote, it looks like one of the more common ones is with its batteries. Who would have thought?

To come back to the original question. "Is there a way I could see a summary of all the reviews?" Yes, yes there is. So, if you are thinking about buying a Amazon Fire TV Stick, I would recommend it - even though I don't have one. After all, there's less than 3% chance of having problems with the remote, not bad.

#python #webscraper #nlp

PORTFOLIO

Technical Projects

Amazon Review Scraper

Recent Posts

Comentarios