Scraping websites can help you get valuable data but often times it is not easy. You will most likely run into challenges such as creating requests (you will need to learn how to code and use a library to create http requests which is what browsers make behind the scenes), setting the correct request headers (if you don’t set request headers such as the language and encoding, a server may return a 403 error instead of the html that you want), throttling (a website may only allow a certain number of requests in a certain amount of time to make sure that you don’t bog down their server), and getting your ip banned (sometimes a website will try and prevent you from crawling and ban your ip so you can’t make requests). We are going to show you how AnswersEngine can handle all these difficult parts of scraping and make it easy for you to get the data you want.
If you prefer to skip this tutorial, you can clone this script directly here.
For this tutorial we are going to show you how to use AnswersEngine to easily scrape information about television products from the following two different categories on Amazon.com: “LED & LCD TVs” and “OLED TVs.” Specifically we are going to be scraping the following Amazon television data (also highlighted below): name, price, ASIN, seller, category, rating, number of reviews, product availability, and description.
We are going to assume you have Ruby 2.5.3 and the Nokogiri gem installed. If not follow this link here for instructions on how to install Ruby. Once Ruby is installed, make sure Rubygems is also installed and then run the following to install Nokogiri:
$ gem install nokogiri
First let’s set up a new AnswersEngine scraper. Install the AnswersEngine Ruby gem with the following command:
$ gem install answersengine --source https://Q34T4-cZG2rRRuLMNmG2zvwZsIJl7W5g@gem.fury.io/answersengine/
You should see something similar to the following output after running this command:
Successfully installed answersengine-0.2.3 Parsing documentation for answersengine-0.2.3 Done installing documentation for answersengine after 0 seconds 1 gem installed
Now that we have the AnswersEngine gem installed we need to create our AnswersEngine environment variable token. This will make it so our token is sent with every AnswersEngine request. Run the following command:
$ export ANSWERSENGINE_TOKEN=<your_token_Here>
We are now ready to create a scraper. Let’s create an empty directory first, and name it ‘amazon-tvs’:
$ mkdir amazon-tvs
Next let’s go into the directory and initialize it as a Git repository:
$ cd amazon-tvs $ git init .
Now that we have our setup is finished, let's move on to the creating the seeders in Part II.