Every AnswersEngine scraper requires a seeder script which tells the scraper which pages to start scraping. A seeder script is a Ruby file that uses Ruby to load urls into a variable called, “pages.” First create a directory for our seeder script:

$ mkdir seeder

Next create a file called, “seeder.rb” inside this seeder directory with the following code:

pages << {
  page_type: "listings",
  method: "GET",
  headers: {"User-Agent" => "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"},
  url: "https://www.aliexpress.com/category/100003109/women-clothing.html",
  vars: {
    category: "Women's clothing"
  }
}

In the ruby script above, we are basically seeding a link to the “Women’s Clothing” category on AliExpress.com. Please note that “pages” is a reserved variable. It is an array that represents what pages you want to seed. Let’s go through the other values in detail.

The “page_type” is a setting that determines which parser script to use. Later we will create a Ruby parser script called, “listings.”

The “method” is the type of http request we want to make. In this example, we are doing a simple “GET” request which is what your browser would make if you were viewing this url.

For the “headers” setting we are setting a “User-Agent” which is basically a string that represents a browser. Whenever you access a website, your browser includes a “User-Agent” so the website knows how to render the page that you request. By including a “User-Agent” string, we avoid having the Walmart website thinking we are a scraping bot and blocking our requests. You can also leave this “headers” setting out completely and AnswersEngine will create a “User-Agent” for you by randomly submitting one with page request. The “User-Agents” that will be randomly selected are all valid and from the main browsers (Chrome, Firefox, and Internet Explorer), so no need to worry if you leave this out.

Then for the “vars” parameter we are passing the “Women’s clothing” category. The “vars” parameter allows you to pass in user-defined variables. We will be able to access and save this “vars” value in our “listings” parser as designated by the “page_type” value. You can pass whatever information you want through this “vars” parameter.

Now that we have created a seeder script we can try it out to see if there are any syntax errors. Run the following command from the root of your project directory:

$ answersengine seeder try seeder/seeder.rb  

Your should see the following output:

Trying seeder script
=========== Seeding Script Executed ===========
----------- New Pages to Enqueue: -----------
[
  {
    "page_type": "listings",
    "method": "GET",
    "headers": {
      "User-Agent": "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
    },
    "url": "https://www.aliexpress.com/category/100003109/women-clothing.html",
    "vars": {
      "category": "Women's clothing"
    }
  }
]

Now we can commit this seeder to our git repository with the following commands:

$ git add .
$ git commit -m 'created a seeder file'

AnswersEngine scrapers live in git repositories so we will first need to create one. Bitbucket offers free git repositories. Create a Bitbucket account and then a new repository here: https://bitbucket.org/repo/create. Use the git repo address from Bitbucket and push your scraper with the following commands (replace <username> with your Bitbucket username):

$ git remote add origin git@bitbucket.org:<username>/amazon-tvs.git
$ git push -u origin master

We will need a config file to tell AnswersEngine where to find our files. Create a config.yaml file in the root project directory with the following content:

seeder:
 file: ./seeder/seeder.rb
 disabled: false # Optional. Set it to true if you want to disable execution of this file

Commit this config file on git, and push it to Bitbucket:

$ git add .
$ git commit -m 'add config.yaml file'
$ git push origin master  

We can now create a scraper and run it on AnswersEngine. Replace your git repo (should end in .git) in the following command which will create a scraper called, “ali-express”:

answersengine scraper create ali-express git@bitbucket.org:<username>/ali-express.git --workers 1

Next, we need to deploy from your remote Git repository onto AnswersEngine:

$ answersengine scraper deploy ali-express   

After deploying we can start the scraper with the following command:

$ answersengine scraper start ali-express  

Starting a new scraper will create a new scrape job and run it. Wait a minute and then check the status of this job with the following command:

$ answersengine scraper stats ali-express    

You should see something similar to the following:

{
 "job_id": 83,             # Job ID
 "pages": 0,               # How many pages in the scrape job
 "fetched_pages": 1,       # Number of fetched pages
 "to_fetch": 0,            # Pages that needs to be fetched
 "fetching_failed": 0,     # Pages that failed fetching
 "fetched_from_web": 1,    # Pages that were fetched from Web
 "fetched_from_cache": 0,  # Pages that were fetched from the shared Cache
 "parsed_pages": 0,        # Pages that have been parsed by parsing script
 "to_parse": 0,            # Pages that needs to be parsed
 "parsing_failed": 0,      # Pages that failed parsing
 "outputs": 0,             # Outputs of the scrape
 "output_collections": 0,  # Output collections
 "workers": 1,             # How many workers are used in this scrape job
 "time_stamp": "2019-02-23T22:09:57.956158Z"
}

The “fetched_pages” value indicates that our scraper has successfully seeded our first page from the seeder. Next we will create parsers to parse these pages to extract product data and enqueue more pages in Part III.