Skip to content

Latest commit

 

History

History
69 lines (41 loc) · 5.25 KB

API_SPIDER.md

File metadata and controls

69 lines (41 loc) · 5.25 KB

The store locator has an API

Common storefinders

Often, a storefinder is a plugin or common software component. To encourage code re-use, there are a large number of pre-built store finders

Over and above the core scrapy spider API, typically these store finders follow a pattern of:

  • Specifying an API key or start_url
  • pre_process_data - a method for cleaning or transforming a structure from the API (dict, xpath/dom node) prior to processing.
  • parse_item or post_process_item (StructuredDataSpiders) - a method for further decorating an item after the primary processing is done. IE, removing invalid names or phone numbers.

While exploring the storefinder you wish to spider, look for indicators or common elements in the existing codebase. IE: is there an ajax call with a get_stores attribute? If so, are there any existing spiders that make similar calls? This will often lead to very simple, minimal spiders.

Examples:

Investigating an API

It can be very easy writing a spider for a site with a good sitemap and also good structured data pages. Not all sites work this way and with these you will have more work to do.

There are a good number of sites that employ their own bespoke API (typically returning JSON data). These APIs exhibit a good degree of similarity in terms of the fields returned.

You are encouraged to run the following checks as a first step:

  • pipenv run scrapy sitemap http://example.com/ - determine if there are sitemaps and useful links - see sitemap for your next steps.
  • pipenv run scrapy sd http://example.com/path/to/individual/store or pasting the URL into https://validator.schema.org/ - see structured data

If these yield no results or you wish to explore more efficient ways to query; the next thing to do is to figure out that there is an API and how it is driven.

With a web development console enabled, navigate using the store locator, to one of the store pages. Then with a per POI field like phone number or postcode search for that string in the network transfers section of the console. If the data has come in via an API call you will quickly see this and also by examining the URL and headers for the data transfer start to understand the API.

Example: Discovering a site is powered by a JSON response from a stockist API

image

If you find that the POI data you want is present in the web page itself as HTML then most likely you will be having to write a spider the hard way.

The rest of this page discusses some different kinds of JSON API by example.

DictParser, a helper for POI JSON

In nearly all cases you should find that our DictParser class can help keep your code tight. There are only so many ways that you can name a field latitude. The DictParser code tries those different ways for this and other fields. Of course, it will not get everything right for every API. This is where you may have to write a little code yourself, but a lot less than otherwise.

One API call, all sites returned

Sometimes a single HTTP request can return a JSON structure giving the details for all stores in the group. You've got lucky. Use DictParser to do the best it can before fixing any mistakes or additions after it returns. Some examples:

Driving an API by position, city, or postcode

Sometimes an API requires an area query and one call may not suffice. You are going to have to figure out the best approach. To start with you must know what you are dealing with:

  • can I query by postcode?, by city?
  • can I query by position (latitude/longitude)?
  • can I increase the number of POIs returned by a single call?
  • can I increase the radial extent of the query?

We provide a certain amount of library support for driving such APIs with data. The geonames countries and cities dataset is a dependency of the project. There are postcode datasets for various territories. There are position point files to cover various territories at various densities. The data should be accessed through the interfaces we have provided on geo.py.

All the above is best illustrated with some further examples: