Scrapy ====== `Scrapy `_ is `application framework` that allows you to crawl for wider range of applications Key Components `````````````` `Architecture overview `_ from Scrapy's official doc has following diagram, which explain flow of data .. image:: images/scrapy_architecture_02.png Following explains main components of Scrapy 1. Scrapy Engine: Co-ordinates all data flow between components 2. Scheduler: Enqueueing and storage of URLs is the main job of this component. It gets them as `Request` objects from Engine 3. Downloader: This is the component that actually does the job of downloading page. This gets feeded back to `Spider` via `Engine` by means of `Response` objects 4. Spiders: Custom class written with logic to * Parse `Response` * Extract `Item` 5. Item Pipeline: Contains logic to process all items once extracted, this include: * Cleansing * Validation * Persistance 6. Downloader middlewares: Set of hooks that sit between `Engine` and `Downloader`. 7. Spider middlewares: Set of hooks that site between `Engine` and `Spider` Quickstart `````````` 1. Create a project using .. code-block:: sh scrapy startproject tutorial 2. Spider class custom contains logic to scrape information from one or more sites. This is generally extended from `scrapy.Spider`. Following information is also provided: * Initial request * How to follow links on page * How to parse downloaded information E.g. create `quotes_spider.py` within `spiders` directory .. code-block:: python import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename) Note: 1. `name`: Specified name of the spider 2. `start_requests` method: Contains seed `url` from which we expand on 3. `parse` method : This is handler that gets called when response is downloaded. Contents of html is usually present in `TextResponse` attribute of `response` object. This method is also responsible for: 1. Figuring out which URLs to follow 2. Extracting item as dictionary 3. Run the spider using .. code-block:: sh scrapy crawl quotes Alternative way to specify seed url .. code-block:: python import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) .. note:: You can use `response.url` to call custom parsing method, this is useful in situation where you want to use single `Spider` for `multiple domains` Shell allows you to learn more and debug, which can be invoked using .. code-block:: sh scrapy shell 'http://quotes.toscrape.com/page/1/' Storing scraped data into JSON can be done as follows .. code-block:: sh scrapy crawl quotes -o quotes.json Returning records using `yield` keyword, can be done as follows .. code-block:: python import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), } Note: The above yield is for `Item` object We can follow a link using `scrapy.Request` as follows .. code-block:: python import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), } next_page = response.css('li.next a::attr(href)').get() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) Shortcut for the above can be done using `response.follow` as follows .. code-block:: python import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('span small::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), } next_page = response.css('li.next a::attr(href)').get() if next_page is not None: yield response.follow(next_page, callback=self.parse) You can create custom call for `Item` and yield that instead of dictionary in `Scrapy`. E.g. .. code-block:: python import scrapy from myproject.items import MyItem class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] def start_requests(self): yield scrapy.Request('http://www.example.com/1.html', self.parse) yield scrapy.Request('http://www.example.com/2.html', self.parse) yield scrapy.Request('http://www.example.com/3.html', self.parse) def parse(self, response): for h3 in response.xpath('//h3').getall(): yield MyItem(title=h3) for href in response.xpath('//a/@href').getall(): yield scrapy.Request(response.urljoin(href), self.parse) Generic Spiders ``````````````` Scapy has few generic spiders, which include: 1. `CrawlSpider `_ 2. `XMLFeedSpider `_ 3. `CSVFeedSpider `_ 4. `SitemapSpider `_ These allow you to avoid writing redundant functionalities, by re-using these classes with your custom logic. LinkExtractor ````````````` `LinkExtractor `_ allows to extract links from webpages using `scrapy.http.Response`. They are mixed with `Rule `_ to specify what URLs can be allowed along with denied ones aswell. For example: .. code-block:: python rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'), ) HTML parsing with `bs4` ``````````````````````` `bs4 `_ is a popular package used to parse HTML and XML. This can be used in conjunction with Scrapy. `lxml` is used along with bs4 as a backend. Find multiple-elements using `find_all` .. code-block:: python moretxt = """

Visit the New York Times

Visit the Wall Street Journal

""" soup = BeautifulSoup(moretxt, 'lxml') tags = soup.find_all('a') type(tags) Finding html element by id using `find` .. code-block:: python div = soup.find(id="articlebody") Finding html by tag and attirbute .. code-block:: python results = soup.findAll("td", {"valign" : "top"}) Reference: More examples on `official documents `_