Scrapy
======

`Scrapy <https://docs.scrapy.org/en/latest/intro/overview.html>`_ is `application framework` that allows you to crawl for wider range of applications

Key Components
``````````````

`Architecture overview <https://docs.scrapy.org/en/latest/topics/architecture.html>`_ from Scrapy's official doc has following diagram, which explain flow of data


.. image:: images/scrapy_architecture_02.png

Following explains main components of Scrapy

1. Scrapy Engine: Co-ordinates all data flow between components
2. Scheduler: Enqueueing and storage of URLs is the main job of this component. It gets them as `Request` objects from Engine
3. Downloader: This is the component that actually does the job of downloading page. This gets feeded back to `Spider` via `Engine` by means of `Response` objects
4. Spiders: Custom class written with logic to
    * Parse `Response`
    * Extract `Item`
5. Item Pipeline: Contains logic to process all items once extracted, this include:
    * Cleansing
    * Validation
    * Persistance 
6. Downloader middlewares: Set of hooks that sit between `Engine` and `Downloader`.
7. Spider middlewares: Set of hooks that site between `Engine` and `Spider`

Quickstart
``````````

1. Create a project using 

.. code-block:: sh

    scrapy startproject tutorial

2. Spider class custom contains logic to scrape information from one or more sites. This is generally extended from `scrapy.Spider`. Following information is also provided:
    * Initial request
    * How to follow links on page
    * How to parse downloaded information

E.g. create `quotes_spider.py` within `spiders` directory

.. code-block:: python

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"

        def start_requests(self):
            urls = [
                'http://quotes.toscrape.com/page/1/',
                'http://quotes.toscrape.com/page/2/',
            ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)

        def parse(self, response):
            page = response.url.split("/")[-2]
            filename = 'quotes-%s.html' % page
            with open(filename, 'wb') as f:
                f.write(response.body)
            self.log('Saved file %s' % filename)

Note:
    1. `name`: Specified name of the spider
    2. `start_requests` method: Contains seed `url` from which we expand on
    3. `parse` method : This is handler that gets called when response is downloaded. Contents of html is usually present in `TextResponse` attribute of `response` object. This method is also responsible for:
        1. Figuring out which URLs to follow
        2. Extracting item as dictionary

3. Run the spider using

.. code-block:: sh

    scrapy crawl quotes

Alternative way to specify seed url

.. code-block:: python

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]

        def parse(self, response):
            page = response.url.split("/")[-2]
            filename = 'quotes-%s.html' % page
            with open(filename, 'wb') as f:
                f.write(response.body)

.. note:: You can use `response.url` to call custom parsing method, this is useful in situation where you want to use single `Spider` for `multiple domains`

Shell allows you to learn more and debug, which can be invoked using

.. code-block:: sh

    scrapy shell 'http://quotes.toscrape.com/page/1/'

Storing scraped data into JSON can be done as follows

.. code-block:: sh

    scrapy crawl quotes -o quotes.json

Returning records using `yield` keyword, can be done as follows

.. code-block:: python

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]

        def parse(self, response):
            for quote in response.css('div.quote'):
                yield {
                    'text': quote.css('span.text::text').get(),
                    'author': quote.css('small.author::text').get(),
                    'tags': quote.css('div.tags a.tag::text').getall(),
                }

Note: The above yield is for `Item` object

We can follow a link using `scrapy.Request` as follows

.. code-block:: python

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            'http://quotes.toscrape.com/page/1/',
        ]

        def parse(self, response):
            for quote in response.css('div.quote'):
                yield {
                    'text': quote.css('span.text::text').get(),
                    'author': quote.css('small.author::text').get(),
                    'tags': quote.css('div.tags a.tag::text').getall(),
                }

            next_page = response.css('li.next a::attr(href)').get()
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)

Shortcut for the above can be done using `response.follow` as follows

.. code-block:: python

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            'http://quotes.toscrape.com/page/1/',
        ]

        def parse(self, response):
            for quote in response.css('div.quote'):
                yield {
                    'text': quote.css('span.text::text').get(),
                    'author': quote.css('span small::text').get(),
                    'tags': quote.css('div.tags a.tag::text').getall(),
                }

            next_page = response.css('li.next a::attr(href)').get()
            if next_page is not None:
                yield response.follow(next_page, callback=self.parse)

You can create custom call for `Item` and yield that instead of dictionary in `Scrapy`. E.g. 

.. code-block:: python

    import scrapy
    from myproject.items import MyItem

    class MySpider(scrapy.Spider):
        name = 'example.com'
        allowed_domains = ['example.com']

        def start_requests(self):
            yield scrapy.Request('http://www.example.com/1.html', self.parse)
            yield scrapy.Request('http://www.example.com/2.html', self.parse)
            yield scrapy.Request('http://www.example.com/3.html', self.parse)

        def parse(self, response):
            for h3 in response.xpath('//h3').getall():
                yield MyItem(title=h3)

            for href in response.xpath('//a/@href').getall():
                yield scrapy.Request(response.urljoin(href), self.parse)

Generic Spiders
```````````````

Scapy has few generic spiders, which include:

1. `CrawlSpider <https://docs.scrapy.org/en/latest/topics/spiders.html#crawlspider>`_
2. `XMLFeedSpider <https://docs.scrapy.org/en/latest/topics/spiders.html#xmlfeedspider>`_
3. `CSVFeedSpider <https://docs.scrapy.org/en/latest/topics/spiders.html#csvfeedspider>`_
4. `SitemapSpider <https://docs.scrapy.org/en/latest/topics/spiders.html#sitemapspider-examples>`_

These allow you to avoid writing redundant functionalities, by re-using these classes with your custom logic.

LinkExtractor
`````````````

`LinkExtractor <https://docs.scrapy.org/en/latest/topics/link-extractors.html>`_ allows to extract links from webpages using `scrapy.http.Response`. They are mixed with `Rule <https://docs.scrapy.org/en/latest/topics/spiders.html#crawlspider-example>`_ to specify what URLs can be allowed along with denied ones aswell.

For example:

.. code-block:: python

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

HTML parsing with `bs4`
```````````````````````

`bs4 <https://www.crummy.com/software/BeautifulSoup/bs4/doc/>`_ is a popular package used to parse HTML and XML. This can be used in conjunction with Scrapy. `lxml` is used along with bs4 as a backend.

Find multiple-elements using `find_all`

.. code-block:: python

    moretxt = """
    <p>Visit the <a href='http://www.nytimes.com'>New York Times</a></p>
    <p>Visit the <a href='http://www.wsj.com'>Wall Street Journal</a></p>
    """
    soup = BeautifulSoup(moretxt, 'lxml')
    tags = soup.find_all('a')
    type(tags)

Finding html element by id using `find`

.. code-block:: python

    div = soup.find(id="articlebody")

Finding html by tag and attirbute

.. code-block:: python

    results = soup.findAll("td", {"valign" : "top"})

Reference: More examples on `official documents <https://www.crummy.com/software/BeautifulSoup/bs4/doc/>`_