Table of Contents
What Is Scrapy
Scrapy is an application framework which will act like a web crawler that mainly used to extract the data from the website. Today, our topic is very much bound to explore about Scrapy hence we’re going to implement web scrapping in Python using Scrapy in our project.
This blog will hopefully cover the following topics :
- How To Install Scrapy
- Create A Scrapy Project
- Export Scraped Data As CSV
Scrappy will only run on python 2.7 and python 3.4 or run above. If you’re using Anaconda, you can install the package from the conda-forge channel packages on Linux, Windows and OS X.
How To Install Scrapy:
You can install scrappy either using conda or if you’re familiar with the installation of Python packages, you can install Scrapy and its dependencies from PyPI itself.
Install Scrappy Using Anaconda
conda install -c conda-forge scrapy
Install Scrapy Using PyPI
pip install Scrapy
Install Scrapy On Ubuntu 14.04 Above
Ubuntu 14.04 and above, If you install scrapy on Ubuntu systems, you need to install these dependencies:
sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
Install Scrapy On Python
If you want to install Scrapy on Python 3, you’ll also need Python 3 development headers:
sudo apt-get install python3 python3-dev
Inside a virtualenv, you can install Scrapy with pip :
pip install scrapy
Create A Scrapy Project
Before you start scrapping, we need to create our scrappy project. Now, switch to the desired directory where we should run the scrapy project.
scrapy startproject project_name
This will create the following directory structure:
project_name/ scrapy.cfg # deploy configuration file project_name/ # project's Python module, you'll import your code from here __init__.py items.py # project items definition file middlewares.py # project middlewares file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py
The two most important files we should consider are:
settings.py – This file will hold all the settings you have set for your project.
spiders/ – This folder will store all your custom spiders used in the project.
Related : Introduction To Web Scraping With Node JS
Create A Scrapy Spider :
Spiders are the classes which you define and that Scrapy uses to scrape information from a website (or a group of websites).
Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:
import scrapy class QuotesSpider(scrapy.Spider): name = 'quotes' start_urls = [ 'http://quotes.toscrape.com/tag/humor/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.xpath('span/small/text()').get(), } next_page = response.css('li.next a::attr("href")').get() if next_page is not None: yield response.follow(next_page, self.parse)
The Spider subclasses scrapy.Spider and defines some attributes and methods:
Name: which indicates the spider, the name must be unique in the project and we can’t assign the same name to another file.
start_requests(): return our request in an iterative way so when the crawl begins then our request will be processed successively from the initial request to end.
parse(): This method is mainly called to handle our response in download, based on our “request.Response” method is an instance of TextResponse that holds the page content.
Other side, The parse() method will also parse the response and extract the crawled data as dicts & finds new URLs to follow and creating new requests (Request) from them.
How To Run Spider From Scrapy
To make your spider work, go to the project’s top level directory and run:
scrapy crawl quotes
This command will run the spider and generate following output,
... (omitted for brevity) 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened 2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None) 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
Also Read: Writing a web crawler with Scrapy and Scrapinghub
Export Scraped Data As CSV :
We can still extract all the data in the command line but it is always good to export the scraped data in various formats like CSV, Excel, JSON, etc. This saves lots of our time and also can be imported into programs else wherever we want. To make this process even easier, Scrapy provides the functions called “nifty” which allows you to export the downloaded content in various formats.
To do that, just add the following code block in settings.py file:
#Export as CSV Feed FEED_FORMAT = "csv" FEED_URI = "your csv name.csv"
That’s all guys! we have successfully exported the data as CSV. Now we know to implement web Scraping Using Scrapy.