Basic tutorial for Scrapy [Scraping framework]


Scrapy is scrapying OSS tool for extracting the data you need from websites. In addition, Scrapy Cloud - Scrapyinghub is a powerful and useful platform to run Scrapy cralwer. This article describes basic tutorial for Scrapy.

🐠 Installation

Install Scrapy and Scrapyinghub library at the first time:

pip install scrapy shub

After then, please create a new project for Scrapy:

scrapy startproject PROJECT_NAME
cd PROJECT_NAME

🎳 Create Spider

In scrapy, you create some spiders which is crawler in a project. You can create the spider by the following command:

# Generate code to scrape a new site
scrapy genspider blogsider blog.scrapinghub.com

After the above command, please change spiders/blogsider.py:

import scrapy

class BlogSpider(scrapy.Spider):
name = 'blogspider'
allowed_domains = ['blog.scrapinghub.com']
start_urls = ['https://blog.scrapinghub.com']

def parse(self, response):
for title in response.css('h2.entry-title'):
yield {'title': title.css('a ::text').extract_first()}

for next_page in response.css('div.prev-post > a'):
yield response.follow(next_page, self.parse)

So, your spider can get title in entry list.

🐝 Run the cawler in local

You can execute your crawler on local:

# Execute scraping in local
scrapy crawl blogspider

😎 Get API Key on Scrapinghub

If you create an account on Scrapy Cloud - Scrapyinghub, you can deloy your code to there and execute it on theere.

After creating the account, please create project and confirm API key and Project Id.

πŸ—» Create Scrapyinghub Configuration

Create scrapinghub.yml:

projects:
default: YOUR_PROJECT_ID
stack: scrapy:1.5-py3

Please set your project id and stack information which you want to use. If you want to know more detail of stack information, please see shub/deploying.

🐰 Deploy to Scrapyinghub

Let’s deply your code to Scrapyinghub with the API key:

# Install a library for scrapying hub
pip install shub

# Logging in Scrapinghub with
shub login

# Deploy to Scrapinghub
shub deploy # Deploy the spider to Scrapy Cloud

# Execute spider on Scrapinghub
shub schedule blogspider

After then, you can confirm the result items by Scrapyinghub screen.

🐑 How to access settings

print("Existing settings: %s" % self.settings.attributes.keys())

If you want to know more detail, please see https://doc.scrapy.org/en/latest/topics/settings.html .

πŸŽ‚ Appendix

Scrapy Cloud API

Scrapy Cloud on Scrapyinghub has useful API. Please confirm as follows:

https://doc.scrapinghub.com/scrapy-cloud.html

Downloading and processing files and images

Scrapy provides reusable item pipelines for downloading files attached to a particular item.

  • Download their images
  • Create thumbnail

If you want to know more detail, please see official document: Downloading and processing files and images.

Scrapy Configuration

About settings.py:

  • DEPTH_LIMIT : Depth to scrape
  • DOWNLOAD_DELAY : Intervals for downloading data in scraping

Sentry

Installation

pip install scrapy-sentry

Configuration

# sentry dsn
SENTRY_DSN = 'http://public:secret@example.com/1'
EXTENSIONS = {
"scrapy_sentry.extensions.Errors":10,
}

Addons

Magic Fields

Monitoring

  • Monitor spiders and item validation check
  • Enable to define scrapy stats values
  • Get email and slack notifications

DotScrapy Persistence

  • Allows the crawler to access a persistent storage and share data among different runs of a spider by S3
  • It ignore duplicated URL among some jobs of a spider.
  • It is difficult to recover error sometimes, so please be careful to use it.

πŸ–₯ Recommended VPS Service

VULTR provides high performance cloud compute environment for you. Vultr has 15 data-centers strategically placed around the globe, you can use a VPS with 512 MB memory for just $ 2.5 / month ($ 0.004 / hour). In addition, Vultr is up to 4 times faster than the competition, so please check it => Check Benchmark Results!!