Scrapy is scrapying OSS tool for extracting the data you need from websites. In addition, Scrapy Cloud - Scrapyinghub is a powerful and useful platform to run Scrapy cralwer. This article describes basic tutorial for Scrapy.
π Installation
Install Scrapy and Scrapyinghub library at the first time:
pip install scrapy shub |
After then, please create a new project for Scrapy:
scrapy startproject PROJECT_NAME |
π Create Spider
In scrapy, you create some spiders which is crawler in a project. You can create the spider by the following command:
# Generate code to scrape a new site |
After the above command, please change spiders/blogsider.py
:
import scrapy |
So, your spider can get title in entry list.
π½ Run the cawler in local
You can execute your crawler on local:
# Execute scraping in local |
π― Get API Key on Scrapinghub
If you create an account on Scrapy Cloud - Scrapyinghub, you can deloy your code to there and execute it on theere.
After creating the account, please create project and confirm API key and Project Id.
π Create Scrapyinghub Configuration
Create scrapinghub.yml
:
projects: |
Please set your project id and stack information which you want to use. If you want to know more detail of stack information, please see shub/deploying.
π Deploy to Scrapyinghub
Letβs deply your code to Scrapyinghub with the API key:
# Install a library for scrapying hub |
After then, you can confirm the result items by Scrapyinghub screen.
π How to access settings
print("Existing settings: %s" % self.settings.attributes.keys()) |
If you want to know more detail, please see https://doc.scrapy.org/en/latest/topics/settings.html .
π Appendix
Scrapy Cloud API
Scrapy Cloud on Scrapyinghub has useful API. Please confirm as follows:
https://doc.scrapinghub.com/scrapy-cloud.html
Downloading and processing files and images
Scrapy provides reusable item pipelines for downloading files attached to a particular item.
- Download their images
- Create thumbnail
If you want to know more detail, please see official document: Downloading and processing files and images.
Scrapy Configuration
About settings.py
:
DEPTH_LIMIT
: Depth to scrapeDOWNLOAD_DELAY
: Intervals for downloading data in scraping
Sentry
Installation
pip install scrapy-sentry |
Configuration
# sentry dsn |
Addons
Magic Fields
- https://github.com/scrapy-plugins/scrapy-magicfields
- It can add job_id, spider_name, created_at, etc to items.
Monitoring
- Monitor spiders and item validation check
- Enable to define scrapy stats values
- Get email and slack notifications
DotScrapy Persistence
- Allows the crawler to access a persistent storage and share data among different runs of a spider by S3
- It ignore duplicated URL among some jobs of a spider.
- It is difficult to recover error sometimes, so please be careful to use it.
π₯ Recommended VPS Service
VULTR provides high performance cloud compute environment for you.
Vultr has 15 data-centers strategically placed around the globe, you can use a VPS with 512 MB memory for just $ 2.5 / month ($ 0.004 / hour).
In addition, Vultr is up to 4 times faster than the competition, so please check it => Check Benchmark Results!!