Scrapy爬蟲框架的介紹和使用-創(chuàng)新互聯(lián)

Scrapy簡介

創(chuàng)新互聯(lián)建站專業(yè)提供成都主機托管四川主機托管成都服務器托管四川服務器托管，支持按月付款！我們的承諾：貴族品質、平民價格，機房位于中國電信/網通/移動機房，資陽托管服務器服務有保障！

Python開發(fā)的一個快速、高層次的屏幕抓取和web抓取框架，用于抓取web站點并從頁面中提取結構化的數(shù)據(jù)。Scrapy用途廣泛，可以用于數(shù)據(jù)挖掘、監(jiān)測和自動化測試。

Scrapy吸引人的地方在于它是一個框架，任何人都可以根據(jù)需求方便的修改。它也提供了多種類型爬蟲的基類，如BaseSpider、sitemap爬蟲等，最新版本又提供了web2.0爬蟲的支持。

基本功能

Scrapy是一個為遍歷爬行網站、分解獲取數(shù)據(jù)而設計的應用程序框架，它可以應用在廣泛領域:數(shù)據(jù)挖掘、信息處理和或者歷史片(歷史記錄)打包等等

盡管Scrapy原本是設計用來屏幕抓取(更精確的說，是網絡抓取)的目的，但它也可以用來訪問API來提取數(shù)據(jù)，比如Amazon的AWS或者用來當作通常目的應用的網絡蜘蛛

Scrapy框架

Scrapy是用Python實現(xiàn)的一個為了爬取網站數(shù)據(jù)，提取結構性數(shù)據(jù)而編寫的應用框架?？梢詰迷诎〝?shù)據(jù)挖掘、信息處理或存儲歷史數(shù)據(jù)等一系列的程序中。

Scrapy使用Twisted基于事件的高效異步網絡框架來處理網絡通信，可以加快下載速度，不用自己去實現(xiàn)異步框架，并且包含了各種中間件接口，可以靈活的完成各種需求。

Scrapy架構

Scrapy爬蟲框架的介紹和使用

Scrapy Engine

引擎，負責控制數(shù)據(jù)流在系統(tǒng)中所有組件中流動，并在相應動作發(fā)生時觸發(fā)事件。此組件相當于爬蟲的“大腦”，是整個爬蟲的調度中心

調度器(Scheduler)

調度器接收從引擎發(fā)送過來的request，并將他們入隊，以便之后引擎請求他們時提供給引擎
初始的爬取URL和后續(xù)在頁面中獲取的待爬取的URL將放入調度器中，等待爬取。同時調度器會自動去除重復的URL（如果特定的URL不需要去重也可以通過設置實現(xiàn)，如post請求的URL）

下載器(Downloader)

下載器負責獲取頁面數(shù)據(jù)并提供給引擎，而后提供給spider

Spiders爬蟲

Spider是編寫的類，作用如下：

Scrapy用戶編寫用于分析response并提取item(即獲取到的item)
額外跟進的URL，將額外跟進的URL提交給引擎，加入到Scheduler調度器中。將每個spider負責處理一個特定(或一些)網站

Item Pipeline

Item Pipeline負責處理被spider提取出來的item。典型的處理有清理、驗證及持久化(例如存取到數(shù)據(jù)庫中)
當頁面被爬蟲解析所需的數(shù)據(jù)存入Item后，將被發(fā)送到項目管道(Pipeline)，并經過設置好次序的pipeline程序處理這些數(shù)據(jù)，最后將存入本地文件或存入數(shù)據(jù)庫
類似管道 $ ls | grep test 或者類似于Django 模板中的過濾器

以下是item pipeline的一些典型應用：

清理HTML數(shù)據(jù)
驗證爬取的數(shù)據(jù)(檢查item包含某些字段)
查重(或丟棄)
將爬取結果保存到數(shù)據(jù)庫中

下載器中間件(Downloader middlewares)

簡單講就是自定義擴展下載功能的組件。

下載器中間件，是在引擎和下載器之間的特定鉤子(specific hook)，處理它們之間的請求request和響應response。

它提供了一個簡便的機制，通過插入自定義代碼來擴展Scrapy功能

通過設置下載器中間件可以實現(xiàn)爬蟲自動更換user-agent、IP等功能

Spider中間件(Spider middlewares)

Spider中間件，是在引擎和Spider之間的特定鉤子(specific hook)，處理spider的輸入(response)和輸出(items或requests)。

也提供了同樣的簡便機制，通過插入自定義代碼來擴展Scrapy功能。

數(shù)據(jù)流(Data flow)

1.引擎打開一個網站(open a domain)，找到處理該網站的Spider并向該spider請求第一個（批）要爬取的URL(s)

2.引擎從Spider中獲取到第一個要爬取的URL并加入到調度器(Scheduler)作為請求以備調度

3.引擎向調度器請求下一個要爬取的URL

4.調度器返回下一個要爬取的URL給引擎，引擎將URL通過下載中間件并轉發(fā)給下載器(Downloader)

5.一旦頁面下載完畢，下載器生成一個該頁面的Response，并將其通過下載中間件發(fā)送給引擎

6.引擎從下載器中接收到Response，然后通過Spider中間件發(fā)送給Spider處理

7.Spider處理Response并返回提取到的Item及(跟進的)新的Request給引擎

8.引擎將Spider返回的Item交給Item Pipeline，將Spider返回的Request交給調度器

9.(從第二步)重復執(zhí)行，直到調度器中沒有待處理的request，引擎關閉

注意：

只有當調度器中沒有任何request了，整個程序才會停止執(zhí)行。如果有下載失敗的URL，會重新下載

安裝scrapy

安裝wheel支持

$ pip install wheel

安裝scrapy框架

$ pip install scrapy

window下，為了避免windows編譯安裝twisted依賴，安裝下面的二進制包

$ pip install Twisted-18.4.0-cp35-cp35m-win_amd64.whl

windows下出現(xiàn)如下問題：

copying src\twisted\words\xish\xpathparser.g -> build\lib.win-amd64-3.5\twisted\words\xish running build_ext building 'twisted.test.raiser' extension error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools解決方案是，下載編譯好的twisted，Python Extension Packages for Windowspython3.5 下載 Twisted-18.4.0-cp35-cp35m-win_amd64.whlpython3.6 下載 Twisted-18.4.0-cp36-cp36m-win_amd64.whl安裝twisted$ pip install Twisted-18.4.0-cp35-cp35m-win_amd64.whl之后在安裝scrapy就沒有什么問題了

安裝好，使用scrapy命令看看

1.> scrapy 2.Scrapy 1.5.0 - no active project 3. 4.Usage: 5. scrapy <command> [options] [args] 6. 7.Available commands: 8.bench Run quick benchmark test 9.check Check spider contracts 10.crawl Run a spider 11.edit Edit spider 12.fetch Fetch a URL using the Scrapy downloader 13.genspider Generate new spider using pre-defined templates 14.list List available spiders 15.parse Parse URL (using its spider) and print the results 16.runspider Run a self-contained spider (without creating a project) 17.settings Get settings values 18.shell Interactive scraping console 19.startproject Create new project 20.version Print Scrapy version 21.view Open URL in browser, as seen by Scrapy

Scrapy開發(fā)

項目編寫流程

1.創(chuàng)建項目

使用 scrapy startproject proname 創(chuàng)建一個scrapy項目

scrapy startproject <project_name> [project_dir]

2.編寫item

在items.py中編寫Item類，明確從response中提取的item

3.編寫爬蟲

編寫spiders/proname_spider.py，即爬取網站的spider并提取出item

4.編寫item pipeline

item的處理，可以存儲

1 創(chuàng)建項目

1.1 豆瓣書評爬取

標簽為“編程”，第一頁、第二頁鏈接：

https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T

https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T

隨便找一個目錄來創(chuàng)建項目，執(zhí)行下面命令

$ scrapy startproject first .

會產生如下目錄和文件

first ├─ scrapy.cfg └─ first ├─ items.py ├─ middlewares.py ├─ pipelines.py ├─ settings.py ├─ __init__.py └─ spiders └─ __init__.py

first：

外部的first目錄是整個項目目錄，內部的first目錄是整個項目的全局目錄

scrapy.cfg：

必須有的重要的項目的配置文件

first 項目目錄
__init__.py 必須有，包文件
items.py 定義Item類，從scrapy.Item繼承，里面定義scrapy.Field類實例
pipelines.py 重要的是process_item()方法，處理item
settings.py：
BOT_NAME 爬蟲名
ROBOTSTXT_OBEY = True 是否遵從robots協(xié)議
USER_AGENT = '' 指定爬取時使用
CONCURRENT_REQEUST = 16 默認16個并行
DOWNLOAD_DELAY = 3 下載延時，一般要設置，不宜過快發(fā)起連續(xù)請求
COOKIES_ENABLED = False 缺省是啟用，一般需要登錄時才需要開啟cookie
SPIDER_MIDDLEWARES 爬蟲中間件
DOWNLOADER_MIDDLEWARES 下載中間件

'firstscrapy.pipelines.FirstscrapyPipeline': 300item交給哪一個管道處理，300 越小優(yōu)先

級越高

ITEM_PIPELINES 管道配置

'first.middlewares.FirstDownloaderMiddleware': 543543 越小優(yōu)先級越高

__init__.py 必須有，可以在這里寫爬蟲類，也可以寫爬蟲子模塊

1.# first/settings.py參考2.BOT_NAME = 'first'3.SPIDER_MODULES = ['first.spiders']4.NEWSPIDER_MODULE = 'first.spiders'5.6.USER_AGENT = "Mozilla/5.0 (Windows NT 6.1)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36"7.ROBOTSTXT_OBEY = False8.9.DOWNLOAD_DELAY = 310.11.# Disable cookies (enabled by default)12.COOKIES_ENABLED = False

注意一定要更改User-Agent，否則訪問https://book.douban.com/會返回403

2 編寫Item

1.在items.py中編寫 2.import scrapy 3.class BookItem(scrapy.Item): 4.title = scrapy.Field() # 書名 5.rate = scrapy.Field() # 評分

3 編寫爬蟲

為爬取豆瓣書評編寫爬蟲類，在spiders目錄下：

編寫的爬蟲類需要繼承自scrapy.Spider，在這個類中定義爬蟲名、爬取范圍、其實地址等
在scrapy.Spider中parse方法未實現(xiàn)，所以子類應該實現(xiàn)parse方法。該方法傳入response對象

1.# scrapy源碼中 2.class Spider(): 3.def parse(self, response): # 解析返回的內容 4.raise NotImplementedError

爬取讀書頻道，tag為“編程”的書名和評分：

https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T

使用模板創(chuàng)建spider， $ scrapy genspider -t basic book https://www.douban.com/

1.import scrapy 2. 3.class BookSpider(scrapy.Spider): # BookSpider 4.name = 'doubanbook' # 爬蟲名，可修改，重要 5.allowed_domains = ['豆瓣'] # 爬蟲爬取范圍 6.url = '豆瓣圖書標簽: 編程' 7.start_urls = [url] # 起始URL 8. 9.# 下載器獲取了WEB Server的response就行了，parse就是解析響應的內容 10.def parse(self, response): 11. print(type(response), '~~~~~~~~~') #scrapy.http.response.html.HtmlResponse 12.print(response) 13.print('-' * 30)

使用crawl爬取子命令

1.$ scrapy list 2.$ scrapy crawl -h 3.scrapy crawl [options] <spider> 4. 5.指定爬蟲名稱開始爬取 6.$ scrapy crawl doubanbook 7. 8.可以不打印日志 9.$ scrapy crawl doubanbook --nolog

如果在windows下運行發(fā)生twisted的異常 ModuleNotFoundError: No module named 'win32api' ，請安裝 $ pip install pywin32。

response是服務器端HTTP響應，它是scrapy.http.response.html.HtmlResponse類。

由此，修改代碼如下

1.import scrapy 2.from scrapy.http.response.html import HtmlResponse 3. 4.class BookSpider(scrapy.Spider): # BookSpider 5. name = 'doubanbook' # 爬蟲名 6. allowed_domains = ['豆瓣'] # 爬蟲爬取范圍 7. url = '豆瓣圖書標簽: 編程' 8.start_urls = [url] # 起始URL 9. 10. # 下載器獲取了WEB Server的response就行了，parse就是解析響應的內容 11.def parse(self, response:HtmlResponse): 12. print(type(response)) #scrapy.http.response.html.HtmlResponse 13. print('-'*30) 14. print(type(response.text), type(response.body)) 15.print('-'*30) 16.print(response.encoding) 17.with open('o:/testbook.html', 'w', encoding='utf-8') as f: 18. try: 19. f.write(response.text) 20. f.flush() 21. except Exception as e: 22.print(e)

3.1 解析HTML

爬蟲獲得的內容response對象，可以使用解析庫來解析。

scrapy包裝了lxml，父類TextResponse類也提供了xpath方法和css方法，可以混合使用這兩套接口解析HTML。

選擇器參考：

https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/selectors.html#id3

1.import scrapy 2.from scrapy.http.response.html import HtmlResponse 3. 4.response = HtmlResponse('file:///O:/testbook.html', encoding='utf-8') # 構造對象 5. 6.with open('o:/testbook.html', encoding='utf8') as f: 7.response._set_body(f.read()) # 填充數(shù)據(jù) 8.#print(response.text) 9. 1O.# 獲取所有標題及評分 11.# xpath解析 12.subjects = response.xpath('//li[@class="subject-item"]') 13.for subject in subjects: 14.title = subject.xpath('.//h3/a/text()').extract() # list 15.print(title[0].strip()) 16. 17.rate = subject.xpath('.//span[@class="rating_nums"]/text()').extract() 18.print(rate[0].strip()) 19. 2O.print('-'*30) 21.# css解析 22.subjects = response.css('li.subject-item') 23.for subject in subjects: 24.title = subject.css('h3 a::text').extract() 25.print(title[0].strip()) 26. 27.rate = subject.css('span.rating_nums::text').extract() 28.print(rate[0].strip()) 29.print('-'*30) 30. 31. # xpath和css混合使用、正則表達式匹配 32.subjects = response.css('li.subject-item') 33.for subject in subjects: 34.# 提取鏈接 35.href =subject.xpath('.//h3').css('a::attr(href)').extract() 36.print(href[0]) 37. 38. # 使用正則表達式 39.id = subject.xpath('.//h3/a/@href').re(r'\d*99\d*') 40.if id: 41.print(id[0]) 42. 43.# 要求顯示9分以上數(shù)據(jù) 44.rate = subject.xpath('.//span[@class="rating_nums"]/text()').re(r'^9.*') 45.# rate = subject.css('span.rating_nums::text').re(r'^9\..*') 46.if rate: 47.print(rate)

3.2 item封裝數(shù)據(jù)

1.# spiders/bookspider.py 2.import scrapy 3.from scrapy.http.response.html import HtmlResponse 4.from ..items import BookItem 5. 6.class BookSpider(scrapy.Spider): # BookSpider 7.name = 'doubanbook' # 爬蟲名 8.allowed_domains = ['豆瓣'] # 爬蟲爬取范圍 9.url = '豆瓣圖書標簽: 編程' 10.start_urls = [url] # 起始URL 11. 12. # 下載器獲取了WEB Server的response就行了，parse就是解析響應的內容 13.def parse(self, response:HtmlResponse): 14.items = [] 15.# xpath解析 16.subjects = response.xpath('//li[@class="subject-item"]') 17.for subject in subjects: 18.title = subject.xpath('.//h3/a/text()').extract() 19.rate = subject.xpath('.//span[@class="rating_nums"]/text()').extract_first() 20.item = BookItem() 21.item['title'] = title[0].strip() 22.item['rate'] = rate.strip() 23. items.append(item) 24. 25. print(items) 26. 27.return items # 一定要return，否則保存不下來 28. 29.# 使用命令保存return的數(shù)據(jù) 30.# scrapy crawl -h 31.# --output=FILE, -o FILE dump scraped items into FILE (use - for stdout) 32.# 文件擴展名支持'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle' 33.# scrapy crawl doubanbook -o dbbooks.json

得到下圖數(shù)

Scrapy爬蟲框架的介紹和使用

注意上圖的數(shù)據(jù)已經是unicode字符，漢字的unicode表達。

4 pipeline處理

將bookspider.py中BookSpider改成生成器，只需要把 return items 改造成 yield item ，即由產生一個列表變成yield一個個item

腳手架幫我們創(chuàng)建了一個pipelines.py文件和一個類

4.1 開啟pipeline

1.# Configure item pipelines 2.# See Item Pipeline - Scrapy 1.8.0 documentation 3.ITEM_PIPELINES = { 4.'first.pipelines.FirstPipeline': 300, 5.}

整數(shù)300表示優(yōu)先級，越小越高。

取值范圍為0-1000

4.2常用方法

1.class FirstPipeline(object): 2.def __init__(self): # 全局設置 3. print('~~~~~~~~~~ init ~~~~~~~~~~~~') 4. 5.def open_spider(self, spider): # 當某spider開啟時調用 6. print(spider,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~') 7. 8.def process_item(self, item, spider): 9. # item 獲取的item；spider 獲取該item的spider 10.return item 11. 12.def close_spider(self, spider): # 當某spider關閉時調用 13.print(spider,'========================================') 14.

需求

通過pipeline將爬取的數(shù)據(jù)存入json文件中

1.# spider/bookspider.py 2.import scrapy 3.from scrapy.http.response.html import HtmlResponse 4.from ..items import BookItem 5. 6.class BookSpider(scrapy.Spider): # BookSpider 7. name = 'doubanbook' # 爬蟲名 8. allowed_domains = ['豆瓣'] # 爬蟲爬取范圍 9.url = '豆瓣圖書標簽: 編程' 10. start_urls = [url] # 起始URL 11. 12.# spider上自定義配置信息 13.custom_settings = { 14. 'filename' : 'o:/books.json' 15. } 16.# 下載器獲取了WEB Server的response就行了，parse就是解析響應的內容 17.def parse(self, response:HtmlResponse): 18. #items = [] 19.# xpath解析 20.subjects = response.xpath('//li[@class="subject-item"]') 21.for subject in subjects: 22.title = subject.xpath('.//h3/a/text()').extract() 23.rate =subject.xpath('.//span[@class="rating_nums"]/text()').extract_first() 24.item = BookItem() 25.item['title'] = title[0].strip() 26.item['rate'] = rate.strip() 27.#items.append(item) 28. 29.yield item 30.#return items 31. 32.# pipelines.py 33.import simplejson as json 34. 35.class FirstPipeline(object): 36. def __init__(self): # 全局設置 37. print('~~~~~~~~~~ init ~~~~~~~~~~~~') 38. 39.def open_spider(self, spider): # 當某spider開啟時調用 40. print('{} ~~~~~~~~~~~~~~~~~~~~'.format(spider)) 41. print(spider.settings.get('filename')) 42.self.file = open(spider.settings['filename'], 'w', encoding='utf-8') 43.self.file.write('[\n') 44. 45.def process_item(self, item, spider): 46.# item 獲取的item；spider 獲取該item的spider 47.self.file.write(json.dumps(dict(item)) + ',\n') 48.return item 49. 50.def close_spider(self, spider): # 當某spider關閉時調用 51.self.file.write(']') 52.self.file.close() 53.print('{} ======================='.format(spider)) 54.print('-'*30)

5 url提取

如果要爬取下一頁內容，可以自己分析每一頁的頁碼變化，也可以通過提取分頁欄的鏈接

1.# spider/bookspider.py 2.import scrapy 3.from scrapy.http.response.html import HtmlResponse 4.from ..items import BookItem 5. 6.class BookSpider(scrapy.Spider): # BookSpider 7.name = 'doubanbook' # 爬蟲名 8.allowed_domains = ['豆瓣'] # 爬蟲爬取范圍 9.url = '豆瓣圖書標簽: 編程' 10.start_urls = [url] # 起始URL 11. 12.# spider上自定義配置信息 13.custom_settings = { 14.'filename' : 'o:/books.json' 15.} 16. 17.# 下載器獲取了WEB Server的response就行了，parse就是解析響應的內容 18.def parse(self, response:HtmlResponse): 19.#items = [] 20.# xpath解析 21.# 獲取下一頁，只是測試，所以使用re來控制頁碼 22.print('-' * 30) 23.urls = response.xpath('//div[@class="paginator"]/span[@class="next"]/a/@href').re( 24. r'.*start=[24]\d[^\d].*') 25.print(urls) 26.print('-' * 30) 27.yield from (scrapy.Request(response.urljoin(url)) for url in urls) 28.print('++++++++++++++++++++++++') 29. 30.subjects = response.xpath('//li[@class="subject-item"]') 31.for subject in subjects: 32.# 解決圖書副標題拼接 33.title = "".join(map(lambda x:x.strip(), subject.xpath('.//h3/a//text()').extract())) 34.rate = subject.xpath('.//span[@class="rating_nums"]/text()').extract_first() 35.#print(rate) # 有的沒有評分，要注意可能返回None 36. 37.item = BookItem() 38.item['title'] = title 39.item['rate'] = rate 40.#items.append(item) 41.yield item 42. 43.#return items

另外有需要云服務器可以了解下創(chuàng)新互聯(lián)cdcxhl.cn，海內外云服務器15元起步，三天無理由+7*72小時售后在線，公司持有idc許可證，提供“云服務器、裸金屬服務器、高防服務器、香港服務器、美國服務器、虛擬主機、免備案服務器”等云主機租用服務以及企業(yè)上云的綜合解決方案，具有“安全穩(wěn)定、簡單易用、服務可用性高、性價比高”等特點與優(yōu)勢，專為企業(yè)上云打造定制，能夠滿足用戶豐富、多元化的應用場景需求。

網站欄目：Scrapy爬蟲框架的介紹和使用-創(chuàng)新互聯(lián)
文章來源：http://jinyejixie.com/article22/ggccc.html

成都網站建設公司_創(chuàng)新互聯(lián)，為您提供全網營銷推廣、定制網站、品牌網站設計、網站設計、企業(yè)建站、建站公司

聲明：本網站發(fā)布的內容（圖片、視頻和文字）以用戶投稿、用戶轉載內容為主，如果涉及侵權請盡快告知，我們將會在第一時間刪除。文章觀點不代表本網站立場，如需處理請聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內容未經允許不得轉載，或轉載時需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內容

成人午夜视频全免费观看高清-秋霞福利视频一区二区三区-国产精品久久久久电影小说-亚洲不卡区三一区三区一区

Scrapy爬蟲框架的介紹和使用-創(chuàng)新互聯(lián)