nodejs中有哪些爬蟲框架

這篇文章給大家介紹nodejs中有哪些爬蟲框架，內(nèi)容非常詳細(xì)，感興趣的小伙伴們可以參考借鑒，希望對(duì)大家能有所幫助。

創(chuàng)新互聯(lián)公司從2013年開始，是專業(yè)互聯(lián)網(wǎng)技術(shù)服務(wù)公司，擁有項(xiàng)目成都網(wǎng)站設(shè)計(jì)、做網(wǎng)站網(wǎng)站策劃，項(xiàng)目實(shí)施與項(xiàng)目整合能力。我們以讓每一個(gè)夢想脫穎而出為使命，1280元岳陽做網(wǎng)站,已為上家服務(wù),為岳陽各地企業(yè)和個(gè)人服務(wù),聯(lián)系電話:18980820575

第一步：安裝 Crawl-pet

nodejs 就不用多介紹吧，用 npm 安裝 crawl-pet

$ npm install crawl-pet -g --production

運(yùn)行，程序會(huì)引導(dǎo)你完成配置，首次運(yùn)行，會(huì)在項(xiàng)目目錄下生成 info.json 文件

$ crawl-pet

> Set project dir: ./test-crawl-pet
> Create crawl-pet in ./test-crawl-pet [y/n]: y
> Set target url: http://foodshot.co/
> Set save rule [url/simple/group]: url
> Set file type limit: 
> The limit: not limit
> Set parser rule module:
> The module: use default crawl-pet.parser

這里使用的測試網(wǎng)站 http://foodshot.co/ 是一個(gè)自由版權(quán)的，分享美食圖片的網(wǎng)站，網(wǎng)站里的圖片質(zhì)量非常棒，這里用它只是為測試學(xué)習(xí)用，大家可以換其它網(wǎng)站測試

如果使用默認(rèn)解析器的話，已經(jīng)可以運(yùn)行，看看效果:

$ crawl-pet -o ./test-crawl-pet

nodejs中有哪些爬蟲框架

試試看

這是下載后的目錄結(jié)構(gòu)

nodejs中有哪些爬蟲框架

本地目錄結(jié)構(gòu)

第二步：寫自己的解析器

現(xiàn)在我們來看一看如何寫自己的解析器，有三種方法來生成我們自己的解析器

在新建項(xiàng)目時(shí), 在 Set parser rule module 輸入自己的解釋器路徑。修改 info.json 下的 parser 項(xiàng)這個(gè)最簡單，直接在項(xiàng)目錄下新建一個(gè) parser.js 文件

使用 crawl-pet，新建一個(gè)解析器模板

$ crawl-pet --create-parser ./test-crawl-pet/parser.js

打開 ./test-crawl-pet/parser.js 文件

// crawl-pet 支持使用 cheerio，來進(jìn)行頁面分析，如果你有這個(gè)需要
const cheerio = require("cheerio")

/*
 * header 函數(shù)是在請求發(fā)送前調(diào)用，可以配置請求的頭信息，如果返回 false，則中斷請求
 *
 * 參數(shù)：
 *  options:   詳細(xì)設(shè)置請看 https://github.com/request/request
 *  crawler_handle: 與隊(duì)列通信的對(duì)象，詳情見下
 *
 * header 函數(shù)是可選的，可不寫
 */
exports.header = function(options, crawler_handle) {  
}

/*
 * body 函數(shù)是在請求返回后調(diào)用，用來解析返回結(jié)果
 *
 * 參數(shù):
 *  url:   請求的 url
 *  body:   請求返回結(jié)果, string 類型
 *  response:  請求的響應(yīng)，詳情請看： https://github.com/request/request
 *  crawler_handle: 與隊(duì)列通信的對(duì)象，該對(duì)象包含以下方法
 *   .info    : crawl-pet 的配置信息
 *   .uri     : 當(dāng)前請求的 uri 信息
 *   .addPage(url)  : 向隊(duì)列里添加一個(gè)待解析頁面
 *   .addDown(url)  : 向隊(duì)列里添加一個(gè)待下載文件
 *   .save(content, ext) : 保存文本到本地，ext 設(shè)置保存文件的后綴名
 *   .over()    : 結(jié)束當(dāng)前隊(duì)列，取出下一條隊(duì)列數(shù)據(jù)
 */

exports.body = function(url, body, response, crawler_handle) {
 const re = /\b(href|src)\s*=\s*["']([^'"#]+)/ig
 var m = null
 while (m = re.exec(body)){
  let href = m[2]
  if (/\.(png|gif|jpg|jpeg|mp4)\b/i.test(href)) {
    // 這理添加了一條下載
   crawler_handle.addDown(href)
  }else if(!/\.(css|js|json|xml|svg)/.test(href)){
    // 這理添加了一個(gè)待解析頁面
   crawler_handle.addPage(href)
  }
 }
  // 記得在解析結(jié)束后一定要執(zhí)行
 crawler_handle.over()
}

在最后會(huì)有一個(gè)分享，懂得的請往下看

第三步：查看爬取下來的數(shù)據(jù)

根據(jù)以下載到本地的文件，查找下載地址

$ crawl-pet -f ./test-crawl-pet/photos.foodshot.co/*.jpg

nodejs中有哪些爬蟲框架
查找下載地址

查看等待隊(duì)列

$ crawl-pet -l queue

nodejs中有哪些爬蟲框架
查看等待隊(duì)列

查看已下載的文件列表

復(fù)制代碼代碼如下:

$ crawl-pet -l down # 查看已下載列表中第 0 條后的5條數(shù)據(jù) $ crawl-pet -l down,0,5 # --json 參數(shù)表示輸出格式為 json $ crawl-pet -l down,0,5 --json

nodejs中有哪些爬蟲框架
已下載的文件

查看已解析頁面列表，參數(shù)與查看已下載的相同

復(fù)制代碼代碼如下:

$ crawl-pet -l page

基本功能就這些了，看一下它的幫助吧

該爬蟲框架是開源的，GIthub 地址在這里：https://github.com/wl879/Crawl-pet

$ crawl-pet --help

 Crawl-pet options help:

 -u, --url  string    Destination address
 -o, --outdir string    Save the directory, Default use pwd
 -r, --restart      Reload all page
 --clear        Clear queue
 --save   string    Save file rules following options
          = url: Save the path consistent with url
          = simple: Save file in the project path
          = group: Save 500 files in one folder
 --types   array    Limit download file type
 --limit   number=5   Concurrency limit
 --sleep   number=200   Concurrent interval
 --timeout  number=180000  Queue timeout
 --proxy   string    Set up proxy
 --parser  string    Set crawl rule, it's a js file path!
          The default load the parser.js file in the project path
 --maxsize  number    Limit the maximum size of the download file
 --minwidth  number    Limit the minimum width of the download file
 --minheight  number    Limit the minimum height of the download file
 -i, --info       View the configuration file
 -l, --list  array    View the queue data 
          e.g. [page/down/queue],0,-1
 -f, --find  array    Find the download URL of the local file
 --json        Print result to json format
 -v, --version      View version
 -h, --help       View help

最后分享一個(gè)配置

$ crawl-pet -u https://www.reddit.com/r/funny/ -o reddit --save group

info.json

{
 "url": "https://www.reddit.com/r/funny/",
 "outdir": ".",
 "save": "group",
 "types": "",
 "limit": "5",
 "parser": "my_parser.js",
 "sleep": "200",
 "timeout": "180000",
 "proxy": "",
 "maxsize": 0,
 "minwidth": 0,
 "minheight": 0,


 "cookie": "over18=1"
}

my_parser.js

exports.body = function(url, body, response, crawler_handle) {
 const re = /\b(data-url|href|src)\s*=\s*["']([^'"#]+)/ig
 var m = null
 while (m = re.exec(body)){
  let href = m[2]
  if (/thumb|user|icon|\.(css|json|js|xml|svg)\b/i.test(href)) {
   continue
  }
  if (/\.(png|gif|jpg|jpeg|mp4)\b/i.test(href)) {
   crawler_handle.addDown(href)
   continue
  }
  if(/reddit\.com\/r\//i.test(href)){
   crawler_handle.addPage(href)
  }
 }
 crawler_handle.over()
}

關(guān)于nodejs中有哪些爬蟲框架就分享到這里了，希望以上內(nèi)容可以對(duì)大家有一定的幫助，可以學(xué)到更多知識(shí)。如果覺得文章不錯(cuò)，可以把它分享出去讓更多的人看到。

新聞名稱：nodejs中有哪些爬蟲框架
轉(zhuǎn)載源于：http://jinyejixie.com/article2/pspeic.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供外貿(mào)建站、虛擬主機(jī)、網(wǎng)站制作、服務(wù)器托管、企業(yè)網(wǎng)站制作、移動(dòng)網(wǎng)站建設(shè)

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請盡快告知，我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場，如需處理請聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時(shí)需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容

成人午夜视频全免费观看高清-秋霞福利视频一区二区三区-国产精品久久久久电影小说-亚洲不卡区三一区三区一区

nodejs中有哪些爬蟲框架