python爬取簡書網(wǎng)文章的方法

這篇文章主要介紹python爬取簡書網(wǎng)文章的方法，文中示例代碼介紹的非常詳細(xì)，具有一定的參考價(jià)值，感興趣的小伙伴們一定要看完！

專注于為中小企業(yè)提供成都網(wǎng)站設(shè)計(jì)、成都網(wǎng)站制作服務(wù),電腦端+手機(jī)端+微信端的三站合一,更高效的管理,為中小企業(yè)吉安免費(fèi)做網(wǎng)站提供優(yōu)質(zhì)的服務(wù)。我們立足成都，凝聚了一批互聯(lián)網(wǎng)行業(yè)人才，有力地推動了千余家企業(yè)的穩(wěn)健成長，幫助中小企業(yè)通過網(wǎng)站建設(shè)實(shí)現(xiàn)規(guī)模擴(kuò)充和轉(zhuǎn)變。

python爬取簡書網(wǎng)文章的步驟：

1、準(zhǔn)備工作，創(chuàng)建scrapy爬蟲，建立數(shù)據(jù)庫和表

# 打開 CMD 或者終端到一個(gè)指定目錄
# 新建一個(gè)項(xiàng)目
scrapy startproject jianshu_spider
cd jianshu_spider
# 創(chuàng)建一個(gè)爬蟲
scrapy genspider -t crawl jianshu "jianshu.com"

python爬取簡書網(wǎng)文章的方法

2、爬取思路，檢查網(wǎng)頁的所有href屬性，獲取文章鏈接地址

python爬取簡書網(wǎng)文章的方法

3、代碼實(shí)現(xiàn)，解析主頁網(wǎng)址獲取文章鏈接，構(gòu)建item模型保存數(shù)據(jù)，將獲取的數(shù)據(jù)保存到數(shù)據(jù)庫中

第一步是指定開始爬取的地址和爬取規(guī)則。

allowed_domains = ['jianshu.com'] 
start_urls = ['https://www.jianshu.com/']  
rules = (        # 文章id是有12位小寫字母或者數(shù)字0-9構(gòu)成        
Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),    )

第二步是拿到下載器下載后的數(shù)據(jù) Response，利用 Xpath 語法獲取有用的數(shù)據(jù)。這里可以使用「 Scrapy shell url 」去測試數(shù)據(jù)是否獲取正確。

# 獲取需要的數(shù)據(jù)

title = response.xpath('//h2[@class="title"]/text()').get() 
author = response.xpath('//div[@class="info"]/span/a/text()').get() 
avatar = self.HTTPS + response.xpath('//div[@class="author"]/a/img/@src').get() 
pub_time = response.xpath('//span[@class="publish-time"]/text()').get().replace("*", "")
current_url = response.url real_url = current_url.split(r"?")[0] 
article_id = real_url.split(r'/')[-1] 
content = response.xpath('//div[@class="show-content"]').get()

然后構(gòu)建 Item 模型用來保存數(shù)據(jù)。

import scrapy
# 文章詳情Itemclass ArticleItem(scrapy.Item):    
title = scrapy.Field()    
content = scrapy.Field()    
# 文章id   
article_id = scrapy.Field()   
# 原始的url  
origin_url = scrapy.Field()    
# 作者    
author = scrapy.Field()    
# 頭像    
avatar = scrapy.Field()    
# 發(fā)布時(shí)間    
pubtime = scrapy.Field()

第三步是將獲取的數(shù)據(jù)通過 Pipline 保存到數(shù)據(jù)庫中。

# 數(shù)據(jù)庫連接屬性
db_params = {            
'host': '127.0.0.1',            
'port': 3306,            
'user': 'root',          
'password': 'root',         
 'database': 'jianshu',          
'charset': 'utf8'
}
# 數(shù)據(jù)庫【連接對象】
self.conn = pyMySQL.connect(**db_params)

# 執(zhí)行 sql 語句
self.cursor.execute(self._sql,(item['title'],item['content'],item['author'],item['avatar'],item['pubtime'],item['article_id'],item['origin_url']))
# 插入到數(shù)據(jù)庫中
self.conn.commit()
# 關(guān)閉游標(biāo)資源
self.cursor.close()

執(zhí)行結(jié)果如下：

python爬取簡書網(wǎng)文章的方法

以上是python爬取簡書網(wǎng)文章的方法的所有內(nèi)容，感謝各位的閱讀！希望分享的內(nèi)容對大家有幫助，更多相關(guān)知識，歡迎關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道！

名稱欄目：python爬取簡書網(wǎng)文章的方法
當(dāng)前路徑：http://www.rwnh.cn/article48/jscoep.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供網(wǎng)站導(dǎo)航、網(wǎng)站排名、、網(wǎng)站維護(hù)、標(biāo)簽優(yōu)化、云服務(wù)器

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請盡快告知，我們將會在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場，如需處理請聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時(shí)需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容

内射老阿姨1区2区3区4区_久久精品人人做人人爽电影蜜月_久久国产精品亚洲77777_99精品又大又爽又粗少妇毛片

python爬取簡書網(wǎng)文章的方法