爬蟲(chóng)流程及Python第三方庫(kù)用法-創(chuàng)新互聯(lián)

requests pymongo bs4等用法

from future import print_function
#python2.X中print不需要括號(hào)，而在python3.X中則需要。在開(kāi)頭加上這句之后，即使在
python2.X，使用print就得像python3.X那樣加括號(hào)使用

在羅定等地區(qū)，都構(gòu)建了全面的區(qū)域性戰(zhàn)略布局，加強(qiáng)發(fā)展的系統(tǒng)性、市場(chǎng)前瞻性、產(chǎn)品創(chuàng)新能力，以專(zhuān)注、極致的服務(wù)理念，為客戶(hù)提供網(wǎng)站設(shè)計(jì)制作、成都網(wǎng)站設(shè)計(jì) 網(wǎng)站設(shè)計(jì)制作按需制作,公司網(wǎng)站建設(shè),企業(yè)網(wǎng)站建設(shè),成都品牌網(wǎng)站建設(shè),全網(wǎng)營(yíng)銷(xiāo)推廣,外貿(mào)營(yíng)銷(xiāo)網(wǎng)站建設(shè),羅定網(wǎng)站建設(shè)費(fèi)用合理。

import requests

導(dǎo)入requests 要是沒(méi)有requests的話(huà)在https://pip.pypa.io/en/stable/×××talling/

  這個(gè)網(wǎng)址的前兩句下載pip  用  pip ×××tall  requests   下載requests   
                     requests是發(fā)起請(qǐng)求獲取網(wǎng)頁(yè)源代碼

爬蟲(chóng)流程及Python第三方庫(kù)用法

from bs4 import BeautifulSoup

pip ×××tall bs4 下載bs4 BeautifulSoup 是Python一個(gè)第三方庫(kù)bs4中有一個(gè)

BeautifulSoup庫(kù)，是用于解析html代碼的，可以幫助你更方便的通過(guò)標(biāo)簽定位你需要的信息

import pymongo
#源碼安裝mongodb數(shù)據(jù)庫(kù) pip安裝pymongo 是python鏈接mongodb的第三方庫(kù)是驅(qū)動(dòng)程
序，使python程序能夠使用Mongodb數(shù)據(jù)庫(kù)，使用python編寫(xiě)而成．

import json
#json 是輕量級(jí)的文本數(shù)據(jù)交換格式。是用來(lái)存儲(chǔ)和交換文本信息的語(yǔ)法。

安裝數(shù)據(jù)庫(kù)

1.源碼安裝mongodb https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-rhel70-3.2.5.tgz 解壓mongodb 源碼包，放在 /usr/local
2 mkdir -p /data/db
3.cd /usr/local/mongodb/bin
./mongod &
./mongo
exit退出

查看數(shù)據(jù)庫(kù)內(nèi)容：
cd/usr/local/mongodb/bin
./mongo
show dbs

數(shù)據(jù)庫(kù) ： iaaf
use iaaf
show collections
db.athletes.find()

爬蟲(chóng)的流程

第一步：提取網(wǎng)站HTML信息

爬蟲(chóng)流程及Python第三方庫(kù)用法

#需要的網(wǎng)址

url = 'https://www.iaaf.org/records/toplists/jumps/long-jump/outdoor/men/senior/2018?regionType=world&windReading=regular&page={}&bestResultsOnly=true'  

    #使用headers設(shè)置請(qǐng)求頭，將代碼偽裝成瀏覽器

headers = {  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15', }

for i in range(1,23):
    res = requests.get(url.format(i), headers=headers)
    html = res.text
    print(i)
    soup = BeautifulSoup(html, 'html.parser')       #使用BeautifulSoup解析這段代碼
    #tbody_l = soup.find_all('tbody')
    record_table = soup.find_all('table', class_='records-table')
    list_re = record_table[2]
    tr_l = list_re.find_all('tr')
    for i in tr_l:    # 針對(duì)每一個(gè)tr  也就是一行
        td_l = i.find_all('td')    # td的列表 第三項(xiàng)是 帶href
       # 只要把td_l里面的每一項(xiàng)賦值就好了  組成json數(shù)據(jù)  {}  插入到mongo
        # 再?gòu)膍ongo里面取href  訪(fǎng)問(wèn)  得到 生涯數(shù)據(jù)  再存回這個(gè)表
        # 再 把所有數(shù)據(jù) 存到 excel

        j_data = {}
        try:
            j_data['Rank'] = td_l[0].get_text().strip()
            j_data['Mark'] = td_l[1].get_text().strip()
            j_data['WIND'] = td_l[2].get_text().strip()
            j_data['Competitior'] = td_l[3].get_text().strip()
            j_data['DOB'] = td_l[4].get_text().strip()
            j_data['Nat'] = td_l[5].get_text().strip()
            j_data['Pos'] = td_l[6].get_text().strip()
            j_data['Venue'] = td_l[8].get_text().strip()
            j_data['Date'] = td_l[9].get_text().strip()
            j_data['href'] = td_l[3].find('a')['href']      
            #把想要的數(shù)據(jù)存到字典里

第二步：從HTML中提取我們想要的信息

#!/usr/bin/env python
#encoding=utf-8

from future import print_function
import requests
from bs4 import BeautifulSoup as bs

def long_jump(url):

url = 'https://www.iaaf.org/athletes/cuba/juan-miguel-echevarria-294120'

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15'}
res = requests.get(url, headers=headers)
html = res.text
soup = bs(html,'html.parser')
div = soup.find('div', id='progression')

h3_l = []
if div != None:
    h3_l = div.find_all('h3')

tbody_l = []
outdoor = []
indoor = []
for i in h3_l:    # 得到h3 標(biāo)簽  
    text = str(i.get_text().strip())
    if "Long Jump" in text and "View Graph" in text:
        tbody = i.parent.parent.table.tbody
        #print(tbody) # 可以拿到里面的數(shù)據(jù) 
        # 兩份 一份是室外 一份是室內(nèi)   
        tbody_l.append(tbody)
# 拿到兩個(gè)元素的tbody  一個(gè)為室外 一個(gè)室內(nèi)  用try except
# 組兩個(gè)json數(shù)據(jù)  outdoor={}    indoor={} 
# db.×××ert()  先打印  
try:
    tbody_out = tbody_l[0]
    tbody_in  = tbody_l[1]
    tr_l = tbody_out.find_all('tr')
    for i in tr_l:
        # print(i)
        # print('+++++++++++++')
        td_l = i.find_all('td')
        td_dict = {}
        td_dict['Year'] = str(td_l[0].get_text().strip())
        td_dict['Performance'] = str(td_l[1].get_text().strip())
        td_dict['Wind'] = str(td_l[2].get_text().strip())
        td_dict['Place'] = str(td_l[3].get_text().strip())
        td_dict['Date'] = str(td_l[4].get_text().strip())
        outdoor.append(td_dict)

    # print(outdoor)
    # print('+++++++++++++++')
    tr_lin = tbody_in.find_all('tr')
    for i in tr_lin:
        td_l = i.find_all('td')
        td_dict = {}
        td_dict['Year'] = str(td_l[0].get_text().strip())
        td_dict['Performance'] = str(td_l[1].get_text().strip())
        td_dict['Place'] = str(td_l[2].get_text().strip())
        td_dict['Date'] = str(td_l[3].get_text().strip())
        indoor.append(td_dict)
    # print(indoor) 
except:
    pass
return outdoor, indoor
if __name__ == '__main__':
long_jump(url'https://www.iaaf.org/athletes/cuba/juan-miguel-echevarria-294120')

在獲取到整個(gè)頁(yè)面的HTML代碼后，我們需要從整個(gè)網(wǎng)頁(yè)中提取運(yùn)動(dòng)員跳遠(yuǎn)的數(shù)據(jù)

第三步：把提取的數(shù)據(jù)儲(chǔ)存到數(shù)據(jù)庫(kù)里

#!/usr/bin/env python
#coding=utf-8

from future import print_function
import pymongo
import requests
from bs4 import BeautifulSoup
import json
from long_jump import *

db = pymongo.MongoClient().iaaf
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15'}

def get_href():

負(fù)責(zé)去mongo中取 href 取到了然后訪(fǎng)問(wèn) 得到的數(shù)據(jù) 存到原來(lái)的表中

href_list = db.athletes.find()
# 794
count = 0
for i in href_list:
    # 取id  根據(jù)id把爬來(lái)的生涯數(shù)據(jù)插回去  
    print(count)
    href = i.get('href')
    outdoor = []
    indoor = []
    if href == None:
        pass
    else:
        url = 'https://www.iaaf.org'+ str(href)
        outdoor, indoor = long_jump(url)

    db.athletes.update({'_id':i.get('_id')},{"$set":{"outdoor":outdoor,"indoor":indoor}})
    count += 1

def get_progression():
pass

if name == 'main':
get_href()

第四步：將數(shù)據(jù)庫(kù)內(nèi)容寫(xiě)到 excel 按照國(guó)家劃分

#!/usr/bin/env python
#coding=utf-8

from future import print_function
import xlwt
import pymongo

def write_into_xls(cursor):
title = ['Rank','Mark','age','Competitior','DOB','Nat','country','Venue','Date','out_year','out_performance','out_wind','out_place','out_date','in_year','in_performance','in_place','in_date']

book = xlwt.Workbook(encoding='utf-8',style_compression=0)
sheet = book.add_sheet('iaaf',cell_overwrite_ok=True)

for i in range(len(title)):
    sheet.write(0, i, title[i])

# db = pymongo.MongoClient().iaaf
# cursor = db.athletes.find()

flag = 1
db = pymongo.MongoClient().iaaf
country_l = ['CUB', 'RSA', 'CHN', 'USA', 'RUS', 'AUS', 'CZE', 'URU', 'GRE', 'JAM', 'TTO', 'UKR', 'GER', 'IND', 'BRA', 'GBR', 'CAN', 'SRI', 'FRA', 'NGR', 'POL', 'SWE', 'JPN', 'INA', 'GUY', 'TKS', 'KOR', 'TPE', 'BER', 'MAR', 'ALG', 'ESP', 'SUI', 'EST', 'SRB', 'BEL', 'ITA', 'NED', 'FIN', 'CHI', 'BUL', 'CRO', 'ALB', 'KEN', 'POR', 'BAR', 'DEN', 'PER', 'ROU', 'MAS', 'CMR', 'TUR', 'PHI', 'HUN', 'VEN', 'HKG', 'PAN', 'BLR', 'MEX', 'LAT', 'GHA', 'MRI', 'IRL', 'ISV', 'BAH', 'KUW', 'NOR', 'SKN', 'UZB', 'BOT', 'AUT', 'PUR', 'DMA', 'KAZ', 'ARM', 'BEN', 'DOM', 'CIV', 'LUX', 'COL', 'ANA', 'MLT', 'SVK', 'THA', 'MNT', 'ISR', 'LTU', 'VIE', 'IRQ', 'NCA', 'ARU', 'KSA', 'ZIM', 'SLO', 'ECU', 'SYR', 'TUN', 'ARG', 'ZAM', 'SLE', 'BUR', 'NZL', 'AZE', 'GRN', 'OMA', 'CYP', 'GUA', 'ISL', 'SUR', 'TAN', 'GEO', 'BOL', 'ANG', 'QAT', 'TJK', 'MDA', 'MAC']
for i in country_l:
    cursor = db.athletes.find({'Nat':i})
    for i in cursor:
        print(i)
        count_out = len(i['outdoor'])
        count_in = len(i['indoor'])
        count = 1
        if count_out >= count_in:
            count = count_out
        else:
            count = count_in
        if count == 0:
            count = 1

        # count 為這條數(shù)據(jù)占的行數(shù)
# title = ['Rank','Mark','Wind','Competitior','DOB','Nat','Pos','Venue',
# 'Date','out_year','out_performance','out_wind','out_place','out_date',
# 'in_year','in_performance','in_place','in_date']

        sheet.write(flag, 0, i.get('Rank'))
        sheet.write(flag, 1, i.get('Mark'))
        sheet.write(flag, 2, i.get('age'))
        sheet.write(flag, 3, i.get('Competitior'))
        sheet.write(flag, 4, i.get('DOB'))
        sheet.write(flag, 5, i.get('Nat'))
        sheet.write(flag, 6, i.get('country'))
        sheet.write(flag, 7, i.get('Venue'))
        sheet.write(flag, 8, i.get('Date'))

        if count_out > 0:
            for j in range(count_out):
                sheet.write(flag+j, 9, i['outdoor'][j]['Year'])
                sheet.write(flag+j, 10, i['outdoor'][j]['Performance'])
                sheet.write(flag+j, 11, i['outdoor'][j]['Wind'])
                sheet.write(flag+j, 12, i['outdoor'][j]['Place'])
                sheet.write(flag+j, 13, i['outdoor'][j]['Date'])

        if count_in > 0:
            for k in range(count_in):
                sheet.write(flag+k, 14, i['indoor'][k]['Year'])
                sheet.write(flag+k, 15, i['indoor'][k]['Performance'])
                sheet.write(flag+k, 16, i['indoor'][k]['Place'])
                sheet.write(flag+k, 17, i['indoor'][k]['Date'])

        flag = flag + count

book.save(r'iaaf.xls')

# 開(kāi)始從第一行 輸入數(shù)據(jù)    從數(shù)據(jù)庫(kù)取

if name == 'main':
write_into_xls(cursor=None)

運(yùn)行完上述代碼后，我們得到的結(jié)果是

爬蟲(chóng)流程及Python第三方庫(kù)用法

另外有需要云服務(wù)器可以了解下創(chuàng)新互聯(lián)scvps.cn，海內(nèi)外云服務(wù)器15元起步，三天無(wú)理由+7*72小時(shí)售后在線(xiàn)，公司持有idc許可證，提供“云服務(wù)器、裸金屬服務(wù)器、高防服務(wù)器、香港服務(wù)器、美國(guó)服務(wù)器、虛擬主機(jī)、免備案服務(wù)器”等云主機(jī)租用服務(wù)以及企業(yè)上云的綜合解決方案，具有“安全穩(wěn)定、簡(jiǎn)單易用、服務(wù)可用性高、性?xún)r(jià)比高”等特點(diǎn)與優(yōu)勢(shì)，專(zhuān)為企業(yè)上云打造定制，能夠滿(mǎn)足用戶(hù)豐富、多元化的應(yīng)用場(chǎng)景需求。

網(wǎng)站名稱(chēng)：爬蟲(chóng)流程及Python第三方庫(kù)用法-創(chuàng)新互聯(lián)
本文URL：http://www.rwnh.cn/article4/jogie.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供網(wǎng)站設(shè)計(jì)公司、全網(wǎng)營(yíng)銷(xiāo)推廣、手機(jī)網(wǎng)站建設(shè)、App設(shè)計(jì)、企業(yè)建站、品牌網(wǎng)站建設(shè)

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶(hù)投稿、用戶(hù)轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請(qǐng)盡快告知，我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如需處理請(qǐng)聯(lián)系客服。電話(huà)：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時(shí)需注明來(lái)源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容

内射老阿姨1区2区3区4区_久久精品人人做人人爽电影蜜月_久久国产精品亚洲77777_99精品又大又爽又粗少妇毛片

爬蟲(chóng)流程及Python第三方庫(kù)用法-創(chuàng)新互聯(lián)

requests pymongo bs4等用法

導(dǎo)入requests 要是沒(méi)有requests的話(huà)在https://pip.pypa.io/en/stable/×××talling/

pip ×××tall bs4 下載bs4 BeautifulSoup 是Python一個(gè)第三方庫(kù)bs4中有一個(gè)

安裝數(shù)據(jù)庫(kù)

爬蟲(chóng)的流程

第二步：從HTML中提取我們想要的信息

url = 'https://www.iaaf.org/athletes/cuba/juan-miguel-echevarria-294120'

第三步： 把提取的數(shù)據(jù)儲(chǔ)存到數(shù)據(jù)庫(kù)里

負(fù)責(zé)去mongo中取 href 取到了 然后訪(fǎng)問(wèn) 得到的數(shù)據(jù) 存到原來(lái)的 表中

第四步：將數(shù)據(jù)庫(kù)內(nèi)容寫(xiě)到 excel 按照國(guó)家劃分

第三步：把提取的數(shù)據(jù)儲(chǔ)存到數(shù)據(jù)庫(kù)里

負(fù)責(zé)去mongo中取 href 取到了然后訪(fǎng)問(wèn) 得到的數(shù)據(jù) 存到原來(lái)的表中