中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點擊 登錄注冊 即表示同意《億速云用戶服務條款》

「docker實戰篇」python的docker-抖音web端數據抓取(19)

發布時間:2020-04-01 19:26:47 來源:網絡 閱讀:615 作者:IT人故事 欄目:云計算

原創文章,歡迎轉載。轉載請注明:轉載自IT人故事會,謝謝!
原文鏈接地址:「docker實戰篇」python的docker-抖音web端數據抓取(19)

抖音抓取實戰,為什么沒有抓取數據?例如:有個互聯網的電商生鮮公司,這個公司老板想在一些流量上投放廣告,通過增加公司產品曝光率的方式,進行營銷,在投放的選擇上他發現了抖音,抖音擁有很大的數據流量,嘗試的想在抖音上投放廣告,看看是否利潤和效果有收益。他們分析抖音的數據,分析抖音的用戶畫像,判斷用戶的群體和公司的匹配度,需要抖音的粉絲數,點贊數,關注數,昵稱。通過用戶喜好將公司的產品融入到視頻中,更好的推廣公司的產品。一些公關公司通過這些數據可以找到網紅黑馬,進行營銷包裝。源碼:https://github.com/limingios/dockerpython.git (douyin)

「docker實戰篇」python的docker-抖音web端數據抓取(19)

抖音分享頁面
  • 介紹

    https://www.douyin.com/share/user/用戶ID,用戶ID通過源碼中的txt中獲取,然后通過鏈接的方式就可以打開對應的web端頁面。然后通過web端頁面。爬取基本的信息。

「docker實戰篇」python的docker-抖音web端數據抓取(19)

  • 安裝谷歌xpath helper工具

    源碼中獲取crx

「docker實戰篇」python的docker-抖音web端數據抓取(19)

谷歌瀏覽器輸入:chrome://extensions/

「docker實戰篇」python的docker-抖音web端數據抓取(19)

直接將xpath-helper.crx 拖入界面chrome://extensions/

安裝成功后

「docker實戰篇」python的docker-抖音web端數據抓取(19)

快捷鍵 ctrl+shift+x 啟動xpath,一般都是谷歌的f12 開發者工具配合使用。

「docker實戰篇」python的docker-抖音web端數據抓取(19)

開始python 爬取抖音分享的網站數據

分析分享頁面https://www.douyin.com/share/user/76055758243

1.抖音做了反派機制,抖音ID中的數字變成了字符串,進行替換。

{'name':['  ','  ','  '],'value':0},
        {'name':['  ','  ','  '],'value':1},
        {'name':['  ','  ','  '],'value':2},
        {'name':['  ','  ','  '],'value':3},
        {'name':['  ','  ','  '],'value':4},
        {'name':['  ','  ','  '],'value':5},
        {'name':['  ','  ','  '],'value':6},
        {'name':['  ','  ','  '],'value':7},
        {'name':['  ','  ','  '],'value':8},
        {'name':['  ','  ','  '],'value':9},

「docker實戰篇」python的docker-抖音web端數據抓取(19)

2.獲取需要的節點的的xpath

# 昵稱
//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()

#抖音ID
//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()

#工作
//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()

#描述
//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()

#地址
//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()

#星座
//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()

#關注數
//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()

#粉絲數
//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()

#贊數
//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()

「docker實戰篇」python的docker-抖音web端數據抓取(19)

  • 完整代碼
    
    import re
    import requests
    import time
    from lxml import etree

def handle_decode(input_data,share_web_url,task):
search_douyin_str = re.compile(r'抖音ID:')
regex_list = [
{'name':['  ','  ','  '],'value':0},
{'name':['  ','  ','  '],'value':1},
{'name':['  ','  ','  '],'value':2},
{'name':['  ','  ','  '],'value':3},
{'name':['  ','  ','  '],'value':4},
{'name':['  ','  ','  '],'value':5},
{'name':['  ','  ','  '],'value':6},
{'name':['  ','  ','  '],'value':7},
{'name':['  ','  ','  '],'value':8},
{'name':['  ','  ','  '],'value':9},
]

for i1 in regex_list:
    for i2 in i1['name']:
        input_data = re.sub(i2,str(i1['value']),input_data)
share_web_html = etree.HTML(input_data)
douyin_info = {}
douyin_info['nick_name'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]
douyin_id = ''.join(share_web_html.xpath("http://div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()"))
douyin_info['douyin_id'] = re.sub(search_douyin_str,'',share_web_html.xpath("http://div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]).strip() + douyin_id

try:
    douyin_info['job'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()")[0].strip()
except:
    pass
douyin_info['describe'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()")[0].replace('\n',',')
douyin_info['location'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()")[0]
douyin_info['xingzuo'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()")[0]
douyin_info['follow_count'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()")[0].strip()
fans_value = ''.join(share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()"))
unit = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()")
if unit[-1].strip() == 'w':
    douyin_info['fans'] = str((int(fans_value)/10))+'w'
like = ''.join(share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']//i[@class='icon iconfont follow-num']/text()"))
unit = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']/span[@class='num']/text()")
if unit[-1].strip() == 'w':
    douyin_info['like'] = str(int(like)/10)+'w'
douyin_info['from_url'] = share_web_url

print(douyin_info)

def handle_douyin_web_share(share_id):
share_web_url = 'https://www.douyin.com/share/user/'+share_id
print(share_web_url)
share_web_header = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}
share_web_response = requests.get(url=share_web_url,headers=share_web_header)
handle_decode(share_web_response.text,share_web_url,share_id)

if name == 'main':
while True:
share_id = "76055758243"
if share_id == None:
print('當前處理task為:%s'%share_id)
break
else:
print('當前處理task為:%s'%share_id)
handle_douyin_web_share(share_id)
time.sleep(2)


![](https://upload-images.jianshu.io/upload_images/11223715-651b910cb91c1c8d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

#### mongodb
>通過vagrant 生成虛擬機創建mongodb,具體查看
「docker實戰篇」python的docker爬蟲技術-python腳本app抓取(13)

``` bash
su -
#密碼:vagrant
docker

>https://hub.docker.com/r/bitnami/mongodb
>默認端口:27017
``` bash
docker pull bitnami/mongodb:latest
mkdir bitnami
cd bitnami
mkdir mongodb
docker run -d -v /path/to/mongodb-persistence:/root/bitnami -p 27017:27017 bitnami/mongodb:latest

#關閉防火墻
systemctl stop firewalld

「docker實戰篇」python的docker-抖音web端數據抓取(19)

「docker實戰篇」python的docker-抖音web端數據抓取(19)

  • 操作mongodb

    讀txt文件獲取userId的編號。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2019/1/30 19:35
# @Author  : Aries
# @Site    : 
# @File    : handle_mongo.py.py
# @Software: PyCharm

import pymongo
from pymongo.collection import Collection

client = pymongo.MongoClient(host='192.168.66.100',port=27017)
db = client['douyin']

def handle_init_task():
    task_id_collections = Collection(db, 'task_id')
    with open('douyin_hot_id.txt','r') as f:
        f_read = f.readlines()
        for i in f_read:
            task_info = {}
            task_info['share_id'] = i.replace('\n','')
            task_id_collections.insert(task_info)

def handle_get_task():
    task_id_collections = Collection(db, 'task_id')
    # return task_id_collections.find_one({})
    return task_id_collections.find_one_and_delete({})

#handle_init_task()
  • 修改python程序調用
    
    import re
    import requests
    import time
    from lxml import etree
    from handle_mongo import handle_get_task
    from handle_mongo import handle_insert_douyin

def handle_decode(input_data,share_web_url,task):
search_douyin_str = re.compile(r'抖音ID:')
regex_list = [
{'name':['  ','  ','  '],'value':0},
{'name':['  ','  ','  '],'value':1},
{'name':['  ','  ','  '],'value':2},
{'name':['  ','  ','  '],'value':3},
{'name':['  ','  ','  '],'value':4},
{'name':['  ','  ','  '],'value':5},
{'name':['  ','  ','  '],'value':6},
{'name':['  ','  ','  '],'value':7},
{'name':['  ','  ','  '],'value':8},
{'name':['  ','  ','  '],'value':9},
]

for i1 in regex_list:
    for i2 in i1['name']:
        input_data = re.sub(i2,str(i1['value']),input_data)
share_web_html = etree.HTML(input_data)
douyin_info = {}
douyin_info['nick_name'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]
douyin_id = ''.join(share_web_html.xpath("http://div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()"))
douyin_info['douyin_id'] = re.sub(search_douyin_str,'',share_web_html.xpath("http://div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]).strip() + douyin_id

try:
    douyin_info['job'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()")[0].strip()
except:
    pass
douyin_info['describe'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()")[0].replace('\n',',')
douyin_info['location'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()")[0]
douyin_info['xingzuo'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()")[0]
douyin_info['follow_count'] = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()")[0].strip()
fans_value = ''.join(share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()"))
unit = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()")
if unit[-1].strip() == 'w':
    douyin_info['fans'] = str((int(fans_value)/10))+'w'
like = ''.join(share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']//i[@class='icon iconfont follow-num']/text()"))
unit = share_web_html.xpath("http://div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']/span[@class='num']/text()")
if unit[-1].strip() == 'w':
    douyin_info['like'] = str(int(like)/10)+'w'
douyin_info['from_url'] = share_web_url

print(douyin_info)
handle_insert_douyin(douyin_info)

def handle_douyin_web_share(task):
share_web_url = 'https://www.douyin.com/share/user/'+task["share_id"]
print(share_web_url)
share_web_header = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}
share_web_response = requests.get(url=share_web_url,headers=share_web_header)
handle_decode(share_web_response.text,share_web_url,task["share_id"])

if name == 'main':
while True:
task=handle_get_task()
handle_douyin_web_share(task)
time.sleep(2)


* mongodb字段
>handle_init_task 是將txt存入mongodb中
>handle_get_task 查出來一條然后刪除一條,因為txt是存在的,所以刪除根本沒有關系

``` python
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2019/1/30 19:35
# @Author  : Aries
# @Site    : 
# @File    : handle_mongo.py.py
# @Software: PyCharm

import pymongo
from pymongo.collection import Collection

client = pymongo.MongoClient(host='192.168.66.100',port=27017)
db = client['douyin']

def handle_init_task():
    task_id_collections = Collection(db, 'task_id')
    with open('douyin_hot_id.txt','r') as f:
        f_read = f.readlines()
        for i in f_read:
            task_info = {}
            task_info['share_id'] = i.replace('\n','')
            task_id_collections.insert(task_info)

def handle_insert_douyin(douyin_info):
    task_id_collections = Collection(db, 'douyin_info')
    task_id_collections.insert(douyin_info)

def handle_get_task():
    task_id_collections = Collection(db, 'task_id')
    # return task_id_collections.find_one({})
    return task_id_collections.find_one_and_delete({})

handle_init_task()

「docker實戰篇」python的docker-抖音web端數據抓取(19)

「docker實戰篇」python的docker-抖音web端數據抓取(19)

PS:text文本中的數據1000條根本不夠爬太少了,實際上是app端和pc端配合來進行爬取的,pc端負責初始化的數據,通過userID獲取到粉絲列表然后在不停的循環來進行爬取,這樣是不是就可以獲取到很大量的數據。

向AI問一下細節

免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。

AI

远安县| 嘉兴市| 全州县| 常熟市| 莫力| 秭归县| 莱阳市| 应用必备| 宜阳县| 苏尼特右旗| 丹寨县| 沛县| 罗江县| 邢台市| 祁阳县| 大关县| 阜城县| 永吉县| 昆明市| 贡嘎县| 琼中| 买车| 镇宁| 镇坪县| 中西区| 图木舒克市| 台湾省| 庆城县| 盐津县| 周宁县| 广宗县| 三穗县| 常山县| 左权县| 淮滨县| 山阴县| 平江县| 宁夏| 通江县| 永安市| 彰化县|