爬蟲基礎篇-BeautifulSoup解析

發布時間：2020-05-24 20:17:55 來源：網絡閱讀：485 作者：YouErAJ 欄目：編程語言

安裝：Installing Beautiful Soup4?
功能：BeautifulSoup用于從HTML和XML文件中提取數據

常用場景：網頁爬取數據或文本資源后，對其進行解析，獲取所需信息

以下詳細的介紹了beautifulsoup的基礎用法

1.結構

BeautifulSoup 將html文檔轉換成樹形結構對象，包含：
① tag(原html標簽，有name和attribute屬性)?
② NavigableString（包裝tag中的字符串，通過string獲得字符串）
③ BeautifulSoup（表示一個文檔的全部內容）

yourhtml = '<b class="boldest">Extremely bold</b>'

soup = BeautifulSoup(yourhtml, parse_method)    
# 獲取tag名稱
soup.tag.name    
# 還可改變soup對象的tag名稱b 
tag.name = "blockquote"   # 變為<blockquote class="boldest">Extremely bold</blockquote>

# 獲取tag的屬性
soup.tag.attrs    # {u'class': u'boldest'}

# 取得標簽下的屬性class對應的值（類似字典取值）
soup.tag["class"]  # u'boldest'
# 如果yourhtml有多值屬性，比如'<p class="body strikeout"></p>',class對應的屬性為["body", "strikeout"]列表

# 獲取NavigableString類包裝的字符串
soup.tag.string    # u'Extremely bold' 
# NavigableString類包裝的子符串不能編輯但可替換，tag.string.replace_with("No longer bold")

1.2 解析方法parse_method

爬蟲基礎篇-BeautifulSoup解析
?BeautifulSoup 為不同的解析器提供了相同的接口，但是如果不是標準格式的文檔結構，不同的解析器可能解析出不同結構的樹型文檔（特別是XML解析器和HTML解析器)，如果是標準格式文檔，則不同解析器只是解析速度不同，解析結果是一致的。

2. 子字節


# 直接獲取tag_name標簽
soup.tag_name # 如soup.head, soup.title, soup.body.b

# 將tag_name的直接子字節以列表方式返回
soup.tag_name.contents # 字符沒有子節點所以沒有contents屬性

# 返回標簽直接子節點的生成器，通過遍歷獲取每一個子節點內容
soup.tag_name.children 

# 返回tag_name下所有子孫節點內容的生成器
soup.tag_name.descendants 

soup.tag_name.string # 僅當tag_name下只有一個字符串字節時，可返回子字節，但若有多個子字節，.string方法無法確定應調用哪個子節點內容，所以會返回None。所以應通過.strings 來循環獲取，.stripped_strings 同時可以去除多余的空白內容
for string in soup.stripped_strings:
 print(repr(string))
"""
u'Once upon a time there were three little sisters; and their names were'
u'Elsie'
 u','
u'Lacie'
"""

3.父節點

4.兄弟節點

第3、4點暫時不做詳細的討論

5.查找方法

5.1 find
查找符合條件的第一個內容

5.1.1 參數：

find( name # 查找標簽
? ? ?,attrs # 查找標簽屬性
? ? ?,recursive # 循環
? ? ?,text # 查找文本) ? ?找目標首次出現的結果，返回一個beautfulsoup的標簽對象

from bs4 import BeautifulSoup

with open('ecologicalpyramid.html', 'r') as ecological_pyramid:
 soup = BeautifulSoup(ecological_pyramid, 'html')
producer_string = soup.find(text = 'plants')    # 通過文本查找：plants
producer_entries = soup.find('ul')    # 標簽查找
soup.find(id="link3")    # 通過id查找

5.1.2 通過class屬性查找

class屬性規定了元素的類名，但是查找時沒辦法將class當做變量名，有以下查詢方法。以下find替換為find_all同樣可行

soup.find("span",{"class":{"green","red"}})
或
soup.find(name="span", attrs={"green" :"red"}
或
soup.find(class_="green")    # 注意class_有下劃線
或
soup.find("", {"class":"green"})

5.2 正則表達式

在沒有標簽或者文本的情況下可使用正則查詢，比如查找郵箱

import re
from bs4 import BeautifulSoup

email_id_example = """<br/>
<div>The below HTML has the information that has email ids.</div> 
abc@example.com 
<div>xyz@example.com</div> 
<span>foo@example.com</span> 
"""

soup = BeautifulSoup(email_id_example)
emailid_regexp = re.compile("\w+@\w+\.\w+")　　　　# \w+ 匹配字母或數字或下劃線或漢字一次或多次
first_email_id = soup.find(text=emailid_regexp)　　# 用正則表達式匹配文本
print(first_email_id)
輸出結果：abc@example.com

5.3 find_all
soup.find_all()查找符合條件的所有內容
5.3.1 參數：
find_all(name, attrs, recursive, text, limit)
limit參數可以限制得到的結果的數目（當limit=1即和find()一樣的結果）
recursive=False表示只返回最近的子節點
text可傳遞指定尋找的字符串（可以是列表）

5.3.2 正則結合find_all()使用
目的：查找所有a標簽，且字符串內容包含關鍵字“Elsie”

for x in soup.find_all('a',href = re.compile('lacie')):
 print(x)

<a class="sister"  id="link2">Lacie</a>

5.3.3 get

一般href屬性對應的值為鏈接
目的：1.獲取所有鏈接，保存在列表中；2.查詢所有a標簽，輸出所有標簽中的“字符串”內容；3.查找所有含"id"屬性的tag

linklist = []
for x in soup.find_all('a'):
 link = x.get('href')    # 獲取屬性值的技巧
 if link:
 linklist.append(link)
"""
linklist：
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
""" 
 string = x.get_text()    # 獲取所有標簽中所有的“字符串”內容

 soup.find_all(id=True)    # 查找所有含"id"屬性的tag,無論id值是什么

向AI問一下細節

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

爬蟲基礎篇-BeautifulSoup解析

1.結構

2. 子字節

3.父節點

4.兄弟節點

5.查找方法

猜你喜歡

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

爬蟲基礎篇-BeautifulSoup解析

1.結構

2. 子字節

3.父節點

4.兄弟節點

5.查找方法

猜你喜歡

最新資訊

相關推薦

相關標簽