Python爬蟲BeautifulSoup4的使用方法

發布時間：2020-09-24 09:29:54 來源：億速云閱讀：140 作者：Leah 欄目：編程語言

今天就跟大家聊聊有關Python爬蟲BeautifulSoup4的使用方法，可能很多人都不太了解，為了讓大家更加了解，小編給大家總結了以下內容，希望大家根據這篇文章可以有所收獲。

爬蟲——BeautifulSoup4解析器

BeautifulSoup用來解析HTML比較簡單，API非常人性化，支持CSS選擇器、Python標準庫中的HTML解析器，也支持lxml的XML解析器。

其相較與正則而言，使用更加簡單。

示例：

首先必須要導入bs4庫

#!/usr/bin/python3
# -*- coding:utf-8 -*- 
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 格式化輸出 soup 對象的內容
print(soup.prettify())

運行結果

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

四大對象種類

BeautifulSoup將復雜的HTML文檔轉換成一個復雜的樹形結構，每個節點都是Python對象，所有對象可以歸納為4種：

（1）Tag

（2）NavigableString

（3）BeautifulSoup

（4）Comment

1.Tag

Tag 通俗點講就是HTML中的一個個標簽，例如：

<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

上面title head a p 等等HTML標簽加上里面包括的內容就是Tag，那么試著使用BeautifulSoup來獲取Tags：

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# # 打印title標簽
print(soup.title)
 
# 打印head標簽
print(soup.head)
 
# 打印a標簽
print(soup.a)
 
# 打印p標簽
print(soup.p)
 
# 打印soup.p的類型
print(type(soup.p))

運行結果

<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<class 'bs4.element.Tag'>

我們可以利用soup加標簽名輕松地獲取這些標簽內容，這些對象的類型是bs4.element.Tag。但是注意，它查找的是在所有內容中的第一個符合要求的標簽。如果需要查詢所有的標簽，后面會進行介紹。

對于Tag，它有兩個重要的屬性，就是name和attrs。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# soup對象比較特殊，它的name為[document]
print(soup.name)
 
# 對于其他內部標簽，輸出的值便為標簽本身的名稱
print(soup.head.name)
 
# 打印p標簽的所有屬性，其類型是一個字典
print(soup.p.attrs)
 
# 打印p標簽的class屬性
print(soup.p['class'])
# 還可以利用get方法獲取屬性，傳入屬性的名稱，與上面的方法等價
print(soup.p.get('class'))
 
print(soup.p)
 
# 修改屬性
soup.p['class'] = "newClass"
print(soup.p)
 
# 刪除屬性
del soup.p['class']
print(soup.p)

運行結果

[document]
head
{'class': ['title'], 'name': 'dromouse'}
['title']
['title']
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
<p name="dromouse"><b>The Dormouse's story</b></p>

2.NavigableString

既然我們已經得到了標簽的內容，那么問題來了，我們想要獲取標簽內部的文字怎么辦呢？很簡單，用.string即可，例如：

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 打印p標簽的內容
print(soup.p.string)
 
# 打印soup.p.string的類型
print(type(soup.p.string))

運行結果

The Dormouse's story
<class 'bs4.element.NavigableString'>

3.BeautifulSoup

BeautifulSoup對象表示的是一個文檔的內容。大部分時候，可以把它當作Tag對象，是一個特殊的Tag，我們可以分別獲取它的類型，名稱，以及屬性。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 類型
print(type(soup.name))
 
# 名稱
print(soup.name)
 
# 屬性
print(soup.attrs)

運行結果

<class 'str'>
[document]
{}

4.Comment

Comment對象是一個特殊類型的NavigableString對象，其輸出的內容不包括注釋符號。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
print(soup.a)
 
print(soup.a.string)
 
print(type(soup.a.string))

運行結果

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie
<class 'bs4.element.Comment'>

a標簽里的內容實際上是注釋，但是如果我們利用.string來輸出它的內容時，注釋符號已經去掉了。

看完上述內容，你們對Python爬蟲BeautifulSoup4的使用方法有進一步的了解嗎？如果還想了解更多知識或者相關內容，請關注億速云行業資訊頻道，感謝大家的支持。

向AI問一下細節

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

Python爬蟲BeautifulSoup4的使用方法

猜你喜歡

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

Python爬蟲BeautifulSoup4的使用方法

猜你喜歡

最新資訊

相關推薦

相關標簽