requests如何爬html格式

是网络爬虫技术的基础，而requests库是Python中一个非常流行的HTTP库，可以用来发送HTTP请求，获取服务器响应，对于爬取HTML格式的网页内容，requests库操作简单，易于上手，下面我将详细介绍如何使用requests库爬取HTML格式的内容。

我们需要安装requests库，在命令行中输入以下命令即可完成安装：

pip install requests

我们将分步骤介绍如何使用requests库爬取HTML内容。

发送GET请求

使用requests库发送GET请求非常简单，只需调用requests.get()方法即可，以下是一个基本的示例：

requests如何爬html格式

import requests
url = 'http://www.example.com'
response = requests.get(url)
print(response.text)

在这个例子中，我们首先导入requests库，然后定义要爬取的网址url，使用requests.get(url)发送GET请求，并将响应结果赋值给变量response，通过response.text获取网页的HTML内容。

在获取到响应内容后，我们可以对其进行进一步处理，例如解析、保存等。

1、解析HTML内容

对于HTML内容的解析，我们可以使用正则表达式，但更推荐使用专门的解析库，如BeautifulSoup，以下是一个使用BeautifulSoup解析HTML内容的示例：

requests如何爬html格式

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())

在这个例子中，我们首先导入BeautifulSoup库，然后创建一个BeautifulSoup对象，将响应内容response.text和解析器html.parser作为参数传入，通过soup.prettify()方法，我们可以以格式化后的形式输出HTML内容。

2、保存HTML内容

如果需要将爬取到的HTML内容保存到本地文件，可以使用以下代码：

with open('example.html', 'w', encoding='utf-8') as f:
    f.write(response.text)

这里，我们使用with open()语句创建一个文件，并指定文件名为example.html，写入模式为w，编码为utf-8，将响应内容response.text写入文件。

处理异常

在使用requests库进行网络请求时，可能会遇到各种异常情况，为了确保程序的稳定性，我们需要对可能出现的异常进行捕获和处理。

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
except requests.exceptions.HTTPError as e:
    print("HTTPError:", e)
except requests.exceptions.ConnectionError as e:
    print("ConnectionError:", e)
except requests.exceptions.Timeout as e:
    print("Timeout:", e)
except requests.exceptions.RequestException as e:
    print("RequestException:", e)

在这个例子中，我们使用try-except语句捕获可能出现的异常。timeout=10表示设置请求超时时间为10秒。response.raise_for_status()方法用于检查响应状态码，如果状态码指示错误，将抛出异常。

设置请求头

有些网站会对爬虫进行限制，为了绕过这些限制，我们可以设置请求头，模拟浏览器访问。

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(url, headers=headers)

在这个例子中，我们创建了一个字典headers，包含了一个常见的User-Agent值，在发送GET请求时，将这个字典作为参数传入requests.get()方法。

通过以上步骤，我们已经可以成功爬取HTML格式的网页内容，以下是结合以上内容的完整示例：

import requests
from bs4 import BeautifulSoup
url = 'http://www.example.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
try:
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')
    print(soup.prettify())
    with open('example.html', 'w', encoding='utf-8') as f:
        f.write(response.text)
except requests.exceptions.HTTPError as e:
    print("HTTPError:", e)
except requests.exceptions.ConnectionError as e:
    print("ConnectionError:", e)
except requests.exceptions.Timeout as e:
    print("Timeout:", e)
except requests.exceptions.RequestException as e:
    print("RequestException:", e)

这个示例包含了发送GET请求、设置请求头、处理异常、解析HTML内容和保存HTML内容等步骤，掌握这些步骤，你就可以使用requests库轻松爬取HTML格式的网页内容了，在实际应用中，你可能还需要根据具体需求对代码进行调整和优化。