代码收藏家技术教程 2025-01-26

Python爬虫入门实例：Python7个爬虫小案例（附源码）

Python爬虫是指使用Python编程语言编写的网络爬虫程序。这些程序能够自动化地访问互联网上的网页，收集并提取所需的数据。Python之所以成为爬虫开发的热门选择，是因为它拥有强大的网络请求库（如requests）、HTML解析库（如BeautifulSoup和lxml）、以及异步编程和并发处理的能力（如通过asyncio或multithreading模块实现）。
CSDN大礼包：《2024年最新全套学习资料包》免费分享
以下是一个简化的Python爬虫开发流程：

明确目标：
确定要抓取的网站和数据类型。
遵守网站的robots.txt协议和相关的法律法规。
发送请求：
使用requests等库向目标网站发送HTTP请求。
处理可能的异常，如网络错误、超时等。
解析网页：
使用BeautifulSoup、lxml或pyquery等库解析网页的HTML内容。
提取所需的数据，如文本、链接、图片等。
处理数据：
对提取的数据进行清洗、转换和存储。
可以使用pandas库进行数据处理和分析。
存储数据：
将数据保存到本地文件（如CSV、JSON格式）或数据库中。
优化和调试：
使用日志记录（如logging模块）来跟踪程序的运行情况。
优化代码以提高效率和可靠性。
调试代码以解决可能出现的错误和问题。
遵守法律和道德：
确保你的爬虫行为符合目标网站的服务条款和法律法规。
避免对目标网站造成过大的负载或损害。
考虑使用框架：
对于更复杂的爬虫任务，可以考虑使用Scrapy等爬虫框架来简化开发和管理工作。

以下是一个简单的Python爬虫示例，用于抓取一个网页的标题和所有链接：

import requests
from bs4 import BeautifulSoup

# 目标URL
url = 'https://www.example.com'

# 发送HTTP GET请求
response = requests.get(url)

# 检查请求是否成功
if response.status_code == 200:
    # 解析网页内容
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 提取标题
    title = soup.title.string
    print(f'Title: {title}')
    
    # 提取所有链接
    links = [a.get('href') for a in soup.find_all('a', href=True)]
    print('Links:')
    for link in links:
        print(link)
else:
    print(f'Failed to retrieve the webpage. Status code: {response.status_code}')

请注意，这个示例仅用于教学目的，并不适用于实际的生产环境。在实际应用中，你需要考虑更多的因素，如处理动态加载的内容（可能需要使用Selenium等工具）、处理反爬虫机制（如验证码、IP封锁等）、以及数据清洗和存储等。

下面，我将为你介绍七个简单的 Python 爬虫案例，每个案例都附有源码，帮助你入门 Python 爬虫。

案例 1：抓取网页内容

目标：抓取一个网页的 HTML 内容并打印出来。

工具：requests 库

import requests

url = 'https://www.example.com'
response = requests.get(url)

if response.status_code == 200:
    print(response.text)
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

案例 2：解析网页内容（使用 BeautifulSoup）

目标：抓取网页并解析其中的特定内容，比如标题。

工具：requests 和 BeautifulSoup 库

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.title.string
    print(f"Title: {title}")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

案例 3：抓取网页中的所有链接

目标：抓取网页并提取其中的所有链接。

工具：requests 和 BeautifulSoup 库

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    links = [a.get('href') for a in soup.find_all('a', href=True)]
    print(links)
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

案例 4：抓取网页中的图片链接

目标：抓取网页并提取其中的所有图片链接。

工具：requests 和 BeautifulSoup 库

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    image_links = [img.get('src') for img in soup.find_all('img', src=True)]
    print(image_links)
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

案例 5：抓取网页中的表格数据

目标：抓取网页并提取其中的表格数据。

工具：requests 和 pandas 库

import requests
import pandas as pd

url = 'https://www.example.com/table'
response = requests.get(url)

if response.status_code == 200:
    tables = pd.read_html(response.text)
    for i, table in enumerate(tables):
        print(f"Table {i+1}:\n{table}\n")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

案例 6：处理分页抓取

目标：抓取一个分页网站的所有页面内容。

工具：requests 和 BeautifulSoup 库

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.example.com/page/'
num_pages = 5  # 假设有5页

for page_num in range(1, num_pages + 1):
    url = f"{base_url}{page_num}"
    response = requests.get(url)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # 假设我们要抓取标题
        titles = [h.string for h in soup.find_all('h2')]
        print(f"Page {page_num} Titles: {titles}")
    else:
        print(f"Failed to retrieve page {page_num}. Status code: {response.status_code}")

案例 7：使用 Scrapy 框架抓取网页

目标：使用 Scrapy 框架抓取网页内容。

工具：Scrapy 框架

首先，安装 Scrapy：

pip install scrapy

然后，创建一个 Scrapy 项目并编写爬虫：

scrapy startproject myproject
cd myproject
scrapy genspider example example.com

在 myproject/spiders/example.py 中编写爬虫代码：

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    def parse(self, response):
        title = response.css('title::text').get()
        yield {'title': title}

        # 假设我们要抓取所有链接
        for link in response.css('a::attr(href)').getall():
            yield {'link': link}

运行爬虫：