代码收藏家技术教程 2024-12-29

使用 Selenium 和 Python 爬取腾讯新闻：从基础到实践

在这篇博客中，我们将介绍如何利用 Selenium 和 Python 爬取腾讯新闻的内容，并将结果保存到 CSV 文件中。本教程包含以下内容：

项目简介
依赖安装
实现功能的代码
实现中的关键技术
完整代码
运行结果与注意事项

1. 项目简介

腾讯新闻网站包含丰富的新闻资源。我们的目标是：

爬取文章的标题和部分内容（200个字符）。

点击“下一页”按钮后跳转到新页面并继续爬取。

处理爬取内容中的特殊字符。

将爬取到的内容保存到 CSV 文件中。

本项目适合初学者学习 Selenium 的基础操作，例如页面切换和元素交互。

2. 依赖安装

在开始前，需要安装以下依赖：

Python 环境：确保安装了 Python 3.7 或以上版本。
Selenium：用于网页自动化。
WebDriver Manager：自动管理浏览器驱动。

运行以下命令安装必要的库：

pip install selenium webdriver-manager pandas

3. 实现功能的代码

以下是主要功能实现：

1. Selenium 驱动设置

通过 WebDriver Manager 自动管理 ChromeDriver，避免手动下载和配置。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

def setup_driver():
    options = Options()
    options.add_argument("--headless")  # 无头模式运行
    options.add_argument("--disable-gpu")
    
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    return driver

2. 点击下一页并切换窗口

实现点击下一页按钮，切换到新打开的窗口，并关闭旧窗口。

from selenium.webdriver.common.by import By
import time

def click_next_and_switch_window(driver):
    current_window = driver.current_window_handle
    next_button = driver.find_element(By.XPATH, '/html/body/div[3]/div[1]/div[3]/div/div/ul/li[6]/div[2]/h3/a')
    next_button.click()
    time.sleep(3)
    
    all_windows = driver.window_handles
    driver.close()
    driver.switch_to.window(all_windows[-1])
    time.sleep(2)

3. 爬取文章内容

爬取标题和正文的前200个字符，并使用正则表达式清理标题。

import re

def crawl_tencent_news(start_url, max_articles=50):
    driver = setup_driver()
    articles = []
    driver.get(start_url)
    time.sleep(2)
    for _ in range(max_articles):
        try:
            title = driver.find_element(By.XPATH, '//*[@id="dc-normal-body"]/div[3]/div[1]/div[1]/div[2]/h1').text
            title = re.sub(r"[^a-zA-Z0-9\u4e00-\u9fa5\s。，！？]", "", title)
            content = driver.find_element(By.XPATH, '//*[@id="ArticleContent"]/div[2]/div').text
            short_content = content[:200]
            articles.append({"Title": title, "Content": short_content})
            click_next_and_switch_window(driver)
        except:
            break
    driver.quit()
    return articles

4. 保存为 CSV

将爬取到的内容保存到 CSV 文件中。

import pandas as pd

def save_to_csv(articles, filename):
    df = pd.DataFrame(articles)
    df.to_csv(filename, index=False, encoding="utf-8")
    print(f"已将 {len(articles)} 篇文章保存到 {filename}.")

4. 完整代码

以下是完整代码整合：

import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
import pandas as pd

def setup_driver():
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    return driver

def click_next_and_switch_window(driver):
    current_window = driver.current_window_handle
    next_button = driver.find_element(By.XPATH, '/html/body/div[3]/div[1]/div[3]/div/div/ul/li[6]/div[2]/h3/a')
    next_button.click()
    time.sleep(3)
    all_windows = driver.window_handles
    driver.close()
    driver.switch_to.window(all_windows[-1])
    time.sleep(2)

def crawl_tencent_news(start_url, max_articles=50):
    driver = setup_driver()
    articles = []
    driver.get(start_url)
    time.sleep(2)
    for _ in range(max_articles):
        try:
            title = driver.find_element(By.XPATH, '//*[@id="dc-normal-body"]/div[3]/div[1]/div[1]/div[2]/h1').text
            title = re.sub(r"[^a-zA-Z0-9\u4e00-\u9fa5\s。，！？]", "", title)
            content = driver.find_element(By.XPATH, '//*[@id="ArticleContent"]/div[2]/div').text
            short_content = content[:200]
            articles.append({"Title": title, "Content": short_content})
            click_next_and_switch_window(driver)
        except:
            break
    driver.quit()
    return articles

def save_to_csv(articles, filename):
    df = pd.DataFrame(articles)
    df.to_csv(filename, index=False, encoding="utf-8")
    print(f"已将 {len(articles)} 篇文章保存到 {filename}.")

def main():
    start_url = "https://news.qq.com/rain/a/20241201A03DNQ00"
    articles = crawl_tencent_news(start_url, max_articles=50)
    if articles:
        save_to_csv(articles, "tencent_articles.csv")

if __name__ == "__main__":
    main()