Python摸鱼小项目:Selenium实战获取汽车之家排行榜数据

Selenium获取汽车之家排行榜

  • 整体思路
  • 步骤一:定义需要处理的年份列表和网页地址结构,为后续数据获取做好准备。
  • 步骤二:使用 requests 库获取网页内容,并将其保存为本地 HTML 文件。
  • 步骤三:使用Selenium模拟浏览器操作,自动加载网页并翻页直到页面内容加载完整。
  • 步骤四:使用BeautifulSoup库解析HTML内容,提取所需信息并保存为Excel文件。
  • 代码
  • 结果
  • 项目结构
  • 数据集展示
  • 整体思路

    步骤一:定义需要处理的年份列表和网页地址结构,为后续数据获取做好准备。

    汽车之家网页分析
    汽车之家网页分析
    爬取范围:

  • 时间:2024-07~2024-12
  • 分类:全部、全部轿车、微型车、小型车、紧凑型车、中型车、中大型车、全部SUV、小型SUV、紧凑型SUV、中型SUV、中大型SUV、大型SUV、MPV
  • 注意:(本文撰写时2024-07已经过期了)

    步骤二:使用 requests 库获取网页内容,并将其保存为本地 HTML 文件。

    基础版本,仅使用requests库

    # 获取html
    def get_html():
        for period in periods:
            for category, category_url in category_url_dict.items():
                # print(f"类别:{category}\t时期:{period_url}")
                logger.info(f"类别:{category}\t时期:{period}")
                url = base_url + category_url + period + ".html"
                logger.info(url)
                response = requests.get(
                    url=url,
                    headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                                           "Chrome/100.0.4896.60 Safari/537.36 Edg/100.0.1185.29"}
                )
                logger.info(response.status_code)
                if response.status_code == 200:
                    logger.info("Success")
                    h5_path = "./h5/" + period + "/" + category + "_" + period + ".html"
                    if not os.path.exists(h5_path):
                        with open(h5_path, "w", encoding="utf-8") as f:
                            f.write(response.text)
                else:
                    logger.debug("Failed")
    

    步骤三:使用Selenium模拟浏览器操作,自动加载网页并翻页直到页面内容加载完整。

    使用的包,requirements.txt如下:

    selenium==3.141.0
    urllib3==1.26.2
    openpyxl
    xlsxwriter
    beautifulsoup4
    

    步骤四:使用BeautifulSoup库解析HTML内容,提取所需信息并保存为Excel文件。

    分析每一个item的层级结构,编写解析代码

    <div class="tw-relative tw-grid tw-items-center tw-grid-cols-[65px_168px_200px_auto_92px]">
        <div class="tw-flex tw-h-[116px] tw-w-[50px] tw-flex-col tw-justify-start tw-pt-5 tw-text-center">
            <div class="tw-absolute tw-left-0 tw-top-0 tw-flex tw-h-full tw-w-[50px] tw-flex-col tw-justify-start tw-pt-5 tw-text-center">
                <div class="tw-min-w-[50px] tw-bg-[length:100%] tw-text-xl tw-font-bold tw-italic tw-leading-[30px]">01</div>
                <div class="tw-mt-[5px] tw-flex tw-min-w-[50px] tw-items-center tw-justify-center tw-text-[15px] tw-font-[600] tw-leading-[21px] tw-text-[#FF6600]">
                    <svg xmlns="http://www.w3.org/2000/svg" width="1em" height="1em" fill="none" viewBox="0 0 8.58 14.3">
                        <path fill="#F60" fill-rule="evenodd" d="m0 6.454 4.29-4.29 4.29 4.29z"></path>
                        <path fill="#F60" fill-rule="evenodd" d="M3.223 12.743V6.217h2.145v6.526z"></path>
                    </svg>1</div>
            </div>
        </div>
        <div class="tw-mr-5 tw-h-[111px] tw-w-[148px]">
            <img width="148" height="111" alt="Model Y" class="tw-img-placeholder tw-h-full tw-w-full" src="//g.autoimg.cn/@img/car2/cardfs/series/g28/M09/55/A3/872x666_autohomecar__CjIFVGSiMe-AFK8LAAdjPz_Yhv0678.png" srcset="//g.autoimg.cn/@img/car2/cardfs/series/g28/M09/55/A3/872x666_autohomecar__CjIFVGSiMe-AFK8LAAdjPz_Yhv0678.png 2x, //g.autoimg.cn/@img/car2/cardfs/series/g28/M09/55/A3/1308x999_autohomecar__CjIFVGSiMe-AFK8LAAdjPz_Yhv0678.png 3x">
        </div>
        <div class="tw-flex tw-flex-col tw-whitespace-nowrap">
            <div class="tw-text-nowrap tw-text-lg tw-font-medium hover:tw-text-[#ff6600]">Model Y</div>
            <div class="tw-flex tw-items-center">
                <div class="tw-mr-[5px] tw-h-[14px] tw-w-[78px] tw-bg-[length:14px_14px] tw-bg-[position:0_center,16px_center,32px_center,48px_center,64px_center] tw-bg-no-repeat" style="background-image: url(&quot;//z.autoimg.cn/pcm/rank/images/star_grey.png&quot;), url(&quot;//z.autoimg.cn/pcm/rank/images/star_grey.png&quot;), url(&quot;//z.autoimg.cn/pcm/rank/images/star_grey.png&quot;), url(&quot;//z.autoimg.cn/pcm/rank/images/star_grey.png&quot;), url(&quot;//z.autoimg.cn/pcm/rank/images/star_grey.png&quot;);">
                    <div class="tw-mr-[5px] tw-h-[14px] tw-bg-[length:14px_14px] tw-bg-[position:0_center,16px_center,32px_center,48px_center,64px_center] tw-bg-no-repeat" style="width: 87%; background-image: url(&quot;//z.autoimg.cn/pcm/rank/images/star_orange.png&quot;), url(&quot;//z.autoimg.cn/pcm/rank/images/star_orange.png&quot;), url(&quot;//z.autoimg.cn/pcm/rank/images/star_orange.png&quot;), url(&quot;//z.autoimg.cn/pcm/rank/images/star_orange.png&quot;), url(&quot;//z.autoimg.cn/pcm/rank/images/star_orange.png&quot;);"></div>
                </div>
                <span class="tw-text-[#FF6600]">
                    <strong class=" tw-font-bold">4.35</strong>分
                </span>
            </div>
            <div class=" tw-font-medium tw-text-[#717887]">24.99-35.49万</div>
        </div>
        <div class="tw-mx-4 tw-flex tw-flex-col tw-items-center tw-whitespace-nowrap xl:tw-mx-[92px]">
            <div class="tw-mb-0.5 tw-flex tw-items-center">
                <span class="tw-pt-[5px]  tw-text-[22px] tw-font-[500] tw-leading-none">48202</span>
            </div>
            <span class="tw-text-sm tw-text-[#717887]">车系销量</span>
            <span class="tw-text-sm tw-text-[#717887]"></span>
        </div>
        <button data-role="inquiry-pc-newcar" data-inquiry-type="transaction" data-series-id="5769" data-eid="1|2|572|20024|206276|305810" type="button" class="ant-btn css-krwonr ant-btn-primary js-rank-inquiry-btn tw-font-medium !tw-shadow-none">
            <span>查成交价</span>
        </button>
    </div>
    

    代码

    """
    # File       : CarSalesRanking.py
    # Time       :2025/1/10 上午9:21
    # Author     :leejack
    # Description:
    爬取汽车之家销量排行榜数据
    https://www.autohome.com.cn/rank/1#pvareaid=6863887
    时间:2024-07~2024-12
    分类:全部、全部轿车、微型车、小型车、紧凑型车、中型车、中大型车、全部SUV、小型SUV、紧凑型SUV、中型SUV、中大型SUV、大型SUV、MPV
    """
    import os
    import numpy as np
    import requests
    import re
    import lxml
    import pandas as pd
    import time
    import logging
    from selenium import webdriver
    from selenium.webdriver.edge.options import DesiredCapabilities
    from msedge.selenium_tools import Edge, EdgeOptions
    from bs4 import BeautifulSoup
    
    
    """
    url规律探索:
    url = 基础url + 类别url + 时期url
    url = base_url + category_url + period_url
    url = https://www.autohome.com.cn/rank/ + 1-1-0-0_9000-x-x-x + /2024-11.html
    """
    # 基础url
    base_url = "https://www.autohome.com.cn/rank/"
    category_url_dict = {
        "全部": "1-1-0-0_9000-x-x-x/",
        "全部轿车": "1-1-1%2C2%2C3%2C4%2C5%2C6-0_9000-x-x-x/",
        "微型车": "1-1-1-0_9000-x-x-x/",
        "小型车": "1-1-2-0_9000-x-x-x/",
        "紧凑型车": "1-1-3-0_9000-x-x-x/",
        "中型车": "1-1-4-0_9000-x-x-x/",
        "中大型车": "1-1-5-0_9000-x-x-x/",
        "全部SUV": "1-1-16%2C17%2C18%2C19%2C20-0_9000-x-x-x/",
        "小型SUV": "1-1-16-0_9000-x-x-x/",
        "紧凑型SUV": "1-1-17-0_9000-x-x-x/",
        "中型SUV": "1-1-18-0_9000-x-x-x/",
        "中大型SUV": "1-1-19-0_9000-x-x-x/",
        "大型SUV": "1-1-20-0_9000-x-x-x/",
        "MPV": "1-1-21%2C22%2C23%2C24-0_9000-x-x-x/",
    }
    periods = [
        "2024-12",
        "2024-11",
        "2024-10",
        "2024-09",
        "2024-08",
        "2024-07",
    ]
    
    # 日志设置
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    handler = logging.StreamHandler()
    handler.setLevel(logging.INFO)
    handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S'))
    logger.addHandler(handler)
    
    
    # 初始化
    def initialize():
        # 检查路径是否存在
        if not os.path.exists("./h5"):
            os.mkdir("./h5")
        # 只需判断一个子文件夹
        if not os.path.exists("./h5/2024-11"):
            for period in periods:
                # 创建二级目录
                os.mkdir(f"./h5/{period}")
        if not os.path.exists("./results"):
            os.mkdir("./results")
    
    
    # 获取html
    def get_html():
        for period in periods:
            for category, category_url in category_url_dict.items():
                # print(f"类别:{category}\t时期:{period_url}")
                logger.info(f"类别:{category}\t时期:{period}")
                url = base_url + category_url + period + ".html"
                logger.info(url)
                response = requests.get(
                    url=url,
                    headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                                           "Chrome/100.0.4896.60 Safari/537.36 Edg/100.0.1185.29"}
                )
                logger.info(response.status_code)
                if response.status_code == 200:
                    logger.info("Success")
                    h5_path = "./h5/" + period + "/" + category + "_" + period + ".html"
                    if not os.path.exists(h5_path):
                        with open(h5_path, "w", encoding="utf-8") as f:
                            f.write(response.text)
                else:
                    logger.debug("Failed")
    
    
    # 使用selenium实现翻页
    def get_html_with_selenium():
        options = EdgeOptions()
        options.use_chromium = True
    
        # 反检测
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
    
        driver_path = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedgedriver.exe"  # 需要保持最新的msedgedriver
        driver = Edge(executable_path=driver_path, options=options)
        driver.implicitly_wait(10)
    
        for period in periods:
            for category, category_url in category_url_dict.items():
                logger.info(f"类别:{category}\t时期:{period}")
                url = base_url + category_url + period + ".html"
                logger.info(url)
    
                driver.get(url)
    
                # 用页面的高度值循环判断页面有没有加载全
                last_height = driver.execute_script("return document.body.scrollHeight")
                while True:
                    driver.execute_script('document.documentElement.scrollTop = document.documentElement.scrollHeight')  # 翻页
                    time.sleep(2)  # 等待页面加载
    
                    # 获取新的页面高度
                    new_height = driver.execute_script("return document.body.scrollHeight")
    
                    # 如果页面高度没有增加,则退出循环
                    if new_height == last_height:
                        break
    
                    # 更新页面高度
                    last_height = new_height
    
                # 更新页面源代码、打印日志
                with open(f"./h5/{period}/{category}-{period}.html", "w", encoding="utf-8") as f:
                    f.write(driver.page_source)
                    logger.info(f"时期:{period}\t类别:{category}")
                time.sleep(5)  # 等待一段时间
    
        # 退出selenium程序
        driver.quit()
    
    
    # 解析html
    def parse_html():
        # category_url_dict = {
        #     "全部": "1-1-0-0_9000-x-x-x/",
        # }
        # periods = [
        #     "2024-12",
        # ]
        # category_url = category_url_dict.get("全部")
        for period in periods:
            # period_datas = []  # 每一个时期一个xlsx
            with pd.ExcelWriter(f"./results/{period}.xlsx", engine="xlsxwriter") as writer:
                for category, category_url in category_url_dict.items():
                    with open(f"./h5/{period}/{category}-{period}.html", "r", encoding="utf-8") as f:
                        html_content = f.read()
                        soup = BeautifulSoup(html_content, "html.parser")
    
                        period_category_datas = []
    
                        # 获取单个item
                        items = soup.findAll("div", class_="tw-relative tw-grid tw-items-center tw-grid-cols-[65px_168px_200px_auto_92px]")
                        for item in items:
                            # 获取该item的名次
                            rank = item.find("div", class_="tw-min-w-[50px] tw-bg-[length:100%] tw-text-xl tw-font-bold tw-italic tw-leading-[30px]").text.strip()
                            # 获取该item的名称
                            car_name = item.find("div", class_="tw-text-nowrap tw-text-lg tw-font-medium hover:tw-text-[#ff6600]").text.strip()
                            # 获取该item的评分(可能为:暂无评分)
                            # 异常处理,有则提取,无则设置为暂无
                            try:
                                score = item.find("strong", class_="tw-font-bold").text.strip()  # 类名中的空格需要删除
                            except AttributeError as error:
                                logger.debug(error)
                                score = "暂无"
                            # 获取该item的价格(最低价-最高价)
                            prices = item.find("div", class_="tw-font-medium tw-text-[#717887]").text.strip()
                            if "-" in prices:
                                min_price, max_price = prices.split(sep="-")
                                min_price = min_price + "万"
                            else:
                                min_price = prices
                                max_price = prices
                            # print(min_price, max_price, sep='\t')
                            # 获取该item的销量
                            sales_volume = item.find("span", class_=["tw-pt-[5px]","tw-text-[22px]","tw-font-[500]","tw-leading-none"]).text.strip()
                            # print(sales_volume)
    
                            item_dict = {
                                "名次": rank,
                                "车名": car_name,
                                "评分": score,
                                "最低价": min_price,
                                "最高价": max_price,
                                "销量": sales_volume
                            }
                            period_category_datas.append(item_dict)
    
                        # 打印统计信息
                        logger.info(f"时期:{period}\t类别:{category}\t总数:{len(period_category_datas)}")
    
                        # 输出为xlsx表格
                        df = pd.DataFrame.from_records(period_category_datas)
                        df.to_excel(excel_writer=writer, index=False, sheet_name=f"{category}")
    
    
    initialize()  # 初始化
    # get_html()
    # get_html_with_selenium()
    parse_html()
    
    
    
    

    结果

    项目结构

    数据集展示

    作者:热爱旅行的小李同学

    物联沃分享整理
    物联沃-IOTWORD物联网 » Python摸鱼小项目:Selenium实战获取汽车之家排行榜数据

    发表回复