代码收藏家技术教程 2024-11-11

【Python】网络爬虫——词云wordcloud详细教程，爬取豆瓣最新评论并生成各式词云

一、功能介绍

二、关键技术

1、安装WordCloud

2、利用WordCloud

1、WordCloud的基础用法

**相关参数介绍**

**WordCloud 提供的方法如下**

2、WordCloud的应用举例

3、设置停用词

4、WordCloud使用词频

三、程序设计的步骤

1、抓取网页数据

2、数据清洗

3、用词云进行展示

四、实现代码

五、最终效果展示

一、功能介绍

词云，即：对网络文本中出现频率较高的“关键词”予以视觉上的突出，形成“关键词云层”或“关键词渲染”，从而过滤掉大量的文本信息，使浏览网页者只要扫过一眼文本就可以领略文本的主旨。

本项目用来爬取豆瓣网上最新的电影评论（以最新上映的：异形：夺命舰 Alien: Romulus为例），经过数据清理和词频统计后进行词云展示。

二、关键技术

1、安装WordCloud

pip install wordcloud

2、利用WordCloud

1、WordCloud的基础用法

class wordcloud.WordCloud(font path=None, width=400, height=200, margin=2, ranks only=None, prefer horizontal=0.9, mask=None, scale=1, color func=None, max words=200, min font size=4, stopwords=None, random state=None, background color='black', max font size=None, font step=1, mode='RGB', relative scaling=0.5, regexp=None, collocations=True, colormap=None, normalize_plurals=True)

WordCloud 提供的方法如下

fit_words(frequencies)：根据词频生成词云。

generate(text)：根据文本生成词云。

generate_from_frequencies(frequencies[…])：根据词频生成词云。

generate_from_text(text)：根据文本生成词云。

process_text(text)：将长文本分词并去除屏蔽词(此处指英语，中文分词还需要自己用其他库先行实现，使用上面的fit_words(fequencies))。

recolor([random_state,color_func,colormap])：对现有输出重新着色，重新着色会比重新生成整个词云快很多。

to_array()：转化为 numpy array。

to_file(filename)：输出到文件。

2、WordCloud的应用举例

from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
import matplotlib.pyplot as plt
from PIL import Image  # 使用Pillow库代替scipy.misc.imread
import numpy as np

text = open('test.txt', 'r',encoding='utf-8').read()  # 读取一个txt文件
bg_pic = Image.open('alice.png')  # 读取背景图片
'''设置词云样式'''
wc = WordCloud(background_color='White', mask=np.array(bg_pic), font_path="simhei.ttf", max_words=2000, max_font_size=150,
               random_state=30, scale=1.5)
wc.generate_from_text(text)  # 根据文本生成词云

image_colors = ImageColorGenerator(np.array(bg_pic))  # 确保bg_pic是数组格式
plt.imshow(wc)  # 展示词云图
plt.axis('off')
plt.show()
print('display success!')

# 保存词云图片
wc.to_file('test2.jpg')

运行结果：

3、设置停用词

用户可以手动设置停用词，使得词云中不显示该词。

from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS  # 词云包

# 读取整个文章
text = open('test.txt', 'r', encoding='utf-8').read()  # 读取一个txt文件
# 读取遮罩/彩色图像
alice_coloring = np.array(Image.open('alice.png'))  # 读取背景图片
# 设置停用词
stopwords = set(STOPWORDS)
stopwords.add("的")  # 人工添加停用词
stopwords.add("了")  # 人工添加停用词

# 可以通过mask参数来设置词云形状
wc = WordCloud(background_color='White', mask=np.array(alice_coloring), font_path="simhei.ttf", max_words=2000,
               stopwords=stopwords, max_font_size=40,
               random_state=42)
# 生成词云
wc.generate(text)
# 根据图片生成颜色
image_colors = ImageColorGenerator(np.array(alice_coloring))  # 确保bg_pic是数组格式
plt.imshow(wc, interpolation="bilinear")  # 展示词云图
plt.axis('off')
plt.show()

# 保存词云图片
wc.to_file('test2.jpg')

运行结果：

4、WordCloud使用词频

import jieba.analyse
from PIL import Image, ImageSequence
import numpy as np
import matplotlib.pyplot as plt

from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS  # 词云包

lyric = ''
f = open('./test.txt', 'r', encoding='utf-8')
for i in f:
    lyric += f.read()
# 用jieba对文章做分词，提取出词频高的前50个词
result = jieba.analyse.textrank(lyric, topK=50, withWeight=True)
keywords = dict()
for i in result:
    keywords[i[0]] = i[1]
print(keywords)

运行结果：

三、程序设计的步骤

1、抓取网页数据

查找页面中id为'nowplaying'的div标签:

在找到的div标签内，查找所有class为'list-item'的li标签:

2、数据清洗

** 消除与数据分析无关的数据 **

1、正则表达式去除非中文字符：

pattern = re.compile(r'[^\w\s]')
cleaned_comments = pattern.sub('', comments)

2、停用词过滤：

stopwords = set(STOPWORDS)
with open('./StopWords.txt', encoding="utf-8") as f:
    stopwords.update(word.strip() for word in f)
keywords = {word: score for word, score in keywords.items() if word not in stopwords}

3、使用jieba进行中文分词：

result = jieba.analyse.textrank(cleaned_comments, topK=150, withWeight=True)

3、用词云进行展示

1、创建词云对象：

wordcloud = WordCloud(font_path="simhei.ttf", mask=np.array(bg_pic), background_color="white",
                     max_font_size=80, stopwords=stopwords).generate_from_frequencies(keywords)

2、展示词云：

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

3、打印成功信息：

print('词云展示成功!')

四、实现代码

import warnings
import jieba
import jieba.analyse

import re
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup as bs
from wordcloud import WordCloud, STOPWORDS


# 忽略警告
warnings.filterwarnings("ignore")

# 设置matplotlib图形大小
plt.rcParams['figure.figsize'] = (10.0, 5.0)


# 分析网页函数
def getNowPlayingMovieList():
    url = 'https://movie.douban.com/nowplaying/guangzhou'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0'
    }
    try:
        resp = requests.get(url, headers=headers)
        resp.raise_for_status()  # 检查请求是否成功
        html = resp.text
    except requests.exceptions.HTTPError as errh:
        print(f"HTTP错误: {errh}")
        return []
    except requests.exceptions.RequestException as err:
        print(f"请求错误: {err}")
        return []
    soup = bs(html, 'html.parser')
    nowplaying_movie = soup.find('div', id='nowplaying')
    if not nowplaying_movie:
        return []
    nowplaying_movie_list = nowplaying_movie.find_all('li', class_='list-item')
    nowplaying_list = []
    for item in nowplaying_movie_list:
        nowplaying_dict = {}
        nowplaying_dict['id'] = item['data-subject']
        nowplaying_dict['name'] = item.find('img')['alt']
        nowplaying_list.append(nowplaying_dict)
    return nowplaying_list


# 爬取评论函数
def getCommentsById(movieId, pageNum):
    eachCommentList = []
    if pageNum <= 0:
        return eachCommentList
    start = (pageNum - 1) * 20
    url = f'https://movie.douban.com/subject/{movieId}/comments?start={start}&limit=20'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0'
    }
    try:
        resp = requests.get(url, headers=headers)
        resp.raise_for_status()  # 检查请求是否成功
        html = resp.text
    except requests.exceptions.HTTPError as errh:
        print(f"HTTP错误: {errh}")
        return []
    except requests.exceptions.RequestException as err:
        print(f"请求错误: {err}")
        return []
    soup = bs(html, 'html.parser')
    comment_div_lits = soup.find_all('div', class_='comment')
    for item in comment_div_lits:
        if item.find('p'):
            eachCommentList.append(item.find('p').text.strip())
    return eachCommentList


def main():
    NowPlayingMovie_list = getNowPlayingMovieList()
    if not NowPlayingMovie_list:
        print("没有获取到电影列表")
        return

    commentList = []
    for i in range(1, 11):  # 从第1页到第10页
        comments_temp = getCommentsById(NowPlayingMovie_list[0]['id'], i)  # 选择第几个电影来进行爬虫，[0]为第一个
        commentList.extend(comments_temp)

    comments = " ".join(commentList)
    # 使用正则表达式去掉标点符号和非中文字符
    pattern = re.compile(r'[^\w\s]')
    cleaned_comments = pattern.sub('', comments)

    # 使用jieba分词进行中文分词
    result = jieba.analyse.textrank(cleaned_comments, topK=150, withWeight=True)
    keywords = {word: weight for word, weight in result}

    # 停用词集合
    stopwords = set(STOPWORDS)
    with open('./StopWords.txt', encoding="utf-8") as f:
        stopwords.update(word.strip() for word in f)

    # 过滤停用词
    keywords = {word: score for word, score in keywords.items() if word not in stopwords}

    # 创建词云
    wordcloud = WordCloud(font_path="simhei.ttf", background_color="white",
                          max_font_size=80,
                          stopwords=stopwords).generate_from_frequencies(keywords)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()
    print('词云展示成功!')


if __name__ == "__main__":
    main()