代码收藏家技术教程 2024-12-20

Python – jieba库的使用

文章目录

jieba库概述

jieba分词的三种模式

jieba库的安装

jieba分词的原理

jieba库常用函数

实例 : 文本词频统计

jieba库概述

jieba是优秀的中文分词第三方库

中文文本需要通过分词获得单个的词语

jieba是优秀的中文分词第三方库，需要额外安装

jieba库提供三种分词模式，最简单的只需要掌握一个函数

jieba分词的三种模式

精确模式，全模式，搜索引擎模式

精确模式：把文本精确的且分开，不存在冗余单词
全模式：把文本中所有可能的词语都扫描出来，有冗余
搜索引擎模式：在精确模式基础上，对长词再次切分

jieba库的安装

cmd命令行： pip install jieba

jieba分词的原理

利用一个中文词库，确定中文字符之间的关联概率

中文字符间概率大的组成词组，形成分词结果

除了分词，用户还可以添加自定义词组

jieba库常用函数

函数	描述
jieba.cut(s)	精确模式，返回一个可迭代的数据类型
jieba.cut(s,cut_all=True)	全模式，输出文本s中所有可能单词
jieba.cut_for_search(s)	搜索引擎模式，适合搜索引擎建立索引的分词结果

'''
@Author: yjy
@Time: 2024/11/12
'''
import jieba

s = 'yjy在努力学习Python'
print(jieba.cut(s)) # <generator object Tokenizer.cut at 0x0000021EFBCA4040>
print(jieba.cut(s,cut_all=True)) # <generator object Tokenizer.cut at 0x000001A6DE434040>
print(jieba.cut_for_search(s)) # <generator object Tokenizer.cut_for_search at 0x000002BDA3E73890>
print(list(jieba.cut(s))) # ['yjy', '在', '努力学习', 'Python']
print(list(jieba.cut(s,cut_all=True))) # ['yjy', '在', '努力', '努力学习', '力学', '学习', 'Python']
print(list(jieba.cut_for_search(s))) # ['yjy', '在', '努力', '力学', '学习', '努力学习', 'Python']

实例 : 文本词频统计

问题分析:
文本词频统计

需求: 一篇文章,出现哪些词?哪些词出现得最多?

该怎么做呢?

这里以
https://python123.io/resources/pye/hamlet.txt 文本为例子:

“Hamlet英文词频统计”

# 文本去噪及归一化
def getText():
    txt = open("hamlet.txt","r").read()
    txt=txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch," ")
    return txt

# 使用字典表达词频
if __name__ == '__main__':
    hamletTxt = getText()
    words = hamletTxt.split()
    counts = {}
    for word in words:
        counts[word] = counts.get(word,0)+1 # 统计单词数量
    items = list(counts.items())
   # print(items) #('the', 1138), ('tragedy', 3), ('of', 669), ('hamlet', 462), ...
    items.sort(key=lambda x:x[1],reverse=True)
    for i in range(10):
        word,count = items[i]
        print("{0:<10}{1:>5}".format(word,count))

"""
格式化字符串的基本语法
格式化字符串的基本语法是 {} 和 :，其中 {} 是占位符，: 后面跟着格式说明符。格式说明符可以包括对齐方式、填充字符、宽度、精度等。

对齐和宽度
对齐方式：
<：左对齐
>：右对齐
^：居中对齐
宽度：
指定占位符的最小宽度。如果实际内容的长度小于指定的宽度，将使用空格或其他指定的填充字符进行填充。
示例
左对齐 (<)

print("{0:<10}".format("hello"))
输出：

hello     
解释："hello" 左对齐，总宽度为 10，右侧用空格填充。

右对齐 (>)

print("{0:>10}".format("hello"))
输出：


     hello
解释："hello" 右对齐，总宽度为 10，左侧用空格填充。

居中对齐 (^)

print("{0:^10}".format("hello"))
输出：

深色版本
  hello   
解释："hello" 居中对齐，总宽度为 10，左右两侧各用两个空格填充。

填充字符
可以在对齐方式之前指定填充字符。默认的填充字符是空格。

python

print("{0:*<10}".format("hello"))  # 左对齐，用 * 填充
print("{0:*>10}".format("hello"))  # 右对齐，用 * 填充
print("{0:*^10}".format("hello"))  # 居中对齐，用 * 填充
输出：

hello*****
*****hello
***hello***


"""

中文文本：《三国演义》分析人物https://python123.io/resources/pye/threekingdoms.txt

'''
中文文本分词,使用字典表达词频
'''
import jieba
text = open("实验文本.txt","r",encoding="utf-8").read()
words = jieba.lcut(text) #精确模式
counts = {}
for word in words:
    if len(word)==1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
    word,count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

我们发现明明是同一个人,不过有别的称号罢了,但是统计的却不一样,所以我们要进行修改:

import jieba
txt = open("实验文件.txt","r",encoding="utf-8").read()
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
words = jieba.lcut(txt) # 精确分词
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == '诸葛亮' or word == '孔明曰':
        rword = '孔明'
    elif word == '关公' or word == '云长':
        rword = "关羽"
    elif word == '玄德' or word == '玄德曰':
        rword = '刘备'
    elif word == '孟德' or word == '丞相':
        rword = '曹操'
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
    word,count = items[i]
    print(word,count)

作者：-\>yjy