当前位置：首页 > news >正文

python学智能算法（十五）|机器学习朴素贝叶斯方法进阶-CountVectorizer多文本处理

news 2025/7/13 7:25:26

【1】引言

前序学习进程中，已经学习CountVectorizer文本处理的简单技巧，先相关文章链接为：

python学智能算法（十四）|机器学习朴素贝叶斯方法进阶-CountVectorizer文本处理简单测试-CSDN博客

此次继续深入，研究多文本的综合处理。

【2】代码测试

首先相对于单文本测试，直接将文本改成多行文本：

# 引入必要的模块
from sklearn.feature_extraction.text import CountVectorizer# 单个文档
document = ["Python programming is fun and useful for data science.","Python is a great programming language for data science.","Data science uses Python for machine learning and AI.","AI and machine learning are fun with Python.","AI is popular at this time."]# 创建向量化器
vectorizer = CountVectorizer()
print('vetorizer=', vectorizer)
# 拟合并转换文档
X = vectorizer.fit_transform(document)
print('X=', X)
# 查看词汇表
print("词汇表:\n", vectorizer.get_feature_names_out())# 查看向量表示
print("向量表示:\n", X.toarray())

尝试运行一下：

X= (0, 14) 1
(0, 13) 1
(0, 8) 1
(0, 6) 1
(0, 1) 1
(0, 18) 1
(0, 5) 1
(0, 4) 1
(0, 15) 1
(1, 14) 1
(1, 13) 1
(1, 8) 1
(1, 5) 1
(1, 4) 1
(1, 15) 1
(1, 7) 1
(1, 9) 1
(2, 14) 1
(2, 1) 1
(2, 5) 1
(2, 4) 1
(2, 15) 1
(2, 19) 1
(2, 11) 1
(2, 10) 1
(2, 0) 1
(3, 14) 1
(3, 6) 1
(3, 1) 1
(3, 11) 1
(3, 10) 1
(3, 0) 1
(3, 2) 1
(3, 20) 1
(4, 8) 1
(4, 0) 1
(4, 12) 1
(4, 3) 1
(4, 16) 1
(4, 17) 1
词汇表:
['ai' 'and' 'are' 'at' 'data' 'for' 'fun' 'great' 'is' 'language'
'learning' 'machine' 'popular' 'programming' 'python' 'science' 'this'
'time' 'useful' 'uses' 'with']
向量表示:
[[0 1 0 0 1 1 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0]
[0 0 0 0 1 1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0]
[1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0]
[1 1 1 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1]
[1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0]]