python - 如何查找与 KMeans 在同一集群中的文档

coder 2023-08-14 原文

我将各种文章与 Scikit-learn 框架放在一起。以下是每个集群中排名前 15 的单词:

Cluster 0: whales islands seaworld hurricane whale odile storm tropical kph mph pacific mexico orca coast cabos
Cluster 1: ebola outbreak vaccine africa usaid foundation virus cdc gates disease health vaccines experimental centers obama
Cluster 2: jones bobo sanford children carolina mississippi alabama lexington bodies crumpton mccarty county hyder tennessee sheriff
Cluster 3: isis obama iraq syria president isil airstrikes islamic li strategy terror military war threat al
Cluster 4: yosemite wildfire park evacuation dome firefighters blaze hikers cobb helicopter backcountry trails homes california evacuate

我像这样创建“词袋”矩阵:

hasher = TfidfVectorizer(max_df=0.5,
                             min_df=2, stop_words='english',
                             use_idf=1)
vectorizer = make_pipeline(hasher, TfidfTransformer())
# document_text_list is a list of all text in a given article
X_train_tfidf = vectorizer.fit_transform(document_text_list)

然后像这样运行 KMeans:

km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
                verbose=0, n_clusters=25)
km.fit(X_train_tfidf)

我正在像这样打印出簇:

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = hasher.get_feature_names()
for i in range(25):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :15]:
        print(' %s' % terms[ind], end='')
    print()

但是，我想知道如何找出哪些文档都属于同一个集群，理想情况下，它们各自到质心(集群)中心的距离。

我知道生成的矩阵 (X_train_tfidf) 的每一行都对应一个文档，但是在执行 KMeans 算法后没有明显的方法可以取回此信息。我将如何使用 scikit-learn 执行此操作？

X_train_tfidf 看起来像:

X_train_tfidf:   (0, 4661)  0.0405014425985
  (0, 19271)    0.0914545222775
  (0, 20393)    0.287636818634
  (0, 56027)    0.116893929188
  (0, 30872)    0.137815327338
  (0, 35256)    0.0343461345507
  (0, 31291)    0.209804679792
  (0, 66008)    0.0643776635222
  (0, 3806) 0.0967713285061
  (0, 66338)    0.0532881852791
  (0, 65023)    0.0702918299573
  (0, 41785)    0.197672720592
  (0, 29774)    0.120772893833
  (0, 61409)    0.0268609667042
  (0, 55527)    0.134102682463
  (0, 40011)    0.0582437010271
  (0, 19667)    0.0234843097048
  (0, 51667)    0.128270976476
  (0, 52791)    0.57198926651
  (0, 15014)    0.149195054799
  (0, 18805)    0.0277497826525
  (0, 35939)    0.170775938672
  (0, 5808) 0.0473913910636
  (0, 24922)    0.0126531527875
  (0, 10346)    0.0200098997901
  : :
  (23945, 56927)    0.0595132327966
  (23945, 23259)    0.0100977769025
  (23945, 12515)    0.0482102583442
  (23945, 49709)    0.210139450446
  (23945, 28742)    0.0190221880312
  (23945, 16628)    0.137692798005
  (23945, 53424)    0.157029848335
  (23945, 30647)    0.104485375827
  (23945, 57512)    0.0569754813269
  (23945, 39389)    0.0158180459761
  (23945, 26093)    0.0153713768922
  (23945, 9787) 0.0963777149738
  (23945, 23260)    0.158336452835
  (23945, 50595)    0.0527243936945
  (23945, 42447)    0.0527515904547
  (23945, 2829) 0.0351677269698
  (23945, 2832) 0.0175929392039
  (23945, 52079)    0.0849796887889
  (23945, 13523)    0.0878730969786
  (23945, 57849)    0.133869666381
  (23945, 25064)    0.128424780903
  (23945, 31129)    0.0919760384953
  (23945, 65601)    0.0388718258746
  (23945, 1428) 0.391477289626
  (23945, 2152) 0.655211469073
  X_train_tfidf shape: (23946, 67816)

回应 ttttthomasssss 的回答:

当我尝试运行以下命令时:

X_cluster_0 = X_train_tfidf[cluster_0]

我得到错误:

File "cluster.py", line 52, in main
    X_cluster_0 = X_train_tfidf[cluster_0]
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/csr.py", line 226, in __getitem__
    col = key[1]
IndexError: tuple index out of range

查看cluster_0的结构:

(array([  858,  2012,  2256,  2762,  2920,  3770,  6052,  6174,  8296,
9494,  9966, 10085, 11914, 12117, 12633, 12727, 12993, 13527,
13754, 14186, 14669, 14713, 14973, 15071, 15157, 15208, 15926,
16300, 16301, 17138, 17556, 17775, 18236, 19057, 20106, 21014, 21080]),)

这是一个元组结构，在第 0 个位置有内容，所以我将该行更改为以下内容:

X_cluster_0 = X_train_tfidf[cluster_0[0]]

我正在从一个数据库中提取“文档”，我可以从中轻松获取索引(迭代提供的数组，直到找到相应的文档 [当然假设 scikit 不会改变矩阵中文档的顺序])。所以我不明白 X_cluster_0 到底代表什么。 X_cluster_0 具有以下结构:

  X_cluster_0:   (0, 42726) 0.741747456202
  (0, 13535)    0.115880661286
  (0, 17447)    0.117608794277
  (0, 44849)    0.414829246262
  (0, 14574)    0.10214258736
  (0, 17317)    0.0634383214735
  (0, 17935)    0.0591234431875
  : :
  (17, 33867)   0.0174155914371
  (17, 48916)   0.0227046046275
  (17, 59132)   0.0168864861723
  (17, 40860)   0.0485813219503
  (17, 63725)   0.0271415763987
  (18, 45019)   0.490135684209
  (18, 36168)   0.14595160766
  (18, 52304)   0.139590524213
  (18, 63586)   0.16501953796
  (18, 28709)   0.15075416279
  (18, 11495)   0.0926490431993
  (18, 40860)   0.124236878928

计算到质心的距离

当前运行建议的代码 (distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])) 会导致以下错误:

File "cluster.py", line 68, in main
    distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/spatial/distance.py", line 211, in euclidean
    dist = norm(u - v)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 197, in __sub__
    raise NotImplementedError('adding a nonzero scalar to a '
NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported

这是 km.cluster_centers 的样子:

km.cluster_centers: [  9.47080802e-05   2.53907413e-03   0.00000000e+00 ...,   0.00000000e+00
   0.00000000e+00   0.00000000e+00]

我想我现在遇到的问题是如何提取矩阵的第 i 项(假设矩阵从左到右遍历)。我指定的任何级别的索引嵌套都没有区别(即 X_cluster_0[0]、X_cluster_0[0][0] 和 X_cluster_0[0][0] [0] 都给我上面描述的相同的打印输出矩阵结构)。

最佳答案

您可以使用 fit_predict()函数执行聚类并获得结果聚类的索引。

获取每个文档的簇索引

您可以尝试以下方法:

km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
                verbose=0, n_clusters=25)
clusters = km.fit_predict(X_train_tfidf)

# Note that your input data has dimensionality m x n and the clusters array has dimensionality m x 1 and contains the indices for every document
print X_train_tfidf.shape
print clusters.shape

# Example to get all documents in cluster 0
cluster_0 = np.where(clusters==0) # don't forget import numpy as np

# cluster_0 now contains all indices of the documents in this cluster, to get the actual documents you'd do:
X_cluster_0 = X_train_tfidf[cluster_0]

找出每个文档到每个质心的距离

您可以通过执行 centroids = km.cluster_centers_ 获得质心，在您的情况下，它的维数应为 25(簇数)x n(特征数)。要计算即文档到质心的欧氏距离，您可以使用 SciPy(可以找到 scipy 的各种距离度量的文档 here):

# Example, distance for 1 document to 1 cluster centroid
from scipy.spatial.distance import euclidean

distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
print distance

更新:稀疏和密集矩阵的距离

scipy.spatial.distance 中的距离度量要求输入矩阵是密集矩阵，因此如果 X_cluster_0 是稀疏矩阵，您可以将矩阵转换为密集矩阵:

d = euclidean(X_cluster_0.A[0], km.cluster_centers_[0]) # Note the .A on X_cluster_0
print d

或者，您可以使用 scikit 的 euclidean_distances()函数，它也适用于稀疏矩阵:

from sklearn.metrics.pairwise import euclidean_distances

D = euclidean_distances(X_cluster_0.getrow(0), km.cluster_centers_[0]) 
# This would be the equivalent expression to the above scipy example, however note that euclidean_distances returns a matrix and not a scalar
print D

请注意，使用 scikit 方法，您还可以一次计算整个距离矩阵:

D = euclidean_distances(X_cluster_0, km.cluster_centers_)
print D

更新:`X_cluster_0` 的结构和类型:

X_cluster_0 和 X_train_tfidf 都是稀疏矩阵(参见文档:scipy.sparse.csr.csr_matrix)。

转储的解释如

(0, 13535)    0.115880661286
(0, 17447)    0.117608794277
(0, 44849)    0.414829246262
(0, 14574)    0.10214258736
.             .
.             .

将如下所示:(0, 13535) 引用文档 0 和特征 13535，因此词袋矩阵中的行号 0 和列号 13535。以下 float 0.115880661286 表示给定文档中该特征 的 tf-idf 分数。

要找出确切的词，您可以尝试执行 hasher.get_feature_names()[13535](首先检查 len(hasher.get_feature_names()) 以了解如何您拥有的许多功能)。

如果您的语料库变量 document_text_list 是列表的列表，那么相应的文档将只是 document_text_list[0]。

关于python - 如何查找与 KMeans 在同一集群中的文档，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25829358/

有关python - 如何查找与 KMeans 在同一集群中的文档的更多相关文章

ruby - 如何使用 Nokogiri 的 xpath 和 at_xpath 方法 - 2
我正在学习如何使用Nokogiri，根据这段代码我遇到了一些问题:require'rubygems'require'mechanize'post_agent=WWW::Mechanize.newpost_page=post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')puts"\nabsolutepathwithtbodygivesnil"putspost_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div
ruby - 如何从 ruby 中的字符串运行任意对象方法？ - 2
总的来说，我对ruby还比较陌生，我正在为我正在创建的对象编写一些rspec测试用例。许多测试用例都非常基础，我只是想确保正确填充和返回值。我想知道是否有办法使用循环结构来执行此操作。不必为我要测试的每个方法都设置一个assertEquals。例如:describeitem,"TestingtheItem"doit"willhaveanullvaluetostart"doitem=Item.new#HereIcoulddotheitem.name.shouldbe_nil#thenIcoulddoitem.category.shouldbe_nilendend但我想要一些方法来使用
ruby - 其他文件中的 Rake 任务 - 2
我试图在一个项目中使用rake，如果我把所有东西都放到Rakefile中，它会很大并且很难读取/找到东西，所以我试着将每个命名空间放在lib/rake中它自己的文件中，我添加了这个到我的rake文件的顶部:Dir['#{File.dirname(__FILE__)}/lib/rake/*.rake'].map{|f|requiref}它加载文件没问题，但没有任务。我现在只有一个.rake文件作为测试，名为“servers.rake”，它看起来像这样:namespace:serverdotask:testdoputs"test"endend所以当我运行rakeserver:testid时
ruby-on-rails - Ruby net/ldap 模块中的内存泄漏 - 2
作为我的Rails应用程序的一部分，我编写了一个小导入程序，它从我们的LDAP系统中吸取数据并将其塞入一个用户表中。不幸的是，与LDAP相关的代码在遍历我们的32K用户时泄漏了大量内存，我一直无法弄清楚如何解决这个问题。这个问题似乎在某种程度上与LDAP库有关，因为当我删除对LDAP内容的调用时，内存使用情况会很好地稳定下来。此外，不断增加的对象是Net::BER::BerIdentifiedString和Net::BER::BerIdentifiedArray，它们都是LDAP库的一部分。当我运行导入时，内存使用量最终达到超过1GB的峰值。如果问题存在，我需要找到一些方法来更正我的代
python - 如何使用 Ruby 或 Python 创建一系列高音调和低音调的蜂鸣声？ - 2
关闭。这个问题是opinion-based.它目前不接受答案。想要改进这个问题？更新问题，以便editingthispost可以用事实和引用来回答它.关闭4年前。Improvethisquestion我想在固定时间创建一系列低音和高音调的哔哔声。例如:在150毫秒时发出高音调的蜂鸣声在151毫秒时发出低音调的蜂鸣声200毫秒时发出低音调的蜂鸣声250毫秒的高音调蜂鸣声有没有办法在Ruby或Python中做到这一点？我真的不在乎输出编码是什么(.wav、.mp3、.ogg等等)，但我确实想创建一个输出文件。
ruby-on-rails - Rails 3 中的多个路由文件 - 2
Rails2.3可以选择随时使用RouteSet#add_configuration_file添加更多路由。是否可以在Rails3项目中做同样的事情？最佳答案在config/application.rb中:config.paths.config.routes在Rails3.2(也可能是Rails3.1)中，使用:config.paths["config/routes"] 关于ruby-on-rails-Rails3中的多个路由文件，我们在StackOverflow上找到一个类似的问题
ruby-on-rails - 如何验证 update_all 是否实际在 Rails 中更新 - 2
给定这段代码defcreate@upgrades=User.update_all(["role=?","upgraded"],:id=>params[:upgrade])redirect_toadmin_upgrades_path,:notice=>"Successfullyupgradeduser."end我如何在该操作中实际验证它们是否已保存或未重定向到适当的页面和消息？最佳答案在Rails3中，update_all不返回任何有意义的信息，除了已更新的记录数(这可能取决于您的DBMS是否返回该信息)。http://ar.ru
ruby-on-rails - 'compass watch' 是如何工作的/它是如何与 rails 一起使用的 - 2
我在我的项目目录中完成了compasscreate.和compassinitrails。几个问题:我已将我的.sass文件放在public/stylesheets中。这是放置它们的正确位置吗？当我运行compasswatch时，它不会自动编译这些.sass文件。我必须手动指定文件:compasswatchpublic/stylesheets/myfile.sass等。如何让它自动运行？文件ie.css、print.css和screen.css已放在stylesheets/compiled。如何在编译后不让它们重新出现的情况下删除它们？我自己编译的.sass文件编译成compiled/t
ruby - 如何将脚本文件的末尾读取为数据文件(Perl 或任何其他语言) - 2
我正在寻找执行以下操作的正确语法(在Perl、Shell或Ruby中):#variabletoaccessthedatalinesappendedasafileEND_OF_SCRIPT_MARKERrawdatastartshereanditcontinues. 最佳答案 Perl用__DATA__做这个:#!/usr/bin/perlusestrict;usewarnings;while(){print;}__DATA__Texttoprintgoeshere 关于ruby-如何将脚
ruby - 如何指定 Rack 处理程序 - 2
Rackup通过Rack的默认处理程序成功运行任何Rack应用程序。例如:classRackAppdefcall(environment)['200',{'Content-Type'=>'text/html'},["Helloworld"]]endendrunRackApp.new但是当最后一行更改为使用Rack的内置CGI处理程序时，rackup给出“NoMethodErrorat/undefinedmethod`call'fornil:NilClass”:Rack::Handler::CGI.runRackApp.newRack的其他内置处理程序也提出了同样的反对意见。例如Rack