python - scikit-learn:查找有助于每个 KMeans 集群的特征

coder 2023-08-13 原文

假设您有 10 个特征用于创建 3 个集群。有没有办法查看每个特征对每个集群的贡献级别？

我想说的是，对于集群 k1，特征 1、4、6 是主要特征，而集群 k2 的主要特征是 2、5、7。

这是我正在使用的基本设置:

k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(data_features)
k_means_labels = k_means.labels_

最佳答案

你可以使用

Principle Component Analysis (PCA)

PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).

一些要点:

特征值反射(reflect)了由相应分量解释的方差部分。比如说，我们有 4 个特征值 1, 4, 1, 2 的特征。这些是相关方解释的差异。载体。第二个值属于第一主成分，因为它解释了总方差的 50%，最后一个值属于第二主成分，解释了总方差的 25%。
特征向量是组件的线性组合。给出功能的权重，以便您了解哪些功能具有高/低影响。
使用基于相关矩阵的 PCA，而不是经验协方差矩阵，如果特征值差异很大(量级)。

示例方法

对整个数据集进行主成分分析(下面的函数就是这样做的)
- 获取包含观察值和特征的矩阵
- 以平均值为中心(所有观察值的特征值的平均值)
- 计算经验协方差矩阵(例如 np.cov)或相关性(见上文)
- 执行分解
- 按特征值对特征值和特征向量进行排序，以获得影响最大的组件
- 在原始数据上使用组件
检查转换数据集中的聚类。通过检查它们在每个组件上的位置，您可以得出对分布/方差影响高低的特征

示例函数

您需要导入 numpy 作为 np 和 scipy 作为 sp。它使用 sp.linalg.eigh 进行分解。您可能还想检查 scikit decomposition module .

PCA 在数据矩阵上执行，观察(对象)在行中，特征在列中。

def dim_red_pca(X, d=0, corr=False):
    r"""
    Performs principal component analysis.

    Parameters
    ----------
    X : array, (n, d)
        Original observations (n observations, d features)

    d : int
        Number of principal components (default is ``0`` => all components).

    corr : bool
        If true, the PCA is performed based on the correlation matrix.

    Notes
    -----
    Always all eigenvalues and eigenvectors are returned,
    independently of the desired number of components ``d``.

    Returns
    -------
    Xred : array, (n, m or d)
        Reduced data matrix

    e_values : array, (m)
        The eigenvalues, sorted in descending manner.

    e_vectors : array, (n, m)
        The eigenvectors, sorted corresponding to eigenvalues.

    """
    # Center to average
    X_ = X-X.mean(0)
    # Compute correlation / covarianz matrix
    if corr:
        CO = np.corrcoef(X_.T)
    else:
        CO = np.cov(X_.T)
    # Compute eigenvalues and eigenvectors
    e_values, e_vectors = sp.linalg.eigh(CO)

    # Sort the eigenvalues and the eigenvectors descending
    idx = np.argsort(e_values)[::-1]
    e_vectors = e_vectors[:, idx]
    e_values = e_values[idx]
    # Get the number of desired dimensions
    d_e_vecs = e_vectors
    if d > 0:
        d_e_vecs = e_vectors[:, :d]
    else:
        d = None
    # Map principal components to original data
    LIN = np.dot(d_e_vecs, np.dot(d_e_vecs.T, X_.T)).T
    return LIN[:, :d], e_values, e_vectors

示例用法

这是一个示例脚本，它使用给定的函数并使用 scipy.cluster.vq.kmeans2 进行聚类。请注意，每次运行的结果都不同。这是由于起始簇 a 是随机初始化的。

import numpy as np
import scipy as sp
from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt

SN = np.array([ [1.325, 1.000, 1.825, 1.750],
                [2.000, 1.250, 2.675, 1.750],
                [3.000, 3.250, 3.000, 2.750],
                [1.075, 2.000, 1.675, 1.000],
                [3.425, 2.000, 3.250, 2.750],
                [1.900, 2.000, 2.400, 2.750],
                [3.325, 2.500, 3.000, 2.000],
                [3.000, 2.750, 3.075, 2.250],
                [2.075, 1.250, 2.000, 2.250],
                [2.500, 3.250, 3.075, 2.250],
                [1.675, 2.500, 2.675, 1.250],
                [2.075, 1.750, 1.900, 1.500],
                [1.750, 2.000, 1.150, 1.250],
                [2.500, 2.250, 2.425, 2.500],
                [1.675, 2.750, 2.000, 1.250],
                [3.675, 3.000, 3.325, 2.500],
                [1.250, 1.500, 1.150, 1.000]], dtype=float)
    
clust,labels_ = kmeans2(SN,3)    # cluster with 3 random initial clusters
# PCA on orig. dataset 
# Xred will have only 2 columns, the first two princ. comps.
# evals has shape (4,) and evecs (4,4). We need all eigenvalues 
# to determine the portion of variance
Xred, evals, evecs = dim_red_pca(SN,2)   

xlab = '1. PC - ExpVar = {:.2f} %'.format(evals[0]/sum(evals)*100) # determine variance portion
ylab = '2. PC - ExpVar = {:.2f} %'.format(evals[1]/sum(evals)*100)
# plot the clusters, each set separately
plt.figure()    
ax = plt.gca()
scatterHs = []
clr = ['r', 'b', 'k']
for cluster in set(labels_):
    scatterHs.append(ax.scatter(Xred[labels_ == cluster, 0], Xred[labels_ == cluster, 1], 
                   color=clr[cluster], label='Cluster {}'.format(cluster)))
plt.legend(handles=scatterHs,loc=4)
plt.setp(ax, title='First and Second Principle Components', xlabel=xlab, ylabel=ylab)
# plot also the eigenvectors for deriving the influence of each feature
fig, ax = plt.subplots(2,1)
ax[0].bar([1, 2, 3, 4],evecs[0])
plt.setp(ax[0], title="First and Second Component's Eigenvectors ", ylabel='Weight')
ax[1].bar([1, 2, 3, 4],evecs[1])
plt.setp(ax[1], xlabel='Features', ylabel='Weight')

输出

特征向量显示组件的每个特征的权重

简短解释

让我们来看看第 0 个簇，红色的那个。我们将对第一个组件最感兴趣，因为它解释了大约 3/4 的分布。红色集群位于第一个组件的上部区域。所有观察结果都产生相当高的值。这是什么意思？现在看看我们第一眼看到的第一个组件的线性组合，第二个特征相当不重要(对于这个组件)。第一个和第四个特征的权重最高，第三个特征的得分为负。这意味着，由于所有红色顶点在第一台 PC 上都具有相当高的分数 - 这些顶点在第一个和最后一个特征中将具有高值，同时它们在关于第三个特点。

关于第二个功能，我们可以看一下第二台 PC。但是，请注意，总体影响要小得多，因为与第一台 PC 的约 74% 相比，该组件仅解释了大约 16% 的方差。

关于python - scikit-learn:查找有助于每个 KMeans 集群的特征，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27491197/

有关python - scikit-learn:查找有助于每个 KMeans 集群的特征的更多相关文章

python - 如何使用 Ruby 或 Python 创建一系列高音调和低音调的蜂鸣声？ - 2
关闭。这个问题是opinion-based.它目前不接受答案。想要改进这个问题？更新问题，以便editingthispost可以用事实和引用来回答它.关闭4年前。Improvethisquestion我想在固定时间创建一系列低音和高音调的哔哔声。例如:在150毫秒时发出高音调的蜂鸣声在151毫秒时发出低音调的蜂鸣声200毫秒时发出低音调的蜂鸣声250毫秒的高音调蜂鸣声有没有办法在Ruby或Python中做到这一点？我真的不在乎输出编码是什么(.wav、.mp3、.ogg等等)，但我确实想创建一个输出文件。
ruby - 当使用::指定模块时，为什么 Ruby 不在更高范围内查找类？ - 2
我刚刚被困在这个问题上一段时间了。以这个基地为例:moduleTopclassTestendmoduleFooendend稍后，我可以通过这样做在Foo中定义扩展Test的类:moduleTopmoduleFooclassSomeTest但是，如果我尝试通过使用::指定模块来最小化缩进:moduleTop::FooclassFailure这失败了:NameError:uninitializedconstantTop::Foo::Test这是一个错误，还是仅仅是Ruby解析变量名的方式的逻辑结果？最佳答案 Isthisabug,or
ruby - 查找字符串中的内容类型(数字、日期、时间、字符串等) - 2
我正在尝试解析一个CSV文件并使用SQL命令自动为其创建一个表。CSV中的第一行给出了列标题。但我需要推断每个列的类型。Ruby中是否有任何函数可以找到每个字段中内容的类型。例如，CSV行:"12012","Test","1233.22","12:21:22","10/10/2009"应该产生像这样的类型['integer','string','float','time','date']谢谢! 最佳答案 require'time'defto_something(str)if(num=Integer(str)rescueFloat(s
Python 相当于 Perl/Ruby ||= - 2
这个问题在这里已经有了答案:关闭10年前。PossibleDuplicate:Pythonconditionalassignmentoperator对于这样一个简单的问题表示歉意，但是谷歌搜索||=并不是很有帮助；)Python中是否有与Ruby和Perl中的||=语句等效的语句？例如:foo="hey"foo||="what"#assignfooifit'sundefined#fooisstill"hey"bar||="yeah"#baris"yeah"另外，类似这样的东西的通用术语是什么？条件分配是我的第一个猜测，但Wikipediapage跟我想的不太一样。
java - 什么相当于 ruby 的 rack 或 python 的 Java wsgi？ - 2
什么是ruby的rack或python的Java的wsgi？还有一个路由库。最佳答案来自Python标准PEP333:Bycontrast,althoughJavahasjustasmanywebapplicationframeworksavailable,Java's"servlet"APImakesitpossibleforapplicationswrittenwithanyJavawebapplicationframeworktoruninanywebserverthatsupportstheservletAPI.ht
华为OD机试用Python实现 -【明明的随机数】 2023Q1A - 2
华为OD机试题本篇题目：明明的随机数题目输入描述输出描述：示例1输入输出说明代码编写思路最近更新的博客华为od2023|什么是华为od，od薪资待遇，od机试题清单华为OD机试真题大全，用Python解华为机试题|机试宝典【华为OD机试】全流程解析+经验分享,题型分享,防作弊指南华为o
python - 如何读取 MIDI 文件、更改其乐器并将其写回？ - 2
我想解析一个已经存在的.mid文件，改变它的乐器，例如从“acousticgrandpiano”到“violin”，然后将它保存回去或作为另一个.mid文件。根据我在文档中看到的内容，该乐器通过program_change或patch_change指令进行了更改，但我找不到任何在已经存在的MIDI文件中执行此操作的库.他们似乎都只支持从头开始创建的MIDI文件。最佳答案 MIDIpackage会为您完成此操作，但具体方法取决于midi文件的原始内容。一个MIDI文件由一个或多个音轨组成，每个音轨是十六个channel中任何一个上的
「Python｜Selenium｜场景案例」如何定位iframe中的元素？ - 2
本文主要介绍在使用Selenium进行自动化测试或者任务时，对于使用了iframe的页面，如何定位iframe中的元素文章目录场景描述解决方案具体代码场景描述当我们在使用Selenium进行自动化测试的时候，可能会遇到一些界面或者窗体是使用HTML的iframe标签进行承载的。对于iframe中的标签，如果直接查找是无法找到的，会抛出没有找到元素的异常。比如近在咫尺的例子就是，CSDN的登录窗体就是使用的iframe，大家可以尝试通过F12开发者模式查看到的tag_name,class_name,id或者xpath来定位中的页面元素，会抛出NoSuchElementException异常。解决
python ffmpeg 使用 pyav 转换一组图像到视频 - 2
2022/8/4更新支持加入水印水印必须包含透明图像，并且水印图像大小要等于原图像的大小pythonconvert_image_to_video.py-f30-mwatermark.pngim_dirout.mkv2022/6/21更新让命令行参数更加易用新的命令行使用方法pythonconvert_image_to_video.py-f30im_dirout.mkvFFMPEG命令行转换一组JPG图像到视频时，是将这组图像视为MJPG流。我需要转换一组PNG图像到视频，FFMPEG就不认了。pyav内置了ffmpeg库，不需要系统带有ffmpeg工具因此我使用ffmpeg的python包装p
Python 刷Leetcode题库，顺带学英语单词（31） - 2
ValidPalindromeGivenastring,determineifitisapalindrome,consideringonlyalphanumericcharactersandignoringcases. [#125]Example:"Aman,aplan,acanal:Panama"isapalindrome."raceacar"isnotapalindrome.Haveyouconsiderthatthestringmightbeempty?Thisisagoodquestiontoaskduringaninterview.Forthepurposeofthisproblem