语料

python - nltk 词语料库不包含 "okay"？

NLTK词语料库中没有短语“okay”、“okay”、“Okay”？>fromnltk.corpusimportwords>words.words().__contains__("check")>True>words.words().__contains__("okay")>False>len(words.words())>236736有什么想法吗？最佳答案长话短说fromnltk.corpusimportwordsfromnltk.corpusimportwordnetmanywords=words.words()+wordn

语料 amp words code section python dictionary nltk corpus

python - 是否可以从 python 中的句子语料库重新训练 word2vec 模型(例如 GoogleNews-vectors-negative300.bin)？

我正在使用预先训练的谷歌新闻数据集，通过在python中使用Gensim库来获取词向量model=Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)加载模型后，我将训练评论句子单词转换为向量#readingallsentencesfromtrainingfilewithopen('restaurantSentences','r')asinfile:x_train=infile.readlines()#cleaningsentencesx_train=[review_to_word

语料 python sentences code GoogleNews-vectors-negative nlp gensim word2vec

python - 在 NLTK/Python 中使用电影评论语料库进行分类

我希望按照NLTKChapter6的思路进行一些分类.这本书似乎跳过了创建类别的步骤，我不确定我做错了什么。我的脚本在这里，响应如下。我的问题主要源于第一部分——基于目录名称的类别创建。这里的其他一些问题使用了文件名(即pos_1.txt和neg_1.txt)，但我更愿意创建可以将文件转储到其中的目录。fromnltk.corpusimportmovie_reviewsreviews=CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews',r'(\w+)/*.txt',cat_pattern=r'/(\w

语料 python code 39 features nlp nltk sentiment-analysis corpus

python - Hierarchical Dirichlet Process Gensim 主题数与语料库大小无关

我在一组文档上使用GensimHDP模块。>>>hdp=models.HdpModel(corpusB,id2word=dictionaryB)>>>topics=hdp.print_topics(topics=-1,topn=20)>>>len(topics)150>>>hdp=models.HdpModel(corpusA,id2word=dictionaryA)>>>topics=hdp.print_topics(topics=-1,topn=20)>>>len(topics)150>>>len(corpusA)1113>>>len(corpusB)17为什么主题数量与语料库长度

语料 Hierarchical topics gt section python nlp lda gensim

python - 根据文本语料库中的出现次数列出词汇表中的单词，使用 Scikit-Learn CountVectorizer

我已经为scikit-learn中的一些文档安装了CountVectorizer。我想在文本语料库中查看所有术语及其相应频率，以便选择停用词。例如'and'123times,'to'100times,'for'90times,...andsoon这个有内置函数吗？最佳答案如果cv是您的CountVectorizer并且X是矢量化语料库，那么zip(cv.get_feature_names(),np.asarray(X.sum(axis=0)).ravel())为CountVectorizer提取的语料库中的每个不同术语返回(te

语料词汇表 code section python machine-learning scikit-learn text-extraction countvectorizer

python - 每次我在同一个语料库上训练时，LDA 模型都会生成不同的主题

我正在使用pythongensim从一个包含231个句子的小型语料库训练一个LatentDirichletAllocation(LDA)模型。然而，每次我重复这个过程，它都会产生不同的主题。为什么相同的LDA参数和语料库每次生成不同的主题？以及如何稳定话题生成？我正在使用这个语料库(http://pastebin.com/WptkKVF0)和这个停用词列表(http://pastebin.com/LL7dqLcj)，这是我的代码:fromgensimimportcorpora,models,similaritiesfromgensim.modelsimporthdpmodel,ldam

语料训练 corpus section python nlp lda topic-modeling gensim

python - 如何使用 Scikit Learn CountVectorizer 获取语料库中的词频？

我正在尝试使用scikit-learn的CountVectorizer计算一个简单的词频。importpandasaspdimportnumpyasnpfromsklearn.feature_extraction.textimportCountVectorizertexts=["dogcatfish","dogcatcat","fishbird","bird"]cv=CountVectorizer()cv_fit=cv.fit_transform(texts)printcv.vocabulary_{u'bird':0,u'cat':1,u'dog':2,u'fish':3}我期待它返回

语料 CountVectorizer code 34 python scikit-learn

javascript - 可下载的 HTML 测试语料库

我正在为Firefox开发一个浏览器插件，我希望能够进行一些自动化测试以确保它能够正确处理各种不同的HTML/JavaScript功能。有谁知道可用于此类测试的良好的可下载HTML和/或JavaScript页面语料库？最佳答案 Dotbot在2009年发布了包含14GBHTML的torrent文件。关于javascript-可下载的HTML测试语料库，我们在StackOverflow上找到一个类似的问题： https://stackoverflow.com

可下语料 section javascript html testing

html - 从 R 中的许多 html 文件创建语料库

我想为下载的HTML文件的集合创建一个语料库，然后在R中读取它们以供将来的文本挖掘使用。本质上，这就是我想要做的:从多个html文件创建语料库。我尝试使用DirSource:library(tm)a但它返回“无效的目录参数”一次从语料库中读取html文件。不知道该怎么做。解析它们，将它们转换为纯文本，删除标签。许多人建议使用XML，但是，我没有找到处理多个文件的方法。它们都是针对一个文件的。非常感谢。最佳答案这应该可以做到。在这里，我的计算机上有一个HTML文件文件夹(来自SO的随机样本)，我用它们制作了一个语料库，然后是一个文

语料 html 34 r xml-parsing text-mining corpus

python - 如何使用 gensim 从语料库中提取短语

为了预处理语料库，我打算从语料库中提取常用短语，为此我尝试在gensim中使用短语模型，我尝试了下面的代码，但它没有给我想要的输出。我的代码fromgensim.modelsimportPhrasesdocuments=["themayorofnewyorkwasthere","machinelearningcanbeusefulsometimes"]sentence_stream=[doc.split("")fordocindocuments]bigram=Phrases(sentence_stream)sent=[u'the',u'mayor',u'of',u'new',u'yor

语料短语 39 strong mayor python nlp gensim

2 3 456 7 8