nltk

Python re.split() 与 nltk word_tokenize 和 sent_tokenize

我正在浏览thisquestion.我只是想知道NLTK在单词/句子标记化方面是否会比正则表达式更快。最佳答案默认的nltk.word_tokenize()使用Treebanktokenizer模拟来自PennTreebanktokenizer的分词器.请注意，str.split()并未实现语言学意义上的记号，例如:>>>sent="Thisisafoo,barsentence.">>>sent.split()['This','is','a','foo,','bar','sentence.']>>>fromnltkimportw

python - 使用 NLTK 检测英语动词时态

我正在寻找一种方法，给定一个英文文本，用过去、现在和将来时态计算其中的动词短语。现在我正在使用NLTK，做一个POS(词性)标记，然后计算说“VBD”以获得过去时态。但这不够准确，所以我想我需要更进一步并使用分block，然后分析VP-chunks的特定时态模式。有什么东西可以做到这一点吗？任何可能有帮助的进一步阅读？NLTKbook主要关注NP-chunks，我能找到很少关于VP-chunks的信息。最佳答案确切的答案取决于您打算使用哪个词block划分器，但是列表理解会让您走很长一段路。这会让您使用不存在的词block划分器

时态 python section noreferrer nlp nltk

python - 如何保存 Python NLTK 对齐模型供以后使用？

在Python中，我使用NLTK'salignmentmodule在平行文本之间创建单词对齐。对齐双文本可能是一个耗时的过程，尤其是在处理大量语料库时。最好有一天批量进行比对并在以后使用这些比对。fromnltkimportIBMModel1asibmbiverses=[listofAlignedSentobjects]model=ibm(biverses,20)withopen(path+"eng-taq_model.txt",'w')asf:f.write(model.train(biverses,20))//makesemptyfile创建模型后，如何(1)将其保存到磁盘并(2)

python gt code nltk io nlp machine-translation

python - NLTK 对结果树进行分块和遍历

我正在使用NLTKRegexpParser从标记的标记中提取名词组和动词组。我如何遍历生成的树以仅找到属于NP或V组的block？fromnltk.chunkimportRegexpParsergrammar='''NP:{?**}V:{}'''chunker=RegexpParser(grammar)token=[]##SometokensfrommyPOStaggerchunked=chunker.parse(tokens)printchunked#HowdoIwalkthetree?#forchunkinchunked:#ifchunk.???=='NP':#printchunk

果树 python section chunked RegexpParser text-parsing nltk chunking

python - 在 NLTK/Python 中使用电影评论语料库进行分类

我希望按照NLTKChapter6的思路进行一些分类.这本书似乎跳过了创建类别的步骤，我不确定我做错了什么。我的脚本在这里，响应如下。我的问题主要源于第一部分——基于目录名称的类别创建。这里的其他一些问题使用了文件名(即pos_1.txt和neg_1.txt)，但我更愿意创建可以将文件转储到其中的目录。fromnltk.corpusimportmovie_reviewsreviews=CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews',r'(\w+)/*.txt',cat_pattern=r'/(\w

语料 python code 39 features nlp nltk sentiment-analysis corpus

python - 从默认 ~/ntlk_data 更改 nltk.download() 路径目录

我试图在计算服务器上下载/更新pythonnltk包，它返回了这个[Errno122]Diskquotaexceeded:错误。具体来说:[nltk_data]Downloadingpackagestopwordsto/home/sh2264/nltk_data...[nltk_data]Errordownloadingu'stopwords'from[nltk_data]:[Errno122][nltk_data]Diskquotaexceeded:[nltk_data]u'/home/sh2264/nltk_data/corpora/stopwords.zipFalse如何更改nl

ntlk_data download code nltk nltk_data python python-2.7 path default

python - 使用 NLTK 生成二元语法

我正在尝试生成给定句子的二元组列表，例如，如果我输入，Tobeornottobe我要程序生成tobe,beor,ornot,notto,tobe我尝试了以下代码，但只给了我这是我的代码:importnltkbigrm=nltk.bigrams(text)print(bigrm)那么我怎样才能得到我想要的呢？我想要一个上述单词组合的列表(是、是或不是、不是、是)。最佳答案 nltk.bigrams()返回二元语法的迭代器(特别是生成器)。如果您想要一个列表，请将迭代器传递给list()。它还需要一系列项目来生成双字母组，因此您必须在

二元 python code section pre nltk n-gram

python - NLTK:设置代理服务器

我正在努力学习NLTK-用Python编写的自然语言工具包，我想安装一个示例数据集来运行一些示例。我的网络连接使用了代理服务器，我正在尝试按如下方式指定代理地址:>>>nltk.set_proxy('http://proxy.example.com:3128'('USERNAME','PASSWORD'))>>>nltk.download()但是我得到一个错误:Traceback(mostrecentcalllast):File"",line1,inTypeError:'str'objectisnotcallable我决定在调用nltk.download()之前设置一个ProxyBas

python NLTK code 39 section proxy-server

python - 使用 nltk 从文本文件中提取所有名词

有没有更有效的方法？我的代码读取一个文本文件并提取所有名词。importnltkFile=open(fileName)#openfilelines=File.read()#readalllinessentences=nltk.sent_tokenize(lines)#tokenizesentencesnouns=[]#emptytoarraytoholdallnounsforsentenceinsentences:forword,posinnltk.pos_tag(nltk.word_tokenize(str(sentence))):if(pos=='NN'orpos=='NNP'or

本文 python section 39 code nltk

python - 使用 NLTK 的高效术语文档矩阵

我正在尝试使用NLTK和pandas创建术语文档矩阵。我写了以下函数:deffnDTM_Corpus(xCorpus):importpandasaspd'''tocreateaTermDocumentMatrixfromaNLTKCorpus'''fd_list=[]forxinrange(0,len(xCorpus.fileids())):fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))DTM=pd.DataFrame(fd_list,index=xCorpus.fileids())DTM.filln

python NLTK 39 code section pandas term-document-matrix

26 27 282930 31 32