BeautifulSoup4

python - Xpath vs DOM vs BeautifulSoup vs lxml vs other 解析网页的最快方法是什么？

我知道如何使用Python解析页面。我的问题是哪种方法是所有解析技术中最快的，其他方法的速度有多快？我知道的解析技术有Xpath、DOM、BeautifulSoup，还有使用Python的find方法。最佳答案 http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ 关于python-XpathvsDOMvsBeautifulSoupvslxmlvsother解析网页的最快方法是什么？，我们在Stack

vs BeautifulSoup section python-html-parser-performance python dom xpath html-parsing lxml

python - 如何使用 BeautifulSoup 获取两个指定标签之间的所有文本？

html="""...all(iterable)¶..."""我想获取开始标记big到第一次出现a标记之间的所有文本。这意味着如果我采用这个示例，那么我必须将(iterable)作为字符串。最佳答案迭代方法。fromBeautifulSoupimportBeautifulSoupasbsfromitertoolsimporttakewhile,chaindefget_text(html,from_tag,until_tag):soup=bs(html)forbiginsoup(from_tag):until=big.findNe

BeautifulSoup python section code big html-parsing

python - 使用 BeautifulSoup 在第一个子标签之前提取文本

来自这个html源:Category:Personal我想提取文本Category:这是我使用Python/BeautifulSoup的尝试(输出作为注释-在#之后)parsed=BeautifulSoup(sample_html)parsed_div=parsed.findAll('div')[0]parsed_div.firstText()#Personalparsed_div.first()#Personalparsed_div.findAll()[0]#Personal我希望“文本节点”作为第一个子节点可用。关于如何解决这个问题有什么建议吗？最佳答

BeautifulSoup 个子 code gt section python

python - 查找与 BeautifulSoup(python)最近的链接

我正在做一个小项目，我在其中提取政治领导人在报纸上的出现。有时会提到一位政客，但没有parent或child有联系。(我猜是由于语义错误的标记)。所以我想创建一个函数，可以找到最近的链接，然后提取它。在下面的例子中，搜索字符串是Rasmussen，我想要的链接是:/307046。#-*-coding:utf-8-*-frombs4importBeautifulSoupimportretekst='''ClausHjortspillermedmrkedekortAf:DennisKristensenClausHjortFrederiksensargumenterforatafvisetr

python BeautifulSoup 34 link element lxml

python - 使用 BeautifulSoup 选择第二个 child

假设我有以下HTML:thisissometext...andthisissomeothertext如何使用BeautifulSoup检索第二段中的文本？最佳答案您可以使用CSS选择器来执行此操作:>>>frombs4importBeautifulSoup>>>soup=BeautifulSoup("""....thisissometext.......andthisissomeothertext....""","html.parser")>>>soup.select('div>p')[1].get_text(strip=Tru

BeautifulSoup python section gt web-scraping

python - BeautifulSoup 没有提取所有 html(自动删除页面的大部分 html)

我正在尝试使用BeautifulSoup从网站(http://brooklynexposed.com/events/)中提取内容。作为问题的示例，我可以运行以下代码:importurllibimportbs4asBeautifulSoupurl='http://brooklynexposed.com/events/'html=urllib.urlopen(url).read()soup=BeautifulSoup.BeautifulSoup(html)printsoup.prettify().encode('utf-8')输出好像截断了html如下:9:00pm-11:00pmCome

大部 html BeautifulSoup section python urllib

python - 在 HTML BeautifulSoup 中按文本查找和替换

我正在尝试使用python和BeautifulSoup标记一个HTML文件(字面意思是将字符串包装在“标记”标签中)。问题大致如下……假设我有我的原始html文档:test="ohheyhereissomeSILLYtext"我想对该文档中的字符串进行不区分大小写的搜索(忽略HTML)并将其包装在“mark”标签中。所以假设我想在html中找到“这里有一些愚蠢的文本”(忽略粗体标签)。我想采用匹配的html并将其包装在“标记”标签中。例如，如果我想在test中搜索“hereissomesillytext”，所需的输出是:"ohheyhereissomeSILLYtext"有什么想法吗？

BeautifulSoup python code gt lt regex html-parsing lxml

python - BeautifulSoup 正则表达式

我刚刚在Python中运行了以下代码，将所有特定电子邮件从IMAP文件夹中取出。提取部分工作正常，BeautifulSoup部分工作正常，但输出中有很多'\r'和'\n'。我试图用REGEX子函数删除它们，但它不起作用...甚至没有给出错误消息。知道有什么问题吗？我附上了代码...请注意(这不是完整的代码，但我发布的代码之上的所有内容都可以正常工作。它仍然打印输出，它是“美化的”，但\r和\n仍然存在。已尝试使用find_all()但这也不起作用。mail.list()#ListsalllabelsinGMailmail.select('INBOX/Personal')#Connect

BeautifulSoup python 39 soup section regex

python - 使用 BeautifulSoup 获取文档 DOCTYPE

我刚刚开始修补scrapy连同BeautifulSoup我想知道我是否遗漏了一些非常明显的东西，但我似乎无法弄清楚如何从生成的soup对象中获取返回的html文档的文档类型。给定以下html:HTML5DemosandExamplesThisisparagraphoneThisisparagraphtwo.谁能告诉我是否有办法使用BeautifulSoup从中提取声明的文档类型？最佳答案 BeautifulSoup4有一个用于DOCTYPE声明的类，因此您可以使用它来提取顶层的所有声明(尽管您无疑希望有一个或没有!)defdoct

BeautifulSoup DOCTYPE 34 section gt python parsing scrapy

python - BeautifulSoup 解析器将分号附加到裸＆符号，修改 URL？

我正在尝试用python解析一些网站，这些网站中有指向其他网站的链接，但是是纯文本，而不是“a”标签。使用BeautifulSoup我得到了错误的答案。考虑这段代码:importBeautifulSouphtml="""Testhtmlexample.com/a.php?b=2&c=15"""parsed=BeautifulSoup.BeautifulSoup(html)printparsed当我运行上面的代码时，我得到以下输出:Testhtmlexample.com/a.php?b=2&c;=15注意“div”中的链接和b=2&c;=15部分。它不同于原始的HTML。为什么Beaut

分号 BeautifulSoup html amp code python

2 3 456 7 8