javascript - 使用 Python 执行网页脚本

coder 2024-06-13 原文

我正在尝试抓取一个充满 javascript 的页面。网址是:

http://www.nasdaqomxnordic.com/index/index_info?Instrument=DK0016268840

我使用了以下代码来获取数据。显然这段代码应该处理 javascript 并返回一个完整的 html 文件，但它没有。可能存在时间问题，如果是这样，我不太清楚您在哪里延迟 proram 以允许完整的 html。

import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

def getHtml(str_url):
    r_html = Render(str_url)
    html = r_html.frame.toHtml()
    return html

str_url = 'http://www.nasdaqomxnordic.com/index/index_info?Instrument=DK0016268840'
str_html = getHtml(str_url)
print(str_html)

如果您从 Web 浏览器请求页面源代码，这会给我您将获得的 html。当然，页面上还有更多内容，因为所有表格都充满了 javascript 函数。使用 Firebug，我要查找的表的 ID 是“sharesInIndexTable。我真正想抓取的项目是每个公司名称下的链接——但是能够访问整个表以使用 beautifulsoup 进行解析会更好。从这张表中，应该能够找到“Carlsberg”这个词(作为查看 AJAX 是否已完全加载的潜在测试)。然后我试图找出一些东西来解析 DOM，我尝试了这个:

import sys
from PyQt4 import QtGui, QtCore, QtWebKit

class Sp():
    def printit(self):        
        data = self.webView.page().mainFrame().findFirstElement('id="sharesInIndexTable"')
    print(data)       

def main(self):
    self.webView = QtWebKit.QWebView()
    self.webView.load(QtCore.QUrl("http://www.nasdaqomxnordic.com/index/index_info?Instrument=DK0016268840"))
    QtCore.QObject.connect(self.webView,QtCore.SIGNAL("loadFinished(bool)"),self.printit)

    app = QtGui.QApplication(sys.argv)
    s = Sp()
    s.main()
    sys.exit(app.exec_())

我从中得到的只是位于 0x03294830 的 PyQt4.QtWebkit.QWebElement 对象(您的结果可能会有所不同)。无论我试图将此地址转换为可读格式，都失败了。这段代码似乎也运行了两次。然后我尝试了这个(有点适应我的需要):

#!/usr/bin/python

# These lines will get us the modules we need.
from PyQt4.QtCore import QUrl, SIGNAL
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage, QWebView

class Scrape(QApplication):
  def __init__(self):
  # only work with ["test"] as it normally takes an array of args
  super(Scrape, self).__init__(["test"])
  # Create a QWebView instance and store it.
  self.webView = QWebView()
  # Connect our searchform method to the searchform signal of this new
  # QWebView.
  self.webView.loadFinished.connect(self.searchForm)

  def load(self, url):
  # In the __init__ we stored a QWebView instance into self.webView so
  # we can load a url into it. It needs a QUrl instance though.
  self.webView.load(QUrl(url))

  def searchForm(self):
  # We landed here because the load is finished. Now, load the root document
  # element. It'll be a QWebElement instance. QWebElement is a QT4.6
  # addition and it allows easier DOM interaction.
  documentElement = self.webView.page().currentFrame().documentElement()
  # Let's find the search input element.
  print("Begin search")
  inputSearch = documentElement.findFirst('id="sharesInIndexTable"')
  # Disconnect ourselves from the signal.
  self.webView.loadFinished.disconnect(self.searchForm)
  print("End search")
  # And connect the next function.
  self.webView.loadFinished.connect(self.searchResults)

  def searchResults(self):
  # As seen above, first grab the root document element and then load all g
  # classed list items.
  print("Begin results")
  results = self.webView.page().currentFrame().documentElement().findAll('td')

  # Change the resulting QWebElementCollection into a list so we can easily
  # iterate over it.
  for e in results.toList():
    # Just print the results.
    print(e.tohtml())
  # We are inside a QT application and need to terminate that properly.
  print("End results")
  self.exit()

# Instantiate our class.
my_scrape = Scrape()
# Load the Google homepage.
my_scrape.load('http://www.nasdaqomxnordic.com/index/index_info?Instrument=DK0016268840')
# Start the QT event loop.
my_scrape.exec_()

我添加了 print() 语句来确定程序是否完全执行了命令。这根本不会产生任何结果(打印语句除外)

检查源页面，我可以找到填充表格的脚本:

var sharesInIndex = { 
load: function () {
var index = webCore.getInstrument();
var nLabel = 'nm';
var hiddenAttributes = ",lists,tp,hlp,isin,note,";
var xslt = "inst_table.xsl";
var options =  ",noflag,sectoridicon,";
var xpath = "//index//instruments";
// Check if swedish r�nteindex or Icelandic r�nteindex.
if ( index.indexOf('OMFSE') >= 0 || webCore.getInstrument().indexOf('IS00000') >= 0 ) {
    hiddenAttributes += ",to,sectid,";
    nLabel = 'fnm';
}

// Check if weights index present (typeof)
var shbindex = ",SE0002834820,SE0002834838,SE0002834846,SE0002977397,";
if ( shbindex.indexOf(index) >= 0 ) {
    xslt = "inst_table_windex.xsl";
    options += "windex,";
    xpath = "//index";
}

var query = webCore.createQuery(
    Utils.Constants.marketAction.getIndexInstrument, {
    inst__a: "0,1,2,5,37,4,20,21,23,24,33,34,97,129,98,10", /* 87,*/
    Instrument: index,
    XPath: xpath,
    ext_xslt: xslt,
    ext_xslt_lang: currentLanguage,
    ext_xslt_tableId: "sharesInIndexTable",
    ext_xslt_hiddenattrs: hiddenAttributes,
    ext_xslt_notlabel: nLabel,
    ext_xslt_options: options
  });

  $("#sharesInIndexOutput").empty().loading("/static/nordic/css/img/loading.gif");
  $("#sharesInIndexOutput").load( webCore.getProxyURL('prod'), {xmlquery: query},
    function( responseText, textStatus, XMLHttpRequest) {
      $("#sharesInIndexTable").tablesorter({
        widgets: ['zebra'], 
        textExtraction: 'complex', 
        numberFormat: Utils.Constants.numberFormat[currentLanguage]
        });
      $("#sharesInIndexTable a").each( function() {
        $(this).attr("href",webCore.getURL( Utils.Constants.pages.micrositeShare, $(this).attr('name') ));
      });
    });
  }
};

$(document).ready( sharesInIndex.load );

我知道有一个“execute_script”命令，但我不知道你是如何实现它的，也没有找到任何适合它的例子——我不介意结果是 Json 或 HTML 还是纯文本。我相信这就是答案所在:(1) 加载页面，(2) 运行页面脚本，(3) 获取结果，(4) 解析/打印/保存结果...

我最好有一个 headless 的解决方案，如果有的话，甚至 Windows 上的 Phantomjs 也不是完全 headless 的，因为它会弹出一个 cmd 窗口(我知道你可以通过 Linux 上的虚拟显示来摆脱这个 - 但那是不是环境)。另外，只是告诉我:哦，你必须轮询它以查看数据是否已加载然后检索它不是很有帮助:你能告诉我(即使是伪代码)轮询是如何完成的，更重要的是大致在程序执行轮询(这就是我发布完全可执行代码的原因 - 如果其他人有同样的问题，他们应该有一个完整且易于理解的答案)。

我最近的尝试(1 - 插入延迟以允许 AJAX 加载)

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *
import time

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)
    self.mainFrame().load(QUrl(url))  
    self.loadFinished.connect(self._loadFinished)   
    self.app.exec_()  

  def _loadFinished(self, result):
    time.sleep(5)
    self.frame = self.currentFrame()  
    self.app.quit()  

url = 'http://www.nasdaqomxnordic.com/index/index_info?Instrument=DK0016268840'  
r = Render(url)  
html = r.frame.toHtml()
print(html)

(2 - 轮询源页面中的已知项目) - 使用 firebug 检查器找到的项目 - 可能 findFirst 的参数语法错误。

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *
import time

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)
    self.mainFrame().load(QUrl(url))  
    self.loadFinished.connect(self._loadFinished)   
    self.app.exec_()  

  def _loadFinished(self, result):
    counter = 0
    while(self.mainFrame().documentElement().findFirst("id=sharesInIndexTable")):
      counter+=1
      print(counter)
      time.sleep(1)    
    self.frame = self.currentFrame()  
    self.app.quit()  

url = 'http://www.nasdaqomxnordic.com/index/index_info?Instrument=DK0016268840'  
r = Render(url)  
html = r.frame.toHtml()
print(html)

最后一个有一个计数器来显示是否有事情发生。它永远计数，必须用 ctrl-c 停止。

(3 - 使用 WebElement 的另一种变体)

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *
import time

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)
    self.mainFrame().load(QUrl(url))  
    self.loadFinished.connect(self._loadFinished)   
    self.app.exec_()  

  def _loadFinished(self, result):
    table = self.mainFrame().documentElement().findFirst("id=sharesInIndexTable")
    print(table)    #prints: <PyQt4.QtWebKit.QWebElement object at 0x0319FB0>
    print("Attributes:")
    print(table.attributeNames())    #prints: [] i.e. None 
    print("Classes: ")
    print(table.classes())      #prints: [] i.e. None
    print("InnerXML: " + table.toInnerXml())   #prints nothing
    print("OuterXML: " + table.toOuterXml())   #prints nothing
    print("Done")
    self.frame = self.currentFrame()  
    self.app.quit()  

url = 'http://www.nasdaqomxnordic.com/index/index_info?Instrument=DK0016268840'  
r = Render(url)  
html = r.frame.toHtml()

这个也没有成功。我输入了打印的代码。那里显然有一个物体，但我看不到里面是什么。

最佳答案

我知道已经很久了，但这个答案是为以后遇到类似情况的访问者准备的

我遇到了类似的问题，我尝试了各种方法，例如等待来自 QWebPage 和 QWebFrame 的 loadFinished 信号，等待来自 QWebFrame.intialLayoutCompleted() 的信号等。

最终对我有用的是:

我只是在普通浏览器中呈现页面。检查由于 javascript 而未在 PyQt 中呈现的元素，获取该元素的 id(如果它是一个包含多个元素、表等的 div，则获取 div id)。现在在 yourPage.loadFinished 函数的 python 代码中调用 yourFrame.evaluateJavaScript("document.getElementById(element_id_retrieved_earlier')")。

这将等待 id 被检索，而 id 又将等待嵌入的脚本被执行。

关于javascript - 使用 Python 执行网页脚本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25860407/

javascript Python self 34 loadFinished windows web-scraping pyqt4

有关javascript - 使用 Python 执行网页脚本的更多相关文章

ruby - 如何使用 Nokogiri 的 xpath 和 at_xpath 方法 - 2
我正在学习如何使用Nokogiri，根据这段代码我遇到了一些问题:require'rubygems'require'mechanize'post_agent=WWW::Mechanize.newpost_page=post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')puts"\nabsolutepathwithtbodygivesnil"putspost_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div
ruby - 使用 RubyZip 生成 ZIP 文件时设置压缩级别 - 2
我有一个Ruby程序，它使用rubyzip压缩XML文件的目录树。gem。我的问题是文件开始变得很重，我想提高压缩级别，因为压缩时间不是问题。我在rubyzipdocumentation中找不到一种为创建的ZIP文件指定压缩级别的方法。有人知道如何更改此设置吗？是否有另一个允许指定压缩级别的Ruby库？最佳答案这是我通过查看rubyzip内部创建的代码。level=Zlib::BEST_COMPRESSIONZip::ZipOutputStream.open(zip_file)do|zip|Dir.glob("**/*")d
ruby - 为什么我可以在 Ruby 中使用 Object#send 访问私有(private)/ protected 方法？ - 2
类classAprivatedeffooputs:fooendpublicdefbarputs:barendprivatedefzimputs:zimendprotecteddefdibputs:dibendendA的实例a=A.new测试a.foorescueputs:faila.barrescueputs:faila.zimrescueputs:faila.dibrescueputs:faila.gazrescueputs:fail测试输出failbarfailfailfail.发送测试[:foo,:bar,:zim,:dib,:gaz].each{|m|a.send(m)resc
ruby-on-rails - 使用 Ruby on Rails 进行自动化测试 - 最佳实践 - 2
很好奇，就使用rubyonrails自动化单元测试而言，你们正在做什么？您是否创建了一个脚本来在cron中运行rake作业并将结果邮寄给您？git中的预提交Hook？只是手动调用？我完全理解测试，但想知道在错误发生之前捕获错误的最佳实践是什么。让我们理所当然地认为测试本身是完美无缺的，并且可以正常工作。下一步是什么以确保他们在正确的时间将可能有害的结果传达给您？最佳答案不确定您到底想听什么，但是有几个级别的自动代码库控制:在处理某项功能时，您可以使用类似autotest的内容获得关于哪些有效，哪些无效的即时反馈。要确保您的提
ruby - 在 Ruby 中使用匿名模块 - 2
假设我做了一个模块如下:m=Module.newdoclassCendend三个问题:除了对m的引用之外，还有什么方法可以访问C和m中的其他内容？我可以在创建匿名模块后为其命名吗(就像我输入“module...”一样)？如何在使用完匿名模块后将其删除，使其定义的常量不再存在？最佳答案三个答案:是的，使用ObjectSpace.此代码使c引用你的类(class)C不引用m:c=nilObjectSpace.each_object{|obj|c=objif(Class===objandobj.name=~/::C$/)}当然这取决于
ruby - 使用 ruby 和 savon 的 SOAP 服务 - 2
我正在尝试使用ruby和Savon来使用网络服务。测试服务为http://www.webservicex.net/WS/WSDetails.aspx?WSID=9&CATID=2require'rubygems'require'savon'client=Savon::Client.new"http://www.webservicex.net/stockquote.asmx?WSDL"client.get_quotedo|soap|soap.body={:symbol=>"AAPL"}end返回SOAP异常。检查soap信封，在我看来soap请求没有正确的命名空间。任何人都可以建议我
python - 如何使用 Ruby 或 Python 创建一系列高音调和低音调的蜂鸣声？ - 2
关闭。这个问题是opinion-based.它目前不接受答案。想要改进这个问题？更新问题，以便editingthispost可以用事实和引用来回答它.关闭4年前。Improvethisquestion我想在固定时间创建一系列低音和高音调的哔哔声。例如:在150毫秒时发出高音调的蜂鸣声在151毫秒时发出低音调的蜂鸣声200毫秒时发出低音调的蜂鸣声250毫秒的高音调蜂鸣声有没有办法在Ruby或Python中做到这一点？我真的不在乎输出编码是什么(.wav、.mp3、.ogg等等)，但我确实想创建一个输出文件。
ruby-openid:执行发现时未设置@socket - 2
我在使用omniauth/openid时遇到了一些麻烦。在尝试进行身份验证时，我在日志中发现了这一点:OpenID::FetchingError:Errorfetchinghttps://www.google.com/accounts/o8/.well-known/host-meta?hd=profiles.google.com%2Fmy_username:undefinedmethod`io'fornil:NilClass重要的是undefinedmethodio'fornil:NilClass来自openid/fetchers.rb，在下面的代码片段中:moduleNetclass
ruby-on-rails - 'compass watch' 是如何工作的/它是如何与 rails 一起使用的 - 2
我在我的项目目录中完成了compasscreate.和compassinitrails。几个问题:我已将我的.sass文件放在public/stylesheets中。这是放置它们的正确位置吗？当我运行compasswatch时，它不会自动编译这些.sass文件。我必须手动指定文件:compasswatchpublic/stylesheets/myfile.sass等。如何让它自动运行？文件ie.css、print.css和screen.css已放在stylesheets/compiled。如何在编译后不让它们重新出现的情况下删除它们？我自己编译的.sass文件编译成compiled/t
ruby - 使用 ruby 将 HTML 转换为纯文本并维护结构/格式 - 2
我想将html转换为纯文本。不过，我不想只删除标签，我想智能地保留尽可能多的格式。为插入换行符标签，检测段落并格式化它们等。输入非常简单，通常是格式良好的html(不是整个文档，只是一堆内容，通常没有anchor或图像)。我可以将几个正则表达式放在一起，让我达到80%，但我认为可能有一些现有的解决方案更智能。最佳答案首先，不要尝试为此使用正则表达式。很有可能你会想出一个脆弱/脆弱的解决方案，它会随着HTML的变化而崩溃，或者很难管理和维护。您可以使用Nokogiri快速解析HTML并提取文本:require'nokogiri'h

javascript - 使用 Python 执行网页脚本

有关javascript - 使用 Python 执行网页脚本的更多相关文章

随机推荐