xml - 将维基百科转储加载到 Elasticsearch

coder 2024-07-04 原文

我想加载一个 XML 维基百科转储，例如: http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/20171001/enwiki-20171001-pages-articles.xml.bz2 进入 Elasticsearch (5.6.4)。但是，我找到的所有工具和教程都已过时，并且与我的 Elasticsearch 版本不兼容。谁能解释将转储导入 Elasticsearch 的最佳方法是什么？

最佳答案

两年前，维基媒体提供了生产 Elasticsearch 索引的可用转储。

索引每周导出一次，每个 wiki 有两次导出。

The content index, which contains only article pages, called content;
The general index, containing all pages. This includes talk pages, templates, etc, called general;

你可以在这里找到它们 http://dumps.wikimedia.org/other/cirrussearch/current/

根据您的需要创建映射。例如:

{
     "mappings": {
     "page": {
        "properties": {
           "auxiliary_text": {
              "type": "text"
           },
           "category": {
              "type": "text"
           },
           "coordinates": {
              "properties": {
                 "coord": {
                    "properties": {
                       "lat": {
                          "type": "double"
                       },
                       "lon": {
                          "type": "double"
                       }
                    }
                 },
                 "country": {
                    "type": "text"
                 },
                 "dim": {
                    "type": "long"
                 },
                 "globe": {
                    "type": "text"
                 },
                 "name": {
                    "type": "text"
                 },
                 "primary": {
                    "type": "boolean"
                 },
                 "region": {
                    "type": "text"
                 },
                 "type": {
                    "type": "text"
                 }
              }
           },
           "defaultsort": {
              "type": "boolean"
           },
           "external_link": {
              "type": "text"
           },
           "heading": {
              "type": "text"
           },
           "incoming_links": {
              "type": "long"
           },
           "language": {
              "type": "text"
           },
           "namespace": {
              "type": "long"
           },
           "namespace_text": {
              "type": "text"
           },
           "opening_text": {
              "type": "text"
           },
           "outgoing_link": {
              "type": "text"
           },
           "popularity_score": {
              "type": "double"
           },
           "redirect": {
              "properties": {
                 "namespace": {
                    "type": "long"
                 },
                 "title": {
                    "type": "text"
                 }
              }
           },
           "score": {
              "type": "double"
           },
           "source_text": {
              "type": "text"
           },
           "template": {
              "type": "text"
           },
           "text": {
              "type": "text"
           },
           "text_bytes": {
              "type": "long"
           },
           "timestamp": {
              "type": "date",
              "format": "strict_date_optional_time||epoch_millis"
           },
           "title": {
              "type": "text"
           },
           "version": {
              "type": "long"
           },
           "version_type": {
              "type": "text"
           },
           "wiki": {
              "type": "text"
           },
           "wikibase_item": {
              "type": "text"
           }
        }
     }
  }

创建索引后，只需键入:

zcat enwiki-current-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki/_bulk --data-binary @- > /dev/null'

尽情享受吧!

关于xml - 将维基百科转储加载到 Elasticsearch，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47476122/

维基 Elasticsearch 34 type text xml wikipedia

有关xml - 将维基百科转储加载到 Elasticsearch的更多相关文章

ruby-on-rails - 如何从 format.xml 中删除 <hash></hash> - 2
我有一个对象has_many应呈现为xml的子对象。这不是问题。我的问题是我创建了一个Hash包含此数据，就像解析器需要它一样。但是rails自动将整个文件包含在.........我需要摆脱type="array"和我该如何处理？我没有在文档中找到任何内容。最佳答案我遇到了同样的问题；这是我的XML:我在用这个:entries.to_xml将散列数据转换为XML，但这会将条目的数据包装到中所以我修改了:entries.to_xml(root:"Contacts")但这仍然将转换后的XML包装在“联系人”中，将我的XML代码修改为
ruby - 如何在续集中重新加载表模式？ - 2
鉴于我有以下迁移:Sequel.migrationdoupdoalter_table:usersdoadd_column:is_admin,:default=>falseend#SequelrunsaDESCRIBEtablestatement,whenthemodelisloaded.#Atthispoint,itdoesnotknowthatusershaveais_adminflag.#Soitfails.@user=User.find(:email=>"admin@fancy-startup.example")@user.is_admin=true@user.save!ende
ruby - RuntimeError(自动加载常量 Apps 多线程时检测到循环依赖 - 2
我收到这个错误:RuntimeError(自动加载常量Apps时检测到循环依赖当我使用多线程时。下面是我的代码。为什么会这样？我尝试多线程的原因是因为我正在编写一个HTML抓取应用程序。对Nokogiri::HTML(open())的调用是一个同步阻塞调用，需要1秒才能返回，我有100,000多个页面要访问，所以我试图运行多个线程来解决这个问题。有更好的方法吗？classToolsController0)app.website=array.join(',')putsapp.websiteelseapp.website="NONE"endapp.saveapps=Apps.order("
ruby-on-rails - 使用 config.threadsafe 时从 lib/加载模块/类的正确方法是什么!选项？ - 2
我一直致力于让我们的Rails2.3.8应用程序在JRuby下正确运行。一切正常，直到我启用config.threadsafe!以实现JRuby提供的并发性。这导致lib/中的模块和类不再自动加载。使用config.threadsafe!启用:$rubyscript/runner-eproduction'pSim::Sim200Provisioner'/Users/amchale/.rvm/gems/jruby-1.5.1@web-services/gems/activesupport-2.3.8/lib/active_support/dependencies.rb:105:in`co
ruby-on-rails - 从应用程序中自定义文件夹内的命名空间自动加载 - 2
我们目前正在为ROR3.2开发自定义cms引擎。在这个过程中，我们希望成为我们的rails应用程序中的一等公民的几个类类型起源，这意味着它们应该驻留在应用程序的app文件夹下，它是插件。目前我们有以下类型:数据源数据类型查看我在app文件夹下创建了多个目录来保存这些:应用/数据源应用/数据类型应用/View更多类型将随之而来，我有点担心应用程序文件夹被这么多目录污染。因此，我想将它们移动到一个子目录/模块中，该子目录/模块包含cms定义的所有类型。所有类都应位于MyCms命名空间内，目录布局应如下所示:应用程序/my_cms/data_source应用程序/my_cms/data_ty
ruby-on-rails - 使用 gmaps4rails 动态加载谷歌地图标记 - 2
如何只加载map边界内的标记gmaps4rails？当然，在平移和/或缩放后加载新的。与此直接相关的是，如何获取map的当前边界和缩放级别？最佳答案我是这样做的，我只在用户完成平移或缩放后替换标记，如果您需要不同的行为，请使用不同的事件监听器:在你看来(index.html.erb):{"zoom"=>15,"auto_adjust"=>false,"detect_location"=>true,"center_on_user"=>true}},false,true)%>在View的底部添加:functiongmaps4rail
ruby - Rails Elasticsearch 聚合 - 2
不知何故，我似乎无法获得包含我的聚合的响应...使用curl它按预期工作:HBZUMB01$curl-XPOST"http://localhost:9200/contents/_search"-d'{"size":0,"aggs":{"sport_count":{"value_count":{"field":"dwid"}}}}'我收到回复:{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":90,"max_score":0.0,"hits":[]},"a
ruby-on-rails - 是否可以让 ActiveRecord 为使用 :joins option? 加载的行创建对象 - 2
我需要做这样的事情classUser'User',:foreign_key=>'abuser_id'belongs_to:gameendclassGame['JOINabuse_reportsONusers.id=abuse_reports.abuser_id','JOINgamesONgames.id=abuse_reports.game_id'],:group=>'users.id',:select=>'users.*,count(distinctgames.id)ASgame_count,count(abuse_reports.id)asabuse_report_count',:
ruby-on-rails - 如何在 Rails 3 中禁用 XML 解析 - 2
我想禁用HTTP参数的自动XML解析。但我发现命令仅适用于Rails2.x，它们都不适用于3.0:config.action_controller.param_parsers.deleteMime::XML(application.rb)ActionController::Base.param_parsers.deleteMime::XMLRails3.0中的等价物是什么？最佳答案根据CVE-2013-0156的最新安全公告你可以将它用于Rails3.0。3.1和3.2ActionDispatch::ParamsParser::
ruby - 运行 rackup private_pub.ru -s thin -E production 命令时无法加载此类文件 -- thin (LoadError) - 2
我指的是pubrailscasttutorial并已正确执行所有步骤，但在运行最后一个命令时，即rackupprivate_pub.ru-sthin-Eproduction为了架设faye服务器，我收到以下错误:/usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in`require':cannotloadsuchfile--thin(LoadError)from/usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in`require'from/var/lib/gems/1.9.1/gems

xml - 将维基百科转储加载到 Elasticsearch

有关xml - 将维基百科转储加载到 Elasticsearch的更多相关文章

随机推荐