python - Geoip 2's python library doesn' t 在 pySpark 的 map 函数中工作

coder 2023-08-23 原文

我正在使用 geoip2 的 python 库和 pySpark 来获取某些 IP 的地理地址。我的代码是这样的:

geoDBpath = 'somePath/geoDB/GeoLite2-City.mmdb'
geoPath = os.path.join(geoDBpath)
sc.addFile(geoPath)
reader = geoip2.database.Reader(SparkFiles.get(geoPath))
def ip2city(ip):
    try:
        city = reader.city(ip).city.name
    except:
        city = 'not found'
    return city

我试过了

print ip2city("128.101.101.101")

它有效。但是当我试图在 rdd.map 中这样做时:

rdd = sc.parallelize([ip1, ip2, ip3, ip3, ...])
print rdd.map(lambda x: ip2city(x))

报道了

    Traceback (most recent call last):
  File "/home/worker/software/spark/python/pyspark/rdd.py", line 1299, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/home/worker/software/spark/python/pyspark/context.py", line 916, in runJob
    port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/home/worker/software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/home/worker/software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/worker/software/spark/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/home/worker/software/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/home/worker/software/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
TypeError: Required argument 'fileno' (pos 1) not found

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

谁能告诉我如何使 ip2city 函数在 rdd.map() 中工作。谢谢!

最佳答案

看起来您的代码问题来自 reader 对象。它不能作为闭包的一部分正确序列化并发送给 worker 。为了解决这个问题，您已经在 worker 身上实例化了它。处理此问题的一种方法是使用 mapPartitions:

from pyspark import SparkFiles

geoDBpath = 'GeoLite2-City.mmdb'
sc.addFile(geoDBpath)

def partitionIp2city(iter):
    from geoip2 import database

    def ip2city(ip):
        try:
           city = reader.city(ip).city.name
        except:
            city = 'not found'
        return city

    reader = database.Reader(SparkFiles.get(geoDBpath))
    return [ip2city(ip) for ip in iter]

rdd = sc.parallelize(['128.101.101.101', '85.25.43.84'])
rdd.mapPartitions(partitionIp2city).collect()

## ['Minneapolis', None]

关于python - Geoip 2's python library doesn' t 在 pySpark 的 map 函数中工作，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33745520/

中工 python spark city apache-spark pyspark geoip

有关python - Geoip 2's python library doesn' t 在 pySpark 的 map 函数中工作的更多相关文章

python - 如何使用 Ruby 或 Python 创建一系列高音调和低音调的蜂鸣声？ - 2
关闭。这个问题是opinion-based.它目前不接受答案。想要改进这个问题？更新问题，以便editingthispost可以用事实和引用来回答它.关闭4年前。Improvethisquestion我想在固定时间创建一系列低音和高音调的哔哔声。例如:在150毫秒时发出高音调的蜂鸣声在151毫秒时发出低音调的蜂鸣声200毫秒时发出低音调的蜂鸣声250毫秒的高音调蜂鸣声有没有办法在Ruby或Python中做到这一点？我真的不在乎输出编码是什么(.wav、.mp3、.ogg等等)，但我确实想创建一个输出文件。
ruby-on-rails - rails : "missing partial" when calling 'render' in RSpec test - 2
我正在尝试测试是否存在表单。我是Rails新手。我的new.html.erb_spec.rb文件的内容是:require'spec_helper'describe"messages/new.html.erb"doit"shouldrendertheform"dorender'/messages/new.html.erb'reponse.shouldhave_form_putting_to(@message)with_submit_buttonendendView本身，new.html.erb，有代码:当我运行rspec时，它失败了:1)messages/new.html.erbshou
ruby-on-rails - 'compass watch' 是如何工作的/它是如何与 rails 一起使用的 - 2
我在我的项目目录中完成了compasscreate.和compassinitrails。几个问题:我已将我的.sass文件放在public/stylesheets中。这是放置它们的正确位置吗？当我运行compasswatch时，它不会自动编译这些.sass文件。我必须手动指定文件:compasswatchpublic/stylesheets/myfile.sass等。如何让它自动运行？文件ie.css、print.css和screen.css已放在stylesheets/compiled。如何在编译后不让它们重新出现的情况下删除它们？我自己编译的.sass文件编译成compiled/t
ruby-on-rails - Rails 3.2.1 中 ActionMailer 中的未定义方法 'default_content_type=' - 2
我在我的项目中添加了一个系统来重置用户密码并通过电子邮件将密码发送给他，以防他忘记密码。昨天它运行良好(当我实现它时)。当我今天尝试启动服务器时，出现以下错误。=>BootingWEBrick=>Rails3.2.1applicationstartingindevelopmentonhttp://0.0.0.0:3000=>Callwith-dtodetach=>Ctrl-CtoshutdownserverExiting/Users/vinayshenoy/.rvm/gems/ruby-1.9.3-p0/gems/actionmailer-3.2.1/lib/action_mailer
ruby - 在 jRuby 中使用 'fork' 生成进程的替代方案？ - 2
在MRIRuby中我可以这样做:deftransferinternal_server=self.init_serverpid=forkdointernal_server.runend#Maketheserverprocessrunindependently.Process.detach(pid)internal_client=self.init_client#Dootherstuffwithconnectingtointernal_server...internal_client.post('somedata')ensure#KillserverProcess.kill('KILL',
ruby - 主要 :Object when running build from sublime 的未定义方法 `require_relative' - 2
我已经从我的命令行中获得了一切，所以我可以运行rubymyfile并且它可以正常工作。但是当我尝试从sublime中运行它时，我得到了undefinedmethod`require_relative'formain:Object有人知道我的sublime设置中缺少什么吗？我正在使用OSX并安装了rvm。最佳答案或者，您可以只使用“require”，它应该可以正常工作。我认为“require_relative”仅适用于ruby1.9+ 关于ruby-主要:Objectwhenrun
ruby - 无法让 RSpec 工作—— 'require' : cannot load such file - 2
我花了三天的时间用头撞墙，试图弄清楚为什么简单的“rake”不能通过我的规范文件。如果您遇到这种情况:任何文件夹路径中都不要有空格!。严重地。事实上，从现在开始，您命名的任何内容都没有空格。这是我的控制台输出:(在/Users/*****/Desktop/LearningRuby/learn_ruby)$rake/Users/*******/Desktop/LearningRuby/learn_ruby/00_hello/hello_spec.rb:116:in`require':cannotloadsuchfile--hello(LoadError) 最佳
ruby-on-rails - 新 Rails 项目 : 'bundle install' can't install rails in gemfile - 2
我已经像这样安装了一个新的Rails项目:$railsnewsite它执行并到达:bundleinstall但是当它似乎尝试安装依赖项时我得到了这个错误Gem::Ext::BuildError:ERROR:Failedtobuildgemnativeextension./System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/bin/rubyextconf.rbcheckingforlibkern/OSAtomic.h...yescreatingMakefilemake"DESTDIR="cleanmake"DESTDIR="
ruby-on-rails - rspec should have_select ('cars' , :options => ['volvo' , 'saab' ] 不工作 - 2
关闭。这个问题需要detailsorclarity.它目前不接受答案。想改进这个问题吗？通过editingthispost添加细节并澄清问题.关闭8年前。Improvethisquestion在首页我有:汽车:VolvoSaabMercedesAudistatic_pages_spec.rb中的测试代码:it"shouldhavetherightselect"dovisithome_pathit{shouldhave_select('cars',:options=>['volvo','saab','mercedes','audi'])}end响应是rspec./spec/request
ruby-on-rails - Rails 中的 NoMethodError::MailersController#preview undefined method `activation_token=' for nil:NilClass - 2
似乎无法为此找到有效的答案。我正在阅读Rails教程的第10章第10.1.2节，但似乎无法使邮件程序预览正常工作。我发现处理错误的所有答案都与教程的不同部分相关，我假设我犯的错误正盯着我的脸。我已经完成并将教程中的代码复制/粘贴到相关文件中，但到目前为止，我还看不出我输入的内容与教程中的内容有什么区别。到目前为止，建议是在函数定义中添加或删除参数user，但这并没有解决问题。触发错误的url是http://localhost:3000/rails/mailers/user_mailer/account_activation.http://localhost:3000/rails/mai

python - Geoip 2's python library doesn' t 在 pySpark 的 map 函数中工作

有关python - Geoip 2's python library doesn' t 在 pySpark 的 map 函数中工作的更多相关文章

随机推荐