我正在使用 Apache Spark 分析查询日志。我在设置 spark 时已经遇到了一些困难。现在我使用独立集群来处理查询。
首先,我使用 Java 中的示例代码来计算工作正常的单词数。但是当我尝试将它连接到 MySQL 服务器时,问题就出现了。我正在使用 64 位 ubuntu 14.04 LTS。 Spark 版本 1.4.1,Mysql 5.1。
这是我的代码,当我使用 Master Url 而不是 [Local*] 时,我收到错误消息找不到合适的驱动程序。我已经包含了日志。
import java.io.Serializable;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
public class LoadFromDb implements Serializable {
private static final org.apache.log4j.Logger LOGGER = org.apache.log4j.Logger.getLogger(LoadFromDb.class);
private static final String MYSQL_DRIVER = "com.mysql.jdbc.Driver";
private static final String MYSQL_USERNAME = "spark";
private static final String MYSQL_PWD = "spark123";
private static final String MYSQL_CONNECTION_URL =
"jdbc:mysql://localhost/productsearch_userinfo?user=" + MYSQL_USERNAME + "&password=" + MYSQL_PWD;
private static final JavaSparkContext sc =
new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("spark://shawon-H67MA-USB3-B3:7077"));
private static final SQLContext sqlContext = new SQLContext(sc);
public static void main(String[] args) {
//Data source options
Map<String, String> options = new HashMap<>();
options.put("driver", MYSQL_DRIVER);
options.put("url", MYSQL_CONNECTION_URL);
options.put("dbtable",
"query");
//options.put("partitionColumn", "sessionID");
// options.put("lowerBound", "10001");
//options.put("upperBound", "499999");
//options.put("numPartitions", "10");
//Load MySQL query result as DataFrame
DataFrame jdbcDF = sqlContext.load("jdbc", options);
//jdbcDF.show();
jdbcDF.select("id","queryText").show();
}
}
任何示例项目都会有很大帮助。日志是: 使用 Spark 的默认 log4j 配置文件:org/apache/spark/log4j-defaults.properties
15/08/29 03:38:26 INFO SparkContext: Running Spark version 1.4.1
15/08/29 03:38:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/08/29 03:38:27 WARN Utils: Your hostname, shawon-H67MA-USB3-B3 resolves to a loopback address: 127.0.0.1; using 192.168.1.102 instead (on interface eth0)
15/08/29 03:38:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/08/29 03:38:27 INFO SecurityManager: Changing view acls to: shawon
15/08/29 03:38:27 INFO SecurityManager: Changing modify acls to: shawon
15/08/29 03:38:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(shawon); users with modify permissions: Set(shawon)
15/08/29 03:38:27 INFO Slf4jLogger: Slf4jLogger started
15/08/29 03:38:27 INFO Remoting: Starting remoting
15/08/29 03:38:27 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.102:60742]
15/08/29 03:38:27 INFO Utils: Successfully started service 'sparkDriver' on port 60742.
15/08/29 03:38:27 INFO SparkEnv: Registering MapOutputTracker
15/08/29 03:38:27 INFO SparkEnv: Registering BlockManagerMaster
15/08/29 03:38:27 INFO DiskBlockManager: Created local directory at /tmp/spark-85b7b4c4-ed50-4ccf-97fc-25b14ab404b1/blockmgr-57acbba4-d7d4-4557-9e6c-e1acf97d4c88
15/08/29 03:38:27 INFO MemoryStore: MemoryStore started with capacity 473.3 MB
15/08/29 03:38:27 INFO HttpFileServer: HTTP File server directory is /tmp/spark-85b7b4c4-ed50-4ccf-97fc-25b14ab404b1/httpd-a5e6844d-ac3a-4da2-822c-1b98d0a287c4
15/08/29 03:38:27 INFO HttpServer: Starting HTTP Server
15/08/29 03:38:27 INFO Utils: Successfully started service 'HTTP file server' on port 55199.
15/08/29 03:38:27 INFO SparkEnv: Registering OutputCommitCoordinator
15/08/29 03:38:28 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/08/29 03:38:28 INFO SparkUI: Started SparkUI at http://192.168.1.102:4040
15/08/29 03:38:28 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster@shawon-H67MA-USB3-B3:7077/user/Master...
15/08/29 03:38:28 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20150829033828-0000
15/08/29 03:38:28 INFO AppClient$ClientActor: Executor added: app-20150829033828-0000/0 on worker-20150829033238-192.168.1.102-36976 (192.168.1.102:36976) with 4 cores
15/08/29 03:38:28 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150829033828-0000/0 on hostPort 192.168.1.102:36976 with 4 cores, 512.0 MB RAM
15/08/29 03:38:28 INFO AppClient$ClientActor: Executor updated: app-20150829033828-0000/0 is now RUNNING
15/08/29 03:38:28 INFO AppClient$ClientActor: Executor updated: app-20150829033828-0000/0 is now LOADING
15/08/29 03:38:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 58874.
15/08/29 03:38:28 INFO NettyBlockTransferService: Server created on 58874
15/08/29 03:38:28 INFO BlockManagerMaster: Trying to register BlockManager
15/08/29 03:38:28 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.102:58874 with 473.3 MB RAM, BlockManagerId(driver, 192.168.1.102, 58874)
15/08/29 03:38:28 INFO BlockManagerMaster: Registered BlockManager
15/08/29 03:38:28 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/08/29 03:38:30 INFO SparkContext: Starting job: show at LoadFromDb.java:43
15/08/29 03:38:30 INFO DAGScheduler: Got job 0 (show at LoadFromDb.java:43) with 1 output partitions (allowLocal=false)
15/08/29 03:38:30 INFO DAGScheduler: Final stage: ResultStage 0(show at LoadFromDb.java:43)
15/08/29 03:38:30 INFO DAGScheduler: Parents of final stage: List()
15/08/29 03:38:30 INFO DAGScheduler: Missing parents: List()
15/08/29 03:38:30 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at show at LoadFromDb.java:43), which has no missing parents
15/08/29 03:38:30 INFO MemoryStore: ensureFreeSpace(4304) called with curMem=0, maxMem=496301506
15/08/29 03:38:30 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.2 KB, free 473.3 MB)
15/08/29 03:38:30 INFO MemoryStore: ensureFreeSpace(2274) called with curMem=4304, maxMem=496301506
15/08/29 03:38:30 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.2 KB, free 473.3 MB)
15/08/29 03:38:30 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.102:58874 (size: 2.2 KB, free: 473.3 MB)
15/08/29 03:38:30 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874
15/08/29 03:38:30 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at show at LoadFromDb.java:43)
15/08/29 03:38:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/08/29 03:38:30 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@192.168.1.102:56580/user/Executor#1344522225]) with ID 0
15/08/29 03:38:30 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.102, PROCESS_LOCAL, 1171 bytes)
15/08/29 03:38:30 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.102:56904 with 265.4 MB RAM, BlockManagerId(0, 192.168.1.102, 56904)
15/08/29 03:38:31 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.102:56904 (size: 2.2 KB, free: 265.4 MB)
15/08/29 03:38:31 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.1.102): java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost/productsearch_userinfo?user=spark&password=spark123
at java.sql.DriverManager.getConnection(DriverManager.java:596)
at java.sql.DriverManager.getConnection(DriverManager.java:187)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:185)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:177)
at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:359)
at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:350)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/08/29 03:38:31 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, 192.168.1.102, PROCESS_LOCAL, 1171 bytes)
15/08/29 03:38:31 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on executor 192.168.1.102: java.sql.SQLException (No suitable driver found for jdbc:mysql://localhost/productsearch_userinfo?user=spark&password=spark123) [duplicate 1]
15/08/29 03:38:31 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, 192.168.1.102, PROCESS_LOCAL, 1171 bytes)
15/08/29 03:38:31 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on executor 192.168.1.102: java.sql.SQLException (No suitable driver found for jdbc:mysql://localhost/productsearch_userinfo?user=spark&password=spark123) [duplicate 2]
15/08/29 03:38:31 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, 192.168.1.102, PROCESS_LOCAL, 1171 bytes)
15/08/29 03:38:31 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on executor 192.168.1.102: java.sql.SQLException (No suitable driver found for jdbc:mysql://localhost/productsearch_userinfo?user=spark&password=spark123) [duplicate 3]
15/08/29 03:38:31 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
15/08/29 03:38:31 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/08/29 03:38:31 INFO TaskSchedulerImpl: Cancelling stage 0
15/08/29 03:38:31 INFO DAGScheduler: ResultStage 0 (show at LoadFromDb.java:43) failed in 1.680 s
15/08/29 03:38:31 INFO DAGScheduler: Job 0 failed: show at LoadFromDb.java:43, took 1.840969 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.1.102): java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost/productsearch_userinfo?user=spark&password=spark123
at java.sql.DriverManager.getConnection(DriverManager.java:596)
at java.sql.DriverManager.getConnection(DriverManager.java:187)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:185)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:177)
at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:359)
at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:350)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/08/29 03:38:31 INFO SparkContext: Invoking stop() from shutdown hook
15/08/29 03:38:31 INFO SparkUI: Stopped Spark web UI at http://192.168.1.102:4040
15/08/29 03:38:31 INFO DAGScheduler: Stopping DAGScheduler
15/08/29 03:38:31 INFO SparkDeploySchedulerBackend: Shutting down all executors
15/08/29 03:38:31 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
15/08/29 03:38:31 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/08/29 03:38:31 INFO Utils: path = /tmp/spark-85b7b4c4-ed50-4ccf-97fc-25b14ab404b1/blockmgr-57acbba4-d7d4-4557-9e6c-e1acf97d4c88, already present as root for deletion.
15/08/29 03:38:31 INFO MemoryStore: MemoryStore cleared
15/08/29 03:38:31 INFO BlockManager: BlockManager stopped
15/08/29 03:38:32 INFO BlockManagerMaster: BlockManagerMaster stopped
15/08/29 03:38:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/08/29 03:38:32 INFO SparkContext: Successfully stopped SparkContext
15/08/29 03:38:32 INFO Utils: Shutdown hook called
15/08/29 03:38:32 INFO Utils: Deleting directory /tmp/spark-85b7b4c4-ed50-4ccf-97fc-25b14ab404b1
15/08/29 03:38:32 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
最佳答案
我在 Oracle 数据库连接方面遇到了同样的问题。
我做了以下 4 件事,但我认为前 2 件事修复了它:
%YOUR_SPARK_HOME%/bin/spark-submit --jars file://c:/jdbcDrivers/ojdbc7.jar
在连接属性中添加驱动程序(在您的情况下为选项对象):
dbProps.put("driver", "oracle.jdbc.driver.OracleDriver");
在 %YOUR_SPARK_HOME%/conf/spark-defaults.conf 文件下添加:
spark.driver.extraClassPath = file://C:/jdbcDrivers/ojdbc7.jar
在同一个conf下还添加:
spark.executor.extraClassPath = file://C:/jdbcDrivers/ojdbc7.jar
我已经连接到 DB,但目前正在处理一些要插入到 Oracle 中的问题。
关于mysql - 未找到 Apache Spark Mysql 连接合适的 jdbc 驱动程序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32280276/
我需要在客户计算机上运行Ruby应用程序。通常需要几天才能完成(复制大备份文件)。问题是如果启用sleep,它会中断应用程序。否则,计算机将持续运行数周,直到我下次访问为止。有什么方法可以防止执行期间休眠并让Windows在执行后休眠吗?欢迎任何疯狂的想法;-) 最佳答案 Here建议使用SetThreadExecutionStateWinAPI函数,使应用程序能够通知系统它正在使用中,从而防止系统在应用程序运行时进入休眠状态或关闭显示。像这样的东西:require'Win32API'ES_AWAYMODE_REQUIRED=0x0
Rackup通过Rack的默认处理程序成功运行任何Rack应用程序。例如:classRackAppdefcall(environment)['200',{'Content-Type'=>'text/html'},["Helloworld"]]endendrunRackApp.new但是当最后一行更改为使用Rack的内置CGI处理程序时,rackup给出“NoMethodErrorat/undefinedmethod`call'fornil:NilClass”:Rack::Handler::CGI.runRackApp.newRack的其他内置处理程序也提出了同样的反对意见。例如Rack
我想用ruby编写一个小的命令行实用程序并将其作为gem分发。我知道安装后,Guard、Sass和Thor等某些gem可以从命令行自行运行。为了让gem像二进制文件一样可用,我需要在我的gemspec中指定什么。 最佳答案 Gem::Specification.newdo|s|...s.executable='name_of_executable'...endhttp://docs.rubygems.org/read/chapter/20 关于ruby-在Ruby中编写命令行实用程序
我构建了两个需要相互通信和发送文件的Rails应用程序。例如,一个Rails应用程序会发送请求以查看其他应用程序数据库中的表。然后另一个应用程序将呈现该表的json并将其发回。我还希望一个应用程序将存储在其公共(public)目录中的文本文件发送到另一个应用程序的公共(public)目录。我从来没有做过这样的事情,所以我什至不知道从哪里开始。任何帮助,将不胜感激。谢谢! 最佳答案 无论Rails是什么,几乎所有Web应用程序都有您的要求,大多数现代Web应用程序都需要相互通信。但是有一个小小的理解需要你坚持下去,网站不应直接访问彼此
我尝试运行2.x应用程序。我使用rvm并为此应用程序设置其他版本的ruby:$rvmuseree-1.8.7-head我尝试运行服务器,然后出现很多错误:$script/serverNOTE:Gem.source_indexisdeprecated,useSpecification.Itwillberemovedonorafter2011-11-01.Gem.source_indexcalledfrom/Users/serg/rails_projects_terminal/work_proj/spohelp/config/../vendor/rails/railties/lib/r
刚入门rails,开始慢慢理解。有人可以解释或给我一些关于在application_controller中编码的好处或时间和原因的想法吗?有哪些用例。您如何为Rails应用程序使用应用程序Controller?我不想在那里放太多代码,因为据我了解,每个请求都会调用此Controller。这是真的? 最佳答案 ApplicationController实际上是您应用程序中的每个其他Controller都将从中继承的类(尽管这不是强制性的)。我同意不要用太多代码弄乱它并保持干净整洁的态度,尽管在某些情况下ApplicationContr
我正在使用Sequel构建一个愿望list系统。我有一个wishlists和itemstable和一个items_wishlists连接表(该名称是续集选择的名称)。items_wishlists表还有一个用于facebookid的额外列(因此我可以存储opengraph操作),这是一个NOTNULL列。我还有Wishlist和Item具有续集many_to_many关联的模型已建立。Wishlist类也有:selectmany_to_many关联的选项设置为select:[:items.*,:items_wishlists__facebook_action_id].有没有一种方法可以
我是一个Rails初学者,但我想从我的RailsView(html.haml文件)中查看Ruby变量的内容。我试图在ruby中打印出变量(认为它会在终端中出现),但没有得到任何结果。有什么建议吗?我知道Rails调试器,但更喜欢使用inspect来打印我的变量。 最佳答案 您可以在View中使用puts方法将信息输出到服务器控制台。您应该能够在View中的任何位置使用Haml执行以下操作:-puts@my_variable.inspect 关于ruby-on-rails-如何在我的R
我使用的是Firefox版本36.0.1和Selenium-Webdrivergem版本2.45.0。我能够创建Firefox实例,但无法使用脚本继续进行进一步的操作无法在60秒内获得稳定的Firefox连接(127.0.0.1:7055)错误。有人能帮帮我吗? 最佳答案 我遇到了同样的问题。降级到firefoxv33后一切正常。您可以找到旧版本here 关于ruby-无法在60秒内获得稳定的Firefox连接(127.0.0.1:7055),我们在StackOverflow上找到一个类
如何检查Ruby文件是否是通过“require”或“load”导入的,而不是简单地从命令行执行的?例如:foo.rb的内容:puts"Hello"bar.rb的内容require'foo'输出:$./foo.rbHello$./bar.rbHello基本上,我想调用bar.rb以不执行puts调用。 最佳答案 将foo.rb改为:if__FILE__==$0puts"Hello"end检查__FILE__-当前ruby文件的名称-与$0-正在运行的脚本的名称。 关于ruby-检查是否