sparking_草庐IT

hadoop - Spark YARN 配置问题 : Container keep failing

我正在尝试将数据框保存为文本文件，但即使是小数据也需要很多时间。我相信我的配置有问题。有人可以告诉我我在这里做错了什么吗？spark.default.parallelism640spark.hadoop.fs.s3.cse.plaintextLength.enabledfalsespark.hadoop.fs.s3n.filestatuscache.enabletruespark.hadoop.mapreduce.input.fileinputformat.split.maxsize33554432spark.executor.iddriverspark.executor.instan

scala - Spark 在计算大文件时崩溃

我在Scala中有一个程序可以读取CSV文件，向Dataframe添加一个新列并将结果保存为parquet文件。它在小文件(16/10/2010:03:37WARNscheduler.TaskSetManager:Losttask14.0instage4.0(TID886,10.0.0.10):java.io.EOFException:reachedendofstreamafterreading136445bytes;1245184bytesexpectedatorg.spark_project.guava.io.ByteStreams.readFully(ByteStreams.ja

大文 scala spark apache debugging hadoop apache-spark

hadoop - 如何更改 Spark 中的默认输出分隔符

hadoop Spark section code strong apache-spark

scala - 我可以通过 spark-scala 程序运行 shell 脚本吗？

我正在用intelligi编写一个spark-scala程序，我的代码基本上是从oracle中获取表格并将它们作为文本文件存储在hdfsinsert_df.rdd.saveAsTextFile("hdfs://path")。我试过这种方法，但没有用valscript_sh="///samplepath/file_creation_script.sh".!但是我要对生成的文本文件进行一些转换，我为此编写了一个shell脚本。我不想分别运行sparkjar文件和.sh文件。请告诉我是否有任何方法可以通过程序调用shell脚本。最佳答案

scala spark-scala section code script hadoop apache-spark intellij-idea spark-dataframe

java - 如何简单地将 spark jar 部署到远程 hadoop 集群？

我有Hadoop集群ClouderaCDH5.2和ApacheSpark1.5.0。我可以使用集群的YARN、Spark和HDFS从IntelliJIDEA或本地PC运行我的应用程序吗？或者我应该通过ftp将jar发送到主节点，然后通过spark-submit运行它？最佳答案是的，如果您按照以下步骤操作，您可以直接从IDE运行您的作业:将spark-yarn包添加到您的项目依赖项中(可以标记为provided)将带有hadoop配置的目录(HADOOP_CONF_DIR)添加到项目类路径将sparkassemblyjar复制到H

hadoop spark code section java scala apache-spark

hadoop - 使用 yum 安装 Apache Spark

我正在我组织的HDP盒中安装spark。我运行yuminstallspark并安装Spark1.4.1。如何安装Spark2.0？请帮忙! 最佳答案 Spark2在HDP2.5中受支持(作为技术预览)。您可以将特定的HDP2.5存储库添加到您的yum存储库目录中，然后进行安装。Spark1.6.2是HDP2.5中的默认版本。wgethttp://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.5.0.0/hdp.reposudocphdp.repo/etc/yum.re

hadoop Apache section hortonworks HDP apache-spark hortonworks-sandbox

java - Apache Spark : In PairFlatMapFunction, 如何将元组添加回 Iterable<Tuple2<Integer, String>> 返回类型

我是新手。我一直在研究涉及两个数据集的代码。因此，我从PairFlatMapFunction开始，在其中我正在处理映射器。JavaPairRDDtrainingArray=trainingData.flatMapToPair(newPairFlatMapFunction(){publicIterable>call(Strings){//codetoformthetuplesoftypeTuple2//newTuples2}如何将元组添加回可迭代类以供缩减器(reduceByKey)处理。如有任何指点，我们将不胜感激。最佳答案谢谢

amp PairFlatMapFunction String Integer section java hadoop apache-spark rdd bigdata

hadoop - Spark RDD 操作

假设我在CSV文件中有一个包含两列A和B的表格。我从A列[Maxvalue=100]中选择最大值，我需要使用JavaRDD操作返回B列的相应值[ReturnValue=AliExpress]，而不使用DataFrames。输入表:COLUMNAColumnB56Walmart72Flipkart96Amazon100AliExpress输出表:COLUMNAColumnB100AliExpress这是我到现在为止尝试过的源代码:SparkConfconf=newSparkConf().setAppName("SparkCSVReader").setMaster("local");Jav

hadoop Spark code section pre apache-spark apache-spark-sql spark-dataframe rdd

windows - 如何在 Windows 10 上运行 Spark Streaming 应用程序？

我在MSWindows1064位上运行一个SparkStreaming应用程序，它使用spark-mongo-connector将数据存储在MongoDB中。.每当我运行Spark应用程序时，甚至pyspark我都会遇到以下异常:Causedby:java.lang.RuntimeException:Therootscratchdir:/tmp/hiveonHDFSshouldbewritable.Currentpermissionsare:rw-rw-rw-完整堆栈跟踪:Causedby:java.lang.RuntimeException:Therootscratchdir:/tm

何在 Streaming code Hadoop section windows apache-spark pyspark

hadoop - Spark : Minimize task/partition size skew with textFile's minPartitions option?

我正在通过sc.textFile("/data/*/*/*")之类的方式将数万个文件读入rdd>一个问题是这些文件中的大多数都是微小的，而其他的则巨大。这会导致任务不平衡，从而导致各种众所周知的问题。我能否通过sc.textFile("/data/*/*/*",minPartitions=n_files*5)读取数据来拆分最大的分区，其中n_files是输入文件的个数吗？如约定elsewhere在stackoverflow上，minPartitions被传递到hadooprabithole，并在org.apache.hadoop.mapred.TextInputFormat.getSp

minPartitions partition code hadoop section apache-spark