mapPartitionsWithIndex

Java Apache Spark : Long transformation chains result in quadratic time

我有一个使用ApacheSpark的Java程序。该程序最有趣的部分如下所示:longseed=System.nanoTime();JavaRDDannotated=documents.mapPartitionsWithIndex(newInitialAnnotater(seed),true);annotated.cache();for(intiter=0;itera.sum(b));//updateoverallcounts(*)seed=System.nanoTime();//copyoverallcountswhichCountChangerusestocomputeastoch

Spark MappartitionswithIndex：识别分区

确定一个分区：mapPartitionsWithIndex(index,iter)该方法导致将功能驱动到每个分区。我知道我们可以使用“索引”参数跟踪分区。许多示例使用此方法使用“index=0”条件在数据集中删除标头。但是，我们如何确保读取的第一个分区（翻译，“索引”参数等于0）确实是标题。ISINT随机或基于分区器（如果使用）。看答案如果使用的是随机还是基于分区者？它不是随机的，而是分区数。您可以使用以下提到的简单示例来理解它valbase=sc.parallelize(1to100,4)base.mapPartitionsWithIndex((index,iterator)=>{itera