這期內(nèi)容當(dāng)中小編將會(huì)給大家?guī)碛嘘P(guān)如何進(jìn)行spark join的源碼分析,文章內(nèi)容豐富且以專業(yè)的角度為大家分析和敘述,閱讀完這篇文章希望大家可以有所收獲。
主要從事網(wǎng)頁(yè)設(shè)計(jì)、PC網(wǎng)站建設(shè)(電腦版網(wǎng)站建設(shè))、wap網(wǎng)站建設(shè)(手機(jī)版網(wǎng)站建設(shè))、成都響應(yīng)式網(wǎng)站建設(shè)公司、程序開發(fā)、微網(wǎng)站、成都小程序開發(fā)等,憑借多年來在互聯(lián)網(wǎng)的打拼,我們?cè)诨ヂ?lián)網(wǎng)網(wǎng)站建設(shè)行業(yè)積累了豐富的網(wǎng)站設(shè)計(jì)制作、成都網(wǎng)站制作、網(wǎng)絡(luò)營(yíng)銷經(jīng)驗(yàn),集策劃、開發(fā)、設(shè)計(jì)、營(yíng)銷、管理等多方位專業(yè)化運(yùn)作于一體,具備承接不同規(guī)模與類型的建設(shè)項(xiàng)目的能力。
import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object JoinDemo { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]") val sc = new SparkContext(conf) sc.setLogLevel("WARN") val random = scala.util.Random val col1 = Range(1, 50).map(idx => (random.nextInt(10), s"user$idx")) val col2 = Array((0, "BJ"), (1, "SH"), (2, "GZ"), (3, "SZ"), (4, "TJ"), (5, "CQ"), (6, "HZ"), (7, "NJ"), (8, "WH"), (0, "CD")) val rdd1: RDD[(Int, String)] = sc.makeRDD(col1) val rdd2: RDD[(Int, String)] = sc.makeRDD(col2) val rdd3: RDD[(Int, (String, String))] = rdd1.join(rdd2) println (rdd3.dependencies) val rdd4: RDD[(Int, (String, String))] = rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(newHashPartitioner(3))) println(rdd4.dependencies) sc.stop() } }
1.兩個(gè)打印語(yǔ)句: List(org.apache.spark.OneToOneDependency@63acf8f6) List(org.apache.spark.OneToOneDependency@d9a498) 對(duì)應(yīng)的依賴: rdd3對(duì)應(yīng)的是寬依賴,rdd4對(duì)應(yīng)的是窄依賴 原因: 1)參考webUI 由DAG圖可以看出,第一個(gè)join和之前的清晰劃分成單獨(dú)的Satge??梢钥闯鲞@個(gè)是寬依賴。第二個(gè)join,partitionBy之后再進(jìn)行join并沒有單獨(dú)劃分成一個(gè)stage,由此可見是一個(gè)窄依賴。
rdd3的join
rdd4的join
2)代碼解析: a.首先是默認(rèn)的join方法,這里使用了一個(gè)默認(rèn)分區(qū)器
/** * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and * (k, v2) is in `other`. Performs a hash join across the cluster. */ def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope { join(other, defaultPartitioner(self, other)) }
b.默認(rèn)分區(qū)器,對(duì)于第一個(gè)join會(huì)返回一個(gè)以電腦core總數(shù)為分區(qū)數(shù)量的HashPartitioner.第二個(gè)join會(huì)返回我們?cè)O(shè)定的HashPartitioner(分區(qū)數(shù)目3)
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val rdds = (Seq(rdd) ++ others) val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0)) val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) { Some(hasPartitioner.maxBy(_.partitions.length)) } else { None } val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) { rdd.context.defaultParallelism } else { rdds.map(_.partitions.length).max } // If the existing max partitioner is an eligible one, or its partitions number is larger // than the default number of partitions, use the existing partitioner. if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) || defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) { hasMaxPartitioner.get.partitioner.get } else { new HashPartitioner(defaultNumPartitions) } }
c.走到了實(shí)際執(zhí)行的join方法,里面flatMapValues是一個(gè)窄依賴,所以說如果有寬依賴應(yīng)該在cogroup算子中
/** * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD. */ def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope { this.cogroup(other, partitioner).flatMapValues( pair => for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) ) }
d.進(jìn)入cogroup方法中,核心是CoGroupedRDD,根據(jù)兩個(gè)需要join的rdd和一個(gè)分區(qū)器。由于第一個(gè)join的時(shí)候,兩個(gè)rdd都沒有分區(qū)器,所以在這一步,兩個(gè)rdd需要先根據(jù)傳入的分區(qū)器進(jìn)行一次shuffle,因此第一個(gè)join是寬依賴。第二個(gè)join此時(shí)已經(jīng)分好區(qū)了,不需要再再進(jìn)行shuffle了。所以第二個(gè)是窄依賴
/** * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the * list of values for that key in `this` as well as `other`. */ def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope { if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) { throw new SparkException("HashPartitioner cannot partition array keys.") } val cg = new CoGroupedRDD[K](Seq(self, other), partitioner) cg.mapValues { case Array(vs, w1s) => (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]]) } }
e.兩個(gè)都打印出OneToOneDependency,是因?yàn)樵贑oGroupedRDD里面,getDependencies方法里面,如果rdd有partitioner就都會(huì)返回OneToOneDependency(rdd)。
override def getDependencies: Seq[Dependency[_]] = { rdds.map { rdd: RDD[_] => if (rdd.partitioner == Some(part)) { logDebug("Adding one-to-one dependency with " + rdd) new OneToOneDependency(rdd) } else { logDebug("Adding shuffle dependency with " + rdd) new ShuffleDependency[K, Any, CoGroupCombiner]( rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer) } } }
join什么時(shí)候是寬依賴什么時(shí)候是窄依賴? 由上述分析可以知道,如果需要join的兩個(gè)表,本身已經(jīng)有分區(qū)器,且分區(qū)的數(shù)目相同,此時(shí),相同的key在同一個(gè)分區(qū)內(nèi)。就是窄依賴。反之,如果兩個(gè)需要join的表中沒有分區(qū)器或者分區(qū)數(shù)量不同,在join的時(shí)候需要shuffle,那么就是寬依賴
上述就是小編為大家分享的如何進(jìn)行spark join的源碼分析了,如果剛好有類似的疑惑,不妨參照上述分析進(jìn)行理解。如果想知道更多相關(guān)知識(shí),歡迎關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道。
分享題目:如何進(jìn)行sparkjoin的源碼分析
網(wǎng)頁(yè)URL:http://www.rwnh.cn/article2/jcjpic.html
成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián),為您提供小程序開發(fā)、網(wǎng)站內(nèi)鏈、、網(wǎng)頁(yè)設(shè)計(jì)公司、外貿(mào)網(wǎng)站建設(shè)、企業(yè)建站
聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主,如果涉及侵權(quán)請(qǐng)盡快告知,我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如需處理請(qǐng)聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載,或轉(zhuǎn)載時(shí)需注明來源: 創(chuàng)新互聯(lián)