Every #DataEngineer should know what Spark Shuffle is and why it happens.
Get more info here: linkedin.com
Or read on in the thread ๐
Also subscribe to my newsletter to get similar content weekly: swirlai.substack.com
#Data #MLOps #DataEngineering
Get more info here: linkedin.com
Or read on in the thread ๐
Also subscribe to my newsletter to get similar content weekly: swirlai.substack.com
#Data #MLOps #DataEngineering
โก๏ธ Spark Jobs are executed against Data Partitions.
โก๏ธ Each Spark Task is connected to a single Partition.
โก๏ธ Data Partitions are immutable - this helps with disaster recovery in Big Data applications.
โก๏ธ After each Transformation - number of Child Partitions will be created.
โก๏ธ Each Spark Task is connected to a single Partition.
โก๏ธ Data Partitions are immutable - this helps with disaster recovery in Big Data applications.
โก๏ธ After each Transformation - number of Child Partitions will be created.
โก๏ธ Each Child Partition will be derived from one or more Parent Partitions and will act as Parent Partition for future Transformations.
โก๏ธ Shuffle is a procedure when creation of Child Partitions involves data movement between Data Containers and Spark Executors over the network.
โก๏ธ Shuffle is a procedure when creation of Child Partitions involves data movement between Data Containers and Spark Executors over the network.
โก๏ธ There are two types of Transformations:
๐ก๐ฎ๐ฟ๐ฟ๐ผ๐ ๐ง๐ฟ๐ฎ๐ป๐๐ณ๐ผ๐ฟ๐บ๐ฎ๐๐ถ๐ผ๐ป๐
โก๏ธ These are simple transformations that can be applied locally without moving Data between Data Containers.
โก๏ธ Locality is made possible due to cross-record context not being needed for the Transformation logic.
โก๏ธ These are simple transformations that can be applied locally without moving Data between Data Containers.
โก๏ธ Locality is made possible due to cross-record context not being needed for the Transformation logic.
๐๐ถ๐ฏ๐ค๐ต๐ช๐ฐ๐ฏ๐ด ๐ต๐ฉ๐ข๐ต ๐ต๐ณ๐ช๐จ๐จ๐ฆ๐ณ ๐๐ข๐ณ๐ณ๐ฐ๐ธ ๐๐ณ๐ข๐ฏ๐ด๐ง๐ฐ๐ณ๐ฎ๐ข๐ต๐ช๐ฐ๐ฏ๐ด
๐ map()
๐ mapPartition()
๐ flatMap()
๐ filter()
๐ union()
๐ contains()
๐ โฆ
๐ map()
๐ mapPartition()
๐ flatMap()
๐ filter()
๐ union()
๐ contains()
๐ โฆ
๐ช๐ถ๐ฑ๐ฒ ๐ง๐ฟ๐ฎ๐ป๐๐ณ๐ผ๐ฟ๐บ๐ฎ๐๐ถ๐ผ๐ป๐
โก๏ธ These are complicated transformations that trigger Data movement between Data Containers.
โก๏ธ This movement of Data is necessary due to cross-record dependencies for a given Transformation type.
โก๏ธ These are complicated transformations that trigger Data movement between Data Containers.
โก๏ธ This movement of Data is necessary due to cross-record dependencies for a given Transformation type.
๐๐ถ๐ฏ๐ค๐ต๐ช๐ฐ๐ฏ๐ด ๐ต๐ฉ๐ข๐ต ๐ต๐ณ๐ช๐จ๐จ๐ฆ๐ณ ๐๐ช๐ฅ๐ฆ ๐๐ณ๐ข๐ฏ๐ด๐ง๐ฐ๐ณ๐ฎ๐ข๐ต๐ช๐ฐ๐ฏ๐ด
๐ groupByKey()
๐ aggregateByKey()
๐ groupBy()
๐ aggregate()
๐ join()
๐ repartition()
๐ โฆ
๐ groupByKey()
๐ aggregateByKey()
๐ groupBy()
๐ aggregate()
๐ join()
๐ repartition()
๐ โฆ
โ๏ธShuffle is an expensive operation as it requires movement of Data through the Network.
โ๏ธShuffle procedure also impacts disk I/O since shuffled Data is saved to Disk.
โ๏ธTune your applications to have as little Shuffle Operations as possible.
โ๏ธShuffle procedure also impacts disk I/O since shuffled Data is saved to Disk.
โ๏ธTune your applications to have as little Shuffle Operations as possible.
โ
If Shuffle is necessary - use ๐ฝ๐ฎ๐ฟ๐ธ.๐๐พ๐น.๐๐ต๐๐ณ๐ณ๐น๐ฒ.๐ฝ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ configuration to tune the number of partitions created after shuffle (defaults to 200).
โ
It is good idea to consider the number of cores your cluster will be working with. Rule of thumb could be having partition numbers set to one or two times more than available cores.
Loading suggestions...