Aurimas Griciลซnas
Aurimas Griciลซnas

@Aurimas_Gr

11 Tweets 2 reads Nov 11, 2022
Every #DataEngineer should know what Spark Shuffle is and why it happens.
Get more info here: linkedin.com
Or read on in the thread ๐Ÿ‘‡
Also subscribe to my newsletter to get similar content weekly: swirlai.substack.com
#Data #MLOps #DataEngineering
โžก๏ธ Spark Jobs are executed against Data Partitions.
โžก๏ธ Each Spark Task is connected to a single Partition.
โžก๏ธ Data Partitions are immutable - this helps with disaster recovery in Big Data applications.
โžก๏ธ After each Transformation - number of Child Partitions will be created.
โžก๏ธ Each Child Partition will be derived from one or more Parent Partitions and will act as Parent Partition for future Transformations.
โžก๏ธ Shuffle is a procedure when creation of Child Partitions involves data movement between Data Containers and Spark Executors over the network.
โžก๏ธ There are two types of Transformations:
๐—ก๐—ฎ๐—ฟ๐—ฟ๐—ผ๐˜„ ๐—ง๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€
โžก๏ธ These are simple transformations that can be applied locally without moving Data between Data Containers.
โžก๏ธ Locality is made possible due to cross-record context not being needed for the Transformation logic.
๐˜๐˜ถ๐˜ฏ๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด ๐˜ต๐˜ฉ๐˜ข๐˜ต ๐˜ต๐˜ณ๐˜ช๐˜จ๐˜จ๐˜ฆ๐˜ณ ๐˜•๐˜ข๐˜ณ๐˜ณ๐˜ฐ๐˜ธ ๐˜›๐˜ณ๐˜ข๐˜ฏ๐˜ด๐˜ง๐˜ฐ๐˜ณ๐˜ฎ๐˜ข๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด
๐Ÿ‘‰ map()
๐Ÿ‘‰ mapPartition()
๐Ÿ‘‰ flatMap()
๐Ÿ‘‰ filter()
๐Ÿ‘‰ union()
๐Ÿ‘‰ contains()
๐Ÿ‘‰ โ€ฆ
๐—ช๐—ถ๐—ฑ๐—ฒ ๐—ง๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€
โžก๏ธ These are complicated transformations that trigger Data movement between Data Containers.
โžก๏ธ This movement of Data is necessary due to cross-record dependencies for a given Transformation type.
๐˜๐˜ถ๐˜ฏ๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด ๐˜ต๐˜ฉ๐˜ข๐˜ต ๐˜ต๐˜ณ๐˜ช๐˜จ๐˜จ๐˜ฆ๐˜ณ ๐˜ž๐˜ช๐˜ฅ๐˜ฆ ๐˜›๐˜ณ๐˜ข๐˜ฏ๐˜ด๐˜ง๐˜ฐ๐˜ณ๐˜ฎ๐˜ข๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด
๐Ÿ‘‰ groupByKey()
๐Ÿ‘‰ aggregateByKey()
๐Ÿ‘‰ groupBy()
๐Ÿ‘‰ aggregate()
๐Ÿ‘‰ join()
๐Ÿ‘‰ repartition()
๐Ÿ‘‰ โ€ฆ
โ—๏ธShuffle is an expensive operation as it requires movement of Data through the Network.
โ—๏ธShuffle procedure also impacts disk I/O since shuffled Data is saved to Disk.
โ—๏ธTune your applications to have as little Shuffle Operations as possible.
โœ… If Shuffle is necessary - use ๐—ฝ๐—ฎ๐—ฟ๐—ธ.๐˜€๐—พ๐—น.๐˜€๐—ต๐˜‚๐—ณ๐—ณ๐—น๐—ฒ.๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€ configuration to tune the number of partitions created after shuffle (defaults to 200).
โœ… It is good idea to consider the number of cores your cluster will be working with. Rule of thumb could be having partition numbers set to one or two times more than available cores.

Loading suggestions...