Rattibha

Aurimas Griciūnas

11 Tweets 2 reads Nov 11, 2022

Every #DataEngineer should know what Spark Shuffle is and why it happens.
Get more info here: linkedin.com
Or read on in the thread 👇
Also subscribe to my newsletter to get similar content weekly: swirlai.substack.com
#Data #MLOps #DataEngineering

Every #DataEngineer should know what Spark Shuffle is and why it happens.

Get more info here: https...

www.linkedin.com

swirlai.substack.com

➡️ Spark Jobs are executed against Data Partitions.
➡️ Each Spark Task is connected to a single Partition.
➡️ Data Partitions are immutable - this helps with disaster recovery in Big Data applications.
➡️ After each Transformation - number of Child Partitions will be created.

➡️ Each Child Partition will be derived from one or more Parent Partitions and will act as Parent Partition for future Transformations.
➡️ Shuffle is a procedure when creation of Child Partitions involves data movement between Data Containers and Spark Executors over the network.

➡️ There are two types of Transformations:

𝗡𝗮𝗿𝗿𝗼𝘄 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀
➡️ These are simple transformations that can be applied locally without moving Data between Data Containers.
➡️ Locality is made possible due to cross-record context not being needed for the Transformation logic.

𝘍𝘶𝘯𝘤𝘵𝘪𝘰𝘯𝘴 𝘵𝘩𝘢𝘵 𝘵𝘳𝘪𝘨𝘨𝘦𝘳 𝘕𝘢𝘳𝘳𝘰𝘸 𝘛𝘳𝘢𝘯𝘴𝘧𝘰𝘳𝘮𝘢𝘵𝘪𝘰𝘯𝘴
👉 map()
👉 mapPartition()
👉 flatMap()
👉 filter()
👉 union()
👉 contains()
👉 …

𝗪𝗶𝗱𝗲 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀
➡️ These are complicated transformations that trigger Data movement between Data Containers.
➡️ This movement of Data is necessary due to cross-record dependencies for a given Transformation type.

𝘍𝘶𝘯𝘤𝘵𝘪𝘰𝘯𝘴 𝘵𝘩𝘢𝘵 𝘵𝘳𝘪𝘨𝘨𝘦𝘳 𝘞𝘪𝘥𝘦 𝘛𝘳𝘢𝘯𝘴𝘧𝘰𝘳𝘮𝘢𝘵𝘪𝘰𝘯𝘴
👉 groupByKey()
👉 aggregateByKey()
👉 groupBy()
👉 aggregate()
👉 join()
👉 repartition()
👉 …

❗️Shuffle is an expensive operation as it requires movement of Data through the Network.
❗️Shuffle procedure also impacts disk I/O since shuffled Data is saved to Disk.
❗️Tune your applications to have as little Shuffle Operations as possible.

✅ If Shuffle is necessary - use 𝗽𝗮𝗿𝗸.𝘀𝗾𝗹.𝘀𝗵𝘂𝗳𝗳𝗹𝗲.𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝘀 configuration to tune the number of partitions created after shuffle (defaults to 200).

✅ It is good idea to consider the number of cores your cluster will be working with. Rule of thumb could be having partition numbers set to one or two times more than available cores.

Loading suggestions...

Categories

More from this author

Related Threads

Popular Threads

Categories

More from this author

Related Threads

Popular Threads

Unroll Thread