Rattibha

Aurimas Griciūnas

2 Tweets 7 reads Jun 19, 2023

What is under the hood of 𝗦𝗽𝗮𝗿𝗸?
Apache Spark is an extremely popular distributed processing framework utilizing in-memory processing to speed up task execution. Most of its libraries are contained in the Spark Core layer.
Let’s look into some of Spark’s Architecture Basics.
𝗦𝗽𝗮𝗿𝗸 𝗵𝗮𝘀 𝘀𝗲𝘃𝗲𝗿𝗮𝗹 𝗵𝗶𝗴𝗵 𝗹𝗲𝘃𝗲𝗹 𝗔𝗣𝗜𝘀 𝗯𝘂𝗶𝗹𝘁 𝗼𝗻 𝘁𝗼𝗽 𝗼𝗳 𝗦𝗽𝗮𝗿𝗸 𝗖𝗼𝗿𝗲 𝘁𝗼 𝘀𝘂𝗽𝗽𝗼𝗿𝘁 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝘂𝘀𝗲 𝗰𝗮𝘀𝗲𝘀:
➡️ 𝗦𝗽𝗮𝗿𝗸𝗦𝗤𝗟 - Batch Processing.
➡️ 𝗦𝗽𝗮𝗿𝗸 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 - Near to Real-Time Processing.
➡️ 𝗦𝗽𝗮𝗿𝗸 𝗠𝗟𝗹𝗶𝗯 - Machine Learning.
➡️ 𝗚𝗿𝗮𝗽𝗵𝗫 - Graph Structures and Algorithms.
𝗦𝘂𝗽𝗽𝗼𝗿𝘁𝗲𝗱 𝗽𝗿𝗼𝗴𝗿𝗮𝗺𝗺𝗶𝗻𝗴 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲𝘀:
➡️ Scala
➡️ Java
➡️ Python
➡️ R
𝗚𝗲𝗻𝗲𝗿𝗮𝗹 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲:
1️⃣ Once you submit a Spark Application - SparkContext Object is created in the Driver Program. This Object is responsible for communicating with the Cluster Manager.
2️⃣ SparkContext negotiates with Cluster Manager for required resources to run Spark application. Cluster Manager allocates the resources inside of a respective Cluster and creates a requested number of Spark Executors.
3️⃣ After starting - Spark Executors will connect with SparkContext to notify about joining the Cluster. Executors will be sending heartbeats regularly to notify the Driver Program that they are healthy and don’t need rescheduling.
4️⃣ Spark Executors are responsible for executing tasks of the Computation DAG (Directed Acyclic Graph). This could include reading, writing data or performing a certain operation on a partition of RDDs.
𝗦𝘂𝗽𝗽𝗼𝗿𝘁𝗲𝗱 𝗖𝗹𝘂𝘀𝘁𝗲𝗿 𝗠𝗮𝗻𝗮𝗴𝗲𝗿𝘀:
➡️ 𝗦𝘁𝗮𝗻𝗱𝗮𝗹𝗼𝗻𝗲 - simple cluster manager shipped together with Spark.
➡️ 𝗛𝗮𝗱𝗼𝗼𝗽 𝗬𝗔𝗥𝗡 - resource manager of Hadoop ecosystem.
➡️ 𝗔𝗽𝗮𝗰𝗵𝗲 𝗠𝗲𝘀𝗼𝘀 - general cluster manager (❗️ deprecated).
➡️ 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 - popular open-source container orchestrator.
𝗦𝗽𝗮𝗿𝗸 𝗝𝗼𝗯 𝗜𝗻𝘁𝗲𝗿𝗻𝗮𝗹𝘀:
👉 Spark Driver is responsible for constructing an optimized physical execution plan for a given application submitted for execution.
👉 This plan materializes into a Job which is a DAG of Stages.
👉 Some of the Stages can be executed in parallel if they have no sequential dependencies.
👉 Each Stage is composed of Tasks.
👉 All Tasks of a single Stage contain the same type of work which is the smallest piece of work that can be executed in parallel and is performed by Spark Executors.
--------
Follow me to upskill in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.
Also hit 🔔to stay notified about new content.
𝗗𝗼𝗻’𝘁 𝗳𝗼𝗿𝗴𝗲𝘁 𝘁𝗼 𝗹𝗶𝗸𝗲 💙, 𝘀𝗵𝗮𝗿𝗲 𝗮𝗻𝗱 𝗰𝗼𝗺𝗺𝗲𝗻𝘁!
Join a growing community of Data Professionals by subscribing to my 𝗡𝗲𝘄𝘀𝗹𝗲𝘁𝘁𝗲𝗿.

f you like the content, be sure to join 11.000+ Data Tech Enthusiasts by subscribing to my newsletter: newsletter.swirlai.com

newsletter.swirlai.com

Loading suggestions...

Categories

More from this author

Related Threads

Popular Threads

Categories

More from this author

Related Threads

Popular Threads

Unroll Thread