What is under the hood of ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ?
Apache Spark is an extremely popular distributed processing framework utilizing in-memory processing to speed up task execution. Most of its libraries are contained in the Spark Core layer.
Letโs look into some of Sparkโs Architecture Basics.
๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐ต๐ฎ๐ ๐๐ฒ๐๐ฒ๐ฟ๐ฎ๐น ๐ต๐ถ๐ด๐ต ๐น๐ฒ๐๐ฒ๐น ๐๐ฃ๐๐ ๐ฏ๐๐ถ๐น๐ ๐ผ๐ป ๐๐ผ๐ฝ ๐ผ๐ณ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ผ๐ฟ๐ฒ ๐๐ผ ๐๐๐ฝ๐ฝ๐ผ๐ฟ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐๐๐ฒ ๐ฐ๐ฎ๐๐ฒ๐:
โก๏ธ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ๐ฆ๐ค๐ - Batch Processing.
โก๏ธ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ๐ถ๐ป๐ด - Near to Real-Time Processing.
โก๏ธ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐ ๐๐น๐ถ๐ฏ - Machine Learning.
โก๏ธ ๐๐ฟ๐ฎ๐ฝ๐ต๐ซ - Graph Structures and Algorithms.
๐ฆ๐๐ฝ๐ฝ๐ผ๐ฟ๐๐ฒ๐ฑ ๐ฝ๐ฟ๐ผ๐ด๐ฟ๐ฎ๐บ๐บ๐ถ๐ป๐ด ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ๐:
โก๏ธ Scala
โก๏ธ Java
โก๏ธ Python
โก๏ธ R
๐๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐น ๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ:
1๏ธโฃ Once you submit a Spark Application - SparkContext Object is created in the Driver Program. This Object is responsible for communicating with the Cluster Manager.
2๏ธโฃ SparkContext negotiates with Cluster Manager for required resources to run Spark application. Cluster Manager allocates the resources inside of a respective Cluster and creates a requested number of Spark Executors.
3๏ธโฃ After starting - Spark Executors will connect with SparkContext to notify about joining the Cluster. Executors will be sending heartbeats regularly to notify the Driver Program that they are healthy and donโt need rescheduling.
4๏ธโฃ Spark Executors are responsible for executing tasks of the Computation DAG (Directed Acyclic Graph). This could include reading, writing data or performing a certain operation on a partition of RDDs.
๐ฆ๐๐ฝ๐ฝ๐ผ๐ฟ๐๐ฒ๐ฑ ๐๐น๐๐๐๐ฒ๐ฟ ๐ ๐ฎ๐ป๐ฎ๐ด๐ฒ๐ฟ๐:
โก๏ธ ๐ฆ๐๐ฎ๐ป๐ฑ๐ฎ๐น๐ผ๐ป๐ฒ - simple cluster manager shipped together with Spark.
โก๏ธ ๐๐ฎ๐ฑ๐ผ๐ผ๐ฝ ๐ฌ๐๐ฅ๐ก - resource manager of Hadoop ecosystem.ย
โก๏ธ ๐๐ฝ๐ฎ๐ฐ๐ต๐ฒ ๐ ๐ฒ๐๐ผ๐ - general cluster manager (โ๏ธ deprecated).
โก๏ธ ๐๐๐ฏ๐ฒ๐ฟ๐ป๐ฒ๐๐ฒ๐ - popular open-source container orchestrator.
๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ผ๐ฏ ๐๐ป๐๐ฒ๐ฟ๐ป๐ฎ๐น๐:
๐ Spark Driver is responsible for constructing an optimized physical execution plan for a given application submitted for execution.ย
๐ This plan materializes into a Job which is a DAG of Stages.ย
๐ Some of the Stages can be executed in parallel if they have no sequential dependencies.ย
๐ Each Stage is composed of Tasks.ย
๐ All Tasks of a single Stage contain the same type of work which is the smallest piece of work that can be executed in parallel and is performed by Spark Executors.
--------
Follow me to upskill in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.
Also hit ๐to stay notified about new content.
๐๐ผ๐ปโ๐ ๐ณ๐ผ๐ฟ๐ด๐ฒ๐ ๐๐ผ ๐น๐ถ๐ธ๐ฒ ๐, ๐๐ต๐ฎ๐ฟ๐ฒ ๐ฎ๐ป๐ฑ ๐ฐ๐ผ๐บ๐บ๐ฒ๐ป๐!
Join a growing community of Data Professionals by subscribing to my ๐ก๐ฒ๐๐๐น๐ฒ๐๐๐ฒ๐ฟ.
Apache Spark is an extremely popular distributed processing framework utilizing in-memory processing to speed up task execution. Most of its libraries are contained in the Spark Core layer.
Letโs look into some of Sparkโs Architecture Basics.
๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐ต๐ฎ๐ ๐๐ฒ๐๐ฒ๐ฟ๐ฎ๐น ๐ต๐ถ๐ด๐ต ๐น๐ฒ๐๐ฒ๐น ๐๐ฃ๐๐ ๐ฏ๐๐ถ๐น๐ ๐ผ๐ป ๐๐ผ๐ฝ ๐ผ๐ณ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ผ๐ฟ๐ฒ ๐๐ผ ๐๐๐ฝ๐ฝ๐ผ๐ฟ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐๐๐ฒ ๐ฐ๐ฎ๐๐ฒ๐:
โก๏ธ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ๐ฆ๐ค๐ - Batch Processing.
โก๏ธ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ๐ถ๐ป๐ด - Near to Real-Time Processing.
โก๏ธ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐ ๐๐น๐ถ๐ฏ - Machine Learning.
โก๏ธ ๐๐ฟ๐ฎ๐ฝ๐ต๐ซ - Graph Structures and Algorithms.
๐ฆ๐๐ฝ๐ฝ๐ผ๐ฟ๐๐ฒ๐ฑ ๐ฝ๐ฟ๐ผ๐ด๐ฟ๐ฎ๐บ๐บ๐ถ๐ป๐ด ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ๐:
โก๏ธ Scala
โก๏ธ Java
โก๏ธ Python
โก๏ธ R
๐๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐น ๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ:
1๏ธโฃ Once you submit a Spark Application - SparkContext Object is created in the Driver Program. This Object is responsible for communicating with the Cluster Manager.
2๏ธโฃ SparkContext negotiates with Cluster Manager for required resources to run Spark application. Cluster Manager allocates the resources inside of a respective Cluster and creates a requested number of Spark Executors.
3๏ธโฃ After starting - Spark Executors will connect with SparkContext to notify about joining the Cluster. Executors will be sending heartbeats regularly to notify the Driver Program that they are healthy and donโt need rescheduling.
4๏ธโฃ Spark Executors are responsible for executing tasks of the Computation DAG (Directed Acyclic Graph). This could include reading, writing data or performing a certain operation on a partition of RDDs.
๐ฆ๐๐ฝ๐ฝ๐ผ๐ฟ๐๐ฒ๐ฑ ๐๐น๐๐๐๐ฒ๐ฟ ๐ ๐ฎ๐ป๐ฎ๐ด๐ฒ๐ฟ๐:
โก๏ธ ๐ฆ๐๐ฎ๐ป๐ฑ๐ฎ๐น๐ผ๐ป๐ฒ - simple cluster manager shipped together with Spark.
โก๏ธ ๐๐ฎ๐ฑ๐ผ๐ผ๐ฝ ๐ฌ๐๐ฅ๐ก - resource manager of Hadoop ecosystem.ย
โก๏ธ ๐๐ฝ๐ฎ๐ฐ๐ต๐ฒ ๐ ๐ฒ๐๐ผ๐ - general cluster manager (โ๏ธ deprecated).
โก๏ธ ๐๐๐ฏ๐ฒ๐ฟ๐ป๐ฒ๐๐ฒ๐ - popular open-source container orchestrator.
๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ผ๐ฏ ๐๐ป๐๐ฒ๐ฟ๐ป๐ฎ๐น๐:
๐ Spark Driver is responsible for constructing an optimized physical execution plan for a given application submitted for execution.ย
๐ This plan materializes into a Job which is a DAG of Stages.ย
๐ Some of the Stages can be executed in parallel if they have no sequential dependencies.ย
๐ Each Stage is composed of Tasks.ย
๐ All Tasks of a single Stage contain the same type of work which is the smallest piece of work that can be executed in parallel and is performed by Spark Executors.
--------
Follow me to upskill in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.
Also hit ๐to stay notified about new content.
๐๐ผ๐ปโ๐ ๐ณ๐ผ๐ฟ๐ด๐ฒ๐ ๐๐ผ ๐น๐ถ๐ธ๐ฒ ๐, ๐๐ต๐ฎ๐ฟ๐ฒ ๐ฎ๐ป๐ฑ ๐ฐ๐ผ๐บ๐บ๐ฒ๐ป๐!
Join a growing community of Data Professionals by subscribing to my ๐ก๐ฒ๐๐๐น๐ฒ๐๐๐ฒ๐ฟ.
f you like the content, be sure to join 11.000+ Data Tech Enthusiasts by subscribing to my newsletter: newsletter.swirlai.com
Loading suggestions...