Aurimas Griciลซnas
Aurimas Griciลซnas

@Aurimas_Gr

2 Tweets 7 reads Jun 19, 2023
What is under the hood of ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ?
Apache Spark is an extremely popular distributed processing framework utilizing in-memory processing to speed up task execution. Most of its libraries are contained in the Spark Core layer.
Letโ€™s look into some of Sparkโ€™s Architecture Basics.
๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ต๐—ฎ๐˜€ ๐˜€๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐—ฎ๐—น ๐—ต๐—ถ๐—ด๐—ต ๐—น๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—”๐—ฃ๐—œ๐˜€ ๐—ฏ๐˜‚๐—ถ๐—น๐˜ ๐—ผ๐—ป ๐˜๐—ผ๐—ฝ ๐—ผ๐—ณ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—–๐—ผ๐—ฟ๐—ฒ ๐˜๐—ผ ๐˜€๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜ ๐—ฑ๐—ถ๐—ณ๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐˜ ๐˜‚๐˜€๐—ฒ ๐—ฐ๐—ฎ๐˜€๐—ฒ๐˜€:
โžก๏ธ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ๐—ฆ๐—ค๐—Ÿ - Batch Processing.
โžก๏ธ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ๐—ถ๐—ป๐—ด - Near to Real-Time Processing.
โžก๏ธ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐— ๐—Ÿ๐—น๐—ถ๐—ฏ - Machine Learning.
โžก๏ธ ๐—š๐—ฟ๐—ฎ๐—ฝ๐—ต๐—ซ - Graph Structures and Algorithms.
๐—ฆ๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฒ๐—ฑ ๐—ฝ๐—ฟ๐—ผ๐—ด๐—ฟ๐—ฎ๐—บ๐—บ๐—ถ๐—ป๐—ด ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ๐˜€:
โžก๏ธ Scala
โžก๏ธ Java
โžก๏ธ Python
โžก๏ธ R
๐—š๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐—น ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ:
1๏ธโƒฃ Once you submit a Spark Application - SparkContext Object is created in the Driver Program. This Object is responsible for communicating with the Cluster Manager.
2๏ธโƒฃ SparkContext negotiates with Cluster Manager for required resources to run Spark application. Cluster Manager allocates the resources inside of a respective Cluster and creates a requested number of Spark Executors.
3๏ธโƒฃ After starting - Spark Executors will connect with SparkContext to notify about joining the Cluster. Executors will be sending heartbeats regularly to notify the Driver Program that they are healthy and donโ€™t need rescheduling.
4๏ธโƒฃ Spark Executors are responsible for executing tasks of the Computation DAG (Directed Acyclic Graph). This could include reading, writing data or performing a certain operation on a partition of RDDs.
๐—ฆ๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฒ๐—ฑ ๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—ฟ๐˜€:
โžก๏ธ ๐—ฆ๐˜๐—ฎ๐—ป๐—ฑ๐—ฎ๐—น๐—ผ๐—ป๐—ฒ - simple cluster manager shipped together with Spark.
โžก๏ธ ๐—›๐—ฎ๐—ฑ๐—ผ๐—ผ๐—ฝ ๐—ฌ๐—”๐—ฅ๐—ก - resource manager of Hadoop ecosystem.ย 
โžก๏ธ ๐—”๐—ฝ๐—ฎ๐—ฐ๐—ต๐—ฒ ๐— ๐—ฒ๐˜€๐—ผ๐˜€ - general cluster manager (โ—๏ธ deprecated).
โžก๏ธ ๐—ž๐˜‚๐—ฏ๐—ฒ๐—ฟ๐—ป๐—ฒ๐˜๐—ฒ๐˜€ - popular open-source container orchestrator.
๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—๐—ผ๐—ฏ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น๐˜€:
๐Ÿ‘‰ Spark Driver is responsible for constructing an optimized physical execution plan for a given application submitted for execution.ย 
๐Ÿ‘‰ This plan materializes into a Job which is a DAG of Stages.ย 
๐Ÿ‘‰ Some of the Stages can be executed in parallel if they have no sequential dependencies.ย 
๐Ÿ‘‰ Each Stage is composed of Tasks.ย 
๐Ÿ‘‰ All Tasks of a single Stage contain the same type of work which is the smallest piece of work that can be executed in parallel and is performed by Spark Executors.
--------
Follow me to upskill in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space.
Also hit ๐Ÿ””to stay notified about new content.
๐——๐—ผ๐—ปโ€™๐˜ ๐—ณ๐—ผ๐—ฟ๐—ด๐—ฒ๐˜ ๐˜๐—ผ ๐—น๐—ถ๐—ธ๐—ฒ ๐Ÿ’™, ๐˜€๐—ต๐—ฎ๐—ฟ๐—ฒ ๐—ฎ๐—ป๐—ฑ ๐—ฐ๐—ผ๐—บ๐—บ๐—ฒ๐—ป๐˜!
Join a growing community of Data Professionals by subscribing to my ๐—ก๐—ฒ๐˜„๐˜€๐—น๐—ฒ๐˜๐˜๐—ฒ๐—ฟ.
f you like the content, be sure to join 11.000+ Data Tech Enthusiasts by subscribing to my newsletter: newsletter.swirlai.com

Loading suggestions...