01 logo

Apache Spark vs Hadoop

Main differences between Hadoop and Spark

By ansam yousryPublished about a year ago 5 min read
Like
Hadoop vs Spark

Hadoop and Spark are both big data processing frameworks that are designed to handle large volumes of data, but they differ in a few key ways:

  1. Architecture: Hadoop is based on a batch-processing architecture, while Spark is based on a streaming architecture. This means that Hadoop processes data in batches, while Spark processes data in real-time streams.
  2. Performance: Spark is generally faster than Hadoop for big data processing tasks because it is designed to process data in memory. Hadoop, on the other hand, is designed to process data on disk, which is slower.
  3. Programming model: Hadoop uses a MapReduce programming model, which can be more complex for developers to learn and use than Spark’s API. Spark, on the other hand, has a more user-friendly API that makes it easier for developers to build big data applications.
  4. Use cases: Both Hadoop and Spark can be used for a wide range of big data processing tasks, but Spark is generally better suited for real-time processing and interactive data analysis, while Hadoop is better suited for batch processing and offline data analysis.

Hadoop Architecture:

Hadoop distributed file system (HDFS):

Apache Hadoop Distributed File System (HDFS) is a distributed file system that is designed to run on commodity hardware. It has a master/slave architecture, with one node designated as the NameNode and the rest as DataNodes. The NameNode manages the file system namespace and regulates access to files by clients, while the DataNodes store the actual data blocks.

HDFS is designed to be fault-tolerant and to support very large file sizes (in the order of terabytes or even petabytes). It does this by distributing the data blocks of a file across multiple nodes in the cluster and replicating each block on multiple nodes for redundancy. This allows the file system to continue operating even if some of the nodes in the cluster fail.

One of the key features of HDFS is its support for “data locality,” which means that it tries to store data blocks on the same node as the task that will process that data. This helps to reduce network traffic and improve the performance of MapReduce jobs.

HDFS is an important component of the Hadoop ecosystem, as it provides the storage layer for Hadoop-based systems and enables users to store and process large datasets in a distributed manner.

Hadoop YARN:

Apache Hadoop YARN (Yet Another Resource Negotiator) is a resource management platform that is responsible for managing compute resources in a Hadoop cluster and scheduling tasks to run on those resources. It was introduced in Hadoop 2.0 as a replacement for the MapReduce engine, which was the original resource manager in Hadoop.

YARN is designed to be a generic platform that can support a wide variety of processing frameworks, such as batch processing, stream processing, interactive processing, and machine learning. It provides a central resource management platform that can be used to manage and schedule tasks across a Hadoop cluster, regardless of the processing framework being used.

Some key features of YARN include:

Resource scheduling: YARN is responsible for allocating resources (e.g., CPU, memory) to tasks that are running on the cluster. It uses a scheduling algorithm to determine which tasks should be given priority and how resources should be allocated among the tasks.

Application isolation: YARN enables different types of processing frameworks to run on the same cluster without interference. It does this by isolating the resources that are allocated to each application so that one application cannot consume all of the resources in the cluster.

Fault tolerance: YARN is designed to be highly fault-tolerant so that it can continue to operate even if individual nodes in the cluster fail. It does this by replicating data and tasks across multiple nodes in the cluster.

Overall, YARN is an important component of the Hadoop ecosystem that enables users to run a variety of big data processing frameworks on a single cluster while managing and allocating resources efficiently.

Hadoop MapReduce:

Apache Hadoop MapReduce is a programming model and an associated implementation for processing large data sets with a parallel, distributed algorithm on a cluster. It is an open-source software framework that was developed as part of the Apache Hadoop project.

The MapReduce model is based on the “map” and “reduce” functions that are provided by the programmer. The “map” function takes a set of key-value pairs as input and processes them to produce a set of intermediate key-value pairs. The “reduce” function takes the intermediate key-value pairs as input and merges them to produce a set of output key-value pairs.

Here is a simple example of how the MapReduce model might be used to count the number of occurrences of each word in a large dataset:

The input to the MapReduce job is a set of text files.

The “map” function reads in a file and processes it line by line. For each line, it splits the line into words and emits a key-value pair for each word, where the key is the word and the value is the number 1.

The intermediate key-value pairs are sorted by key and then passed to the “reduce” function.

The “reduce” function receives a list of values for each key and sums them to produce the final output.

MapReduce is a powerful tool for processing large datasets because it allows the programmer to specify a parallel, distributed algorithm and have it executed on a cluster of machines. However, it can be somewhat complex to use, especially for programmers who are not familiar with the MapReduce model.

In Hadoop 2.0, MapReduce was replaced by YARN as the resource manager for the Hadoop cluster. However, the MapReduce programming model is still supported and can be used in conjunction with YARN to process data on a Hadoop cluster.

Spark Architecture:

the Spark architecture is the Spark driver, which is responsible for converting a user’s Spark program into a set of parallel tasks that are executed on the cluster. The driver runs on the master node of the cluster and is responsible for maintaining the overall context of the Spark application.

The Spark driver communicates with a number of executors, which are responsible for executing the tasks that are assigned to them by the driver. The executors are responsible for running the actual code that is defined in the user’s Spark program, as well as for storing and caching data that is used by the tasks.

One of the key features of the Spark architecture is its support for in-memory processing, which allows it to process data much faster than traditional big data processing frameworks that rely on disk-based storage. Spark also has a number of advanced features, such as support for stream processing, machine learning, and graph processing, which make it a powerful and flexible platform for big data processing.

I hope you enjoyed reading this and finding it informative, feel free to add your comments, thoughts, or feedback, and don’t forget to get in touch on LinkedIn or follow my medium account to keep updated.

tech news
Like

About the Creator

ansam yousry

Work as data engineer , experienced in data analyst and DWH , Write technical articles and share my life experience

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments (1)

Sign in to comment
  • ansam yousry (Author)about a year ago

    if you interested in Airflow tutorial you can check this article https://medium.com/@ansam.yousry/airflow-from-zero-and-how-to-install-airflow-in-docker-fb5c4a0f992b

Find us on social media

Miscellaneous links

  • Explore
  • Contact
  • Privacy Policy
  • Terms of Use
  • Support

© 2024 Creatd, Inc. All Rights Reserved.