Hadoop

Introduction

Hadoop is an open-source software platform for distributed storage and distributed processing for big data. It consists of several components that work together to load, process, and store big data written in Java.

Hadoop Architecture

The Hadoop architecture consists mainly of four components:

MapReduce

Note: For more information, read about MapReduce here.

MapReduce is a core component of the Hadoop platform. The main funtionality of MapReduce is to split large amounts of data into smaller chunks and distribute the smaller chunks onto multiple servers for massively parallel processing tasks. The architecture allows splitting workloads into a massive number of smaller ones that later get re-combined into singular data sets.

Hadoop Distributed File System (HDFS)

Note: For more information, read about Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System is utilized for storage permission in a Hadoop cluster. It's designed for providing a commodity hardware scalable and highly available storage cluster for distributing processing and querying workloads.

Yet Another Resource Negotiator (YARN)

Note: For more information, read about Hadoop: Yet Another Resource Negotiator (YARN)

YARN is a framework that MapReduce works with. YARN performs two operations job scheduling and resource management.

Deployment

TODO: Write more about how to deploy via containers, kubernetes & ansible.

Thanks to the EU's Horizon 2020 project, Big Data Europe, offers open source resources to deploy Big Data tools such as Hadoop. This will help when deploying Hadoop on a cluster of machines or a single development one. This includes their Hadoop Docker Container which can be used to deploy Hadoop on a single machine.