In the last article we understood what is Bigdata? In this article we will see a big data framework called “Hadoop”. Hadoop is a software library which will enable the users to distribute and process the large amount of data using clusters of commodity servers. This project includes,
Hadoop Common – Contains common utilities to work with Hadoop
Hadoop Distributed File System(HDFS) – A distributed file system which will provide high throughput access to the data.
Hadoop YARN(Yet Another Resource Negotiator) – A framework for job scheduling and resource management.
Hadoop MapReduce(MRv2) – Parallel processing programming model.
In addition to above modules, there are other Hadoop related projects,
ZooKeeper – Coordination service for distributed applications.
Pig – A high-level data-flow language and execution framework for parallel computation.
HBase – A scalable distributed columnar database.
Hive – A data warehouse infrastructure that provides data summarization and ad hoc querying.
Oozie – A workflow scheduler system to manage Apache Hadoop jobs.
Flume – A tool to move the unstructured data to Apache Hadoop.
Sqoop – A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
The Apache Hadoop eco system is depicted below.
In the coming article we will see introduction of Hadoop Distributed File System.