Friday, October 4, 2013

Apache Hadoop : Introduction

In this post we will discuss about the main functions and various distributions of “Apache Hadoop” (“Big-Data” technology) !

What is Apache Hadoop?

Apache™ Hadoop® is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive
amounts of structured and unstructured data quickly and without significant investment.




What are the Main Functions ?
It consists of three main functions: storage, processing and resource management.
1] Processing – MapReduce
2] Storage – HDFS
3] Resource Management – YARN 


A Hadoop Distribution:
A number of supporting Apache Software Foundation(ASF) projects enable the integration of core Apache Hadoop into a data center environment.
1] Apache Pig - Platform for processing and analyzing datasets
2] Apache HCatalog - Table and Meta data Management service
3] Apache Hive - Data warehouse that enables easy data summarization and ad-hoc queries for large datasets
4] Apache HBase - Column-oriented NOSQL data storage system
5] Apache ZooKeeper - Coordinate distributed process
6] Apache Ambari - Monitor and Administrate Hadoop clusters
7] Apache Scoop - Tool to speedup data movement(in,out) in Hadoop
8] Apache Oozie - Java web application to schedule apache hadoop jobs
9] Apache Mahout - Scalable machine learning algorithm for clustering,classification and batch based collaborative filtering
10] Apache Flume - Efficiently aggregate and move large amount of log data from different sources
11] Apache YARN - Next generation for Hadoop data processing
12] Apache Tez - Generalizes the map reduce paradigm for executing a complex DAG(Direct Acyclic Graph)

The Big Guns using Apache Hadoop ?
  • Yahoo
  • Google
  • Twitter
  • Facebook
We will discuss these technologies and terms in detail on next article !

No comments:

Post a Comment