Apache: Hadoop (en)

From OnnoWiki
Jump to navigation Jump to search

Apache Hadoop is an open-source software framework written in Java for distributed storage and processing of very large datasets on clusters of commodity hardware. All modules in Hadoop are designed with the fundamental assumption that hardware failures (individual machines or racks of machines) are common and should be automatically handled by the framework.

The core of Apache Hadoop consists of the following modules:

  • Hadoop Common: Contains libraries and utilities needed by other Hadoop modules.
  • Hadoop Distributed File System (HDFS): A distributed file system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
  • Hadoop YARN: A platform responsible for managing computing resources in clusters and using them for scheduling users' applications.
  • Hadoop MapReduce: A programming model for large-scale data processing.

The term "Hadoop" has come to refer not only to the core modules above but also to the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.

Apache Hadoop's MapReduce and HDFS components were inspired by Google's papers on MapReduce and the Google File System.

The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command-line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the map and reduce parts of the user's program. Other projects in the Hadoop ecosystem expose richer user interfaces.

Prominent users of Hadoop include Facebook and Yahoo. It can be used in traditional on-premises data centers but has also been deployed in public cloud spaces such as Microsoft Azure, Amazon Web Services, Google Compute, and IBM Bluemix.

Apache Hadoop is a registered trademark of the Apache Software Foundation.

References

External Links