Difference between revisions of "Hadoop: Ecosystem"
Onnowpurbo (talk | contribs) (New page: Sumber: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php Why Hadoop? - Volume, Velocity, and Variety (3 Vs) Volume - cheaper : scale to Petabytes or more Velocity -...) |
Onnowpurbo (talk | contribs) |
||
(6 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
− | Why Hadoop? - Volume, Velocity, and Variety (3 Vs) | + | ==Why Hadoop? - Volume, Velocity, and Variety (3 Vs)== |
− | + | * Volume - cheaper : scale to Petabytes or more | |
− | + | * Velocity - Faster : parallel data processing | |
− | + | * Variety - Better : suited for particular types of BigData problems | |
+ | ==Ecosystem== | ||
− | Ecosystem | + | [[Image:Hadoop Ecosystem.png|center|200px|thumb]] |
− | + | [[Image:Hadoop EcoSystem2.png|center|200px|thumb]] | |
− | |||
− | |||
Image source Hadoop Ecosystem | Image source Hadoop Ecosystem | ||
+ | ==Samples== | ||
+ | [[Image:Google.png|center|200px|thumb]] | ||
+ | [[Image:Facebook.png|center|200px|thumb]] | ||
+ | [[Image:Yahoo.png|center|200px|thumb]] | ||
+ | [[Image:Linkedin.png|center|200px|thumb]] | ||
− | + | [[Image:Cloudera.png|center|200px|thumb]] | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | Cloudera.png | ||
Picture source: Apache Hadoop Ecosystem | Picture source: Apache Hadoop Ecosystem | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
Line 50: | Line 40: | ||
The NameNode takes the similar role to HDFS as the JobTracker does to the MapReduce. | The NameNode takes the similar role to HDFS as the JobTracker does to the MapReduce. | ||
− | |||
+ | [[Image:HadoopCoreComponents.png|center|300px|thumb]] | ||
Line 62: | Line 52: | ||
Instead of having to write macros and reducers, in Hive we just write a statement which looks very much like standard SQL query: | Instead of having to write macros and reducers, in Hive we just write a statement which looks very much like standard SQL query: | ||
− | SELECT * FROM ... | + | SELECT * FROM ... |
The Hive interpreter turns the SQL into map produced code, which then runs on the cluster. Facebook uses it extensively and about 90% of their computation. | The Hive interpreter turns the SQL into map produced code, which then runs on the cluster. Facebook uses it extensively and about 90% of their computation. | ||
Line 69: | Line 59: | ||
An alternative is Pig, which allows us to write code to analyse our data in a fairly simple scripting language, rather than map reduce. It is a high-level language for routing data developed at Yahoo, and it allows easy integration of Java for complex tasks. | An alternative is Pig, which allows us to write code to analyse our data in a fairly simple scripting language, rather than map reduce. It is a high-level language for routing data developed at Yahoo, and it allows easy integration of Java for complex tasks. | ||
− | + | ||
+ | [[Image:Hadoop Pig.png|center|300px|thumb]] | ||
Image source : Introduction to Apache Hadoop | Image source : Introduction to Apache Hadoop | ||
Line 75: | Line 66: | ||
The code is just turned into map reduce and run on a cluster. It works like a compiler which translates our program into an assembly. So, the Pig does the same thing for MapReduce jobs. | The code is just turned into map reduce and run on a cluster. It works like a compiler which translates our program into an assembly. So, the Pig does the same thing for MapReduce jobs. | ||
+ | Though Hive and Pig are great, they're still running map reduce jobs, and can take a reasonable around of time to run, especially over large amounts of data. | ||
− | + | ===Impala=== | |
− | |||
− | |||
That's why we have another open source project called Impala. It was developed as a way to query our data with SQL, but which directly accesses the data in HDFS rather than needing map reduce. Impala is optimized for low latency queries. In other words, Impala queries run very quickly, typically, many times faster than Hive, while Hive is optimized for running long batch processing jobs. | That's why we have another open source project called Impala. It was developed as a way to query our data with SQL, but which directly accesses the data in HDFS rather than needing map reduce. Impala is optimized for low latency queries. In other words, Impala queries run very quickly, typically, many times faster than Hive, while Hive is optimized for running long batch processing jobs. | ||
− | + | ===Sqoop=== | |
− | |||
− | |||
− | Sqoop | ||
Another project used by many people is Sqoop. Sqoop takes data from a traditional relational database, such as Microsoft SQL Server, and puts it in HDFS, as the limited files. So, it can be processed along with other data on the cluster. | Another project used by many people is Sqoop. Sqoop takes data from a traditional relational database, such as Microsoft SQL Server, and puts it in HDFS, as the limited files. So, it can be processed along with other data on the cluster. | ||
− | + | ===Flume=== | |
− | |||
− | |||
− | Flume | ||
Then, there's Flume. Flume is for streaming data into Hadoop. It injests data as it's generated by external systems, and puts it into the cluster. So, if we have servers generating data continuously, we can use Flume. | Then, there's Flume. Flume is for streaming data into Hadoop. It injests data as it's generated by external systems, and puts it into the cluster. So, if we have servers generating data continuously, we can use Flume. | ||
+ | HBase is a real time database, built on top of HDFS. It's a column-family store based on the Google's BigTable design. | ||
− | + | [[Image:Hadoop HBase.png|center|300px|thumb]] | |
− | |||
− | |||
− | HBase | ||
− | |||
Image source : Introduction to Apache Hadoop | Image source : Introduction to Apache Hadoop | ||
Line 164: | Line 145: | ||
Image source: Mesos Orchestrates a Data Center Like One Big Computer | Image source: Mesos Orchestrates a Data Center Like One Big Computer | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==Referensi== | ==Referensi== | ||
* http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php | * http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php |
Latest revision as of 05:26, 10 November 2015
Sumber: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php
Why Hadoop? - Volume, Velocity, and Variety (3 Vs)
- Volume - cheaper : scale to Petabytes or more
- Velocity - Faster : parallel data processing
- Variety - Better : suited for particular types of BigData problems
Ecosystem
Image source Hadoop Ecosystem
Samples
Picture source: Apache Hadoop Ecosystem
Hadoop consists of two main pieces, HDFS and MapReduce. The MapReduce is the processing part of Hadoop and the MapReduce server on a typical machine is called a TaskTracker. HDFS is the data part of Hadoop and the HDFS server on a a typical machine is called a DataNode.
MapReduce needs a coordinator which is called a JobTracker. Regardless of how large the cluster is we need a signle JobTracker. The JobTracker is responsible for acceting user's job, dividing it into tasks and assigning it to individual TaskTracker. Then, runs the task and reports the status as it runs and completes. The JobTracker is also responsible for noticing if the TaskTracker dispears because of software failure or hardware failure. If it's gone away, the JobTracker automatically assigns the task to another TaskTracker.
The NameNode takes the similar role to HDFS as the JobTracker does to the MapReduce.
However, since the project was first started, lots of other software has been built around it. And that's what we call Hadoop Ecosystem. Some of the software are intended to make it easier to load data into the Hadoop cluster.
Well, lots of them designed to make Hadoop easier to use. In fact, writing Map reduced code isn't that simple. We need to write a programming language such as Java, Python, or Ruby. But quite a few people out there who aren't programmers, but can write SQL queries to access data in a traditional relational database system, like SQL Server. And, of course, lots of business tools are used to hook into Hadoop.
Therefore, several open source projects have been created to make it easier for people to query their data without knowing how to code.
Instead of having to write macros and reducers, in Hive we just write a statement which looks very much like standard SQL query:
SELECT * FROM ...
The Hive interpreter turns the SQL into map produced code, which then runs on the cluster. Facebook uses it extensively and about 90% of their computation.
An alternative is Pig, which allows us to write code to analyse our data in a fairly simple scripting language, rather than map reduce. It is a high-level language for routing data developed at Yahoo, and it allows easy integration of Java for complex tasks.
Image source : Introduction to Apache Hadoop
The code is just turned into map reduce and run on a cluster. It works like a compiler which translates our program into an assembly. So, the Pig does the same thing for MapReduce jobs.
Though Hive and Pig are great, they're still running map reduce jobs, and can take a reasonable around of time to run, especially over large amounts of data.
Impala
That's why we have another open source project called Impala. It was developed as a way to query our data with SQL, but which directly accesses the data in HDFS rather than needing map reduce. Impala is optimized for low latency queries. In other words, Impala queries run very quickly, typically, many times faster than Hive, while Hive is optimized for running long batch processing jobs.
Sqoop
Another project used by many people is Sqoop. Sqoop takes data from a traditional relational database, such as Microsoft SQL Server, and puts it in HDFS, as the limited files. So, it can be processed along with other data on the cluster.
Flume
Then, there's Flume. Flume is for streaming data into Hadoop. It injests data as it's generated by external systems, and puts it into the cluster. So, if we have servers generating data continuously, we can use Flume.
HBase is a real time database, built on top of HDFS. It's a column-family store based on the Google's BigTable design.
Image source : Introduction to Apache Hadoop
We need to read/write data in real time and HBase is a top-level Apache project meets that need. It provides a simple interface to our distributed data that allows incremental processing. HBase can be accessed by Hive and Pig by MapReduce and stores that information in its HDFS and it's guaranteed to be reliable and durable. HBase is used for applications such as Facebook messages.
KijiSchema provides a simple Java API and command line interface for importing, managing, and retrieving data from HBase by setting up HBase layouts using user-friendly tools including a DDL (data definition language or data description language). HBaseAPIs.png
Picture source: A Hadoop Ecosystem Overview: Including HDFS, MapReduce, Yarn, Hive, Pig, and HBase.
Hive_Icon.png Hive.png
Image source : Hadoop Tutorial: Apache Hive
Hive is a data warehouse system layer built on Hadoop. It allows us to define a structure for our unstructured Big Data. With a HiveQL which is an SQL-like scripting languages, we can simplify analysis and queries.
Note that Hive is NOT a database but uses a database to store metadata. The data that Hive processes is stored in HDFS. Hive runs on Hadoop and is NOT designed for on-line transaction processing because the latency for Hive queries is generally high. Therefore, Hive is NOT suited for real-time queries. Hive is best used for batch jobs over large sets of immutable data such as web logs.
Hue is a graphical front end to the questor.
Oozie is a workflow scheduler tool it provides workflow/coordination service to manage Hadoop jobs. So, we define when we want our MapReduce jobs to run and Oozie will fire them up automatically. It also will trigger when data becomes available. It was developed at Yahoo.
Mahout is a library for scalable machine learning and data mining.
Avro is a Serialization and RPC framework.
In fact, there are so many ecosystem projects that making them all talk to one another, and work well, can be tricky. To make installing and maintaining a cluster like this easier, a company such as Cloudera, has put together a distribution of HADOOP called CDH (Cloudera distribution including a patchy HADOOP) takes all the key ecosystem projects, along with Hadoop itself, and packages them together so that installation is a really easy process. It's a free and open source, just like Hadoop itself. While we could install everything from scratch, it's far easier to use CDH.
ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers. - Zookeeper wiki Zookeeper.png Hadoop_Zookeeper.png
Image source : Introduction to Apache Hadoop
Kafka Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
Kafka.png
Kafka_ZooKeeper.png
Source: Simulating and transporting Realtime event stream with Apache Kafka
Data center Orchestration - Mesos Mesos.png
Image source: Apache Mesos
Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API's for resource management and scheduling across entire datacenter and cloud environments. Mesos-DataCenter.png
Image source: Mesos Orchestrates a Data Center Like One Big Computer