Hadoop: Menjalankan MapReduce Job

From OnnoWiki
Revision as of 12:51, 9 November 2015 by Onnowpurbo (talk | contribs)
Jump to navigation Jump to search

Sumber: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Running_MapReduce_Job.php


Persiapan MapReduce

Sebelum kita melompat ke dalam pemrograman MapReduce, kita mungkin perlu untuk membicarakan langkah-langkah persiapan yang biasa diambil. Karena MapReduce biasanya beroperasi pada data yang besar, kita perlu mempertimbangkan langkah-langkah sebelum kita benar-benar melakukan MapReduce itu.

Struktur yang mendasari filesystem HDFS sangat berbeda dari sistem file normal kami. Ukuran blok yang sedikit lebih besar, dan ukuran blok yang sebenarnya untuk cluster kami tergantung pada konfigurasi cluster seperti yang ditunjukkan pada gambar di bawah: 64, 128, atau 256 MB. Jadi, kita mungkin perlu memiliki blok dengan dipartisi yang dikustomisasi.

MapRPrep.png

Sumber Gambar : Hadoop MapReduce Fundamentals.

Pertimbangan lain adalah di mana kita akan mengambil data kita dalam rangka untuk melakukan operasi MapReduce atau pemrosesan paralel di atasnya. Meskipun kami akan bekerja dengan Hadoop filesystem, kita dapat mengeksekusi algoritma MapReduce terhadap informasi yang tersimpan di lokasi yang berbeda dengan filesystem native, penyimpanan awan seperti Amazon S3 bucket, atau Windows Azure blob.

Pertimbangan lain adalah output dari MapReduce hasil pekerjaan yang berubah. Jadi, output kami adalah one-time output, dan ketika keluaran baru yang dihasilkan, kita memiliki nama file baru untuk itu.

Pertimbangan terakhir dalam mempersiapkan MapReduce adalah tentang logika yang akan kita tulis, dan harus sesuai dengan situasi yang akan kita atasi. Kita akan menulis logika dalam beberapa bahasa pemrograman, perpustakaan, atau alat untuk memetakan data, dan kemudian mengurangi, dan kemudian kita memiliki beberapa output.

Perhatikan juga bahwa kita akan bekerja dengan pasangan kunci-nilai, sehingga terlepas dari format data yang masuk, kami ingin menampilkan pasangan kunci-nilai.



Hadoop shell commands

Before performing MapReduce jobs, we should be familiar with some of the Hadoop shell commands. Please visit List of Apache Hadoop hdfs commands.



Menjalankan MapReduce Job

Lakukan

cd /usr/local/hadoop
ls
bin  include  libexec      logs        README.txt  share
etc  lib      LICENSE.txt  NOTICE.txt  sbin

Jalankan

hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi 2 5
Number of Maps  = 2
Samples per Map = 5
14/07/14 01:28:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wrote input for Map #0
Wrote input for Map #1
Starting Job
14/07/14 01:28:07 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/07/14 01:28:07 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/07/14 01:28:07 INFO input.FileInputFormat: Total input paths to process : 2
14/07/14 01:28:07 INFO mapreduce.JobSubmitter: number of splits:2
14/07/14 01:28:09 INFO mapreduce.JobSubmitter: Submitting tokens for job:  job_local1228885165_0001
...
	File Input Format Counters  
		Bytes Read=236 
	File Output Format Counters 
		Bytes Written=97
Job Finished in 6.072 seconds
Estimated value of Pi is 3.60000000000000000000



Hadoop FileSystem (HDFS)

Files are stored in the Hadoop Distributed File System (HDFS). Suppose we're going to store a file called data.txt in HDFS.

This file is 160 megabytes. When a file is loaded into HDFS, it's split into chunks which are called blocks. The default size of each block is 64 megabytes. Each block is given a unique name, which is blk, an underscore, and a large number. In our case, the first block is 64 megabytes. The second block is 64 megabytes. The third block is the remaining 32 megabytes, to make up our 160 megabyte file.

HDFS_Cloud.png

As the file is uploaded to HDFS, each block will get stored on one node in the cluster. There's a Daemon running on each of the machines in the cluster, and it is called the DataNode. Now, we need to know which blocks make up the original file. And that's handled by a separate machine, running the Daemon called the NameNode. The information stored on the NameNode is known as the Metadata.


NoSQL.png

HDFS Commands

While Hadoop is running, let's create hdfsTest.txt in our home directory:

hduser@k:~$ echo "hdfs test" > hdfsTest.txt

Then, we want to create Home Directory in HDFS :

hduser@ubuntu:~$ hadoop fs -mkdir -p /user/hduser

We can copy file hdfsTest.txt from local disk to the user's directory in HDFS:


hduser@ubuntu:~$ hadoop fs -copyFromLocal hdfsTest.txt hdfsTest.txt

We could have used put instead of copyFromLocal:

hduser@ubuntu:~$ hadoop fs -put hdfsTest.txt

Get a directory listing of the user's home directory in HDFS:

hduser@k:~$ hadoop fs -ls Found 1 items -rw-r--r-- 1 hduser supergroup 5 2014-07-14 01:49 hdfsTest.txt

If we want to display the contents of the HDFS file /user/hduser/hdfsTest.txt:


hduser@ubuntu:~$ hadoop fs -cat /user/hduser/hdfsTest.txt

copy that file to the local disk from HDFS, named as hdfsTest2.txt :

hduser@k:~$ hadoop fs -copyToLocal /user/hduser/hdfsTest.txt hdfsTest2.txt

hduser@k:~$ ls hdfsTest2.txt hdfsTest.txt

To delete the file from Hadoop HDFS:

hduser@k:~$ hadoop fs -rm hdfsTest.txt

hduser@k:~$ hadoop fs -ls hduser@k:~$



Hadoop Setup for Development

HadoopSetup.png

Picture source : Hadoop MapReduce Fundamentals.

Throughout my tutorials on Hadoop Echo Systems, I used:

   Hadoop Binaries - Local (Linux), Cloudera's Demo VM, and AWS for Cloud.
   Data Storage - Local (HDFS Pseudo-distributed, single-node) and Cloud.
   MapReduce - Both Local and Cloud.



Ways to MapReduce

Java is the most common language to use, but other languages can be used: WayToMapReduces.png

Picture source : Hadoop MapReduce Fundamentals.



Referensi