Hadoop: Giraph - Graph Analysis
Sumber: http://giraph.apache.org/quick_start.html
Deploying Giraph
We will now deploy Giraph. In order to build Giraph from the repository, you need first to install Git and Maven 3 by running the following commands:
su - hdadmin sudo apt-get install git sudo apt-get install maven mvn -version
Make sure that you have installed Maven 3 or higher. Giraph uses the Munge plugin, which requires Mave 3, to support multiple versions of Hadoop. Also, the web site plugin requires Maven 3. You can now clone Giraph from its Github mirror:
cd /usr/local/ sudo git clone https://github.com/apache/giraph.git sudo chown -R hduser:hadoop giraph su - hduser
After that, edit $HOME/.bashrc for user account hduser with the following line:
export GIRAPH_HOME=/usr/local/giraph
Save and close the file, and then validate, compile, test (if required), and then package Giraph into JAR files by running the following commands:
source $HOME/.bashrc cd $GIRAPH_HOME mvn package -DskipTests
The argument -DskipTests will skip the testing phase. This may take a while on the first run because Maven is downloading the most recent artifacts (plugin JARs and other files) into your local repository. You may also need to execute the command a couple of times before it succeeds. This is because the remote server may time out before your downloads are complete. Once the packaging is successful, you will have the Giraph core JAR $GIRAPH_HOME/giraph-core/target/giraph-1.2.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar and Giraph examples JAR $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar. You are done with deploying Giraph. Running a Giraph job
With Giraph and Hadoop deployed, you can run your first Giraph job. We will use the SimpleShortestPathsComputation example job which reads an input file of a graph in one of the supported formats and computes the length of the shortest paths from a source node to all other nodes. The source node is always the first node in the input file. We will use JsonLongDoubleFloatDoubleVertexInputFormat input format. First, create an example graph under /tmp/tiny_graph.txt with the follwing:
[0,0,[[1,1],[3,3]]] [1,0,[[0,1],[2,2],[3,1]]] [2,0,[[1,2],[4,4]]] [3,0,[[0,3],[1,1],[4,4]]] [4,0,[[3,4],[2,4]]]
Save and close the file. Each line above has the format [source_id,source_value,[[dest_id, edge_value],...]]. In this graph, there are 5 nodes and 12 directed edges. Copy the input file to HDFS:
$HADOOP_HOME/bin/hadoop dfs -copyFromLocal /tmp/tiny_graph.txt /user/hduser/input/tiny_graph.txt $HADOOP_HOME/bin/hadoop dfs -ls /user/hduser/input
We will use IdWithValueTextOutputFormat output file format, where each line consists of source_id length for each node in the input graph (the source node has a length of 0, by convention). You can now run the example by:
$HADOOP_HOME/bin/hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hduser/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hduser/output/shortestpaths -w 1
Notice that the job is computed using a single worker using the argument -w. To get more information about running a Giraph job, run the following command:
$HADOOP_HOME/bin/hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -h
This will output the following:
usage: org.apache.giraph.utils.ConfigurationUtils [-aw <arg>] [-c <arg>]
[-ca <arg>] [-cf <arg>] [-eif <arg>] [-eip <arg>] [-eof <arg>] [-esd <arg>] [-h] [-jyc <arg>] [-la] [-mc <arg>] [-op <arg>] [-pc <arg>] [-q] [-th <arg>] [-ve <arg>] [-vif <arg>] [-vip <arg>] [-vof <arg>] [-vsd <arg>] [-vvf <arg>] [-w <arg>] [-wc <arg>] [-yh <arg>] [-yj <arg>] -aw,--aggregatorWriter <arg> AggregatorWriter class -c,--messageCombiner <arg> Message messageCombiner class -ca,--customArguments <arg> provide custom arguments for the job configuration in the form: -ca <param1>=<value1>,<param2>=<value2> -ca <param3>=<value3> etc. It can appear multiple times, and the last one has effect for the sameparam. -cf,--cacheFile <arg> Files for distributed cache -eif,--edgeInputFormat <arg> Edge input format -eip,--edgeInputPath <arg> Edge input path -eof,--vertexOutputFormat <arg> Edge output format -esd,--edgeSubDir <arg> subdirectory to be used for the edge output -h,--help Help -jyc,--jythonClass <arg> Jython class name, used if computation passed in is a python script -la,--listAlgorithms List supported algorithms -mc,--masterCompute <arg> MasterCompute class -op,--outputPath <arg> Vertex output path -pc,--partitionClass <arg> Partition class -q,--quiet Quiet output -th,--typesHolder <arg> Class that holds types. Needed only if Computation is not set -ve,--outEdges <arg> Vertex edges class -vif,--vertexInputFormat <arg> Vertex input format -vip,--vertexInputPath <arg> Vertex input path -vof,--vertexOutputFormat <arg> Vertex output format -vsd,--vertexSubDir <arg> subdirectory to be used for the vertex output -vvf,--vertexValueFactoryClass <arg> Vertex value factory class -w,--workers <arg> Number of workers -wc,--workerContext <arg> WorkerContext class -yh,--yarnheap <arg> Heap size, in MB, for each Giraph task (YARN only.) Defaults to giraph.yarn.task.heap.mb => 1024 (integer) MB. -yj,--yarnjars <arg> comma-separated list of JAR filenames to distribute to Giraph tasks and ApplicationMaster. YARN only. Search order: CLASSPATH, HADOOP_HOME, user current dir.
You can monitor the progress of your Giraph job from the JobTracker web GUI. Once the job is completed, you can check the results by:
$HADOOP_HOME/bin/hadoop dfs -cat /user/hduser/output/shortestpaths/p* | less
Getting involved
Giraph is an open-source project and external contributions are extremely appreciated. There are many ways to get involved:
Subscribe to the mailing lists, particularly the user and developer lists, where you can get a feel for the state of the project and what the community is working on. Try out more examples and play with Giraph on your cluster. Be sure to ask questions on the user list or file an issue if you run into problems with your particular configuration. Browse the existing issues to find something you may be interested in working on. Take a look at the section on generating patches for detailed instructions on contributing your changes. Make Giraph more accessable to new comers by updating this and other site documentation.