Hadoop: Giraph - Graph Analysis

From OnnoWiki
Jump to navigation Jump to search

Sumber: http://giraph.apache.org/quick_start.html


Deploying Giraph

We will now deploy Giraph. In order to build Giraph from the repository, you need first to install Git and Maven 3 by running the following commands:

su - hdadmin sudo apt-get install git sudo apt-get install maven mvn -version

Make sure that you have installed Maven 3 or higher. Giraph uses the Munge plugin, which requires Mave 3, to support multiple versions of Hadoop. Also, the web site plugin requires Maven 3. You can now clone Giraph from its Github mirror:

cd /usr/local/ sudo git clone https://github.com/apache/giraph.git sudo chown -R hduser:hadoop giraph su - hduser

After that, edit $HOME/.bashrc for user account hduser with the following line:

export GIRAPH_HOME=/usr/local/giraph

Save and close the file, and then validate, compile, test (if required), and then package Giraph into JAR files by running the following commands:

source $HOME/.bashrc cd $GIRAPH_HOME mvn package -DskipTests

The argument -DskipTests will skip the testing phase. This may take a while on the first run because Maven is downloading the most recent artifacts (plugin JARs and other files) into your local repository. You may also need to execute the command a couple of times before it succeeds. This is because the remote server may time out before your downloads are complete. Once the packaging is successful, you will have the Giraph core JAR $GIRAPH_HOME/giraph-core/target/giraph-1.2.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar and Giraph examples JAR $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar. You are done with deploying Giraph. Running a Giraph job

With Giraph and Hadoop deployed, you can run your first Giraph job. We will use the SimpleShortestPathsComputation example job which reads an input file of a graph in one of the supported formats and computes the length of the shortest paths from a source node to all other nodes. The source node is always the first node in the input file. We will use JsonLongDoubleFloatDoubleVertexInputFormat input format. First, create an example graph under /tmp/tiny_graph.txt with the follwing:

[0,0,[[1,1],[3,3]]] [1,0,[[0,1],[2,2],[3,1]]] [2,0,[[1,2],[4,4]]] [3,0,[[0,3],[1,1],[4,4]]] [4,0,[[3,4],[2,4]]]

Save and close the file. Each line above has the format [source_id,source_value,[[dest_id, edge_value],...]]. In this graph, there are 5 nodes and 12 directed edges. Copy the input file to HDFS:

$HADOOP_HOME/bin/hadoop dfs -copyFromLocal /tmp/tiny_graph.txt /user/hduser/input/tiny_graph.txt $HADOOP_HOME/bin/hadoop dfs -ls /user/hduser/input

We will use IdWithValueTextOutputFormat output file format, where each line consists of source_id length for each node in the input graph (the source node has a length of 0, by convention). You can now run the example by:

$HADOOP_HOME/bin/hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hduser/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hduser/output/shortestpaths -w 1

Notice that the job is computed using a single worker using the argument -w. To get more information about running a Giraph job, run the following command:

$HADOOP_HOME/bin/hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -h

This will output the following:

usage: org.apache.giraph.utils.ConfigurationUtils [-aw <arg>] [-c <arg>]

      [-ca <arg>] [-cf <arg>] [-eif <arg>] [-eip <arg>] [-eof <arg>]
      [-esd <arg>] [-h] [-jyc <arg>] [-la] [-mc <arg>] [-op <arg>] [-pc
      <arg>] [-q] [-th <arg>] [-ve <arg>] [-vif <arg>] [-vip <arg>] [-vof
      <arg>] [-vsd <arg>] [-vvf <arg>] [-w <arg>] [-wc <arg>] [-yh <arg>]
      [-yj <arg>]
-aw,--aggregatorWriter <arg>           AggregatorWriter class
-c,--messageCombiner <arg>             Message messageCombiner class
-ca,--customArguments <arg>            provide custom arguments for the
                                       job configuration in the form: -ca
                                       <param1>=<value1>,<param2>=<value2>
                                       -ca <param3>=<value3> etc. It
                                       can appear multiple times, and the
                                       last one has effect for the sameparam.
-cf,--cacheFile <arg>                  Files for distributed cache
-eif,--edgeInputFormat <arg>           Edge input format
-eip,--edgeInputPath <arg>             Edge input path
-eof,--vertexOutputFormat <arg>               Edge output format
-esd,--edgeSubDir <arg>                subdirectory to be used for the
                                       edge output
-h,--help                              Help
-jyc,--jythonClass <arg>               Jython class name, used if
                                       computation passed in is a python
                                       script
-la,--listAlgorithms                   List supported algorithms
-mc,--masterCompute <arg>              MasterCompute class
-op,--outputPath <arg>                 Vertex output path
-pc,--partitionClass <arg>             Partition class
-q,--quiet                             Quiet output
-th,--typesHolder <arg>                Class that holds types. Needed
                                       only if Computation is not set
-ve,--outEdges <arg>                   Vertex edges class
-vif,--vertexInputFormat <arg>         Vertex input format
-vip,--vertexInputPath <arg>           Vertex input path
-vof,--vertexOutputFormat <arg>        Vertex output format
-vsd,--vertexSubDir <arg>              subdirectory to be used for the
                                       vertex output
-vvf,--vertexValueFactoryClass <arg>   Vertex value factory class
-w,--workers <arg>                     Number of workers
-wc,--workerContext <arg>              WorkerContext class
-yh,--yarnheap <arg>                   Heap size, in MB, for each Giraph
                                       task (YARN only.) Defaults to
                                       giraph.yarn.task.heap.mb => 1024
                                       (integer) MB.
-yj,--yarnjars <arg>                   comma-separated list of JAR
                                       filenames to distribute to Giraph
                                       tasks and ApplicationMaster. YARN
                                       only. Search order: CLASSPATH,
                                       HADOOP_HOME, user current dir.

You can monitor the progress of your Giraph job from the JobTracker web GUI. Once the job is completed, you can check the results by:

$HADOOP_HOME/bin/hadoop dfs -cat /user/hduser/output/shortestpaths/p* | less

Getting involved

Giraph is an open-source project and external contributions are extremely appreciated. There are many ways to get involved:

   Subscribe to the mailing lists, particularly the user and developer lists, where you can get a feel for the state of the project and what the community is working on.
   Try out more examples and play with Giraph on your cluster. Be sure to ask questions on the user list or file an issue if you run into problems with your particular configuration.
   Browse the existing issues to find something you may be interested in working on. Take a look at the section on generating patches for detailed instructions on contributing your changes.
   Make Giraph more accessable to new comers by updating this and other site documentation.


Referensi