<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Werx Limited &#187; hbase</title>
	<atom:link href="http://werxltd.com/wp/tag/hbase/feed/" rel="self" type="application/rss+xml" />
	<link>http://werxltd.com/wp</link>
	<description>We make IT work.</description>
	<lastBuildDate>Wed, 08 Sep 2010 17:00:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Simple HBase query bridge</title>
		<link>http://werxltd.com/wp/2010/03/25/simple-hbase-query-bridge/</link>
		<comments>http://werxltd.com/wp/2010/03/25/simple-hbase-query-bridge/#comments</comments>
		<pubDate>Thu, 25 Mar 2010 12:00:43 +0000</pubDate>
		<dc:creator>wes</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[software development]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[interoperability]]></category>
		<category><![CDATA[json-rpc]]></category>
		<category><![CDATA[map reduce]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[servlet]]></category>
		<category><![CDATA[tomcat]]></category>

		<guid isPermaLink="false">http://werxltd.com/wp/?p=522</guid>
		<description><![CDATA[I&#8217;ve recently released a simple json-rpc query bridge (using our own simple json-rpc framework) for HBase at http://code.google.com/p/hbasebridge/ You can use this bridge to query HBase for either the current record or the last few versions of a record. To see the methods http://localhost:8080/hbasebridge/rpc?debug=true Which returns a list of usable RPC methods: { "jsonrpc": "2.0", "result": [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve recently released a simple json-rpc query bridge (using our own <a href="http://werxltd.com/wp/portfolio/json-rpc/simple-java-json-rpc/">simple json-rpc framework</a>) for <a href="http://hadoop.apache.org/hbase/">HBase </a>at <a href="http://code.google.com/p/hbasebridge/">http://code.google.com/p/hbasebridge/</a></p>
<p>You can use this bridge to query HBase for either the current record or the last few versions of a record.</p>
<p>To see the methods</p>
<pre>http://localhost:8080/hbasebridge/rpc?debug=true</pre>
<p>Which returns a list of usable RPC methods:</p>
<pre class="brush:javascript">{
  "jsonrpc": "2.0",
  "result": {"method": [
    {
      "class": "com.werxltd.hbasebridge.HBaseInfo",
      "name": "listtables",
      "params": [],
      "returns": "org.json.JSONObject",
      "static": false
    },
    {
      "class": "com.werxltd.hbasebridge.HadoopInfo",
      "name": "clusterstatus",
      "params": [],
      "returns": "org.json.JSONObject",
      "static": false
    },
    {
      "class": "com.werxltd.hbasebridge.HadoopInfo",
      "name": "jobstatus",
      "params": [],
      "returns": "org.json.JSONObject",
      "static": false
    },
    {
      "class": "com.werxltd.jsonrpc.RPC",
      "name": "listrpcmethods",
      "params": [],
      "returns": "org.json.JSONObject",
      "static": false
    },
    {
      "class": "com.werxltd.hbasebridge.TableLookup",
      "name": "lookup",
      "params": ["org.json.JSONObject"],
      "returns": "org.json.JSONObject",
      "static": false
    }
  ]}
}</pre>
<p>To list tables:</p>
<pre>http://localhost:8080/hbasebridge/rpc?debug=true&#038;method=listtables</pre>
<p>Which returns:</p>
<pre class="brush:javascript">
{
  "jsonrpc": "2.0",
  "result": {"tables": [
    "mytable"
  ]}
}
</pre>
<p>To get the current status of the cluster:</p>
<pre>http://localhost:8080/hbasebridge/rpc?debug=true&#038;method=clusterstatus</pre>
<p>Which returns:</p>
<pre class="brush:javascript">{
  "jsonrpc": "2.0",
  "result": {
    "activetrackernames": [
      "trackernode1:localhost/127.0.0.1:33455",
      "trackernode2:localhost/127.0.0.1:54616",
    ],
    "blacklistedtrackernames": [],
    "blacklistedtrackers": 0,
    "jobqueues": {"queues": [{
      "jobs": [
        {
          "cleanuptasks": [{"state": ""}],
          "complete": false,
          "filename": "hdfs://hadoophdfsnode:9000/data/hadoop/mapred/system/job_201003191557_0442/job.xml",
          "jobpriority": "normal",
          "mapprogress": 1,
          "name": "My mapreduce job",
          "reduceprogress": 0.9819000363349915,
          "runstate": "running",
          "schedulinginfo": "NA",
          "setupprogress": 1,
          "starttime": 1269024863960,
          "username": "hadoop-admin"
        }
      ],
      "name": "default"
    }]},
    "jobtrackerstate": "running",
    "maptasks": 1,
    "maxmaptasks": 116,
    "maxmemory": 2079719424,
    "maxreducetasks": 58,
    "reducetasks": 16,
    "tasktrackers": 34,
    "ttyexpiryinterval": 600000,
    "usedmemory": 969170944
  }
}</pre>
<p>Key/Value Query:</p>
<pre>http://localhost:8080/hbase_tape/rpc?debug=true&#038;data={"method":"lookup","params":{"table":"tablename","keys":["mykey"]}}</pre>
<p>Results:</p>
<pre class="brush:javascript">
{
  "jsonrpc": "2.0",
  "result": {"rows": [{"mykey": {
    "family:col": "myvalue"
  }}]}
}
</pre>
<p>Key/Value query with versions:</p>
<pre>http://localhost:8080/hbase_tape/rpc?debug=true&#038;data={"method":"lookup","params":{"table":"tablename","keys":["mykey"],versions:2}}</pre>
<p>Results:</p>
<pre class="brush:javascript">
{
  "jsonrpc": "2.0",
  "result": {"rows": [{"mykey": {
    "family:col": [{
      "value": "myval",
      "version": 123456789
    }],
    "family:col": [{
      "value": "myoldval",
      "version": 123456788
    }]
  }}]}
}
</pre>
<p>The code should also provide a handy reference for anyone who wants to learn how to query HBase and scrape <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/Result.html">Result</a> objects for values without knowing family or column names in advance.</p>
]]></content:encoded>
			<wfw:commentRss>http://werxltd.com/wp/2010/03/25/simple-hbase-query-bridge/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Using Python with Hadoop</title>
		<link>http://werxltd.com/wp/2009/09/29/using-python-with-hadoop/</link>
		<comments>http://werxltd.com/wp/2009/09/29/using-python-with-hadoop/#comments</comments>
		<pubDate>Tue, 29 Sep 2009 14:46:50 +0000</pubDate>
		<dc:creator>wes</dc:creator>
				<category><![CDATA[software development]]></category>
		<category><![CDATA[framework]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoopstreaming]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[jython]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[python hadoop framework]]></category>
		<category><![CDATA[zohmg]]></category>

		<guid isPermaLink="false">http://werxltd.com/wp/?p=222</guid>
		<description><![CDATA[First, some review Hadoop is a very powerful MapReduce framework based on a white paper released by Google documenting how they have successfully tackled the issue of processing large amounts of data (on the scale of petabytes in many cases) using their proprietary distributed filesystem, GFS. Hadoop is the open source version of this distributed [...]]]></description>
			<content:encoded><![CDATA[<h3>First, some review</h3>
<p><a href="http://hadoop.apache.org/">Hadoop</a> is a very powerful <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a> framework based on a <a href="http://news.cnet.com/8301-10784_3-9955184-7.html">white paper released by Google</a> documenting how they have successfully tackled the issue of processing large amounts of data (on the scale of petabytes in many cases) using their proprietary distributed filesystem, <a href="http://labs.google.com/papers/gfs.html">GFS</a>. Hadoop is the open source version of this distributed file system<sup><a href="http://werxltd.com/wp/2009/09/29/using-python-with-hadoop/#footnote_0_222" id="identifier_0_222" class="footnote-link footnote-identifier-link" title="Technically Hadoop is an umbrella name whereas HDFS is the technical name for the GFS alternative.">1</a></sup>, heavily <a href="http://wiki.apache.org/hadoop/PoweredBy">supported by companies like</a> Yahoo, Google, Amazon, Adobe, <a href="http://hadoop.apache.org/hive/">Facebook</a>, Hulu, <a href="http://www-03.ibm.com/press/us/en/pressrelease/22613.wss">IBM</a>, <a href="http://blog.racklabs.com/?p=66">RackSpace</a>, etc. and and has a growing number of related projects hosted by the Apache Foundation.</p>
<h3>Why we need to learn &#8220;yet another language&#8221;</h3>
<p>Yet, even with all of the buzz and hoopla many people find it difficult to setup and start writing applications capable of levreging the awesome power of an Hadoop cluster, many find the learning curve of Java and the Hadoop APIs very steep.</p>
<p>Fortunately one of the features available in Hadoop is <a href="http://wiki.apache.org/hadoop/HadoopStreaming">HadoopStreaming</a> which allows programmers to specify any program (or script) as a mapper and/or reducer. Consequently, one of the most popular scripting languages to use alongside Hadoop is <a href="http://www.python.org/">Python</a><sup><a href="http://werxltd.com/wp/2009/09/29/using-python-with-hadoop/#footnote_1_222" id="identifier_1_222" class="footnote-link footnote-identifier-link" title="If you aren&amp;#8217;t familiar with Python and want to learn, here is an excellent site for diving into the language and here is an excellent video series walking you through the basics.">2</a></sup>.</p>
<p>One of the reasons Python is well suited to this type of work is it&#8217;s ability to be <a href="http://en.wikipedia.org/wiki/Functional_programming">functional</a> provided you are careful how you write it. This makes chopping well-written Python map/reduce scripts up into distributable units much easier.</p>
<h3>There&#8217;s a framework for that</h3>
<p>While <a href="http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python">it is possible to write plain Python scripts</a>, the folks at last.fm have helped create an excellent <a href="http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant">Python framework for Hadoop called Dumbo</a> to help streamline the process of writing MapReduce jobs in Python. <a href="http://wiki.github.com/klbostee/dumbo">Dumbo</a> seems to be a fairly <a href="http://wiki.github.com/klbostee/dumbo/short-tutorial">simple framework</a> with <a href="http://github.com/klbostee/dumbo/wikis/example-programs">plenty of examples</a> you can adapt to your <a href="http://www.audioscrobbler.net/development/dumbo/">particular needs</a>.</p>
<h3>There&#8217;s a framework for that too</h3>
<p>Hadoop has many sub-projects, and one that is fairly popular is called <a href="http://hadoop.apache.org/hbase/">HBase</a> which allows a more structured, database-like, approach to storing and retrieving data. An excellent Python framework for quickly parsing data into HBase tables is <a href="http://github.com/zohmg/zohmg">Zohmg</a>. This framework allows programmers to define tables in a <a href="http://www.yaml.org/">YAML</a> <a href="http://github.com/zohmg/zohmg/blob/master/examples/television/config/dataset.yaml">configuration file</a> and corresponding mappers as <a href="http://github.com/zohmg/zohmg/blob/master/examples/television/mappers/mapper.py">simple Python scripts</a>.</p>
<h3>Bringing it back home</h3>
<p>One of the biggest drawbacks to using HadoopStreaming is that it is inherently less optimal than writing MapReduce jobs in Java since the target script or application has to be initialized, the data then has to be serialized, sent to the target application/script, processed, and then sent back (if there are any reducers). All this context switching adds overhead that wouldn&#8217;t exist if the MapReduce job were kept in the JVM where Hadoop runs.</p>
<p><a href="http://www.jython.org/">Jython</a> is a viable answer for converting existing Python applications into Java bytecode to prevent incurring as much of a performance penalty. This utility can come in handy if you decide that your &#8220;quick and dirty&#8221; Python script needs to be moved into a production environment.</p>
<ol class="footnotes"><li id="footnote_0_222" class="footnote">Technically Hadoop is an umbrella name whereas <a href="http://hadoop.apache.org/common/docs/current/hdfs_design.html">HDFS</a> is the technical name for the GFS alternative.</li><li id="footnote_1_222" class="footnote">If you aren&#8217;t familiar with Python and want to learn, <a href="http://diveintopython.org/">here is an excellent site</a> for diving into the language and here is an excellent <a href="http://www.youtube.com/watch?v=bDgD9whDfEY">video series</a> walking you through the basics.</li></ol>]]></content:encoded>
			<wfw:commentRss>http://werxltd.com/wp/2009/09/29/using-python-with-hadoop/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
