Archive for January, 2011
Over the weekend I was fooling around with HBase. HBase is a distributed database, you would have one master node (name node) set up and multiple regional nodes set up to offer high availability, and performance; the downside is that HBase is not a relational database (that means no foreign keys and the other nice features that comes databases like mySQL). Hbase has its own unique way of storing data that is not too different from google’s BigTable. To understand how data is stored in HBase one can refer to http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture. (Although this is kind of out of date, you can still get a general picture).
HBase requires HDFS to be set up, the instruction for the whole process can be found at http://wiki.apache.org/hadoop/QuickStart.
HBase came with several interfaces: shell, restful, thrift and java. Below I will describe my experiences with some of the interfaces and my opinion on others.
HBase provides a JRUBY IRB-based shell that is very similar to the standard SQL shell. Starting the shell is simple, go to the bin directory of your Hbase, and do ./hbase shell. The shell is really well designed and allows a user to do table and row operations with minimal effort. To create a table in the shell one would only have to do:
The above command will create a table with the name ‘table1’ and with the column families ‘f1’, ‘f2’, and ‘f3’.
To insert data into hbase
put 'table1','row1','f1:col9', 'value', 'timestamp'
The above command will create an row with the key ‘row1’ in the table ‘table1’. ‘f1:col9’ is the specific column which the data “value” is to be put. As you can see this is a lot easier than the standard sql statements to create tables and insert data.
Pro: Easy to use
For a more detailed guide on the HBase shell visit http://wiki.apache.org/hadoop/Hbase/Shell.
The restful interface that HBase provides is quite nice in theory though it is still in alpha development stage. The HBase people calls the restful interface the “Stargate” interface. The Stargate interface runs on the Jetty server by default, but they provide a .war that allows the user to switch from Jetty into the apache tomcat server. There are a lot of discussions on tomcat vs Jetty online. The main lesson I got out of those discussions was that Jetty is more light weight than tomcat, but tomcat provides performance advantages (One would need to use JProfiler to confirm this). A lot of people tend to prefer restful interfaces because they are so easy to access and use, in fact many of the hbase adapter libraries will use restful interface bindings to communicate with HBase.
If you want the current restful docs specific to your version you’d have to go to http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/stargate/package-summary.html#package_description
replace the 0.20.6 with the version of your HBase. (to find the version of your hbase, simply go to the shell and type ‘version’).
Stargate can accept and return three types of data (depending on the header): xml, json, and protobufs(binary). It is good to mention that xml is the most supported data type. Another thing about the restful interface is that the columns, rows, and data are encoded in base64, so for every get and put you have to know what to encode and decode appropriately.
There are tons of posts on the HBase mailing list which talks about the (the lack of) performance that the restful interface provides. Some say that the restful interface is suitable for services which have many clients posting and reading data at a low rate.
In general I am not too happy with the restful interface that HBase provides.
Pro: Ease of access
Cons: Everything else
I admit that I haven’t used thrift with HBase and some says that the performance is terrific. I just didn’t like the fact that I’d have to install 64mb of C++ libraries just to communicate with HBase.
Cons: Client needs to install thrift
If you are doing development in Java, then kudos to you, this is probably the interface that you want to use. It has the best performance compared to all other interfaces and is the most supported interface.
Pros: Speed, community support
Cons: Java only
Thanks for reading my long rambling, the material of the posts are only of my opinion, experience and knowledge, please correct me if I am wrong.