Synapse Blog
Synapse Blog
Understanding HBase—1 The data model
At Internet Memory, we use HBase as a large-scale repository for our collections, holding terabytes of web documents in a distributed cluster. This article presents the data model of HBase, and explains how it stands between relational DBs and the "No Schema" approach.
Understanding the HBase data model
In 2006, the Google Labs team published a paper entitled BigTable: A Distributed Storage System for Structured Data. It describes a distributed index designed to manage very large datasets (``petabytes of data") in a cluster of data servers. BigTable supports key search, range search and high-throughput file scans, and also provides a flexible storage for structured data. HBase is an open-source clone of BigTable, and closely mimics its design. At Internet Memory, we use HBase as a large-scale repository for our collections, holding terabytes of web documents in a distributed cluster. HBase is often assimilated to a large, distributed relational database. It actually presents many aspects common to "NoSQL" systems: distribution, fault tolerance, flexible modeling, absence of some features deemed essential in centralized DBMS (e.g., concurrency), etc. This article presents the data model of HBase, and explains how it stands between relational DBs and the "No Schema" approach. It will be completed by an introduction to both the Java and REST APIs, and a final article on system aspects.The map structure: representing data with key/value pairs
We start with an idea familiar to Lisp programmers of association lists, which are nothing more than key-value pairs. They constitute a simple and convenient way of representing the properties of an object. We use as a running example the description of a Web document. For instance, using the JSON notation:
{
'url': 'http://internetmemory.org', type: 'text/html', content: 'my document content'
}
One obtains what is commonly called an associative array, a dictionary, or a map. Given a context (the object/document), the structure associates values to keys.
We can represent such data as a graph, as shown by the figure below. The key information is captured by edges,whereas data values reside at leaves.
| url | type | content |
|---|---|---|
| http://internetmemory.org | text/html | my document content |
- there is no schema that constrains the list of keys (unlike relational table where the schema is fixed and uniform for all rows),
- the value may itself be some complex structure.
An HBase "table" is a multi-map structure
Instead of keeping one value for each property of an object, HBases allows the storage of several versions. Each version is identified by a timestamp. How can we represent such a multi-versions, key-value structure? HBase simply replaces atomic values by a map where the key is the timestamp. The extended representation for our example is shown below. It helps to figure out the power and flexibility of the data representation. Now, our document is built from two nested maps,- a first one, called "column" in HBase terminology (an unfortunate choice, since this is hardly related to the column relational concept),
- a second "timestamp" (each map is named after its key).
The full picture: rows and tables
So, now, we know how to represent our objects with the HBase data model. It remains to describe how we can put many objects (potentially, millions or even billions of object) in HBase. This is where HBase borrows some terminology to relational databases: objects are called "rows", and rows are stored in a "table". Although one could find some superficial similarities, this comparison is a likely source of confusion. Let us try to list the differences:- a "table" is actually a map where each row is a value, and the key is chosen by the table designer.
- we already explained that the structure of a "row" has little to do with the flat representation of relational row.
- regarding data manipulation, the nature of a "table" implies that two basic operations are available: put(key, row) and get(key): row. Nothing comparable to SQL here!
References
par : Philippe Rigaux, Fri 06 Apr 2012