Synapse Blog

Synapse Blog

On the Power of HBase Filters

Filters are a powerful feature of HBase to delegate the selection of rows to the servers rather than moving rows to the Client. We present the filtering mechanism as an illustration of the general data locality principle and compare it to the traditional select-and-project data access pattern.

Dealing with massive amounts of data changes the way you think about data processing tasks. In a standard business application context, people use a Relational Database System (RDBMS) and consider this system as a service in charge of providing data to the client application. How this data is processed, manipulated, shown to the user, is considered to be the full responsability of the application. In other words, the role of the data server is restricted to what is does best: efficient, safe and consistent storage and access.

The naive approach that consists in getting all the required data at the Client in order to apply locally some processing should be limited in a distributed setting to trivial tasks operating on a tiny subset. There are two fundamentals reasons for that. First, this generates a lot of network exchanges, consuming without necessity a lot of resources and sometimes leading to unacceptable response time. Second, centralizing all the information then processing it, simply misses all the advantages brought by a powerful cluster of hundreds or even thousands machines. The lesson is simply:

When you deal with BigData, the data center is your computer.

It is fair to acknowledge that server-side computation is not limited to Hadoop-like frameworks, but is also possible with relational systems, in the form of “transactional SQL” languages - e.g., PL/SQL - which are executed in the server. Still, moving the computation to the data server, instead of moving the data to the client computing, becomes of the essence in a BigData context. The principle is most often termed data locality.

Server-side filtering

Let us examine one important feature of HBase which helps us to put this principle in action: filters. For concreteness, consider the canonical example of a program which scans a collection of Web documents and applies some analysis method to RSS feeds (typical of the daily tasks operated at Internet Memory). The algorithm is trivial: we need to access each document, check whether it is indeed a RSS one, and run the analytics. We can implement this algorithm as a program running at a Client node, using a scanner on some HBase table and a filter (on documents’ type) on the client side. The program retrieves the documents from the distributed system (input data flows), and locally performs the computation.

Non distributied mode

The disadvantage is obvious. For very large data sets, the computing and storage resources of the Client machine are likely to become quickly overwhelmed, creating a bottleneck. Most of the time will be spent by exchanging documents that do not participate to the result. HBase filters enable a different scenario, illustrated in the Figure below, when the selection of RSS documents occurs at each server. This is likely (on our example) to drastically limit communications by applying local data processing as much as possible. In such a setting, the Client plays the role of a coordinator sending pieces of code to each server, initiating, and possibly coordinating a fully decentralized computation.

Filters in HBase

Filters in HBase

Filters in HBase are objects implementing the “Filter” interface, equipped with a Boolean “FilterRow()” method which tells whether a row passes or not the filter. The semantics is that rows are filtered out by a filter, which means that you rather tell the rows that you want to ignore (just the opposite of what you are used to express if you are familiar with the SQL “WHERE” clause). Scanners can incorporate filters with the “setFilter()” method.

“scan.setFilter(myFilter);”

This means, among others consequences, that you can use filters for MapReduce jobs that take their inputs from a HBase scanner. We will not cover the full Filter functionality in this post (rather look at the HBase Wiki or the Javadoc) but briefly touch a few of it meain features.

HBase comes with a list of pre-defined filters, including:

     
  • “RowFilter”: data filtering based on row key values;
  •  
  • “FamilyFilter”: allows to filter out some families on their names;
  •  
  • “QualifierFilter”: allows to filter out some qualifiers on their names;
  •  
  • “ValueFilter”: allows to filter out some qualifiers on their values;
  •  
  • “TimestampsFilter”: allows to filter out rows based on a list of timestamps.
  •  

As suggested but these examples, filters are much more powerful than a simple all-or-nothing semantics applied to HBase rows. You can choose to filter out a row as whole based on some qualifier value, but also filter out some part of a row, namely a family and/or a column (qualifier) in a family, etc. In other words, you can apply filters to the schema (family and column names) as well as on the values, a recurring features of semi-stuctured data models.

Compare with the well-known SQL world. When you express a SELECT-FROM-WHERE query, you restrict the number or rows (with the “WHERE” clause) and the number of columns for each row (with the “SELECT” clause). Filters in HBase let you do both: fully ignore some rows, and for those rows that pass, restrict the family, columns, or timestamps. This must be related to the underlying motivation: limit as much as possible the network bandwidth used to communicate withe the client application.

There exists many other filters, some of which implementing utilities likes pagination. Again refer to the documentation. 

Combining filters: “FilterList”

Even though HBase comes with a lot of filter types, their expressive power would remain limited without the ability to combine them with Boolean connectors. This is the purpose of the “FilterList” class. A “FilterList” is defined by a connector (“and”, “or”) and a list of component filters (which can be of any type). The main constructor is:

“FilterList (Operator operator, List<Filter> filters)”

Since a filter list is itself a “Filter” instance, we can build hierarchies of filters representing nested Boolean combinations, and gaining the ability to obtain arbitrarily complex filters.

Building custom filters


Finally, it is worth mentioning that you can write your own filters, in case the existing ones would not be sufficient. This amounts to write a Java class subclassing the “FilterBase” abstract class, implementing a few abstract methods which operate, in the Region servers, on the local rows. The downside of custom filters is that they require a dissemination in the cluster prior to their execution, which makes the set-up of the whole system a bit more complicated.

Summary: let HBase do the data selection job for you!

The bottom line of what precedes is: do not ever overload in your application with the burden of row filtering! HBase can do it for you in a much more effective way, at scale. And this encourages us to thing as if our computing machinery is no longer our laptop, but a whole set of servers interconnected with a bandwidth. Consider the resources and limitations of such a system, particularly regarding the network bandwidth, and adopt the measures that avoid to overwhelm these resources. HBase filters (and other features to be covered next) are just built for that.

by: Philippe Rigaux, (post a comment)
 

 

Using Hadoop for Video Streaming

Internet Memory supplies a service to browse archived Web pages, including multimedia content. We use Hadoop, HDFS and HBase for storing and indexing our data, and associates this storage with a Web server that lets users navigate through the archive and retrieve documents. In the present post, we focus on videos and detail the solution adopted to serve true streaming from HDFS storage.

Basics

Many video formats are found on the Web, including Windows Media (.wmv), RealMedia (.rm), Quicktime (.mov), MPEG, Adobe Flash (.flv), etc. In order to display a video, we need a player, which can be incorporated in the Web browser. The player depends on the specific video format, but most browsers are able to detect the format and choose the appropriate player. Firefox for instance comes with a lot of plugins, which can be quickly integrated in the presence of a specific video to display it content.

There are basically two ways to play a video. The simplest one is a two-steps process: first the whole file is downloaded from the Web server to the user’s computer, and then displayed by the player running the local copy. It has the disadvantage that the download step may take a while is the file is big (hundreds of megabytes are not uncommon). The second one uses (true) streaming: the video file is split into fragments which are sent from the Web server to the player, giving the illusion of a continuous stream. From the user point of view, it looks as if a window is swept over the video content, saving the need of a full initial download of the whole file.

Obviously, streaming is a more involved method because it requires a strong coordination between the components involved in the process, namely the player, the Web server, and the file system from which the video is retrieved. We examine this technical issue in the context of a Hadoop system where files are stored in HDFS, a file system dedicated to large distributed storage.

File seeking with HDFS

At explained above, streaming requires a strong coordination between the Web server and the file system. The former produces requests to access chunks of the video file (think to what happens when the user suddenly requires a move to a specific part of the video), whereas the later must be able to seek in the file to position the cursor at a specific location. When using HDFS, enabling such a close cooperation turns out to be a problem because HDFS can in principle only be accessed through a Hadoop client, which the standard Apache server is not. We investigated two possible solutions: Hoop, the Hadoop web server, and Apache/FUSE.

Hoop (see http:///cloudera.github.com/hoop/) is an HTTP-HDFS-Connector. It allows the HDFS file system to be accessed via HTTP. A working local prototype has been developed using JW Player and a large video file. Streaming works, but seeking in an unbuffered part results in the playback stopping. It seems that the Hoop API does not support seeking in a file, so we had to give up this approach.

The second solution is based on HDFS/FUSE. FUSE (File System in User Space) is an API that captures the file system operations and allows to implement them with ad-hoc functions running in the the user’s processus space (thereby saving the need to change the operating system kernel, a tricky and dangerous option). FUSE is provided in Hadoop as a component named “Mountable HDFS” (see http://wiki.apache.org/hadoop/MountableHDFS). It lets the standard file system user or program see the HDFS name space as a locally mounted directory. All file system operations, including directory browsing, file opening and content access, are enabled over HDFS content through the FUSE interface.

Apache server configuration

It remained to configure Apache to access the mounted FUSE system and load content from video files. How this is done depends on the video format. At the moment, we tested and validated .mp4 files and Flash video files. For the first format we use H264 Streaming Module (see http://h264.code-shop.com/trac), an Apache plugin, which enables adaptive streaming. For FLV we used pseudo-stream module for Apache named “mod_flv”. Both behave nicely and go along with the mountable HDFS without problem.

Conclusion

The solution based on Apache + Mountable HDFS (FUSE) turned out to be both reliable, functionally adequate (seeking is well supported) and efficient. The architecture is simple and easy to set up, and allows to combine the benefits of HDFS for very large repositories and standard Web server streaming solutions. Although we chose to adopt Apache plugins in our current service, nothing keeps you from using a more powerful streaming server since the FUSE approach (virtually) moves all the HDFS content in the standard file system scope.

Hoop remains a potential option for the future, but it appeared not mature enough when we tested it, at least for the complex operations (seeking at a specific offset in a file) required by video streaming.

 

by: Philippe Rigaux, (post a comment)
 

 

Understanding HBase—1 The data model

At Internet Memory, we use HBase as a large-scale repository for our collections, holding terabytes of web documents in a distributed cluster. This article presents the data model of HBase, and explains how it stands between relational DBs and the "No Schema" approach.

Understanding the HBase data model

In 2006, the Google Labs team published a paper entitled BigTable: A Distributed Storage System for Structured Data. It describes a distributed index designed to manage very large datasets (``petabytes of data") in a cluster of data servers. BigTable supports key search, range search and high-throughput file scans, and also provides a flexible storage for structured data. HBase is an open-source clone of BigTable, and closely mimics its design. At Internet Memory, we use HBase as a large-scale repository for our collections, holding terabytes of web documents in a distributed cluster. HBase is often assimilated to a large, distributed relational database. It actually presents many aspects common to "NoSQL" systems: distribution, fault tolerance, flexible modeling, absence of some features deemed essential in centralized DBMS (e.g., concurrency), etc. This article presents the data model of HBase, and explains how it stands between relational DBs and the "No Schema" approach. It will be completed by an introduction to both the Java and REST APIs, and a final article on system aspects.

The map structure: representing data with key/value pairs

We start with an idea familiar to Lisp programmers of association lists, which are nothing more than key-value pairs. They constitute a simple and convenient way of representing the properties of an object. We use as a running example the description of a Web document. For instance, using the JSON notation:
{  
   'url': 'http://internetmemory.org', type: 'text/html', content: 'my document content' 
}
One obtains what is commonly called an associative array, a dictionary, or a map. Given a context (the object/document), the structure associates values to keys. We can represent such data as a graph, as shown by the figure below. The key information is captured by edges,whereas data values reside at leaves.
Key value
There exists many possible representations for a map. We showed a JSON example above, but XML is of course an appropriate choice. At first glance, a map can also be represented by a table. The above example is equivalently viewed as
urltypecontent
http://internetmemory.orgtext/htmlmy document content
However, this often introduces some confusion. It is worth understanding several important differences that make a map much more flexible than the strict (relational) table structure. In particular,
  • there is no schema that constrains the list of keys (unlike relational table where the schema is fixed and uniform for all rows),
  • the value may itself be some complex structure.
HBase, following BigTable, builds on this flexibility. First, we can add new key-value pair to describe an object, if needed.This does not require any pre-declaration at the 'schema' level, and the new key remains local. Other objects stored in the same HBase instance remain unaffected by the change. Second, a value can be another map, yielding a multi-map structure which is exemplified below.

An HBase "table" is a multi-map structure

Instead of keeping one value for each property of an object, HBases allows the storage of several versions. Each version is identified by a timestamp. How can we represent such a multi-versions, key-value structure? HBase simply replaces atomic values by a map where the key is the timestamp. The extended representation for our example is shown below. It helps to figure out the power and flexibility of the data representation. Now, our document is built from two nested maps,
  • a first one, called "column" in HBase terminology (an unfortunate choice, since this is hardly related to the column relational concept),
  • a second "timestamp" (each map is named after its key).
Our document is globally viewed as a column map. If we choose a column key, say, type, we obtain a value which is itself a second map featuring as many keys as there are timestamps for this specific column. In our example, there is only one timestamp for url (well, we can assume that the URL of the document does not change much). Looking at, respectively, type and content, we find the former has two versions and the latter three. Moreover, they only have one timestamp (t1) in common. Actually, the "timestamp" maps are completely independent from one another.
Multi-map
Note that we can add as many timestamps (hence, as many keys in one of the second-level maps) as we wish. And, in fact, this is true for the first-level map as well: we can add as many columns as we wish, at the document level, without having to impose such a change to all other documents in a same HBase instance. In essence, each object is just a self-described piece of information (think again to the flexible representation of semi-structured data formats like XML or JSON). In this respect, HBase is in the wake of other 'NoSQL' systems, and its data model shares many aspects that characterize this trend: no schema and self-description of objects. We are not completely done with the multi-map levels of HBase. Columns are grouped in column families, and a family is actually a key in a new map level, referring to a group of columns. In the Figure below, we define two families: meta, grouping url and type, and data representing the content of a document.
Full multi map
Important: Unlike the column and timestamp maps, the keys if a family map is fixed. We cannot add new families to a table once it is created. The family level constitutes therefore the equivalent of a relational schema, although, as we saw, the content of a family value may be a quite complex structure.

The full picture: rows and tables

So, now, we know how to represent our objects with the HBase data model. It remains to describe how we can put many objects (potentially, millions or even billions of object) in HBase. This is where HBase borrows some terminology to relational databases: objects are called "rows", and rows are stored in a "table". Although one could find some superficial similarities, this comparison is a likely source of confusion. Let us try to list the differences:
  1. a "table" is actually a map where each row is a value, and the key is chosen by the table designer.
  2. we already explained that the structure of a "row" has little to do with the flat representation of relational row.
  3. regarding data manipulation, the nature of a "table" implies that two basic operations are available: put(key, row) and get(key): row. Nothing comparable to SQL here!
Finally, it is worth noting that the "table" map is a sorted map: rows are grouped on the key value, and two rows close from one another (with respect to the keys order) are stored is the same physical area. This make possible (and efficient) range queries of keys. We further explore this feature is the article devoted to the system aspects of HBase. The following Figures summarize our structure for an hypothetical webdoc HBase table storing a large collection of web documents. Each document is indexed by its url (which is therefore the key of the highest level map). A row is itself a local map featuring a fixed number of keys defined by the family names (f1, f2, etc.), associated to values which are themselves maps indexed by columns. Finally, column values are versioned, and represented by a timestamp-index map. Columns and timestamps do no obey to a global schema: they are defined on a row basis. The columns may vary arbitrarily from one row to another, and so do the timestamps for columns.
HBase table
The multi-map structure of a HBase table can thus be summarized as
key -> family -> column -> timestamp -> value
It should be clear that the intuitive meaning of common concepts such as "table", "row", and "column" must be revisited when dealing with HBase data. In particular, considering HBase as a kind of large relational database is clearly a misleading option. HBase is essentially a key-value store with efficient indexing on key access, a semi-structured data model for value representation, and range-search capabilities supported by key ordering.

References

by: Philippe Rigaux, (post a comment)
 

 
 

Comments