Filters are a powerful feature of HBase to delegate the selection of rows to the servers rather than moving rows to the Client. We present the filtering mechanism as an illustration of the general data locality principle and compare it to the traditional select-and-project data access pattern.
Dealing with massive amounts of data changes the way you think about data processing tasks. In a standard business application context, people use a Relational Database System (RDBMS) and consider this system as a service in charge of providing data to the client application. How this data is processed, manipulated, shown to the user, is considered to be the full responsability of the application. In other words, the role of the data server is restricted to what is does best: efficient, safe and consistent storage and access.
The naive approach that consists in getting all the required data at the Client in order to apply locally some processing should be limited in a distributed setting to trivial tasks operating on a tiny subset. There are two fundamentals reasons for that. First, this generates a lot of network exchanges, consuming without necessity a lot of resources and sometimes leading to unacceptable response time. Second, centralizing all the information then processing it, simply misses all the advantages brought by a powerful cluster of hundreds or even thousands machines. The lesson is simply:
It is fair to acknowledge that server-side computation is not limited to Hadoop-like frameworks, but is also possible with relational systems, in the form of “transactional SQL” languages - e.g., PL/SQL - which are executed in the server. Still, moving the computation to the data server, instead of moving the data to the client computing, becomes of the essence in a BigData context. The principle is most often termed data locality.
Let us examine one important feature of HBase which helps us to put this principle in action: filters. For concreteness, consider the canonical example of a program which scans a collection of Web documents and applies some analysis method to RSS feeds (typical of the daily tasks operated at Internet Memory). The algorithm is trivial: we need to access each document, check whether it is indeed a RSS one, and run the analytics. We can implement this algorithm as a program running at a Client node, using a scanner on some HBase table and a filter (on documents’ type) on the client side. The program retrieves the documents from the distributed system (input data flows), and locally performs the computation.
The disadvantage is obvious. For very large data sets, the computing and storage resources of the Client machine are likely to become quickly overwhelmed, creating a bottleneck. Most of the time will be spent by exchanging documents that do not participate to the result. HBase filters enable a different scenario, illustrated in the Figure below, when the selection of RSS documents occurs at each server. This is likely (on our example) to drastically limit communications by applying local data processing as much as possible. In such a setting, the Client plays the role of a coordinator sending pieces of code to each server, initiating, and possibly coordinating a fully decentralized computation.
Filters in HBase
Filters in HBase are objects implementing the “Filter” interface, equipped with a Boolean “FilterRow()” method which tells whether a row passes or not the filter. The semantics is that rows are filtered out by a filter, which means that you rather tell the rows that you want to ignore (just the opposite of what you are used to express if you are familiar with the SQL “WHERE” clause). Scanners can incorporate filters with the “setFilter()” method.
This means, among others consequences, that you can use filters for MapReduce jobs that take their inputs from a HBase scanner. We will not cover the full Filter functionality in this post (rather look at the HBase Wiki or the Javadoc) but briefly touch a few of it meain features.
HBase comes with a list of pre-defined filters, including:
- “RowFilter”: data filtering based on row key values;
- “FamilyFilter”: allows to filter out some families on their names;
- “QualifierFilter”: allows to filter out some qualifiers on their names;
- “ValueFilter”: allows to filter out some qualifiers on their values;
- “TimestampsFilter”: allows to filter out rows based on a list of timestamps.
As suggested but these examples, filters are much more powerful than a simple all-or-nothing semantics applied to HBase rows. You can choose to filter out a row as whole based on some qualifier value, but also filter out some part of a row, namely a family and/or a column (qualifier) in a family, etc. In other words, you can apply filters to the schema (family and column names) as well as on the values, a recurring features of semi-stuctured data models.
Compare with the well-known SQL world. When you express a SELECT-FROM-WHERE query, you restrict the number or rows (with the “WHERE” clause) and the number of columns for each row (with the “SELECT” clause). Filters in HBase let you do both: fully ignore some rows, and for those rows that pass, restrict the family, columns, or timestamps. This must be related to the underlying motivation: limit as much as possible the network bandwidth used to communicate withe the client application.
There exists many other filters, some of which implementing utilities likes pagination. Again refer to the documentation.
Combining filters: “FilterList”
Even though HBase comes with a lot of filter types, their expressive power would remain limited without the ability to combine them with Boolean connectors. This is the purpose of the “FilterList” class. A “FilterList” is defined by a connector (“and”, “or”) and a list of component filters (which can be of any type). The main constructor is:
“FilterList (Operator operator, List<Filter> filters)”
Since a filter list is itself a “Filter” instance, we can build hierarchies of filters representing nested Boolean combinations, and gaining the ability to obtain arbitrarily complex filters.
Building custom filters
Finally, it is worth mentioning that you can write your own filters, in case the existing ones would not be sufficient. This amounts to write a Java class subclassing the “FilterBase” abstract class, implementing a few abstract methods which operate, in the Region servers, on the local rows. The downside of custom filters is that they require a dissemination in the cluster prior to their execution, which makes the set-up of the whole system a bit more complicated.
Summary: let HBase do the data selection job for you!
The bottom line of what precedes is: do not ever overload in your application with the burden of row filtering! HBase can do it for you in a much more effective way, at scale. And this encourages us to thing as if our computing machinery is no longer our laptop, but a whole set of servers interconnected with a bandwidth. Consider the resources and limitations of such a system, particularly regarding the network bandwidth, and adopt the measures that avoid to overwhelm these resources. HBase filters (and other features to be covered next) are just built for that.
Nice explanation of how HBase filters work, Philippe. Two small quibbles:
1. In your opening paragraph, you're asserting that the "normal" way to do things with a relational database is to select all the data in a table and send it to the client for processing (unless you're using a procedural language like PL/SQL). But I would argue that you'd almost always use a WHERE clause on the server side (which is plain SQL, not a part of the procedural language). WHERE clauses are exactly analogous to HBase filters. If an RDBMS application is selecting ALL the data in a table and sending it to the client for manipulation, it's doing something seriously wrong.
2. Filters on HBase aren't a substitute for proper row key design; if the underlying query requires scanning over every row of data on the server side, and you have a lot of data, it's going to perform extremely poorly b/c the server still has to physically read all of the rows on disk. Filters save network traffic and client processing time, which is important, but they don't fundamentally save disk IOs on the server, and that's going to be the dominating factor. This is also true for relational databases, but relational databases solve this problem using indexes (i.e. they keep a "cheat sheet" to answer your query without hitting every block of disk). HBase doesn't (yet) have a built-in indexing solution, so the only option with HBase is to lay out the data in the table in pretty much the same way you'll be accessing it (sorted by rowkey). Server-side filters don't relieve you of that burden, because the server is still having to look at all the data.
Those are minor clarifications, of course; kudos to you for writing an informative and well-written article!
(Ian Varley, 2012 03 08)
Thanks for these comments. Regarding the first point: yes, definitely, filters are about limiting network exchanges by slicing HBase table both horizontally (filter out rows) and vertically (filter out columns). The second point is indeed of primary importance. When it comes to serves rows in milliseconds, distribution is a first step, but efficient local storage is also an essential asset. RDMS solve this with B-trees.How this is done in HBase does deserve another post
(Philippe Rigaux, 2012 03 15)
make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
(web designer usa, 2012 09 04)
Keep in mind that HBase only has a block index per file, which is rather course grained and tells the reader that a key may be in the file because it falls into a start and end key range in the block index. But if the key is actually present can only be determined by loading that block and scanning it. This also places a burden on the block cache and you may create a lot of unnecessary churn that the bloom filters would help avoid.
(Cek, 2012 11 22)