Filters are a powerful feature of HBase to delegate the selection of rows to the servers rather than moving rows to the Client. We present the filtering mechanism as an illustration of the general data locality principle and compare it to the traditional select-and-project data access pattern.
Dealing with massive amounts of data changes the way you think about data processing tasks. In a standard business application context, people use a Relational Database System (RDBMS) and consider this system as a service in charge of providing data to the client application. How this data is processed, manipulated, shown to the user, is considered to be the full responsability of the application. In other words, the role of the data server is restricted to what is does best: efficient, safe and consistent storage and access.
The naive approach that consists in getting all the required data at the Client in order to apply locally some processing should be limited in a distributed setting to trivial tasks operating on a tiny subset. There are two fundamentals reasons for that. First, this generates a lot of network exchanges, consuming without necessity a lot of resources and sometimes leading to unacceptable response time. Second, centralizing all the information then processing it, simply misses all the advantages brought by a powerful cluster of hundreds or even thousands machines. The lesson is simply:
It is fair to acknowledge that server-side computation is not limited to Hadoop-like frameworks, but is also possible with relational systems, in the form of “transactional SQL” languages - e.g., PL/SQL - which are executed in the server. Still, moving the computation to the data server, instead of moving the data to the client computing, becomes of the essence in a BigData context. The principle is most often termed data locality.
Let us examine one important feature of HBase which helps us to put this principle in action: filters. For concreteness, consider the canonical example of a program which scans a collection of Web documents and applies some analysis method to RSS feeds (typical of the daily tasks operated at Internet Memory). The algorithm is trivial: we need to access each document, check whether it is indeed a RSS one, and run the analytics. We can implement this algorithm as a program running at a Client node, using a scanner on some HBase table and a filter (on documents’ type) on the client side. The program retrieves the documents from the distributed system (input data flows), and locally performs the computation.
The disadvantage is obvious. For very large data sets, the computing and storage resources of the Client machine are likely to become quickly overwhelmed, creating a bottleneck. Most of the time will be spent by exchanging documents that do not participate to the result. HBase filters enable a different scenario, illustrated in the Figure below, when the selection of RSS documents occurs at each server. This is likely (on our example) to drastically limit communications by applying local data processing as much as possible. In such a setting, the Client plays the role of a coordinator sending pieces of code to each server, initiating, and possibly coordinating a fully decentralized computation.
Filters in HBase
Filters in HBase are objects implementing the “Filter” interface, equipped with a Boolean “FilterRow()” method which tells whether a row passes or not the filter. The semantics is that rows are filtered out by a filter, which means that you rather tell the rows that you want to ignore (just the opposite of what you are used to express if you are familiar with the SQL “WHERE” clause). Scanners can incorporate filters with the “setFilter()” method.
This means, among others consequences, that you can use filters for MapReduce jobs that take their inputs from a HBase scanner. We will not cover the full Filter functionality in this post (rather look at the HBase Wiki or the Javadoc) but briefly touch a few of it meain features.
HBase comes with a list of pre-defined filters, including:
- “RowFilter”: data filtering based on row key values;
- “FamilyFilter”: allows to filter out some families on their names;
- “QualifierFilter”: allows to filter out some qualifiers on their names;
- “ValueFilter”: allows to filter out some qualifiers on their values;
- “TimestampsFilter”: allows to filter out rows based on a list of timestamps.
As suggested but these examples, filters are much more powerful than a simple all-or-nothing semantics applied to HBase rows. You can choose to filter out a row as whole based on some qualifier value, but also filter out some part of a row, namely a family and/or a column (qualifier) in a family, etc. In other words, you can apply filters to the schema (family and column names) as well as on the values, a recurring features of semi-stuctured data models.
Compare with the well-known SQL world. When you express a SELECT-FROM-WHERE query, you restrict the number or rows (with the “WHERE” clause) and the number of columns for each row (with the “SELECT” clause). Filters in HBase let you do both: fully ignore some rows, and for those rows that pass, restrict the family, columns, or timestamps. This must be related to the underlying motivation: limit as much as possible the network bandwidth used to communicate withe the client application.
There exists many other filters, some of which implementing utilities likes pagination. Again refer to the documentation.
Combining filters: “FilterList”
Even though HBase comes with a lot of filter types, their expressive power would remain limited without the ability to combine them with Boolean connectors. This is the purpose of the “FilterList” class. A “FilterList” is defined by a connector (“and”, “or”) and a list of component filters (which can be of any type). The main constructor is:
“FilterList (Operator operator, List<Filter> filters)”
Since a filter list is itself a “Filter” instance, we can build hierarchies of filters representing nested Boolean combinations, and gaining the ability to obtain arbitrarily complex filters.
Building custom filters
Finally, it is worth mentioning that you can write your own filters, in case the existing ones would not be sufficient. This amounts to write a Java class subclassing the “FilterBase” abstract class, implementing a few abstract methods which operate, in the Region servers, on the local rows. The downside of custom filters is that they require a dissemination in the cluster prior to their execution, which makes the set-up of the whole system a bit more complicated.
Summary: let HBase do the data selection job for you!
The bottom line of what precedes is: do not ever overload in your application with the burden of row filtering! HBase can do it for you in a much more effective way, at scale. And this encourages us to thing as if our computing machinery is no longer our laptop, but a whole set of servers interconnected with a bandwidth. Consider the resources and limitations of such a system, particularly regarding the network bandwidth, and adopt the measures that avoid to overwhelm these resources. HBase filters (and other features to be covered next) are just built for that.