Synapse Blog

Synapse Blog



Nice explanation of how HBase filters work, Philippe. Two small quibbles:

1. In your opening paragraph, you're asserting that the "normal" way to do things with a relational database is to select all the data in a table and send it to the client for processing (unless you're using a procedural language like PL/SQL). But I would argue that you'd almost always use a WHERE clause on the server side (which is plain SQL, not a part of the procedural language). WHERE clauses are exactly analogous to HBase filters. If an RDBMS application is selecting ALL the data in a table and sending it to the client for manipulation, it's doing something seriously wrong. smile

2. Filters on HBase aren't a substitute for proper row key design; if the underlying query requires scanning over every row of data on the server side, and you have a lot of data, it's going to perform extremely poorly b/c the server still has to physically read all of the rows on disk. Filters save network traffic and client processing time, which is important, but they don't fundamentally save disk IOs on the server, and that's going to be the dominating factor. This is also true for relational databases, but relational databases solve this problem using indexes (i.e. they keep a "cheat sheet" to answer your query without hitting every block of disk). HBase doesn't (yet) have a built-in indexing solution, so the only option with HBase is to lay out the data in the table in pretty much the same way you'll be accessing it (sorted by rowkey). Server-side filters don't relieve you of that burden, because the server is still having to look at all the data.

Those are minor clarifications, of course; kudos to you for writing an informative and well-written article!

(Ian Varley, 2012 03 08)

Hello Ian,

Thanks for these comments. Regarding the first point: yes, definitely, filters are about limiting network exchanges by slicing HBase table both horizontally (filter out rows) and vertically (filter out columns). The second point is indeed of primary importance. When it comes to serves rows in milliseconds, distribution is a first step, but efficient local storage is also an essential asset. RDMS solve this with B-trees.How this is done in HBase does deserve another post

(Philippe Rigaux, 2012 03 15)

make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

(web designer usa, 2012 09 04)

Keep in mind that HBase only has a block index per file, which is rather course grained and tells the reader that a key may be in the file because it falls into a start and end key range in the block index. But if the key is actually present can only be determined by loading that block and scanning it. This also places a burden on the block cache and you may create a lot of unnecessary churn that the bloom filters would help avoid.

(Cek, 2012 11 22)

Ajoutez votre commentaire…


Mémoriser ces infos

M'alerter des réponses à ce commentaire

Merci de saisir les caractères de cette images anti-spam: