Mignify: A Big Data Refinery Built on HBase
Tuesday, May 22, 2012, 2:20pm – 3:00pm, InterContinental San Francisco Hotel
Presented by Stanislav Barton
This platform is partly supported by several EU projects among which LAWA (Longitudinal Analytics of Web Archive data).
Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.
HBasecon 2012 is the first industry conference for Apache HBase users, contributors, administrators and application developers and we are glad to present