Internet Memory developed a new infrastructure with the ambition to reach “Web-scale” in terms of Web documents acquisition and computable data storage.
Internet Research requires the ability to store and analyse large portions of the Web as a foundational block for most content-centric studies.
For this, a combination of Web archives together with a distributed infrastructure supporting extended analytical tools is a necessary tool. With such an infrastructure, large-scale measurements, topological information and trends at Internet scale can be brought to researchers and information professional’s scrutiny.
Internet Memory developed a new infrastructure with the ambition to reach “Web-scale” in terms of Web documents acquisition (billions of resources crawled per week) and computable data storage (Petabytes of data). This platform, partly supported by several EU projects among which LAWA (Longitudinal Analytics of Web Archive data) includes:
- A new crawler, entirely implemented in Erlang to support the retrieval of billions of pages in days. Thanks to its innovative frontier and seen-URL data structure, it sustains throughput for weeks while enabling Web-scale exploration.
- A new Web Archive repository for content and metadata based on HBase. It offers a perfect storage layer for Web archives as it is functionally isomorphic to WARC, but abstracts lots of the underlying data management (replication, index creation etc) while exposing analytical friendly APIs.
- Filters and extractors to distil relevant information and create processing chain in a distributed execution environment.
This presentation will offer an overview of this platform and discuss the next steps of its development.
International Internet Preservation Consortium (IIPC) 2012 General Assembly
Library of Congress, Washington DC
Tuesday May 1, 2012, 2:30 pm -3:45 pm (Members only)
Presented by Leïla Medjkoune