Longitudinal Analytics of Web Archive data
The Internet Memory Foundation is involved since September 2010 in Longitudinal Analytics of Web Archive data (LAWA), a three-year European project funded by the European Commission through the Seventh Research Framework Programme under the theme [ICT-2009.1.6] Future Internet experimental facility and experimentally-driven research and under Project No 258105.
LAWA will build an internet-based experimental testbed for large-scale data analytics. Its emphasis is on developing a sustainable infrastructure, scalable methods and software tools for aggregating, querying and analyzing heterogeneous data at Internet scale. The efforts conducted by the LAWA partners will converge towards the design, development, and deployment of a virtual observatory for large-scale analytics of Web Data. The LAWA Observatory will host datasets collected on the Web, freely available to the whole Web research community. It will also offer a data acquisition tool able to carry out focused crawls of the Web to collect documents of interest to a specific thematic (e.g., documents published on the Web during a given period and related to some politic, social, cultural, or economical events).
The project outcome includes scalability gains and community benefits along the production chain of Web data analytics:
- enhanced data capturing and storing
- efficient data distribution and indexing
- exploration, mining and knowledge discovery on aggregated data
- advanced graph analyses and quality assessment of Web data.
The Internet Memory Foundation will be specifically involved in Web scale Data acquisition. A major objective is the design and implementation of a new architecture for Web scale crawling (billions of ressources) and the storage of collected documents in an sophisticated data repository able to serve complex searches on Web scale datasets.
The project is headed by the Max Planck Gesellschaft in Saarbrücken (Germany).
Beside Internet Memory Foundation, The Hebrew University of Jerusalem (Israel), Hanzo Archives Limited (United Kingdom), University of Patras (Greece) and the Computer and Automation Research Institute of the Hungarian Academy of Sciences in Budapest, Hungary (SZTAKI) are involved.
For more information on Longitudinal Analytics of Web Archive data, please visit the Project website.