<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom">

    <channel>
    
    <title><![CDATA[internet Memory Foundation]]></title>
    <link>http://internetmemory.org/en</link>
    <description>The internet Memory Foundation is an european non-profit institution dedicated to web archiving.</description>
    <dc:language>en</dc:language>
    <dc:creator>http://internetmemory.org/en</dc:creator>
    <dc:rights>Copyright 2012</dc:rights>
    <pubDate>Fri, 27 Apr 2012 09:54:39 GMT</pubDate>
    <atom:link href="http://internetmemory.org/en/index.php/RSS" rel="self" type="application/rss+xml" />

    

    <item>
      <title>Workshop at the IIPC 2012 General Assembly : Leveraging Web Archives Research</title>
      <link>http://internetmemory.org/en/index.php/News/workshop_at_the_iipc_2012_general_assembly_leveraging_web_archives_research</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/workshop_at_the_iipc_2012_general_assembly_leveraging_web_archives_research#id:153#date:09:54</guid>
      <description><![CDATA[Internet Memory developed a new infrastructure with the ambition to reach “Web-scale” in terms of Web documents acquisition and computable data storage.<p>Internet Research requires the ability to store and analyse large portions of the Web as a foundational block for most content-centric studies.</p>

<p>For this, a combination of Web archives together with a distributed infrastructure supporting extended analytical tools is a necessary tool. With such an infrastructure, large-scale measurements, topological information and trends at Internet scale can be brought to researchers and information professional’s scrutiny. </p>

<p>Internet Memory developed a new infrastructure with the ambition to reach “Web-scale” in terms of Web documents acquisition (billions of resources crawled per week) and computable data storage (Petabytes of data). This platform, partly supported by several EU projects among which LAWA (<a href="http://www.lawa-project.eu/">Longitudinal Analytics of Web Archive data</a>) includes:</p>

<p>-	<strong>A new crawler</strong>, entirely implemented in Erlang to support the retrieval of billions of pages in days. Thanks to its innovative frontier and seen-URL data structure, it sustains throughput for weeks while enabling Web-scale exploration.<br />
-	<strong>A new Web Archive repository</strong> for content and metadata based on HBase. It offers a perfect storage layer for Web archives as it is functionally isomorphic to WARC, but abstracts lots of the underlying data management (replication, index creation etc) while exposing analytical friendly APIs.<br />
-	<strong>Filters and extractors</strong> to distil relevant information and create processing chain in a distributed execution environment.</p>

<p>This presentation will offer an overview of this platform and discuss the next steps of its development.<br />
<a href="http://netpreserve.org/events/2012ga.php">International Internet Preservation Consortium (IIPC) 2012 General Assembly</a><br />
Library of Congress, Washington DC<br />
Tuesday May 1, 2012, 2:30 pm -3:45 pm (Members only)<br />
Presented by Leïla Medjkoune</p>



<p>&nbsp;</p>]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Fri, 27 Apr 2012 09:54 GMT</pubDate>
    </item>

    <item>
      <title>HBASE CON2012 : Mignify, A Big Data Refinery Built on HBase</title>
      <link>http://internetmemory.org/en/index.php/News/hbase_con2012_mignify_a_big_data_refinery_built_on_hbase</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/hbase_con2012_mignify_a_big_data_refinery_built_on_hbase#id:149#date:08:46</guid>
      <description><![CDATA[In the framework of <a href="http://www.lawa-project.eu/">LAWA project</a>, IMF will present at <a href="http://www.hbasecon.com/">HBasecon 2012</a> progress of the design and development of a Big Data Platform: May 22, 2012 in San Francisco<h1>Mignify: A Big Data Refinery Built on HBase</h1>

<p><a href="http://www.hbasecon.com/sessions/mignify-a-big-data-refinery-built-on-hbase/">HBasecon 2012</a><br />
Tuesday, May 22, 2012, 2:20pm – 3:00pm, InterContinental San Francisco Hotel<br />
Presented by Stanislav Barton</p>

<p>This platform is partly supported by several EU projects among which LAWA (<a href="http://www.lawa-project.eu/">Longitudinal Analytics of Web Archive data</a>).</p>

<p><a href="http://mignify.com">Mignify</a> is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain. </p>

<p><a href="http://www.hbasecon.com/">HBasecon 2012</a> is the first industry conference for Apache HBase users, contributors, administrators and application developers and we are glad to present </p>

<p>&nbsp;</p>]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Fri, 27 Apr 2012 08:46 GMT</pubDate>
    </item>

    <item>
      <title>Web Archiving at the College de France</title>
      <link>http://internetmemory.org/en/index.php/News/web_archiving_at_the_college_de_france</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/web_archiving_at_the_college_de_france#id:139#date:17:06</guid>
      <description><![CDATA[On March 28th, at 11.00 am, a <a href="http://www.college-de-france.fr/site/serge-abiteboul/ouverture-des-donnees-publiques-archivage-du-web-.htm">Web archiving Seminar</a> held by Julien Masanès<h3>At the College de France, Chair of Information Technology and Digital Sciences</h3>

<p>Information technology has revolutionized our lives. Computers are traditionally seen as computing machines, although their main purpose is now to manage data. This course will cover essential aspects of data management, including its close relationship with mathematical logic and complexity theory. The Web can be seen as a huge distributed database: its most exciting aspects will also be studied, such as its scale or the challenges of distributed computing and the Semantic Web.</p>

<h3>Wednesday, March 28th, from 10.00 to 12.00 am: Semantic Web, Open Data and Web Archiving</h3>
<p><a href="http://www.college-de-france.fr/site/en-serge-abiteboul/index.htm">Serge Abiteboul</a> opens the conference with a lecture about the Semantic Web and invites François Bancilhon, Director of DataPublica to talk and Julien Masanès, Director of the Internet Memory Foundation to talk about Open Data and Web archiving.</p>

<h3>Feel free to join!</h3>
<p>Address:<br />
Amphithéâtre Maurice Halbwachs <br />
Collège de France<br />
11, place Marcelin Berthelot<br />
75231 Paris Cedex 05<br />
France</p>]]></description>
      <dc:subject><![CDATA[English, French,]]></dc:subject>
      <pubDate>Tue, 27 Mar 2012 17:06 GMT</pubDate>
    </item>

    <item>
      <title>On the Power of HBase Filters</title>
      <link>http://internetmemory.org/en/index.php/Synapse/on_the_power_of_hbase_filters</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/Synapse/on_the_power_of_hbase_filters#id:134#date:06:57</guid>
      <description><![CDATA[Filters are a powerful feature of HBase to delegate the selection of rows to the servers rather than moving rows to the Client. We present the filtering mechanism as an illustration of the general data locality principle and compare it to the traditional select-and-project data access pattern.<p>Dealing with massive amounts of data changes the way you think about data processing tasks. In a standard business application context, people use a Relational Database System (RDBMS) and consider this system as a service in charge of providing data to the client application. How this data is processed, manipulated, shown to the user, is considered to be the full responsability of the application. In other words, the role of the data server is restricted to what is does best: efficient, safe and consistent storage and access.</p>

<p>The naive approach that consists in getting all the required data at the Client in order to apply locally some processing should be limited in a distributed setting to trivial tasks operating on a tiny subset. There are two fundamentals reasons for that. First, this generates a lot of network exchanges, consuming without necessity a lot of resources and sometimes leading to unacceptable response time. Second, centralizing all the information then processing it, simply misses all the advantages brought by a powerful cluster of hundreds or even thousands machines. The lesson is simply:</p>

<center><em>When you deal with BigData, the data center is your computer.</em></center><p> </p>

<p>It is fair to acknowledge that server-side computation is not limited to Hadoop-like frameworks, but is also possible with relational systems, in the form of &#8220;transactional SQL&#8221; languages - e.g., PL/SQL - which are executed in the server. Still, moving the computation to the data server, instead of moving the data to the client computing, becomes of the essence in a BigData context. The principle is most often termed <em>data locality</em>. </p>

<h2>Server-side filtering</h2>

<p>Let us examine one important feature of HBase which helps us to put this principle in action: <em>filters</em>. For concreteness, consider the canonical example of a program which scans a collection of Web documents and applies some analysis method to RSS feeds (typical of the daily tasks operated at Internet Memory). The algorithm is trivial: we need to access each document, check whether it is indeed a RSS one, and run the analytics. We can implement this algorithm as a program running at a Client node, using a scanner on some HBase table and a filter (on documents&#8217; type) on the client side. The program retrieves the documents from the distributed system (input data flows), and locally performs the computation.</p>

<p><img src="http://internetmemory.org/images/uploads/nondistr-computing.png" alt="Non distributied mode" width="312" height="182" style="border: 0;" /></p>

<p>The disadvantage is obvious. For very large data sets, the computing and storage resources of the Client machine are likely to become quickly overwhelmed, creating a bottleneck. Most of the time will be spent by exchanging documents that do not participate to the result. HBase filters enable a different scenario, illustrated in the Figure below, when the selection of RSS documents occurs at each server. This is likely (on our example) to drastically limit communications by applying local data processing as much as possible. In such a setting, the Client plays the role of a coordinator sending pieces of code to each server, initiating, and possibly coordinating a fully decentralized computation.</p>

<p><img src="http://internetmemory.org/images/uploads/filters.png" alt="Filters in HBase" width="312" height="182" style="border: 0;" /></p>

<h2>Filters in HBase</h2>

<p>Filters in HBase are objects implementing the &#8220;Filter&#8221; interface, equipped with a Boolean &#8220;FilterRow()&#8221; method  which tells whether a row passes or not the filter. The semantics is that rows are filtered <em>out</em> by a filter, which means that you rather tell the rows that you want to ignore (just the opposite of what you are used to express if you are familiar with the SQL &#8220;WHERE&#8221; clause). Scanners can incorporate filters with the &#8220;setFilter()&#8221; method.</p><center>
<p>&#8220;scan.setFilter(myFilter);&#8221;</p>
</center>
<p>This means, among others consequences, that you can use filters for MapReduce jobs that take their inputs from a HBase scanner. We will not cover the full Filter functionality in this post (rather look at the HBase Wiki or the Javadoc) but briefly touch a few of it meain features.<p>HBase comes with a list of pre-defined filters, including:</p><ul>
&nbsp;  <li>&#8220;RowFilter&#8221;: data filtering based on row key values;</li>
&nbsp;  <li>&#8220;FamilyFilter&#8221;: allows to filter out some families on their <em>names</em>;</li>
&nbsp;  <li>&#8220;QualifierFilter&#8221;: allows to filter out some qualifiers on their <em>names</em>;</li>
&nbsp;  <li>&#8220;ValueFilter&#8221;: allows to filter out some qualifiers on their <em>values</em>;</li>
&nbsp;  <li>&#8220;TimestampsFilter&#8221;: allows to filter out rows based on a list of timestamps.</li>
&nbsp;  </ul>

<p>As suggested but these examples, filters are much more powerful than a simple all-or-nothing semantics  applied to HBase rows. You can choose to filter out a row as whole based on some qualifier value, but also  filter out some part of a row, namely a family and/or a column (qualifier) in a family, etc. In other words, you can apply filters to the schema (family and column names) as well as on the values, a recurring features of semi-stuctured data models.</p>

<p>Compare with the well-known SQL world. When you express a SELECT-FROM-WHERE query, you restrict the number or rows (with the &#8220;WHERE&#8221; clause) and the number of columns for each row (with the &#8220;SELECT&#8221; clause). Filters in HBase let you do both: fully ignore some rows, and for those rows that pass, restrict the family, columns, or timestamps. This must be related to the underlying motivation: limit as much as possible the network bandwidth used to communicate withe the client application.</p>

<p>There exists many other filters, some of which implementing utilities likes pagination. Again refer to the documentation.&nbsp; </p>

<h2>Combining filters: &#8220;FilterList&#8221;</h2>

<p>Even though HBase comes with a lot of filter types, their expressive power would remain limited without the ability to combine them with Boolean connectors. This is the purpose of the &#8220;FilterList&#8221; class. A &#8220;FilterList&#8221; is defined by a connector (&#8220;and&#8221;, &#8220;or&#8221;) and a list of component filters (which can be of any type). The main constructor is:</p><center>
<p>&#8220;FilterList (Operator operator, List&lt;Filter&gt; filters)&#8221;</p>
</center>

<p>Since a filter list is itself a &#8220;Filter&#8221; instance, we can build hierarchies of filters representing nested Boolean combinations, and gaining the ability to obtain arbitrarily complex filters.<br /></p><h2>Building custom filters</h2>
<p> <br />
Finally, it is worth mentioning that you can write your own filters, in case the existing ones would not be sufficient. This amounts to write a Java class subclassing the &#8220;FilterBase&#8221; abstract class, implementing a few abstract methods which operate, in the Region servers, on the local rows. The downside of custom filters is that they require a dissemination in the cluster prior to their execution, which makes the set-up of the whole system a bit more complicated.</p>

<h2>Summary: let HBase do the data selection job for you!</h2>

<p>The bottom line of what precedes is: do not ever overload in your application with the burden of row filtering! HBase can do it for you in a much more effective way, at scale. And this encourages us to thing as if our computing machinery is no longer our laptop, but a whole set of servers interconnected with a bandwidth. Consider the resources and limitations of such a system, particularly regarding the network bandwidth, and adopt the measures that avoid to overwhelm these resources. HBase filters (and other features to be covered next) are just built for that.</p>]]></description>
      <dc:subject><![CDATA[English, Big Data, Hadoop, Hbase,]]></dc:subject>
      <pubDate>Fri, 02 Mar 2012 06:57 GMT</pubDate>
    </item>

    <item>
      <title>Preserving Research Projects&#8217; websites</title>
      <link>http://internetmemory.org/en/index.php/Memoranda/preserving_research_projects_websites</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/Memoranda/preserving_research_projects_websites#id:131#date:13:04</guid>
      <description><![CDATA[The quality research project management often requires creation and maintenance of the research project’s website that is used to make available the new developments and results. But what happens to such website when the project and its funding end?<h3>Inside Installations use case</h3>

<p><img src="http://internetmemory.org/images/uploads/InsideInstallation_thumb.png" alt="InsideInstallation" width="600" height="339"  style="border: 0;" /></p>

<p>Few months ago, the <a href="http://www.cultureelerfgoed.nl/"><strong>Cultural Heritage Agency of the Netherlands</strong></a> (RCE) contacted us to expose its situation:</p>

<p><strong>Inside Installations Project</strong>, Preservation and Presentation of Installation Art, was a research project (2004-2007) into the management and conservation of installations and was supported by the European Commission’s Culture 2000 programme. <br />
Rapid obsolescence of media technologies, interactivity and, for instance, the site specific character of many installations are a challenge for prevailing views about long-term conservation, documentation and presentation. Thirty complex installations (many multimedia) were re-installed, investigated and documented. <br />
By sharing their experience partners worked together to develop guidelines for conservation, re-installation and documentation of installation art. </p>

<p>The Cultural Heritage Agency of the Netherlands was the coordinator of the project, which was co-organised by: <br />
- <a href="http://www.tate.org.uk/">Tate</a>, London; <br />
- <a href="http://www.duesseldorf.de/restaurierungszentrum/index.shtml">Restaurierungszentrum</a>, Düsseldorf; <br />
- <a href="http://www.smak.be/">Stedelijk Museum for Modern Art (S.M.A.K.)</a>, Ghent; <br />
- <a href="http://www.museoreinasofia.es/portada/portada.php">Museo Nacional Centro de Arte Reina Sofia</a>, Madrid <br />
- and the <a href="http://www.sbmk.nl/">Foundation for the Conservation of Modern Art (SBMK)</a> in The Netherlands.</p>

<p>In this framework, they developed a <a href="http://www.inside-installations.org/">high content website</a> (Online Version). (<a href="http://collections.europarchive.org/rce/20120208162002/http://www.inside-installations.org/"><em>Archived Version</em>)</p>

<p>More than four years after finishing the project, maintaining this website means a certain annual expense for the coordinator, who does not have specific funding for this. <br />
Which alternatives did he have? <br />
- To continue to fund the website himself, or ask for contributions to other institutions,<br />
- To close the website, remove all content and make it unavailable,<br />
- Or to archive it and ensure an open access to its Web archive.</p>

<h3>Internet Memory proposes solutions</h3>

<p>The consortium decided to follow Cultural Heritage Agency of the Netherlands (RCE) initiative: to buy the archival of the project website “www.inside-installations.org” once and for good and thus <strong>to preserve results of the European project</strong> Inside Installations. <br />
The process of Web archiving and preservation was delegated to Internet Memory Foundation. </p>

<p>See <a href="http://collections.europarchive.org/rce/20120208162002/http://www.inside-installations.org/">archived version</a> captured in February 2012.</p>

<h3>Results of such Web archiving initiatives</h3>

<p><strong>* Websites are preserved and therefore they might remain a part of the cultural heritage for decades.<br />
* They are publicly available <a href="http://internetmemory.org/en/index.php/about/collections1">online</a>.<br />
* This solution is less expensive than maintaining websites that are not any more updated.</strong></p>

<h6><em><strong>Web archiving as an efficient solution to offer a second life to your project websites!</strong></em></h6>

<p>Internet Memory proposes solutions to archive and preserve high quality websites such are research projects’ websites thanks to its automated Web archiving platform, <a href="http://archivethe.net"><strong>ArchivetheNet</strong></a>.</p>

<p>&nbsp;</p>]]></description>
      <dc:subject><![CDATA[English, French,]]></dc:subject>
      <pubDate>Mon, 20 Feb 2012 13:04 GMT</pubDate>
    </item>

    <item>
      <title>Using Hadoop for Video Streaming</title>
      <link>http://internetmemory.org/en/index.php/Synapse/using_hadoop_for_video_streaming</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/Synapse/using_hadoop_for_video_streaming#id:130#date:08:52</guid>
      <description><![CDATA[Internet Memory supplies a service to browse archived Web pages, including multimedia content. We use Hadoop, HDFS and HBase for storing and indexing our data, and associates this storage  with a Web server that lets users navigate through the  archive and retrieve documents. In the present post, we focus on <i>videos</i> and detail the solution adopted to serve true streaming from HDFS storage. <h2>Basics</h2>
<p>
Many video formats are found on the Web, including Windows Media (.wmv), RealMedia (.rm), Quicktime (.mov), MPEG, Adobe Flash (.flv), etc. In order to display a video, we need a <em>player</em>, which can be incorporated in the Web browser. The  player depends on the specific video format, but most browsers are able to detect the format and choose the appropriate player. Firefox for instance comes with a lot of plugins, which can be quickly integrated in the presence of a specific video to display it content.
</p>

<p>
There are basically two ways to play a video. The simplest one is a two-steps process:
first the whole file is downloaded from the Web server to the user&#8217;s computer, and then displayed by the player running the local copy. It has the  disadvantage that the download step may take a while is the file is big (hundreds of megabytes are not uncommon).
The second one uses (true) <em>streaming</em>: the video file is split into fragments which are sent from the Web server to the player, giving the illusion of a continuous stream. From the user point of view, it looks as if a window is swept over the video content, saving the need of a full 
initial download of the whole file.
</p>
<p>
Obviously, streaming is a more involved method because it requires a strong coordination between the components involved in the process, namely the player, the Web server, and the file system from which the video is retrieved. We examine this technical issue in the context of a Hadoop system where files are stored in HDFS, a file system dedicated to large distributed storage. 
</p>
<p>
<img src="http://internetmemory.org/images/uploads/components.png" alt="" width="405" height="93" style="border: 0;" />
</p>

<h2>File seeking with HDFS</h2>

<p>
At explained above, streaming requires a strong coordination between the Web server and the file system. The former
produces requests to access chunks of the video file (think to what happens when the user suddenly requires a  move to a specific part of the video), whereas the later must be able to seek in the file  to position the cursor at a specific location. When using HDFS, enabling such a close cooperation turns out to be a problem because HDFS can in principle only be accessed through a Hadoop client, which the standard Apache server is not. We investigated two possible solutions: Hoop, the Hadoop web server, and Apache/FUSE.
</p>
<p>
Hoop (see http:///cloudera.github.com/hoop/) is an HTTP-HDFS-Connector. It allows the HDFS file system to be accessed via HTTP. 
A working local prototype has been developed using JW Player and a large video file.
Streaming works, but seeking in an unbuffered part results in the playback stopping. 
It seems that the Hoop API does not support seeking in a file, so we had to give up this approach.
</p>
<p>
The second solution is based on HDFS/FUSE. FUSE (File System in User Space) is an API that captures the file system operations and allows to implement them with ad-hoc functions running in the the user&#8217;s processus space (thereby saving the need to change the operating system kernel, a tricky and dangerous option). FUSE is provided in Hadoop as a component named &#8220;Mountable HDFS&#8221; (see <a href="http://wiki.apache.org/hadoop/MountableHDFS">http://wiki.apache.org/hadoop/MountableHDFS</a>). It lets the standard file system user or program see the HDFS name space as a locally mounted directory. All file system operations, including directory browsing, file opening and content access, are enabled over HDFS content through the FUSE interface. 
</p>
<h2>Apache server configuration</h2>
<p>
It remained to configure Apache to access the mounted FUSE system and load content from video files. 
How this is done depends on the video format. At the moment, we tested and validated
<i>.mp4</i> files and Flash video files. For the first format we use H264 Streaming Module (see <a href="http://h264.code-shop.com/trac">http://h264.code-shop.com/trac</a>), an Apache plugin, which enables adaptive streaming. For FLV we used pseudo-stream module for Apache named &#8220;mod_flv&#8221;. Both behave nicely and go along with the mountable HDFS without problem.
</p>
<h2>Conclusion</h2>

<p>The solution based on Apache + Mountable HDFS (FUSE) turned out to be both reliable, functionally adequate (seeking is well supported) and efficient. The architecture is simple and easy to set up, and allows to combine the benefits of HDFS for very large repositories and standard Web server streaming solutions. Although we chose to adopt Apache plugins in our current service, nothing keeps you from using a more
powerful streaming server since the FUSE approach (virtually) moves all the HDFS content in the standard file system scope. 
</p>
<p>
Hoop remains a potential option for the future, but it appeared not mature enough when we tested it, at least for the complex operations (seeking at a specific offset in a file) required by video streaming.
</p>

<p>&nbsp;</p>]]></description>
      <dc:subject><![CDATA[English, Hadoop, Hbase, Video Streaming,]]></dc:subject>
      <pubDate>Fri, 03 Feb 2012 08:52 GMT</pubDate>
    </item>

    <item>
      <title>Open source version of the LivingKnowledge testbed publicly released on SourceForge</title>
      <link>http://internetmemory.org/en/index.php/News/open_source_version_of_the_livingknowledge_testbed</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/open_source_version_of_the_livingknowledge_testbed#id:129#date:17:49</guid>
      <description><![CDATA[Since its public release on <a href="http://sourceforge.net/p/diversityengine/wiki/Home/">SourceForge</a> in August 2011 under the name of Diversity Engine, many downloads were made and some of the testbed components will be reused in other FP7 research projects such as <a href="http://internetmemory.org/en/index.php/projects/trendminer">TrendMiner</a>.<h2>LivingKnowledge Project</h2>

<p>The <a href="http://livingknowledge.europarchive.org/">LivingKnowledge</a> project (LK) enhances the state of the art of retrieving information from the Web by formalizing the notions of bias and diversity, creating tools that analyze, summarize and visualize bias in textual and image documents and finally, by creating applications that exploit this technology.</p>

<h2>LivingKnowledge Testbed</h2>

<p>The testbed integrates the following components, all of which contribute to diversity and bias aware search:<br />
- <strong>document collections</strong> chosen to reflect a diversity of document types and content,<br />
- <strong>image and text analysis tools</strong> supporting the analysis of diversity in text and image documents,<br />
- <strong>indexing and search tools</strong> supporting the bias and diversity aware search including novel visualization methods,</p>

<p>The testbed processing starts with document collections that are available upon request from the <a href="http://internetmemory.org/en/index.php/projects/livingknowledge">Internet Memory Foundation</a>, including 280 News sites and 750 blogs.<br />
Furthermore, the testbed supports a number of collection formats allowing users to incorporate their own collections.</p>

<p>Hands-On session with over 30 participants (Symposium on Bias and Diversity) was held during the 8th <a href="http://essir.uni-koblenz.de/">International Summer School on Information Retrieval</a> (ESSIR), which tooks place in Koblenz (Germany) in August/September 2011.</p>

<h2>More info</h2>
<p><a href="http://livingknowledge.europarchive.org/">Living Knowledge Project</a> <br />
<a href="http://sourceforge.net/p/diversityengine/wiki/Home/">SourceForge</a><br />
<a href="www.diversityengine.org">Diversity Engine</a><br />
<a href="http://essir.uni-koblenz.de/">Symposium on Bias and Diversity in IR (ESSIR 2011) </a></p>]]></description>
      <dc:subject><![CDATA[English, French,]]></dc:subject>
      <pubDate>Thu, 02 Feb 2012 17:49 GMT</pubDate>
    </item>

    <item>
      <title>Temporal Web Analytics Workshop (TempWeb02) at WWW2012 in Lyon on April 17,</title>
      <link>http://internetmemory.org/en/index.php/News/temporal_web_analytics_workshop</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/temporal_web_analytics_workshop#id:128#date:10:23</guid>
      <description><![CDATA[<a href="http://temporalweb.net/">TempWeb02</a> will take place April 17th, 2012 in conjunction with <a href="http://www2012.wwwconference.org/">International World Wide Web Conference</a> in Lyon, France. <br />
<p><strong>As PC-Chair and Organizer, Internet Memory Foundation informs you that the submission deadline for paper is fixed to February 24, 2012.</strong></p>

<h2>Objectives</h2>

<p>The objective of this workshop is to provide a venue for researchers of all domains (IE/IR, Web mining etc.) where the temporal dimension opens up an entirely new range of challenges and possibilities. The workshops ambition is to help shaping a community of interest on the research challenges and possibilities resulting from the introduction of the time dimension in Web analysis.</p>

<p>TempWeb focuses on temporal data analysis along the time dimension for Web data that has been collected over extended time periods. A major challenge in this regard is the sheer size of the data it exposes and the ability to make sense of it in a useful and meaningful manner for its users. Web scale data analytics therefore needs to develop infrastructures and extended analytical tools to make sense of these. </p>

<h2>Workshop topics</h2>

<p>• Web scale data analytics<br />
• Temporal Web analytics<br />
• Distributed data analytics<br />
• Web science<br />
• Web dynamics<br />
• Data quality metrics<br />
• Web spam<br />
• Knowledge evolution on the Web<br />
• Systematic exploitation of Web archives<br />
• Large scale data storage<br />
• Large scale data processing<br />
• Data aggregation<br />
• Web trends<br />
• Topic mining<br />
• Terminology evolution<br />
• Community detection and evolution</p>

<h2>Important Dates</h2>

<p>• Paper submission deadline: February 24, 2012<br />
• Notification of acceptance: March 5, 2012<br />
• Camera ready copy deadline: March 16, 2012<br />
• Workshop: April 17, 2012</p>

<p>Please post your submission (up to 8 pages) using the ACM template:<br />
<a href="http://www.acm.org/sigs/publications/proceedings-templates">http://www.acm.org/sigs/publications/proceedings-templates</a><br />
at:<br />
<a href="https://www.easychair.org/account/signin.cgi?conf=tempweb2012">https://www.easychair.org/account/signin.cgi?conf=tempweb2012</a></p>

<p>Note that the workshop proceedings will be published in ACM DL (ISBN 978-1-4503-1188-5)</p>

<h2>Support</h2>

<p>This workshop is organized with the support of the EU 7th Framework ICT STREP on Longitudinal Analytics of Web Archive data (<a href="http://www.lawa-project.eu/">LAWA</a>) under contract no. 258105.</p>

<h2>Workshop Officials</h2>

<p><strong>Chair:</p>

<p>PC-Chairs and Organizers:</strong></p>

<p>Ricardo Baeza-­Yates (<a href="http://research.yahoo.com/Ricardo_Baeza-Yates">Yahoo! Research</a>, Spain)<br />
Julien Masanès (<a href="http://internetmemory.org/en/index.php/about/the_board">Internet Memory Foundation</a>, France and Netherlands)<br />
Marc Spaniol (<a href="http://www.mpi-inf.mpg.de/~mspaniol/">Max Planck Institute for Informatics</a>, Germany)</p>

<p><strong>Program Committee:</strong></p>

<p>Eytan Adar (University of Michigan, USA)<br />
Omar Alonso (Microsoft Bing, USA)<br />
Srikanta Bedathur (IIIT-Delhi, India)<br />
Andras Benczur (Hungarian Academy of Science)<br />
Klaus Berberich (Max Planck Institute for Informatics, Germany)<br />
Roi Blanco (Yahoo! Research, Spain)<br />
Adam Jatowt (Kyoto University, Japan)<br />
Scott Kirkpatrick (Hebrew University Jerusalem, Israel)<br />
Christian König (Microsoft Research, USA)<br />
Frank McCown (Harding University, USA)<br />
Michael Nelson (Old Dominion University, USA)<br />
Nikos Ntarmos (University of Patras, Greece)<br />
Kjetil Norvag (Norwegian University of Science and Technology, Norway)<br />
Philippe Rigaux (Internet Memory Foundation, France and Netherlands)<br />
Thomas Risse (L3S Research Center, Germany)<br />
Pierre Senellart (Télécom ParisTech, France)<br />
Torsten Suel (NYU Polytechnic, USA)<br />
Masashi Toyoda (Tokyo University, Japan)<br />
Peter Triantafillou (University of Patras, Greece)<br />
Michalis Vazirgiannis (Athens University of Economics and Business &amp; École Polytechnique)<br />
Gerhard Weikum (Max Planck Institute for Informatics, Germany)</p>]]></description>
      <dc:subject><![CDATA[English, French,]]></dc:subject>
      <pubDate>Thu, 02 Feb 2012 10:23 GMT</pubDate>
    </item>

    <item>
      <title>TV Show: « La mémoire de toile » (net memory) and Web archiving challenges</title>
      <link>http://internetmemory.org/en/index.php/News/tv_show_la_memoire_de_toile_the_net_memory_and_the_web_archiving_challenges</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/tv_show_la_memoire_de_toile_the_net_memory_and_the_web_archiving_challenges#id:124#date:16:38</guid>
      <description><![CDATA[Reportage on Web archiving by France24<p><img src="http://internetmemory.org/images/uploads/memoiredelatoile_thumb.png" alt="memoiredelatoile" width="300" height="228"  style="border: 0;" alt="image" /></p>

<p>The Internet has become one of the most productive media for information and news. Thus, there&#8217;s an absolute need to preserve web content and promote Web archiving at large scale. This idea begins to be one of the great challenges of the Web. <br />
Media are already interested in the subject, and <a href="http://www.france24.com/en/" title="France 24">France24</a>, the French international news channel, is broadcasting a <a href="http://www.france24.com/fr/20111231-memoire-internet-archivage">video reportage</a> on web harvesting in France (due to the French legal deposit), on Web archiving in general and on giving access to these <a href="http://internetmemory.org/en/index.php/about/collections1">Web archive collections</a>. </p>

<p>This video shows a rapid overview of French initiatives and <a href="http://internetmemory.org/en/index.php/IM/blogs" title="Blog InternetMemory">Web archiving technologies</a> thanks to the participation of the <a href="http://www.bnf.fr/fr/collections_et_services/livre_presse_medias/a.archives_internet.html">National Library of France</a>, the National Audiovisual Institute of France and the Internet Memory Foundation (interview of Julien Masanès by Natalia Gallois in our offices in Paris).</p>

<p>To view the video and discover the challenges of Web archiving <a href="http://www.france24.com/fr/20111231-memoire-internet-archivage" title="France24">click here</a> (in French).<br />
TV Show: <a href="http://www.france24.com/fr/taxonomy/emission/16758">&#8220;Web News&#8221;</a>, News seen on the Web and about the web.</p>

]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Tue, 03 Jan 2012 16:38 GMT</pubDate>
    </item>

    <item>
      <title>Happy New Year 2012!</title>
      <link>http://internetmemory.org/en/index.php/News/happy_new_year_2012</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/happy_new_year_2012#id:123#date:14:38</guid>
      <description><![CDATA[We present you our best wishes for this New Year 2012!<p><strong>2012</strong> will be a year full of projects and developments, so follow us on <a href="http://twitter.com/#!/InternetMemory">Twitter</a> and save our <a href="http://internetmemory.org/en/index.php/RSS">RSS feed</a>!</p>

]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Fri, 30 Dec 2011 14:38 GMT</pubDate>
    </item>

    <item>
      <title>November 7-8th, Kick-Off of a new R&amp;D project: TrendMiner</title>
      <link>http://internetmemory.org/en/index.php/News/november_7_8th_kick_off_of_a_new_rd_project_trendminer</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/november_7_8th_kick_off_of_a_new_rd_project_trendminer#id:117#date:09:48</guid>
      <description><![CDATA[We are glad to announce the kick-off of the European Research project, TrendMiner on Large-scale, Cross-lingual Trend Mining and Summarization of Real-time Media Streams<p>Today in Luxembourg starts TrendMiner project (Large-scale, Cross-lingual Trend Mining and Summarization of Real-time Media Streams). It is a three-year European project funded by the European Commission through the Seventh Research Framework Program (FP7-ICT) and under Project No 287863. </p>

<p>Beside Internet Memory Foundation are involved:<br />
- <a href="http://www.dfki.de/web/welcome?set_language=en&amp;cl=en" target="new">Deutsches Forschungszentrum für Künstliche Intelligenz GmbH(Germany)</a> as Coordinator,<br />
- <a href="http://www.shef.ac.uk/" target="new">The University of Sheffield (United Kingdom)</a>, <br />
- <a href="http://www.ontotext.com/" target="new">Ontotext AD (Bulgaria)</a>, <br />
- <a href="http://www.soton.ac.uk/" target="new">University of Southampton (UK)</a>, <br />
- <a href="http://en.eurokleis.com/" target="new">Eurokleis S.R.L. (Italy)</a>, <br />
- <a href="http://www.sora.at/index.php?id=72&amp;L=1" target="new">Sora Ogris &amp; Hofinger GmbH (Austria)</a> <br />
- and <a href="http://hardikgroup.com/" target="new">Hardik Fintrade Pvt Ltd. (India)</a>.</p>

<p>This project aims at delivering innovative, portable open-source real-time methods for cross-lingual mining and summarization of large-scale stream media.</p>

<p>IMF will contribute to the Platform for Real Time Media collection, Analysis and storage by :<br />
- providing scalable infrastructure to partners, with support for integration and experiment.<br />
- designing and developing an application-aware crawler mechanism for social media.</p>

<p>For more information on TrendMiner, please visit the <a href="http://www.trendminer-project.eu/" target="new">Project website</a> (under construction).</p>

<p><img src="http://internetmemory.org/images/uploads/fp7logoban1.jpg" alt="" width="60" height="56" style="border: 0;" alt="image" /> <img src="http://internetmemory.org/images/uploads/Eur-flag.jpg" alt="" width="63" height="44" style="border: 0;" alt="image" /></p>]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Mon, 07 Nov 2011 09:48 GMT</pubDate>
    </item>

    <item>
      <title>Interview with France Lasfargues after IFTA 2011</title>
      <link>http://internetmemory.org/en/index.php/News/interview_with_france_lasfargues_after_ifta_2011</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/interview_with_france_lasfargues_after_ifta_2011#id:114#date:16:15</guid>
      <description><![CDATA[France Lasfargues, project manager for the foundation, conducts two research projects on web archiving and a portfolio of <a href="http://internetmemory.org">Internet Memory</a> partners. She talks about the results of her participation in the conference of the <a href="http://www.fiatifta.org/">International Federation of Television Archives (IFTA)</a> in Turin in September 2011, where she led a workshop on web archiving and audiovisual archives with two partners: <a href="http://www.swr.de/">SWR</a> (German Television) and in <a href="http://portal.beeldengeluid.nl/">Beeld en Geluid</a> (Netherlands Institute for Sound and Vision).<p><img src="http://internetmemory.org/images/uploads/conference-fiatifta_thumb.png" alt="FIATIFTA_2011" width="400" height="227"  style="border: 0;" alt="image" /></p>

<p><strong>Was this your first participation in the FIAT?</strong></p>

<p>France Lasfargues: Personally, yes. But, it is not the first participation of Internet Memory Foundation, which is an associate member of IFTA. Last year, Chloe Martin, Business Developer at the Foundation, presented a poster based on our web archiving platform, <a href="http://archivethe.net">Archivethe.net (ATN)</a> and issues related to the collection and access videos that are broadcasted on the Web.</p>

<p><strong>Is it easy for Internet Memory to participate in this international conference?</strong></p>

<p>FL : To make this conference, we must first answer the call for participation that occurs at least three months before. It is then decided how we can present our activities, the participants we would like to integrate and form of presentation (poster, workshop, plenary session, ...). We submit our proposal and expect a return from IFTA. So, we decided to focus on issues that involve the expectations and needs of audiovisual archives and our skills and areas of development. The shape of the workshop seemed most appropriate, also to open a space for dialogue and exchange with the audience.</p>

<p><strong>This brings us precisely to talk in more detail the reason for the presence of the Internet Memory at the IFTA.</strong></p>

<p>FL : Our goal is simple : to communicate about the need of web archiving for audiovisual archives and, thereby, to share our expertise in this area. Internet Memory wants to drive projects, motivate institutions to get engaged in web archiving projects, now, in order to stop the loss of relevant content and high added value.</p>

<p><strong>What is the angle chosen by Internet Memory for this workshop?</strong></p>

<p>FL : The workshop was mainly an opportunity to invite audiovisual archives to share their current experience and problems in terms of web archiving and make up to date with capture and access solutions that we have developed. We must say that we have strong arguments on the matter. This gave us the opportunity to communicate on all of our projects <a href="http://www.liwa-project.eu/index.php">LIWA</a>, <a href="http://livingknowledge-project.eu/">LK</a>, <a href="http://www.lawa-project.eu/">LAWA</a>, <a href="http://www.scape-project.eu/">SCAPE</a>, especially, <a href="http://www.arcomem.eu/">ARCOMEM</a>, which are a European large-scale projects and an excellent reference to show the extent of our technologies and skills. In detail, and because attendees were audiovisual archives, we have focused on the technical challenges of capturing video in web sites (<a href="http://www.liwa-project.eu/index.php">LIWA</a>). Equally important, the social web and the challenges it poses for archivists (<a href="http://www.arcomem.eu/">ARCOMEM</a>). We have also talked about various tools that we develop (including Application Aware Crawling, API Crawls, etc ...) to solve archiving problems and improve the archival collection.</p>

<p><strong>How many participants were present to this international conference? Was your workshop appreciated?</strong></p>

<p>The conference brought over 300 audiovisual archivists.<br />
As for the workshop that we held, the workshop room was full with over than 120 participants. I admit that we did not expect such a success. Last year, the workshop on web archiving had mobilized at most 40 people! Moreover, the organizers of the conference highlighted our &#8220;score of attendance&#8221;. This proves that much more archivists are interested in web archiving and audiovisual archives. Internet Memory services could be developed in the near future and we are always ready to repeat the experience at the IFTA.</p>

]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Thu, 13 Oct 2011 16:15 GMT</pubDate>
    </item>

    <item>
      <title>Workshop on web archiving for audiovisual archives at FIAT 2011 in Turin</title>
      <link>http://internetmemory.org/en/index.php/News/workshop_on_web_archiving_for_audiovisual_archives_at_fiat_2011_in_turin</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/workshop_on_web_archiving_for_audiovisual_archives_at_fiat_2011_in_turin#id:113#date:13:02</guid>
      <description><![CDATA[IMF is glad to participate to the workshop Web Archiving for Audiovisual Archives <a href="http://www.fiatifta.org/index.php/archives/4407">'No content without context'</a> with SWR and the Netherlands Institute for Sound and Vision Nederlands on Friday, 30 September (2.30 pm)<p>This workshop will present our platform <a href="http://archivethe.net">Archivethe.net (AtN)</a> and several concrete use cases, which will contribute to raise awareness and interest of audiovisual archivists.<br />
1/ SWR use case: Rock am Ring 2011<br />
2/ Multimedia Web archiving following recently finalized Living Web Archives project (<a href="http://www.liwa-project.eu/index.php">LiWA</a>).</p>

<p>Hope you will be many to attend it!</p>

<p>Find all details on <a href="http://www.fiatifta.org/">FIAT website</a></p>]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Thu, 29 Sep 2011 13:02 GMT</pubDate>
    </item>

    <item>
      <title>Understanding HBase&#8212;1 The data model</title>
      <link>http://internetmemory.org/en/index.php/Synapse/understanding_the_hbase_data_model</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/Synapse/understanding_the_hbase_data_model#id:98#date:13:45</guid>
      <description><![CDATA[At Internet Memory, we use HBase as a large-scale repository for our collections, holding terabytes of web documents in a distributed cluster.  This article presents the data model of HBase, and explains how it stands between relational DBs and the "No Schema" approach.<h2>Understanding the HBase data model</h2>

In 2006, the Google Labs team published a paper entitled  <a href='http://labs.google.com/papers/bigtable.html'>BigTable: A Distributed Storage System for Structured Data</a>. It describes a distributed index designed to manage very large datasets (``petabytes of data") in a cluster of data servers. BigTable supports key search, range search and high-throughput file scans, and also provides a flexible storage for structured data. HBase is an open-source clone of BigTable, and closely mimics its design.

At  Internet Memory, we use HBase as a large-scale repository for our collections, holding terabytes of web documents in a distributed cluster. HBase is often assimilated to a large, distributed relational database. It actually presents many aspects common to "NoSQL" systems: distribution, fault tolerance, flexible modeling, absence of some features deemed essential in centralized DBMS (e.g., concurrency), etc. This article presents the data model of HBase, and explains how it stands between relational DBs and the "No Schema" approach. It will be completed by an introduction to both the Java and REST APIs, and a final article on system aspects.

<h3>The <i>map</i> structure: representing data with key/value pairs</h3>

We start with an idea familiar to Lisp programmers of <em>association lists</em>, which are nothing more than key-value pairs. They constitute a simple and convenient way of representing the properties of an object. We use as a running example the description of a Web document. For instance, using the JSON notation:

<pre>
{  
   'url': 'http://internetmemory.org', type: 'text/html', content: 'my document content' 
}
</pre>

One obtains what is commonly called an associative array, a dictionary, or a <em>map</em>. Given a context (the object/document), the structure associates <em>values</em> to <em>keys</em>.

We can represent such data  as a graph, as shown by the figure below. The key information is captured by  edges,whereas data values reside at leaves. 

<center>
<img src="http://internetmemory.org/images/uploads/instance6.png" alt="Key value" width="334" height="150" style="border: 0;" alt="image" />
</center>
There exists many possible representations for a <em>map</em>. We showed a JSON example above, but XML is of course an appropriate choice.  At first glance, a <i>map</i> can also be represented by a table. The above example is equivalently viewed as

<center>
<table border="1">
<tr bgcolor="lightgrey"><th>url</th><th>type</th><th>content</th></tr>
<tr><td>http://internetmemory.org</td><td>text/html</td><td>my document content</td></tr>
</table>
</center>

However, this often introduces some confusion. It is worth understanding several important differences that make a <em>map</em> much more flexible than the strict (relational) <em>table</em> structure. In particular,
<ul>
  <li>there is no <em>schema</em> that constrains the list of keys (unlike relational table where the schema is fixed and uniform for all rows),</li>
  <li>the <em>value</em> may itself be some complex structure.  
</ul>

HBase, following BigTable, builds on this flexibility. First, we can add new key-value pair to describe an object, if needed.This does not require any pre-declaration at the 'schema' level, and the new key remains local. Other objects stored in the same HBase instance remain unaffected by the change.

Second, a value can be another <em>map</em>, yielding a <em>multi-map</em> structure which is exemplified below.

<h3>An HBase "table" is a <i>multi-map</i> structure</h3>

Instead of keeping one value for each property of an object, HBases allows the storage of several <em>versions</em>. Each version is identified by a timestamp. How can we represent such a multi-versions, key-value structure? HBase simply replaces atomic values by a <em>map</em>  where the key is the timestamp. 

The extended representation for our example is shown below. It helps to figure out the power and flexibility of the data representation. Now, our document is built from two nested maps, 
  <ul>
    <li>a first one, called "<em>column"</em> in HBase terminology (an unfortunate choice, since this is hardly related to the column relational concept),</li>
    <li>a second "<em>timestamp</em>" (each map is named after its key).</li>
</ul>
Our document is globally viewed  as a <em>column</em> map. If  we choose a <em>column</em> key, say, <em>type</em>, we obtain a value which is itself a second <em>map</em> featuring as many keys as there are timestamps for this specific column. In our example,  there is only one timestamp for <i>url</i> (well, we can assume that the URL of the document does not change much).  Looking at, respectively,  <em>type</em> and <em>content</em>, we find the former has two versions and the latter three. Moreover, they only have one timestamp (<em>t<sub>1</sub></em>) in common. Actually, the "<em>timestamp</em>" maps are completely independent from one another.
  <center>
    <img src="http://internetmemory.org/images/uploads/multimap.png" alt="Multi-map" width="244" height="230" style="border: 0;" alt="Key value" />
  </center>
Note that we can add as many <i>timestamps</i> (hence, as many keys in one of the second-level maps) as we wish. And, in fact, this is true for the first-level map as well: we can add as many <i>columns</i> as we wish, at the document level, without having to impose such a change to <i>all</i> other documents in a same HBase instance.  In essence, each object is just a self-described piece of information (think again to the flexible representation of semi-structured data formats like XML or JSON). In this respect, HBase is in the wake of other 'NoSQL' systems, and its data model shares many aspects that characterize this trend: no schema and self-description of objects. 


We are not completely done with the multi-map levels of HBase. Columns are grouped in <i>column families</i>, and a family is actually a key in a new map level, referring to a group of columns. In the Figure below, we define two families:  <em>meta</em>, grouping <em>url</em> and <em>type</em>, and <em>data</em> representing the content of a document.
  <center>
    <img src="http://internetmemory.org/images/uploads/fullmultimap.png" alt="Full multi map " width="251" height="205" style="border: 0;" alt="image" />
  </center>

<b>Important</b>: Unlike the <em>column</em> and <em>timestamp</em> maps, the keys if a <em>family</em> map is <em>fixed</em>. We cannot add new families to a table once it is created. The <em>family</em> level constitutes therefore the equivalent of a relational schema, although, as we saw, the content of a family value may be a quite complex structure.

<h3>The full picture: rows and tables</h3>

So, now, we know how to represent our objects with the HBase data model. It remains to describe how we can put many objects (potentially, millions or even billions of object) in HBase. This is where HBase borrows some terminology to relational databases: objects are called "<em>rows</em>", and rows are stored in a "<em>table</em>". Although one could find some superficial similarities, this comparison is a likely source of confusion. Let us try to list the differences:
<ol>
<li>a "table" is actually a <i>map</i> where each row is a value, and the key is chosen by the table designer.</li>
   <li>we already explained that the structure of a "row" has little to do with the flat representation of relational row.</li>
    <li>regarding data manipulation, the nature of a "table" implies that two basic operations are available: <i>put(key, row)</i> and <i>get(key): row</i>. Nothing comparable to SQL here!</li>
</ol>

Finally, it is worth noting that the "table" map is a <em>sorted</em> map: rows are grouped on the key value, and two rows close from one another (with respect to the keys  order) are stored is the same physical area. This make possible (and efficient) <i>range queries</i> of keys. We further explore this feature is the article devoted to the system aspects of HBase.

The following Figures summarize our structure for an hypothetical <i>webdoc</i> HBase table storing a large collection of web documents. Each document is indexed by its url (which is therefore the key of the highest level map). A <i>row</i> is itself a local map featuring a fixed number of  keys defined by the family names (<em>f<sub>1</sub></em>, <em>f<sub>2</sub></em>, etc.), associated to values which are themselves maps indexed by columns. Finally, column values are versioned, and represented by a timestamp-index map. Columns and timestamps do no obey to a global schema: they are defined on a row basis. The columns may vary arbitrarily from one row to another, and so do the timestamps for columns.
 
  <center>
<img src="http://internetmemory.org/images/uploads/table.png" alt="HBase table" width="285" height="208" style="border: 0;" alt="image" />
   </center>
 The multi-map structure of  a HBase table can thus be summarized as
<center>
<tt>key -> family -> column -> timestamp -> value</tt>
</center>

It should be clear that the intuitive meaning of common concepts such as "table", "row", and "column" must be revisited when dealing with HBase data. In particular, considering HBase as a kind of large relational database is clearly a misleading option.  HBase is essentially a key-value store with efficient indexing on key access, a semi-structured data model for value representation, and range-search capabilities supported by key ordering.

<h3>References</h3>

<ul>
  <li><a href="http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable">Understanding HBase and BigTable</a></li>
  <li><a href="http://wiki.apache.org/hadoop/Hbase/DataModel">The HBase Wiki page</a></li>
</ul>]]></description>
      <dc:subject><![CDATA[English, French, Hbase,]]></dc:subject>
      <pubDate>Wed, 01 Jun 2011 13:45 GMT</pubDate>
    </item>

    <item>
      <title>Temporal Web Analytics Workshop (TWAW 2011) Proceeding</title>
      <link>http://internetmemory.org/en/index.php/News/temporal_web_analytics_workshop_twaw_2011_proceeding</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/temporal_web_analytics_workshop_twaw_2011_proceeding#id:96#date:09:15</guid>
      <description><![CDATA[The Proceedings of the 1st International Temporal Web Analytics Workshop (TWAW 2011) are online now.<p>The Proceedings of the 1st International Temporal Web Analytics Workshop (<a href="http://www.temporalweb.net/" target="new">TWAW 2011</a>) held in conjunction with the 20th International World Wide Web Conference (www2011) in Hyderabad, India on March 28, 2011 are online. <br />
The workshop was co-organized by the <a href="http://www.lawa-project.eu/" target="new">LAWA project</a> and chaired by R. Baeza-Yates (Yahoo! Research Barcelona), J. Masanès (Internet Memory Foundation) and M. Spaniol (Max-Planck-Institut für Informatik).</p>

]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Tue, 24 May 2011 09:15 GMT</pubDate>
    </item>

    <item>
      <title>ARCOMEM Meeting in Paris on May 9-10-11</title>
      <link>http://internetmemory.org/en/index.php/News/arcomem_meeting_in_paris_on_may_9_10_11</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/arcomem_meeting_in_paris_on_may_9_10_11#id:91#date:11:50</guid>
      <description><![CDATA[The ARCOMEM Consortium will come together to discuss and fix the next milestones of the system architecture work packages. This meeting will be hosted by Télécom ParisTech, a member of the ARCOMEM Consortium. <p>The core areas of this meeting in Paris will be the system architecture of the different modules (e.g. content crawling, social web analysis, archive enrichment, storage module …). <br />
After the meeting in Paris ARCOMEM will provide the results on the <a href="http://www.arcomem.eu/">project website</a>.</p>

]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Fri, 06 May 2011 11:50 GMT</pubDate>
    </item>

    <item>
      <title>LAWA Presentation at the FIRE Research Workshop</title>
      <link>http://internetmemory.org/en/index.php/News/lawa_presentation_at_the_fire_research_workshop1</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/lawa_presentation_at_the_fire_research_workshop1#id:90#date:07:43</guid>
      <description><![CDATA[The <a href="http://www.lawa-project.eu/">LAWA project</a> will be presented at the FIRE Research Workshop in Budapest (Hungary) on May 16th 2011.<p>The <a href="http://www.ict-fire.eu/events/fire-research-workshop.html">FIRE Research Workshop</a> will be held on May 16th 2011 in Budapest (Hungary). This event is part of the <a href="http://www.fi-budapest.eu/">Future Internet Week 2011</a>. LAWA is proud to give a presentation in the <a href="http://wiki.ict-fire.eu/index.php/FIRE_research_workshop_agenda">Future Internet, Living Labs and Web analytics</a> session.</p>

<p>Presentation Abstract<br />
Organizations like the Internet Archive have been capturing Web contents over decades. This time-versioned content is a gold mine for analysts, focusing on longitudinal studies. An application example is tracking and analyzing a politician’s public appearances over a decade. The LAWA project develops methods and tools for time-travel indexing and querying, entity detection and tracking along the time axis, and advanced analyses and knowledge discovery. For scalability, we pursue Hadoop-based distributed computations. We also prepare reference data and will provide analytics services. We will offer a user workshop in late 2011 to disseminate these opportunities and explore interesting use cases.</p>]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Thu, 05 May 2011 07:43 GMT</pubDate>
    </item>

    <item>
      <title>LivingKnowledge partners attend the Fet11 in Budapest</title>
      <link>http://internetmemory.org/en/index.php/News/livingknowledge_partners_attend_the_fet11_in_budapest</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/livingknowledge_partners_attend_the_fet11_in_budapest#id:89#date:16:21</guid>
      <description><![CDATA[LivingKnowledge project is currently part of the exhibition to show the <a href="http://www.fet11.eu/exhibition">project diversity-aware technologies</a>. <br />
This event is organised by the “ICT forever yours” community. <p>LK partners are present at <a href="http://www.fet11.eu/">FET</a> 2011 in Budapest, Hungary in May 4th-6th, 2011</p>

<p>Looking for a search engine to find images or web pages about David Beckham arranged in terms of the various clubs he has played for? <br />
A 15 minute demonstration of <a href="http://livingknowledge-project.eu/">LivingKnowledge</a> showcases technologies enabling bias-aware, diversity-aware and evolution-aware information access, including diversity-aware search within texts and images, analysis of future predictions as well as fact-and-opinion extraction.</p>]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Wed, 04 May 2011 16:21 GMT</pubDate>
    </item>

    <item>
      <title>SCAPE Project on the Web</title>
      <link>http://internetmemory.org/en/index.php/News/scape_project_on_the_web</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/scape_project_on_the_web#id:85#date:08:11</guid>
      <description><![CDATA[We are pleased to inform that the official <a href="http://www.scape-project.eu/">SCAPE website</a> is now online!<p><strong>What is it about?</strong><br />
The <a href="http://www.scape-project.eu/">SCAPE project</a> will develop scalable services for planning and execution of institutional preservation strategies on an open source platform that orchestrates semi-automated workflows for large-scale, heterogeneous collections of complex digital objects. SCAPE will enhance the state of the art of digital preservation in three ways: by developing infrastructure and tools for scalable preservation actions; by providing a framework for automated, quality-assured preservation workflows and by integrating these components with a policy-based preservation planning and watch system. These concrete project results will be validated within three large-scale Testbeds from diverse application areas.</p>

<p>SCAPE also has a <strong>Twitter account</strong>: <a href="http://twitter.com/SCAPEproject">@SCAPEproject</a><br />
Tweets with the hashtag #SCAPEproject (or pointed to @SCAPEproject) will be re-tweeted and appear on the website’s Twitter feed.</p>]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Mon, 02 May 2011 08:11 GMT</pubDate>
    </item>

    <item>
      <title>Museums and the Web</title>
      <link>http://internetmemory.org/en/index.php/News/museums_and_the_web</link>
      <guid isPermaLink="true">http://internetmemory.org/en/index.php/News/museums_and_the_web#id:81#date:07:17</guid>
      <description><![CDATA[Internet Memory will participate to the Museums & the Web conference in Philadelphia, April 6-9 2011.<p>In an one-hour <a href="http://conference.archimuse.com/mw2011/programs/web_archiving_workshop">workshop</a>, main aspects of Web archiving will be presented.<br />
If you can&#8217;t join, follow #mw2011 Tag on Twitter</p>]]></description>
      <dc:subject><![CDATA[English,]]></dc:subject>
      <pubDate>Thu, 31 Mar 2011 07:17 GMT</pubDate>
    </item>

    
    </channel>
</rss>
