- Alexander Munteanu
- Christian Sohler
- Thomas Nitschke
- Marc Gillé
- Christopher Schröder
- Hendrik Spiegel
Research supported by Deutsche Forschungsgemeinschaft, grant SO 514/3-1 and SO 514/4-2.
The design and implementation of internet search engines becomes an increasingly sophisticated challenge.
Several requirements have to be met, such as dealing with massive amounts of data, efficient text
processing and fault tolerance. The aims of the Krake web-search project are the implementation and
practical improvement of the different components of internet search engines with an emphasis on scalability
- Distributed data management
- Solid and tested web crawler
- Efficient web graph implementation
- Easy access to the crawled content
- Simplicity and extendibility
An early evaluation of the freely available web crawlers lead to the conclusion that a new crawler framework should be created to meet all the defined requirements.
In order to satisfy the demand for distribution and simplicity at the same time, it was decided to use the very popular MapReduce paradigm as basis for the framework.
The development phase was heavily prototype driven to ensure the practical relevance of the framework and finally led to the stable and final system as presented below.
The Krake crawler framework is a reliable, distributed and modern crawler framework that can easily be
modified to fit individual research interests. It is meant to be used as an out of the box crawler or as a basis
for a customized crawling and analysis system.
In conjunction with the actual crawler, the Krake framework also provides means to export and aggregate the
gathered data into a web-graph and content-database. The file formats of the exported data are designed to
be easily read or modified by other programs without sacrificing efficiency.
The framework was thoroughly tested by performing two big crawls in the ".de" and ".li"-domainspace which
lasted several months in total. Both crawls were a success and yielded web-graphs with millions of
pages/links as well as a content-database containing multiple terabytes of data.
Features at a glance
- Based on Apache Hadoop / MapReduce
- Makes heavy use of distributed computation
- Exports crawled data as web-graph and content-database
- Written entirely in Java
- Easy to adapt
- Well tested through several months of operation
Please contact Christian Sohler (christian.sohlertu-dortmund.de
if you would like to access the crawled data and web-graphs.
Last update: 20.06.2012 by L. Pradel