Announce: MKSearch beta 1

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Announce: MKSearch beta 1

Phil Shaw

MKDoc Ltd. would like to announce the first beta release of
MKSearch, under the GNU General Public Licence. Source and
pre-compiled binary downloads are available from the project Web

MKSearch is a metadata search engine that indexes structured
metadata in Web documents, not free text in the document body.
The data acquisition system:

* Conforms to the Dublin Core metadata in HTML
recommendations [1]

* Supports other application profiles, such as the UK e-Government
Metadata Standard [2]

* Indexes native RDF formats, including RSS 1.0

The MKSearch system has five major components:

1. A Web crawler based on JSpider [3]

    * Multi-threaded processing
    * Per-site throttle, user agent, depth and linking rules
    * Respects the robots.txt exclusion policy
    * Extensible plug-in based content handling

2. An HTML document validator and formatter based on JTidy [4]

    * Cleans-up and corrects HTML syntax errors
    * Converts HTML to XHTML

3. A set of custom indexers based on the Simple API for XML (SAX)

    * Extracts metadata from HTML meta and link elements
    * Converts metadata to RDF triple statements
    * Configurable application profiles

4. An RDF storage and query system based on Sesame [5]

    * XML/RDF file-based storage
    * Database storage using PostgreSQL or MySQL
    * Sophisticated Sesame RDF Query Language (SeRQL) queries
    * Scope for more semantically rich queries with inferencing

5. A public query interface, provided through a standard servlet

    * Simple, expandable query builder form
    * Configurable application profile-based presentation
    * Wildcard query handling
    * Phrase searches
    * Paged HTML results
    * Standing RSS results

The two main elements of the MKSearch system can be used
independently. The data acquisition system can be used to gather
large quantities of metadata from the Web and store it as RDF. The
query system can be used to provide a typical search engine-style
interface to existing RDF content.

The MKSearch beta 1 distribution includes sample configurations
that crawl a Web site and create:

* A mirror of the site on the local file system in valid XHTML
* An RDF N-Triple record for each page on the local file system
* UK e-Government metadata in a Sesame file-based repository

This distribution also includes a demonstration of the MKSearch
query interface, in the form of a Web Application Archive (WAR)
that can be deployed directly to an existing servlet container. The
sample search content is from an index of the MKSearch project
Web site on 2 November 2005. See the site documentation below:

System requirements and licence

MKSearch is written in the Java programming language and is
designed to run on any platform that supports a Java environment
equivalent to the Sun Java 2 language specification.

The system has specifically been designed, developed and tested
to run on GNU/Linux systems using the GNU Compiler for Java
(GCJ) [6] and Apache Tomcat 5 servlet container, as available on
Fedora Core 4 [7].  This provision means that MKSearch can be
built and run on software systems that are entirely open source
and free from proprietary licencing.

The system has been tested extensively using the Sun Java SDK
1.5 on Microsoft Windows 2000. JUnit test suites for the
MKSearch code base cover 99% of all code branches.

If you have any comments or questions about the MKSearch
system, please join us on the project mailing list.









MKSearch (beta)

Free, open source metadata search engine with RDF storage and query.