[MKSearch-dev] Re: Setting MKSearch

Tue Jan 10 15:10:52 GMT 2006

On 10 Jan 2006 at 10:33, Chris Croome wrote:

> > > $mk_home/conf/rdfstore/sites/mksearch.mkdoc.org.properties
> > 
> > OK... so as a minimum a file like this is needed for each site?

Although it's a chore, this gives you finest control over what 
happens. It is possible to use the default site properties as a catch-
all. There's quite a good explanation of per-site configuration in 
the JSpider manual (towards the end), but it will take some trial and 
error.

I've checked in the a copy of the manual to 
$mk_home/doc/reference/jspider-0-5-0-doc-user.pdf

One advantage of the property-file-per-target-site is that any site 
that has no property configuration defaults to the skip.properties 
configuration and is not indexed.

JSpider distinguishes two types of tasks: spidering and parsing.
The key thing to remember is that MKSearch indexing is driven by the 
spidering process. The parsing process identifies new Web resources 
and can be controlled separately in the rules configuration.

It's probably best to start with a few target sites and build up from 
there. If you set up a start page on the test site you will be able 
to test the property configuration without having to process too many 
pages. 

> >   $mk_home/conf/rdfstore/sites/example.org.properties
> > 
> > And doing this addresses the com.mkdoc.store.LocalStoreManager.rdf
> > clobbering issue?

For a single run over any number of sites, JSpider/Sesame will 
synchronize the addition of triples to  a single store (more below). 
To speed things up, you may wish to adjust the number of threads. The 
per-site throttle interval will ensure the site is not overloaded.

> So I have set up multiple spider config files:
> 
> And I have written this script to spider these sites:
> 
>   #!/bin/bash
>   # index mkdoc sites
> 
>   for a in www.bndfc.co.uk www.boothcentre.org.uk  etc.

Yes, that re-runs the spider for each site and overwrites the same 
file. Probably easiest to set up a start page on the test site or 
elsewhere that simply links to all the sites you want to index -- use 
that as the base site for JSpider and that will make a single sweep 
of all. Something like this IAR page will do the job:

http://test.mksearch.mkdoc.org/iar/index.html

If you start from the mksearch test site, you will need to remove 
this line from the mksearch.mkdoc.org.properties:

site.rules.parser.count=1
site.rules.parser.1.class=net.javacoding.jspider.mod.rule.BaseSiteOnly
Rule:

And change to

site.rules.parser.count=0

> But I does what I feared -- creates a
> com.mkdoc.store.LocalStoreManager.rdf for each site and then clobbers it
> so that there is only ever the metadata from one site in this file at
> any one time...

Ah, that clobber problem! Making a single sweep over all sites will 
create a single index of the whole lot. Then your copy and restart 
should be successful.

The index may turn out to be quite a large file. This is why the 
database storage option is ultimately preferable, though not fully 
tested.

> I'm sure I have something set up wrong... can you shed any light on
> this...?

Hopefully these notes shed the light you need. In case the store size 
becomes a problem, I'll check the database store option.

Best regards,

Phil
--
MKSearch (beta)

http://www.mksearch.mkdoc.org/

Free, open source metadata search engine with RDF storage and query.