[MKSearch-dev] Re: Setting MKSearch
Phil Shaw
phil at mkdoc.com
Tue Jan 10 15:10:52 GMT 2006
On 10 Jan 2006 at 10:33, Chris Croome wrote:
> > > $mk_home/conf/rdfstore/sites/mksearch.mkdoc.org.properties
> >
> > OK... so as a minimum a file like this is needed for each site?
Although it's a chore, this gives you finest control over what
happens. It is possible to use the default site properties as a catch-
all. There's quite a good explanation of per-site configuration in
the JSpider manual (towards the end), but it will take some trial and
error.
I've checked in the a copy of the manual to
$mk_home/doc/reference/jspider-0-5-0-doc-user.pdf
One advantage of the property-file-per-target-site is that any site
that has no property configuration defaults to the skip.properties
configuration and is not indexed.
JSpider distinguishes two types of tasks: spidering and parsing.
The key thing to remember is that MKSearch indexing is driven by the
spidering process. The parsing process identifies new Web resources
and can be controlled separately in the rules configuration.
It's probably best to start with a few target sites and build up from
there. If you set up a start page on the test site you will be able
to test the property configuration without having to process too many
pages.
> > $mk_home/conf/rdfstore/sites/example.org.properties
> >
> > And doing this addresses the com.mkdoc.store.LocalStoreManager.rdf
> > clobbering issue?
For a single run over any number of sites, JSpider/Sesame will
synchronize the addition of triples to a single store (more below).
To speed things up, you may wish to adjust the number of threads. The
per-site throttle interval will ensure the site is not overloaded.
> So I have set up multiple spider config files:
>
> And I have written this script to spider these sites:
>
> #!/bin/bash
> # index mkdoc sites
>
> for a in www.bndfc.co.uk www.boothcentre.org.uk etc.
Yes, that re-runs the spider for each site and overwrites the same
file. Probably easiest to set up a start page on the test site or
elsewhere that simply links to all the sites you want to index -- use
that as the base site for JSpider and that will make a single sweep
of all. Something like this IAR page will do the job:
http://test.mksearch.mkdoc.org/iar/index.html
If you start from the mksearch test site, you will need to remove
this line from the mksearch.mkdoc.org.properties:
site.rules.parser.count=1
site.rules.parser.1.class=net.javacoding.jspider.mod.rule.BaseSiteOnly
Rule:
And change to
site.rules.parser.count=0
> But I does what I feared -- creates a
> com.mkdoc.store.LocalStoreManager.rdf for each site and then clobbers it
> so that there is only ever the metadata from one site in this file at
> any one time...
Ah, that clobber problem! Making a single sweep over all sites will
create a single index of the whole lot. Then your copy and restart
should be successful.
The index may turn out to be quite a large file. This is why the
database storage option is ultimately preferable, though not fully
tested.
> I'm sure I have something set up wrong... can you shed any light on
> this...?
Hopefully these notes shed the light you need. In case the store size
becomes a problem, I'll check the database store option.
Best regards,
Phil
--
MKSearch (beta)
http://www.mksearch.mkdoc.org/
Free, open source metadata search engine with RDF storage and query.
More information about the MKSearch-dev
mailing list