[MKSearch-dev] Setting MKSearch

Chris Croome chris at webarchitects.co.uk
Mon Jan 9 13:42:33 GMT 2006


Hi Phil

Best do this on the list I think :-)

On Thu 05-Jan-2006 at 06:22:19PM -0000, Phil Shaw wrote:
> On 5 Jan 2006 at 16:49, Chris Croome wrote:
> 
> > The site I have set up is here:
> > 
> >   http://mksearch.dev.webarch.net/mksearch/
> > 
> > And it seem to be fine apart from the FQ URIs, like this:
> > 
> >   http://localhost:8080/mksearch/?property=dcterms.alternative
> 
> The system uses the absolute URI as a key for an (as yet, un-
> implemented) result cache. Can you send me the server.xml
> configuration you are using, it should be at:
> 
> /etc/tomcat5/server.xml
> 
> Tomcat does not know the host name it is operating under. Some work on
> the configuration should sort that out.
> 
> > Could MKSearch generate these relative to the DocumentRoot, ie:
> > 
> >   /mksearch/?property=dcterms.alternative
> 
> This could be done if fix #1 does not solve it. Even so, it would be
> preferable to add a configuration parameter to the MKSearch servlet to
> tell it the domain it's working under.

Fix #1 won't work unless we can also remove the 8080 port number
via server.xml since 8080 (tomcat) is not directly accessible -- I want
to do everythjing via apache... 

> > My next questing is which indexes to generate? triple and or
> > rdfstore?
> 
> The triple configuration is a good one to check the indexer is
> properly set up. It will write a mirrored set of N-Triple files for
> each Web page at $mk_home/output by default
> 
> If you start at a deep URL on the test site, it won't take long to
> confirm it's working properly, e.g.:
> 
> http://test.mksearch.mkdoc.org/link/rel/index.html

Yes, things look fine :-)

> Then you can switch the rdfstore configuration. This will create an 
> RDF/XML serialization at:
> 
> $mk_home/output/com.mkdoc.store.LocalStoreManager.rdf

OK, I have done this, I see that each time java-jspider.sh it clobbers
com.mkdoc.store.LocalStoreManager.rdf rather than updating / adding to
it? Or perhaps the problems is that I haven't yet done the multi-site
setup stuff...?

> If you use this file to replace the current sample index (below) and 
> re-build the WAR, it will make a drop-in replacement:
> 
> $mk_home/src/app/WEB-INF/rdf/com.mkdoc.store.LocalStoreManager.rdf

OK, I did that, though the spider might have died before it completed
the task these are the last couple of lines of output:

  PANIC! Task net.javacoding.jspider.core.task.work.SpiderHttpURLTask at 133c8d0 threw an excpetion!
  java.lang.NullPointerException

> Re-run:
> 
> $mk_home/bin/war-mksearch.sh
> 
> And the updated WAR file is output at:
> 
> $mk_home/dist/mksearch.war

OK, I have done this and deployed the .war file:

  sudo cp $mk_home/dist/mksearch.war /var/lib/tomcat5/webapps/
  sudo /etc/init.d/tomcat5 restart

So the test install now has an index of http://www.webarchitects.co.uk/

  http://mksearch.dev.webarch.net/mksearch/

And Subject searches work OK:

  http://mksearch.dev.webarch.net/mksearch/HttpQuery?dc.subject=web&type=html&limit=10

But, when I search for documents contributed to by "Chris Croome" I get
no results:

  http://mksearch.dev.webarch.net/mksearch/HttpQuery?dc.subject=&dc.contributor=Chris+Croome&type=html&limit=10

But I'm down as a contributor to the front page of the site...

> To configure for multiple sites, you will need to edit the rules in 
> the configuration files, see these for example:
> 
> $mk_home/conf/rdfstore/sites.properties
> $mk_home/conf/rdfstore/sites/default.properties
> $mk_home/conf/rdfstore/sites/mksearch.mkdoc.org.properties

OK... so as a minimum a file like this is needed for each site?

  $mk_home/conf/rdfstore/sites/example.org.properties

And doing this addresses the com.mkdoc.store.LocalStoreManager.rdf
clobbering issue?

Chris

PS I have left in the rest of the email from you since it could help
people having it in the archives :-)

> More specific rules override the general rules at the base level. You
> can create any number of per-site configurations for throttling,
> robots.txt, user agent, etc. as above, but it's not necessary. If no
> site-specific configuration is declared, the default properties will
> be used.
> 
> The JSpider manual has guidance on configuration, but you'll need to
> skim over lots to find the useful stuff. More pointers below.
> 
> http://prdownloads.sourceforge.net/j-spider/jspider-0-5-0-doc-user.pdf?download
> 
> See this JavaDoc page for an outline of the JSpider configuration
> rules:
> 
> https://svn.mkdoc.com/mksearch/trunk/doc/javadoc/jspider/net/javacoding/jspider/mod/rule/package-summary.html
> 
> And some rules I wrote:
> 
> https://svn.mkdoc.com/mksearch/trunk/doc/javadoc/com/mkdoc/jspider/HtmlAndRdfMimeTypeOnlyRule.html
> https://svn.mkdoc.com/mksearch/trunk/doc/javadoc/com/mkdoc/jspider/RdfMimeTypeOnlyRule.html

-- 
Chris Croome                               <chris at webarchitects.co.uk>
web design                             http://www.webarchitects.co.uk/ 
web content management                               http://mkdoc.com/   


More information about the MKSearch-dev mailing list