[MKSearch-dev] Jspider postgres problems
Phil Shaw
phil at mkdoc.com
Wed Feb 8 16:08:45 GMT 2006
On 8 Feb 2006 at 13:29, Chris Croome wrote:
> This is the script:
>
> #!/bin/bash
> SITES="tre.ngfl.gov.uk ferl.becta.org.uk www.aclearn.net"
> for a in $SITES
> do
> $mk_home/bin/java-jspider-pgsql.sh http://$a/ rdfstoredb
> done
It's okay to use this type of script with database storage, but you
are starting and stopping the JVM for each. If you were to set up a
start page of index links as we were before, there would be one JVM
start-up and the spider could make better use of its threading
capability. Not critical.
> And it runs for a while with various warnings and then ends with this:
>
> PANIC! Task net.javacoding.jspider.core.task.work.SpiderHttpURLTask at 1231fd8 threw an excpetion! java.lang.NullPointerException
I have occasionally seen this type of error before and I think it's
to do with thread scheduling during the spider shut down process. I
re-traced the code one time and found the engine tries to clean-up a
thread that had just terminated itself. I don't think it affects the
indexing because it's all over by then.
> And when I look at the database to see what is there I get:
>
> mksearch_test=> select * from literals;
> id | datatype | labelhash | language | label
> ----+----------+----------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> -1 | 0 | -7566735784996357570 | | Community Learning Resource
> -2 | 0 | -1621285313438006658 | |
> -3 | 0 | 8274037457894185038 | | The Community Learning Resource website supports the Adult and Community Learning (ACL) sector. It provides information, advice and guidance to those working in the sector and is designed to compliment the rollout of effective e-learning and related support into ACL.
> (3 rows)
That's not very much data, but it's more likely to be a configuration
issue governing which pages should be indexed I suspect. Take a look
at the site properties configuration and the rules applied to
spidering. Perhaps there is no configuration for this domain and it
is only spidering the first page it encounters.
Best regards,
Phil
--
MKSearch (beta)
http://www.mksearch.mkdoc.org/
Free, open source metadata search engine with RDF storage and query.
More information about the MKSearch-dev
mailing list