[MKSearch-dev] Jspider postgres problems

Wed Feb 8 16:08:45 GMT 2006

On 8 Feb 2006 at 13:29, Chris Croome wrote:

> This is the script:
> 
>   #!/bin/bash
>   SITES="tre.ngfl.gov.uk ferl.becta.org.uk www.aclearn.net"
>   for a in $SITES
>     do
>     $mk_home/bin/java-jspider-pgsql.sh http://$a/ rdfstoredb
>   done

It's okay to use this type of script with database storage, but you 
are starting and stopping the JVM for each. If you were to set up a 
start page of index links as we were before, there would be one JVM 
start-up and the spider could make better use of its threading 
capability. Not critical.

> And it runs for a while with various warnings and then ends with this:
> 
>   PANIC! Task net.javacoding.jspider.core.task.work.SpiderHttpURLTask at 1231fd8 threw an excpetion!  java.lang.NullPointerException

I have occasionally seen this type of error before and I think it's 
to do with thread scheduling during the spider shut down process. I 
re-traced the code one time and found the engine tries to clean-up a 
thread that had just terminated itself. I don't think it affects the 
indexing because it's all over by then. 

> And when I look at the database to see what is there I get:
> 
>   mksearch_test=> select * from literals;
>    id | datatype |      labelhash       | language |                                                                                                                                    label                                                                                                                                    
>   ----+----------+----------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>    -1 |        0 | -7566735784996357570 |          | Community Learning Resource
>    -2 |        0 | -1621285313438006658 |          | 
>    -3 |        0 |  8274037457894185038 |          | The Community Learning Resource website supports the Adult and Community Learning (ACL) sector. It provides information, advice and guidance to those working in the sector and is designed to compliment the rollout of effective e-learning and related support into ACL.
>   (3 rows)

That's not very much data, but it's more likely to be a configuration 
issue governing which pages should be indexed I suspect. Take a look 
at the site properties configuration and the rules applied to 
spidering. Perhaps there is no configuration for this domain and it 
is only spidering the first page it encounters.

Best regards,

Phil
--
MKSearch (beta)

http://www.mksearch.mkdoc.org/

Free, open source metadata search engine with RDF storage and query.