From phil at mkdoc.com Sun Mar 9 10:30:45 2008 From: phil at mkdoc.com (Phil Shaw) Date: Sun Mar 9 10:31:16 2008 Subject: [MKSearch-dev] Server reply unparseable? In-Reply-To: <47C33F02.5080404@bu.edu> Message-ID: <47D3BC55.28403.1030697F@phil.mkdoc.com> On 25 Feb 2008 at 17:19, Jeff Albro wrote: > I'm trying to spider a local page, and am getting: > > java.io.IOException: Server reply was unparseable: "-//IETF//DTD HTML 2.0//EN"> > Server reply was unparseable: 2.0//EN"> > [Plugin] Error event comment: resource http://emt.bu.edu couldn't be > fetched [0] > [Plugin] 0 - ERROR !!!http://emt.bu.edu > > It wouldn't surprise me if there were an error in this page's doctype, > but is there a way to ignore the error? Jeff, Sorry for the delay in responding. It would help if you could outline how you have built and configured MKSearch and what source you have used please. All HTML is passed through JTidy to convert it to XHTML before processing. XML validation and indexing only occurs _after_ the source has passed through JTidy, so this looks like JSpider may be having trouble processing the source for link extraction. The start URL you have given, without a trailing slash, results in a bad request response from the server, which will probably be handled as an error by JSpider: 400 Bad Request

Bad Request

Your browser sent a request that this server could not understand.


Apache/2.0.52 (CentOS) Server at imc-sed.bu.edu Port 80
Web browsers will apparently redirect to the default base URL http://emt.bu.edu/ with a trailing slash. Try starting the indexer with that URL instead. The target site's markup is declared as XHTML 1.0 Transitional, but does not validate on many counts. However, this should _probably_ be cleaned up by JTidy. It would probably be best to ensure your installation can process the our test site content before trying elsewhere. Best regards, Phil -- MKSearch (beta) http://www.mksearch.mkdoc.org/ Free, open source metadata search engine with RDF storage and query. From phil at mkdoc.com Sun Mar 9 10:39:03 2008 From: phil at mkdoc.com (Phil Shaw) Date: Sun Mar 9 10:39:27 2008 Subject: [MKSearch-dev] http://test.mksearch.mkdoc.org/ down? In-Reply-To: <47C33831.8010207@bu.edu> References: <47BB4691.25583.583EF9B8@phil.mkdoc.com> Message-ID: <47D3BE47.11391.10380448@phil.mkdoc.com> On Monday, February 25, 2008 at 16:50, Jeff Albro wrote: > I can confirm that the site is back up... but it is still not working > for me... How sensitive it is to java version? > I'm using: > > export mk_build=/home/jalbro/mksearch/build > export mk_home=/home/jalbro/mksearch > #export CLASSPATH=/usr/share/java/libgcj-3.4.1.jar > export CLASSPATH=/usr/share/java/libgcj-3.4.3.jar > > And I get the error below. I also got it with gij-jspider. > exception during spidering > java.lang.ClassCastException: java.util.List > java.util.List Jeff, This exception does suggest a Java version conflict, it may be that the java.util.List version the source was compiled against is not compatible with your runtime system. To ensure your installation is compatible with your runtime environment, follow these instructions on compiling with GCJ on Linux: http://www.mksearch.mkdoc.org/howto/build-mksearch-with-gcj/ Hope this helps. Phil -- MKSearch (beta) http://www.mksearch.mkdoc.org/ Free, open source metadata search engine with RDF storage and query.