[MKSearch-dev] Server reply unparseable?

Phil Shaw phil at mkdoc.com
Sun Mar 9 10:30:45 GMT 2008


On 25 Feb 2008 at 17:19, Jeff Albro wrote:

> I'm trying to spider a local page, and am getting:
> 
> java.io.IOException: Server reply was unparseable: <!DOCTYPE HTML PUBLIC 
> "-//IETF//DTD HTML 2.0//EN">
> Server reply was unparseable: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 
> 2.0//EN">
> [Plugin] Error event comment: resource http://emt.bu.edu couldn't be 
> fetched [0]
> [Plugin] 0 - ERROR !!!http://emt.bu.edu
> 
> It wouldn't surprise me if there were an error in this page's doctype, 
> but is there a way to ignore the error?

Jeff,

Sorry for the delay in responding. It would help if you could outline 
how you have built and configured MKSearch and what source you have 
used please.

All HTML is passed through JTidy to convert it to XHTML before 
processing. XML validation and indexing only occurs _after_ the 
source has passed through JTidy, so this looks like JSpider may be 
having trouble processing the source for link extraction.

The start URL you have given, without a trailing slash, results in a 
bad request response from the server, which will probably be handled 
as an error by JSpider:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not 
understand.<br />
</p>
<hr>
<address>Apache/2.0.52 (CentOS) Server at imc-sed.bu.edu Port 
80</address>
</body></html>

Web browsers will apparently redirect to the default base URL 
http://emt.bu.edu/ with a trailing slash. Try starting the indexer 
with that URL instead.

The target site's markup is declared as XHTML 1.0 Transitional, but 
does not validate on many counts. However, this should _probably_ be 
cleaned up by JTidy.

It would probably be best to ensure your installation can process the 
our test site content before trying elsewhere.

Best regards,

Phil

--
MKSearch (beta)

http://www.mksearch.mkdoc.org/

Free, open source metadata search engine with RDF storage 
and query.



More information about the MKSearch-dev mailing list