[MKSearch-dev] Server reply unparseable?
Phil Shaw
phil at mkdoc.com
Sun Mar 9 10:30:45 GMT 2008
On 25 Feb 2008 at 17:19, Jeff Albro wrote:
> I'm trying to spider a local page, and am getting:
>
> java.io.IOException: Server reply was unparseable: <!DOCTYPE HTML PUBLIC
> "-//IETF//DTD HTML 2.0//EN">
> Server reply was unparseable: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML
> 2.0//EN">
> [Plugin] Error event comment: resource http://emt.bu.edu couldn't be
> fetched [0]
> [Plugin] 0 - ERROR !!!http://emt.bu.edu
>
> It wouldn't surprise me if there were an error in this page's doctype,
> but is there a way to ignore the error?
Jeff,
Sorry for the delay in responding. It would help if you could outline
how you have built and configured MKSearch and what source you have
used please.
All HTML is passed through JTidy to convert it to XHTML before
processing. XML validation and indexing only occurs _after_ the
source has passed through JTidy, so this looks like JSpider may be
having trouble processing the source for link extraction.
The start URL you have given, without a trailing slash, results in a
bad request response from the server, which will probably be handled
as an error by JSpider:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not
understand.<br />
</p>
<hr>
<address>Apache/2.0.52 (CentOS) Server at imc-sed.bu.edu Port
80</address>
</body></html>
Web browsers will apparently redirect to the default base URL
http://emt.bu.edu/ with a trailing slash. Try starting the indexer
with that URL instead.
The target site's markup is declared as XHTML 1.0 Transitional, but
does not validate on many counts. However, this should _probably_ be
cleaned up by JTidy.
It would probably be best to ensure your installation can process the
our test site content before trying elsewhere.
Best regards,
Phil
--
MKSearch (beta)
http://www.mksearch.mkdoc.org/
Free, open source metadata search engine with RDF storage
and query.
More information about the MKSearch-dev
mailing list