[MKSearch-dev] Server reply unparseable?

Jeff Albro jalbro at bu.edu
Mon Feb 25 22:19:46 GMT 2008


I'm trying to spider a local page, and am getting:

java.io.IOException: Server reply was unparseable: <!DOCTYPE HTML PUBLIC 
"-//IETF//DTD HTML 2.0//EN">
Server reply was unparseable: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 
2.0//EN">
[Plugin] Error event comment: resource http://emt.bu.edu couldn't be 
fetched [0]
[Plugin] 0 - ERROR !!!http://emt.bu.edu

It wouldn't surprise me if there were an error in this page's doctype, 
but is there a way to ignore the error?

-Jeff

Here is the full error:


sed-linux:~/mksearch/bin$ $mk_home/bin/gij-jspider.sh http://emt.bu.edu 
triple
@jspider.version.string@
Build: @build.DSTAMP@
Started from .
[Engine] jspider.home=/home/jalbro/mksearch
[Engine] default output folder=/home/jalbro/mksearch/output
[Engine] starting with configuration 'triple'
Loading 2 plugins.
Loading plugin configuration 'console'...
first trying to instantiate via ctr with (name, config) params
plugin 'console' prefix is '[Plugin]'
adding space after prefix
Prefix set to '[Plugin] '
plugin instantiated.
Plugin not configured for local event filtering
Plugin Name    : Console writer JSpider module
Plugin Version : v1.0
Plugin Vendor  : http://www.javacoding.net
Loading plugin configuration 'xhtmltriple'...
first trying to instantiate via ctr with (name, config) params
cannot instantiate module - constructor with name and PropertySet params 
not found
java.lang.NoSuchMethodException: <init>
plugin not yet instantiated, trying via ctr with (config) param
Custom application profile com.mkdoc.schema.DublinCoreProfile loaded.
plugin instantiated.
Plugin uses local event filtering
EventDispatcher for Plugin 'XHTML metadata triple writer plugin for 
JSpider' configuring...
EventFilter for engine events = 
net.javacoding.jspider.mod.eventfilter.AllowNoneEventFilter
EventFilter for monitor events = 
net.javacoding.jspider.mod.eventfilter.AllowNoneEventFilter
EventFilter for spider events = 
net.javacoding.jspider.mod.eventfilter.AllowAllEventFilter
EventDispatcher EventDispatcher for Plugin 'XHTML metadata triple writer 
plugin for JSpider' configured.
Plugin Name    : XHTML metadata triple writer plugin for JSpider
Plugin Version : v0.7
Plugin Vendor  : http://www.mkdoc.com
Loaded 2 plugins.
Global Event Dispatcher configuring...
EventFilter for engine events = 
net.javacoding.jspider.mod.eventfilter.AllowAllEventFilter
EventFilter for monitor events = 
net.javacoding.jspider.mod.eventfilter.AllowAllEventFilter
EventFilter for spider events = 
net.javacoding.jspider.mod.eventfilter.AllowAllEventFilter
EventDispatcher Global Event Dispatcher configured.
Global Event Dispatcher intializing...
EventDispatcher for Plugin 'XHTML metadata triple writer plugin for 
JSpider' intializing...
EventDispatcher for Plugin 'XHTML metadata triple writer plugin for 
JSpider' intialized.
Global Event Dispatcher intialized.
Storage provider class is 'class 
net.javacoding.jspider.core.storage.memory.InMemoryStorageProvider'
rule net.javacoding.jspider.mod.rule.OnlyHttpProtocolRule hasn't got a 
config-param constructor
added rule net.javacoding.jspider.mod.rule.OnlyHttpProtocolRule to 
spider ruleset
rule net.javacoding.jspider.mod.rule.TextHtmlMimeTypeOnlyRule hasn't got 
a config-param constructor
added rule net.javacoding.jspider.mod.rule.TextHtmlMimeTypeOnlyRule to 
parser ruleset
default user Agent is 'MKSearch 0.1 (http://www.mksearch.mkdoc.org)'
TaskScheduler provider class is 'class 
net.javacoding.jspider.core.task.impl.DefaultSchedulerProvider'
Spider born - threads: spiders: 1, thinkers: 1
Worker thread (Spider 0) born
Worker thread (Thinker 0) born
[Plugin] Module : Console writer JSpider module
[Plugin] Version: v1.0
[Plugin] Vendor : http://www.javacoding.net
[Plugin] Spidering Started, baseURL = http://emt.bu.edu
using userAgent 'MKSearch 0.1 (http://www.mksearch.mkdoc.org)' for site 
'http://emt.bu.edu'
rule net.javacoding.jspider.mod.rule.InternallyReferencedOnlyRule hasn't 
got a config-param constructor
added rule net.javacoding.jspider.mod.rule.InternallyReferencedOnlyRule 
to spider ruleset
rule net.javacoding.jspider.mod.rule.BaseSiteOnlyRule hasn't got a 
config-param constructor
added rule net.javacoding.jspider.mod.rule.BaseSiteOnlyRule to parser 
ruleset
[Plugin] site discovered : http://emt.bu.edu
[Plugin] net.javacoding.jspider.api.event.site.RobotsTXTSkippedEvent 
RobotsTXTSkippedEvent for site [Site: http://emt.bu.edu - 
ROBOTSTXT_SKIPPED *]
[Plugin] resource discovered: http://emt.bu.edu
Thinker task dispatcher running ...
Spider task dispatcher running ...
Throttle provider class is 'class 
net.javacoding.jspider.core.throttle.impl.DistributedLoadThrottleProvider'
throttle interval set to 1000 ms.
exception during spidering
java.io.IOException: Server reply was unparseable: <!DOCTYPE HTML PUBLIC 
"-//IETF//DTD HTML 2.0//EN">
Server reply was unparseable: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 
2.0//EN">
[Plugin] Error event comment: resource http://emt.bu.edu couldn't be 
fetched [0]
[Plugin] 0 - ERROR !!!http://emt.bu.edu




---------------------------------------------------------
Jeff Albro - Information Technology Manager
Boston University School of Education
jalbro at bu.edu   (617) 358-2966


More information about the MKSearch-dev mailing list