[MKSearch-dev] Week 9 round up
Phil Shaw
phil at mkdoc.com
Fri Dec 10 18:52:23 GMT 2004
I have something new to try out this week, a first draft
MetaParserPlugin for JSpider that extracts basic triple statements
from HTML meta elements. This is not thoroughly tested at this stage,
but I have tried it with the test site and it runs as expected.
$mk_home/bin/java-jspider.sh http://test.mksearch.mkdoc.org parse
It works like the JTidy plugin, but writes text files for each HTML
document in the output directory. The current test files only have
the "generator" metadata JTidy puts in, but it should get more
substantial content where present on other sites. Please give it a
try.
Note, I have changed the environment variable name for the Sun SDK
installation directory to $JAVA_14.
Best regards,
Phil
Wednesday
~~~~~~~~~
Renamed the Java execution scripts to distinguish GIJ and Sun Java
interpreter targets. Committed a path independent execution script
for JTidy under Windows. Made an initial commit for the Sesame Open
RDF API, then cut out unnecessary dependencies and applied some
workarounds for GCJ bugs. Wrote first draft compilation and JAR
scripts for Sesame.
Thursday
~~~~~~~~
Checked in the current Sesame CVS source -- apparently the set
imported on Wednesday was an abandonned archive! Made various
amendments to exclude non-critical dependencies for a quick start,
see below.
http://www.mksearch.mkdoc.org/documentation/sesame/
The "find" pipe trick used for JSpider compilation reached the limit
for input arguments with the Sesame source -- more than 400 classes.
Created a general purpose source listing script, like the former Java
version, e.g.
$mk_home/bin/util/source-list.sh $mk_home/lib-src/sesame sesame
Finalised the JTidy plugin configuration and refactored the JTidy
elements in a new TidyDriver class for general usage.
Friday
~~~~~~
Bruno confirmed a general problem with the wildcard import statements
used in Sesame, which causes problems for GCJ. Noted and set-aside
for now.
Further refactored the JTidy plugin to create some abstract
superclasses in the course of implementing a new MetaParserPlugin
class. Added GNU JAXP to the compilation scripts and created new
XMLFilter and ContentHandler implementations for extracting meta
elements from tidied XHTML documents. Added a tidyInputStream method
to the TidyDriver class and updated unit tests for coverage. Prepared
a first draft "parse" plugin configuration and successfully ran on
the test site.
Added Ant build targets to build all library dependencies from source
and create a JAR distribution of the MKSearch project. There is
currently a byte code verifier error when the Sun interpreter runs
GCJ 3.4.1 compiled code on Windows.
More information about the MKSearch-dev
mailing list