[MKSearch-dev] Week 9 round up

Phil Shaw phil at mkdoc.com
Fri Dec 10 18:52:23 GMT 2004


I have something new to try out this week, a first draft 
MetaParserPlugin for JSpider that extracts basic triple statements 
from HTML meta elements. This is not thoroughly tested at this stage, 
but I have tried it with the test site and it runs as expected.

  $mk_home/bin/java-jspider.sh http://test.mksearch.mkdoc.org parse

It works like the JTidy plugin, but writes text files for each HTML 
document in the output directory. The current test files only have 
the "generator" metadata JTidy puts in, but it should get more 
substantial content where present on other sites. Please give it a 
try.

Note, I have changed the environment variable name for the Sun SDK 
installation directory to $JAVA_14.

Best regards,

Phil


Wednesday
~~~~~~~~~
Renamed the Java execution scripts to distinguish GIJ and Sun Java 
interpreter targets. Committed a path independent execution script 
for JTidy under Windows. Made an initial commit for the Sesame Open 
RDF API, then cut out unnecessary dependencies and applied some 
workarounds for GCJ bugs. Wrote first draft compilation and JAR 
scripts for Sesame.


Thursday
~~~~~~~~
Checked in the current Sesame CVS source -- apparently the set 
imported on Wednesday was an abandonned archive! Made various 
amendments to exclude non-critical dependencies for a quick start, 
see below.

http://www.mksearch.mkdoc.org/documentation/sesame/

The "find" pipe trick used for JSpider compilation reached the limit 
for input arguments with the Sesame source -- more than 400 classes. 
Created a general purpose source listing script, like the former Java 
version, e.g.

  $mk_home/bin/util/source-list.sh $mk_home/lib-src/sesame sesame

Finalised the JTidy plugin configuration and refactored the JTidy 
elements in a new TidyDriver class for general usage.

Friday
~~~~~~
Bruno confirmed a general problem with the wildcard import statements 
used in Sesame, which causes problems for GCJ. Noted and set-aside 
for now. 

Further refactored the JTidy plugin to create some abstract 
superclasses in the course of implementing a new MetaParserPlugin 
class. Added GNU JAXP to the compilation scripts and created new 
XMLFilter and ContentHandler implementations for extracting meta 
elements from tidied XHTML documents. Added a tidyInputStream method 
to the TidyDriver class and updated unit tests for coverage. Prepared 
a first draft "parse" plugin configuration and successfully ran on 
the test site.

Added Ant build targets to build all library dependencies from source 
and create a JAR distribution of the MKSearch project. There is 
currently a byte code verifier error when the Sun interpreter runs 
GCJ 3.4.1 compiled code on Windows.


More information about the MKSearch-dev mailing list