[Petal] HTML::TreeBuilder utf8 troubles

William McKee william at knowmad.com
Thu Jan 20 22:33:09 GMT 2005


Hi again Sean,

A few more details based on some discussion on the Petal mailing list.
Seems the problem lies more with reading in data using PerlIO, not just
utf8 (you can test this by changing the :encoding('utf8') value in the
open function of the script I sent to latin1). You'll still get the same
results of the Acirc appearing in the output. Somehow encoding to
another format seems to clear up the data in such a way to allow
HTML::TreeBuilder to be happy about it.

Our problem now is figuring out which format to encode it to before
passing to HTML::TreeBuilder (although I don't use it, Petal strives for
strong i8n support). Does TreeBuilder understand non-Latin1 characters? 


Thanks,
William

On Thu, Jan 20, 2005 at 04:53:37PM -0500, William McKee wrote:
> On Thu, Jan 20, 2005 at 09:19:00PM +0000, Bruno Postle wrote:
> > 1. Your patch seems to assume that all the data will fit into
> >   latin1:
> > 
> >        encode ('latin1', $$data_ref)
> > 
> >   Now this isn't necessarily the case.  For instance, this
> >   produces the sort of garbage you are trying to prevent in the
> >   first place:
> > 
> >          encode ("latin1", "Euro: \x{20ac} Copyright: \x{00a9}");
> > 
> >   Unless HTML::TreeBuilder and/or HTML::Parser are unsafe with 
> >   anything other than latin1 - In which case it doesn't matter and 
> >   Petal::Parser::HTB should try and squeeze everything into latin1.
> 
> I dunno. I just know that I got rid of my problems by recoding the data
> into latin1 before passing it off to HTML::TreeBuilder. Hopefully I'll
> hear back from Sean Burke with some ideas/suggestions/fixes. In the
> meantime, I wonder if we could figure out what the local encoding is for
> the system in question and use it. Or perhaps we should be using the
> value of DECODE_CHARSET. If neither of these are available, we could
> fallback to latin1.


-- 
Knowmad Services Inc.
http://www.knowmad.com


More information about the Petal mailing list