[Petal] HTML::TreeBuilder utf8 troubles

Bruno Postle bruno at mkdoc.com
Thu Jan 20 21:19:00 GMT 2005


On Thu 20-Jan-2005 at 15:40 -0500, William McKee wrote:
>
> After today's research on character encodings (which I'm finally 
> feeling like I'm getting a grasp of), I've attached a patch for
> Petal::Parser::HTB which should fix it's desire to output the 
> extra Acirc character.

> Bruno, any chance you'd be willing to post this update to CPAN?

Ok though I have some questions:

1. Your patch seems to assume that all the data will fit into 
   latin1:

      encode ('latin1', $$data_ref)

   Now this isn't necessarily the case.  For instance, this produces  
   the sort of garbage you are trying to prevent in the first place:
   
      encode ("latin1", "Euro: \x{20ac} Copyright: \x{00a9}");
   
   Unless HTML::TreeBuilder and/or HTML::Parser are unsafe with 
   anything other than latin1 - In which case it doesn't matter and 
   Petal::Parser::HTB should try and squeeze everything into latin1.

2. Petal::Parser::HTB has (the ancient) Petal-1.10 as a 
   prerequisite, does it really only work with this exact version of 
   Petal?

   (You can tell that I don't use this backend)

3. Petal::Parser::HTB has HTML::TreeBuilder-3.12 as a prerequisite, 
   does it really only work with this exact version?

-- 
Bruno


More information about the Petal mailing list