[Petal] HTML::TreeBuilder utf8 troubles

William McKee william at knowmad.com
Thu Jan 20 20:25:24 GMT 2005


Hi Sean,

I'm been having some troubles with the TreeBuilder module when the
string that contains the data to be processed is in a utf8 format and
contains entities that have already been decoded into their Unicode
counterpart. Under these conditions, an extra character, Acirc or
\x{c2}, gets inserted before the element in question.

More curiously, this behavior only seems to apply when loading a file
from disk using PerlIO with utf8 encoding and only for some entities
(e.g. copyright and nbsp).

I have attached a sample script which demonstrates this behavior. Test 1
will generate the Acirc whereas Test 2 does not. However, you'll notice
that if the imported data is encoded back to latin1 (set
$recode_to_latin1=1), then everything is fine again. 

I'm not sure this is a bug so much as something worth noting in the POD.
Hope it saves someone else the trouble (though the time spent learning
utf8 was worth the effort).


Thanks,
William

-- 
Knowmad Services Inc.
http://www.knowmad.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: html_treebuilder.pl
Type: text/x-perl
Size: 2760 bytes
Desc: not available
Url : http://lists.webarch.co.uk/pipermail/petal/attachments/20050120/f83381a1/html_treebuilder.bin
-------------- next part --------------
<html>
<head><title>Test</title></head>
<body>
<p>Ampersand: &amp;</p>
<p>Copyright: &copy;</p>
<p>Non-break space: Sticky&nbsp;Space</p>
</body>
</html>


More information about the Petal mailing list