[Petal] Petal and  

William McKee william at knowmad.com
Fri Mar 18 14:12:41 GMT 2005


On Tue, Mar 15, 2005 at 12:58:29PM -0600, wsmith wrote:
> This has probably been beaten to death, and if someone could point to
> which archives to look through, I would appreciate it. I was just
> wondering if anyone has any input in the fact that petal changes  
> to  , but does *not* modify   (the unicode equivilant). Is
> this a bug or how petal is supposed to work?

Hi Warren,

In my case Bruno's advice was not a solution. I was always using XHTML
as the INPUT. I've attached my running commentary about this issue which
has been plaguing me ever since the upgrade to 2.0 where Jean-Michel
began using Unicode when importing the text if your version of Perl was
5.7+. The comments are a bit rambling and I'm preferring the solution of
tweaking the header vs. forcing the output via Encode::encode these
days but you'll find some helpful info (and hopefully not too much
misinformation).

The crux of my problem was a lack of understanding about Unicode.  Take
the time to learn how that works and this problem won't seem so
mysterious. You'll need to understand it one day, anyhow.


HTH,
William



Petal and  characters

<p>Petal is printing  characters into my HTML output again. This character is
a capital A circumflex (Acirc) and is Hex C2 (\x{c2}), Dec 194, Oct 302. If
viewed in UTF8 encoding, this is a space. Somehow this character is getting
distorted during translation probably because newer versions of Perl work with
text internally in UTF8 format. Indeed, the actual reason for this
transposition is that in utf-8, we have a pair of bytes, not a single byte.
Thus, the non-breaking space maps to \xC2\xA0. See <a
href="http://www.dpawson.co.uk/xsl/sect2/N7150.html#d8562e1215">http://www.dpawson.co.uk/xsl/sect2/N7150.html#d8562e1215</a>
for further details.</p>

<p> Interestingly, I only see this behavior on the production server, not my test
server. There was quite a discussion about this problem on the mailing list
several weeks ago.  Basically it has to do with how Petal inserts a blank space
for &amp;nbsp; entities.</p>

<p>Petal v2.00+ uses UTF8 charset internally (note that the ENCODE_CHARSET
setting is deprecated; the Encode module can be used to encode the results for
a specific charset, as shown below). The module Petal::Entities turns a
non-breaking space entity into character \240. This works in UTF8. However,
this upper-level ANSI character is interpretted as an  character in
ISO-8859-1 (latin1). The quick fix which works for most modern browsers is to
tell the browser to display UTF8 using the following meta command:</p>

<blockquote>
	&lt;meta http-equiv="Content-Type" content="text/html; charset=utf-8" /&gt;
</blockquote>

<p>For browsers that support XML, it's probably also a good idea to include the
following line at the top of your template:</p>

<blockquote>
  <?xml version="1.0" encoding="utf-8"?>
</blockquote>

<p>To check the encoding of a file in Vim, use <code>:set encoding</code>. You
can also change the encoding with this statement. You can also check the
fileencodings with <code>:set fileencodings</code>.</p>

<p>A useful Unicode table can be found at <a
href="http://www.columbia.edu/kermit/utf8-t1.html">http://www.columbia.edu/kermit/utf8-t1.html</a>.</p>

<p>After much more reading and many messages back and forth on the <a
href="http://lists.webarch.co.uk/pipermail/petal/">Petal mailing list</a>, I
have finally resolved the issues. There a 3 primary problems I have encountered:</p>

<ol>
  <li>CGI.pm was forcing the charset to ISO-8859-1</li>
  <li>Perl was outputting UTF-8 on the production server and ISO-8859-1 on the
  test server.</li>
  <li>Using Petal::Parser::HTB with a version of HTML::Parser < 3.45 will
  result in these errors. Simply upgrading to a more recent release will remove
  the extra chars.</li>
</ol>

<p>The first problem turns out to be a documented feature of CGI.pm. If the
charset is not defined, it defaults to 'iso-8859-1'. The solution was to set
the charset header via a header_add call in my cgiapp_postrun handler as
follows (note that this is only necessary with Petal v2 or greater since
earlier versions used the ISO-8859-1 charset):</p>

<pre>
  $self->header_add(-charset=>'utf-8') if $Petal::VERSION > 2;
</pre>


<p>The solution to the second problem was to force the encoding that Petal uses
by overriding the process() function in my WebBase.pm module with the following
code:</p>

<pre>
sub process {
  my $self = shift;
  my $template = shift;

  # If $template is not an object
  unless (ref $template) {
    $template = $self->load_tmpl($template);
  }

  my $string = $template->process(@_);
  return Encode::encode('utf8', $string);
}
</pre>

<p>A nice side effect of all this effort has been that this version of process()
will take either a filename or a Petal object which reduces my code since I
usually call load_tmpl and process in succession.</p>

<p>The fix for the third problem is to either upgrade HTML::Parser or use tidy
to clean up the templates so that you don't need to use the Petal::Parser::HTB
module. If you use a persistent environment like mod_perl, be careful about
other scripts/modules that may be loading this package without your knowledge.
Use Apache::Status and check <a
href="http://perl.apache.org/docs/1.0/guide/install.html#Testing_by_viewing__perl_status">perl-status</a>
to see the list of loaded modules.</p>

<p>OLD INFORMATION (prior to 2005-Jan-06 release of HTML::Parser): I've not
been able to find a way to get Petal::Parser::HTB to properly display encoded
entities (e.g., &nbsp;) without the Acirc. In some cases, I've seen the output
include Ã\u201aÂ. Encoding the output as latin1 (e.g., return
Encode::encode('latin1', $string)) seems to correctly remove the extra two
characters added in front of the entity but does not resolve the issue of the
Acirc.</p>

<p>See also:</p>
<ul>
<li>perldoc perlunicode</li>
</ul>


-- 
Knowmad Services Inc.
http://www.knowmad.com


More information about the Petal mailing list