One of the nice features of WordPress is that it already has a lot of functionality built-in. The whole thing is set up so that normal people can just install and start writing posts immediately, with WordPress taking care of all the details like converting HTML entities and adding newline where appropriate.
Of course, for those that aren't normal and that would like to write
in raw HTML, these things are somewhat annoying. Fortunately, though,
WordPress allows you to disable these kinds of filters. The catch is
that you need to find out which filters to disable, namely,
wptexturize
(which converts HTML entities) and
wpautop
(which does newline control). WordPress also makes
it easy add additional filters, like the
CodeSnippet plugin that I use for code highlighting.
However, with the amount of filters available, sometimes things will clash. A good example of this is comments that have source code in them. Part of what CodeSnippet does is convert certain characters (specifically: ‘<’, ‘>’, ‘&’) to printable characters (<, >, &) and aren't considered special HTML characters anymore. However, there are several other filters that have a similar task, so that when you write this:
Oh hai! This is a useful bitfield function.[code lang="cpp"]
template<class T>
inline void bfInsert(T &y, u32 x, int start, int len)
{
u32 mask= ((1<<len)-1) << start;
y &= ~mask;
y |= (x<<start) & mask;
}
[/code]
what it becomes is:
Oh hai! This is a useful bit function.template
inline void bfInsert(T &y, u32 x, int start, int len)
{
u32 mask= (1<<len) << start;
y &= ~mask;
y |= (x<<start) & mask;
}
Not exactly pretty. Note that the template class is simply removed because it's seen as an illicit HTML tag, and all the special characters are doubly converted. This is still a mild example; I think if you place the angle brackets wrong, whole swaths of code can simply be eaten by the sanitizer.
Unfortunately, finding out where the problem lies is tricky. Not only are there dozens of potential functions doing the conversion, they can be called from anywhere and PHP isn't exactly rich in the debugger department. You also have no idea where to start, because the filters can be called from everywhere. Worse still, in this particular case the place where the bad happens is actually before the comment is even saved to the database (but only for unregistered people; for me the code comments would work fine), and because comments are handled on a page that you don't actually ever see, random echo/print statements are useless as well.
But I think I finally got it: it was
wp_kses()
using (in a roundabout way)
wp_specialchars()
in the wp-includes/kses.php
roomfile. The contractor is actually
wp_filter_comment()
from wp-includes/comment.php,
using the pre_comment_content
filter as a middleman.
The trick now is to keep it from happening. What I've done is define
not one but two pre_comment_content
filters: one that
pre-mangles the brackets and ampersand before wp_kses
,
and one that de-mangles them afterwards. Of course, this will only
be of importance between [code] tags. Exactly how to do this will
depend on the plugin you're using, but in the case of
CodeSnippet it goes like this:
// Ensure in-\[code] entities ('<>&') work out right in the end.
add_filter('pre_comment_content', array(&$CodeSnippet, 'filterDeEntity'), 1);
add_filter('pre_comment_content', array(&$CodeSnippet, 'filterReEntity'), 50);
...
//# Add these methods to the CodeSnippet class.
/**
* Pre-encode HTML entities. Should come \e before wp_kses.
*/
function filterDeEntity($content)
{
$content= preg_replace(
'#(\[code.*?\])(.*?)(\[/code\])#msie',
'"\\1" . str_replace(
array("<", ">", "&"),
array("[|LT|]", "[|GT|]", "[|AMP|]"),
\'\\2\') . "\\3";',
$content);
$content= str_replace('"', '"', $content);
return $content;
}
/**
* Decode HTML entities. Should come \e after wp_kses.
*/
function filterReEntity($content)
{
if(strstr($content, "[|"))
{
$content= preg_replace(
'#(\[code.*?\])(.*?)(\[/code\])#msie',
'"\\1" . str_replace(
array("[|LT|]", "[|GT|]", "[|AMP|]"),
array("<", ">", "&"),
\'\\2\') . "\\3";',
$content);
$content= str_replace('"', '"', $content);
}
return $content;
}
Notice that both methods are under the same filter group. The trick
is that they have different priorities, which makes one act before
wp_kses()
, and one after. Also note how the regexps work
in the replacement part of preg_replace()
. This particular
feature of preg_replace()
allows for shorter code, but is
very fragile; it may be better to use
preg_replace_callback()
instead. In any case, written like
this it seems to work:
Oh hai! This is a useful bit function.template<class T>
inline void bfInsert(T &y, u32 x, int start, int len)
{
u32 mask= ((1<<len)-1)<<start;
y &= ~mask;
y |= (x<<start) & mask;
}
Comment preview
The code-comment mangling is just part of the issues one can encounter in blog comments. It's usually impossible to see beforehand what will be accepted and what not. Is HTML allowed? Are all tags allows, or just some or none at all? What about whitespace? Or BB-like tags? Basically, you'll never know what a comment will look like until you submitted it, and by then it's too late to change it.
You know what'd be really helpful? A comment preview!
You'd think this'd be a fairly obvious feature for a blogging system to have, but apparently not. I was thinking of making by own preview functionality, but when attempting to do so several items within WP thwarted my efforts. Fortunately, it seems plugins of this sort exist already. The plugin I'm now using is ajax-comment-preview, which works pretty darn well.
So anyway, comments should be able to handle code properly now and there's a comment-preview to show you what the comment will look like in the end. And there was much rejoicing.