Filter juggling and comment preview

One of the nice features of WordPress is that it already has a lot of functionality built-in. The whole thing is set up so that normal people can just install and start writing posts immediately, with WordPress taking care of all the details like converting HTML entities and adding newline where appropriate.

Of course, for those that aren't normal and that would like to write in raw HTML, these things are somewhat annoying. Fortunately, though, WordPress allows you to disable these kinds of filters. The catch is that you need to find out which filters to disable, namely, wptexturize (which converts HTML entities) and wpautop (which does newline control). WordPress also makes it easy add additional filters, like the CodeSnippet plugin that I use for code highlighting.

However, with the amount of filters available, sometimes things will clash. A good example of this is comments that have source code in them. Part of what CodeSnippet does is convert certain characters (specifically: ‘<’, ‘>’, ‘&’) to printable characters (&lt;, &gt;, &amp;) and aren't considered special HTML characters anymore. However, there are several other filters that have a similar task, so that when you write this:

Oh hai! This is a useful bitfield function.
 
[code lang="cpp"]
template<class T>
inline void bfInsert(T &y, u32 x, int start, int len)
{
   u32 mask= ((1<<len)-1) << start;
   y &= ~mask;
   y |= (x<<start) & mask;
}
[/code]

what it becomes is:

Oh hai! This is a useful bit function.
template
inline void bfInsert(T &amp;y, u32 x, int start, int len)
{
    u32 mask= (1&lt;&lt;len) &lt;&lt; start;
    y &amp;= ~mask;
    y |= (x&lt;&lt;start) &amp; mask;
}

Not exactly pretty. Note that the template class is simply removed because it's seen as an illicit HTML tag, and all the special characters are doubly converted. This is still a mild example; I think if you place the angle brackets wrong, whole swaths of code can simply be eaten by the sanitizer.

Unfortunately, finding out where the problem lies is tricky. Not only are there dozens of potential functions doing the conversion, they can be called from anywhere and PHP isn't exactly rich in the debugger department. You also have no idea where to start, because the filters can be called from everywhere. Worse still, in this particular case the place where the bad happens is actually before the comment is even saved to the database (but only for unregistered people; for me the code comments would work fine), and because comments are handled on a page that you don't actually ever see, random echo/print statements are useless as well.

But I think I finally got it: it was wp_kses() using (in a roundabout way) wp_specialchars() in the wp-includes/kses.php roomfile. The contractor is actually wp_filter_comment() from wp-includes/comment.php, using the pre_comment_content filter as a middleman.

The trick now is to keep it from happening. What I've done is define not one but two pre_comment_content filters: one that pre-mangles the brackets and ampersand before wp_kses, and one that de-mangles them afterwards. Of course, this will only be of importance between [code] tags. Exactly how to do this will depend on the plugin you're using, but in the case of CodeSnippet it goes like this:

//# Put this along with the other add_filter() calls.

// Ensure in-\&#91;code] entities ('<>&') work out right in the end.
add_filter('pre_comment_content', array(&$CodeSnippet, 'filterDeEntity'), 1);
add_filter('pre_comment_content', array(&$CodeSnippet, 'filterReEntity'), 50);

...

//# Add these methods to the CodeSnippet class.
    /**
     * Pre-encode HTML entities. Should come \e before wp_kses.
     */

    function filterDeEntity($content)
    {
        $content=  preg_replace(
            '#(\[code.*?\])(.*?)(\[/code\])#msie',
            '"\\1" . str_replace(
                array("<", ">", "&"),
                array("[|LT|]", "[|GT|]", "[|AMP|]"),
                \'\\2\') . "\\3";'
,
            $content);
        $content= str_replace('"', '"', $content);
       
        return $content;
    }
    /**
     * Decode HTML entities. Should come \e after wp_kses.
     */

    function filterReEntity($content)
    {
        if(strstr($content, "[|"))
        {
            $content= preg_replace(
                '#(\[code.*?\])(.*?)(\[/code\])#msie',
                '"\\1" . str_replace(
                    array("[|LT|]", "[|GT|]", "[|AMP|]"),
                    array("<", ">", "&"),
                    \'\\2\') . "\\3";'
,
                $content);
            $content= str_replace('"', '"', $content);
        }
       
        return $content;
    }

Notice that both methods are under the same filter group. The trick is that they have different priorities, which makes one act before wp_kses(), and one after. Also note how the regexps work in the replacement part of preg_replace(). This particular feature of preg_replace() allows for shorter code, but is very fragile; it may be better to use preg_replace_callback() instead. In any case, written like this it seems to work:

Oh hai! This is a useful bit function.
template<class T>
inline void bfInsert(T &y, u32 x, int start, int len)
{
   u32 mask= ((1<<len)-1)<<start;
   y &= ~mask;
   y |= (x<<start) & mask;
}

Comment preview

The code-comment mangling is just part of the issues one can encounter in blog comments. It's usually impossible to see beforehand what will be accepted and what not. Is HTML allowed? Are all tags allows, or just some or none at all? What about whitespace? Or BB-like tags? Basically, you'll never know what a comment will look like until you submitted it, and by then it's too late to change it.

You know what'd be really helpful? A comment preview!

You'd think this'd be a fairly obvious feature for a blogging system to have, but apparently not. I was thinking of making by own preview functionality, but when attempting to do so several items within WP thwarted my efforts. Fortunately, it seems plugins of this sort exist already. The plugin I'm now using is ajax-comment-preview, which works pretty darn well.

 

So anyway, comments should be able to handle code properly now and there's a comment-preview to show you what the comment will look like in the end. And there was much rejoicing.

Signs from Hell

The integer datatypes in C can be either signed or unsigned. Sometimes, it's obvious which should be used; for negative values you clearly should use signed types, for example. In many cases there is no obvious choice – in that case it usually doesn't matter which you use. Usually, but not always. Sometimes, picking the wrong kind can introduce subtle bugs in your program that, unless you know what to look out for, can catch you off-guard and have you searching for the problem for hours.

I've mentioned a few of these occasions in Tonc here and there, but I think it's worth going over them again in a little more detail. First, I'll explain how signed integers work and what the difference between signed and unsigned and where potential problems can come from. Then I'll discuss some common pitfalls so you know what to expect.

Continue reading