Wet paint (this is not an instruction)

hhhyeah, I guess it's time to give this thing a new theme. Had the old style since, what, 2007 and drab gray is out, right? And now that IE6/7 are firmly on the way out, I can do nice CSS 2/3 effects like rounded borders and shadows and the like. Yay. For better or worse I also added ShareThis buttons, but as they're a little heavy on the javascript I might have to remove them from the frontpage later. I'll see how it goes.

 

I'm sure there will still be some CSS bugs here and there, but I think I've covered most of them.

Filter juggling and comment preview

One of the nice features of WordPress is that it already has a lot of functionality built-in. The whole thing is set up so that normal people can just install and start writing posts immediately, with WordPress taking care of all the details like converting HTML entities and adding newline where appropriate.

Of course, for those that aren't normal and that would like to write in raw HTML, these things are somewhat annoying. Fortunately, though, WordPress allows you to disable these kinds of filters. The catch is that you need to find out which filters to disable, namely, wptexturize (which converts HTML entities) and wpautop (which does newline control). WordPress also makes it easy add additional filters, like the CodeSnippet plugin that I use for code highlighting.

However, with the amount of filters available, sometimes things will clash. A good example of this is comments that have source code in them. Part of what CodeSnippet does is convert certain characters (specifically: ‘<’, ‘>’, ‘&’) to printable characters (&lt;, &gt;, &amp;) and aren't considered special HTML characters anymore. However, there are several other filters that have a similar task, so that when you write this:

Oh hai! This is a useful bitfield function.
 
[code lang="cpp"]
template<class T>
inline void bfInsert(T &y, u32 x, int start, int len)
{
   u32 mask= ((1<<len)-1) << start;
   y &= ~mask;
   y |= (x<<start) & mask;
}
[/code]

what it becomes is:

Oh hai! This is a useful bit function.
template
inline void bfInsert(T &amp;y, u32 x, int start, int len)
{
    u32 mask= (1&lt;&lt;len) &lt;&lt; start;
    y &amp;= ~mask;
    y |= (x&lt;&lt;start) &amp; mask;
}

Not exactly pretty. Note that the template class is simply removed because it's seen as an illicit HTML tag, and all the special characters are doubly converted. This is still a mild example; I think if you place the angle brackets wrong, whole swaths of code can simply be eaten by the sanitizer.

Unfortunately, finding out where the problem lies is tricky. Not only are there dozens of potential functions doing the conversion, they can be called from anywhere and PHP isn't exactly rich in the debugger department. You also have no idea where to start, because the filters can be called from everywhere. Worse still, in this particular case the place where the bad happens is actually before the comment is even saved to the database (but only for unregistered people; for me the code comments would work fine), and because comments are handled on a page that you don't actually ever see, random echo/print statements are useless as well.

But I think I finally got it: it was wp_kses() using (in a roundabout way) wp_specialchars() in the wp-includes/kses.php roomfile. The contractor is actually wp_filter_comment() from wp-includes/comment.php, using the pre_comment_content filter as a middleman.

The trick now is to keep it from happening. What I've done is define not one but two pre_comment_content filters: one that pre-mangles the brackets and ampersand before wp_kses, and one that de-mangles them afterwards. Of course, this will only be of importance between [code] tags. Exactly how to do this will depend on the plugin you're using, but in the case of CodeSnippet it goes like this:

//# Put this along with the other add_filter() calls.

// Ensure in-\&#91;code] entities ('<>&') work out right in the end.
add_filter('pre_comment_content', array(&$CodeSnippet, 'filterDeEntity'), 1);
add_filter('pre_comment_content', array(&$CodeSnippet, 'filterReEntity'), 50);

...

//# Add these methods to the CodeSnippet class.
    /**
     * Pre-encode HTML entities. Should come \e before wp_kses.
     */

    function filterDeEntity($content)
    {
        $content=  preg_replace(
            '#(\[code.*?\])(.*?)(\[/code\])#msie',
            '"\\1" . str_replace(
                array("<", ">", "&"),
                array("[|LT|]", "[|GT|]", "[|AMP|]"),
                \'\\2\') . "\\3";'
,
            $content);
        $content= str_replace('"', '"', $content);
       
        return $content;
    }
    /**
     * Decode HTML entities. Should come \e after wp_kses.
     */

    function filterReEntity($content)
    {
        if(strstr($content, "[|"))
        {
            $content= preg_replace(
                '#(\[code.*?\])(.*?)(\[/code\])#msie',
                '"\\1" . str_replace(
                    array("[|LT|]", "[|GT|]", "[|AMP|]"),
                    array("<", ">", "&"),
                    \'\\2\') . "\\3";'
,
                $content);
            $content= str_replace('"', '"', $content);
        }
       
        return $content;
    }

Notice that both methods are under the same filter group. The trick is that they have different priorities, which makes one act before wp_kses(), and one after. Also note how the regexps work in the replacement part of preg_replace(). This particular feature of preg_replace() allows for shorter code, but is very fragile; it may be better to use preg_replace_callback() instead. In any case, written like this it seems to work:

Oh hai! This is a useful bit function.
template<class T>
inline void bfInsert(T &y, u32 x, int start, int len)
{
   u32 mask= ((1<<len)-1)<<start;
   y &= ~mask;
   y |= (x<<start) & mask;
}

Comment preview

The code-comment mangling is just part of the issues one can encounter in blog comments. It's usually impossible to see beforehand what will be accepted and what not. Is HTML allowed? Are all tags allows, or just some or none at all? What about whitespace? Or BB-like tags? Basically, you'll never know what a comment will look like until you submitted it, and by then it's too late to change it.

You know what'd be really helpful? A comment preview!

You'd think this'd be a fairly obvious feature for a blogging system to have, but apparently not. I was thinking of making by own preview functionality, but when attempting to do so several items within WP thwarted my efforts. Fortunately, it seems plugins of this sort exist already. The plugin I'm now using is ajax-comment-preview, which works pretty darn well.

 

So anyway, comments should be able to handle code properly now and there's a comment-preview to show you what the comment will look like in the end. And there was much rejoicing.

new and improved geshi

With Tonc I pretty much did all the syntax highlighting of code manually. As you might expect, this experience was – well, the proper description is something not suitable for anyone under the age of several thousand, so let's keep it at “somewhat less than pleasant”. So the first thing I looked when starting this whole blogging gig for was something that could do that automatically. In my case, that was codesnippet, which was build on the very awesome Geshi. There were some small problems with number formatting and whitespace handling, but overall it's served me well.

The Geshi that came with it was … 1.0.7.20, I think. In any case, Geshi's is now at 1.0.8.3, so I figured it was time for an upgrade. Most notable was that the way numbers were parsed has been greatly modified, with different types of representations now being parsed separately – and correctly to boot. Right now, it's almost fully correct, as you can see from the list below:

// Regular int
123
123l
123L
123ll       // fail
123LL       // fail
123u        // fail
123U        // fail
+123
-123

// Octal
0123

// Hex
0x12
0x123
0x123.4

// Float
123.4
123.4f
123.4F
+123.4
-123.4
1.2e3
1.2E3
1.2e+3
1.2e-3

// Inner
(1.23)
abc123de

Only some of the more special integer literals aren't parsed correctly, specifically the unsigned (-U) and long long (-LL) suffixes aren't accepted. I don't suppose hex floats will work either, but that's a GCC extension anyway.

To fix this, you need to modify geshi a little; specifically the GESHI_NUMBER_INT_CSTYLE regular expression:

  GESHI_NUMBER_INT_CSTYLE =>
    '(?<![0-9a-z_\.%])(?<![\d\.]e[+\-])([1-9]\d*?|0)l(?![0-9a-z\.])',

… yeah. I'm not sure why it's formulated like that either. I'd have thought '\b' would have worked just as well, but alright. Anyway, notice the single 'l' character in there? That needs to be extended to something that matches a potential single 'u', possibly followed by one or two 'l's. In other words: 'u?l{0,2}'.

  GESHI_NUMBER_INT_CSTYLE =>
    '(?<![0-9a-z_\.%])(?<![\d\.]e[+\-])([1-9]\d*?|0)\<b\>u?l{0,2}\</b\>(?![0-9a-z\.])',

HTML in code

An astute readed may have noted the bold in the previous snippet. Normally, you can't do that in Geshi.. One of the things that Geshi does is translate HTML entities like '<' into things like "&lt;" so that it'll turn up right on the resulting page. This, of course, is one of the things Geshi is expected to do. However, in this case it also makes it impossible to add HTML parts in the code snippet, which at times can be very useful.

So what do we do now? Well, we can use escaped HTML tags. Much like "\n" doesn't actually mean backslash + 'n' but a newline character, "\<" can be used for the actual '<'. And to unescape that, a double backslash can be used, much like it is in C.

\\<b\\>BOLD\\</b\\>    becomes     \<b\>BOLD\</b\>

There are several ways to implement this. One would be to modify it in the geshi code. I haven't tried that route yet because I expect it could get messy. That's arguably how it should be done, but it's easier to do it after the fact: when all the conversions have been done. Basically, you need something like this:

// Initialize geshi with the text to convert and language file to use.
$geshi = new GeSHi($text, $lang, $this->geshi_path);

// This does the actual work.
$text= $geshi->parse_code();

// Replace (un)escaped html entities.
$text= str_replace(
    array(
        // Normal entities
        '\\\&lt;', '\\\&gt;', '\\\&amp;',
        // In-string escapes get crap added, gaddammittohell >_<.
        '<span class="es0"><</span>',
            '<span class="es0">></span>',
            '<span class="es0">&</span>',
        // Unescaped entities
        '\\\&', '\\\<', '\\\>'),
    array(
        '<'     , '>'     , '&',        // Normal entities
        '<'     , '>'     , '&',        // In-string entities.
        '\\\&amp;', '\\\&lt;', '\\\&gt;'    // Unescaped entities
        ),
    $text);

There are three sets of items to search & replace here. The first two are the basic escaped tag delimiters, so that they'll actually result in HTML tags, and unescaped delimiters, so that you can print the combination itself. The third category are for HTML in string literals. Since the backslash has a specific meaning there as well, Geshi puts some highlighting stuff around it that would make the standard search fail. So that whole thing would need to be searched for and destroyedreplaced.

It's ugly, I know, but it seems to work. It'd be nicer if this could be done in the parser itself, but I have a feeling that'd take changes in multiple places. Since I don't know the code that well yet, I'm not touching that one with a ten-foot pole.

Lastly, let's test the ARM asm highlighter:

// Regular int
123
123l
123L
123ll
123LL  
123u
123U
+123
-123

// Binary
0b01100110
0B10101010

// Octal
0123

// Hex
0x12
0x123
0x123.4

// Float
123.4
123.4f
123.4F
+123.4
-123.4
1.2e3
1.2E3
1.2e+3
1.2e-3

// Inner
(1.23)
abc123de

Still works too. Bitchin'.