new and improved geshi

With Tonc I pretty much did all the syntax highlighting of code manually. As you might expect, this experience was – well, the proper description is something not suitable for anyone under the age of several thousand, so let's keep it at “somewhat less than pleasant”. So the first thing I looked when starting this whole blogging gig for was something that could do that automatically. In my case, that was codesnippet, which was build on the very awesome Geshi. There were some small problems with number formatting and whitespace handling, but overall it's served me well.

The Geshi that came with it was … 1.0.7.20, I think. In any case, Geshi's is now at 1.0.8.3, so I figured it was time for an upgrade. Most notable was that the way numbers were parsed has been greatly modified, with different types of representations now being parsed separately – and correctly to boot. Right now, it's almost fully correct, as you can see from the list below:

// Regular int
123
123l
123L
123ll       // fail
123LL       // fail
123u        // fail
123U        // fail
+123
-123

// Octal
0123

// Hex
0x12
0x123
0x123.4

// Float
123.4
123.4f
123.4F
+123.4
-123.4
1.2e3
1.2E3
1.2e+3
1.2e-3

// Inner
(1.23)
abc123de

Only some of the more special integer literals aren't parsed correctly, specifically the unsigned (-U) and long long (-LL) suffixes aren't accepted. I don't suppose hex floats will work either, but that's a GCC extension anyway.

To fix this, you need to modify geshi a little; specifically the GESHI_NUMBER_INT_CSTYLE regular expression:

  GESHI_NUMBER_INT_CSTYLE =>
    '(?<![0-9a-z_\.%])(?<![\d\.]e[+\-])([1-9]\d*?|0)l(?![0-9a-z\.])',

… yeah. I'm not sure why it's formulated like that either. I'd have thought '\b' would have worked just as well, but alright. Anyway, notice the single 'l' character in there? That needs to be extended to something that matches a potential single 'u', possibly followed by one or two 'l's. In other words: 'u?l{0,2}'.

  GESHI_NUMBER_INT_CSTYLE =>
    '(?<![0-9a-z_\.%])(?<![\d\.]e[+\-])([1-9]\d*?|0)\<b\>u?l{0,2}\</b\>(?![0-9a-z\.])',

HTML in code

An astute readed may have noted the bold in the previous snippet. Normally, you can't do that in Geshi.. One of the things that Geshi does is translate HTML entities like '<' into things like "&lt;" so that it'll turn up right on the resulting page. This, of course, is one of the things Geshi is expected to do. However, in this case it also makes it impossible to add HTML parts in the code snippet, which at times can be very useful.

So what do we do now? Well, we can use escaped HTML tags. Much like "\n" doesn't actually mean backslash + 'n' but a newline character, "\<" can be used for the actual '<'. And to unescape that, a double backslash can be used, much like it is in C.

\\<b\\>BOLD\\</b\\>    becomes     \<b\>BOLD\</b\>

There are several ways to implement this. One would be to modify it in the geshi code. I haven't tried that route yet because I expect it could get messy. That's arguably how it should be done, but it's easier to do it after the fact: when all the conversions have been done. Basically, you need something like this:

// Initialize geshi with the text to convert and language file to use.
$geshi = new GeSHi($text, $lang, $this->geshi_path);

// This does the actual work.
$text= $geshi->parse_code();

// Replace (un)escaped html entities.
$text= str_replace(
    array(
        // Normal entities
        '\\\&lt;', '\\\&gt;', '\\\&amp;',
        // In-string escapes get crap added, gaddammittohell >_<.
        '<span class="es0"><</span>',
            '<span class="es0">></span>',
            '<span class="es0">&</span>',
        // Unescaped entities
        '\\\&', '\\\<', '\\\>'),
    array(
        '<'     , '>'     , '&',        // Normal entities
        '<'     , '>'     , '&',        // In-string entities.
        '\\\&amp;', '\\\&lt;', '\\\&gt;'    // Unescaped entities
        ),
    $text);

There are three sets of items to search & replace here. The first two are the basic escaped tag delimiters, so that they'll actually result in HTML tags, and unescaped delimiters, so that you can print the combination itself. The third category are for HTML in string literals. Since the backslash has a specific meaning there as well, Geshi puts some highlighting stuff around it that would make the standard search fail. So that whole thing would need to be searched for and destroyedreplaced.

It's ugly, I know, but it seems to work. It'd be nicer if this could be done in the parser itself, but I have a feeling that'd take changes in multiple places. Since I don't know the code that well yet, I'm not touching that one with a ten-foot pole.

Lastly, let's test the ARM asm highlighter:

// Regular int
123
123l
123L
123ll
123LL  
123u
123U
+123
-123

// Binary
0b01100110
0B10101010

// Octal
0123

// Hex
0x12
0x123
0x123.4

// Float
123.4
123.4f
123.4F
+123.4
-123.4
1.2e3
1.2E3
1.2e+3
1.2e-3

// Inner
(1.23)
abc123de

Still works too. Bitchin'.

Code highlighting. Neat.

I use quite a bit of code in my documents. Now, you can't exactly copy code from an editor to an HTML page … at least, not if you want formatting to be maintained. Before now, used my standard text editor to convert it to html, and then post-process to remove excess styles and such. This worked, but it's still a little cumbersome.

After some searching, I found GeSHi, a php package that will convert and highlight code for use on the web and it is very customizable as well. Very nice. Before I had even looked into how I could use this, I found out that there is a WP plugin that uses it as well: IG-syntax hiliter. So now I can do this:

Keywords:
void for return int

Comments:
/* cmt */
// cmt

Strings/characters:
"hello"
'x'

Numbers:
12345, , +12, -12, 01234, 0x12ab34, 0X12AB34,
1.234, 1.2e3, 1.2e-3

/* Affine tilemap demo */
void test_tte_ase()
{
    // Base inits
    irq_init(NULL);
    irq_add(II_VBLANK, NULL);
    REG_DISPCNT= DCNT_MODE1 | DCNT_BG2;

    // Init affine text for 32x32t bg
    tte_init_ase(2, BG_CBB(0) | BG_SBB(28) | BG_AFF_32x32,
        0, CLR_YELLOW, 0xFE, NULL, NULL);

    // Write something
    tte_write("\\{P:120,80}o");
    tte_write("\\{P:72,104}Round, round, \\{P:80,112}round we go");

    AFF_SRC_EX asx= { 124 << 8, 84 << 8, 120, 80, 0x100, 0x100, 0 };
    bg_rotscale_ex(&REG_BG_AFFINE[2], &asx);

    // Rotate it
    while(1)
    {
        VBlankIntrWait();
        key_poll();

        asx.alpha += 0x111;
        bg_rotscale_ex(&REG_BG_AFFINE[2], &asx);

        if(key_hit(KEY_START))
            break;
    }
}

That should be C code. And by golly it works :). Now I just have to make something for ARM asm.


There are some caveats to the plugin and geshi, though.

  • On the plugin side, it works as an extra filter in the displaying process. However, when writing the code in the post, that'll be standard C, complete with brackets and ampersands and such. So it is vitally important to turn off the visual editor and auto-validating of XHTML. It you don't, there will be trouble when you save the post.
  • By default, the plugin uses line numbers (bleh) and a few other extras that clutter the actual code. So I turned those off in the plugin options. However, this wasn't enough for everything. The geshi settings for number parsing and using CSS-classes highlighting are also turned off, and if you want those (and I do), you'll have to make a few changes to the plugin manually. In particular, I needed to add `$geshi->enable_classes(true)' and `$geshi->set_number_highlighting(true)'.
  • The regexp for numbers in geshi is incomplete: it doesn't do hex or floats, for example. In parse_non_string_part() use this instead: [php] $reg= "#\\b((0[xX][0-9A-Fa-f]+)|([0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)|([0-9]))\\b#"; $stuff_to_parse= preg_replace($reg, "<|/NUM!/>\\1|>", $stuff_to_parse); [/php]

Regexp help courtesy of www.regular-expressions.info.