new and improved geshi

With Tonc I pretty much did all the syntax highlighting of code manually. As you might expect, this experience was – well, the proper description is something not suitable for anyone under the age of several thousand, so let's keep it at “somewhat less than pleasant”. So the first thing I looked when starting this whole blogging gig for was something that could do that automatically. In my case, that was codesnippet, which was build on the very awesome Geshi. There were some small problems with number formatting and whitespace handling, but overall it's served me well.

The Geshi that came with it was … 1.0.7.20, I think. In any case, Geshi's is now at 1.0.8.3, so I figured it was time for an upgrade. Most notable was that the way numbers were parsed has been greatly modified, with different types of representations now being parsed separately – and correctly to boot. Right now, it's almost fully correct, as you can see from the list below:

// Regular int
123
123l
123L
123ll       // fail
123LL       // fail
123u        // fail
123U        // fail
+123
-123

// Octal
0123

// Hex
0x12
0x123
0x123.4

// Float
123.4
123.4f
123.4F
+123.4
-123.4
1.2e3
1.2E3
1.2e+3
1.2e-3

// Inner
(1.23)
abc123de

Only some of the more special integer literals aren't parsed correctly, specifically the unsigned (-U) and long long (-LL) suffixes aren't accepted. I don't suppose hex floats will work either, but that's a GCC extension anyway.

To fix this, you need to modify geshi a little; specifically the GESHI_NUMBER_INT_CSTYLE regular expression:

  GESHI_NUMBER_INT_CSTYLE =>
    '(?<![0-9a-z_\.%])(?<![\d\.]e[+\-])([1-9]\d*?|0)l(?![0-9a-z\.])',

… yeah. I'm not sure why it's formulated like that either. I'd have thought '\b' would have worked just as well, but alright. Anyway, notice the single 'l' character in there? That needs to be extended to something that matches a potential single 'u', possibly followed by one or two 'l's. In other words: 'u?l{0,2}'.

  GESHI_NUMBER_INT_CSTYLE =>
    '(?<![0-9a-z_\.%])(?<![\d\.]e[+\-])([1-9]\d*?|0)\<b\>u?l{0,2}\</b\>(?![0-9a-z\.])',

HTML in code

An astute readed may have noted the bold in the previous snippet. Normally, you can't do that in Geshi.. One of the things that Geshi does is translate HTML entities like '<' into things like "&lt;" so that it'll turn up right on the resulting page. This, of course, is one of the things Geshi is expected to do. However, in this case it also makes it impossible to add HTML parts in the code snippet, which at times can be very useful.

So what do we do now? Well, we can use escaped HTML tags. Much like "\n" doesn't actually mean backslash + 'n' but a newline character, "\<" can be used for the actual '<'. And to unescape that, a double backslash can be used, much like it is in C.

\\<b\\>BOLD\\</b\\>    becomes     \<b\>BOLD\</b\>

There are several ways to implement this. One would be to modify it in the geshi code. I haven't tried that route yet because I expect it could get messy. That's arguably how it should be done, but it's easier to do it after the fact: when all the conversions have been done. Basically, you need something like this:

// Initialize geshi with the text to convert and language file to use.
$geshi = new GeSHi($text, $lang, $this->geshi_path);

// This does the actual work.
$text= $geshi->parse_code();

// Replace (un)escaped html entities.
$text= str_replace(
    array(
        // Normal entities
        '\\\&lt;', '\\\&gt;', '\\\&amp;',
        // In-string escapes get crap added, gaddammittohell >_<.
        '<span class="es0"><</span>',
            '<span class="es0">></span>',
            '<span class="es0">&</span>',
        // Unescaped entities
        '\\\&', '\\\<', '\\\>'),
    array(
        '<'     , '>'     , '&',        // Normal entities
        '<'     , '>'     , '&',        // In-string entities.
        '\\\&amp;', '\\\&lt;', '\\\&gt;'    // Unescaped entities
        ),
    $text);

There are three sets of items to search & replace here. The first two are the basic escaped tag delimiters, so that they'll actually result in HTML tags, and unescaped delimiters, so that you can print the combination itself. The third category are for HTML in string literals. Since the backslash has a specific meaning there as well, Geshi puts some highlighting stuff around it that would make the standard search fail. So that whole thing would need to be searched for and destroyedreplaced.

It's ugly, I know, but it seems to work. It'd be nicer if this could be done in the parser itself, but I have a feeling that'd take changes in multiple places. Since I don't know the code that well yet, I'm not touching that one with a ten-foot pole.

Lastly, let's test the ARM asm highlighter:

// Regular int
123
123l
123L
123ll
123LL  
123u
123U
+123
-123

// Binary
0b01100110
0B10101010

// Octal
0123

// Hex
0x12
0x123
0x123.4

// Float
123.4
123.4f
123.4F
+123.4
-123.4
1.2e3
1.2E3
1.2e+3
1.2e-3

// Inner
(1.23)
abc123de

Still works too. Bitchin'.

One thought on “new and improved geshi

Leave a Reply

Your email address will not be published. Required fields are marked *