An EBNF Grammar for Wiki Creole 1.0 [Technical Report]

Abstract: Today’s wiki engines are not interoperable. This is an unfortunate consequence of the lack of rigorously specified standards. This technical report presents a complete and validated EBNF-based grammar for Wiki Creole, a community standard for wiki markup. Wiki Creole is also the only standard currently available. Wiki Creole is being specified using prose, leading to inconsistencies and ambiguities. Our grammar uncovered those ambiguities which we fed back into the specification process. The Wiki Creole grammar presented in this report makes the creation of Wiki Creole parsers simple using parser generators, ANTLR in our case. Using a precise specification of wiki markup lets us decouple wiki editors from wiki storage from further wiki processing tools. Based on this decoupling layer we expect innovation on these different parts to proceed independently and at a faster pace than before.

Reference: Martin Junghans, Dirk Riehle, Rama Gurram, Matthias Kaiser, Mario Lopes, Umit Yalcinalp. In ACM SIGWEB Newsletter, Volume 2007, Issue Winter (Winter 2007), Article No. 4. ACM Press, 2007.

Available as a PDF file.

10 Replies to “An EBNF Grammar for Wiki Creole 1.0 [Technical Report]”

  1. Thank you and your collaborators for this very useful grammar! It looks complex. But Wiki Text in fact *is* complex to parse. … Johannes

  2. Hey Dirk, I’ve noticed a few rules which look odd to me. Please let me know your thoughts. This is one pair:
    text_line
    :
    text_firstelement ( text_element )* text_lineseparator
    ;
    text_element
    :
    onestar text_unformattedelement
    |
    text_unformattedelement onestar
    |
    text_formattedelement
    ;
    Doesn’t this allow the following invalid text_line?
    ab**c

  3. To clarify what I mean. The “**c” part of “ab**c” would not use the text_formattedelement, ital_markup, or text_italcontent symbols, yet, it should be rendered in italics.

  4. Another question: What happens with a string of unformatted text which ends with a ~ followed by a newline? Since, “~\n” is a valid escaped sequence and unformatted text can contain escaped sequences. Should the newline be ignored? Should I email you future questions? Also, thank you very much for providing this grammar! It’s really helping me understand how to write my parser. ;D

    1. Hi Shaddy,
      the ~ is part of ANTLR syntax expressing that the following expressions/characters must not occur in the terminal derived from the respective production rule(s).
      In your case, ‘~\n’ includes any character except for a new line assuming that the ~ was not escaped.
      HTH, Martin

      1. I’m referring to the ~ as part of a creole escape sequence in a text line:
        text_unformatted
        : ( ~( ITAL …and so on… EOF )
        | forced_linebreak
        | escaped )+
        ;
        escaped
        : ESCAPE STAR STAR
        | ESCAPE .
        ;
        ESCAPE : ‘~’;
        These rules can generate “~\n” is a valid form of text_unformatted. Along with the other rules related to text paragraphs, a text paragraph can look like this:
        // START CREOLE
        this is a valid text_paragraph~
        ~
        ~
        this is part of the same text_line~
        this is still part of the same text_line
        // END CREOLE
        Which means that any ~ at the end of a text line would result in an issue:
        // START CREOLE
        this is a valid text_paragraph ~
        this text_line should belong to a new paragraph, but it doesn’t
        // END CREOLE

        1. What I was trying to say:
          Hi Shaddy, now I got you. Thanks for the examples.
          As you said, ‘~\n’ can be produced. After I had a brief look at the grammar again, we escape the linefeed. If you want to change it, the linefeed can be excluded from the characters/tokens that are allowed after the tilde in the ‘escaped’ rule. When we wrote the paper, (i) we aimed at defining an LR(k) grammar that is not specific to ANTLR but can be deployed to ANTLR for testing and (ii) ANTLR could not handle all the different kinds of ambiguities that the Creole grammar contains quite well. (This was also the reason, why for instance the rules for bold formatting exploded.) With current versions of ANTLR you can easily insert statements to the grammar that make assertions on the Lookahead, e.g. “{ LA(2)!=’\n’ }?”, which allows you to handle the ambiguities.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: