Dirk Riehle's Industry and Research Publications

An EBNF grammar for Wiki Creole 1.0 [Technical Report]

Abstract: Today’s wiki engines are not interoperable. This is an unfortunate consequence of the lack of rigorously specified standards. This technical report presents a complete and validated EBNF-based grammar for Wiki Creole, a community standard for wiki markup. Wiki Creole is also the only standard currently available. Wiki Creole is being specified using prose, leading to inconsistencies and ambiguities. Our grammar uncovered those ambiguities which we fed back into the specification process. The Wiki Creole grammar presented in this report makes the creation of Wiki Creole parsers simple using parser generators, ANTLR in our case. Using a precise specification of wiki markup lets us decouple wiki editors from wiki storage from further wiki processing tools. Based on this decoupling layer we expect innovation on these different parts to proceed independently and at a faster pace than before.

Reference: Martin Junghans, Dirk Riehle, Rama Gurram, Matthias Kaiser, Mario Lopes, Umit Yalcinalp. In ACM SIGWEB Newsletter vol. 2007 (winter). ACM: Article no. 4, pp 4-es.

Available as a PDF file.

Subscribe!

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  1. Shaddy Zeineddine Avatar

    Another question: What happens with a string of unformatted text which ends with a ~ followed by a newline? Since, “~\n” is a valid escaped sequence and unformatted text can contain escaped sequences. Should the newline be ignored? Should I email you future questions? Also, thank you very much for providing this grammar! It’s really helping me understand how to write my parser. ;D

    1. Martin Avatar
      Martin

      Hi Shaddy,
      the ~ is part of ANTLR syntax expressing that the following expressions/characters must not occur in the terminal derived from the respective production rule(s).
      In your case, ‘~\n’ includes any character except for a new line assuming that the ~ was not escaped.
      HTH, Martin

      1. Shaddy Zeineddine Avatar

        I’m referring to the ~ as part of a creole escape sequence in a text line:
        text_unformatted
        : ( ~( ITAL …and so on… EOF )
        | forced_linebreak
        | escaped )+
        ;
        escaped
        : ESCAPE STAR STAR
        | ESCAPE .
        ;
        ESCAPE : ‘~’;
        These rules can generate “~\n” is a valid form of text_unformatted. Along with the other rules related to text paragraphs, a text paragraph can look like this:
        // START CREOLE
        this is a valid text_paragraph~
        ~
        ~
        this is part of the same text_line~
        this is still part of the same text_line
        // END CREOLE
        Which means that any ~ at the end of a text line would result in an issue:
        // START CREOLE
        this is a valid text_paragraph ~
        this text_line should belong to a new paragraph, but it doesn’t
        // END CREOLE

        1. Martin Avatar
          Martin

          What I was trying to say:
          Hi Shaddy, now I got you. Thanks for the examples.
          As you said, ‘~\n’ can be produced. After I had a brief look at the grammar again, we escape the linefeed. If you want to change it, the linefeed can be excluded from the characters/tokens that are allowed after the tilde in the ‘escaped’ rule. When we wrote the paper, (i) we aimed at defining an LR(k) grammar that is not specific to ANTLR but can be deployed to ANTLR for testing and (ii) ANTLR could not handle all the different kinds of ambiguities that the Creole grammar contains quite well. (This was also the reason, why for instance the rules for bold formatting exploded.) With current versions of ANTLR you can easily insert statements to the grammar that make assertions on the Lookahead, e.g. “{ LA(2)!=’\n’ }?”, which allows you to handle the ambiguities.

          1. Shaddy Zeineddine Avatar

            Perfect, thanks!

  2. Shaddy Zeineddine Avatar

    To clarify what I mean. The “**c” part of “ab**c” would not use the text_formattedelement, ital_markup, or text_italcontent symbols, yet, it should be rendered in italics.

  3. Dirk Riehle Avatar

    Hi Shaddy,
    how is ab**c invalid?
    Thanks,
    Dirk

  4. Shaddy Zeineddine Avatar

    Hey Dirk, I’ve noticed a few rules which look odd to me. Please let me know your thoughts. This is one pair:
    text_line
    :
    text_firstelement ( text_element )* text_lineseparator
    ;
    text_element
    :
    onestar text_unformattedelement
    |
    text_unformattedelement onestar
    |
    text_formattedelement
    ;
    Doesn’t this allow the following invalid text_line?
    ab**c

  5. […] of Martin Junghans and Dirk Riehle. They did research on this subject. As a result, they created an EBNF grammar and an XML interchange format (Creole […]

  6. johannes busse Avatar
    johannes busse

    Thank you and your collaborators for this very useful grammar! It looks complex. But Wiki Text in fact *is* complex to parse. … Johannes

Navigation

Share the content

Share on LinkedIn

Share by email

Share on X (Twitter)

Share on WhatsApp

Featured startups

QDAcity makes collaborative qualitative data analysis fun and easy.

Featured projects

Open data, easy and social
Engineering intelligence unleashed
Open source in products, easy and safe