An EBNF grammar for Wiki Creole 1.0 [Technical Report]

Abstract: Today’s wiki engines are not interoperable. This is an unfortunate consequence of the lack of rigorously specified standards. This technical report presents a complete and validated EBNF-based grammar for Wiki Creole, a community standard for wiki markup. Wiki Creole is also the only standard currently available. Wiki Creole is being specified using prose, leading to inconsistencies and ambiguities. Our grammar uncovered those ambiguities which we fed back into the specification process. The Wiki Creole grammar presented in this report makes the creation of Wiki Creole parsers simple using parser generators, ANTLR in our case. Using a precise specification of wiki markup lets us decouple wiki editors from wiki storage from further wiki processing tools. Based on this decoupling layer we expect innovation on these different parts to proceed independently and at a faster pace than before.

Reference: Martin Junghans, Dirk Riehle, Rama Gurram, Matthias Kaiser, Mario Lopes, Umit Yalcinalp. In ACM SIGWEB Newsletter vol. 2007 (winter). ACM: Article no. 4, pp. 4-es.

Available as a PDF file.

Posted on

2008-01-09

4. Society-at-large, 4.4 Wikis and Wikipedia

Tagged as (if any)

Publication

Subscribe!

Comments

Leave a Reply to Dirk RiehleCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Shaddy Zeineddine

2013-02-28

Another question: What happens with a string of unformatted text which ends with a ~ followed by a newline? Since, “~\n” is a valid escaped sequence and unformatted text can contain escaped sequences. Should the newline be ignored? Should I email you future questions? Also, thank you very much for providing this grammar! It’s really helping me understand how to write my parser. ;D

Reply
1. Martin
  
  2013-02-28
  
  Hi Shaddy,
  the ~ is part of ANTLR syntax expressing that the following expressions/characters must not occur in the terminal derived from the respective production rule(s).
  In your case, ‘~\n’ includes any character except for a new line assuming that the ~ was not escaped.
  HTH, Martin
  
  Reply
  1. Shaddy Zeineddine
    
    2013-02-28
    
    I’m referring to the ~ as part of a creole escape sequence in a text line:
    text_unformatted
    : ( ~( ITAL …and so on… EOF )
    | forced_linebreak
    | escaped )+
    ;
    escaped
    : ESCAPE STAR STAR
    | ESCAPE .
    ;
    ESCAPE : ‘~’;
    These rules can generate “~\n” is a valid form of text_unformatted. Along with the other rules related to text paragraphs, a text paragraph can look like this:
    // START CREOLE
    this is a valid text_paragraph~
    ~
    ~
    this is part of the same text_line~
    this is still part of the same text_line
    // END CREOLE
    Which means that any ~ at the end of a text line would result in an issue:
    // START CREOLE
    this is a valid text_paragraph ~
    this text_line should belong to a new paragraph, but it doesn’t
    // END CREOLE
    
    Reply
    1. Martin
      
      2013-02-28
      
      What I was trying to say:
      Hi Shaddy, now I got you. Thanks for the examples.
      As you said, ‘~\n’ can be produced. After I had a brief look at the grammar again, we escape the linefeed. If you want to change it, the linefeed can be excluded from the characters/tokens that are allowed after the tilde in the ‘escaped’ rule. When we wrote the paper, (i) we aimed at defining an LR(k) grammar that is not specific to ANTLR but can be deployed to ANTLR for testing and (ii) ANTLR could not handle all the different kinds of ambiguities that the Creole grammar contains quite well. (This was also the reason, why for instance the rules for bold formatting exploded.) With current versions of ANTLR you can easily insert statements to the grammar that make assertions on the Lookahead, e.g. “{ LA(2)!=’\n’ }?”, which allows you to handle the ambiguities.
      
      Reply
      1. Shaddy Zeineddine
        
        2013-03-02
        
        Perfect, thanks!
Shaddy Zeineddine

2013-02-27

To clarify what I mean. The “**c” part of “ab**c” would not use the text_formattedelement, ital_markup, or text_italcontent symbols, yet, it should be rendered in italics.

Reply
Dirk Riehle

2013-02-27

Hi Shaddy,
how is ab**c invalid?
Thanks,
Dirk

Reply
Shaddy Zeineddine

2013-02-27

Hey Dirk, I’ve noticed a few rules which look odd to me. Please let me know your thoughts. This is one pair:
text_line
:
text_firstelement ( text_element )* text_lineseparator
;
text_element
:
onestar text_unformattedelement
|
text_unformattedelement onestar
|
text_formattedelement
;
Doesn’t this allow the following invalid text_line?
ab**c

Reply
Wiki to XML using Ant « PrayogShala

2009-07-16

[…] of Martin Junghans and Dirk Riehle. They did research on this subject. As a result, they created an EBNF grammar and an XML interchange format (Creole […]

Reply
johannes busse

2008-01-10

Thank you and your collaborators for this very useful grammar! It looks complex. But Wiki Text in fact *is* complex to parse. … Johannes

Reply