Text Formatting Regular Expressions Discussion

Summary of WikiUseful? TextFormattingRegularExpressions:

 Quoted emphasis:
 ('')(.*?)\1  ->  <em>$2</em>
 Quoted strong:
 (''')(.*?)\1  ->  <strong>$2</strong>
 Preformatted code (space at beginning of line):
 ^ +(.*)\n  ->  <pre>$1</pre>
 with a subsequent match and replace of
 </pre><pre>  ->  
 Ordered list (tab-numeral-space at beginning of line):
 ^\t[0-9]\s*(.*)(?:\n?|$)  ->  <li>$1</li>
 with a subsequent match and replace of
 (?<!</li>)((?:<li>.*</li>)+)(?!<li>)  ->  <ol>$1</ol>

Unordered list (tab-asterisk-space at beginning of line with consideration for previous unordered list parsing): ^\t\*\s*(.*)(?:\n?|$) -> <li>$1</li> with a subsequent match and replace of (?<!</li>|<ol>)((?:<li>.*</li>)+)(?!<li>|</ol>) -> <ul>$1</ul>
 standard protocols, not image:
 (http|news|ftp|mailto)\:(\/\/)?((?:\S(?!\.gif|\.jpg|\.jpeg))+)\s  -->  <a href="$1:$2$3">$3</a>

standard protocols, inline image: (?<!href=")(http|news|ftp|mailto)\:(\/\/)?((?:\S(?!\.gif|\.jpg|\.jpeg|\.png))+\S)(\.gif|\.jpg|\.jpeg|\.png) --> <img src="$1:$2$3$4" border="0">
The above are the rules used for HtagWiki. I would be really interested to know what Ward uses, so we could analyse posssible problems with the various RegularExpressions. I also started to realise that after a certain number of rules it starts getting inefficient, and of course the fastest method (I can think of) would be to implement a custom WikiParser?. Doesn't seem to be an issue yet though. -- SvenNeumann


The following paragraph was moved from TextFormattingRegularExpressions.

This page could have been really, genuinely useful to my endeavors the last two days and I was very happy when I found it. Unfortunately I must say I find many of the expressions here dodgy at best, and just wrong at worst. Could we turn this into a resource for the (many) ways that WikiAuthors? have implemented this? -- SvenNeumann

Let's use this page to discuss the RegularExpressions and use the other page as a document. Great idea! Maybe I'm just a bit slow, but some of these RegExes? give me a splittin' headache :)


The emphasis RegularExpression /'{3}(.*?)'{3}/ looks like it would match situations where the apostrophes contain nothing. Wouldn't /'{3}(.*)'{3}/ be simpler? And horizontal line RegularExpression should only match exactly 4 dashes not 4 or more. -- AdewaleOshineye

Also in HtagWiki I use the backreferencing to make it easy to match several alternate formatting rules like:

 (''|//)(.*?)\1
for example to allow the slash notation too in one simple expression. Also, I believe matching blank apostrophe parts is accuarate too, as long as the rules are ordered with the expression gobbling the most quotes first. -- sn


Let's try this one:

 ^ +(.*)\n','<pre>$1</pre>
to mark all monospaced lines and
 </pre><pre>
for deletion after to merge the lines into a single block. is that accurate?


What is an efficient quick and dirty way for the spell checker? I tried a RegularExpression but that took over 14seconds for a small page and a 270k dictionary. A simple search for each word takes only 0.4seconds. Of course indexing would speed this up. But perhaps I'm approaching this from the wrong end too. How is simple spell checking usually implemented? -- sn By the users... it's a wiki, after all.


EditText of this page (last edited February 13, 2004) or FindPage with title or text search