Text Formatting Regular Expressions Discussion

Summary of WikiUseful? TextFormattingRegularExpressions:

Expressions for inline tags

 Quoted emphasis:
 ('')(.*?)\1  ->  <em>$2</em>
 Quoted strong:
 (''')(.*?)\1  ->  <strong>$2</strong>

Expressions for block level tags

 Preformatted code (space at beginning of line):
 ^ +(.*)\n  ->  <pre>$1</pre>
 with a subsequent match and replace of
 </pre><pre>  ->

Expressions for nested group tags (such as bullets)

 Ordered list (tab-numeral-space at beginning of line):
 ^\t[0-9]\s*(.*)(?:\n?|$)  ->  <li>$1</li>
 with a subsequent match and replace of
 (?<!</li>)((?:<li>.*</li>)+)(?!<li>)  ->  <ol>$1</ol>

 Unordered list (tab-asterisk-space at beginning of line with consideration for previous unordered list parsing):
 ^\t\*\s*(.*)(?:\n?|$)  ->  <li>$1</li>
 with a subsequent match and replace of
 (?<!</li>|<ol>)((?:<li>.*</li>)+)(?!<li>|</ol>)  ->  <ul>$1</ul>

Expressions for parsing links

 standard protocols, not image:
 (http|news|ftp|mailto)\:(\/\/)?((?:\S(?!\.gif|\.jpg|\.jpeg))+)\s  -->  <a href="$1:$2$3">$3</a>

 standard protocols, inline image:
 (?<!href=")(http|news|ftp|mailto)\:(\/\/)?((?:\S(?!\.gif|\.jpg|\.jpeg|\.png))+\S)(\.gif|\.jpg|\.jpeg|\.png)
 -->  <img src="$1:$2$3$4" border="0">

Expressions for alternate formatting rules (i.e. rules not native to TheOriginalWiki)

The above are the rules used for HtagWiki. I would be really interested to know what Ward uses, so we could analyse posssible problems with the various RegularExpressions. I also started to realise that after a certain number of rules it starts getting inefficient, and of course the fastest method (I can think of) would be to implement a custom WikiParser?. Doesn't seem to be an issue yet though. -- SvenNeumann

The following paragraph was moved from TextFormattingRegularExpressions.

This page could have been really, genuinely useful to my endeavors the last two days and I was very happy when I found it. Unfortunately I must say I find many of the expressions here dodgy at best, and just wrong at worst. Could we turn this into a resource for the (many) ways that WikiAuthors? have implemented this? -- SvenNeumann

Let's use this page to discuss the RegularExpressions and use the other page as a document. Great idea! Maybe I'm just a bit slow, but some of these RegExes? give me a splittin' headache :)

The emphasis RegularExpression /'{3}(.*?)'{3}/ looks like it would match situations where the apostrophes contain nothing. Wouldn't /'{3}(.*)'{3}/ be simpler? And horizontal line RegularExpression should only match exactly 4 dashes not 4 or more. -- AdewaleOshineye

Also in HtagWiki I use the backreferencing to make it easy to match several alternate formatting rules like:

 (''|//)(.*?)\1

for example to allow the slash notation too in one simple expression. Also, I believe matching blank apostrophe parts is accuarate too, as long as the rules are ordered with the expression gobbling the most quotes first. -- sn

Let's try this one:

 ^ +(.*)\n','<pre>$1</pre>

to mark all monospaced lines and

 </pre><pre>

for deletion after to merge the lines into a single block. is that accurate?

What is an efficient quick and dirty way for the spell checker? I tried a RegularExpression but that took over 14seconds for a small page and a 270k dictionary. A simple search for each word takes only 0.4seconds. Of course indexing would speed this up. But perhaps I'm approaching this from the wrong end too. How is simple spell checking usually implemented? -- sn By the users... it's a wiki, after all.