Text Formatting Regular Expressions

This page complemented TextFormattingRules by exposing the exact regular expressions used by Ward's wiki when it was founded. If you do not find these useful, you can also consult TheWikiWay or any number of open source wiki implementations for alternatives.


We look for paragraph boundaries...

/^\s*$/  
(the expression matches empty lines and lines containing only whitespace (we don't look for the paragraph as a whole because the formatting script parses the text line-by-line))

definition, unordered and ordered lists...

/^(\t+)(.+):\t/
(matches lines beginning with one or more tabs, followed by one or more characters, a colon, and a tab. The line in question does not have to be tab-final, and usually isn't)
/^(\t+)\*/
/^(\t+)\d+\.?/

preformatted text...
/^\s/
(matches lines beginning with a whitespace character)

emphasis...

/'{3}(.*?)'{3}/
/'{2}(.*?)'{2}/
horizontal rules...
/-{4,}/
external and internal links...
/\[(\d)\]/
/[A-Z][a-z]+([A-Z][a-z]+)+/
in-place hyperlinks...
/\b((http)|(ftp)|(mailto)|(news)):[^\s\&lt\&gt\[\]"'\(\)]*[^\s\&lt\&gt\[\]"'\(\)\,\.]/
the search field macro...
/\[Search\]/
characters that could mess up our html... (what about ampersand & & ?)
/</
/>/
and, when asked, spans of spaces to become tabs ...
/ {3,8}/


The HTML specs say "In SGML, it is possible to eliminate the final ";" after a character reference in some cases .... We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present." http://www.w3.org/TR/html4/charset.html#entities

The specs also say that an ampersand which is not followed by an alphanumeric is to be displayed as is. -- PLA [[Cite? I can't find evidence to support this -- Danil]]

So, in fact, /&[^;]+\(\b[^;]\|$\)/, if I've not forgotten my perl, needs to be trapped and fixed. Then again, the patterns given genuinely recognize text which 'could mess up our html', which is what was stipulated - the dodgy texts matched don't need to be valid to serve as sentinels for trouble ... -- Eddy.

The HTML::Entity module neatly sidesteps the issue by doing all the hard work for you. If you like modules, that is. -- Paul


Java ORO BUG: If anyone is using jakarta ORO to hack a java-based Wiki, beware of that it doesn't handle the in-place hyperlinks regexp as expected. Replace it with something like this instead:

/\b((http:)|(ftp:)|(mailto:)|(news:))[^\s\<\>\[\]"'\(\)]*[^\s\<\>\[\]"'\(\)\,\.]/
-- NiclasOlofsson


Could somebody please point me to the right way of handling external links that contain a wikiword?

Currently, I replace all

 (http|news|ftp|mailto)\:(\/\/)?((?:\S(?!\.gif|\.jpg|\.jpeg))+)\s
with
 <a href="$1:$2$3">$3</a>
but if the url is, for example, http://c2.com/cgi/wiki?WikiPage, the link gets garbled later by my wikiword replacement like this
 (\b[A-Z][a-z]+[A-Z][A-Za-z]*\b)
with
 <a href="/$1">$1</a>
Could somebody please point me to a solution? Thanks. -- SvenNeumann

Before processing raw page text QuickiWiki extracts links, stores them in a list, and replaces each with a placeholder. Once text formatting is done, it restores the links, removing the placeholders. You can also use this technique to implement literal text (e.g., code blocks).


SvenNeumann, if you define a WikiWord as:

/(\s)([A-Z][a-z]+[A-Z][A-Za-z]*)/

and replace it with the captured beginning whitespace, it seems to work fine, and avoids wikiWordsWithinWordsAndurls. it does have the adverse effect of .WikiWord not being a WikiWord.

-- DanielStaudigel

Hi DanielStaudigel. Why exactly do I want to capture whitespace at the beginning? a WikiWord might not have white space before (but punctuation instead). --SvenNeumann

Implementations I have seen replace all URLs temporarily with some implausible-to-type combination of ASCII characters, do the WikiWord substitution, then put the URLs, with HTML, back in. (An alternative is to use an excessively complex Perl regex that handles WikiWords and URLs simultaneously using executed substitutions or some such. Or just use m/\G ... /gc) -- Chris Purcell (KritTer)


I.e. using regular expressions in the context of text formatting. Some such things are trivial, others are deceptively slippery.

This page could have been really, genuinely useful to my endeavors the last two days and I was very happy when I found it. Unfortunately, I must say I find many of the expressions here dodgy at best, and just wrong at worst. Could we turn this into a resource for the (many) ways that WikiAuthors have implemented this? -- SvenNeumann

See influential RobPike paper on StructuralRegularExpressions, though that does not appear to be the idea of this page.


(bug report moved to WikiWikiBugs)


See TextFormattingRegularExpressionsDiscussion, AlternativesToRegularExpressions


CategoryWiki


EditText of this page (last edited December 23, 2007) or FindPage with title or text search