Can Web Search Engines Index Wikis

Conclusion: yes, Web search engines can index wikis. GoogleLovesWiki.

Therefore, a HTTP search engine is a useful tool to provide increased search capability for any wiki.

However, the Robots.txt file should be properly configured.

And the appropriate META tags should be added to "EditPage" and "EditCopy" to exclude all robots from indexing obsolete text.

Robots.txt can be configured to allow a local search engine to index the site while excluding public Internet search engines, something like this:

To allow a single robot

User-agent: FriendlyWebCrawler Disallow:

User-agent: * Disallow: /

See RobotsDotTxt for a link to the file format spec.

Discussion

Any HTTP search engine can index any wiki. It is just a matter of combination of configuration, URL space and page content (META tags) whether it is possible. There are no inherent technical limits. (copied from WikiNavigationPattern)

Note the following:

Some wikis use robots.txt to avoid being indexed.
As for any other Web site, the searchbot increases traffic on the wiki, but, hopefully, the searchbot is well behaved and the increased traffic is negligible.
As for any other Web site, the index is not continually updated, and thus not always up to date.

If the term "caching" of wiki pages is used, my belief is that the term is used to describe a sort of virtual caching that occurs because the searchbot indexes a (dynamic) Web page that "disappears" between "instantiations". The index is all that remains between instantiations of the dynamic page.

I don't know if there is a way for a wiki (or other Web site) to force a searchbot to update its index.

References

A sample robots.txt (for the Portland Pattern Repository):

# Prevent all robots from getting lost in my databases User-agent: * Disallow: /cgi/ Disallow: /cgi-bin/

http://usemod.com/cgi-bin/mb.pl?RobotsDotTxt: A useful link to learn more about preventing a wiki site from being indexed, with "sublinks" worth following.

Comments

Sure, why not? MeatballWiki is indexed by Google. Some searches are bizarre. We got one trolling for kiddie porn once. Recently, Cliff decided to exclude bots from hitting the edit pages, but they are greatly encouraged to index the whole site there. On the other hand, Wiki is not indexed, nor will it likely ever be directly, although it is through the mirrors that aren't blocked. -- SunirShah

I wrote up a document about how to give the wiki pages some "fake" URLs that end in .html. This doc is at http://www.riceball.com/wikiwiki/MakingWikiWorkWithSearchEngines.html -- JohnKawakami

Aug 6, 2004: Google no longer indexes the EditPage or EditCopy links. The wiki software now emits

    <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

in the HTML header of the Edit page and the Edit Copy page. Google, like all properly-written robots, ignore such pages. See WikiSpam for details.

Feb 2. 2005: But, using the above implementation, the spider must still access the EditPage or EditCopy links to then find out they aren't supposed to be indexed. Now google and blogs are implementing a rel="nofollow" tag in links. This could be added to the EditPage button on each page to give robots a heads-up to not even bother trying to access the EditPage. What is still unclear is whether the rel="nofollow" tag really means "do not follow this link" to robots, or whether it means "follow this link, but don't give it any credit in your page ranking scheme".

--- May 02, 2005: Meta Tag Tutorial Available http://www.rahsoftware.com/tutorials/meta_tag_tutorial.aspx

CategoryWikiEngineReview