Google Hates Wiki

Google giveth, and Google taketh away.

Right now GoogleLovesWiki, pretty much. See:

http://www.google.com/search?q=google+loves+wiki

See GoogleSearch for a c2 Wiki search experiment performed 2005-04-25.

As of June 2005, it appears that Google doesn't appear to love Wiki very much any more. Pages that used to rank highly in Google's index no longer seem to. I don't know if this is due to some new change in how PageRank is determined, a change in c2's characteristics that have resulted in it being identified as a LinkFarm, or a failure (for some reason) for the Google 'bot to index c2, or what. Yahoo also seems to not rank c2 highly anymore; search.msn.com still ranks pages here highly.

Jul05 GoogleLovesWikiNot

I can no longer do C2 research using Google. Lots of pages are not found. :(

But YahooLovesWiki.

Google, Yahoo, and MSN Search tried 2005-08-27:

Either our c2 Wiki is broken or GoogleSearch has come up with a Wiki-unfriendly algorithm for indexing pages.

Go to Google and search, for example, for "site:c2.com microsoft slave". When there's a space between 'microsoft' and 'slave', Google should turn up the Wiki page whose title is 'Microsoft Slave'. In fact, I bet that page should show up first in the list. It doesn't show up at all. All that show up are a few "Signatures" pages, and signature pages are really quite useless from a Google-user standpoint. You'll also notice that all the entries are "Supplemental Results." That means they're coming from old indexes Google has lying around, not from Google's most recent attempt at indexing Wiki. If you look at each "Cached" page you'll see the date that each useless signature page was indexed. They're all at least a few months old instead of just a few weeks old. I don't think I've spotted anything newer than March 2005 in the GoogleCache.

So try searching for "site:c2.com microsoftslave". Don't you suppose you should see all the pages where dl signed himself as MicrosoftSlave? You don't. The first page in the list is--guess what?--the MicrosoftSlave page. That's nice, but there's no text summary and no cache. I guess it means the page was found but not indexed, and who knows when it was found. The search also turns up a few useless "Signatures" pages, a few old "Changes in ..." pages, and our "Long Pages" script. This useless smattering of extra results are all "Supplemental Results," i.e., they're old.

Dogpile (which also reports results from Yahoo, MSN Search, and Ask Jeeves) barely provides much of interest, either. MSN Search seems to have a fair amount of success, and its results are from just a couple weeks ago.

Maybe Google has been up to something in the past few months. On the other hand, the new meta robots HTML tag we've been inserting in pages might be causing some search spiders not to index our site frequently with useful results. That at least would explain why the MicrosoftSlave page was found by Google but not indexed. A meta tag is only a tag. There's no guarantee any given SearchEngine will pay attention to it. Maybe MSN Search ignores it.

I'm in favor of eliminating the "meta" tricks for a couple months and see how Google responds. We haven't been blasted by spam in some months now, anyway.

-- ElizabethWiethoff

Google searched 2005-08-28:

I'm going to go ahead and be vain here. Do a GoogleSearch for "elizabethwiethoff site:c2.com". Besides some ancient "Changes In ..." pages and "Signatures" pages, Google turns up little of interest (OrientedVsOrientated, JonThoroddsen, ScAza?, AnandJain), and all results are "Supplemental Results", i.e., old. Now try searching "elizabethwiethoff site:bookshelved.org". Google turns up 27 interesting, useful pages, none from its "Supplemental Results." That means the pages were indexed recently, and checking the GoogleCache you can see they were indexed within a couple weeks.

If Google changed its algorithm for indexing wikis, I believe its BookShelved results would be just as pitiful as its c2 wiki results. Therefore, I think it's c2 wiki that's broken. Our problem might be our "meta" tricks. Or maybe our IP blocks block out Google!

-- ElizabethWiethoff

Good point. BTW I've noticed the kind of problem you've been discussing for quite a while now; it's really cramped my style on searching c2. -- DougMerritt

2005-08-28: I haven't spotted anything from our Wiki in the GoogleCache more recent than March 10, 2005. I think Wiki's been broken somehow--maybe blocking Google's spider 'bots (and Yahoo's)--since shortly after March 10. I'm sending a note to Ward... -- Eliz

It must be the wiki itself. The c2-site alone is well indexed and has a good PageRank (I mean the index.html and the html pages linked from there). But all Wiki-pages (e.g., http://c2.com/cgi/wiki?PortlandPatternRepository) have PageRank zero. Can it be, that

URLs containing a "?" are no longer indexed by Google?
- Nope. That's not the problem. http://www.rubygarden.org/ruby?ElizabethWiethoff and http://bookshelved.org/cgi-bin/wiki.pl?ElizabethWiethoff are numbers 1 and 2 if you search Google for elizabethwiethoff.
cgi/wiki returns some cache timeout etc., which is unusual or interpreted as not indexable?
robots.txt excludes the bot fully by accident?
- The RobotsDotTxt file appears to be fully standard; excluding only /wiki/history and some private folders of Ward's.

-- .gz

An interesting page to read at Google: http://www.google.com/intl/en/webmasters/index.html

One possibility is that c2.com, during a spam attack, managed to get on google's naughty list; Ward might inquire at google if c2 is indeed so.

I was at Google on August 29, 2005, talking to a friend. He said the Google spiders always run around as fast as they can. Another friend (not a Google person) said that you're supposed use the RobotsDotTxt file to limit the speed. In any case, there's nothing about speed in http://c2.com/robots.txt. But the Wiki server has a few roadblocks to flag and stop speeders. (See the MoreAboutWikiAccess pages, particularly WikiAccessDenied.) Maybe some months ago the Googlebot started running too fast for Ward's roadblocks and he inadvertently added it to the banned IP list. -- ElizabethWiethoff

"...he inadvertently added it [googlebot] to the banned IP list." BINGO!

I notice that the Yahoo search results of Wiki are as pathetic as Google's. I bet the Yahoo bot was inadvertently banned as well.

Yahoo tells you how to use RobotsDotTxt to slow down the crawl rate: http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html You specify a "Crawl-delay" for the search spider user agent. This crawl delay directive in robots.txt was invented by MSNSearch (EmbraceAndExtend!) and is honored by msnbot, Slurp (Yahoo), and Teoma (Ask Jeeves). Google, as far as I can tell, does not honor it.

Results of an interesting study done in the spring of 2005. The major search spiders behave quite differently with RobotsDotTxt: http://rss.publish101.com/articles/pages/slurp/teoma/site/search/Mike-Banks-Valentine/112391/

-- ElizabethWiethoff

GoogleHatesWiki may have something to do with procedures Ward have applied to his server since Feb05. It creates lots of difficulties in various wiki related tasks that I participate.

Example: MakeRoomForAllViewpoints difficulty

I scan a lot of articles / news items. The ones I think could be of interest to multiple persons are logged in a C2 page. Sometimes, I come across a viewpoint that may be different from the ones expressed and could be useful. Without a Google search, it becomes much more difficult for me to track down where I saw the first link. This is especially true for a large topic like WebServices that have many pages. -- DavidLiu

GoogleHatesWiki was first noticed by me (in C2 pages terms) in May05. I mentioned it to Ward around then but I did not get any feedback. My experiences in Apr05 appeared fine. DM said he noticed the same maybe a bit earlier. I do think a sharp deterioration occurred in May because I use Google search on C2 many times a day.

In Aug05, other C2 members start complaining about it. So I made another note on Wards home page. Hope something positive come out of this.

I too noticed it quite a while before people started talking about it in August, although I don't recall quite when. Probably most people assumed it was something temporary... for a long time... before speaking up. Multiple people have by now mentioned that they notified Ward.

Problem is, Ward is notoriously so busy with multiple things that it takes him typically a minimum of months (sometimes years, if a solution suitable to his philosophy and aesthetics is not obvious) to find the time to do things to c2 even when he's indicated they are high priority for him.

So I think we must be prepared to exercise patience. -- DougMerritt

That's funny. It never occurred to me that Ward might be at fault. I assumed that it was an algorithm change on the part of Google (I distinctly recall them intending to limit the page ranking of blogs) and flamed them accordingly (though nicely). Oh well, I hate the way robots.txt is abused anyways; I don't think I'm willing to take any excuse for a site not being archived. -- RK

You don't need Ward to identify the problem, perhaps only to fix it once found. Perhaps we can start a Fix Google campaign to collect funds to pay a Google consultant to hunt down the problem from the inside of Google.

Aside from side issues, yeah, I agree. I think it must be a bug on Google's part, even if it turns out that Ward made some kind of Google-flagging-meta-tag mistake. To be fair, questions of meta tags and robots.txt have in fact been critical arguments in court cases that have set a certain amount of legal precedence, so I think Google should indeed be concerned - but I would still think that Google merely screwed up, here. -- Doug

Well, the page HEAD says ROBOTS NOINDEX, and from my understanding that means "robots, go away!". -- ClaesWallin

See DelayedIndexing for the rationale for that.

Perhaps there is a bug in Google such that once shooed away it never comes back even if the no-index flag goes away.

Google hates "Ward's Wiki" only.

 See: http://www.google.com/search?q=GoogleHatesWiki&hl=en&lr=&filter=0

So, the moral of this little debacle is: If you want what you have to say to show up on the net, put it on another wiki! ;)

If we want to fix the Google problem, then all we have to do is create a mirror of this wiki with links back to original, eh?

According to this tool, there is no problem:

http://www.123promotion.co.uk/tools/googlebanned.php

Perhaps it's not that GoogleHatesWiki, but that WikiHatesGoogle?. A few months ago I did a Google search that successfully coughed up a WardsWiki page based on the page title, but instead of page content the abstract contained an error message generated by WardsWiki. The error message (which unfortunately I can no longer find) suggested to me that one of Ward's anti-abuse measures was being triggered by the Google spiders' high volume of hits within a short period of time. -- DaveVoorhis

I suspect that GoogleHatesWiki will eventually kill the c2 wiki, or at least stagnate it. People looking for something usually use Google first.

I think that both too much publicity and too little publicity could kill Wiki. Too much publicity brings in lots of trolls (probably the reason google is no longer indexing Wiki). If there is too little publicity there won't be enough people to replace those who leave for whatever reason. But google isn't the only way to bring people to Wiki. -- MichaelSparks

I avoid Microsoft anything and this doesn't fix the Google/Yahoo problem, but at least MSN Search (http://search.msn.com/) still indexes this wiki. You could try it for searching the Wiki. -- Eliz

The page creation rate has been dropping - sometimes below zero - since 30,000 pages. See WikiAtFortyThousand. I suppose the instances of negative growth may be from the RA MindWipe and moving some stuff to TheAdjunct. The slow-down in positive growth might be due to lack of Google & Yahoo, might be due to lack of RA spewing pages right and left, might be due to people leaving in disgust during the 'bot wars, take a guess... FWIW, I found Wiki a couple years ago from googling (with Google) for "functional programming". I was also led here from googling for "design patterns". These searches were about a couple weeks apart, and I don't remember which I did first. My point is, I wouldn't be here were it not for Google (and the fact that Wiki contains loads of interesting pages). -- ElizabethWiethoff

"This is G o o g l e's cache of http://c2.com/cgi/wiki as retrieved on Mar 25, 2006 09:14:23 GMT."

The only pages found were: FrontPage and http:wikiRss. I believe the problem started when DelayedIndexing was implemented.

There might be useful information to be had from https://www.google.com/webmasters/sitemaps/sitestatus . I made a brief attempt to get something useful, but it (justifiably) requires proof that I have control over the piece of the site I'm requesting information on. Apparently I'm not special enough to do that :p. --WilliamUnderwood

June 2007, and Google still hates Wiki. As a result, this site is pretty much in torpor.

Million-dollar companies have similar problems:

http://www.forbes.com/home/technology/2007/04/29/sanar-google-skyfacet-tech-cx_ag_0430googhell.html

I think that the problem is the "?" in the URLs. I've been told at work that google does not index URLs with question marks anymore.

Interesting! So if the wiki scripts were modified to use paths instead of parameters, like MoinMoin and TwikiClone, Google would pick up on Wiki again? (http://c2.com/cgi/wiki?GoogleHatesWiki would become http://c2.com/cgi/wiki/GoogleHatesWiki and editing would become http://c2.com/cgi/wiki/GoogleHatesWiki?edit ).
Possibly, but the Google spiders might trigger some of the Wiki anti-abuse mechanisms that temporarily deny access if the volume of hits per unit of time is too high.
There must be more to it than just the '?' character. If I do a Google search for WardsWiki, UseMod is near the top (http://www.usemod.com/cgi-bin/wiki.pl?WardsWiki) even though it also uses a parameter with '?'.
I suspect it's mainly the anti-abuse mechanisms that are preventing Google from indexing WardsWiki. As I mentioned above, a few years ago -- about the time Google stopped being friendly to WardsWiki -- I did a Google search that successfully pointed to the relevant page on WardsWiki, but the Google abstract contained an error message generated by WardsWiki indicating that Google had triggered an anti-abuse mechanism.

October 2007 and GoogleLovesWiki! Vanity search for site:c2.com elizabethwiethoff turns up 113 real pages cached on 2007-10-07.

I've noticed more C2 hits also of late. Apparently Google tweaked a knob somewhere.

Inspection of Wiki's HTML source reveals the use of a GoogleAnalytics? urchinTracker() JavaScript function. Perhaps Ward added this when he added the EditText PNG a few months ago. I imagine signing up for GoogleAnalytics? forces Wiki to get indexed no matter what the algorithm is inclined to do otherwise. -- Eliz

Clever, that Ward.

If Google indexes WardsWiki through proxy hacks, the original site and the hack sites are considered duplicate content and the duplicate content gets dropped from the index. See "Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs" (http://www.seofaststart.com/blog/google-proxy-hacking).

For exercise, try searching for dougmerritt earlemartin (sorry, guys). On 2007-10-17, sixteen pages show up in the Google search results. Many of them are legit returns from wikis where they both hang out. A couple pages are from the Simplewebs (see WhatIsSimplewebs) copy of WardsWiki:

simplewebs.com/?PhlIp
simplewebs.com/?DaveVoorhis

Most of the rest are from obvious proxies:

Ward Cunningham: www.surfonsteroids.com/index.php?q=aHR0cDovL2MyLmNvbS9jZ2kvd2lraT9XYXJkQ3VubmluZ2hhbQ%3D%3D
Edit WikiNature: www.browser9.com/index.php/Y29t/YzI/aHR0cDovL2MyLmNvbS9jZ2kvd2lraT9lZGl0PVdpa2lOYXR1cmU/68/0/
AalbertTorsius AbandonPage AbbeNormal AbbreviationsAreEvil ...: www.browser9.com/index.php/b3Jn/c3VuaXI/aHR0cDovL3N1bmlyLm9yZy9hcHBzL21ldGEucGw-bGlzdD1NZWF0QmFsbA/68/0/
Wiki Nature: cloak-me.info/index.php?q=aHR0cDovL2MyLmNvbS9jZ2kvd2lraT9XaWtpTmF0dXJl
Wiki Now: cloak-me.info/index.php?q=aHR0cDovL2MyLmNvbS9jZ2kvd2lraT9XaWtpTm93

One of the sites appears to be a legit attempt at archiving, but to play nice the archive really shouldn't get indexed:

The Adjunct: ww3.wrump.com/archive/2007-05-04/c2.com/cgi/wiki?TheAdjunct

-- ElizabethWiethoff

2008 October and GoogleHatesWiki again.

Hmmm. Fickle, like an Earth female.

2008 November 5

Eight single facts and a question, viewed from a far point of view

The first fact: I live in a far, poor but highly technologyzed country, like usually are the far and poor countries.
The second fact: Twenty years ago, if I'd ask to computing related people "What do you think about Usenet News?", the usual answer was "Usenet News? But... what is it?". Today, when I ask to web related people "What do you think about Usenet News?",... I have the same answer! And when I ask them "What about Wiki?" I usually have a single "Wiki? But... what is it?". But be sure that if I ask for bogs, or weblogs, or as you want to call them, surely I'll have a lot of lighted elucidators. At far regions, thru popular magazines, technical magazines, technical supplements of newspapers, books at bookstores, TV and other mass media we are informed of "all over all new technology issues", but I'm sure that in the last two decades there was no published one line of info neither on wiki technologies, interactive usenet news nor any other truly socially useful resources.
The third fact: It is widely known that, in journalistic behaviour, if some fact is never communicated, it ranks as non being.
The fourth fact: Wiki and Usenet technologies enhance the individual active participation in a kind of contexts where if someone want to do (or to be) like successful people, the technology paradigm suggests him (or her) to embody into that context, while other technologies (like the useful but overstately promoted blogs) suggests to create another similar but disjunctive space. In other words, while the formers hint collaboration, the latter hints confronting competitiveness; i.e. union vs. division.
The fifth fact, usually never communicated to you: The far and poor countries all over the world are usually rich in resources, human kindness and education level (spite the movies, tv series and cartoons profusely received from there), all of them in a fast impairing process higly supported by -for us- strange, selfish and awful habits and traditions subliminally suggested to our childs and young people by most of these products.
The sixth fact: The country resources above mentioned are very attractive for those corps who also develop, direct or indirectly, alone or altogether, another ones like movies, tv series, cartoons, etc. And don't discard search tools, of course.
The seventh fact: Any family whose members have high education and moral levels is usually well-housekeeped, because their members interact and actually interchange their opinions; and the same stands for countries, if there exist resources for this kind of interchange.
The eighth fact: A model -similar to the above with countries- can be intrapolated over a single country; note that if there exist there truly useful opinions interchange tools, they remain being a risk, too.
The question: Should you do a brief estimation of the time this comment will remain here 'as is'? I'd entrust to be permanent... as a prove I'm wrong!

-- BobTait?.

CategoryWiki