Here is the algorithm for Analysis of Wiki Link Structure:
- Download from Ward's server the file containing all links.
- Ignore all links to external sites, and all "links" to non-existent pages. For the purposes of finding potentially valuable but hard to access pages, these "links" are irrelevant.
- Consider removing from consideration all pages with CategoryHomePage. Some pages only have signatures as their out-going links. Removing all home pages should make more pages into dead ends, and more pages into orphans, but will no longer list home pages in the output.
We now have a digraph with pages as vertices and links as directed edges. We begin by removing from the graph the vertices that are not part of any strongly connected component.
- Remove from the graph all vertices listed on CategoryAutoIgnore. These are pages that should probably be ignored when trying to understand the linkage structure of the wiki. There is some discussion about this issue on that page.
Next we print a list of dead-ends (and pages that point only to dead-ends).
- In the remaining graph, find all vertices with zero out-degree. List them, then remove them from the graph. Repeat this process until all vertices have out-degree > 0.
Next print a list of orphans (and pages that are pointed to only by orphans).
- In the remaining graph, find all vertices with zero in-degree. List them, then remove them from the graph. Repeat this process until all vertices have in-degree > 0.
- The graph that now remains is such that all vertices have non-zero in-degree and non-zero out-degree.
Print "orphan clusters":
- Identify all small (up to 100 vertices) sets of vertices such that the set is strongly connected and has no incoming links.
Print "dead-end clusters":
- Identify all small (up to 100 vertices) sets of vertices such that the set is strongly connected and has no outgoing links.