Wiki Analytics

From Forge

Revision as of 15:24, 3 December 2009 by Eekim (Talk | contribs)
(diff) ← Older revision | Current revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Are there algorithmic ways of determining the health of a Wiki?

There are likely a number of different patterns of healthy Wikis and, more importantly, healthy Wiki-based communities. If we can identify and visualize these patterns, we can apply these analytics to:

  • Understand the patterns of interactions in a healthy community
  • Aid the community to use the Wiki more effectively
  • Encourage developers to facilitate these patterns in the tool itself

Contents

Philosophy

The end goal is not to come up with some single index indicating health or effectiveness, but to identify patterns. Communities can derive their own meaning from these patterns and act appropriately.

You can get infinitely intricate in the complexity of your metrics, and at some point, that may be valuable. As a starting point, try to identify simple metrics. Simple metrics means simplifying the data acquisition and computation requirements. The simpler these requirements, the more likely people will examine the data.

Wiki Usage

  • How do edits vs accesses reflect health of Wiki?
  • How often are orphan pages accessed/edited?
  • Most wanted pages.
  • Numbers of editors

Principal Component Analysis

There may be some purely numerical methods or simple machine learning algorithms which can be applied to determine when a wiki falls into one of a number of categories. The point of taking a statistical or machine-learning approach would be to discover if there are measures human beings hadn't thought of. The need for explanatory power indicates these methods are best used for exploration, rather than actual in-the-field classification.

So far it appears that human intuition is entirely correct: given a sample of statistical data from around 2000, wikis, its been found that the first three principal components (with more than 99% of the variance explained) are RecentChanges, Views, and Edits. First experiment at Joe Blaylock 20090714.

Hypotheses

  • All classifications are a function of time.
  • Given some set of statistics for some bounded period of time, you can say whether something is (1) alive or dead; (2) "healthy."
  • "Health" is a function of its intended purpose.
    • CMS -- "Above the Flow"
    • Collaborative -- "In the Flow"
    • Discussion-oriented w/ limited refactoring
    • Data repository.
      • Active and evolving core group of editors w/ long tail of edits
      • Wide variance of views, which are a function of "interestingness" of content.
  • Would be useful for public Wiki providers to identify link farms, so they can react accordingly
  • Categorize editors based on number of edits per size of edit
  • Number and quality of editors a greater indicator of success

Factors to include in next analysis:

  • New editors
  • Number of edits / editor
  • Size of edits (new + deleted characters) / editor
  • Editors and the pages they're editing

Page Buddies

You can automatically derive social networks by looking at who edits a page. If you edit the same page as another person, you become that person's page buddy.

If you expose that information, it could help in making useful connections and breaking down silos.

Link (Graph) Analysis

  • Number of Islands/Orphans. If no pages are linked to anything else, then every page is an island of one, and you are probably not using the Wiki in a useful way. Islands consisting of several pages ("components" in graph theory) indicate some level of interconnectedness.
  • Number of Blocks/Peninsulas. Blocks are pages only connected to one other page. If you break that link, the page becomes an orphan, or island.
  • Level/pattern of interconnectedness of clusters.
  • Diameter (longest path in the graph). A long diameter might be an unhealthy indicator. Measuring diameter is NP-complete, so it's not practical as a general metric.
  • Number of links to and from a page. If you graph pages (x-axis) and links (y-axis) from largest to small, you may be able to derive interesting usage patterns. For example, the double derivative of the curve might indicate the linking behavior variance in a community.
  • How often are external links used to link inside a Wiki?

The hypothesis for an island is that fewer large islands are better than many small islands. One way to verify this would be to cross-relate this data with page name analysis (see above). In other words, do larger islands have better page names?

What constitutes a "link"?

  • Forward link
  • Backlink
  • Internal link (Collab:Link As You Think)
    • Links to non-existent pages (incipient links)
  • External link
  • Transclusion
  • Tags

Tag Analysis

As discussed in the section on graph analytics, one way to analyze tags is to treat them as links. Another way to study them is to treat them as page names.

Tag-specific analysis:

  • Emergent namespacing of tags

Time Analysis

Some of the most interesting analysis will be when a time axis is added. This will allow us to understand how content evolves -- how it is refactored (or not), how conflict is resolved, and in general, what the patterns of interaction look like. The best work to date on this is IBM's history flow.

Other things to study:

  • Stubs evolving into fleshed out pages

Shared Language

Page Names

Hypothesis: Good page names are one indicator of healthy Wikis. The better the names, the more likely people will link to those pages, both intentionally and accidentally.

How do you measure the "goodness" of a page name?

  • Number of characters
  • Number of words/tokens
  • Number of non-alphanumeric characters

The hypothesis for all of the above is that smaller is better.

Other potential analytics:

  • Variation in normalized link names. Some Wikis normalize page names. For example, "Matt Liggett" and "matt liggett" might point to the same page. Other forms of normalization include treating non-alphanumeric characters as white space. Studying the variation in the text actually used to link to pages would demonstrate the effectiveness of the normalization algorithms.

Some initial work on page name analysis.

Page Content Analysis

As with page name analysis, you could also cross-relate this data with the graph analytics.

Data Sources

Wikipedia

Wikipedia is an unusual data source, because of its massive scale (which makes it computationally difficult to analyze and even get the data) and content-focus (e.g. encyclopedia, etc.). In some ways, the patterns found here could serve as a model for Wiki Health, although we need to be careful about this assumption.

The other nice thing about Wikipedia is that the data is available and there's already been some great analysis work. Erik Zachte is Wikimedia's Data Analyst, and his analysis and scripts are available. There is a general page on Wikimedia statistics.

Tools

MediaWiki

Extension:Usage Statistics

Open Web Analytics is a PHP web analytics framework, with built-in support for MediaWiki. FireStats is similar.

lqt-analytics is an analytics tool for LiquidThreads.

Visualization

Many Eyes

References

  • 2007 Wikithon Analysis
  • WikiTracer -- a web service providing platform-independent analytics and comparative growth statistics for Wikis. References some of my work on Wiki Analytics.
  • I explored some of these ideas with Socialtext.
Personal tools