Thursday, September 06, 2007

Wiki Wednesday's talk on Wiktionary and multilingual collaboration

crossposted from my blog at http://socialtext.com

September's Bay Area Wiki Wednesday featured Betsy Megas, a mechanical engineer and Wiktionary administrator, known in the wikiverse as Dvortygirl. She's a Wiki Wednesday regular and spoke at Wikimania 2006. In her talk, she gave us a ton of information on the history of Wiktionary, a tour of its interesting features, and thoughts on possible future directions for this worldwide, massively multilingual collaboration.

Betsy started by explaining the difference between Wikipedia and Wiktionary. Wikipedia's goal is to capture all the knowledge in the world. Except for dictionary definitions! Wiktionary's modest goal is to include all words in all languages. While an encyclopedia article is about a subject, a dictionary definition is about a word.

But what is a dictionary? Betsy went to a library to browse dictionary collections. Some dictionaries focus on types of words: cliches, law, saints, nonsexist language. Others center around types of content: rhymes, usage, etymology, visual information. Others are dictionaries of translation. Wiktionary, because it's not paper, is searchable, unlimited by size; it can evolve; and it has strong ties to people who edit it, and to communities of its editors.

Wiktionary content includes audio pronunciations, definitions, etymologies, metadata such as a word's frequency in English according to all the text on Project Gutenberg; pictures (such as this great photo illustrating the concept of "train wreck"); and videos attached to a word, for example, videos of how to express a word in American Sign Language. It also includes translations.

We went off on a few speculations to future directions for Wiktionary, Wikipedia, and perhaps the entire web. What if links knew why they were linked? For example, why is "Lima" linked to "Peru"? Betsy thinks that we are missing out on a lot of metadata that could be quite useful. And for Wiktionary specifically, what if we had categories that were structured around the functionality of a word, for example, its part of speech?

Betsy then went on to sketch out basic entry layout - which is different in different languages, but which for English is referred to as WT:ELE. She explains the problem of Wiktionary as "We have structured data, and no structure". This is a problem and a feature of many wikis!

Wiktionary has many tools to help with the tension between structure and structurelessness. It heavily relies on entry templates, which fill a regular wikitext entry box with something like this:


==English==

===Noun===
{{en-noun}}

# {{substub}}

===References===
*Add verifiable references here to show where you found the word in use.


Other useful tools depend mostly on automated detection of problems, relying on human beings to do the cleanup by hand. For example, Connel MacKenzie wrote a bot to list potentially messed-up second level article headers, but a person checks each link by hand to do the gardening.

Structurelessness or being structure-light can be a problem for sensible reuse of Wiktionary content. Other dictionary projects such as Onelook and Ninjawords have used content from Wiktionary, but ran into difficulties with their imports. Is Wiktionary content reusable? Yes, but barely.

Somewhere in the mix, we also discussed WT:CFI (Criteria for Inclusion) and WT:RFV (Requests for Verification).

But then, the truly fascinating stuff about translation and multilingual collaboration. Words, or definitions, exist in many places. For example, we might have an English word defined in the English Wiktionary and the Spanish Wiccionario, and then a Spanish equivalent of that word also defined in both places. So, a single word (or definition, or lexeme) can potentially exist in a matrix of all the 2000+ languages which currently have Wiktionaries (or the 6000-7000+ known living languages) squared.

For a taste of how the Wiktionary community has dealt with that level of complexity, take a look at the English entry for the word "board". About halfway down the page, there's a section titled "Translations", with javascript show/hide toggles off to the right hand side of the page. There are many meanings for the English word, including "piece of wood" and "committee". If I show the translations for board meaning a piece of wood, many other languages are listed, with the word in that language as a link. The Dutch word for "piece of wood" is listed as "plank", and if I click that word I get to the English Wiktionary's entry for plank (which, so far, does not list itself as Dutch, but as English and Swedish.) I also noted that the noun form and the verb form of "board" have different sections to show the translations.

Ariel, another Wikipedia and Wiktionary editor and admin, showed us a bit of the guts of the translation template. The page looks like this:

[[{{{2}}}#|{{{2}}}]]

But the code behind it, which you can see if you click to edit the page, looks like this, all on one line (I have added artificial line breaks to protect the width of your browser window)}:

[[{{{2}}}#{{{{#if:{{{xs|}}}|t2|t-sect}}|{{{1|}}}|{{{xs|}}}}}|{{
#if:{{{sc|}}}|{{{{{sc}}}|{{{alt|{{{2}}}}}}}}|{{{alt|{{{2}}}}}}}}]]
 {{#ifeq:{{{1|}}}|{{#language:{{#switch:{{{1|}}}|
nan=zh-min-nan|yue=zh-yue|cmn=zh|{{{1|}}}}}}}||
[[:{{#switch:{{{1}}}|nan=zh-min-nan|yue=zh-yue|
cmn=zh|{{{1}}}}}:{{{2}}}|({{{1}}})]]
}}{{#if:{{{tr|}}}|&
nbsp;({{{tr}}})}}{{#switch:{{{3|}}}|f|m|mf|n|c|nm= {{{{{3}}}}}|
}}{{#switch:{{{4|}}}|s|p= {{{{{4}}}}}|}}

Fortunately, this template has a lovely Talk page which explains everything.

We all sat around marvelling at the extent of language, and the ambition of the multilingual Wiktionary projects. The scope of OmegaWiki was impressive. As Betsy and Ariel demonstrated its editing interface for structured multilingual data, I got a bit scared, though! Maybe a good future step for OmegaWiki contributions could be to build a friendlier editing UI on top of what sounds like a very nice and solid database structure.

We also took a brief tour of Wordreference.com and its forums, which Wordreference editors go through to update the content of its translation dictionaries.

I'm a literary translator, and publish mostly my English translations of Spanish poetry; so I'm a dictionary geek. In order to translate one poem, I might end up in the underbelly of Stanford library, poring over regional dictionaries from 1930s Argentina, as well as browsing online for clues to past and current usage of just a few words in that poem. Wiktionary is a translator's dream — or will be, over time and as more people contribute. I noted as I surfed during Betsy's talk that the Spanish Wiktionary has a core of only 15 or so die-hard contributors. So, with only a little bit of sustained effort, one person could make a substantial difference in a particular language.

The guy who is scanning the OED and who works for the Internet Archive talked about that as an interesting scanning problem. We told him that Kragen has also worked on a similar project. The IA guy, whose name I didn't catch, described his goals of comparing his OCR version to the not-copy-protected first CD version of the second edition.

At some point, someone brought up ideas about structuring and web forms. I have forgotten the exact question, but Betsy's answer was hilariously understated: "I think that the structure of languages is substantially more complex."

Chris Dent brought up some interesting ideas as we closed out the evening. What is a wiki? When we talk about Wikipedia or Wiktionary or most other wiki software implementations, really we're just talking about "the web". And what he thinks wiki originally meant and still means is a particular kind of tight close collaboration. As I understand it, he was saying that possibly we overstate wiki-ness as "editability" when really the whole web is "editable". I thought about this some more. We say we are "editing a page" but really we are creating a copy of the old one, swapping it to the same url, and making our changes. The old page still exists. So for the general web, we can't click on a page to "edit" it, but we can make our own page and reference back to the "old" page, which is essentially the same thing as what most wiki software does; but at a different pace and with different tools and ease of entry/editing. So his point is that wiki-ness is about evolving collaborative narratives. I'm not really sure where to go with that idea, but it was cool to think about and I was inspired by the idea that the entire web, really, has a big button on it that says "Edit This Page".

As is often the case, we had low attendance, but a great speaker and unusually good group discussion. I'm happy with only 10 people being there, if they're the right people. And yet I feel that many people are missing out on this great event. Betsy's going to give me her slides and an audio recording for this month, but next month I will try to get a videocamera and record the entire event. If any actual videobloggers would like to come and do the recording, I'd love it.

Also, tune in next week, or September 16, for the San Francisco Wikipedia/Mediawiki meetup!

Technorati Tags: , , , , ,

Digg this

1 comment:

GerardM said...

OmegaWiki's user interface definitely needs an overhaul, it is sucky but its one redeeming quality is that it works. The best bit is that when you have changed the language in your user preferences from English, you get many of the labels in the selected language.

As OmegaWiki has its content in a relational database, it is possible to have multiple applications on top of it. It can for instance be used as the source for terminology in data mining. If you care for a demo, contact me. :)

Thanks,
GerardM