Wiktionary to XML

From GCompris
Revision as of 23:02, 23 September 2010 by Bruno (talk | contribs)
Jump to: navigation, search

Wiktionary to XML

In GCompris it would be very usefull to have a large amount of words and definition in a XML formated form. This would allow us to create different kinds of activities around reading and writing skills.

In the early day of GCompris there was no such data available under an open license. But now things have changes and in the Wikimedia provides the Wiktionary dictionary.

Sadly, these are formatted as WikiText instead of XML so it is very hard for a computer to parse them and extract relevant informations.

I decided to make it a try and transform a Wiktionary dump in an XML structured format.

The primary goal is to provide a content appropriate for children and this is another challenge because in Wiktionary:

  • we find words definition including their sexual connotation,
  • there are way too much words. For GCompris a list of 1000 or 2000 words and definition would be perfect,
  • wikitags and formating is specific for each language which makes our life harder.

I created a tool called wiktio2xml, it can be found under gcomprixogoo branch in tool/witio2xml. To run it, you must pick the Wiktionary XML dump and a list of words to extract. Then just run:

  • ./wiktio2xml.py frwiktionary-20100915-pages-articles.xml fr_words.txt > miniwiktio.html

(For now a single HTML formated page is created. In the end an XML file have to be defined).

Here is an in progress version of what the script creates at this stage.