Difference between revisions of "Wiktionary to XML"

From GCompris
Jump to: navigation, search
(Should consider Omega Wiki)
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
=== Wiktionary to XML ===
 
=== Wiktionary to XML ===
  
In GCompris it would be very usefull to have a large amount of words and definition in a XML formated form. This would allow us to create different kinds of activities around reading and writing skills.
+
In GCompris it would be very usefull to have a large amount of words and definitions in a XML formatted form. This would allow us to create different kinds of activities around reading and writing skills.
  
In the early day of GCompris there was no such data available under an open license. But now things have changes and in the Wikimedia provides the Wiktionary dictionary.
+
In the early days of GCompris there was no such data available under an open license. But now things have changed and the Wiktionary dictionary is one of the Wikimedia projects.
  
Sadly, these are formatted as ''WikiText'' instead of XML so it is very hard for a computer to parse them and extract relevant informations.
+
Sadly, it is formatted as ''WikiText'' instead of XML so it is very hard for a computer to parse it and extract relevant informations.
  
 
I decided to make it a try and transform a Wiktionary dump in an XML structured format.
 
I decided to make it a try and transform a Wiktionary dump in an XML structured format.
  
The primary goal is to provide a content appropriate for children and this is another challenge because in Wiktionary:
+
The primary goal is to provide content that is appropriate for children and this is another challenge because in Wiktionary:
* we find words definition including their sexual connotation,  
+
* we find words definitions with sexual connotation,  
 
* there are way too much words. For GCompris a list of 1000 or 2000 words and definition would be perfect,
 
* there are way too much words. For GCompris a list of 1000 or 2000 words and definition would be perfect,
* wikitags and formating is specific for each language which makes our life harder.
+
* wikitags and formatting is specific for each language which makes life harder.
  
I created a tool called wiktio2xml, it can be found under gcomprixogoo branch in [http://git.gnome.org/browse/gcompris/tree/tools/wiktio2xml?h=gcomprixogoo tool/witio2xml]. To run it, you must pick the [http://download.wikimedia.org/frwiktionary/ Wiktionary XML dump] and a list of words to extract. Then just run:
+
I created a tool called wiktio2xml, it can be found under the master branch in [http://git.gnome.org/browse/gcompris/tree/tools/wiktio2xml?h=master tool/witio2xml]. To run it, you must pick the [http://download.wikimedia.org/frwiktionary/ Wiktionary XML dump] and a list of words to extract. Then just run:
* ./wiktio2xml.py frwiktionary-20100915-pages-articles.xml fr_words.txt > miniwiktio.html
+
* ./wiktio2xml.py frwiktionary-20100915-pages-articles.xml fr_words.txt -o miniwiktio.html
  
(For now a single HTML formated page is created. In the end an XML file have to be defined).
+
To get more information on what it does, just run ''./wiktio2xml.py -h'':
 +
<code>
 +
Usage: wiktio2xml.py [options] wiktionary_dump.xml word_list.txt
 +
Options:
 +
  -h, --help            show this help message and exit
 +
  -o OUTPUT, --output=OUTPUT
 +
                        write result to file or directory
 +
  -q, --quiet          don't print in progress messages to stdout
 +
  -d, --debug          print debug traces to stdout
 +
  -s, --site            Creates a web site
  
Here is an [http://gcompris.net/incoming/miniwiktio.html in progress version] of what the script creates at this stage.
+
</code>
 +
 
 +
 
 +
== Results of what the script creates at this stage ==
 +
* [http://gcompris.net/incoming/miniwiktio.html in a single HTML-formatted page ]
 +
* [http://gcompris.net/incoming/miniwiktio/index.html in a static html site]
 +
* (In the end an XML file have to be defined)
 +
 
 +
== Should consider Omega Wiki ==
 +
 
 +
[http://www.omegawiki.org Omega Wiki] is a formatted wiki that should be considered for this project.
 +
 
 +
[[Category:English]]

Latest revision as of 12:21, 27 January 2015

Wiktionary to XML

In GCompris it would be very usefull to have a large amount of words and definitions in a XML formatted form. This would allow us to create different kinds of activities around reading and writing skills.

In the early days of GCompris there was no such data available under an open license. But now things have changed and the Wiktionary dictionary is one of the Wikimedia projects.

Sadly, it is formatted as WikiText instead of XML so it is very hard for a computer to parse it and extract relevant informations.

I decided to make it a try and transform a Wiktionary dump in an XML structured format.

The primary goal is to provide content that is appropriate for children and this is another challenge because in Wiktionary:

  • we find words definitions with sexual connotation,
  • there are way too much words. For GCompris a list of 1000 or 2000 words and definition would be perfect,
  • wikitags and formatting is specific for each language which makes life harder.

I created a tool called wiktio2xml, it can be found under the master branch in tool/witio2xml. To run it, you must pick the Wiktionary XML dump and a list of words to extract. Then just run:

  • ./wiktio2xml.py frwiktionary-20100915-pages-articles.xml fr_words.txt -o miniwiktio.html

To get more information on what it does, just run ./wiktio2xml.py -h:

Usage: wiktio2xml.py [options] wiktionary_dump.xml word_list.txt
Options:
 -h, --help            show this help message and exit
 -o OUTPUT, --output=OUTPUT
                       write result to file or directory
 -q, --quiet           don't print in progress messages to stdout
 -d, --debug           print debug traces to stdout
 -s, --site            Creates a web site


Results of what the script creates at this stage

Should consider Omega Wiki

Omega Wiki is a formatted wiki that should be considered for this project.