Dataset handling

From GCompris

High-level brainstorming

Most activities use datasets for defining excercises they present to the user. The formats of these datasets are mostly specific to a single activity, at best few activities share a common format dataset (like wordlists). Some activities load their data from external files, many define them internally.

In order to get a more flexible dataset handling the following requirements turned out to be helpful:

  • Datatype: Define a common datatype (QML type) for internal usage: GCDataset
    • Mime-type: The type of dataset a GCDataset instance provides can be defined by a mime-type like property. Examples: "gcompris/wordlist", "gcompris/imageid", "gcompris/click_on_letter", ...
    • Payload: The payload holding the data content is activity/mime-type specific and usable only by the activity/ies that support that mime-type. Except these compatible activities no other subsystem should need to parse the data semantically. (This leaves the door open for integrating even other applications than GCompris to use this dataset format syntax).
    • Serializable: For passing GCDataset-s they will be serialized in JSON.
    • At best datasets of all activities should be decoupled from the activities, i.e. internally defined datasets in the code should be factored out to a JSON-based dataset file.
  • Dataset Editors: For creating datasets editors will be needed. As a dataset is activity/mime-type specific, an editor per mime time can be expected. These editors will best come as QML applications.
    • Admin: In the school context the generation of datasets will be the task of the teacher, i.e. the admin-console to present the dataset editors.
    • Code organization: Dataset editors are not forcibly only activity-specific. The wordlist-format is common to multiple activities for example. Though, the editors will most likely share some code with the activities. Therefore the code might best be organized in a seperate folder:
 -> core/
 -> activitities/
    -> click_on_letter/
    -> ...
 -> dataset_editors
    -> wordlist
    -> click_on_letter
    -> ...
  • Dataset Distribution: It would be desirable to completely integrate the dataset infrastructure into the school/admin context.
    • Creation: Datasets are created by a teacher in the admin console using a dataset editor. And pushed to the central service backend using its webservice API. Existing datasets can be assigned to profiles.
    • Storage: The central backend receives the dataset content as JSON data and stores it as a text blob. Binary data like images/sounds will be base64 encoded.
    • Distribution: Clients can pull in datasets that have been assigned to the profiles they use by using the central webservice API.
    • It might be a better option to use the .rcc format for dataset distribution.