Massive collaboration to support minority language survival
homunq Mon 04 of May, 2009 15:25 GMT
I am coming at this from a different perspective. My interest is not in translation per se, but in supporting literacy in the Mayan languages. These are ~31 languages in varying states of health, but which collectively have several million native speakers; at least 7 of them have over 200,000 speakers. Yet mayan-language literacy is at a very low rate, generally speaking under 10% for the most basic level of literate skills (that is, any knowledge whatsoever of the orthographic conventions of one's language; I am not counting "on-the-spot" ability to slowly read a word, based on transfer from Spanish literacy).
Moreover, lexical databases of various qualities and completenesses exist, often in digital form, but are not accessible. As for TMs/corpuses (corpi?), there is a general paucity: for instance, I doubt all 31 languages together, with over 100 times the speakers of Inkutik, have significantly more material on the web than is available on the Nunavut domain alone (as reported in the "translation wikified" paper).
I intend to use an omegawiki-like tool (probably Ambaradan/ Omegawiki Mark II, currently in development) to make a large database (compiled from existing databases) available on the web. I then intend to pilot its use as a teaching tool in elementary-school classrooms. This would be a way of attacking both content and literacy rates - that is, both sides of the chicken-and-egg problem of literacy - at once. Fundamental in this plan is the availability of low-cost laptops for education; the Guatemalan ministry of education recently got a donation of 3000 XO computers from Telmex/Claro.
I have a number of questions for this community, primarily oriented towards the possibility of a simple user-oriented tool which nevertheless refines data (gradually) towards the high quality needed by NLP tools. I'm starting primarily with a pretty poor level of data - a lot of what I have is just pairs of words, with some occasional unformatted comments, between a Mayan and another language (and unfortunately the other language is more English than Spanish). So any user tagging and cleanup is a step up. But I'm aiming to use this with rural Guatemalan elementary-school kids, whose Spanish is often very poor, and where the teachers themselves often have no more than a 9th-grade education! I myself have taught in this context, and I'm an incurable optimist, so I focus more on the incredible opportunities for improvement this situation provides than on the challenges - if I accomplish anything at all, I will accomplish wonders. But still...
So I'd love to see if anybody here has any experience or research on using massive collaboration with a non-linguist, non-professional-translator audience. My name is Jameson Quinn, my email is my firstname.lastname at gmail.com.
Re: Massive collaboration to support minority language survival
alain_desilets Mon 04 of May, 2009 16:11 GMT
Hi Jameson,
Sounds like a fascinating project! And I think that massive online collaboration is one of the tools that could greatly help saving small languages from extinction:
I don't have experience with massive online collaboration in that kind of context, but as you know, my colleague Benoit Farley has experience in putting together a parallel corpus for Inuktitut:
If you know of some sites where there is parallel Mayan-Spanish texts, we might be able to help do the same for those languages (but I can't promise anything... we are a bit thinly spread at the moment).
I think TikiWiki could be a good platform for what you are trying to do. The translation features it contains could be used to create some parallel documents. Tiki also has features for building bilingual lexicons, which are used on this site (medical terminology):
Re: Re: Massive collaboration to support minority language survival
marclaporte Wed 06 of May, 2009 13:22 GMT
Hi!
Tiki has a lot of tools that could help.
In addition to all the features useful for terminology/glossary (wiki pages, tags, categories, semantic, doc:i18n), wiki links, ((doc:mindmaps etc), you can also take advantage of the community features, like forums, etc. to animate a community and keep people in contact.
Moreover, lexical databases of various qualities and completenesses exist, often in digital form, but are not accessible. As for TMs/corpuses (corpi?), there is a general paucity: for instance, I doubt all 31 languages together, with over 100 times the speakers of Inkutik, have significantly more material on the web than is available on the Nunavut domain alone (as reported in the "translation wikified" paper).
I intend to use an omegawiki-like tool (probably Ambaradan/ Omegawiki Mark II, currently in development) to make a large database (compiled from existing databases) available on the web. I then intend to pilot its use as a teaching tool in elementary-school classrooms. This would be a way of attacking both content and literacy rates - that is, both sides of the chicken-and-egg problem of literacy - at once. Fundamental in this plan is the availability of low-cost laptops for education; the Guatemalan ministry of education recently got a donation of 3000 XO computers from Telmex/Claro.
I have a number of questions for this community, primarily oriented towards the possibility of a simple user-oriented tool which nevertheless refines data (gradually) towards the high quality needed by NLP tools. I'm starting primarily with a pretty poor level of data - a lot of what I have is just pairs of words, with some occasional unformatted comments, between a Mayan and another language (and unfortunately the other language is more English than Spanish). So any user tagging and cleanup is a step up. But I'm aiming to use this with rural Guatemalan elementary-school kids, whose Spanish is often very poor, and where the teachers themselves often have no more than a 9th-grade education! I myself have taught in this context, and I'm an incurable optimist, so I focus more on the incredible opportunities for improvement this situation provides than on the challenges - if I accomplish anything at all, I will accomplish wonders. But still...
So I'd love to see if anybody here has any experience or research on using massive collaboration with a non-linguist, non-professional-translator audience. My name is Jameson Quinn, my email is my firstname.lastname at gmail.com.
Sounds like a fascinating project! And I think that massive online collaboration is one of the tools that could greatly help saving small languages from extinction:
http://www.wiki-translation.com/tiki-index.php?page=Save+minority+languages+from+extinction&bl=y
I don't have experience with massive online collaboration in that kind of context, but as you know, my colleague Benoit Farley has experience in putting together a parallel corpus for Inuktitut:
http://www.webitext.com/bin/webinuk.cgi
If you know of some sites where there is parallel Mayan-Spanish texts, we might be able to help do the same for those languages (but I can't promise anything... we are a bit thinly spread at the moment).
I think TikiWiki could be a good platform for what you are trying to do. The translation features it contains could be used to create some parallel documents. Tiki also has features for building bilingual lexicons, which are used on this site (medical terminology):
http://htaglossary.net/
Tiki has a lot of tools that could help.
In addition to all the features useful for terminology/glossary (wiki pages, tags, categories, semantic, doc:i18n), wiki links, ((doc:mindmaps etc), you can also take advantage of the community features, like forums, etc. to animate a community and keep people in contact.
Please participate with us to build such profiles so they become available more easily:
http://profiles.tikiwiki.org/Collaborative_Multilingual_Terminology
http://profiles.tikiwiki.org/Collaborative_Community
Best regards,
M ;-)