These are notes from a breakout session at the AMTA 2010 Workshop on Collaborative Translation and Crowdsourcing, Denver, Colorado, October 31st 2010.
This breakout session focused on a cluster of issues that pertain to the impact and importance of source quality in a collaborative/crowdsourced translation context.
This breakout session focused on a cluster of issues that pertain to the impact and importance of source quality in a collaborative/crowdsourced translation context.
Media/Channels where UGC has source quality issues
- SMS
- IM
- Forum
Noise examples
- Absence of diacritics
- Foreign words used instead of native words
- Abbreviations/shorthand (compression)
- Lack of standard in writing
Strategies to address source quality issues
- Tools: T9, spell-check. How far should we go with cleaning? Is the cleaning done before publishing? Before training?
- Rewards (bonus points, trust points)
- Normalization (at run-time and during training)
- User acceptance (anecdotal evidence that users adapt their authoring when they interface with bots; more users studies are required).