Source quality

These are notes from a breakout session at the AMTA 2010 Workshop on Collaborative Translation and Crowdsourcing, Denver, Colorado, October 31st 2010.

This breakout session focused on a cluster of issues that pertain to the impact and importance of source quality in a collaborative/crowdsourced translation context.

Media/Channels where UGC has source quality issues

  • SMS
  • Twitter
  • IM
  • Forum

Noise examples

  • Absence of diacritics
  • Foreign words used instead of native words
  • Abbreviations/shorthand (compression)
  • Lack of standard in writing
Recipients use language model to unpack the information that has been compressed in the first place.

Strategies to address source quality issues

  • Tools: T9, spell-check. How far should we go with cleaning? Is the cleaning done before publishing? Before training?
  • Rewards (bonus points, trust points)
  • Normalization (at run-time and during training)
  • User acceptance (anecdotal evidence that users adapt their authoring when they interface with bots; more users studies are required).

Upcoming Events

No records to display