History: Daniel Hardt's position paper for AMTA 2010 workshop

Preview of version: 1

Author: Daniel Hardt

Position paper submitted for the AMTA 2010 Workshop on Collaborative Translation and Crowdsourcing.


If a translation problem has already been solved, it shouldn't have to
be solved again. This is just common sense, but at the same time it
constitutes an ambitious and distant goal. There are two main issues:
accessibility of data and translation technology.

On the technology side, there has been a development of increasingly
sophisticated techniques to learn from translation data and apply what
has been learned to new texts. For example, SMT systems have moved from
word-by-word translations to phrase translation, and techniques to
exploit hierarchical structure are being developed. This progress is
important, but incremental.

Accessibility of data is a very different problem, involving difficult
social, economic and legal issues. While initiatives like TAUS/TDA
have taken promising steps, again progress has been modest, especially
in light of the ambitious goal we have set ourselves.

I would like to suggest that more dramatic progress is possible in the
area of data accessibility, not with respect to the complex social
issues, but by developing relevant technologies. Here's an example of
this: consider a translator interactively translating a document by
automatically translating each sentence and then post-editing the
output. At any given point in the process, the post-edited portion of
the file is very likely to contain solutions to translation problems
arising later in that same file. Thus it is well worth making that
data available to the MT system. In a research paper in the current
AMTA, I'll present work (done together with Jakob Elming) giving
evidence that within-file translation data can dramatically improve
SMT translation quality. We also describe a prototype showing how
this data can be exploited in real time.

Right now, SMT systems only exploit a fraction of potentially
available data in producing translations. This is because there are
technological barriers to data accessibility — I think it might be
much easier to remove these technological barriers than it is to
remove the social barriers. We should build technologies that ensure
that, whenever a text is being translated, the SMT system exploits all
available data, tuned to the specifics of the text at hand.

Existing, accessible translation data contains the distilled wisdom of
the world's translators — this is knowledge that ultimately will
solve the problem of translation. SMT technology is already quite
effective at extracting this knowledge. But right now, we lack the
technology for systems to take advantage of this data. By removing
these technological barriers, we can make great strides towards
solving the world's translation challenges.




History

Information Version
Fri 29 of Oct, 2010 01:39 GMT alain_desilets 3
Fri 29 of Oct, 2010 01:38 GMT alain_desilets 2
Fri 29 of Oct, 2010 01:37 GMT alain_desilets 1

Upcoming Events

No records to display