Implementation

As the above scenario illustrates, CLWE supports a very open ended workflow that lifts most of the assumptions of traditional authoring and translation environments.

A key element for supporting this open workflow is a backend capable of tracking edits made in the different pages, in such a way that it can help users to reproduce these changes in other linguistic variants of those same pages. We now describe how we addressed this challenge in the CLWE project. A complete description of the theory behind the engine and the implementation details are available in the architecture document (Huberdeau 2008). However, this section presents an overview of these concepts.

By keeping track of which individual edits are incorporated in a given page version, it is possible to compare two individual page's included edit sets. This comparison can quickly identify which pages need updating without even having to look at the page content. For any edit sets \alpha and \tau, the following situations can exist:

  • \alpha = \tau : Pages are equivalent
  • \alpha \subset \tau : \alpha can be updated from \tau
  • \tau \subset \alpha : \tau can be updated from \alpha
  • else : Both pages can benefit an update from the other one

This last case, which could also be represented in a much less concise and readable manner, means that a page can both be improved and be used to improve the other page. It leads to the key aspect that allows collaborative translation: No master language is imposed at any point in time. Contributors to the content are free to improve the pages at any time and the CLWE guarantees that their changes will not be lost in translation.

All these page relationships are used to display the links to the different other linguistic versions. They allow readers of the site to access higher quality content and also serve as an incentive to translate content.

The engine knows which pages need updating. However, translators need to see the content differences in order to replicate them. All wikis allow to compare page versions and obtain this information. CLWE innovates by automatically selecting the best possible text difference from the stored translation history. By comparing the current version of the source with the oldest version that includes an edit not yet incorporated in the target, the text difference can be obtained.

Because this lookup does not rely on the actual translation operations between two pages, the various edits can be incorporated from a different source every time and allow the translation community to adapt over time. While a pivot language is desired to reduce the translation effort, it is not imposed by the CLWE architecture.

When a translation from a source to a target, all edits that have been highlighted by the difference can be considered as incorporated in the target version. This operation is called propagation. The only other operation defined is the addition of a new edit, which will later on be propagated.

Because of the propagation model, it is required that any translation made is complete. Because it does not look at the actual content of the pages, the CLWE does not know which edit was incorporated. This all-or-nothing simplification is represented by "Complete translation" and "Partial translation". In the first case, a propagation occurs and in the second one, no actions are taken.

By allowing content to be added to any page and allowing translation to occur between any pair of pages, the information presented to the end user can become confusing. To provide insight on the quality of the content, it was necessary to find a measure to qualify the size of the changes.

Finding an accurate measure is not an easy task. In natural languages, a single character change can completely change the meaning and complete paragraphs can be meaningless. For this reason, the measure found is bound to be imprecise and can only be used as an estimate.

The screenshots presented in the previous section contains percentages of "Up-to-date-ness". This measure indicates how many segments have been modified in the other linguistic versions. A modified segment is a segment that is either added or removed. The segments are counted as sentences. To take the difference of effort required to add a segment and to remove one, additions and removals are weighted differently in the calculation.

Alternatively, segments could have been counted as words or paragraphs. However, a sentence is closer to the meaning of an idea and initial tests demonstrated that other techniques would make variations too large or too small. Using sentences is not perfect either. Especially on small articles, the replacement of a word by its synonym may cause significant variations in the measure displayed.

To resolve this issue, multiple techniques could be used. However, until the CLWE has been deployed on a large amount of sites and data is available, it has been preferred to leave the measure as-is and wait until more comparative data and user feedback is available. The potential solutions are:

  • Change segment size
  • Adapt change weight
  • Adaptive segment size or weight depending on the length of the article
  • Deeper content analysis to determine if the change modified the meaning
  • Use symbolic representation of the value rather than numeric to better represent the imprecise nature of the value

From alain... need to see what fits

NOTES FROM AD: Not sure that there is a whole lot to say here. In a sense, the Notations section kind of provided the main insight that form the basis of the backend. LPH, are there additional meaningful details that we could say about how this is done?

This tracking approach only presents a number of limitations, most of them minor.

NOTE FROM AD: LPH, when you first sent out the architecture document, you and I had day long chat about it, during which we discussed some of the limitations of the approach. I think, I was able to recall most of these points but not sure. Do you still have a transcript of that chat? The limitation I am thinking about was soe


Firstly, in cases where more than one edit are translated in a same translation transaction, the system is not able to identify which parts of the translated text correspond to which edit. TODO: Use part of the usage scenario to illustrate that For example, if Pierre translates edits e1, e2, and e3 from English to French, when Pedro later wants to update the Spanish page, he will only be able to translate these three edits at once.

TODO: Need to better phrase the second limitation below

Secondly, say EN = {e1, e2, e3}, FR = {e1, e2}. You translate ES from FR, and get ES = {e1, e2, e3}. ES will be shown as needing updating from EN (cause it's missing e3). As I recall, I think if you update ES from EN, the system will show you all three of e1, e2, and e3 in the EN, when in fact, it should only show you e3. Is that correct LPH? If so, can you remind me why that is (can you explain it using the {e1, e2, e3} notation.

Thirdly, the approach assumes that users will never do original edits while in the midst of translating edits from one language to another. If a user does that, then the system will simply consider the original edits to be part of the translation task. In other words, no new edit identifier will be generated. The end result of this is that the system will not be able to notify users that other pages need to include that edit as well.

Fourthly, the system relies heavily on the end user to tell it when pages are synchronized. TODO: Talk about what bad things can happen when the use clicks on the Complete Translation button instead of the Partial Translation button, and vice-versa. There is a short discussion of how this could be fixed this in the Related work section, which describes the problem. Maybe we should move that problem description here.


In practice, we haven't found the first limitation to be a big issue. The other two however seem like they could be problematic. Ideas for how to address them are proposed in the Future Work section.

End of current argumentation


NOTES FROM AD: The stuff in the remaining sections is stuff that had been written before I reorganized this page on 2008-04-09, and for which I didn't know if it should/could be threaded into the new reorganized argumentation. Need to evaluate this.

Maybe all we need is to reformat LPH's scenario a bit to use the format below, which communicates better than the impersonal approach. Also, add a bit about how prior art (LizzyWiki) succeeded in removing some of the constraints, but not all.

John creates an English page Welcome to this wiki. Pierre then translates it to French page Bienvenue à ce wiki. Later on, John adds three sentences to English page Welcome to this wiki. Now, Josée who does not speak English wants also to add two words to the French page Bienvenue à ce wiki.

With a standard wiki, the two pages would be distinct and no one would be aware of the content evolution in the other linguistic versions. With the LizzyWiki approach, Josée is not allowed to add her two words to Bienvenue à ce wiki before she has translated the ten sentences added by John to the English version Welcome to this wiki. But Josée cannot do this because she does not read English. Even if she could, she might not be in the mood to translate ten English sentences just to be allowed to add two words to the French version.

In (Désilets et al., 2006) the authors also postulated that in order to support collaborative authoring and translation in more than two languages at a time, it might be necessary to impose the use of pivot languages as intermediaries between other languages, in order to provide stable points of references in an otherwise chaotic environment.

With the CLWE project, we blindly ignored these constraints and allow authors to create original content on any linguistic version of any page, and at any time.

NOTES FROM AD: Not sure what to do with this content either


AD: Maybe some of this needs to be threaded back into the above argumentation, but I am not sure


In adding the required features to support change tracking, a few guidelines had to be followed:
  • Any contribution is worth translating until a translator says otherwise.
  • Only the final result matters. Intermediate steps can be ignored.
  • Content contributors should not have to worry about translation.

NOTES FROM AD: In a way, this breaks the argumentation a bit. In the Intro, we present the assumptions and constraints we are trying to relax. Those make very compelling "guiding principles". Then we get to this section and we present more "guiding principles". Are these new principles subsumed by the ones presented in the intro? If not, should we thread them into the intro, or are they better presented here? How can we present those new principles in a way that makes it clear to the reader that they are consistent with the previous principles? I have to admit, I don't see how the above three principles tie to a "simpler model of tracking". LPH said he will re-read this whole section (which he wrote as a sort of "stream of consciousness") and see if he can rework it and make it clearer, etc..._

These three guidelines are all tightly connected towards a solution. As mentioned before, an attempt to translate any change individually to every single linguistic version is impractical due to the limited translation resources. By focussing on the final results, it is possible for a translator to catch up on the content of a source page in a single step and abstract away the different steps that lead to the final content. In most cases, the translator can simply observe the changes that were made on the source page since the last time a synchronization occurred.

In a wiki content creation process, multiple edits and not all of them would be worth translating. Rather than having a single author, wiki pages are a collaboration work between many people. Someone would initially write the base content. Others would add to it. In the process, many people will make minor contributions that contribute to the quality of the content, but may not be relevant for translators. Such changes include grammatical corrections and syntactic improvements. Some changes may only affect the formatting of the page. Determining which change has to be translated is a complex task on which a line is very hard to draw.

A simple change to the syntax of a phrase may seem trivial, but if the previous formulation was ambiguous, a translator may already have made a wrong interpretation of it. In which case, the translation would need to be updated. A potential solution to this would be to ask the content contributor to say wether or not a change should be translated. However, the content contributor may not have sufficient knowledge about the translation process to make the right decision. With no information requested from the content contributors, it is possible to make the translation process as invisible as possible.

Because translators will translate an aggregate of changes rather than changes individually, it does not matter if a few trivial changes slip in. Very active translation communities propagating the changes often may get to translate very small changes that do not impact their version of the content. However, it is unlikely that a translation gets updated multiple times a day to match each change on a page.

The result of these guidelines is a very simplified model for change tracking. In the model, each change represents an idea by its author. As far as tracking in concerned, all changes are equal as they all have to be propagated to the other linguistic version. When a change is incorporated in a given linguistic version, all pages updating from the given page will also inherit the given change.

In a simplified manner, change propagation can be represented as a directed graph in which nodes are page versions and arcs are page evolutions. Arcs are used for both evolution between versions of a same page and to represent the translation of content from source to target language. In the graph, original content creation is attached to page versions. By following all arcs from the original page version node, all pages containing the change can be found.

Consider this sample case using three languages. An unlimited amount of languages are supported. However, the resulting scenarios and representations would be too large and impractical for the demonstration purposes.

Upcoming Events

No records to display