Translation Memory¶
Summary¶
Translation Memory (TM) is a database of previously translated text segments that CAT tools use to suggest matches for new content. It is the core productivity feature that distinguishes CAT tools from plain text editors.
How It Works¶
- Segmentation — Source text is split into segments (usually sentences) using SRX rules
- Lookup — For each new segment, the TM is searched for matches
- Matching — Matches are scored by similarity percentage (fuzzy matching)
- Suggestion — The best match is presented to the translator
- Storage — When translator confirms a translation, it's stored in the TM
Match Types¶
| Type | Description | Example |
|---|---|---|
| Exact match (100%) | Identical source segment found in TM | "Hello world" → "Hello world" |
| Fuzzy match (70-99%) | Similar but not identical segment | "Hello world" → "Hello everyone" |
| Context match (ICE) | Exact match with identical surrounding context | Highest confidence match |
| No match | Nothing similar found in TM | Translator starts from scratch |
TMX Format¶
Translation Memory eXchange (TMX) is the standard XML format for storing and exchanging translation memories. OmegaT uses TMX for:
- project_save.tmx — Current project's translation memory
- project_mem.tmx — Accumulated project memory carried forward
- External TM files — Loaded from the /tm/ directory for reference
Translation Leveraging¶
When source files are updated: 1. OmegaT compares new segments with previous version 2. Segments that haven't changed keep their translations 3. Segments that changed are marked for review (with fuzzy match from old translation) 4. New segments have no prior translation 5. Removed segments are archived
This ensures that updating a project doesn't lose existing translation work.
Glossaries vs Translation Memory¶
| Translation Memory | Glossary / Term Base | |
|---|---|---|
| Unit | Full segments (sentences/paragraphs) | Individual terms/phrases |
| Purpose | Reuse previous translations | Ensure consistent terminology |
| Match type | Fuzzy + exact | Exact term matching |
| Format | TMX (XML) | Plain text, CSV, or specialized formats |
SRX Segmentation¶
Segmentation Rules eXchange (SRX) is the standard format for defining how text should be split into segments. Rules account for:
- Sentence-ending punctuation (., !, ?)
- Abbreviations that shouldn't trigger breaks (Mr., e.g.)
- Language-specific rules (CJK languages don't use spaces between words)
- Custom exceptions
OmegaT uses a Segmenter class with SRX rules to split paragraphs into sentences and glue them back together with proper spacing for the target language.
See Also¶
- Cat Tools
fuzzy-matching- Filter System
omegaT