Skip to content

Translation Memory

Summary

Translation Memory (TM) is a database of previously translated text segments that CAT tools use to suggest matches for new content. It is the core productivity feature that distinguishes CAT tools from plain text editors.

How It Works

  1. Segmentation — Source text is split into segments (usually sentences) using SRX rules
  2. Lookup — For each new segment, the TM is searched for matches
  3. Matching — Matches are scored by similarity percentage (fuzzy matching)
  4. Suggestion — The best match is presented to the translator
  5. Storage — When translator confirms a translation, it's stored in the TM

Match Types

Type Description Example
Exact match (100%) Identical source segment found in TM "Hello world" → "Hello world"
Fuzzy match (70-99%) Similar but not identical segment "Hello world" → "Hello everyone"
Context match (ICE) Exact match with identical surrounding context Highest confidence match
No match Nothing similar found in TM Translator starts from scratch

TMX Format

Translation Memory eXchange (TMX) is the standard XML format for storing and exchanging translation memories. OmegaT uses TMX for: - project_save.tmx — Current project's translation memory - project_mem.tmx — Accumulated project memory carried forward - External TM files — Loaded from the /tm/ directory for reference

Translation Leveraging

When source files are updated: 1. OmegaT compares new segments with previous version 2. Segments that haven't changed keep their translations 3. Segments that changed are marked for review (with fuzzy match from old translation) 4. New segments have no prior translation 5. Removed segments are archived

This ensures that updating a project doesn't lose existing translation work.

Glossaries vs Translation Memory

Translation Memory Glossary / Term Base
Unit Full segments (sentences/paragraphs) Individual terms/phrases
Purpose Reuse previous translations Ensure consistent terminology
Match type Fuzzy + exact Exact term matching
Format TMX (XML) Plain text, CSV, or specialized formats

SRX Segmentation

Segmentation Rules eXchange (SRX) is the standard format for defining how text should be split into segments. Rules account for: - Sentence-ending punctuation (., !, ?) - Abbreviations that shouldn't trigger breaks (Mr., e.g.) - Language-specific rules (CJK languages don't use spaces between words) - Custom exceptions

OmegaT uses a Segmenter class with SRX rules to split paragraphs into sentences and glue them back together with proper spacing for the target language.

See Also