Translation Memory¶

Summary¶

Translation Memory (TM) is a database of previously translated text segments that CAT tools use to suggest matches for new content. It is the core productivity feature that distinguishes CAT tools from plain text editors.

How It Works¶

Segmentation — Source text is split into segments (usually sentences) using SRX rules
Lookup — For each new segment, the TM is searched for matches
Matching — Matches are scored by similarity percentage (fuzzy matching)
Suggestion — The best match is presented to the translator
Storage — When translator confirms a translation, it's stored in the TM

Match Types¶

Type	Description	Example
Exact match (100%)	Identical source segment found in TM	"Hello world" → "Hello world"
Fuzzy match (70-99%)	Similar but not identical segment	"Hello world" → "Hello everyone"
Context match (ICE)	Exact match with identical surrounding context	Highest confidence match
No match	Nothing similar found in TM	Translator starts from scratch

TMX Format¶

Translation Memory eXchange (TMX) is the standard XML format for storing and exchanging translation memories. OmegaT uses TMX for: - project_save.tmx — Current project's translation memory - project_mem.tmx — Accumulated project memory carried forward - External TM files — Loaded from the /tm/ directory for reference

Translation Leveraging¶

When source files are updated: 1. OmegaT compares new segments with previous version 2. Segments that haven't changed keep their translations 3. Segments that changed are marked for review (with fuzzy match from old translation) 4. New segments have no prior translation 5. Removed segments are archived

This ensures that updating a project doesn't lose existing translation work.

Glossaries vs Translation Memory¶

	Translation Memory	Glossary / Term Base
Unit	Full segments (sentences/paragraphs)	Individual terms/phrases
Purpose	Reuse previous translations	Ensure consistent terminology
Match type	Fuzzy + exact	Exact term matching
Format	TMX (XML)	Plain text, CSV, or specialized formats

SRX Segmentation¶

Segmentation Rules eXchange (SRX) is the standard format for defining how text should be split into segments. Rules account for: - Sentence-ending punctuation (., !, ?) - Abbreviations that shouldn't trigger breaks (Mr., e.g.) - Language-specific rules (CJK languages don't use spaces between words) - Custom exceptions

OmegaT uses a Segmenter class with SRX rules to split paragraphs into sentences and glue them back together with proper spacing for the target language.