Skip to content

Filter System

Summary

OmegaT's filter system enables reading, extracting translatable content from, and writing back a wide variety of file formats — from .po and .properties to .html, .xml, .xlsx, .docx, OpenOffice, and many more. Each filter is a two-fold component that can both read and write its format.

How Filters Work

A filter class can: 1. Read — Parse a document in a given format 2. Extract — Pull out translatable content as segments 3. Write — Rebuild the document, replacing translatable content with translations

Key invariant: Filters must be two-fold (read & write the same format).

FilterMaster

FilterMaster is the central organizer: - Maintains the registry of all available filters - Detects which filter to use for a given file - Routes files to appropriate filters based on extension and content

File Detection

OmegaT distinguishes files by: - File extension — e.g., *.txt, *.po, *.html - File content — Some formats require content inspection - Filter instantiation — A single filter class can be instantiated multiple times with different parameters (e.g., text file filter with different encodings)

Filename Patterns

Input pattern uses DOS-style wildcards: - *.txt — all files with "txt" extension - read* — all files starting with "read"

Output patterns use variable substitution:

Variable Description
${filename} Full input filename (default)
${nameOnly} Name without extension
${extension} File extension
${sourceLanguage} Project's source language
${targetLanguage} Project's target language

Example: Java Resource Bundles use ${nameOnly}_${targetLanguage}.${extension} to produce Messages_fr.properties from Messages.properties.

XML Filter Processing

OmegaT has sophisticated XML handling:

Tag Classification

  • Paragraph tags — Declare new paragraphs; don't define translatable/untranslatable parts
  • Intact tags — Mark content that should NOT be translated
  • Paired tags — Opening/closing tag pairs (e.g., <a>...</a>)
  • Content-based tags — Tags that must always be preserved regardless of position

Tag Processing Flow

  1. Handler.java collects all tags and texts
  2. On paragraph tag, calls translateAndFlush()
  3. Entry.detectTags() determines which parts are translatable
  4. Finds first (textStart) and last (textEnd) text elements, skipping spaces-only
  5. Expands markers to include paired tags inside the text range
  6. Content-based tags are always preserved

Spaces Processing

XML Type Handling
Unformatted Spaces are read literally; "Remove leading/trailing whitespace" can be disabled
Formatted Impossible to distinguish formatting spaces from real spaces; "Remove leading/trailing whitespace" must be enabled

TMXReader2/TMXWriter2 use a hybrid approach: segment text is unformatted, other tags are formatted — giving nice-looking XML without space issues.

Filter Types

OmegaT includes filters for: - Plain text.txt, .csv (with configurable encoding) - Web.html, .htm, .xhtml - XML.xml, .svg, .xml-based formats - Localization.po (gettext), .properties (Java resource bundles) - Office — OpenOffice/LibreOffice formats (.odt, .ods, .odp) - Microsoft.docx, .xlsx (via additional modules) - DTP.idml (InDesign) - Software.strings (macOS), .resx, .json - Subtitle.srt, .vtt

Creating Custom Filters

Custom filters can be distributed as plugins: 1. Implement the filter interface 2. Package as a .jar with manifest entry 3. Register via Core.registerFilterClass(MyFilter.class) in loadPlugins() 4. Define input/output filename patterns

See Also