LLM PDF/ODT Translator

Proof-of-concept script for LLM-based offline document translation. Using a local Ollama chat API endpoint, context is given by chunking the past translation history.

The translation quality seems quite underwhelming, at least for English into German with several general-purpose 8B models. In especially for short fragments or question-style inputs, hallucinations are a problem, too. (Better prompt-engineering and a second proof-reading pass might help here, though.)

Thus, there are is some polishing as well as several features left for future work. On the other hand, the script should already be easily adaptable to process additional document formats – currently:

PDF support via PyMuPDF
ODT support via odfpy

Installation & Usage

python3 -m venv venv
./venv/bin/pip install --require-virtualenv -U -r requirements.txt
./venv/bin/python3 pdftranslate.py -h

usage: pdftranslate.py [-h] [--api URL] --model MODEL
                       --source-lang LANG --target-lang LANG [--title-context TITLE]
                       [--history-len LEN] [--chunk-len LEN] [--debug]
                       infile outfile

Quick prototype for translating PDF and ODT documents using an Ollama chat API.

options:
  -h, --help             show this help message and exit
  --api URL              ollama endpoint to use (default: http://localhost:11434/api/chat)
  --model MODEL          ollama model to use (default: llama3.1:latest)
  --source-lang LANG     original input document language (default: English)
  --target-lang LANG     desired translated output language (default: German)
  --title-context TITLE  document title/context as hinted by prompt (default: None)
  --history-len LEN      length of past original/translation to replay as context (default: 5000)
  --chunk-len LEN        try to split by sentence boundaries when input length exceeded (default: 5000)
  --debug                enable debug logging (default: False)