LLM PDF/ODT Translator
Proof-of-concept script for LLM-based offline document translation. Using a local Ollama chat API endpoint, context is given by chunking the past translation history.
The translation quality seems quite underwhelming, at least for English into German with several general-purpose 8B models. In especially for short fragments or question-style inputs, hallucinations are a problem, too. (Better prompt-engineering and a second proof-reading pass might help here, though.)
Thus, there are is some polishing as well as several features left for future work. On the other hand, the script should already be easily adaptable to process additional document formats – currently:
Installation & Usage
python3 -m venv venv
./venv/bin/pip install --require-virtualenv -U -r requirements.txt
./venv/bin/python3 pdftranslate.py -h
usage: pdftranslate.py [-h] [--api URL] --model MODEL
--source-lang LANG --target-lang LANG [--title-context TITLE]
[--history-len LEN] [--chunk-len LEN] [--debug]
infile outfile
Quick prototype for translating PDF and ODT documents using an Ollama chat API.
options:
-h, --help show this help message and exit
--api URL ollama endpoint to use (default: http://localhost:11434/api/chat)
--model MODEL ollama model to use (default: llama3.1:latest)
--source-lang LANG original input document language (default: English)
--target-lang LANG desired translated output language (default: German)
--title-context TITLE document title/context as hinted by prompt (default: None)
--history-len LEN length of past original/translation to replay as context (default: 5000)
--chunk-len LEN try to split by sentence boundaries when input length exceeded (default: 5000)
--debug enable debug logging (default: False)