added envs for unstructured to control OCR quality and OCR languages

This commit is contained in:
yuisheaven
2025-10-04 05:21:02 +02:00
parent df5f85e0c6
commit c9a687171a
5 changed files with 96 additions and 6 deletions
+17
View File
@@ -8,3 +8,20 @@ ENABLE_UNSTRUCTURED_PARSING=true
# Unstructured API endpoint (default for docker-compose setup)
UNSTRUCTURED_API_URL=http://unstructured:8000
# Parsing strategy for the Unstructured service
# Valid values: auto, fast, hi_res
# - auto: Automatically choose the best strategy based on document type
# - fast: Fast parsing without OCR - best for simple text documents
# - hi_res: High-resolution parsing with OCR - best for scanned documents, images, and complex layouts (default)
UNSTRUCTURED_STRATEGY=hi_res
# Languages for OCR and document parsing (comma-separated ISO 639-3 language codes)
# Default: eng,deu (English and German)
# Common language codes:
# eng = English deu = German fra = French
# spa = Spanish ita = Italian por = Portuguese
# rus = Russian ara = Arabic zho = Chinese
# jpn = Japanese kor = Korean
# Example for English, German, and French: UNSTRUCTURED_LANGUAGES=eng,deu,fra
UNSTRUCTURED_LANGUAGES=eng,deu