Skip to content

Thai and number blocks are not auto-scaled and get wrong hyphen when using ­ in insert_htmlbox #4613

@Arlencl

Description

@Arlencl

Description of the bug

When using insert_htmlbox to render Thai text or long pure number sequences (which do not have natural word breaks), the following problems occur:

  1. No Auto-Scaling:

    • If the input text cannot be split (for example, Thai text without spaces, or a long number like "12345678901234567890"), insert_htmlbox will not auto-scale (shrink) the text to fit the given rectangle. As a result, the content will overflow the boundary and be cut off.
  2. Wrong Hyphen for Thai (­ handling):

    • When Thai text is pre-tokenized (e.g., using PyThaiNLP), and then joined with ­ to represent soft line-break opportunities, insert_htmlbox may insert a "-" (hyphen) at the line break. However, adding a hyphen at word breaks in Thai is not in line with Thai writing conventions and is visually/semantically incorrect.

How to reproduce the bug

  • Thai case:

    "bbox": [
                      317.98,
                      201.75,
                      641.93,
                      264.3
                  ]
    text = '''<span style=\"font-size:60.02pt;color:rgb(255,255,211);\">ค่าธรรมเนียมชำระเมื่อมาถึง</span>'''
    # or: '''<span style=\"font-size:60.02pt;color:rgb(255,255,211);\">ค่าธรรมเนียม&shy;ชำระ&shy;เมื่อ&shy;มาถึง</span>'''
    page.insert_htmlbox(rect, text, scale_low=0)
    
    
    Image Image
  • Number case:

Image

Expected:

  • If the text does not naturally break but overflows the rect, insert_htmlbox should apply auto-scaling so the full content fits.
  • When using &shy; as a break point (such as in Thai tokenization), do not insert a hyphen character at the break; in Thai, no such symbol should appear.

Observed:

  • No auto-shrinking (scaling) for unbreakable blocks.
  • A hyphen - is added at Thai &shy; line breaks, which is not appropriate for Thai (and similarly for Chinese, Japanese, Korean, etc).

PyMuPDF version

1.26.3

Operating system

Windows

Python version

3.10

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugfix developedrelease schedule to be determined

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions