-
Notifications
You must be signed in to change notification settings - Fork 630
Open
Labels
Description
Description of the bug
When using insert_htmlbox
to render Thai text or long pure number sequences (which do not have natural word breaks), the following problems occur:
-
No Auto-Scaling:
- If the input text cannot be split (for example, Thai text without spaces, or a long number like "12345678901234567890"),
insert_htmlbox
will not auto-scale (shrink) the text to fit the given rectangle. As a result, the content will overflow the boundary and be cut off.
- If the input text cannot be split (for example, Thai text without spaces, or a long number like "12345678901234567890"),
-
Wrong Hyphen for Thai ( handling):
- When Thai text is pre-tokenized (e.g., using PyThaiNLP), and then joined with
­
to represent soft line-break opportunities,insert_htmlbox
may insert a "-" (hyphen) at the line break. However, adding a hyphen at word breaks in Thai is not in line with Thai writing conventions and is visually/semantically incorrect.
- When Thai text is pre-tokenized (e.g., using PyThaiNLP), and then joined with
How to reproduce the bug
-
Thai case:
"bbox": [ 317.98, 201.75, 641.93, 264.3 ] text = '''<span style=\"font-size:60.02pt;color:rgb(255,255,211);\">ค่าธรรมเนียมชำระเมื่อมาถึง</span>''' # or: '''<span style=\"font-size:60.02pt;color:rgb(255,255,211);\">ค่าธรรมเนียม­ชำระ­เมื่อ­มาถึง</span>''' page.insert_htmlbox(rect, text, scale_low=0)
-
Number case:

Expected:
- If the text does not naturally break but overflows the rect,
insert_htmlbox
should apply auto-scaling so the full content fits. - When using
­
as a break point (such as in Thai tokenization), do not insert a hyphen character at the break; in Thai, no such symbol should appear.
Observed:
- No auto-shrinking (scaling) for unbreakable blocks.
- A hyphen
-
is added at Thai­
line breaks, which is not appropriate for Thai (and similarly for Chinese, Japanese, Korean, etc).
PyMuPDF version
1.26.3
Operating system
Windows
Python version
3.10