Skip to content

Commit 20e2203

Browse files
committed
py version in progress
1 parent 2ea8287 commit 20e2203

17 files changed

+436
-8
lines changed

.Rbuildignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,4 @@ vignettes/.*\.png$
2727
^\.covrignore$
2828
^vignettes/bondargentina_1\.png$
2929
^vignettes/bondargentina_2\.png$
30+
^pypkg$

CRAN-SUBMISSION

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
Version: 5.3.5
2-
Date: 2025-01-14 17:49:03 UTC
3-
SHA: e854fe79164464d5ba65a4029f767c2f32c51c48
2+
Date: 2025-01-14 19:57:37 UTC
3+
SHA: 2ea8287ef2c27901446bafa402728014d99904d4

pypkg/LICENSE.md

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
Apache License
2+
==============
3+
4+
_Version 2.0, January 2004_
5+
_&lt;<http://www.apache.org/licenses/>&gt;_
6+
7+
### Terms and Conditions for use, reproduction, and distribution
8+
9+
#### 1. Definitions
10+
11+
“License” shall mean the terms and conditions for use, reproduction, and
12+
distribution as defined by Sections 1 through 9 of this document.
13+
14+
“Licensor” shall mean the copyright owner or entity authorized by the copyright
15+
owner that is granting the License.
16+
17+
“Legal Entity” shall mean the union of the acting entity and all other entities
18+
that control, are controlled by, or are under common control with that entity.
19+
For the purposes of this definition, “control” means **(i)** the power, direct or
20+
indirect, to cause the direction or management of such entity, whether by
21+
contract or otherwise, or **(ii)** ownership of fifty percent (50%) or more of the
22+
outstanding shares, or **(iii)** beneficial ownership of such entity.
23+
24+
“You” (or “Your”) shall mean an individual or Legal Entity exercising
25+
permissions granted by this License.
26+
27+
“Source” form shall mean the preferred form for making modifications, including
28+
but not limited to software source code, documentation source, and configuration
29+
files.
30+
31+
“Object” form shall mean any form resulting from mechanical transformation or
32+
translation of a Source form, including but not limited to compiled object code,
33+
generated documentation, and conversions to other media types.
34+
35+
“Work” shall mean the work of authorship, whether in Source or Object form, made
36+
available under the License, as indicated by a copyright notice that is included
37+
in or attached to the work (an example is provided in the Appendix below).
38+
39+
“Derivative Works” shall mean any work, whether in Source or Object form, that
40+
is based on (or derived from) the Work and for which the editorial revisions,
41+
annotations, elaborations, or other modifications represent, as a whole, an
42+
original work of authorship. For the purposes of this License, Derivative Works
43+
shall not include works that remain separable from, or merely link (or bind by
44+
name) to the interfaces of, the Work and Derivative Works thereof.
45+
46+
“Contribution” shall mean any work of authorship, including the original version
47+
of the Work and any modifications or additions to that Work or Derivative Works
48+
thereof, that is intentionally submitted to Licensor for inclusion in the Work
49+
by the copyright owner or by an individual or Legal Entity authorized to submit
50+
on behalf of the copyright owner. For the purposes of this definition,
51+
“submitted” means any form of electronic, verbal, or written communication sent
52+
to the Licensor or its representatives, including but not limited to
53+
communication on electronic mailing lists, source code control systems, and
54+
issue tracking systems that are managed by, or on behalf of, the Licensor for
55+
the purpose of discussing and improving the Work, but excluding communication
56+
that is conspicuously marked or otherwise designated in writing by the copyright
57+
owner as “Not a Contribution.”
58+
59+
“Contributor” shall mean Licensor and any individual or Legal Entity on behalf
60+
of whom a Contribution has been received by Licensor and subsequently
61+
incorporated within the Work.
62+
63+
#### 2. Grant of Copyright License
64+
65+
Subject to the terms and conditions of this License, each Contributor hereby
66+
grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
67+
irrevocable copyright license to reproduce, prepare Derivative Works of,
68+
publicly display, publicly perform, sublicense, and distribute the Work and such
69+
Derivative Works in Source or Object form.
70+
71+
#### 3. Grant of Patent License
72+
73+
Subject to the terms and conditions of this License, each Contributor hereby
74+
grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
75+
irrevocable (except as stated in this section) patent license to make, have
76+
made, use, offer to sell, sell, import, and otherwise transfer the Work, where
77+
such license applies only to those patent claims licensable by such Contributor
78+
that are necessarily infringed by their Contribution(s) alone or by combination
79+
of their Contribution(s) with the Work to which such Contribution(s) was
80+
submitted. If You institute patent litigation against any entity (including a
81+
cross-claim or counterclaim in a lawsuit) alleging that the Work or a
82+
Contribution incorporated within the Work constitutes direct or contributory
83+
patent infringement, then any patent licenses granted to You under this License
84+
for that Work shall terminate as of the date such litigation is filed.
85+
86+
#### 4. Redistribution
87+
88+
You may reproduce and distribute copies of the Work or Derivative Works thereof
89+
in any medium, with or without modifications, and in Source or Object form,
90+
provided that You meet the following conditions:
91+
92+
* **(a)** You must give any other recipients of the Work or Derivative Works a copy of
93+
this License; and
94+
* **(b)** You must cause any modified files to carry prominent notices stating that You
95+
changed the files; and
96+
* **(c)** You must retain, in the Source form of any Derivative Works that You distribute,
97+
all copyright, patent, trademark, and attribution notices from the Source form
98+
of the Work, excluding those notices that do not pertain to any part of the
99+
Derivative Works; and
100+
* **(d)** If the Work includes a “NOTICE” text file as part of its distribution, then any
101+
Derivative Works that You distribute must include a readable copy of the
102+
attribution notices contained within such NOTICE file, excluding those notices
103+
that do not pertain to any part of the Derivative Works, in at least one of the
104+
following places: within a NOTICE text file distributed as part of the
105+
Derivative Works; within the Source form or documentation, if provided along
106+
with the Derivative Works; or, within a display generated by the Derivative
107+
Works, if and wherever such third-party notices normally appear. The contents of
108+
the NOTICE file are for informational purposes only and do not modify the
109+
License. You may add Your own attribution notices within Derivative Works that
110+
You distribute, alongside or as an addendum to the NOTICE text from the Work,
111+
provided that such additional attribution notices cannot be construed as
112+
modifying the License.
113+
114+
You may add Your own copyright statement to Your modifications and may provide
115+
additional or different license terms and conditions for use, reproduction, or
116+
distribution of Your modifications, or for any such Derivative Works as a whole,
117+
provided Your use, reproduction, and distribution of the Work otherwise complies
118+
with the conditions stated in this License.
119+
120+
#### 5. Submission of Contributions
121+
122+
Unless You explicitly state otherwise, any Contribution intentionally submitted
123+
for inclusion in the Work by You to the Licensor shall be under the terms and
124+
conditions of this License, without any additional terms or conditions.
125+
Notwithstanding the above, nothing herein shall supersede or modify the terms of
126+
any separate license agreement you may have executed with Licensor regarding
127+
such Contributions.
128+
129+
#### 6. Trademarks
130+
131+
This License does not grant permission to use the trade names, trademarks,
132+
service marks, or product names of the Licensor, except as required for
133+
reasonable and customary use in describing the origin of the Work and
134+
reproducing the content of the NOTICE file.
135+
136+
#### 7. Disclaimer of Warranty
137+
138+
Unless required by applicable law or agreed to in writing, Licensor provides the
139+
Work (and each Contributor provides its Contributions) on an “AS IS” BASIS,
140+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied,
141+
including, without limitation, any warranties or conditions of TITLE,
142+
NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are
143+
solely responsible for determining the appropriateness of using or
144+
redistributing the Work and assume any risks associated with Your exercise of
145+
permissions under this License.
146+
147+
#### 8. Limitation of Liability
148+
149+
In no event and under no legal theory, whether in tort (including negligence),
150+
contract, or otherwise, unless required by applicable law (such as deliberate
151+
and grossly negligent acts) or agreed to in writing, shall any Contributor be
152+
liable to You for damages, including any direct, indirect, special, incidental,
153+
or consequential damages of any character arising as a result of this License or
154+
out of the use or inability to use the Work (including but not limited to
155+
damages for loss of goodwill, work stoppage, computer failure or malfunction, or
156+
any and all other commercial damages or losses), even if such Contributor has
157+
been advised of the possibility of such damages.
158+
159+
#### 9. Accepting Warranty or Additional Liability
160+
161+
While redistributing the Work or Derivative Works thereof, You may choose to
162+
offer, and charge a fee for, acceptance of support, warranty, indemnity, or
163+
other liability obligations and/or rights consistent with this License. However,
164+
in accepting such obligations, You may act only on Your own behalf and on Your
165+
sole responsibility, not on behalf of any other Contributor, and only if You
166+
agree to indemnify, defend, and hold each Contributor harmless for any liability
167+
incurred by, or claims asserted against, such Contributor by reason of your
168+
accepting any such warranty or additional liability.
169+
170+
_END OF TERMS AND CONDITIONS_
171+
172+
### APPENDIX: How to apply the Apache License to your work
173+
174+
To apply the Apache License to your work, attach the following boilerplate
175+
notice, with the fields enclosed by brackets `[]` replaced with your own
176+
identifying information. (Don't include the brackets!) The text should be
177+
enclosed in the appropriate comment syntax for the file format. We also
178+
recommend that a file or class name and description of purpose be included on
179+
the same “printed page” as the copyright notice for easier identification within
180+
third-party archives.
181+
182+
Copyright [yyyy] [name of copyright owner]
183+
184+
Licensed under the Apache License, Version 2.0 (the "License");
185+
you may not use this file except in compliance with the License.
186+
You may obtain a copy of the License at
187+
188+
http://www.apache.org/licenses/LICENSE-2.0
189+
190+
Unless required by applicable law or agreed to in writing, software
191+
distributed under the License is distributed on an "AS IS" BASIS,
192+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
193+
See the License for the specific language governing permissions and
194+
limitations under the License.

pypkg/MANIFEST.in

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
include cpp11tesseract/*.so
2+
include cpp11tesseract/*.py

pypkg/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Debug
2+
3+
```
4+
rm -rf build dist cpp11tesseract.egg-info
5+
pip install --use-pep517 .
6+
```

pypkg/cpp11tesseract.cpp

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
#include <leptonica/allheaders.h>
2+
#include <pybind11/pybind11.h>
3+
#include <pybind11/stl.h>
4+
#include <tesseract/baseapi.h>
5+
6+
#include <memory>
7+
8+
namespace py = pybind11;
9+
10+
class TesseractEngine {
11+
public:
12+
TesseractEngine(const std::string &lang = "eng") {
13+
// engine = std::make_unique<tesseract::TessBaseAPI>();
14+
engine.reset(new tesseract::TessBaseAPI());
15+
if (engine->Init(NULL, lang.c_str())) {
16+
throw std::runtime_error("Could not initialize Tesseract.");
17+
}
18+
}
19+
20+
~TesseractEngine() { engine->End(); }
21+
22+
std::string ocr_file(const std::string &file) {
23+
Pix *image = pixRead(file.c_str());
24+
if (!image) {
25+
throw std::runtime_error("Failed to read image.");
26+
}
27+
28+
engine->SetImage(image);
29+
char *text = engine->GetUTF8Text();
30+
std::string result(text);
31+
delete[] text;
32+
pixDestroy(&image);
33+
34+
return result;
35+
}
36+
37+
private:
38+
std::unique_ptr<tesseract::TessBaseAPI> engine;
39+
};
40+
41+
PYBIND11_MODULE(_cpp11tesseract, m) {
42+
if (!py::hasattr(m, "TesseractEngine")) {
43+
py::class_<TesseractEngine>(m, "TesseractEngine")
44+
.def(py::init<const std::string &>())
45+
.def("ocr_file", &TesseractEngine::ocr_file);
46+
}
47+
}

pypkg/cpp11tesseract/__init__.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
import os
2+
import sys
3+
import glob
4+
5+
def _initialize_tessdata():
6+
tessdata_prefix = os.getenv("TESSDATA_PREFIX", "/usr/share/tesseract-ocr/4.00/tessdata")
7+
if not os.path.exists(tessdata_prefix):
8+
os.makedirs(tessdata_prefix, exist_ok=True)
9+
os.environ["TESSDATA_PREFIX"] = tessdata_prefix
10+
11+
_initialize_tessdata()
12+
13+
if "_cpp11tesseract" not in sys.modules:
14+
package_dir = os.path.dirname(__file__)
15+
shared_object_files = glob.glob(os.path.join(package_dir, "_cpp11tesseract*.so"))
16+
17+
if shared_object_files:
18+
_shared_lib = shared_object_files[0]
19+
sys.path.insert(0, package_dir)
20+
_cpp11tesseract = __import__(os.path.basename(_shared_lib).split(".")[0])
21+
else:
22+
raise ImportError("Could not find _cpp11tesseract shared library!")
23+
24+
from ._cpp11tesseract import TesseractEngine
25+
else:
26+
_cpp11tesseract = sys.modules["_cpp11tesseract"]
27+
28+
from .ocr import ocr
29+
from .tesseract import TesseractManager
30+
from .tessdata import tesseract_download
31+
32+
__all__ = [
33+
"TesseractEngine",
34+
"ocr",
35+
"TesseractManager",
36+
"tesseract_download"
37+
]

pypkg/cpp11tesseract/ocr.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
import _cpp11tesseract
2+
3+
def ocr(file: str, engine=None, HOCR=False, opw="", upw=""):
4+
if isinstance(engine, str):
5+
engine = _cpp11tesseract.TesseractEngine(engine)
6+
7+
if not isinstance(engine, _cpp11tesseract.TesseractEngine):
8+
raise ValueError("Invalid engine instance")
9+
10+
if isinstance(file, str):
11+
return engine.ocr_file(file, HOCR)
12+
elif isinstance(file, bytes):
13+
return engine.ocr_raw(file, HOCR)
14+
else:
15+
raise ValueError("Argument 'file' must be a file path or raw bytes")
16+
17+
def ocr_data(file: str, engine=None):
18+
"""Extracts text with bounding boxes and confidence scores."""
19+
if isinstance(engine, str):
20+
engine = _cpp11tesseract.TesseractEngine(engine)
21+
22+
if not isinstance(engine, _cpp11tesseract.TesseractEngine):
23+
raise ValueError("Invalid engine instance")
24+
25+
if isinstance(file, str):
26+
return engine.ocr_file_data(file)
27+
else:
28+
raise ValueError("Argument 'file' must be a file path")

pypkg/cpp11tesseract/tessdata.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
import os
2+
import requests
3+
4+
def tesseract_download(lang, model="best", datapath=None):
5+
if not isinstance(lang, str):
6+
raise ValueError("Language must be a string")
7+
8+
repo = "tessdata_best" if model == "best" else "tessdata_fast"
9+
url = f"https://github.com/tesseract-ocr/{repo}/raw/main/{lang}.traineddata"
10+
11+
if datapath is None:
12+
datapath = os.getenv("TESSDATA_PREFIX", "/usr/share/tesseract-ocr/4.00/tessdata")
13+
14+
destfile = os.path.join(datapath, f"{lang}.traineddata")
15+
16+
response = requests.get(url, stream=True)
17+
if response.status_code == 200:
18+
with open(destfile, 'wb') as f:
19+
for chunk in response.iter_content(1024):
20+
f.write(chunk)
21+
print(f"Downloaded {lang}.traineddata")
22+
else:
23+
raise Exception(f"Download failed: HTTP {response.status_code}")

pypkg/cpp11tesseract/tesseract.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
import hashlib
2+
3+
class TesseractManager:
4+
_store = {}
5+
6+
@staticmethod
7+
def get_engine(language="eng", datapath=None, configs=None, options=None, cache=True):
8+
key = hashlib.md5(str((language, datapath, configs, options)).encode()).hexdigest()
9+
10+
if cache and key in TesseractManager._store:
11+
return TesseractManager._store[key]
12+
13+
engine = _cpp11tesseract.TesseractEngine(language)
14+
if cache:
15+
TesseractManager._store[key] = engine
16+
return engine

0 commit comments

Comments
 (0)