|
1 |
| -This page documents various differences between IronPython and CPython. Since IronPython is under active development, any of the differences described here may change or disappear in the future: |
2 |
| - |
3 |
| -- [Environment Variables](#environment-variables) |
4 |
| -- [COM Interaction](#com-interaction) |
5 |
| -- [Strings](#strings) |
6 |
| -- [Interaction with the Operating System](#interaction-with-the-operating-system) |
7 |
| -- [Codecs](#codecs) |
8 |
| -- [Source File Encoding](#source-file-encoding) |
9 |
| -- [Recursion](#recursion) |
10 |
| - |
11 |
| -# Environment Variables |
12 |
| - |
13 |
| -* `IRONPYTHONSTARTUP` is used instead of `PYTHONSTARTUP` |
14 |
| - |
15 |
| -* `IRONPYTHONPATH` is used instead of `PYTHONPATH` |
16 |
| - |
17 |
| -# COM Interaction |
18 |
| - |
19 |
| -* Interaction with COM objects is handled by the CLR rather than a python library binding to the native COM dlls. |
20 |
| - |
21 |
| -# Strings |
22 |
| - |
23 |
| -* `str` objects are represented in UTF-16 (like all .NET strings) rather than UTF-32 used by CPython. |
24 |
| - |
25 |
| -This has a few visible consequences if characters ouside of the Basic Multilingual Plane (BMP) are used (that is, characters with Unicode code points above `U+FFFF`). A few examples below illustrate the differences. |
26 |
| - |
27 |
| -Let's take a Unicode character U+1F70B, '🜋'. In CPython, it is represented by a single character: |
28 |
| - |
29 |
| -_CPython_ |
30 |
| -``` |
31 |
| ->>> len('\U0001f70b') |
32 |
| -1 |
33 |
| ->>> str('\U0001f70b') |
34 |
| -'🜋' |
35 |
| -``` |
36 |
| - |
37 |
| -In IronPython, it is represented by a pair of surrogate characters U+D83D and U+DF0B: |
38 |
| - |
39 |
| -_IronPython_ |
40 |
| -``` |
41 |
| ->>> len('\U0001f70b') |
42 |
| -2 |
43 |
| ->>> str('\U0001f70b') |
44 |
| -'\ud83d\udf0b' |
45 |
| -``` |
46 |
| - |
47 |
| -In **both** cases, however, the string containing such character is printed out correctly, since `print` will transcode the string from its internal representation to whichever encoding is used by the console or file (usually UTF-8): |
48 |
| - |
49 |
| -_CPython_ and _IronPython_ |
50 |
| -``` |
51 |
| -print('\U0001f70b') |
52 |
| -'🜋' |
53 |
| -``` |
54 |
| - |
55 |
| -Any surrogate pair in IronPython strings represents one logical character. CPython, however, sees a surrogate pair as two invalid characters. |
56 |
| - |
57 |
| -_IronPython_ |
58 |
| -``` |
59 |
| ->>> '\ud83d\udf0b' |
60 |
| -'\ud83d\udf0b' |
61 |
| ->>> print('\ud83d\udf0b') |
62 |
| -🜋 |
63 |
| ->>> '\ud83d\udf0b'.encode('utf-8') |
64 |
| -b'\xf0\x9f\x9c\x8b' |
65 |
| ->>> '\U0001f70b'.encode('utf-8') |
66 |
| -b'\xf0\x9f\x9c\x8b' |
67 |
| -``` |
68 |
| - |
69 |
| -_CPython_ |
70 |
| -``` |
71 |
| ->>> '\ud83d\udf0b' |
72 |
| -'\ud83d\udf0b' |
73 |
| ->>> print('\ud83d\udf0b') |
74 |
| -Traceback (most recent call last): |
75 |
| - File "<stdin>", line 1, in <module> |
76 |
| -UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed |
77 |
| -'\ud83d\udf0b'.encode('utf-8') |
78 |
| -Traceback (most recent call last): |
79 |
| - File "<stdin>", line 1, in <module> |
80 |
| -UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed |
81 |
| -``` |
82 |
| - |
83 |
| -CPython requires use of `'surrogatepass'` error handler to let those pairs through. Note however, that they are still being treated as two separate characters. IronPython encodes the pair as if it were one character. |
84 |
| - |
85 |
| -_CPython_ |
86 |
| -``` |
87 |
| ->>> '\ud83d\udf0b'.encode('utf-8','surrogatepass') |
88 |
| -b'\xed\xa0\xbd\xed\xbc\x8b' |
89 |
| ->>> '\U0001f70b'.encode('utf-8') |
90 |
| -b'\xf0\x9f\x9c\x8b' |
91 |
| -``` |
92 |
| - |
93 |
| -The `'surrogatepass'` error handler is still needed in IronPython to handle surrogate characters that do not form a valid surrogate pair: |
94 |
| - |
95 |
| -_IronPython_ |
96 |
| -``` |
97 |
| -print('\ud83d\udf0b') |
98 |
| -🜋 |
99 |
| ->>> print('\ud83d\udf0b'[::-1]) |
100 |
| -Traceback (most recent call last): |
101 |
| - File "<stdin>", line 1, in <module> |
102 |
| -UnicodeEncodeError: 'cp65001' codec can't encode character '\udf0b' in position 0: Unable to translate Unicode character \\uDF0B at index 0 to specified code page. |
103 |
| ->>> print('\ud83d\udf0b'[::-1].encode('utf-8','surrogatepass')) |
104 |
| -b'\xed\xbc\x8b\xed\xa0\xbd' |
105 |
| -``` |
106 |
| - |
107 |
| -_CPython_ |
108 |
| -``` |
109 |
| ->>> print('\ud83d\udf0b') |
110 |
| -Traceback (most recent call last): |
111 |
| - File "<stdin>", line 1, in <module> |
112 |
| -UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed |
113 |
| ->>> print('\ud83d\udf0b'[::-1]) |
114 |
| -Traceback (most recent call last): |
115 |
| - File "<stdin>", line 1, in <module> |
116 |
| -UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed |
117 |
| ->>> print('\ud83d\udf0b'[::-1].encode('utf-8','surrogatepass')) |
118 |
| -b'\xed\xbc\x8b\xed\xa0\xbd' |
119 |
| -``` |
120 |
| - |
121 |
| -# Interaction with the Operating System |
122 |
| - |
123 |
| -* Environment variables are decoded using the `'replace'` error handler, rather than the `'surrogateescape'` error handler used by CPython. |
124 |
| - |
125 |
| -This is how .NET libraries handle encoding errors in the system. The difference is only visible on Posix systems that have environment variables defined using a different encoding than the encoding used by the system (Windows environment variables are always in UTF-16, so no conversion takes place when accessed as Python `str` objects). |
126 |
| - |
127 |
| -Assume that a Linux system is configured to use UTF-8. Under bash: |
128 |
| - |
129 |
| -``` |
130 |
| -$ python -c 'f=open("test.sh","w",encoding="latin-1");print("NAME=\"André\"",file=f)' |
131 |
| -$ source test.sh |
132 |
| -$ export NAME |
133 |
| -``` |
134 |
| - |
135 |
| -This creates an environment variable that is encoded using Latin-1 encoding, rather than the system encoding. CPython will escape the invalid byte 0xe9 (letter 'é' in Latin-1) in a lone surrogate 0xdce9, which is still an invalid Unicode character. |
136 |
| - |
137 |
| -_CPython_ |
138 |
| -``` |
139 |
| ->>> import os |
140 |
| ->>> os.environ["NAME"] |
141 |
| -'Andr\udce9' |
142 |
| ->>> print(os.environ["NAME"]) |
143 |
| -Traceback (most recent call last): |
144 |
| - File "<stdin>", line 1, in <module> |
145 |
| -UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 4: surrogates not allowed |
146 |
| -``` |
147 |
| - |
148 |
| -IronPython will replace the invalid byte with U+FFFD, the Unicode replacement character, which is a valid and printable character. |
149 |
| - |
150 |
| -_IronPython_ |
151 |
| -``` |
152 |
| ->>> import os |
153 |
| ->>> os.environ["NAME"] |
154 |
| -'Andr�' |
155 |
| ->>> print(os.environ["NAME"]) |
156 |
| -Andr� |
157 |
| ->>> hex(ord(os.environ["NAME"][-1])) |
158 |
| -'0xfffd' |
159 |
| -``` |
160 |
| - |
161 |
| -The CPython representation is not printable, but can be safely encoded back to the original form using `'surrogateescape'` (default when dealing with the OS environment): |
162 |
| - |
163 |
| -_CPython_ |
164 |
| -``` |
165 |
| ->>> os.environ["PATH"] = os.environ["PATH"] + ":/home/" + os.environ["NAME"] + "/bin" |
166 |
| ->>> import posix |
167 |
| ->>> posix.environ[b"PATH"] |
168 |
| -b'/bin:/usr/bin:/usr/local/bin:/home/Andr\xe9/bin' |
169 |
| ->>> os.environ["NAME"].encode("utf-8","surrogateescape") |
170 |
| -b'Andr\xe9' |
171 |
| -``` |
172 |
| - |
173 |
| -The IronPython representation is printable, but the original byte value is lost: |
174 |
| - |
175 |
| -_IronPython_ |
176 |
| -``` |
177 |
| ->>> os.environ["NAME"].encode("utf-8","surrogateescape") |
178 |
| -b'Andr\xef\xbf\xbd' |
179 |
| -``` |
180 |
| - |
181 |
| -# Codecs |
182 |
| - |
183 |
| -* Some single-byte codecs may have unused positions in their codepage. There are differences between how CPython and IronPython (and .NET) handle such cases. |
184 |
| - |
185 |
| -A simple example is encoding Windows-1252. According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API `MultiByteToWideChar` maps these to the corresponding C1 control codes. The Unicode "best fit" mapping [documents this behavior](https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt). CPython will treat those bytes as invalid, while IronPython will map them to the "best fit" Unicode character: |
186 |
| - |
187 |
| -_CPython_ |
188 |
| -``` |
189 |
| ->>> b'\x81'.decode('windows-1252') |
190 |
| -Traceback (most recent call last): |
191 |
| - File "<stdin>", line 1, in <module> |
192 |
| - File "/opt/anaconda3/envs/py34/lib/python3.4/encodings/cp1252.py", line 15, in decode |
193 |
| - return codecs.charmap_decode(input,errors,decoding_table) |
194 |
| -UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined> |
195 |
| ->>> b'\x81'.decode('windows-1252','surrogateescape') |
196 |
| -'\udc81' |
197 |
| -``` |
198 |
| - |
199 |
| -_IronPython_ |
200 |
| -``` |
201 |
| ->>> b'\x81'.decode('windows-1252') |
202 |
| -'\x81' |
203 |
| ->>> b'\x81'.decode('windows-1252','surrogateescape') |
204 |
| -'\x81' |
205 |
| -``` |
206 |
| - |
207 |
| -The same difference in behavior can be observed during encoding: |
208 |
| - |
209 |
| -_CPython_ |
210 |
| -``` |
211 |
| ->>> '\x81'.encode('windows-1252') |
212 |
| -Traceback (most recent call last): |
213 |
| - File "<stdin>", line 1, in <module> |
214 |
| - File "/opt/anaconda3/envs/py34/lib/python3.4/encodings/cp1252.py", line 12, in encode |
215 |
| - return codecs.charmap_encode(input,errors,encoding_table) |
216 |
| -UnicodeEncodeError: 'charmap' codec can't encode character '\x81' in position 0: character maps to <undefined> |
217 |
| -``` |
218 |
| - |
219 |
| -_IronPython_ |
220 |
| -``` |
221 |
| ->>> '\x81'.encode('windows-1252') |
222 |
| -b'\x81' |
223 |
| -``` |
224 |
| - |
225 |
| -* When using the UTF-7 encoding, IronPython (and .NET) always terminates the modified Base64 encoded blocks with a '-' while CPython omits the '-' if allowed. |
226 |
| - |
227 |
| -The UTF-7 standard allows encoders for some freedom of implementation. One optionality allowed in UTF-7 is how to end a sequence encoded in the modified Base64 code. In principle, `+` marks the start of the sequence, and `-` is the terminator. However, it is allowed to omit the terminating `-` if the next character unambiguously does not belong to the encoded Base64 block. CPython chooses to drop the terminating `-` in such cases, while IronPython will always terminate Base64-encoded blocks with a `-`: |
228 |
| - |
229 |
| -_CPython_ |
230 |
| -``` |
231 |
| ->>> 'abc:~~:zyz'.encode('utf-7') |
232 |
| -b'abc:+AH4Afg:zyz' |
233 |
| -``` |
234 |
| - |
235 |
| -_IronPython_ |
236 |
| -``` |
237 |
| ->>> 'abc:~~:zyz'.encode('utf-7') |
238 |
| -b'abc:+AH4Afg-:zyz' |
239 |
| -``` |
240 |
| - |
241 |
| -Note that both forms are fully interchangeable; IronPython will correctly decode what CPython encoded and vice versa. |
242 |
| - |
243 |
| -# Source File Encoding |
244 |
| - |
245 |
| -* Widechar Unicode encodings are supported as source file encoding, in addition to standard Python encodings. |
246 |
| - |
247 |
| -The default source file encoding is UTF-8. This also applies to bytestrings used within the program (processed by `compile`, `eval`, or `exec`). The source file encoding can be explicitly specified, and possibly changed, in one of the two ways: |
248 |
| - |
249 |
| - 1. By declaring the encoding in a Python comment in one of the first two lines — in accordance with [PEP-263](https://www.python.org/dev/peps/pep-0263/). |
250 |
| - 2. By a byte-order-mark (BOM) — only for Unicode encodings. |
251 |
| - |
252 |
| -CPython recognizes only UTF-8 BOM. IronPython recognizes BOM in UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE. |
253 |
| - |
254 |
| -If both BOM and PEP-263 methods are used simultaneously in the same file, they should be specifying the same encoding. If the PEP-263 encoding does not match the BOM, then: |
255 |
| - |
256 |
| - * In case of UTF-8 BOM, an error will be reported (by both CPython and IronPython). |
257 |
| - * In case of other BOMs, the encoding specified in the PEP-263 comment is silently ignored. |
258 |
| - |
259 |
| -# Recursion |
260 |
| - |
261 |
| -By default, instead of raising a `RecursionError` when the maximum recursion depth is reached, IronPython will terminate with a `StackOverflowException`. You can enable the recursion limit in IronPython in a number of ways: |
262 |
| - |
263 |
| - 1. From the command line: `ipy -X MaxRecursion=100`. |
264 |
| - 2. In hosted scenarios: `Python.CreateEngine(new Dictionary<string, object>() { { "RecursionLimit", 100 } });`. |
265 |
| - 3. From Python: `sys.setrecursionlimit(100)`. |
266 |
| - |
267 |
| -*There is a significant performance cost when the recursion limit is enabled*. |
268 |
| - |
269 |
| -Note that IronPython 3.4 adopts the CPython 3.5 behavior and throws a `RecursionError` instead of a `RuntimeError`. |
| 1 | +This document has been moved to the IronPython wiki: [Differences from CPython](https://github.com/IronLanguages/ironpython3/wiki/Differences-from-CPython) |
0 commit comments