1.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
9
10Unicode Objects
11^^^^^^^^^^^^^^^
12
13
14Unicode Type
15""""""""""""
16
17These are the basic Unicode object types used for the Unicode implementation in
18Python:
19
20
21.. c:type:: Py_UNICODE
22
23   This type represents the storage type which is used by Python internally as
24   basis for holding Unicode ordinals.  Python's default builds use a 16-bit type
25   for :c:type:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
26   possible to build a UCS4 version of Python (most recent Linux distributions come
27   with UCS4 builds of Python). These builds then use a 32-bit type for
28   :c:type:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
29   where :c:type:`wchar_t` is available and compatible with the chosen Python
30   Unicode build variant, :c:type:`Py_UNICODE` is a typedef alias for
31   :c:type:`wchar_t` to enhance native platform compatibility. On all other
32   platforms, :c:type:`Py_UNICODE` is a typedef alias for either :c:type:`unsigned
33   short` (UCS2) or :c:type:`unsigned long` (UCS4).
34
35Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
36this in mind when writing extensions or interfaces.
37
38
39.. c:type:: PyUnicodeObject
40
41   This subtype of :c:type:`PyObject` represents a Python Unicode object.
42
43
44.. c:var:: PyTypeObject PyUnicode_Type
45
46   This instance of :c:type:`PyTypeObject` represents the Python Unicode type.  It
47   is exposed to Python code as ``unicode`` and ``types.UnicodeType``.
48
49The following APIs are really C macros and can be used to do fast checks and to
50access internal read-only data of Unicode objects:
51
52
53.. c:function:: int PyUnicode_Check(PyObject *o)
54
55   Return true if the object *o* is a Unicode object or an instance of a Unicode
56   subtype.
57
58   .. versionchanged:: 2.2
59      Allowed subtypes to be accepted.
60
61
62.. c:function:: int PyUnicode_CheckExact(PyObject *o)
63
64   Return true if the object *o* is a Unicode object, but not an instance of a
65   subtype.
66
67   .. versionadded:: 2.2
68
69
70.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
71
72   Return the size of the object.  *o* has to be a :c:type:`PyUnicodeObject` (not
73   checked).
74
75   .. versionchanged:: 2.5
76      This function returned an :c:type:`int` type. This might require changes
77      in your code for properly supporting 64-bit systems.
78
79
80.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
81
82   Return the size of the object's internal buffer in bytes.  *o* has to be a
83   :c:type:`PyUnicodeObject` (not checked).
84
85   .. versionchanged:: 2.5
86      This function returned an :c:type:`int` type. This might require changes
87      in your code for properly supporting 64-bit systems.
88
89
90.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
91
92   Return a pointer to the internal :c:type:`Py_UNICODE` buffer of the object.  *o*
93   has to be a :c:type:`PyUnicodeObject` (not checked).
94
95
96.. c:function:: const char* PyUnicode_AS_DATA(PyObject *o)
97
98   Return a pointer to the internal buffer of the object. *o* has to be a
99   :c:type:`PyUnicodeObject` (not checked).
100
101
102.. c:function:: int PyUnicode_ClearFreeList()
103
104   Clear the free list. Return the total number of freed items.
105
106   .. versionadded:: 2.6
107
108
109Unicode Character Properties
110""""""""""""""""""""""""""""
111
112Unicode provides many different character properties. The most often needed ones
113are available through these macros which are mapped to C functions depending on
114the Python configuration.
115
116
117.. c:function:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
118
119   Return ``1`` or ``0`` depending on whether *ch* is a whitespace character.
120
121
122.. c:function:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
123
124   Return ``1`` or ``0`` depending on whether *ch* is a lowercase character.
125
126
127.. c:function:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
128
129   Return ``1`` or ``0`` depending on whether *ch* is an uppercase character.
130
131
132.. c:function:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
133
134   Return ``1`` or ``0`` depending on whether *ch* is a titlecase character.
135
136
137.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
138
139   Return ``1`` or ``0`` depending on whether *ch* is a linebreak character.
140
141
142.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
143
144   Return ``1`` or ``0`` depending on whether *ch* is a decimal character.
145
146
147.. c:function:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
148
149   Return ``1`` or ``0`` depending on whether *ch* is a digit character.
150
151
152.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
153
154   Return ``1`` or ``0`` depending on whether *ch* is a numeric character.
155
156
157.. c:function:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
158
159   Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character.
160
161
162.. c:function:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
163
164   Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character.
165
166These APIs can be used for fast direct character conversions:
167
168
169.. c:function:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
170
171   Return the character *ch* converted to lower case.
172
173
174.. c:function:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
175
176   Return the character *ch* converted to upper case.
177
178
179.. c:function:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
180
181   Return the character *ch* converted to title case.
182
183
184.. c:function:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
185
186   Return the character *ch* converted to a decimal positive integer.  Return
187   ``-1`` if this is not possible.  This macro does not raise exceptions.
188
189
190.. c:function:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
191
192   Return the character *ch* converted to a single digit integer. Return ``-1`` if
193   this is not possible.  This macro does not raise exceptions.
194
195
196.. c:function:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
197
198   Return the character *ch* converted to a double. Return ``-1.0`` if this is not
199   possible.  This macro does not raise exceptions.
200
201
202Plain Py_UNICODE
203""""""""""""""""
204
205To create Unicode objects and access their basic sequence properties, use these
206APIs:
207
208
209.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
210
211   Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
212   may be *NULL* which causes the contents to be undefined. It is the user's
213   responsibility to fill in the needed data.  The buffer is copied into the new
214   object. If the buffer is not *NULL*, the return value might be a shared object.
215   Therefore, modification of the resulting Unicode object is only allowed when *u*
216   is *NULL*.
217
218   .. versionchanged:: 2.5
219      This function used an :c:type:`int` type for *size*. This might require
220      changes in your code for properly supporting 64-bit systems.
221
222
223.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
224
225   Create a Unicode object from the char buffer *u*.  The bytes will be interpreted
226   as being UTF-8 encoded.  *u* may also be *NULL* which
227   causes the contents to be undefined. It is the user's responsibility to fill in
228   the needed data.  The buffer is copied into the new object. If the buffer is not
229   *NULL*, the return value might be a shared object. Therefore, modification of
230   the resulting Unicode object is only allowed when *u* is *NULL*.
231
232   .. versionadded:: 2.6
233
234
235.. c:function:: PyObject *PyUnicode_FromString(const char *u)
236
237   Create a Unicode object from a UTF-8 encoded null-terminated char buffer
238   *u*.
239
240   .. versionadded:: 2.6
241
242
243.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...)
244
245   Take a C :c:func:`printf`\ -style *format* string and a variable number of
246   arguments, calculate the size of the resulting Python unicode string and return
247   a string with the values formatted into it.  The variable arguments must be C
248   types and must correspond exactly to the format characters in the *format*
249   string.  The following format characters are allowed:
250
251   .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
252   .. % because not all compilers support the %z width modifier -- we fake it
253   .. % when necessary via interpolating PY_FORMAT_SIZE_T.
254
255   .. tabularcolumns:: |l|l|L|
256
257   +-------------------+---------------------+--------------------------------+
258   | Format Characters | Type                | Comment                        |
259   +===================+=====================+================================+
260   | :attr:`%%`        | *n/a*               | The literal % character.       |
261   +-------------------+---------------------+--------------------------------+
262   | :attr:`%c`        | int                 | A single character,            |
263   |                   |                     | represented as a C int.        |
264   +-------------------+---------------------+--------------------------------+
265   | :attr:`%d`        | int                 | Exactly equivalent to          |
266   |                   |                     | ``printf("%d")``.              |
267   +-------------------+---------------------+--------------------------------+
268   | :attr:`%u`        | unsigned int        | Exactly equivalent to          |
269   |                   |                     | ``printf("%u")``.              |
270   +-------------------+---------------------+--------------------------------+
271   | :attr:`%ld`       | long                | Exactly equivalent to          |
272   |                   |                     | ``printf("%ld")``.             |
273   +-------------------+---------------------+--------------------------------+
274   | :attr:`%lu`       | unsigned long       | Exactly equivalent to          |
275   |                   |                     | ``printf("%lu")``.             |
276   +-------------------+---------------------+--------------------------------+
277   | :attr:`%zd`       | Py_ssize_t          | Exactly equivalent to          |
278   |                   |                     | ``printf("%zd")``.             |
279   +-------------------+---------------------+--------------------------------+
280   | :attr:`%zu`       | size_t              | Exactly equivalent to          |
281   |                   |                     | ``printf("%zu")``.             |
282   +-------------------+---------------------+--------------------------------+
283   | :attr:`%i`        | int                 | Exactly equivalent to          |
284   |                   |                     | ``printf("%i")``.              |
285   +-------------------+---------------------+--------------------------------+
286   | :attr:`%x`        | int                 | Exactly equivalent to          |
287   |                   |                     | ``printf("%x")``.              |
288   +-------------------+---------------------+--------------------------------+
289   | :attr:`%s`        | char\*              | A null-terminated C character  |
290   |                   |                     | array.                         |
291   +-------------------+---------------------+--------------------------------+
292   | :attr:`%p`        | void\*              | The hex representation of a C  |
293   |                   |                     | pointer. Mostly equivalent to  |
294   |                   |                     | ``printf("%p")`` except that   |
295   |                   |                     | it is guaranteed to start with |
296   |                   |                     | the literal ``0x`` regardless  |
297   |                   |                     | of what the platform's         |
298   |                   |                     | ``printf`` yields.             |
299   +-------------------+---------------------+--------------------------------+
300   | :attr:`%U`        | PyObject\*          | A unicode object.              |
301   +-------------------+---------------------+--------------------------------+
302   | :attr:`%V`        | PyObject\*, char \* | A unicode object (which may be |
303   |                   |                     | *NULL*) and a null-terminated  |
304   |                   |                     | C character array as a second  |
305   |                   |                     | parameter (which will be used, |
306   |                   |                     | if the first parameter is      |
307   |                   |                     | *NULL*).                       |
308   +-------------------+---------------------+--------------------------------+
309   | :attr:`%S`        | PyObject\*          | The result of calling          |
310   |                   |                     | :func:`PyObject_Unicode`.      |
311   +-------------------+---------------------+--------------------------------+
312   | :attr:`%R`        | PyObject\*          | The result of calling          |
313   |                   |                     | :func:`PyObject_Repr`.         |
314   +-------------------+---------------------+--------------------------------+
315
316   An unrecognized format character causes all the rest of the format string to be
317   copied as-is to the result string, and any extra arguments discarded.
318
319   .. versionadded:: 2.6
320
321
322.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
323
324   Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
325   arguments.
326
327   .. versionadded:: 2.6
328
329
330.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
331
332   Return a read-only pointer to the Unicode object's internal
333   :c:type:`Py_UNICODE` buffer, *NULL* if *unicode* is not a Unicode object.
334   Note that the resulting :c:type:`Py_UNICODE*` string may contain embedded
335   null characters, which would cause the string to be truncated when used in
336   most C functions.
337
338
339.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
340
341   Return the length of the Unicode object.
342
343   .. versionchanged:: 2.5
344      This function returned an :c:type:`int` type. This might require changes
345      in your code for properly supporting 64-bit systems.
346
347
348.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
349
350   Coerce an encoded object *obj* to a Unicode object and return a reference with
351   incremented refcount.
352
353   String and other char buffer compatible objects are decoded according to the
354   given encoding and using the error handling defined by errors.  Both can be
355   *NULL* to have the interface use the default values (see the next section for
356   details).
357
358   All other objects, including Unicode objects, cause a :exc:`TypeError` to be
359   set.
360
361   The API returns *NULL* if there was an error.  The caller is responsible for
362   decref'ing the returned objects.
363
364
365.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj)
366
367   Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
368   throughout the interpreter whenever coercion to Unicode is needed.
369
370If the platform supports :c:type:`wchar_t` and provides a header file wchar.h,
371Python can interface directly to this type using the following functions.
372Support is optimized if Python's own :c:type:`Py_UNICODE` type is identical to
373the system's :c:type:`wchar_t`.
374
375
376wchar_t Support
377"""""""""""""""
378
379:c:type:`wchar_t` support for platforms which support it:
380
381.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
382
383   Create a Unicode object from the :c:type:`wchar_t` buffer *w* of the given *size*.
384   Return *NULL* on failure.
385
386   .. versionchanged:: 2.5
387      This function used an :c:type:`int` type for *size*. This might require
388      changes in your code for properly supporting 64-bit systems.
389
390
391.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
392
393   Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*.  At most
394   *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing
395   0-termination character).  Return the number of :c:type:`wchar_t` characters
396   copied or ``-1`` in case of an error.  Note that the resulting :c:type:`wchar_t`
397   string may or may not be 0-terminated.  It is the responsibility of the caller
398   to make sure that the :c:type:`wchar_t` string is 0-terminated in case this is
399   required by the application. Also, note that the :c:type:`wchar_t*` string
400   might contain null characters, which would cause the string to be truncated
401   when used with most C functions.
402
403   .. versionchanged:: 2.5
404      This function returned an :c:type:`int` type and used an :c:type:`int`
405      type for *size*. This might require changes in your code for properly
406      supporting 64-bit systems.
407
408
409.. _builtincodecs:
410
411Built-in Codecs
412^^^^^^^^^^^^^^^
413
414Python provides a set of built-in codecs which are written in C for speed. All of
415these codecs are directly usable via the following functions.
416
417Many of the following APIs take two arguments encoding and errors, and they
418have the same semantics as the ones of the built-in :func:`unicode` Unicode
419object constructor.
420
421Setting encoding to *NULL* causes the default encoding to be used which is
422ASCII.  The file system calls should use :c:data:`Py_FileSystemDefaultEncoding`
423as the encoding for file names. This variable should be treated as read-only: on
424some systems, it will be a pointer to a static string, on others, it will change
425at run-time (such as when the application invokes setlocale).
426
427Error handling is set by errors which may also be set to *NULL* meaning to use
428the default handling defined for the codec.  Default error handling for all
429built-in codecs is "strict" (:exc:`ValueError` is raised).
430
431The codecs all use a similar interface.  Only deviation from the following
432generic ones are documented for simplicity.
433
434
435Generic Codecs
436""""""""""""""
437
438These are the generic codec APIs:
439
440
441.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
442
443   Create a Unicode object by decoding *size* bytes of the encoded string *s*.
444   *encoding* and *errors* have the same meaning as the parameters of the same name
445   in the :func:`unicode` built-in function.  The codec to be used is looked up
446   using the Python codec registry.  Return *NULL* if an exception was raised by
447   the codec.
448
449   .. versionchanged:: 2.5
450      This function used an :c:type:`int` type for *size*. This might require
451      changes in your code for properly supporting 64-bit systems.
452
453
454.. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
455
456   Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python
457   string object.  *encoding* and *errors* have the same meaning as the parameters
458   of the same name in the Unicode :meth:`~unicode.encode` method.  The codec
459   to be used is looked up using the Python codec registry.  Return *NULL* if
460   an exception was raised by the codec.
461
462   .. versionchanged:: 2.5
463      This function used an :c:type:`int` type for *size*. This might require
464      changes in your code for properly supporting 64-bit systems.
465
466
467.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
468
469   Encode a Unicode object and return the result as Python string object.
470   *encoding* and *errors* have the same meaning as the parameters of the same name
471   in the Unicode :meth:`encode` method. The codec to be used is looked up using
472   the Python codec registry. Return *NULL* if an exception was raised by the
473   codec.
474
475
476UTF-8 Codecs
477""""""""""""
478
479These are the UTF-8 codec APIs:
480
481
482.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
483
484   Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
485   *s*. Return *NULL* if an exception was raised by the codec.
486
487   .. versionchanged:: 2.5
488      This function used an :c:type:`int` type for *size*. This might require
489      changes in your code for properly supporting 64-bit systems.
490
491
492.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
493
494   If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF8`. If
495   *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
496   treated as an error. Those bytes will not be decoded and the number of bytes
497   that have been decoded will be stored in *consumed*.
498
499   .. versionadded:: 2.4
500
501   .. versionchanged:: 2.5
502      This function used an :c:type:`int` type for *size*. This might require
503      changes in your code for properly supporting 64-bit systems.
504
505
506.. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
507
508   Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and return a
509   Python string object.  Return *NULL* if an exception was raised by the codec.
510
511   .. versionchanged:: 2.5
512      This function used an :c:type:`int` type for *size*. This might require
513      changes in your code for properly supporting 64-bit systems.
514
515
516.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
517
518   Encode a Unicode object using UTF-8 and return the result as Python string
519   object.  Error handling is "strict".  Return *NULL* if an exception was raised
520   by the codec.
521
522
523UTF-32 Codecs
524"""""""""""""
525
526These are the UTF-32 codec APIs:
527
528
529.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
530
531   Decode *size* bytes from a UTF-32 encoded buffer string and return the
532   corresponding Unicode object.  *errors* (if non-*NULL*) defines the error
533   handling. It defaults to "strict".
534
535   If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
536   order::
537
538      *byteorder == -1: little endian
539      *byteorder == 0:  native order
540      *byteorder == 1:  big endian
541
542   If ``*byteorder`` is zero, and the first four bytes of the input data are a
543   byte order mark (BOM), the decoder switches to this byte order and the BOM is
544   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
545   ``1``, any byte order mark is copied to the output.
546
547   After completion, *\*byteorder* is set to the current byte order at the end
548   of input data.
549
550   In a narrow build code points outside the BMP will be decoded as surrogate pairs.
551
552   If *byteorder* is *NULL*, the codec starts in native order mode.
553
554   Return *NULL* if an exception was raised by the codec.
555
556   .. versionadded:: 2.6
557
558
559.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
560
561   If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF32`. If
562   *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat
563   trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
564   by four) as an error. Those bytes will not be decoded and the number of bytes
565   that have been decoded will be stored in *consumed*.
566
567   .. versionadded:: 2.6
568
569
570.. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
571
572   Return a Python bytes object holding the UTF-32 encoded value of the Unicode
573   data in *s*.  Output is written according to the following byte order::
574
575      byteorder == -1: little endian
576      byteorder == 0:  native byte order (writes a BOM mark)
577      byteorder == 1:  big endian
578
579   If byteorder is ``0``, the output string will always start with the Unicode BOM
580   mark (U+FEFF). In the other two modes, no BOM mark is prepended.
581
582   If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
583   as a single code point.
584
585   Return *NULL* if an exception was raised by the codec.
586
587   .. versionadded:: 2.6
588
589
590.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
591
592   Return a Python string using the UTF-32 encoding in native byte order. The
593   string always starts with a BOM mark.  Error handling is "strict".  Return
594   *NULL* if an exception was raised by the codec.
595
596   .. versionadded:: 2.6
597
598
599UTF-16 Codecs
600"""""""""""""
601
602These are the UTF-16 codec APIs:
603
604
605.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
606
607   Decode *size* bytes from a UTF-16 encoded buffer string and return the
608   corresponding Unicode object.  *errors* (if non-*NULL*) defines the error
609   handling. It defaults to "strict".
610
611   If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
612   order::
613
614      *byteorder == -1: little endian
615      *byteorder == 0:  native order
616      *byteorder == 1:  big endian
617
618   If ``*byteorder`` is zero, and the first two bytes of the input data are a
619   byte order mark (BOM), the decoder switches to this byte order and the BOM is
620   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
621   ``1``, any byte order mark is copied to the output (where it will result in
622   either a ``\ufeff`` or a ``\ufffe`` character).
623
624   After completion, *\*byteorder* is set to the current byte order at the end
625   of input data.
626
627   If *byteorder* is *NULL*, the codec starts in native order mode.
628
629   Return *NULL* if an exception was raised by the codec.
630
631   .. versionchanged:: 2.5
632      This function used an :c:type:`int` type for *size*. This might require
633      changes in your code for properly supporting 64-bit systems.
634
635
636.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
637
638   If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF16`. If
639   *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat
640   trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
641   split surrogate pair) as an error. Those bytes will not be decoded and the
642   number of bytes that have been decoded will be stored in *consumed*.
643
644   .. versionadded:: 2.4
645
646   .. versionchanged:: 2.5
647      This function used an :c:type:`int` type for *size* and an :c:type:`int *`
648      type for *consumed*. This might require changes in your code for
649      properly supporting 64-bit systems.
650
651
652.. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
653
654   Return a Python string object holding the UTF-16 encoded value of the Unicode
655   data in *s*.  Output is written according to the following byte order::
656
657      byteorder == -1: little endian
658      byteorder == 0:  native byte order (writes a BOM mark)
659      byteorder == 1:  big endian
660
661   If byteorder is ``0``, the output string will always start with the Unicode BOM
662   mark (U+FEFF). In the other two modes, no BOM mark is prepended.
663
664   If *Py_UNICODE_WIDE* is defined, a single :c:type:`Py_UNICODE` value may get
665   represented as a surrogate pair. If it is not defined, each :c:type:`Py_UNICODE`
666   values is interpreted as a UCS-2 character.
667
668   Return *NULL* if an exception was raised by the codec.
669
670   .. versionchanged:: 2.5
671      This function used an :c:type:`int` type for *size*. This might require
672      changes in your code for properly supporting 64-bit systems.
673
674
675.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
676
677   Return a Python string using the UTF-16 encoding in native byte order. The
678   string always starts with a BOM mark.  Error handling is "strict".  Return
679   *NULL* if an exception was raised by the codec.
680
681
682UTF-7 Codecs
683""""""""""""
684
685These are the UTF-7 codec APIs:
686
687
688.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
689
690   Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
691   *s*.  Return *NULL* if an exception was raised by the codec.
692
693
694.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
695
696   If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF7`.  If
697   *consumed* is not *NULL*, trailing incomplete UTF-7 base-64 sections will not
698   be treated as an error.  Those bytes will not be decoded and the number of
699   bytes that have been decoded will be stored in *consumed*.
700
701
702.. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char *errors)
703
704   Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and
705   return a Python bytes object.  Return *NULL* if an exception was raised by
706   the codec.
707
708   If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise
709   special meaning) will be encoded in base-64.  If *base64WhiteSpace* is
710   nonzero, whitespace will be encoded in base-64.  Both are set to zero for the
711   Python "utf-7" codec.
712
713
714Unicode-Escape Codecs
715"""""""""""""""""""""
716
717These are the "Unicode Escape" codec APIs:
718
719
720.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
721
722   Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
723   string *s*.  Return *NULL* if an exception was raised by the codec.
724
725   .. versionchanged:: 2.5
726      This function used an :c:type:`int` type for *size*. This might require
727      changes in your code for properly supporting 64-bit systems.
728
729
730.. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
731
732   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and
733   return a Python string object.  Return *NULL* if an exception was raised by the
734   codec.
735
736   .. versionchanged:: 2.5
737      This function used an :c:type:`int` type for *size*. This might require
738      changes in your code for properly supporting 64-bit systems.
739
740
741.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
742
743   Encode a Unicode object using Unicode-Escape and return the result as Python
744   string object.  Error handling is "strict". Return *NULL* if an exception was
745   raised by the codec.
746
747
748Raw-Unicode-Escape Codecs
749"""""""""""""""""""""""""
750
751These are the "Raw Unicode Escape" codec APIs:
752
753
754.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
755
756   Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
757   encoded string *s*.  Return *NULL* if an exception was raised by the codec.
758
759   .. versionchanged:: 2.5
760      This function used an :c:type:`int` type for *size*. This might require
761      changes in your code for properly supporting 64-bit systems.
762
763
764.. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
765
766   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape
767   and return a Python string object.  Return *NULL* if an exception was raised by
768   the codec.
769
770   .. versionchanged:: 2.5
771      This function used an :c:type:`int` type for *size*. This might require
772      changes in your code for properly supporting 64-bit systems.
773
774
775.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
776
777   Encode a Unicode object using Raw-Unicode-Escape and return the result as
778   Python string object. Error handling is "strict". Return *NULL* if an exception
779   was raised by the codec.
780
781
782Latin-1 Codecs
783""""""""""""""
784
785These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
786ordinals and only these are accepted by the codecs during encoding.
787
788
789.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
790
791   Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
792   *s*.  Return *NULL* if an exception was raised by the codec.
793
794   .. versionchanged:: 2.5
795      This function used an :c:type:`int` type for *size*. This might require
796      changes in your code for properly supporting 64-bit systems.
797
798
799.. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
800
801   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and return
802   a Python string object.  Return *NULL* if an exception was raised by the codec.
803
804   .. versionchanged:: 2.5
805      This function used an :c:type:`int` type for *size*. This might require
806      changes in your code for properly supporting 64-bit systems.
807
808
809.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
810
811   Encode a Unicode object using Latin-1 and return the result as Python string
812   object.  Error handling is "strict".  Return *NULL* if an exception was raised
813   by the codec.
814
815
816ASCII Codecs
817""""""""""""
818
819These are the ASCII codec APIs.  Only 7-bit ASCII data is accepted. All other
820codes generate errors.
821
822
823.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
824
825   Create a Unicode object by decoding *size* bytes of the ASCII encoded string
826   *s*.  Return *NULL* if an exception was raised by the codec.
827
828   .. versionchanged:: 2.5
829      This function used an :c:type:`int` type for *size*. This might require
830      changes in your code for properly supporting 64-bit systems.
831
832
833.. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
834
835   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and return a
836   Python string object.  Return *NULL* if an exception was raised by the codec.
837
838   .. versionchanged:: 2.5
839      This function used an :c:type:`int` type for *size*. This might require
840      changes in your code for properly supporting 64-bit systems.
841
842
843.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
844
845   Encode a Unicode object using ASCII and return the result as Python string
846   object.  Error handling is "strict".  Return *NULL* if an exception was raised
847   by the codec.
848
849
850Character Map Codecs
851""""""""""""""""""""
852
853This codec is special in that it can be used to implement many different codecs
854(and this is in fact what was done to obtain most of the standard codecs
855included in the :mod:`encodings` package). The codec uses mapping to encode and
856decode characters.
857
858Decoding mappings must map single string characters to single Unicode
859characters, integers (which are then interpreted as Unicode ordinals) or ``None``
860(meaning "undefined mapping" and causing an error).
861
862Encoding mappings must map single Unicode characters to single string
863characters, integers (which are then interpreted as Latin-1 ordinals) or ``None``
864(meaning "undefined mapping" and causing an error).
865
866The mapping objects provided must only support the __getitem__ mapping
867interface.
868
869If a character lookup fails with a LookupError, the character is copied as-is
870meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
871resp. Because of this, mappings only need to contain those mappings which map
872characters to different code points.
873
874These are the mapping codec APIs:
875
876.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
877
878   Create a Unicode object by decoding *size* bytes of the encoded string *s* using
879   the given *mapping* object.  Return *NULL* if an exception was raised by the
880   codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
881   dictionary mapping byte or a unicode string, which is treated as a lookup table.
882   Byte values greater that the length of the string and U+FFFE "characters" are
883   treated as "undefined mapping".
884
885   .. versionchanged:: 2.4
886      Allowed unicode string as mapping argument.
887
888   .. versionchanged:: 2.5
889      This function used an :c:type:`int` type for *size*. This might require
890      changes in your code for properly supporting 64-bit systems.
891
892
893.. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
894
895   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given
896   *mapping* object and return a Python string object. Return *NULL* if an
897   exception was raised by the codec.
898
899   .. versionchanged:: 2.5
900      This function used an :c:type:`int` type for *size*. This might require
901      changes in your code for properly supporting 64-bit systems.
902
903
904.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
905
906   Encode a Unicode object using the given *mapping* object and return the result
907   as Python string object.  Error handling is "strict".  Return *NULL* if an
908   exception was raised by the codec.
909
910The following codec API is special in that maps Unicode to Unicode.
911
912
913.. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
914
915   Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a
916   character mapping *table* to it and return the resulting Unicode object.  Return
917   *NULL* when an exception was raised by the codec.
918
919   The *mapping* table must map Unicode ordinal integers to Unicode ordinal
920   integers or ``None`` (causing deletion of the character).
921
922   Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
923   and sequences work well.  Unmapped character ordinals (ones which cause a
924   :exc:`LookupError`) are left untouched and are copied as-is.
925
926   .. versionchanged:: 2.5
927      This function used an :c:type:`int` type for *size*. This might require
928      changes in your code for properly supporting 64-bit systems.
929
930
931MBCS codecs for Windows
932"""""""""""""""""""""""
933
934These are the MBCS codec APIs. They are currently only available on Windows and
935use the Win32 MBCS converters to implement the conversions.  Note that MBCS (or
936DBCS) is a class of encodings, not just one.  The target encoding is defined by
937the user settings on the machine running the codec.
938
939
940.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
941
942   Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
943   Return *NULL* if an exception was raised by the codec.
944
945   .. versionchanged:: 2.5
946      This function used an :c:type:`int` type for *size*. This might require
947      changes in your code for properly supporting 64-bit systems.
948
949
950.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
951
952   If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeMBCS`. If
953   *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode
954   trailing lead byte and the number of bytes that have been decoded will be stored
955   in *consumed*.
956
957   .. versionadded:: 2.5
958
959
960.. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
961
962   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return a
963   Python string object.  Return *NULL* if an exception was raised by the codec.
964
965   .. versionchanged:: 2.5
966      This function used an :c:type:`int` type for *size*. This might require
967      changes in your code for properly supporting 64-bit systems.
968
969
970.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
971
972   Encode a Unicode object using MBCS and return the result as Python string
973   object.  Error handling is "strict".  Return *NULL* if an exception was raised
974   by the codec.
975
976
977Methods & Slots
978"""""""""""""""
979
980.. _unicodemethodsandslots:
981
982Methods and Slot Functions
983^^^^^^^^^^^^^^^^^^^^^^^^^^
984
985The following APIs are capable of handling Unicode objects and strings on input
986(we refer to them as strings in the descriptions) and return Unicode objects or
987integers as appropriate.
988
989They all return *NULL* or ``-1`` if an exception occurs.
990
991
992.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
993
994   Concat two strings giving a new Unicode string.
995
996
997.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
998
999   Split a string giving a list of Unicode strings.  If *sep* is *NULL*, splitting
1000   will be done at all whitespace substrings.  Otherwise, splits occur at the given
1001   separator.  At most *maxsplit* splits will be done.  If negative, no limit is
1002   set.  Separators are not included in the resulting list.
1003
1004   .. versionchanged:: 2.5
1005      This function used an :c:type:`int` type for *maxsplit*. This might require
1006      changes in your code for properly supporting 64-bit systems.
1007
1008
1009.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
1010
1011   Split a Unicode string at line breaks, returning a list of Unicode strings.
1012   CRLF is considered to be one line break.  If *keepend* is ``0``, the Line break
1013   characters are not included in the resulting strings.
1014
1015
1016.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
1017
1018   Translate a string by applying a character mapping table to it and return the
1019   resulting Unicode object.
1020
1021   The mapping table must map Unicode ordinal integers to Unicode ordinal integers
1022   or ``None`` (causing deletion of the character).
1023
1024   Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
1025   and sequences work well.  Unmapped character ordinals (ones which cause a
1026   :exc:`LookupError`) are left untouched and are copied as-is.
1027
1028   *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
1029   use the default error handling.
1030
1031
1032.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
1033
1034   Join a sequence of strings using the given *separator* and return the resulting
1035   Unicode string.
1036
1037
1038.. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
1039
1040   Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end
1041   (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match),
1042   ``0`` otherwise. Return ``-1`` if an error occurred.
1043
1044   .. versionchanged:: 2.5
1045      This function used an :c:type:`int` type for *start* and *end*. This
1046      might require changes in your code for properly supporting 64-bit
1047      systems.
1048
1049
1050.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
1051
1052   Return the first position of *substr* in ``str[start:end]`` using the given
1053   *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a
1054   backward search).  The return value is the index of the first match; a value of
1055   ``-1`` indicates that no match was found, and ``-2`` indicates that an error
1056   occurred and an exception has been set.
1057
1058   .. versionchanged:: 2.5
1059      This function used an :c:type:`int` type for *start* and *end*. This
1060      might require changes in your code for properly supporting 64-bit
1061      systems.
1062
1063
1064.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
1065
1066   Return the number of non-overlapping occurrences of *substr* in
1067   ``str[start:end]``.  Return ``-1`` if an error occurred.
1068
1069   .. versionchanged:: 2.5
1070      This function returned an :c:type:`int` type and used an :c:type:`int`
1071      type for *start* and *end*. This might require changes in your code for
1072      properly supporting 64-bit systems.
1073
1074
1075.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
1076
1077   Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
1078   return the resulting Unicode object. *maxcount* == ``-1`` means replace all
1079   occurrences.
1080
1081   .. versionchanged:: 2.5
1082      This function used an :c:type:`int` type for *maxcount*. This might
1083      require changes in your code for properly supporting 64-bit systems.
1084
1085
1086.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right)
1087
1088   Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than,
1089   respectively.
1090
1091
1092.. c:function:: int PyUnicode_RichCompare(PyObject *left,  PyObject *right,  int op)
1093
1094   Rich compare two unicode strings and return one of the following:
1095
1096   * ``NULL`` in case an exception was raised
1097   * :const:`Py_True` or :const:`Py_False` for successful comparisons
1098   * :const:`Py_NotImplemented` in case the type combination is unknown
1099
1100   Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
1101   :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
1102   with a :exc:`UnicodeDecodeError`.
1103
1104   Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
1105   :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
1106
1107
1108.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
1109
1110   Return a new string object from *format* and *args*; this is analogous to
1111   ``format % args``.
1112
1113
1114.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element)
1115
1116   Check whether *element* is contained in *container* and return true or false
1117   accordingly.
1118
1119   *element* has to coerce to a one element Unicode string. ``-1`` is returned if
1120   there was an error.
1121