1:mod:`email` Package Architecture
2=================================
3
4Overview
5--------
6
7The email package consists of three major components:
8
9    Model
10        An object structure that represents an email message, and provides an
11        API for creating, querying, and modifying a message.
12
13    Parser
14        Takes a sequence of characters or bytes and produces a model of the
15        email message represented by those characters or bytes.
16
17    Generator
18        Takes a model and turns it into a sequence of characters or bytes.  The
19        sequence can either be intended for human consumption (a printable
20        unicode string) or bytes suitable for transmission over the wire.  In
21        the latter case all data is properly encoded using the content transfer
22        encodings specified by the relevant RFCs.
23
24Conceptually the package is organized around the model.  The model provides both
25"external" APIs intended for use by application programs using the library,
26and "internal" APIs intended for use by the Parser and Generator components.
27This division is intentionally a bit fuzzy; the API described by this
28documentation is all a public, stable API.  This allows for an application
29with special needs to implement its own parser and/or generator.
30
31In addition to the three major functional components, there is a third key
32component to the architecture:
33
34    Policy
35        An object that specifies various behavioral settings and carries
36        implementations of various behavior-controlling methods.
37
38The Policy framework provides a simple and convenient way to control the
39behavior of the library, making it possible for the library to be used in a
40very flexible fashion while leveraging the common code required to parse,
41represent, and generate message-like objects.  For example, in addition to the
42default :rfc:`5322` email message policy, we also have a policy that manages
43HTTP headers in a fashion compliant with :rfc:`2616`.  Individual policy
44controls, such as the maximum line length produced by the generator, can also
45be controlled individually to meet specialized application requirements.
46
47
48The Model
49---------
50
51The message model is implemented by the :class:`~email.message.Message` class.
52The model divides a message into the two fundamental parts discussed by the
53RFC: the header section and the body.  The `Message` object acts as a
54pseudo-dictionary of named headers.  Its dictionary interface provides
55convenient access to individual headers by name.  However, all headers are kept
56internally in an ordered list, so that the information about the order of the
57headers in the original message is preserved.
58
59The `Message` object also has a `payload` that holds the body.  A `payload` can
60be one of two things: data, or a list of `Message` objects.  The latter is used
61to represent a multipart MIME message.  Lists can be nested arbitrarily deeply
62in order to represent the message, with all terminal leaves having non-list
63data payloads.
64
65
66Message Lifecycle
67-----------------
68
69The general lifecycle of a message is:
70
71    Creation
72        A `Message` object can be created by a Parser, or it can be
73        instantiated as an empty message by an application.
74
75    Manipulation
76        The application may examine one or more headers, and/or the
77        payload, and it may modify one or more headers and/or
78        the payload.  This may be done on the top level `Message`
79        object, or on any sub-object.
80
81    Finalization
82        The Model is converted into a unicode or binary stream,
83        or the model is discarded.
84
85
86
87Header Policy Control During Lifecycle
88--------------------------------------
89
90One of the major controls exerted by the Policy is the management of headers
91during the `Message` lifecycle.  Most applications don't need to be aware of
92this.
93
94A header enters the model in one of two ways: via a Parser, or by being set to
95a specific value by an application program after the Model already exists.
96Similarly, a header exits the model in one of two ways: by being serialized by
97a Generator, or by being retrieved from a Model by an application program.  The
98Policy object provides hooks for all four of these pathways.
99
100The model storage for headers is a list of (name, value) tuples.
101
102The Parser identifies headers during parsing, and passes them to the
103:meth:`~email.policy.Policy.header_source_parse` method of the Policy.  The
104result of that method is the (name, value) tuple to be stored in the model.
105
106When an application program supplies a header value (for example, through the
107`Message` object `__setitem__` interface), the name and the value are passed to
108the :meth:`~email.policy.Policy.header_store_parse` method of the Policy, which
109returns the (name, value) tuple to be stored in the model.
110
111When an application program retrieves a header (through any of the dict or list
112interfaces of `Message`), the name and value are passed to the
113:meth:`~email.policy.Policy.header_fetch_parse` method of the Policy to
114obtain the value returned to the application.
115
116When a Generator requests a header during serialization, the name and value are
117passed to the :meth:`~email.policy.Policy.fold` method of the Policy, which
118returns a string containing line breaks in the appropriate places.  The
119:meth:`~email.policy.Policy.cte_type` Policy control determines whether or
120not Content Transfer Encoding is performed on the data in the header.  There is
121also a :meth:`~email.policy.Policy.binary_fold` method for use by generators
122that produce binary output, which returns the folded header as binary data,
123possibly folded at different places than the corresponding string would be.
124
125
126Handling Binary Data
127--------------------
128
129In an ideal world all message data would conform to the RFCs, meaning that the
130parser could decode the message into the idealized unicode message that the
131sender originally wrote.  In the real world, the email package must also be
132able to deal with badly formatted messages, including messages containing
133non-ASCII characters that either have no indicated character set or are not
134valid characters in the indicated character set.
135
136Since email messages are *primarily* text data, and operations on message data
137are primarily text operations (except for binary payloads of course), the model
138stores all text data as unicode strings.  Un-decodable binary inside text
139data is handled by using the `surrogateescape` error handler of the ASCII
140codec.  As with the binary filenames the error handler was introduced to
141handle, this allows the email package to "carry" the binary data received
142during parsing along until the output stage, at which time it is regenerated
143in its original form.
144
145This carried binary data is almost entirely an implementation detail.  The one
146place where it is visible in the API is in the "internal" API.  A Parser must
147do the `surrogateescape` encoding of binary input data, and pass that data to
148the appropriate Policy method.  The "internal" interface used by the Generator
149to access header values preserves the `surrogateescaped` bytes.  All other
150interfaces convert the binary data either back into bytes or into a safe form
151(losing information in some cases).
152
153
154Backward Compatibility
155----------------------
156
157The :class:`~email.policy.Policy.Compat32` Policy provides backward
158compatibility with version 5.1 of the email package.  It does this via the
159following implementation of the four+1 Policy methods described above:
160
161header_source_parse
162    Splits the first line on the colon to obtain the name, discards any spaces
163    after the colon, and joins the remainder of the line with all of the
164    remaining lines, preserving the linesep characters to obtain the value.
165    Trailing carriage return and/or linefeed characters are stripped from the
166    resulting value string.
167
168header_store_parse
169    Returns the name and value exactly as received from the application.
170
171header_fetch_parse
172    If the value contains any `surrogateescaped` binary data, return the value
173    as a :class:`~email.header.Header` object, using the character set
174    `unknown-8bit`.  Otherwise just returns the value.
175
176fold
177    Uses :class:`~email.header.Header`'s folding to fold headers in the
178    same way the email5.1 generator did.
179
180binary_fold
181    Same as fold, but encodes to 'ascii'.
182
183
184New Algorithm
185-------------
186
187header_source_parse
188    Same as legacy behavior.
189
190header_store_parse
191    Same as legacy behavior.
192
193header_fetch_parse
194    If the value is already a header object, returns it.  Otherwise, parses the
195    value using the new parser, and returns the resulting object as the value.
196    `surrogateescaped` bytes get turned into unicode unknown character code
197    points.
198
199fold
200    Uses the new header folding algorithm, respecting the policy settings.
201    surrogateescaped bytes are encoded using the ``unknown-8bit`` charset for
202    ``cte_type=7bit`` or ``8bit``.  Returns a string.
203
204    At some point there will also be a ``cte_type=unicode``, and for that
205    policy fold will serialize the idealized unicode message with RFC-like
206    folding, converting any surrogateescaped bytes into the unicode
207    unknown character glyph.
208
209binary_fold
210    Uses the new header folding algorithm, respecting the policy settings.
211    surrogateescaped bytes are encoded using the `unknown-8bit` charset for
212    ``cte_type=7bit``, and get turned back into bytes for ``cte_type=8bit``.
213    Returns bytes.
214
215    At some point there will also be a ``cte_type=unicode``, and for that
216    policy binary_fold will serialize the message according to :rfc:``5335``.
217