1URL Parsing With WSGI And Paste
2+++++++++++++++++++++++++++++++
3
4:author: Ian Bicking <ianb@colorstudy.com>
5:revision: $Rev$
6:date: $LastChangedDate$
7
8.. contents::
9
10Introduction and Audience
11=========================
12
13This document is intended for web framework authors and integrators,
14and people who want to understand the internal architecture of Paste.
15
16.. include:: include/contact.txt
17
18URL Parsing
19===========
20
21.. note::
22
23   Sometimes people use "URL", and sometimes "URI".  I think URLs are
24   a subset of URIs.  But in practice you'll almost never see URIs
25   that aren't URLs, and certainly not in Paste.  URIs that aren't
26   URLs are abstract Identifiers, that cannot necessarily be used to
27   Locate the resource.  This document is *all* about locating.
28
29Most generally, URL parsing is about taking a URL and determining what
30"resource" the URL refers to.  "Resource" is a rather vague term,
31intentionally.  It's really just a metaphor -- in reality there aren't
32any "resources" in HTTP; there are only requests and responses.
33
34In Paste, everything is about WSGI.  But that can seem too fancy.
35There are four core things involved: the *request* (personified in the
36WSGI environment), the *response* (personified inthe
37``start_response`` callback and the return iterator), the WSGI
38application, and the server that calls that application.  The
39application and request are objects, while the server and response are
40really more like actions than concrete objects.
41
42In this context, URL parsing is about mapping a URL to an
43*application* and a *request*.  The request actually gets modified as
44it moves through different parts of the system.  Two dictionary keys
45in particular relate to URLs -- ``SCRIPT_NAME`` and ``PATH_INFO`` --
46but any part of the environment can be modified as it is passed
47through the system.
48
49Dispatching
50===========
51
52.. note::
53
54   WSGI isn't object oriented?  Well, if you look at it, you'll notice
55   there's no objects except built-in types, so it shouldn't be a
56   surprise.  Additionally, the interface and promises of the objects
57   we do see are very minimal.  An application doesn't have any
58   interface except one method -- ``__call__`` -- and that method
59   *does* things, it doesn't give any other information.
60
61Because WSGI is action-oriented, rather than object-oriented, it's
62more important what we *do*.  "Finding" an application is probably an
63intermediate step, but "running" the application is our ultimate goal,
64and the only real judge of success.  An application that isn't run is
65useless to us, because it doesn't have any other useful methods.
66
67So what we're really doing is *dispatching* -- we're handing the
68request and responsibility for the response off to another object
69(another actor, really).  In the process we can actually retain some
70control -- we can capture and transform the response, and we can
71modify the request -- but that's not what the typical URL resolver will
72do.
73
74Motivations
75===========
76
77The most obvious kind of URL parsing is finding a WSGI application.
78
79Typically when a framework first supports WSGI or is integrated into
80Paste, it is "monolithic" with respect to URLs.  That is, you define
81(in Paste, or maybe in Apache) a "root" URL, and everything under that
82goes into the framework.  What the framework does internally, Paste
83does not know -- it probably finds internal objects to dispatch to,
84but the framework is opaque to Paste.  Not just to Paste, but to
85any code that isn't in that framework.
86
87That means that we can't mix code from multiple frameworks, or as
88easily share services, or use WSGI middleware that doesn't apply to
89the entire framework/application.
90
91An example of someplace we might want to use an "application" that
92isn't part of the framework would be uploading large files.  It's
93possible to keep track of upload progress, and report that back to the
94user, but no framework typically is capable of this.  This is usually
95because the POST request is completely read and parsed before it
96invokes any application code.
97
98This is resolvable in WSGI -- a WSGI application can provide its own
99code to read and parse the POST request, and simultaneously report
100progress (usually in a way that *another* WSGI application/request can
101read and report to the user on that progress).  This is an example
102where you want to allow "foreign" applications to be intermingled with
103framework application code.
104
105Finding Applications
106====================
107
108OK, enough theory.  How does a URL parser work?  Well, it is a WSGI
109application, and a WSGI server, in the typical "WSGI middleware"
110style.  Except that it determines which application it will serve
111for each request.
112
113Let's consider Paste's ``URLParser`` (in ``paste.urlparser``).  This
114class takes a directory name as its only required argument, and
115instances are WSGI applications.
116
117When a request comes in, the parser looks at ``PATH_INFO`` to see
118what's left to parse.  ``SCRIPT_NAME`` represents where we are *now*;
119it's the part of the URL that has been parsed.
120
121There's a couple special cases:
122
123The empty string:
124
125    URLParser serves directories.  When ``PATH_INFO`` is empty, that
126    means we got a request with no trailing ``/``, like say ``/blog``
127    If URLParser serves the ``blog`` directory, then this won't do --
128    the user is requesting the ``blog`` *page*.  We have to redirect
129    them to ``/blog/``.
130
131A single ``/``:
132
133    So, we got a trailing ``/``.  This means we need to serve the
134    "index" page.  In URLParser, this is some file named ``index``,
135    though that's really an implementation detail.  You could create
136    an index dynamically (like Apache's file listings), or whatever.
137
138Otherwise we get a string like ``/path...``.  Note that ``PATH_INFO``
139*must* start with a ``/``, or it must be empty.
140
141URLParser pulls off the first part of the path.  E.g., if
142``PATH_INFO`` is ``/blog/edit/285``, then the first part is ``blog``.
143It appends this to ``SCRIPT_NAME``, and strips it off ``PATH_INFO``
144(which becomes ``/edit/285``).
145
146It then searches for a file that matches "blog".  In URLParser, this
147means it looks for a filename which matches that name (ignoring the
148extension).  It then uses the type of that file (determined by
149extension) to create a WSGI application.
150
151One case is that the file is a directory.  In that case, the
152application is *another* URLParser instance, this time with the new
153directory.
154
155URLParser actually allows per-extension "plugins" -- these are just
156functions that get a filename, and produce a WSGI application.  One of
157these is ``make_py`` -- this function imports the module, and looks
158for special symbols; if it finds a symbol ``application``, it assumes
159this is a WSGI application that is ready to accept the request.  If it
160finds a symbol that matches the name of the module (e.g., ``edit``),
161then it assumes that is an application *factory*, meaning that when
162you call it with no arguments you get a WSGI application.
163
164Another function takes "unknown" files (files for which no better
165constructor exists) and creates an application that simply responds
166with the contents of that file (and the appropriate ``Content-Type``).
167
168In any case, ``URLParser`` delegates as soon as it can.  It doesn't
169parse the entire path -- it just finds the *next* application, which
170in turn may delegate to yet another application.
171
172Here's a very simple implementation of URLParser::
173
174    class URLParser(object):
175        def __init__(self, dir):
176            self.dir = dir
177        def __call__(self, environ, start_response):
178            segment = wsgilib.path_info_pop(environ)
179            if segment is None: # No trailing /
180                # do a redirect...
181            for filename in os.listdir(self.dir):
182                if os.path.splitext(filename)[0] == segment:
183                    return self.serve_application(
184                        environ, start_response, filename)
185            # do a 404 Not Found
186        def serve_application(self, environ, start_response, filename):
187            basename, ext = os.path.splitext(filename)
188            filename = os.path.join(self.dir, filename)
189            if os.path.isdir(filename):
190                return URLParser(filename)(environ, start_response)
191            elif ext == '.py':
192                module = import_module(filename)
193                if hasattr(module, 'application'):
194                    return module.application(environ, start_response)
195                elif hasattr(module, basename):
196                    return getattr(module, basename)(
197                        environ, start_response)
198            else:
199                return wsgilib.send_file(filename)
200
201Modifying The Request
202=====================
203
204Well, URLParser is one kind of parser.  But others are possible, and
205aren't too hard to write.
206
207Lets imagine a URL like ``/2004/05/01/edit``.  It's likely that
208``/2004/05/01`` doesn't point to anything on file, but is really more
209of a "variable" that gets passed to ``edit``.  So we can pull them off
210and put them somewhere.  This is a good place for a WSGI extension.
211Lets put them in ``environ["app.url_date"]``.
212
213We'll pass one other applications in -- once we get the date (if any)
214we need to pass the request onto an application that can actually
215handle it.  This "application" might be a URLParser or similar system
216(that figures out what ``/edit`` means).
217
218::
219
220    class GrabDate(object):
221        def __init__(self, subapp):
222            self.subapp = subapp
223        def __call__(self, environ, start_response):
224            date_parts = []
225            while len(date_parts) < 3:
226               first, rest = wsgilib.path_info_split(environ['PATH_INFO'])
227               try:
228                   date_parts.append(int(first))
229                   wsgilib.path_info_pop(environ)
230               except (ValueError, TypeError):
231	           break
232            environ['app.date_parts'] = date_parts
233            return self.subapp(environ, start_response)
234
235This is really like traditional "middleware", in that it sits between
236the server and just one application.
237
238Assuming you put this class in the ``myapp.grabdate`` module, you
239could install it by adding this to your configuration::
240
241    middleware.append('myapp.grabdate.GrabDate')
242
243Object Publishing
244=================
245
246Besides looking in the filesystem, "object publishing" is another
247popular way to do URL parsing.  This is pretty easy to implement as
248well -- it usually just means use ``getattr`` with the popped
249segments.  But we'll implement a rough approximation of `Quixote's
250<http://www.mems-exchange.org/software/quixote/>`_ URL parsing::
251
252    class ObjectApp(object):
253        def __init__(self, obj):
254            self.obj = obj
255        def __call__(self, environ, start_response):
256            next = wsgilib.path_info_pop(environ)
257            if next is None:
258                # This is the object, lets serve it...
259                return self.publish(obj, environ, start_response)
260            next = next or '_q_index' # the default index method
261            if next in obj._q_export and getattr(obj, next, None):
262                return ObjectApp(getattr(obj, next))(
263                    environ, start_reponse)
264            next_obj = obj._q_traverse(next)
265            if not next_obj:
266                # Do a 404
267            return ObjectApp(next_obj)(environ, start_response)
268
269        def publish(self, obj, environ, start_response):
270            if callable(obj):
271                output = str(obj())
272            else:
273                output = str(obj)
274            start_response('200 OK', [('Content-type', 'text/html')])
275            return [output]
276
277The ``publish`` object is a little weak, and functions like
278``_q_traverse`` aren't passed interesting information about the
279request, but this is only a rough approximation of the framework.
280Things to note:
281
282* The object has standard attributes and methods -- ``_q_exports``
283  (attributes that are public to the web) and ``_q_traverse``
284  (a way of overriding the traversal without having an attribute for
285  each possible path segment).
286
287* The object isn't rendered until the path is completely consumed
288  (when ``next`` is ``None``).  This means ``_q_traverse`` has to
289  consume extra segments of the path.  In this version ``_q_traverse``
290  is only given the next piece of the path; Quixote gives it the
291  entire path (as a list of segments).
292
293* ``publish`` is really a small and lame way to turn a Quixote object
294  into a WSGI application.  For any serious framework you'd want to do
295  a better job than what I do here.
296
297* It would be even better if you used something like `Adaptation
298  <http://www.python.org/peps/pep-0246.html>`_ to convert objects into
299  applications.  This would include removing the explicit creation of
300  new ``ObjectApp`` instances, which could also be a kind of fall-back
301  adaptation.
302
303Anyway, this example is less complete, but maybe it will get you
304thinking.
305