1:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5   :synopsis: mathematical statistics functions
6
7.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
12**Source code:** :source:`Lib/statistics.py`
13
14.. testsetup:: *
15
16   from statistics import *
17   __name__ = '<doctest>'
18
19--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
24.. note::
25
26   Unless explicitly noted otherwise, these functions support :class:`int`,
27   :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28   Behaviour with other types (whether in the numeric tower or not) is
29   currently unsupported.  Mixed types are also undefined and
30   implementation-dependent.  If your input data consists of mixed types,
31   you may be able to use :func:`map` to ensure a consistent result, e.g.
32   ``map(float, input_data)``.
33
34Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
40=======================  =============================================
41:func:`mean`             Arithmetic mean ("average") of data.
42:func:`harmonic_mean`    Harmonic mean of data.
43:func:`median`           Median (middle value) of data.
44:func:`median_low`       Low median of data.
45:func:`median_high`      High median of data.
46:func:`median_grouped`   Median, or 50th percentile, of grouped data.
47:func:`mode`             Mode (most common value) of discrete data.
48=======================  =============================================
49
50Measures of spread
51------------------
52
53These functions calculate a measure of how much the population or sample
54tends to deviate from the typical or average values.
55
56=======================  =============================================
57:func:`pstdev`           Population standard deviation of data.
58:func:`pvariance`        Population variance of data.
59:func:`stdev`            Sample standard deviation of data.
60:func:`variance`         Sample variance of data.
61=======================  =============================================
62
63
64Function details
65----------------
66
67Note: The functions do not require the data given to them to be sorted.
68However, for reading convenience, most of the examples show sorted sequences.
69
70.. function:: mean(data)
71
72   Return the sample arithmetic mean of *data* which can be a sequence or iterator.
73
74   The arithmetic mean is the sum of the data divided by the number of data
75   points.  It is commonly called "the average", although it is only one of many
76   different mathematical averages.  It is a measure of the central location of
77   the data.
78
79   If *data* is empty, :exc:`StatisticsError` will be raised.
80
81   Some examples of use:
82
83   .. doctest::
84
85      >>> mean([1, 2, 3, 4, 4])
86      2.8
87      >>> mean([-1.0, 2.5, 3.25, 5.75])
88      2.625
89
90      >>> from fractions import Fraction as F
91      >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
92      Fraction(13, 21)
93
94      >>> from decimal import Decimal as D
95      >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
96      Decimal('0.5625')
97
98   .. note::
99
100      The mean is strongly affected by outliers and is not a robust estimator
101      for central location: the mean is not necessarily a typical example of the
102      data points.  For more robust, although less efficient, measures of
103      central location, see :func:`median` and :func:`mode`.  (In this case,
104      "efficient" refers to statistical efficiency rather than computational
105      efficiency.)
106
107      The sample mean gives an unbiased estimate of the true population mean,
108      which means that, taken on average over all the possible samples,
109      ``mean(sample)`` converges on the true mean of the entire population.  If
110      *data* represents the entire population rather than a sample, then
111      ``mean(data)`` is equivalent to calculating the true population mean μ.
112
113
114.. function:: harmonic_mean(data)
115
116   Return the harmonic mean of *data*, a sequence or iterator of
117   real-valued numbers.
118
119   The harmonic mean, sometimes called the subcontrary mean, is the
120   reciprocal of the arithmetic :func:`mean` of the reciprocals of the
121   data. For example, the harmonic mean of three values *a*, *b* and *c*
122   will be equivalent to ``3/(1/a + 1/b + 1/c)``.
123
124   The harmonic mean is a type of average, a measure of the central
125   location of the data.  It is often appropriate when averaging quantities
126   which are rates or ratios, for example speeds. For example:
127
128   Suppose an investor purchases an equal value of shares in each of
129   three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
130   What is the average P/E ratio for the investor's portfolio?
131
132   .. doctest::
133
134      >>> harmonic_mean([2.5, 3, 10])  # For an equal investment portfolio.
135      3.6
136
137   Using the arithmetic mean would give an average of about 5.167, which
138   is too high.
139
140   :exc:`StatisticsError` is raised if *data* is empty, or any element
141   is less than zero.
142
143   .. versionadded:: 3.6
144
145
146.. function:: median(data)
147
148   Return the median (middle value) of numeric data, using the common "mean of
149   middle two" method.  If *data* is empty, :exc:`StatisticsError` is raised.
150   *data* can be a sequence or iterator.
151
152   The median is a robust measure of central location, and is less affected by
153   the presence of outliers in your data.  When the number of data points is
154   odd, the middle data point is returned:
155
156   .. doctest::
157
158      >>> median([1, 3, 5])
159      3
160
161   When the number of data points is even, the median is interpolated by taking
162   the average of the two middle values:
163
164   .. doctest::
165
166      >>> median([1, 3, 5, 7])
167      4.0
168
169   This is suited for when your data is discrete, and you don't mind that the
170   median may not be an actual data point.
171
172   If your data is ordinal (supports order operations) but not numeric (doesn't
173   support addition), you should use :func:`median_low` or :func:`median_high`
174   instead.
175
176   .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
177
178
179.. function:: median_low(data)
180
181   Return the low median of numeric data.  If *data* is empty,
182   :exc:`StatisticsError` is raised.  *data* can be a sequence or iterator.
183
184   The low median is always a member of the data set.  When the number of data
185   points is odd, the middle value is returned.  When it is even, the smaller of
186   the two middle values is returned.
187
188   .. doctest::
189
190      >>> median_low([1, 3, 5])
191      3
192      >>> median_low([1, 3, 5, 7])
193      3
194
195   Use the low median when your data are discrete and you prefer the median to
196   be an actual data point rather than interpolated.
197
198
199.. function:: median_high(data)
200
201   Return the high median of data.  If *data* is empty, :exc:`StatisticsError`
202   is raised.  *data* can be a sequence or iterator.
203
204   The high median is always a member of the data set.  When the number of data
205   points is odd, the middle value is returned.  When it is even, the larger of
206   the two middle values is returned.
207
208   .. doctest::
209
210      >>> median_high([1, 3, 5])
211      3
212      >>> median_high([1, 3, 5, 7])
213      5
214
215   Use the high median when your data are discrete and you prefer the median to
216   be an actual data point rather than interpolated.
217
218
219.. function:: median_grouped(data, interval=1)
220
221   Return the median of grouped continuous data, calculated as the 50th
222   percentile, using interpolation.  If *data* is empty, :exc:`StatisticsError`
223   is raised.  *data* can be a sequence or iterator.
224
225   .. doctest::
226
227      >>> median_grouped([52, 52, 53, 54])
228      52.5
229
230   In the following example, the data are rounded, so that each value represents
231   the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
232   is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc.  With the data
233   given, the middle value falls somewhere in the class 3.5--4.5, and
234   interpolation is used to estimate it:
235
236   .. doctest::
237
238      >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
239      3.7
240
241   Optional argument *interval* represents the class interval, and defaults
242   to 1.  Changing the class interval naturally will change the interpolation:
243
244   .. doctest::
245
246      >>> median_grouped([1, 3, 3, 5, 7], interval=1)
247      3.25
248      >>> median_grouped([1, 3, 3, 5, 7], interval=2)
249      3.5
250
251   This function does not check whether the data points are at least
252   *interval* apart.
253
254   .. impl-detail::
255
256      Under some circumstances, :func:`median_grouped` may coerce data points to
257      floats.  This behaviour is likely to change in the future.
258
259   .. seealso::
260
261      * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
262        Larry B Wallnau (8th Edition).
263
264      * The `SSMEDIAN
265        <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
266        function in the Gnome Gnumeric spreadsheet, including `this discussion
267        <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
268
269
270.. function:: mode(data)
271
272   Return the most common data point from discrete or nominal *data*.  The mode
273   (when it exists) is the most typical value, and is a robust measure of
274   central location.
275
276   If *data* is empty, or if there is not exactly one most common value,
277   :exc:`StatisticsError` is raised.
278
279   ``mode`` assumes discrete data, and returns a single value. This is the
280   standard treatment of the mode as commonly taught in schools:
281
282   .. doctest::
283
284      >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
285      3
286
287   The mode is unique in that it is the only statistic which also applies
288   to nominal (non-numeric) data:
289
290   .. doctest::
291
292      >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
293      'red'
294
295
296.. function:: pstdev(data, mu=None)
297
298   Return the population standard deviation (the square root of the population
299   variance).  See :func:`pvariance` for arguments and other details.
300
301   .. doctest::
302
303      >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
304      0.986893273527251
305
306
307.. function:: pvariance(data, mu=None)
308
309   Return the population variance of *data*, a non-empty iterable of real-valued
310   numbers.  Variance, or second moment about the mean, is a measure of the
311   variability (spread or dispersion) of data.  A large variance indicates that
312   the data is spread out; a small variance indicates it is clustered closely
313   around the mean.
314
315   If the optional second argument *mu* is given, it should be the mean of
316   *data*.  If it is missing or ``None`` (the default), the mean is
317   automatically calculated.
318
319   Use this function to calculate the variance from the entire population.  To
320   estimate the variance from a sample, the :func:`variance` function is usually
321   a better choice.
322
323   Raises :exc:`StatisticsError` if *data* is empty.
324
325   Examples:
326
327   .. doctest::
328
329      >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
330      >>> pvariance(data)
331      1.25
332
333   If you have already calculated the mean of your data, you can pass it as the
334   optional second argument *mu* to avoid recalculation:
335
336   .. doctest::
337
338      >>> mu = mean(data)
339      >>> pvariance(data, mu)
340      1.25
341
342   This function does not attempt to verify that you have passed the actual mean
343   as *mu*.  Using arbitrary values for *mu* may lead to invalid or impossible
344   results.
345
346   Decimals and Fractions are supported:
347
348   .. doctest::
349
350      >>> from decimal import Decimal as D
351      >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
352      Decimal('24.815')
353
354      >>> from fractions import Fraction as F
355      >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
356      Fraction(13, 72)
357
358   .. note::
359
360      When called with the entire population, this gives the population variance
361      σ².  When called on a sample instead, this is the biased sample variance
362      s², also known as variance with N degrees of freedom.
363
364      If you somehow know the true population mean μ, you may use this function
365      to calculate the variance of a sample, giving the known population mean as
366      the second argument.  Provided the data points are representative
367      (e.g. independent and identically distributed), the result will be an
368      unbiased estimate of the population variance.
369
370
371.. function:: stdev(data, xbar=None)
372
373   Return the sample standard deviation (the square root of the sample
374   variance).  See :func:`variance` for arguments and other details.
375
376   .. doctest::
377
378      >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
379      1.0810874155219827
380
381
382.. function:: variance(data, xbar=None)
383
384   Return the sample variance of *data*, an iterable of at least two real-valued
385   numbers.  Variance, or second moment about the mean, is a measure of the
386   variability (spread or dispersion) of data.  A large variance indicates that
387   the data is spread out; a small variance indicates it is clustered closely
388   around the mean.
389
390   If the optional second argument *xbar* is given, it should be the mean of
391   *data*.  If it is missing or ``None`` (the default), the mean is
392   automatically calculated.
393
394   Use this function when your data is a sample from a population. To calculate
395   the variance from the entire population, see :func:`pvariance`.
396
397   Raises :exc:`StatisticsError` if *data* has fewer than two values.
398
399   Examples:
400
401   .. doctest::
402
403      >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
404      >>> variance(data)
405      1.3720238095238095
406
407   If you have already calculated the mean of your data, you can pass it as the
408   optional second argument *xbar* to avoid recalculation:
409
410   .. doctest::
411
412      >>> m = mean(data)
413      >>> variance(data, m)
414      1.3720238095238095
415
416   This function does not attempt to verify that you have passed the actual mean
417   as *xbar*.  Using arbitrary values for *xbar* can lead to invalid or
418   impossible results.
419
420   Decimal and Fraction values are supported:
421
422   .. doctest::
423
424      >>> from decimal import Decimal as D
425      >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
426      Decimal('31.01875')
427
428      >>> from fractions import Fraction as F
429      >>> variance([F(1, 6), F(1, 2), F(5, 3)])
430      Fraction(67, 108)
431
432   .. note::
433
434      This is the sample variance s² with Bessel's correction, also known as
435      variance with N-1 degrees of freedom.  Provided that the data points are
436      representative (e.g. independent and identically distributed), the result
437      should be an unbiased estimate of the true population variance.
438
439      If you somehow know the actual population mean μ you should pass it to the
440      :func:`pvariance` function as the *mu* parameter to get the variance of a
441      sample.
442
443Exceptions
444----------
445
446A single exception is defined:
447
448.. exception:: StatisticsError
449
450   Subclass of :exc:`ValueError` for statistics-related exceptions.
451
452..
453   # This modelines must appear within the last ten lines of the file.
454   kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;
455