Descriptive Statistics#

Exercise 1: Plotting data and calculating descriptive statistics#

Write a function distribution_analysis(x) that expects a numpy array x as input argument and returns a dictionary stats with these statistical values as the keys:

  1. mean

  2. standard deviation (biased and unbiased)

  3. median

  4. interquartile range

Moreover, the function should generate a figure with 3 subplots, showing:

  1. the raw data

  2. a histogram of the data

  3. the boxplot of the data

Make sure that this function is written in a general way (without fixed numbers or any other assumptions about the data that should be analyzed), because you will use this function repeatedly in this exercise. Test the function using samples drawn from a Normal distribution with mean -200 and std 10, stored in the variable x below.

Check if you understood the concepts of generating random numbers and analyzing the distribution of values in a sample:

Hint: Not all of these statistics can be computed using numpy - you find the missing ones in the scipy.stats module.

import numpy as np
x = np.random.normal(2, 1, 1000)  # generate some data for testing your function

# your solution here

Exercise 2: Compare the descriptive statistics of Normal, Uniform, and Poisson distributions#

Now run the same distribution analysis for 10000 samples from a Uniform and a Poisson distribution.

Generate 10000 samples from

  1. A normal distribution with mean 2.0 and standard deviation 1.0 as a reference. Use the function np.random.normal. Check the numpy documentation on how to use it.

  2. A uniform distribution with mean 2.0 and span (difference between smallest and largest value) 1.0. Use the function np.random.random_sample.

  3. A Poisson distribution with mean 2.0. Use the function np.random.poisson. Check the numpy documentation for how to use it. Hint: The first argument, lam(bda) corresponds to the mean of the distribution.

Use your function to visualize the three data sets and compute the descriptive statistics. Compare the descriptive statistics of your Normal distribution to the statistics from the Uniform and Poisson distributions:

  • Look at all output values and all subplots: Do they meet your expectation? Do mean and std make sense? Can you explain the values for the median and the IQR?

# your solution here

Exercise 3 (bonus): Check the convergence of different statistics for different sample sizes#

Explore how the estimates of mean and standard deviation scatter for multiple samples of different size. To do that you will repeatedly generate datasets of normally distributed samples, compute their statistics and then plot the distribution of these statistics.

Make a function that

  • takes the sample size N as an input,

  • generates a normally distributed dataset with mean 10 and standard deviation 2 with the specified sample size,

  • computes:

    • the empirical mean of the generated sample of random numbers

    • the empirical unbiased standard deviation (normalized by the sample size N)

    • the empirical biased standard deviation (normalized by N-1)

To explore how the estimates of these statistics depend on sample size, you will estimate them from samples of 3 different sizes (10, 100, 1000). First, call your function 100 times with a sample size of 10 and save the three output values (mean, unbiased std, biased std). Now do the same, but with sample sizes 100 and 1000.

Plot the histograms of the 3 empirical measures for the three sample sizes 10, 100, 1000. How do the distributions change with sample size? What is the difference between the biased and the unbiased standard deviation?

Bonus You can make your code more efficient by generating a matrix of 100x1000 samples at once and computing the statistics over the axis 1 of this matrix.

# your solution here