Thursday, August 30, 2012

Sample Mean vs. Population Mean


I did not mention this is my last post that outlined some basic statistical functions related to central tendency, but when studying stats, it is important to understand that you will be dealing with two kinds of means: a sample mean, and a population mean.  Conceptually, they both do the same kind of thing, though their meanings are slightly different.  It is a very good idea to know when to use sample mean vs population mean, and I will try to go over these concepts and their uses in this post.

Mathematically, I already explained how you determine a mean value.  It is what you have likely always known as an average value, and you can very easily find it using the mean formula.  Without actually writing out the equation, you already know that the mean is the sum of all your values, divided by the total number of your values.  This is straightforward and nothing new.  Here is where I am going to make a distinction that you need to be aware of.

In statistics, you deal with populations.  Populations are complete groups of people, of things, of measurements.  As an example, you likely know of the population of the planet Earth.  That refers to all of the people on the planet.  Or, you could have a population of bald eagles in a nesting ground, or a population of Ferrari sports car manufactured in 2011.  Populations refer to the whole group of whatever you are talking about.  However, in many cases, you don't have access to data about the entire population.  You only have access to a subset of that population... a sample of the population.  So, a sample can be considered to be a small part of the population, but is representative of that population as a whole.

A sample could also be looked at as only an estimate of the larger population.  They are frequently sufficient enough to work with, since having data for an entire population could involve a very complicated and long set of data, and the closer your sample size is to your population size, the more accurate this estimate becomes.  This is why people tend to question things that are only based upon a few observations... error is higher when sample size is smaller.  More observations means less error.

So then, with those definitions in mind, you should hopefully be able to understand what is meant by population mean and sample mean.  Literally, a population mean is the average of the entire population, whereas the sample mean is the average of a sample (which represents a larger population).  Of course, since this is mathematics, we have different ways to write the notation for these two statistics concepts.

When we are talking about a population mean, where we have data about all of the subjects or measurements of a given population, we represent that data by the Greek letter mu, which looks like a fancy lower-case u:


This is calculated by summing all of the values in the entire population, and then dividing by the total number of values in that population, which is denoted by a capital N for a population.

On the other hand, when we are dealing with a sample mean (a subset that is representative of a whole population), we denote this function by the aforementioned symbol, x-bar:


As before, we find this by summing all of the values in your set, and then dividing by the total number of values in your set, in this case, the number being denoted by a lower-case n for a sample.

As I mentioned, calculating these values means essentially doing the same thing.  However, in stats, it is wise to pay attention to the group that you are analyzing.  Making a mistake at this point could lead to much larger errors in any further statistical analysis.  Keep in mind that a sample mean is an approximation of a population mean, and that approximation becomes more accurate as the size of your sample (n values) approaches the size of your whole population (N values).


Sunday, August 26, 2012

Central Tendency - Statistics


In this first post of my new Statistics series of posts, it is going to be a refresh of the most common statistics you have ever done (and perhaps didn't realize were actually statistics): measures of central tendency.  I've previously done posts that briefly described the functions that measure center (e.g mean, median, mode), but here I am going to compile them all together in one place, and provide perhaps a better explanation of these statistics concepts.

Mean

The first statistic that I include here is the most common statistic with which you have likely ever worked.  You probably know it be the name "average" but in the field of statistics, you will find it referred to by "mean," "arithmetic mean," or "arithmetic average."  It probably doesn't need much of an explanation, as most students learn how to calculate averages very early on in school!  It represents a calculated measure of the center of a distribution of values, simply obtained by adding up all of the values and then dividing that sum by the number of values you added together.  (It is important to be aware that there are different types of means in statistics: sample and population means. I describe these in more detail in a separate post.  For the sake of demonstration, consider the math in this post to describe samples instead of populations.)

There are a couple of important points to make about the notation involved in calculating means.  The first is regarding the actual mathematical symbol for mean (because you don't want to always have to write down the word "mean" in your solutions!).  The symbol for mean is written as an x (or whatever variable you are using) with a small horizontal bar over it, like this:


You say this symbol as "x bar."  You can use and will see this notation wherever an arithmetic mean value is being used in statistical analysis and calculations.  It is extraordinarily common, yet would appear confusing at first to a student who is new to statistics, because it looks like nothing they had ever dealt with before.

In addition to this, there is a second notation that you will see that may need an explanation first.  This notation is used to describe the arithmetic mean formula.  I explained the concept and process of calculating a mean above, but here is one way in which you could write this down in your work:


Mathematically, this simply says that the mean is equal to the sum of all your values (x1 all the way up to xwhatever) divided by the total number of values that you are adding up.  This average formula could also be represented in another way, like this:


This formula for mean is saying the same thing as the previous one.  The 1/n part is the same in both equations (in the first, dividing by n is the same as multiplying by 1/n).  The fancy capital E-looking thing is the Greek capital letter sigma (which is not equivalent to E, but rather to S), and in math, it means to "sum up everything in the following equation."  And the xi part represents all the values of x.  So the sigma would start with x1, then add x2, then add x3, and so on, for all the values of x.  (I will do a separate post on sigma notation to perhaps explain this a bit better, with more examples.)

An important concept to understand about the mean is just what exactly it represents, and how it can be influenced by its dataset.  For a collection of values that are similar, the mean will provide a fairly reasonable measure of the center of this data.  However, if you consider the inclusion of any extreme values, you can see how this would cause the arithmetic average to be biased in its direction.  The more extreme the outliers are, the greater their effect on the mean.  Try for yourself to see what I mean.  Consider the dataset of values 1, 2, 3, 4, 5, and then consider the dataset of 1, 2, 3, 4, 20.  You can see that the mean is pulled in the direction of the outlier.  This is simply a result of how the mean is calculated, and is one of the flaws of it as a statistical tool.  Similarly, if have a distribution of values in your dataset that are "skewed" (that is, if you graph them out, you will see that the graph isn't symmetrical, and it has a tail on one end), the long tail will tend to bias the measurement of the mean in its direction.  Because of these characteristics, the mean is considered not to be a resistant measure (in that it can't resist being pulled by extreme data).  However, despite these points, the mean is an incredibly useful tool for statistics, if for no other reason that it is so simple to use, and provides a very quick evaluation of how the dataset is centered.

Median

The median is a second of the three measures of center that I want to talk about here.  Conceptually, I think that it is probably even simpler to understand than the mean.  It's much easier to calculate.  Whereas the arithmetic mean requires you to perform the calculation I described above (or really keen people know how to use their calculator's mean calculation function!), to determine the median, you don't have to do any mathematic operations at all!  Quite basically, the median represents the midpoint of your dataset, the point where half of the data is larger and the other half is smaller.  You don't have to calculate it, you just have to identify it.

To do this, all you need to do is take your dataset, and arrange all of the values in increasing size.  The value in the center is your median, often represented by the capital letter M.  When you have an odd number of values in your dataset, you will be able to find the median very easily.  You can identify it through a quick calculation to find which is the center value, which is simply the (n+1)/2 value in your order, where n is the total number of values in your dataset.  Note that this median formula only tells you where in the order your median is located, not the value of the median.  If you have an even number of values, then your median is represented by the mean of the two center values (using the same calculation above to determine the location, you'll result in a location 4.5 for example, indicating that the median is the mean of the values at locations 4 and 5).  So, in this case, your median does not necessarily have to be one of your data points, but instead the average of the middle two.

Determining the median can be a very tedious process if you have a very large dataset.  In these cases, the use of a spreadsheet software will come in extremely handy!  Then, you can automatically sort your values, and then identify the one(s) you require.  For small data sets, on the other hand, it takes very little effort to sort through and rearrange the values, making the median another very simple and useful statistical tool to evaluate central tendency.

There are a few differences to consider when comparing the mean and the median.  Since the mean uses the actual data values in its calculation, it is influenced more by extreme or skewed data.  Therefore, the median will represent a better estimate of the center of the distribution.  In this sense, the median can be considered to be a more resistant measure than the mean.  So, if you have a symmetric distribution of data, the mean and the median will be very similar.  However, when you have skewed distributions, the mean will be located more in the long tail of the distribution, further away from the median.  Consider, if you have a set of prices in a data set, and then you double the highest price, the median will be the same in both cases, though the doubled price point will push the mean much further away and more towards that extreme end of the distribution.  The mean and the median provide differing assessments of the central tendency of a distribution, but both functions are extremely useful in statistical analysis.

Mode

The mode is the third statistical function used to evaluate the center of a dataset.  It is just as easy to determine as the median.  Once again, there is no mathematic operation needed to determine the mode.  That is because the mode is quite simply the value that is most common in your dataset.  If you have arranged all of your data points in increasing order to assess the median, as described above, then it is quite easy to find the mode.  For example, in the dataset 1, 2, 3, 4, 4, 4, 5, the mode is 4 because it is repeated the most often.  See?  Easy!  

There are two points to keep in mind: for one, you can have more than one mode (if you have a dataset of 1, 2, 2, 3, 4, 4, then you have two modes, 2 and 4); and second, if no term is repeated at all, then there is no mode to the dataset.  

It is also good to know that of these three statistical functions I've covered already, the mode is the only stat that can be applied to non-numerical datasets.  For example, the mode could be used to say what colour shirt is the most commonly worn shirt in an office.  The dataset for this could read like: red, white, blue, green, blue, yellow, blue, blue, red, black, and so the mode is blue.  You can't arrange these in increasing size to find a median.  You cannot apply the mean formula.  These concepts don't make any sense when you consider this data!  However, the mode makes perfect sense and is very easy to determine.



Range

While we're considering these methods of measuring central tendency, it would also be useful to mention range.  Technically, range doesn't provide any sense of measure of center.  However, it is very useful in evaluating the spread of the data, or how close the values are to the distribution's center.  Range is another simple concept, but it may not be exactly what you would think it should be.  If you have a dataset, or a graph of a distribution, you would be incorrect to say that the range is the low value to the high value.  (This would be similar to the definition of range in graphing, where range is all the y-values on the curve.)  However, range is slightly different in statistics.  Range is the DIFFERENCE between the high and low values of your dataset.  So, for the dataset 5, 6, 7, 10, 13, the range is 13-5, which is 8 (not 5 to 13, as may be thinking).

So, that is mean, median, mode, and range.  These are some of the most basic and common statistical operations that you will encounter.  Quite likely, you have already used some of these before, and may not have realized that you were actually doing statistical analysis of a dataset.  They are all different, and so they provide different assessments of how your data behaves.  They are all useful in their own way, and each shows strength in analyzing different types of data.  Therefore, it is extremely important that you learn what each stat means, and how to evaluate them.

I hope that this post has been informative and helpful for you!  If it was, please don't forget to hit the +1 button below, or click here to share by tweeting about it!


Related Posts