Is it ok to remove an outlier?

Wow, this week we can talk about about anythiiiiiiiiiiiiiing we want to (stats related of course), so I have been inspired by this week’s lecture and I’m going to chat about outliers. Are they important? Incase you are unsure as to what I mean by “outlier”, you can basically sum up an outlier as something that is very different from everything else. (Like this guy in the image below!!)

 So in terms of statistics, an outlier is a data point that is extremely numerically different from all the other data points in the sample, it doesn’t follow any of the patterns that the other results may show and they can really change a researcher’s results due to the impact they have on the mean!  When we’re doing psychological experiments, outliers can occur for many different reasons. It could be due to the researcher, did they not have a big enough sample? Were there flaws in their design? Did they measure incorrectly?? Although, the outlier data point could have been developed from the participant! Was the participant not listening to the instructions of the task? Did they fake their answers? ….. or was it complete chance?

One thing to consider with outliers, is.. is it wrong to remove them from your data? Some may say it’s wrong to remove an outlier from your data, because you are messing with the natural results from the study.

However, I think in some circumstances it is the correct thing to do!

Having outliers in a set of data can have a really dramatic influence on the outcome of the study. The value obtained for the correlation can be seriously affected.. you may end up thinking your research has been really successful (or really disastrous) just because of one single participant changing the mean!    

I’ll show you an example experiment so I can try and show you what I’m rambling on about

My example experiment is testing the levels of hyperactivity in children against how many fizzy drinks they have consumed that day (I’ve used the exact same numbers from out of a book*  to make sure I get the sums right haha).

The first image shows a set of data points where the correlation is almost 0 (r=-0.08) meaning there isn’t really a relationship between the two so the amount of fizzy drinks most likely doesn’t affect a child’s hyper behaviour. The outlier example has been added into the second dataset, and there is a huuuuuuge change in the correlation value! Due to this ONE participant, the correlation is now r=0.85, suggesting there is a strong positive correlation, and that fizzy drinks do affect how hyper a child behaves! So the entire outcome of the experiment has changed just due to this one person, is that fair?

In conclusion, I think it is acceptable to remove outliers from your dataset as they can have a serious effect on the end result, and for the majority of the time it is an unnecessary effect. In some cases when outliers occur it is important that the research looks into it, it could be due to they didn’t understand the task, therefore the researcher should probably reconsider their experiment! 🙂

*statistics for the behavioural sciences, eighth edition

Advertisements

13 thoughts on “Is it ok to remove an outlier?

  1. ~Problems says:

    Hi Sinae, i liked your blog- it was really set out well and was generally very cool and funky.

    however, i disagree with your reasoning. Surely any outlier is just as valid as any other data point, i mean if it was caused by say a sticky button in an RT test, then fine, but what if it was just a very fast or very slow person?
    To me, the most interesting data lies in the extremes. for example, enough RT studies have been done to support the idea that the average human RT is ~210 ms. The problem we have with removing outliers is where do we take science from there? all the time we are removing data for our own convenience surely we are practicing bad science? What if there is a particular effect which makes someone consistently blindingly fast at 110ms, but as they are an outlier their average is deleted?

    I guess my point is, is removing outliers not a way to remove some possibly very interesting data?

  2. World of Statistics says:

    In some situations, like the example case you provided, it is indeed acceptable to remove the outlier as it has an impact of the overall correlation and it therefore alters the outcomes of the experiment, just because of that one extreme value. On the contrary, removing outliers is not the solution because this could lead us to removing other values as we perceive them to be ‘extreme’ and as such, we are no longer left with out natural results that we gathered from the beginning. Instead, we should find ways of dealing with the outliers as opposed to merely taking the easy way out.

  3. racheljessica92 says:

    I think we should look carefully at situations where outliers are going to be removed, and by removing them, how this would impact the other results and if the mean would show a significant difference. I think in some cases where there have been a large amount of ethical issues or a participant has tried to guess the nature of the study then these outliers should be removed. All in all I really liked your argument and the examples you provided. I look forward to reading more from you next week, good job!! 🙂

  4. psucfb says:

    I always love your examples in your blogs – they always make me smile 🙂

    If somebody’s reuslts were affected by an outlier that was caused by error, I would completely understand their reasoning for doing so (as long as they could justify themselves). Things like measurement error can have a very negative affect on reuslts and can cause the statistcal analysis to change dramatically. Yet all that may have happened was the paticipant could have been pressing the wrong key. In instances like these, I agree with you, in that outliers should be removed.

    However, not all outliers can be justified with being removed. If there is no clear cause of the outlier, then it is dishonest to remove it. The outlier may just be as a result of individual differences, and the participant may just, for example, have very slow reaction times.

    Loved reading you blog!

  5. psuc5d says:

    I think under certain circumstances it is OK to remove outliers. However, I’m not sure I would from your example. I know this is a simple example that you created so I could be over critical here but these comments are getting tough so apologies before hand. First of all, just because the outlier changes your correlation value (however great) is no reason in itself to be removed. Outliers occur from natural deviations from populations. According to the three-sigma rule (http://en.wikipedia.org/wiki/Three_sigma_rule), 1 in 22 observations will differ by two standard deviations and 1 in 370 will differ by three standard deviations. Outliers will be expected in any large data set so they should not be removed automatically. Also, removing outliers isn’t the only option, using robust statistics like the median or non-parametric statistics that do not use normal distributions can handle outliers much better (http://chromatographyonline.findanalytichem.com/lcgc/data/articlestandard//lcgceurope/502001/4509/article.pdf). I think it should be mentioned that if outliers are removed, it should follow established procedures, be noted and provide rationale for removal (http://ori.hhs.gov/education/products/plagiarism/30.shtml).

  6. cfredlevy says:

    To further your argument: taking your example of the hyper kid, s/he is a participant taken from a broad population. They come from a tiny percent of the population of very hyper kids and skew the results. I agree that it is entirely worth removing this data point. It could be argued that removing it takes away from a true reflection of the data however, by skewing the results the conclusions are very different. They may be used to support some kind of legislation or another study but are in fact very wrong. To minimise the harm to the statistics its more beneficial to remove the data and conclude based on a more logical pattern. Osbourne 2002 (from link). Also consider sampling error, if the hyper kid digests sugars differently and this effects behaviour, they are not part of the average kid population and behave differently from the same stimulus. This idea is contrasted by Orr, Sackett & DuBois (1991) who believe the population as a whole including the extreme should also be included. Yet a high extreme ignore the low extremes. Therefore on this basis it would also be beneficial to remove the error for the sake of the results and stick to an average. Hopefully my ramblins show that I agree with your conclusion that the effect of an outlier kept in, to resemble a more accurate population, is largely unnecessary 🙂

  7. vanilla85 says:

    I think that whether remove the outliers or not depends on the data. Your data is quite extreme and in that case I would remove the outlier. However, whether remove the outlier or not, the most important thing is to be aware of the outlier. We can keep the outliers and do the transformation (Hamilton, 1992) or robust methods (Barnett & Lewis, 1994).

  8. exactestimates says:

    I believe that outliers should remain within data. While you can say that it could be due to some computer or keyboard error, how can you prove this? Extreme scores and people do exist, so why should they be eliminated? This removes the idea of random sampling if you are just going to abandon those who do not cluster around your mean. Too dishonest for a slight shift in your correlation, etc. Loved reading your blog though, very well set out and engaging to read 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s