Numeric outliers cleaning

Modified on Mon, 07 Nov 2022 at 12:10 PM

An outlier is a value that differs significantly from other observations. It may be due to a normal variability in the measurement, like an individual who is 2.10 meters tall, or due to an error, like an individual who is 17.8 meters tall.

How are outliers detected?

General case

When no rule has been manually edited, the data cleaner will look for outliers by assessing values which are far lower than Q1 (first quarter) and far higher than Q3 (third quarter). This is called the Tukey's fences method.

Any data greater than Q3 + 3.5 IQR or lower than Q1 - 3.5 IQR will be detected as an outlier.

If the observed minimum or maximum values of the variables are below these thresholds, the displayed rule will change accordingly. 

Let's take this example with the height: Q1 - 3.5 IQR = 1.60 and Q3 + 3.5 IQR = 2.30m and the values observed in the dataset are between 1.55 and 2.10m. The displayed rule will be: "Height is expected to be between 1.60 and 2.10".

Manual bounds

If numeric bounds have been defined manually for a numeric variable, the rule will be based on these bounds.

For example, if the weight is set to be between 50 and 200, the rule will not be defined with the Tukey's fence method. It will simply be "Weight is expected to be between 50 and 200".

If a bound is defined only for the minimum value, the Tukey's fence method will be applied for the maximum values.

If a bound is defined only for the maximum value, the Tukey's fence method will be applied for the minimum values.

How to clean outliers?

Get context

Adding some context will help you assess and correct outliers. For example, if you find that a patient is 2.10m tall, knowing the sex and the weight of the patient will help you. A 50kg woman will probably not measure 2.10 but a 120kg man may be 2.10m tall!

It is possible to display additional information in each patient card:

  1. Click on "Display also"
  2. Type the names of the variables you want to display
  3. On each patient card are now displayed the additional variables you have selected

Display distribution charts

Click on the chart icon to display charts for the variable you are cleaning. This will help you see if the suspected value is an outlier or not.

See the complete patient form

If the distribution chart and additional variables were not sufficient to decide whether or not the value is an outlier, you can display the full patient form by clicking on the initials of the patient in the patient card.

Make the decision

For each data point, you have 3 choices:

  1. Change the value: type a new value and click on "Update value"
  2. Keep the current value: if you have decided that this value is not an outlier, click on "No, keep value" to confirm that the data point is not an outlier
  3. Delete the value: click on "Make NC" to delete the data point

The data point will be displayed until you choose one of these 3 options. Once you have chosen, the data point will not appear anymore in the list of outliers and your data cleaner report will be updated.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article