Cleaning HC-SR04 distance sensor data to get an accurate measurement

You have a distance sensor HC-SR04 and a series of distance observations, assuming the target does not move you want to establish the best or most correct measure. The particular sensor used is a HC-SR04, of which an analysis of the accuracy and performances is presented here: in the analysis, as I do not have any specific data about the conditions in which the data is acquired, accuracy of the sensor and conditions, I simply try to skim the data removing the observations that diverge from a central cluster. Accuracy, skew of the sensors used etc should be in reality assessed, so that the data could be normalized before applying my approach. Here an article on the sensor and how to use with Arduino, and here you can find one of my articles on how to use it with the Raspberry Pi (In French).

Analysis – Data Exploration and Cleaning

I tried to establish a statistical method, by executing an exploratory analysis, this is collected in the funciton analyse_data. I used two methods to find the outliers:

  • Take out the observations that are not within 3*SD from the mean
  • Take out the observations that are bigger than 1.5* Inter-Quartile-Range + Third Quartile value or lower than First Quartile value – 1.5* Inter-Quartile-Range

One of the things that must be said is that we have no control over the way in which the data is acquired.

Assuming that the measurements (60), taken at short intervals and with static target are stored in the variable data, the analysis is performed in the analyse_data function:

def analyse_data(data):

    """ Brief exploration of the data, to see which value is the most reliable,
        two different strategies for removing outliers, one using zscores and
        the other using the 1*5*IQR strategy. Results and ('dirty data') are compared """

    (lower_quartile, data_median, upper_quartile, iqr) = quartile(data)
    print(lower_quartile, data_median, upper_quartile, iqr)

    print(f"Size {len(data)}")
    print(f"Mean {stat.mean(data)}")
    print(f"Median {stat.median(data)}")
    print(f"Lower Quartile: {lower_quartile}")
    print(f"Upper Quartile: {upper_quartile}")
    print(f"Interquartile Range: {iqr}")

    # Finding frequencies
    freq_dict = {x:data.count(x) for x in data}
    # Now you have categories.

    # Max distance:
    min_val = min(data)
    max_val = max(data)
    print (f"Max and Min: { max_val}, { min_val }")

    print (f"Max Distance: { max_val-min_val }")
    print(f"Max Distance max from median : {max_val - stat.median(data)}")
    print(f"Max Distance min from median : {min_val - stat.median(data)}")
    print(f"Standard Deviation of data {stat.stdev(data)}")

    # Statistics suggests us that anything beyond a certain number of standard deviations from mean should be
    # considered an outlier
    zscores = abs(stats.zscore(data)) > 2.5
    scored_data = list(zip(data, zscores))
    print(f"Z scores of data {  scored_data }")
    outliers = [k for (k,v) in scored_data if v ]
    print(f"Outliers using zscores * 2.5 * sd (usually is *3 sd *sd) {  outliers }")
    outliers2 = [k for k in data if (k > upper_quartile + 1.5*iqr ) or (k < lower_quartile - 1.5*iqr )]
    print(f"Outliers using IQR {outliers2}")

    # remove outliers IQR
    data_clean = [item for item in data if item not in outliers2]

    # Remove outliers zscores
    data_clean_z = [item for item in data if item not in outliers]

    # And again
    print(f"Size {len(data_clean)}")
    print(f"Mean {stat.mean(data_clean)}")
    print(f"Median {stat.median(data_clean)}")

    min_val = min(data_clean)
    max_val = max(data_clean)
    print (f"Max and Min: { max_val}, { min_val }")

    print (f"Max Distance: { max_val-min_val }")
    print(f"Max Distance max from median : {max_val - stat.median(data_clean)}")
    print(f"Max Distance min from median : {min_val - stat.median(data_clean)}")
    print(f"Standard Deviation of data {stat.stdev(data_clean)}")
    zscores_clean = abs(stats.zscore(data_clean)) > 2.5
    scored_data_clean = list(zip(data, zscores_clean))
    print(f"Z scores of CLEAN data {  scored_data_clean }")

    freq_dict_clean = {x:data_clean.count(x) for x in data_clean}
    freq_dict_clean_z = {x:data_clean_z.count(x) for x in data_clean_z}

    # Three figures, one with the raw data, and one for each strategy

    f, ax = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.25, .75)})
    ax[0].boxplot(data, notch= True, patch_artist=True, vert=False)
    ax[1].bar(freq_dict.keys(), freq_dict.values(), edgecolor='black')
    ax[1].set(xlabel="Raw sensor data")

    f1, ax1 = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.25, .75)})

    ax1[0].boxplot(data_clean, notch= True, patch_artist=True, vert=False)
    ax1[1].bar(freq_dict_clean.keys(), freq_dict_clean.values(), edgecolor='black')
    ax1[1].set(xlabel="Sensor data with outliers removed (IQR Method")

    f2, ax2 = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.25, .75)})

    ax2[0].boxplot(data_clean_z, notch= True, patch_artist=True, vert=False)
    ax2[1].bar(freq_dict_clean_z.keys(), freq_dict_clean_z.values(), edgecolor='black')
    ax2[1].set(xlabel="Sensor data with outliers removed (zscores)")

Plots obtained

Looking at the raw data, as a bar plot linked to a box plot, gives us this image:

This image shows us how some data looks unlikely. The box plot is telling us that many points are outliers, that lie beyond the 1.5×IQR+Q3 or below the Q1-1.5×IQR points. In fact, we are observing an inaccurate sensor, and we need to filter the data to get the best out of it.

The box plot criterion will be one of those used to remove the outliers and try to get a measurement that would represent the best part of the data. The other that we will use for comparison is the zscores method, which tells us that removing the data which does not fall into the 3×standard_deviation(data) ± mean should not remove any meaningful piece of data. In this particular case, I have removed observation that fall outside the 2.5×standard_deviation(data) ± mean.

The method of the zscores gives us this picture:

If we compare it with the above, we can see that this looks more like a Gaussian bell, although we have no indication that this data should be distributed in that way. However, it is what the method aims to.

The observations removed by this method are only 2.

The method based on the interquartile range gives this result:

It leaves us with 37 observations out of 60, removing thus 27. The measurement is then taken by averaging the 37 observations. Obviously in this case, no outlier is shown in the box plot.


This latter method is removing more data, and doing so is probably removing more noise. This is good as the sensor is not really accurate as demonstrated in the attached literature. The method centres the calculated measurements around the most frequent measure found in the data and it is in my opinion how the measurement should be extracted.

Source Code

The source code for this little study is available on github.