5 August 2022

Data cleaning, is said to be the activity that consumes the majority of Data Scientists’ and Analysts’ time – which is our experience as well. This is also why if there ever is clean (or cleaned) data, it shaves off a considerable amount of time required to generate insights and results from models that use that data. As clean data is always ideal, I’ve been asked this question many times – what does clean data look like?

Some times it’s relatively easy to spot “dirty” data or data that requires cleaning. For example, you want to analyse sales across the financial year, but you’re missing two months’ worth of data, or on some days, sales were negative. In both these cases, just a cursory glance of the data should allow you to identify these issues. Having said that, in cases where the data set is too large to open in Excel, it will be difficult to have a quick glance at the data. If it is a csv or text file, there might be some tools that you can use to identify the number of rows in a file. A simple analysis like this might give you an early indicator if your data with an expected size of 5million rows for example only has 1million rows – a clear indicator that you are potentially missing some rows.

For certain data sets, it might be as easy as identifying if the information you require is available in the data set or not. For example, you want a category-based overview of sales – so is that information in the data? If not, it can be added by joining together information from another data set, which sounds easy – but sometimes it’s not as easy as it sounds. Depending on where the data is sourced from, this could be as easy as a simple VLOOKUP in Excel or could require multiple tables to be joined together to get the final result.

Assuming these “first-glances” I mention above all appear ok, it doesn’t mean the data set gets the tick of approval. You can have a data set complete with information in every single cell, but is it correct? This is where the majority of time and effort in data cleaning may be spent – sifting through the data to identify these values that need to be filtered and modified. These corrections may require verification from relevant data stakeholders or additional work to find out what’s wrong.

Data sets containing information populated manually almost always contain multiple issues. The most common we usually find are the “finger-slips” where extra 0s get added, the 0000 or 9999 value entered when the operator doesn’t know what to enter or is too lazy to enter the correct value or when you find multiple weird and wonderful spellings for certain words. I recall coming across one data set that had PICKUP, PICK UP, PICK-UP, P/UP all in one data field meaning the same thing. Taken as-is, the data set would have definitely provided some inaccurate results. While the identification of this error was easy, sometimes a deeper knowledge of particular parts of the data is required to properly understand and glean insight from the data.

Considering volumetric data is probably one of the biggest sources of pain and work. Most often that cause is incomplete volumetric data for SKUs which then will require additional analysis to impute these values based on SKU characteristics and master data. But sometimes, even with complete data, there are issues that need to be considered. If a SKU has dimensions, are they correct? Or do they reflect the proper dimensions of the item? Are the dimensions of the packed product (which might be smaller) or the product the customer gets (i.e. assembled or folded out)?

Duplicate lines can also cause issues in data sets when present. When considering order lines for a distribution centre, inaccurate data that has duplicates has the potential to overstate the design level, potentially resulting in a solution that is too large for actual throughputs. Unfortunately, duplicate lines are also hard to detect without detailed analysis and in some cases, there may be valid reasons creating the duplicate lines.

Context is also key to identifying the right approach and values within data. For volumetric data and the example of packed vs final product – what are you analysing or modelling? If it is the warehouse and transport to store, potentially the packed product dimensions are required. However, if the product is then folded out/assembled in the store, the final product dimensions may be more appropriate when determining the ideal store or shelf size.

Clean data is never easy to find. While you may come across good data sets, most will always require some tender loving care so that you can get some valuable insights from them. If you want to get some insights from your warehouse or distribution centre data, contact us at Fuzzy LogX