How much data is too much data?

2 August 2019

As with most (if not all) of our projects, they all start with a data analysis component, which helps us understand current processes and throughputs, as well as what the requirements for the future DC are or opportunities for optimisation. The perfect data set is still a unicorn, however I must say that I have come across some projects which have presented some near perfect data sets. With the help of SAM and all our other analytics tools, I have always erred on the side of caution and asked for as much data as we can get our hands on. But, is there such a thing as too much data?

The Vs

With the advent of Big Data came the definitions of what constituted Big Data. Several organisations came up with their own definitions, ranging from the 3Vs to 4Vs to 5Vs upto 10Vs of Big Data! One of the most popular definitions I have seen – by IBM is the 4Vs of Big Data, which I have listed below with a brief overview of what each V covers.

Volume
- This covers the amount of data. Is it a 10MB CSV file or a 2TB database?
Velocity
- This talks about the speed at which data is created and needs to be analysed. For example, if you’re analysing per-second pings from a 100-strong truck fleet you would have higher velocity data than for example a warehouse which only processes about 2,000 order lines per day.
Veracity
- Is the data valid and correct? What is the uncertainty contained within the data?
Variety
- What data types are included. You could have multiple types, such as structured data which is organised in a tabular form, semi-structured data which is not tabular, but still adheres to some format such as a CSV file or XML file or unstructured data which has no structure to it, such as a document or image.

Warehouse analytics, especially in the context of DC design doesn’t really meet any of the above criteria to be considered as Big Data, however it’s useful to borrow from the above to construct a few pointers to keep in mind.

Volume – More is better, sometimes

There are some instances where more data is better. When considering time-series data such as order line, inbound and Stock-On-Hand (SOH) data, more data, specifically data over a longer time period would be extremely useful. Looking at data over a longer timeframe provides insights into the seasonalities present in an order profile as well as the underlying trends affecting throughput. We usually recommend at least a year’s worth of data – which will also contain any peaks common to your operations. It’s important that peaks are accounted for within the design.

While aggregate data provides a summary of the operations, keep in mind that a lot of the detail gets hidden within the aggregation. Ditch the aggregate data and analyse the raw transactional data, especially for order lines. Using transactional data presents the opportunity to create detailed order profiles such as how many orders are single line, single unit orders – which presents a perfect opportunity to batch pick.

It will also be useful to match the timeframe for order line, inbound and SOH data. This allows you to match required picking performance with receiving performance and pair these together with SOH to create a complete summary of warehouse capacity/sizing requirements.

Velocity

Although the 4Vs of data refer to the velocity at which data can be attained, in the context of DC design and warehouse optimisation this is not necessarily a relevant issue (Although it would be great if data was more readily available for some of our projects as the data wardens can sometimes become the bottleneck for our modelling).

SKU velocity profiling is a more relevant context for velocity in warehouse-related data modelling. In this context we can often find that too much data can hide the underlying story of a SKU. For example, if you looked at the velocity of a SKU across 3 years of data, the magnitude of that SKU’s movement profile could be hidden or flattened by too much data. When modelling velocity, it is important to do so in the context of a design period in order to understand how the velocity profile should be used for optimisation or Storage/pick face design.

Over the coming weeks, we will be exploring the intricacies of DC storage design and pick face slotting. In September (AKA Slot-tember), we will be hosting some information sessions on Slotting Optimisation tools with our network. If you are interested in staying in the loop on these events and learning more about slotting, watch this space. For now, let’s move on for the sake of variety (pun intended).

Variety – More is better

While variety in the context of the 4Vs talks about the type of data (structured vs semi-structured), in this context, variety would refer to the types of data sets. While traditionally, warehouse analytics would only include order line, receiving, SOH and SKU data, additional data sets can deliver more insights about your operations. This is extremely beneficial especially when there are blanks within the data. For example, in the absence of receiving data, inbound container data with an average cartons per container assumption would create an approximate inbound data set. If order line or pick information is not recorded, matching order data to freight data would provide an approximate data set that indicates when orders were picked.

Veracity

As I mentioned previously in my article Garbage In, Garbage Out, the validity of the data is of paramount importance to the entire analytic process. Even the slightest uncertainty of the input data will be reflected in the output. When considering adding more data (either volume or variety) to your analysis, always consider if the data being added is correct, otherwise you may create more work and confusion. Consider the context of the new data set and its relationship to other data sets. Take care to also ensure the entity referred to by data file 1 as “Order ID” is the same as that referred to by the field “Order Number” in data file 2. A common problem we come across is deciphering Unit of Measure information, with inconsistent UOM definitions across multiple data sets.

An important aspect of veracity is also the source of the data. Ensure that your data source is always the same. The veracity of data for analytics can be greatly assisted by creating a single “Source of Truth” for data. This would require a central repository or database to be used to store all data and information in a suitable format. Not only would such an initiative help warehouse analytics projects, but also it will help the wider business as the organisation knows where they need to refer to when they require information or data.

Tools

Keep in mind, the analytic toolset being used will also determine your data limits. MS Excel still has a limit of about 1 million lines, while MS Access itself also has a limit of about 2GB per database file. At Fuzzy LogX we generally use a mix of SQL, Python and Power BI for all our data storage, analysis and visualisation requirements.

While this list is not exhaustive, this is a brief list of pointers to keep in mind next time you embark on a warehouse analytics project. Feel free to contact us if you’re keen to see what SAM and our analytics tools can do with your data! And don’t forget to be ready to slot some time for Slot-tember if you’re interested to know more about slotting!

About the author: Yohan Fernando is the Manager – Systems & Data Science at Fuzzy LogX who are the leading warehouse, logistics, and process improvement consultants in Australia. Fuzzy LogX provide project management & consulting services, leading-edge data analytics, process improvements, concept design & validation, solution/software tendering, implementation and solution validation services to businesses with Storage & Distribution operations looking to improve their distribution centres.