15 May 2020

In this digital age comes a deluge of data in various structures and forms. While in logistics and supply chain we generally deal with some form of structured data, there can be the time you need to wrangle some unstructured or semi-structured data. But first, what’s the difference between them?

Structured vs Semi-structured vs Unstructured Data

Structured data refers to data that can be easily imported into a table or database. An example of structured data is an Excel sheet with perfectly defined columns and rows. This type of data can be easily stored in a database and therefore is easier to analyse as well. Each column would represent a specific measure or variable and each row would consist of an observation or set of values. Think of a table specifying the Date, Order ID, SKU and Quantity as an example.

Unstructured data would be the opposite of structured. Most of the data in the world today can be categorised as unstructured. The text on a web site, a PDF document, images, audio recordings. These are all data sets which can’t easily be stored in a standard table or database. However, there are storage media and analysis methods which can still extract data and information from such data. An example we use every day would be Google. We can type in a search phrase, and Google can present a list of suitable web pages that closely match our request.

Semi-structured data is a mix of both. This would refer to data sets where there is some organisation to the data, allowing it to be stored in a spreadsheet or database, but still contains some form of unstructured data. For example, an email would be an example of semi-structured data. The From, To and Subject fields are structured and specific which can be stored in a table, however, the actual body of text of the email is unstructured. While it can be stored in a table as a value, it would be difficult to analyse the content without any further data-smithing.

Analysing distribution centre and order line data extracted from systems which would store such information in SQL-based tables and databases may make you think that this data is structured as well, but as we always know, there are exceptions and no data set is perfect. Consider a SKU Master as an example. We would generally consider it to be a form of structured data. However, we have come across many data sets where the SKU Master itself is missing data that can be found in the SKU Description. Pack sizes, dimensions, weight are some examples we’ve seen where the respective information is missing in the SKU Master but present in the SKU Description.

While traditional intralogistics analysis (or warehouse analytics) has focussed on working with structured data, i.e. tables and spreadsheets of data, we’re seeing the growing need to process and extract valuable information from text data. So how can you do it?

 

Extract Data from Text Fields in Excel

In Excel, the best tools you can use will be the FIND and MID functions. Briefly, the FIND function returns the location of a text string within the cell, while the MID function allows you to extract a given number of characters from any point within the string. Considering a simple example, assume that we need to extract the diameter from this SKU description: Widget 12×2 50mm. We can use the FIND function to determine where “mm” occurs in the text and then the MID function to extract the two characters before. The two images below show the formulae used in the cells and results of the formulae.

If 12×2 represented the units per carton, which we wanted as well, we’d used the following formulae.

While I have included each FIND and MID formulae separately for simplicity, remember that they can be combined in one cell if required. The one caveat which you may have discovered is that these formulae assume that the pattern of the data remains the same throughout. In the event of varying dimensions, you will require more variants of the above formulae to tackle each variation. The formulae I have included above only scratch the surface of what is possible, with much more complex extraction possible within Excel.

 

Another useful Excel tool is to convert Text to Columns using the function shown below,  which is located in the Data panel of Excel.

This function can be used to split a text column either by spaces or a defined character such as a comma, semi-colon or tab. As mentioned above, if it is getting too complicated to set up the FIND and MID functions, you can split the text field into multiple columns, each representing a word and tackle it from there.

 

The Fuzzy Way

So how do we tackle it at Fuzzy LogX? Depending on the complexity and structure of the data, we may use Excel, or we may use my favourite – Python.

For those of you who are interested, Python includes a multitude of string functions that can help you to transform text strings and extract information such as splitting it up into single words. You can also use Regular Expressions which lets you define a pattern and returns the relevant characters that match the pattern. For example, you could use regular expressions to check if an email address is valid, i.e. does it have an “@” and a “.com” or is a mobile number valid (is it 10 digits and only consists of numbers?) Best of all, regular expressions are a common programming concept and can be used in many programming languages, even Excel VBA!

 

There are more advanced methods to extract information from text data which we’ll cover in another Fuzzy Friday. If you have a data set which needs some massaging to extract information from your text strings, feel free to contact us!

 

Yohan Fernando is the Manager – Systems & Data Science at Fuzzy LogX who are the leading warehouse, logistics, and process improvement consultants in Australia. Fuzzy LogX provide project management & consulting services, leading-edge data analytics, process improvements, concept design & validation, solution/software tendering, implementation and solution validation services to businesses with Storage & Distribution operations looking to improve their distribution centres.