Data preparation is the thought given to how stored data in any system will be utilized by the target audience.

Feature photo, connected data depicting data preparation and connecting information for the business analyst.

This article will include:

  • What is data preparation?
  • Where does the data come from?
  • How will the data be used?
  • Data cleansing/transformation and data validation.
  • Data review for the target audience.
  • Data maintenance and data security.

 

What is data preparation?

Wikipedia says: “Data preparation is the act of manipulating (or pre-processing) raw data (which may come from disparate data sources) into a form that can readily and accurately be analyzed, e.g. for business purposes.”

MediaBanner-ToadWorld-600x100-IM-JY-63709.psd-1-3-1-2

In talking with a fellow colleague, Senior DBA / Data Architect Troy Frericks says, “I’m not sure of the industry meaning to this. It does strike a nerve though. Today I frequently see data as a second-class citizen. Data is plopped down for an application to use without regard for others that will be using the data. Frequently developers don’t even want to deal with it even to the level of ‘plopping it down.’ They delegate that to an Object Relational Mapper / ORM, a framework that takes care of the task. Reporting departments, analytics departments, and those that need to ad hoc query the data are struggling to first understand the metadata (what the data is), and then to be able to query it efficiently.”

I agree with Troy. And, I'll add that while data is a source of truth, if it's not prepared correctly it could lead to misinterpretation, mis-use, and bad decisions. But, if data is prepared correctly, it can provide quick business insight and guidance to jumpstart future decisions.

 

Where does the data come from?

Data is available in just about any form.

In yesteryear, data was on paper and was difficult to manipulate to solve complex business issues. Along came tabulating machines that made this manual process a bit more efficient but still didn’t lend itself well for any sort of data analysis. Computers came with software such as spreadsheets that could take data and present it in forms of use to businesses.

In today’s world, data is in a variety of formats. Freeform data is also known as ‘Big Data’ where there is no real format. Data can be a series of spreadsheets. Data can be hosted on computers in just flat files (organized but not easily accessed or transformed into different needs) or data can be hosted in database management systems, the common types today are relational databases. This kind of data storage allows for the data to be accessed in a variety of ways but the data still needs to be organized in a manner that is efficient for the business analyst to use to solve their business needs.

Data can be collected from a variety of sources as well. Today’s ‘Internet everywhere’ devices also report back and this data is collected. Your mobile phone and the apps within collect data on where you are, what you have been viewing, social media posts, and possibly even email and text content. This information/data is reported back and saved.

The data that I typically think of is from business-based applications such as for colleges and universities (student enrollment/student information management/etc), to payroll systems, to more direct business applications such as SalesForce, SAP, Oracle Financials, Microsoft Excel/Access, and others. Then, there are the internally built systems where the data can be in a variety of forms. And, all of this data can be stored in a variety of technologies and various database management systems.

All of these pieces of the data pie can be across various computing hardware (computers themselves, on-premise), software (various database management systems, accounting systems, etc), and network locations (including cloud computing such as Microsoft Azure).

The role of the data analyst is to transform the data from its current form to a form that is both effective and efficient for the business analyst/end user of the data to gain the desired results.

 

How will the data be used?

The role of the data analyst is to transform the data from its current form to a form that is both effective and efficient for the business analyst/end user of the data to gain the desired results.

Nothing beats a good data model. This model will illustrate how the data will be stored to be the most flexible and ease of access by the target end users. This model will show the data stores and their fields (name, address, amounts, counts, etc) and the field attributes (character data, numbers, dates, raw data, etc).

In an interview with Jaxenter in their article, Challenges with Managing the Exploding Data Firehose,  John Pocknell, senior market strategist at Quest Software says, “Once you understand the framework of the data you’re examining, you can better understand the data itself.” I couldn’t agree more.

When I’m developing a plan for business analysts, I always start with the end result. What information do you need and what format is it to end up in. IF they can share ‘for what reason…’, I find this information helpful as well. I work backward from their data needs to the source of the data.

This data model not only illustrates the data’s intended organization but will help with the performance of retrieving the data, particularly if there are millions of rows of data. This data model will also visualize how the data relates to other data items (normalization of data) so that redundant information is only stored once (such as city and state spelled out), etc.

The goal of data analyst is to present the data to the intended audience in a form that is easy for them to report on the data for their business needs.

 

Data cleansing/transformation and data validation

This is an important step to ensure the data that is going to be used is both clean, formatted properly, and valid.

By this I mean make sure the email addresses are not only formatted correctly but are actual valid email addresses.

  • That debits and credits add up to the same number.
  • That subsetted data also adds up to the correct number.
  • That null number fields are set to zero instead of left null (requires additional syntax for the business analyst).
  • That date-sensitive data contains only data for that particular date range, and so on.

Some of this kind of cleansing can occur on the data load or during the data extract. For example, fixing null values is easy during this step. Validation too is rather easy by stabling either SQL or processes that compare the sub-set of data to its source.

Number data has a field-level attribute of number or integer. Make sure null values are adjusted to zero. This greatly aids the validation of the data and aids the business analyst, as now they don’t have to worry about dealing with null fields. For example, you add a number to zero, you get the number. You add a number to a null value, you get a null value.

That date data has a field-level attribute of date. This, alone, will ensure that the date data is a valid date and is far easier for most databases to handle. Using common data types for comparison also validates that there are no characters stored in a number field, dates are valid and in the correct format, etc.

Clean up the character fields. If some of the data has mixed capitalization, work with the team and see which format for the data is desired. This greatly aids the ad-hoc searchability of the data.

 

Data review for the target audience

During my Toad Data Point classes, I bring up the fact that if a group of analysts are all accessing a similar volume of data with each SQL, such as a 3-day window of data perhaps, that this data be accessed once and stored in a format that everyone in the department can use. I call this sub-setting the data. A good data model will also identify the users of the data and how the data will be used. Too many times I see a group of business analysts each running their individual queries against say a 10 million row data store when they are after a subset of say 50,000 rows for their particular need. This same 50,000 rows are also needed by others in the same group for their analysis needs.

Subsetting the data is a good technique if the data isn’t necessarily real time…ie: you don’t need the data as of the last 5 minutes perhaps. In most cases, some or all of the data in a multi-million dollar database can be subsetted into smaller data sets (called data marts) for ease of access. Database software can maintain any changes to this data too.

Enriching the data comes in the form of doing some of the required math perhaps and storing the answer along with the data. Enriching the data also maybe takes the 50,000 rows and groups the data together in summary form, producing just 5,000 of specifically formatted rows. Again, this kind of processing can happen in off-hours and the data is stored and presented in summary form for the business analyst.

 

Data maintenance and security

Data maintenance, aging of data, and who has access to parts or all of the data.

How the subsetted data gets refreshed is important. This is important if the business analyst needs real-time data. For the department that substetted the 10 million row to a window of data that they need, maybe just refreshing the subset is all that is needed. Maybe multiple subsets of data need to be stored for a period of time as well.

There are always hardware considerations with the storage of data. How often will the data be accessed? What is the expected response time? How often will other/historical data be accessed?

Security is huge. Data sensitivity is so important. For example, the data on a secure environment may be quite safe but what about the department server, outside of a safe environment? What about the backup scenarios? Where is the same secured data being stored AFTER it leaves its original location?

The user of the data needs access to the data that they need. IF the data is subsetted, the same level of security needs to be managed as the data is now being stored in a subset form, or condensed form on a computer that might not have the same security as the source computer of the data.

The data analyst needs to take into consideration not only who will need access to sensitive data but where this sensitive data might be stored…and who then has access to it.

Data preparation is more than just running a SQL statement and saving the result as a data store. Data needs to be validated and prepared for the end user in a form that best suits their needs. To do data preparation right, you need a data model, data to be cleansed and formatted for the intended users, and secure from others it was not intended for.

I hope you enjoyed this blog. You might be interested in my next blog in this series: 6 questions to ask before you start data prep: Data volume is #1

Thanks! — Dan

 

Related information:

Webinar: How to Simplify Reporting and More with Quest® Toad® Data Point v5.3

Webinar series: Data Preparation Made Easy: Toad Data Point Webcast Series

Ebook: Toad Data Point Workbook for Beginners

Ebook: The Four Roadblocks of Data Preparation

Blog: 6 questions to ask before you start data prep: Data volume is #1

Blog: How to prepare data for machine learning

Blog: Easy data preparation: Toad® Data Point v5+ new features

Quest blog: Data Preparation and Analysis: Using Data Efficiently in the Era of Dark Data

Toad World related blogs: Access more Toad Data Point blogs.

 

Watch the 7-minute video blog: What is data preparation

 

Help your colleagues

If you think your colleagues would benefit from this blog, share it now on social media with the buttons located at the top of this blog post. Thanks!

About the Author

Dan Hotka

Dan Hotka is an Author/Instructor/Expert/Oracle ACE Director Alumni who uses TOAD regularly in his web-based and on-line courses. He is well published with many books still available on Amazon or www.DanHotka.com. Dan is available for user groups and company educational events. He regularly speaks at user groups around the world. Check his website for course listings, references, and discount codes for his online courses based in Safari.

Start the discussion at forums.toadworld.com