Data preparation (data prep) is the thought given to how stored data in any system will be utilized by the target audience. Data availability presents some interesting challenges.
This blog identifies the six questions you need to ask yourself before you start any data prep project:
- Volume of data requested/needed
- Timeframe for data collection
- When is the data available
- Data locations/data accessibility
- Budget for data storage
- Budget for software, data storage, training
Volume of data requested/needed
In my last blog on this topic I discussed the need for a comprehensive data model and included many issues in and around data prep.
Now that we know what type of data is needed, the question becomes how much of this data is actually needed?
Can the data be used from its existing location? This saves on items below on storage and other hardware needs for the data. If the data can be used in its existing location, what is the volume of data available versus what is actually needed? This presents serious performance issues as to how long the business analyst will wait for data retrieval.
The volume of data leads us to …
Timeframe for data collection?
What is the timeframe for the data to both be accessed and downloaded/prepared for use?
This step assumes the data will be downloaded/transformed/cleansed/ and possibly condensed. These steps all take time to do. When is the data at its source available to begin this process? How long will it take to transfer this data to the target computing environment? How long with the transformation/cleansing/condensing take? When is the data to be available for the business analyst?
Also consider if the data is to be real-time data, day old data, or near-real time data. Will these steps need to be performed during the course of the business day? On the fly?
Which leads us to …
When is the data available?
This is an important step that is definitely related to the last topic. Are there time constraints with the required data? Is some data being updated real-time during the day? Is some data not available until some big data load is complete? How much of the needed data is subject to various data loads and the various time frames.
IF the some of the data is being maintained in a real-time mode and some of the data is not, what will be the cutoff for the data? The cutoff should be at the time of the maintenance routine of the data that is not being updated real-time but in a nightly or off-hour process. This brings up another related question though, relating back to the above time-frame of the data…the query running against the real-time updating system will have some sort of date and time being used in the selection of the data. If this real-time data is being subsetted and stored separately, how will it be maintained? Is it necessary for this subsetted data to be maintained, and if so, what is the time frame for this process?
This step is related to the last step but also is needed information for …
Data locations/data accessibility
The more locations the needed data is physically in, the slower the whole process of data collection/data prep will take BEFORE the business analyst can use the data for their analysis needs. What is the earliest this data might need to be available for the business analyst? How much of the download/transformation/cleaning of data from multiple sources can be automated? How much of the needed data is subject to other time constraints such as time zone. The more this can be automated, the easier the process of transformation/cleansing/consolidating becomes. This ability to have the data local or consolidated in a common location has some financials associated with it such as …
Budget for data storage
As I have been discussing, having access to the data where it exists has its advantages but also having the data stored either on a departmental server or in a consolidated or subset form, takes some disk space somewhere.
Is there budget for the required server/software/disk space and let’s not forget the support staff required to support such an environment.
True Story: I was doing training at a site where the business analysts were all pulling very similar data in a rolling 3-day or 5-day window … back from the prior day. The data was all static, ie: it wasn’t subject to updates or changes. The whole data store had tens of millions of rows in it. The queries took 10 to 20 minutes each to run because each query was accessing tens of millions of rows. I think the 3-day window of data had something like 12,000 rows and the 5-day window of data would not exceed 50,000 rows. This is a fraction of the total rows in the object, these queries would take seconds to run each, not 20 minutes, but this was a new concept to them (brought up by me in class) so as of the class, there was no available disk space for such objects and possibly no budget to purchase disk space just for this need. I did leave them with a useful line of thought though.
Budget for software, data storage, training
This is where the rubber meets the road. Funds for the business software itself. How many people will need a license for the software? A related topic would be if the data were being stored locally during analysis, to make sure there is disk space on the work stations for the work to be done. And of course, do all the business analysts know the software being used? Training. Does the current workstation support the new software?
Business analysis software pricing varies greatly. A common tool should be selected. This makes training easier, the sharing of work between analysts easy, and quite possibly volume discounts on the initial purchase of the software.
Disk storage is important if the data is going to be stored either remote or in a subset/prepared/cleansed/consolidated form. If the data is going to be stored remote, away from the IT department, support staff for the disk drives, related computing systems, their upgrades/backup/maintenance all have to be considered.
Will the staff using the software require training? How much turnover in the group of business analysts will there be (ie: future training for replacement staff), and this takes into account how sophisticated the software is. Is there online training available for the software? Is there instructor-led training available for the software?
True Story: I trained power users and business analysts on an Oracle product months before they were to use the software. As time came close to using the software, I was asked to assemble and deliver a 1-day refresher on the same software!
Being overwhelmed with data preparation can be daunting, but a data preparation plan can help. Many of the questions listed in this blog post are often overlooked in the overall business process of producing good analytics for business decisions. The larger the organization, the more important these considerations need to be.
Thanks for reading my blog! — Dan.
Event May 19-20:The Data Empowerment Summit
The Quest Data Empowerment Summit includes three unique tracks —Data Operations, Data Protection and Data Governance— with sessions about the latest trends, best practices and technologies to align these critical areas and close the gaps between your front and back offices. Don't miss out!
IDC Datasheet: Quest Software Acquires erwin Inc.
Got 6 minutes? Overwhelmed with data prep? Watch the video of this blog.
Help your colleagues
If you think your colleagues would benefit from this blog, share it now on social media with the buttons located at the top of this blog post. Thanks!