Toad World Blog

Getting faster insights from Big Data with emerging dataprep tools

Mar 29, 2017 11:10:00 AM by Fernando Garcia

There is an expression in the Big Data world that I have been hearing more and more frequently. It is the term "dataprep". In Big Data, the word "dataprep" means "data preparation" and it refers to all the activities related to the collection, combination, and organization of disordered, inconsistent, and non-standardized data coming from many different sources. In virtually every Big Data project we will find the need to validate, clean, and transform the data we receive. Once we have completed all of these "data preparation" tasks we will be provided with structured data and we will be able to analyze it with business intelligence tools.

Those of us who have previously worked on Data Warehousing projects are most likely to come up with the following question: Isn’t dataprep a synonym of ETL? Let's review some Data Warehousing concepts:

ETL stands for "Extraction, Transformation, and Load". In a Data Warehouse we "extract" data from different systems that can be homogeneous or heterogeneous. Then the data is "transformed" into a structure and format suitable for analysis. Finally the transformed data is "loaded" or stored in new storage so that it can be analyzed or exploited by business intelligence tools.

Traditional Data Warehouse Pipeline

So, what's the difference between dataprep and ETL? Perhaps the need for a new term comes from the new requirements imposed by the famous "V’s" of the Big Data era:

  • Volume. The volume of data we handle in Big Data projects is far greater than what we were accustomed to in a traditional data warehouse. In order to process the huge data flow we receive, we need to make use of highly scalable parallel processing techniques and tools.
  • Variety. Although in a traditional data warehouse we receive data from heterogeneous systems, in the vast majority of cases we collect data residing in transactional relational databases. In Big Data projects data comes from diverse sources that range from smart phones to machine-generated data from sensors.
  • Velocity. Both the speed at which data is generated and the speed at which analytical decisions are made have created the demand for a transition from traditional batch processing to near real-time processing.

As I said at the beginning of this article, dataprep is a term that I have been hearing more and more frequently. The reason? It seems that many Big Data projects are suffering from a common problem. Most of the time spent on Big Data projects goes into data preparation activities, thus reducing the valuable time required for analysis. There is a clear need to reduce the time spent on data preparation. And as a consequence of this situation many companies are starting to offer products and tools that aim to reduce the time spent on data preparation. According to some specialists, the so-called "self-service data preparation" tools will be the hottest software tools of the coming years. According to Gartner analysts, the market for self-service data preparation software will have reached $ 1 billion by 2019.

Big Data pipeline

When we enter the world of these software products, we find that not only do they try to reduce data preparation times in Big Data projects but they also aim to ensure that data preparation activities are not limited to the IT field or to a few data engineers. The companies behind these products ensure that their software can also be adopted by business users who are able to create connections and data integration by themselves.

I propose to do the following exercise. Let's suppose that you and I are part of a team in the IT area of a company. The organization has decided to acquire a data preparation tool that will be adopted by business users from different sectors within the organization: Human Resources, Accounting, Finance, Auditing, Marketing, etc. The users will have to use the tool by themselves to prepare the data. However, the director has asked us to evaluate products and recommend the one that seems best from a technical point of view. What characteristics will you evaluate? Of course the list of features to be weighed will be subjective. Your list is unlikely to be the same as mine. Hence the interesting thing about this exercise. Here’s my list, I propose that you create and share your own:

• Easy to use. This point seems fundamental to me. Keep in mind that the user will not be a data engineer or person with IT training. The user will be a business analyst (this type of user without IT training is usually called a "Citizen User"). If the tool is difficult to use, it probably will not be adopted and the project will fail. What does it mean that it is easy to use? It should have a graphical interface, with copy and paste features; it should not require programming or scripting skills and offers wizards that simplify tasks.

• Fast results. If in addition to extracting and transforming the data, the tool enables the possibility of obtaining insights quickly, surely the users will welcome it. The ability to create graphs or “drill down” tables in a simple way will allow to deepen on the data and detect patterns or outliers very quickly.

Personally, I think these two characteristics are the most important and they apply to practically any project. Then there are other elements to consider, which may vary according to the particular characteristics of each situation: Will it require access to multiple sources of heterogeneous data? Or is it enough with two or three specific data sources? Will we require a collaborative workspace?

And what about you? What qualities would you look for in a data preparation tool?

Let me know.

See you!

Tags: Toad Data Point NoSQL Analysis

Fernando Garcia

Written by Fernando Garcia

Fernando García está ligado al mundo de las bases de datos Oracle desde 1995. Es el creador de la Comunidad Oracle Hispana, la mayor comunidad online en lenguaje español para profesionales Oracle. También es miembro fundador del Grupo de Usuarios Oracle de Argentina (AROUG) y partícipe activo en la organización de eventos de gran convocatoria como el OTN Tour Argentina y el APEX Tour Argentina. Conduce, junto a Clarisa Mamán Orfali, El Show de la Comunidad Oracle Hispana, un podcast en español dedicado íntegramente a la divulgación de las tecnologías Oracle de una forma didáctica y divertida. Es creador del canal de You Tube Asterisco Más, desde donde difunde contenidos educativos de manera totalmente gratuita.

En 2009 fue reconocido por la corporación Oracle con el Oracle ACE Award. El programa Oracle ACE destaca la excelencia de aquellos individuos que han demostrado habilidad técnica y fuertes credenciales como entusiastas y partidiarios de la comunidad Oracle global.

Sitios y publicaciones

  • Comunidad Oracle Hispana (sitio oficial)
  • El Show de la Comunidad Oracle Hispana (podcast)
  • Asterisco Mas (canal de You Tube)
  • Curso gratuito de lenguaje SQL (blog con videos didácticos)
  • AROUG- Grupo de Usuarios Oracle de Argentina (sitio oficial)
  • Introducción al uso de Expresiones Regulares en una base de datos Oracle (publicación en Oracle Technology Network)

Fernando Garcia has been using Oracle Databases since 1995. Fernando is founder of Comunidad Oracle Hispana, the biggest spanish online community for Oracle professionals. He is also member and co-founder of the Oracle Users Group of Argentina (AROUG), where he contributes in the organization of many popular events like the OTN Tour Argentina and the APEX Tour Argentina among others. Fernando hosts The Show of the Comunidad Oracle Hispana, a funny and educational podcast about Oracle technologies. He has also created Asterico Mas, a You Tube channel with free and didactic content for the spanish Oracle Community.

In 2009 Fernando was recognized by the Oracle ACE Program with the Oracle ACE Award, The Oracle ACE program highlights excellence within the global Oracle community by recognizing individuals who have demonstrated both technical proficiency and strong credentials as community enthusiasts and advocates.

Sites and publications

  • Comunidad Oracle Hispana (official site)
  • The Show of the Comunidad Oracle Hispana (podcast)
  • Asterisco Mas (You Tube channel)
  • Free tutorial about SQL languaje (blog with videos)
  • AROUG- Argentina's Oracle Users Group (official site)
  • Introduction to the use of Regular Expressions in Oracle Databases (article in spanish in Oracle Technology Network)