This is part 3 of the Transformation and Cleans videos. I would like to go over the duplicates rule and calculated column rule.
Duplicates can be a very common but complex task as duplicates can come in many forms. In this sample dataset we can see that three are many duplicate values in the first column. These are valid entries but I also see duplicate values in some of the third column. In this case I want to keep all rows that have unique values in the first and third columns.
To do this open up the duplicates editor. Here I want an exact math on Landgrabbed and Base columns. There are other matching types such as Case Insensitive and Fuzzy Logic matching. The Fuzzy Logic option uses a slightly modified DoubleMetaphone algorithm where each word is passed through separately. This options is only available for English.
After I choose my key columns and match type I need to choose how I want the rows processed. The first option is to include all unique rows and take one of the duplicate rows and discard the rest. The row is arbitrarily picked and the end result is a distinct set of rows for the key columns.
Another option is to exclude all duplicate rows. If you choose this option you might want to export them to a separate csv file for later handling. The last option is to exclude all unique rows and just leave the duplicates. It just depends on what you want to accomplish.
When you apply the rule you will see a new tab that shows you all of the duplicates grouped together. This helps you to further understand the dataset and confirm the rule you created filters out the right duplicates. Remember, only the sample rows are processed when previewing the rule. The exporting option will only be executed on the final export.
Let’s look at the last rule, the calculated column. I call it the catch-all rule. Here you can apply any SQL function or math function to manipulate the data. If you want to continue to use SQL for your transforms select this option. Here you will get a list of SQL functions for the database you are connected to. Generating transformation SQL is only available for Oracle and MySQL in the 3.7 release. We will add support for other platforms in later releases.
To build a calculated column browse and choose the function. Double clicking will provide a template to guide you along. When done apply the rule. If you notice the rule is in black letters. This means SQL was used to generate the rule. Anytime you add a rule that does not use SQL the rule will turn red so you know that the output can only export to file or as a dataset. This disables support for views and snapshots.
The duplicate rule is an example of a rule that cannot generate SQL. Also this occurs when you turn off SQL transformation in the calculated rule. When SQL is not used to generate the transformation the processing is done on the client by our Transformation engine. Using SQL will be processed on the server.
So there you have it, the new Transformation and Cleanse utility. Have fun. And send us all your enhancement requests.