Getting Off to a Good Start
When it comes to data, Continuous Integration is one of those things that looks good on paper but is very difficult to implement. As a result, most data professionals conveniently ignore this rule when it comes to Agile database development efforts. And, like other pseudo-Agile efforts where difficult or uncomfortable rules are passed over, I almost talked myself into a bad situation when I was asked to build a compliance data warehouse for Sun Microsystems.
Agile Grass Roots Database Development
In 2006, I was consulting at Sun Microsystems when I was asked to lead an effort to build a custom data warehouse for government compliance reporting. We were part of a larger program that was responding to the government's request for integrated sales reporting. Sun had a different transactional system for each of its major lines of business, so our new data warehouse would combine and organize all this data into a logical reporting system. They already had a Corporate Operational Data Store (ODS) that we could leverage, but the target data warehouse didn't exist at all. This was a grass roots effort to stand up a new reporting system.
So, I pulled together a team of database experts, and started putting together some project documents. I decided on an Agile database development approach that had a lot of influence from Extreme Programming (XP). While designing the implementation approach, I was faced with the XP rule of Continuous Integration, and I almost passed on the idea. We didn't have an automated test harness (Quest Code Tester didn't exist yet) so we would have had to build one from scratch. Then there's that nasty issue that data professionals must deal with that most Web developers don't–persistence.
The people who gave us Extreme Programming didn't have to deal with persistent data. They were Web developers building Web applications–not data professionals building data solutions. Persistence is an additional challenge for those of us in the data professions. For a successful Continuous Integration operation, you must make sure your data's always in the right state. Otherwise, your testing will produce inaccurate results.
But as much as I wanted to pass on the idea of Continuous Integration and just have a phased build and integration event like the old days, I knew it was too risky. There were a lot of moving parts to our solution: four different source systems, ETL to integrate to and from the ODS, a grass roots data warehouse, and several reporting streams. There's no way I was going to wait until the end to see if it all fit together. So, we decided to take on the Continuous Integration challenge, but we needed some time to put everything into place. So, I negotiated with the program office for an Iteration Zero.
The idea of an Iteration Zero came in handy when we started off. An Extreme Programming rule that immediately came in conflict with our Continuous Integration rule is the rule of Small Releases. According to Extreme Programming purists, every iteration must deliver a tested, working, production-ready solution. Well, in our situation, we didn't have the tools required to produce even a small functional solution, so this is a rule that we had to tweak a bit.
We successfully negotiated for an Iteration Zero, which is a free iteration. It’s free in the sense that we weren’t on the hook for delivering anything functional. It was purely a technical iteration that would give us time to put some infrastructure in place, like an automated test harness. We setup a schedule of 30-day iterations, with the first iteration to kick off with a release planning meeting in one month. That gave us about 30 days to get our house in order.
We accomplished a lot during our 30-day Iteration Zero. We were handed three blank servers (development, test, and production) that had nothing more than an operating system installed on each one. We had to install the databases; install the ETL servers; and build an automated test harness that would move code into the test environment, run a series of unit and system tests, and then move the code into production. And since most of our tests were written in our ETL tool (which wasn't natively built for unit testing), we had to figure out how to design an ETL unit test and launch it from our database server. It was a daunting feat that took us all of 30 days to complete, but we succeeded. And we learned a lot along the way.
If I Only Knew Then What I Know Now
We struggled through many of the challenges we faced through Iteration Zero, but we emerged stronger and wiser. Our goal was to have an environment where developers were integrating small pieces of functionality throughout the day, and then a nightly job would run a comprehensive suite of tests and push the solution to production if successful. We had a rough start making it all work like clockwork, but we eventually got there.
As I mentioned before, there was a big challenge with persistent data. So, we decided to drop the whole database and recreate it every time we ran an integrated test. Of course, this was all automated, so we had to build and maintain scripts that would recreate the whole database and then populate all the facts and dimensions.
Another challenge to overcome was how to build the database equivalent of a Java interface. In Java, the interface concept is crucial for unit testing because it's the only way the test environment can simulate the production environment with controlled scenarios. There's no native equivalent of this in Oracle. For instance, a production transformation may source from Table A, but to perform a unit test, you'll need this transformation to source from Table Z, which has the exact same structure as Table A. We accomplished this by dynamically creating views. Instead of sourcing from tables, our transformations sourced from views. This allowed us to switch underlying tables without altering the transformations.
Finally, we learned to keep test sets very small. It's tempting to just use the entire copy of a production table as your testing source; however, if it's a large table, it will take a long time to recreate and reload (remember, we're dropping and recreating each time we test). This isn't usually a problem, but in a Continuous Integration environment where these tables are constantly being dropped, created, and loaded, time is a big deal. So, when we built tests, we only used the bare minimum data needed to test the functionality of the function, procedure, or transformation.
Although we've had decades to improve our Agile development practices, it's still challenging in many ways for data professionals to successfully navigate through an implementation using Extreme Programming rules and practices. And even with all the great tools that exist today, Continuous Integration remains as a valuable but formidable practice for data professionals to get right. Today, I've shared with you three key lessons from the trenches. Embrace the technique of dropping and recreating tables and other structures. It's much cleaner than trying to constantly massage existing data to fit a variety of testing needs. Also, utilize views to simulate interfaces and keep your test data sets small to prevent long data load times. Take this advice and learn from your own experiences until Continuous Integration is seamless. It may take some work to get there, but it's well worth it.