In an increasingly data-driven world with ever-changing systems, having an automated process for database migrations is a valuable practice. Instead of people having to keep track of each file, table, procedure, or object being modified, a tool can make database deployments more automatic and less excruciating, saving time and reducing errors.
Steve Jobs once said that a computer is like a bicycle for the mind. The late founder of Apple was talking about the fact that while human beings are not very efficient when it comes to locomotion, with the invention of the right tool (the bicycle), humanity was able to move much more effectively. Jobs took this idea of using computers to amplify our natural cognitive abilities and made it central to design, shaping the history of digital experiences.
Most products these days try to deliver relevant information based on automated processes that watch, track, and act upon the previous interaction with the system and what is likely to come next. This philosophy should not be limited to iPads and other consumer-grade content-consumption devices. What if it were applied to software production, and instead of using automation to push content, we used automation to push content creation to push database deployments?
Inspired by Jobs’s quote, my colleagues and I thought that instead of people having to keep track of each file, table, procedure, or object being modified, we could make a tool that does for database deployments what the bike did for locomotion. In this effort, we have been working on a way to make database deployments more automatic and less excruciating.
For most databases, each change that is created necessitates the manual submission of code into the codebase. As versions build upon each other, determining which files overlap, which objects are parentless, what needs to be purged, and what needs to be integrated gets progressively harder. We decided that by automatically creating database elements based on creator input, we could achieve something revolutionary: Database deployments become just like codebase deployments.
Because changes that are made to the database on DLL and DML are as important as changes made in source code, in order to be migrated to other environments, database changes have to be recorded just as closely. Migration of source code without the proper database structure changes often results in downtime and hours of debugging, which is both expensive and frustrating. The best way to avoid this situation is seamlessly keeping track of changes that are made to the database and automatically migrating those changes between environments. This removes error due to human interventions. The idea is that if one can store all DML and DDL structures for a database as files in a version control system and selectively find and apply diffs on these files by comparing them with the database versions, then deployments can be automated.
Our automation process seeks to perfectly recreate the steps a human would take to migrate these changes from one place to another. In a traditional manual scenario, migration to a target environment is a two-step process. First, an administrator needs to create a set of text files that contain the DDL of the tables and views, as well as the DML of functions and procedures from the target database. Secondly, the administrator needs to compare the contents of the files with the respective table structures and identify the changes that need to be applied to the new deployment. To automatically accomplish this, we created a tool that consists of two phases: deliver and deploy. The former compares the DDL and DML on the file system with the database, finds the diffs, and applies those changes to the respective files in the file system, while the latter then finds those diffs and applies the changes to the database itself. Both modes work in conjunction to bring the database and its DDL and DML files in sync with each other based on simple inputs from users.
Deliver
The purpose of the deliver script is to retrieve the list of tables, views, functions, and procedures from the database and check for the existence of files with the same name in the file system. If a file does not exist, then new files are created with content provided by the create statement itself. If a file does exist, then the MD5 of the contents of the file and the create statement are compared after removing blank space to verify that no changes have been made. If the MD5 is found to be different, signifying that there are changes, then the content of the file is replaced with what is provided by the create statement. Using this comparison, the appropriate functions, procedures, views, and triggers can be retrieved using simple database queries and then stored properly in the file system.
The process of storing the DDL of the tables cannot be done using queries because the structure of the table is created through a Perl module. This is necessary because the next step of the process, deploy, uses the module to compare tables in order to get a list of missing columns and comparison is not comprehensive when using a DDL retrieved via queries. With structures compared, in sync, and delivered, the process moves on to the deployment phase.
Codeblock:
karthik@stg:~/iris/srcbuild$ ant deliver
Buildfile: /home/karthik/iris/srcbuild/build.xml
deliver:
[exec] AddCommissionEarner
[exec] AddRaise
[exec] ArchiveContact
[exec] Balance_Summary
[exec] batchprocessBankAccounts
[exec] ../sql/procedures/CreateInvoiceFromEstimatetemp.sql different
[exec] ../sql/procedures/SPLIT.sql different
[exec] ../sql/tables/CandidateSkills.sql different
[exec] BankAccount_MD5s_temp
[exec] BankAccounts
[exec] BonusRefunds
Caption: Deliver Phase checking for differences and collecting changes in repository
Deploy
Not surprisingly, the deploy phase works in conjunction with, and is in fact dependent on, the deliver phase. It takes the changes from the files created in deliver and applies them to the target database. Like in the deliver phase, the deploy phase compares each individual element in the database and file system using MD5. Any change that is detected regarding a stored procedure, function, trigger, or view is dropped from the database and recreated using content from the file system. This is because they do not have static content or data associated with them, instead manipulating data that is stored in tables to perform their respective functions. DDL changes to the tables have to maintain the data in those tables while making changes to the structure of the tables. These changes include creating or deleting columns in specific orders, adding removing constraint checks without loss of data, etc. As such, taking from the file system as opposed to the database ensures artifact continuity.
As before, the process of comparing DDLs from different databases and selectively applying the changes from one to another cannot be performed by queries because it is necessary to keep the data within the table intact. One solution might be creating insert statements for each record in the original tables before dropping the old one and creating a new version. While this would technically work, it can be expensive in terms of time and computational power, with costs for both scaling up extremely fast as the size of the database being migrated increases.
The solution is to use our module and apply changes to table structure based on the comparisons between the DDL in the database and the DDL in the file system. This works by getting the description of the table from both the file system and the database, generating a list of columns, data types, default values, and properties for each column, and then comparing the list of keys it generates to the diff of missing columns. These columns are then applied on their respective tables in the database by forming alter statements. This preserves the data within the table and is a judicious use of processor power.
Codeblock:
karthik@stg:~/iris/srcbuild$ ant deploy
Buildfile: /home/karthik/iris/srcbuild/build.xml
deploy:
[exec] Number of files :
[exec] FUNCTIONS : 4
[exec] Procedures : 8
[exec] Triggers : 0
[exec] Views : 2
[exec] Tables : 6
[exec] Functions Done
[exec]
[exec] SPLIT different
[exec] Procedure Done
[exec]
[exec] Triggers Done
[exec]
[exec] workorder different
[exec] Employee different
[exec] BankAccount_MD5s different
[exec] Tables Done
[exec]
[exec] Views Done
[exec]
BUILD SUCCESSFUL
Total time: 3 seconds
Caption: Deploy phase applying selective changes from repository onto the database
Challenges
The Perl module does have one or two issues that need to be addressed. First, default values associated with each column are not retained in the DDL created by the module. Thus, it's necessary to modify the module to include extra default values when returning the create statement for the table. Also, the module does not take into account the foreign key constrains associated with the table. As such, for each table, it is necessary to retrieve the list of constraints associated by looking through the information_schema table maintained by the database and then creating appropriate alter statements.
Additionally, the column order is not maintained when creating new columns in the target environment. The column order of tables is important because the insert statements used in the business logic may have a specific order in which they are performed. Changes to this column order would result in discrepancies, thus eliminating the benefit of automation. Fortunately, this can be solved by adding new functionality to the Perl module to maintain the order of columns, using key values associated with each column to do so.
Conclusion
A few required technical steps notwithstanding, when users pull in new features from the repository, changes to the database can be automatically deployed rather than migrating all the changes manually. This keeps database files organized, avoids many of the causes of human error that make developers scared to modify anything for fear of bringing the whole system down, and allows deployments to become background processing tasks. Furthermore, the file structure can be maintained using Git or some other repository system, lending an extra helping hand.
In an increasingly data-driven world with ever-changing systems (even virtual ones), having an automated process for database migrations will prove itself to be an indispensable practice for 2015 and beyond.
This web seminar shows the step-by-step application of the technical solution described in the article.
User Comments
Good article. Good solution to by pass the overhead of having to keep track of modified files, tables, proceduresetc.
This seems even easier than regular code deployments!