Automatically tracking database schema changes

2010/10/26

Last modified: 2010-11-14

How can you quickly find out what changes have recently been made to your database schema?

Previous blogs have shown how to include version numbers with a database and how checksums can help ensure a schema is as expected.

But, what do you do if you find that the schema has been ‘unexpectedly’* changed?

This blog will look at how DDL triggers and logging can be used to track changes in your schema and help trouble shoot if there are issues.

* Someone has changed it outside of the standard deployment process.

Overview

  • DDL triggers can be used to automatically track schema changes.
  • Table is created in the schema to log these changes
  • This table can be queried to find recent changes.

Note
Thank-you to those who pointed out that MS SQL 2K8 enterprise edition and Oracle have schema auditing functionality built in – see LinkedIn discussion from a previous blog

The MS sample database, AdventureWorks, contains similar functionality to that outlined in this blog.

DDL triggers

DDL triggers were introduced in MS SQL 2K5 and can be used to track schema changes.

They fire primarily in response to TSQL statements that start with the keywords CREATE, ALTER, DROP and GRANT.

They give access to the EVENTDATA function which has a whole host of useful info – including the TSQL statement that was executed.

A script – 007_CreateTrigger_tr_DDL_Event.sql – is provided which includes an example trigger that will capture DDL events and logs them.

The following ‘snippets’ highlight and explain some of the key sections in this trigger:

CREATE TRIGGER tr_DDL_Event
ON DATABASE
FOR DDL_DATABASE_LEVEL_EVENTS

‘DDL_DATABASE_LEVEL_EVENTS’ capturing DDL events for this database.

SET @data = EVENTDATA()

SET @eventName = COALESCE(@data.value('(/EVENT_INSTANCE/EventType)[1]', 'nvarchar(128)'),'')
SET @eventMsg = 'SchemaName: ' + COALESCE(@data.value('(/EVENT_INSTANCE/SchemaName)[1]', 'nvarchar(128)'),'')
SET @eventMsg = @eventMsg + ', ObjectName: ' + COALESCE(@data.value('(/EVENT_INSTANCE/ObjectName)[1]', 'nvarchar(128)'),'')
SET @eventMsg = @eventMsg + ', TSQLCommand: ' + COALESCE(@data.value('(/EVENT_INSTANCE/TSQLCommand)[1]', 'nvarchar(max)'),'')
SET @eventMsg = @eventMsg + ', LoginName: ' + COALESCE(@data.value('(/EVENT_INSTANCE/LoginName)[1]', 'nvarchar(128)'),'')
SET @eventMsg = @eventMsg + ', UserName: ' + COALESCE(@data.value('(/EVENT_INSTANCE/UserName)[1]', 'nvarchar(128)'),'')

‘SET @data = EVENTDATA()’ – capturing the DDL event data (of type XML).

XQueries are run to extract different elements eg ‘TSQL command’ and putting them into a variable @eventMsg.

EXEC dbo.p_EventLog_CreateRecord
@eventType='INFORMATION'
,@programName='tr_DDL_Event'
,@eventName=@eventName
,@eventMsg=@eventMsg

This data is then logged to a table dbo.EventLog (see details below) using a sproc dbo.p_EventLog_CreateRecord.

Note the parameter – ‘,@programName=’tr_DDL_Event’– this is used later as a query filter on dbo.Eventlog.

Storing the changes

By default I include a table – dbo.EventLog and associated sprocs – in all new databases.

This is used to store DDL event logs and for other general logging purposes.

This table can be created using a script similar to that shown below:

 CREATE TABLE dbo.EventLog
 (
	EventLogId int identity(1, 1) NOT NULL,
	EventType nvarchar(15) NOT NULL,
	ProgramName nvarchar(128) NOT NULL,
	CallingProcedure nvarchar(128) NOT NULL,
	EventMsg nvarchar(max) NOT NULL,
	ProgramSectionName nvarchar(128) NOT NULL,
	EventName nvarchar(128) NOT NULL,
	CreatedDate datetime NOT NULL,
	BatchId nvarchar(128) NOT NULL,
	CreatedBy nvarchar(128) NOT NULL
 )
 ALTER TABLE dbo.EventLog

ADD CONSTRAINT  pk_EventLog PRIMARY KEY CLUSTERED ( EventLogId ),

ADD CONSTRAINT ck_EventLog_EventType CHECK (EventType IN ('Success','Error','Information','Warning'))

How this works

If the following script – to create a sproc – is run on the database:

CREATE PROCEDURE dbo.p_TestSproc
AS
BEGIN
	PRINT 'Test sproc'
END
GO

Then this create event will be logged via the DDL trigger and stored in dbo.EventLog

The following query will return the schema changes that have just been made.

SELECT EventName, EventMsg, CreateDate, UserName
FROM dbo.EventLog
WHERE ProgramName='tr_DDL_Event'
ORDER BY EventLogId desc

The first row returned is as follows (note only showing first two columns):

“CREATE_PROCEDURE”,
“SchemaName: dbo, ObjectName: p_TestSproc, TSQLCommand: CREATE PROCEDURE dbo.p_TestSproc AS BEGIN PRINT ‘Test sproc’ END , LoginName: Login, UserName: User”

If a SQL user is then given exec permissions on this sproc – the following row is returned:

“GRANT_DATABASE”,
“SchemaName: dbo, ObjectName: p_TestSproc, TSQLCommand: GRANT EXECUTE ON [dbo].[p_TestSproc] TO [aspnet_Roles_BasicAccess], LoginName: Login, UserName: User”

If the sproc is dropped – the following:

“DROP_PROCEDURE”,
“SchemaName: dbo, ObjectName: p_TestSproc, TSQLCommand: DROP PROCEDURE dbo.p_TestSproc, LoginName: Login, UserName: User”

What do you do if your database has changed?

You want to refresh you dev and testing environments with the latest copy of the production database.

You’ve received a copy. Initial checks show that the the version number is as expected – but the checksums are not, indicating the schema has had some ‘unexpected’ changes.

What do you do?

Firstly, you probably want of find what unexpected changes have been made to the production system.

But, perhaps more importantly, there is a need to review the current release process and to improve on it so that there is less chance of unexpected changes occurring in the future.

Tracking database schema changes can provide the – what, when, and potentially who – to help in improving the process.

Finding out the differences and tracking down the changes.

To find the unexpected changes a tool such as RedGate SQL Compare can be used.

The production copy will need to be compared to a copy of the database that has the correct/expected schema*.

This process will outline the differences – and the unexpected changes made to the production database. These differences can then be correlated with entries in the dbo.EventLog table. This information can then be used to help in improving the deployment/change control process.

*A future blog will show how database unit testing can help ensure that you have a correct copy of the database schema.

Advantages of this approach

  • Help in trouble shooting. As outlined above, this can be used to track unexpected changes. But, it can also help if there are issues with official deployments as operational/support staff can quickly identify recent changes to a database as part of their overall trouble shooting procedures.
  • An aid in development. The log table can be used to quickly check what changes have been made and the order they have been applied – including security/permission changes.
  • Documentation. This is similar to the point above. DBAs often want to know what is going to be changed in a deployment. This can quickly provide this information.

Gotchas

  • The example DDL trigger provided does not log all the available event data – see EVENTDATA function. It could be easily extended to do so eg add something like the following to the script:
    SET @eventMsg = @eventMsg + convert(nvarchar(max),@data);
    
  • This is being used as an aid to the SDLC/database development. It is not intended to be part of a ‘formal’ security audit on a database.

Source code
All example scripts are provided in the database testing framework which can be downloaded from here.

What next

This blog has shown how schema changes can be tracked and how this can provide value especially when troubleshooting unexpected changes.

The next blog in this series will look how individual SQL objects can be labelled with build version numbers making it easier to:

  • Find out when a SQL object was last changed.
  • Correlate these changes with builds/deployments.

What do you think?

How do you track database changes?

Advertisements

Doing agile – do I need to think about the rest of the enterprise?

2010/10/20

I recently read an interesting article – Governing Agile Development Teams – by Scott Ambler

A couple of points particularly resonated with me – as they relate to my blog Agile database development 101 and the need to ‘Think about the rest of the enterprise’.

To quote the article:

“Agile methods should strive to optimize the whole, not just sub-optimize on construction or delivery.”

“Too many agile teams are building high-quality silo applications, which is better than building low-quality silo applications I suppose, but what they really need to be doing is building high-quality enterprise-class applications and the only way to do so is to take enterprise considerations into account throughout the delivery lifecycle

These quotes go someway in dispelling the myth that agile projects don’t need to think about the overall enterprise.

Getting the appropriate balance between tactical and strategic objectives is key for all projects – not just those ‘doing agile’.

Many projects, whilst perhaps being optimum for their particular business unit needs, are not paying enough attention to the overall needs of the enterprise and hence sub-optimum for ‘the whole’.

This is of particular importance to data architecture, and the overall data management space, as these project sub-optimisations are major contributors to the data quality debt incurred in many organisations.

The big question is how to ensure we get the appropriate balance?


Using checksums to test for unexpected database schema changes

2010/10/16

Last modified: 2010-11-13

How does a dba/developer quickly check that the baseline database schema they are using has not had any ‘unexpected’ changes made to it?

How can these checks be automated so that they are part of a continuous deployment process for all environments ?

The previous blog in this series showed how to include a version number with your database schema*. This on its own provides value, but, even if the build version label is correct, the schema might have changed. A SQL object dropped or altered by accident – a change made outside of standard deployment procedures!

This blog shows how using checksums can test that the baseline schema is as expected.

* This blog assumes you have read the previous one in the series.

Overview

  • Checksums can be used to detect changes in schema.
  • Checksums values can be computed using queries on INFORMATION_SCHEMA.
  • dbo.BuildVersion table stores these values.
  • First script of each deployment checks the checksum values are as expected.
  • Last script updates with the new checksum values.

What are checksums?

A checksum is a computed value that is typically used to check/compare files. They can also be used to detect changes in the values in tables/views. MS SQL includes a number of functions that can be used for this – e.g. CHECKSUM_AGG

How they can be used to detect changes in schema

The following checksum query was run on a copy of the AdventureWorks database – it returned: -1697427893

SELECT
CHECKSUM_AGG(CHECKSUM
(TABLE_NAME
,COLUMN_NAME
,CHARACTER_MAXIMUM_LENGTH))
FROM INFORMATION_SCHEMA.COLUMNS

A column was then added to a table using the following script:

ALTER TABLE HumanResources.Department ADD TestColumn nvarchar(50);

The query was run again – this time giving a value of: -326082783

As the schema was changed – the values in the INFORMATION_SCHEMA views changed – therefore the computed checksum changed.

What about the overall schema?

The query above only covers a small number of columns in the INFORMATION_SCHEMA views. If this technique is to be used – other columns e.g. DATA_TYPE need to be included. Also what about other types of SQL objects?

I have found that checking the following areas gives me a good enough indication if the schema has changed:

Table/view columns, constraints and sproc/functions.

But if required it would be relatively easy to extend the coverage – [see ‘Gotchas’ section below].

Three functions are provided to do this – (see ‘Source’ below):

dbo.fn_CheckSum_TableViewColumn – checksum based on join of tables/views and their columns
dbo.fn_CheckSum_TableConstraint – checksum based on table/view constraints e.g. fk, unique, defaults
dbo.fn_CheckSum_Routine – checksum based on sproc/function – includes full routine definition

This is an example of a checksum query used in dbo.fn_CheckSum_TableViewColumn:

SELECT CHECKSUM_AGG(CHECKSUM(
t.TABLE_SCHEMA
, t.TABLE_NAME
, t.TABLE_TYPE
, c.COLUMN_NAME
, c.ORDINAL_POSITION
, c.COLUMN_DEFAULT
, c.IS_NULLABLE
, c.DATA_TYPE
, c.CHARACTER_MAXIMUM_LENGTH
, c.CHARACTER_OCTET_LENGTH
, c.NUMERIC_PRECISION
, c.NUMERIC_PRECISION_RADIX
, c.NUMERIC_SCALE
, c.DATETIME_PRECISION
, c.CHARACTER_SET_CATALOG
, c.CHARACTER_SET_SCHEMA
, c.CHARACTER_SET_NAME
, c.COLLATION_CATALOG
, c.COLLATION_SCHEMA
, c.COLLATION_NAME
, c.DOMAIN_SCHEMA
, c.DOMAIN_NAME))
FROM INFORMATION_SCHEMA.TABLES AS t LEFT OUTER JOIN
INFORMATION_SCHEMA.COLUMNS AS c ON t.TABLE_NAME = c.TABLE_NAME

Storing the checksums in dbo.BuildVersion

See the previous post for details on the dbo.BuildVersion table.

Carrying out an ad hoc check

Each production deployment updates the dbo.BuildVersion table with the latest checksum values.

The latest production schema is used as the baseline database for all environments (see Agile database development 101). Therefore, its last entry contains the expected checksums for all environments and can be used to ensure all copies of the schema are correct.

A dba/dev can quickly run a query similar to below to get the latest expected checksum values:

SELECT CheckSumTableViewColumn
, CheckSumConstraint
, CheckSumRoutine
FROM dbo.BuildVersion
WHERE BuildDate=(SELECT MAX(BuildDate) FROM dbo.BuildVersion)

These can then be compared to the actual/current values of the database schema – using the example functions provided:

SELECT dbo.fn_CheckSum_TableViewColumn()
SELECT dbo.fn_CheckSum_TableConstraint()
SELECT dbo.fn_CheckSum_Routine()

If they return different values then an unexpected change has been made to the baseline schema.

Automatically testing before deploying

As per the previous post the first script that runs is – 000_DataSelect_BuildVersion_CheckBeforeBuild.sql

In this script, checksum variables are manually updated after each production deployment – to give the ‘baseline expectations’.

SET @expectedBuildVersion = '1.0.0.234'
SET @expectedCheckSumTableViewColumn = 1980786926
SET @expectedCheckSumConstraints = 1866989356
SET @expectedCheckSumRoutine = 476203398

It then uses the query and functions outlined above to compare expected vs actual.

If they are different it raises an error and the team can be notified.

Updating with the latest checksums

As per the previous post the last script that runs, after all changes have been deployed, is – 999_DataInsert_BuildVersion_EndOfBuild.sql
This runs the functions above to calculate the latest schema checksum values. It then inserts these values into dbo.BuildVersion.

And so the cycle begins again.

Gotchas

  • The example scripts provided do not cover all types of SQL objects e.g. triggers, indexes. Therefore, they will not ‘spot’ all potential changes to a schema.
  • They do not cover all columns in the selected INFORMATION_SCHEMA views.
  • If the above points are issues then it would be relatively straightforward to extend the example scripts provided to cover other types of objects/columns.
  • Environmental variables changing scripts can cause issues.
    For example, using physical server names for linked servers and referencing these in sprocs/function. It is very likely, and desirable, that the server names change as you go from dev to production. If the physical names are referenced directly in scripts checksum values will change e.g. routine definitions are included in checksum query. There are a number of ways to get around this e.g. use synonyms or ‘abstract’ the physical server name away when creating the linked server. This has the added advantage of making deployments easier to manage as the name only has to change in one place not in multiple scripts.

Source code

All example scripts are provided in the database testing framework which can be downloaded from here.

What next

This has shown how checksums can be used to test if your schema has had unexpected changes made to it.

But, what do you do if you uncover a change?

The next blog in this series will look how we can automatically track schema changes and use this to help trouble shoot unexpected schema changes.

What do you think?

Do you use a similar/different method for checking for unexpected schema changes?


Including a version number with your database schema

2010/10/13

Last modified: 2010-11-13

How does a dba/developer quickly check that the database schema version they are working on is correct?

Before each deployment is there away to you automatically test that your environments are using the correct database version?

This blog will show how including a version number in the database schema can help do this.

Overview of the overall process

  • Build version can be auto generated/incremented by build tools.
  • Table is created in the schema to store this.
  • First script of each deployment checks the last build version is as expected.
  • Last script updates with the new build version.

Use a build tool to auto generate the build version

Most build tools include functionality to automatically create build versions.

For example, Nant has a task (function) Version that does this.

I follow a four part build version convention – Major.Minor.Maintenance.Build

The first three parts are manually maintained by editing the appropriate Nant build file.

The 4th part – ‘.Build’ – number is auto incremented with each build.

e.g. the first production release of a database might have a build version of ‘1.0.0.234’

It is major release ‘1’ and gone through ‘234’ builds.

Nant token replacement

It is possible to include tokens in scripts that can be automatically replaced by variables created in the build process.

For example, a token such as‘@BUILD_VERSION_NUMBER@’ can be replaced by the latest build version.

This can then be inserted into SQL scripts on each build.

Storing the build version in the database

By default I include a table – dbo.BuildVersion in all new databases.

This is used to store build version data (the checksum part will be explained in a later blog).

A script similar to the following can be used to create this table:

CREATE TABLE dbo.BuildVersion
(
	BuildLabel varchar(20) NOT NULL,
	BuildDate datetime CONSTRAINT [df_BuildVersion_BuildDate] DEFAULT getdate() NOT NULL,
	CheckSumConstraint bigint NOT NULL,
	CheckSumTableViewColumn bigint NOT NULL,
	CheckSumRoutine bigint NOT NULL
)

ALTER TABLE dbo.BuildVersion
ADD CONSTRAINT [pk_BuildVersion] PRIMARY KEY CLUSTERED (BuildLabel)

Carrying out an ad hoc check

Each production deployment updates the dbo.BuildVersion table with the latest build version.

The latest production schema is used as the baseline database for all environments (see Agile database development 101). Therefore, it’s last entry is the target/expected build version for all environments and can be used to ensure all copies of the schema are correct.

A dba/dev can quickly run a query similar to the following to check that their copy of the schema is correct.

SELECT BuildLabel
FROM dbo.BuildVersion
WHERE BuildDate = (SELECT MAX(BuildDate) FROM dbo.BuildVersion)

Automatically testing before deploying

The continuous deployment process can also automatically test that the expected base line database schema is as expected in all environments.

Two scripts are provided to help do this (see ‘Source Code’ section at the end).

These are placed in the ‘WorkingDirectory’ (a blog later in the series will explain the solution/folder structure used).

000_DataSelect_BuildVersion_CheckBeforeBuild.sql

– the first script to be run in a deployment and automatically checks the build version

999_DataInsert_BuildVersion_EndOfBuild.sql

– the last script to be run in a deployment and automatically updates the table with the new build version

In the first script a variable is manually updated after each production deployment with the latest build version.

For example it might be:

SET @expectedBuildVersion = '1.0.0.234'

On deployment the script uses a query similar to the one above to check the actual value of the schema being worked on.

If the actual value is different to @expectedBuildVersion then an error is raised and the team is notified.

Updating with the latest build version

The last script file – 999_DataInsert_BuildVersion_EndOfBuild.sql – contains the following text:

SET @buildVersion = '@BUILD_VERSION_NUMBER@'

On creating the deployment files, this token will be replaced by the new build version created by the build tool – e.g.:

SET @buildVersion = '2.0.0.235'

When this is deployed into production it will insert the new value into dbo.BuildVersion table

And the cycle will begin again.

Source code

All example scripts are provided in the database testing framework which can be downloaded from here.

What next?

This has shown a relatively simple way to version a database schema.

But, whilst this is a good start, having the expected/correct build version label on a database schema doesn’t actually indicate that:

  • All of the expected SQL objects are present.
  • None of the SQL objects have had ‘unexpected’ changes made to them.

The next blog in this series will show how a simple test using checksums can help to do this.
What do you think?

Do you use a similar/different method to version your database schema?


Agile database development 101

2010/10/08

Last modified: 2011-04-14

Agile techniques are being increasingly used in general software development but, so far, they have failed to take off in database development.

The purpose of this blog is to list agile database development techniques and practices that I have found useful. It won’t be complete – it will need to change/improve over time. I also hope others will contribute to it.

Brief notes are included with each item. It is my intention is to write more detailed blogs in the future – which will include examples and where appropriate source code.

What is agile?

Agile as a development concept came from software development.

There is no official definition – the closest being the four key principles of the Agile Manifesto.

Key features include:

  • Iterative and incremental approach.
  • An emphasis on creating ‘working software’.
  • Just enough design rather than doing all design/analysis up front.
  • Active stakeholder participation.
  • Providing continuous and rapid feedback.

What about database development?

Whilst many teams are using agile techniques – they rarely extend this to include database development.

Issues with current approaches include:

  • Database schemas are deemed difficult to change/extend. Once created they are effectively static.
  • Little reuse of data assets. New project continue to create their own ‘silos’ of data.
  • Processes, such as deployment, are reliant on manual procedures leading to waste and errors.

These all contribute to the poor data quality that plagues many organisations.

A newly created database and the data within it is likely to be around for a long time – often much longer than the software that is initially created with it. Therefore, IMHO, perhaps it is even more important that techniques are employed that enable systems to be maintainable, extensible, and to reduce the time/cost of change – whilst ensuring high quality.

Agile database techniques and practices

Standard part of development

For many projects database development often stands out like a ‘sore thumb’ – with separate processes/techniques/tools being used. On some, it is not even given the same level of importance as other areas of development.

Database development should be considered an equal – no more no less – to other areas and appropriate priority assigned to it. Where possible similar tools and techniques should be used as unnecessary variation often leads to waste and friction.

As per the title – it should just be treated as a standard part of development.

Automate, automate, automate

Repetitive routine tasks should be automated. It reduces waste, friction of change and can help in increasing overall quality. Deployment, testing, anything that is routine – are all prime candidates for automation.

When starting out or trying to apply to an existing system it might seem like an onerous task. Start by taking an incremental and iterative – with priority going to areas that will provide most value and/or are causing the most pain. It is also worth remembering that once created much of functionality can be reused in other projects.

Use source control

The use of source control is considered a given for software development. But, this is rarely the case for database development. All database changes should be via scripts. Creating/altering SQL objects, security/permissions, core configuration/reference data and data change scripts should be held in source control.

This makes it easier:

  • To have consistent builds/deployments
  • To correlate database changes with code changes
  • For developers to access latest versions of change scripts.

Automate database testing

A suite of automated tests is core for providing the ability to change/extend your database whilst maintaining existing quality. Automated database testing should be integrated throughout the project lifecycle.

Key areas to include:

  • database unit testing
  • system and integration testing
  • deployment testing
  • data quality testing
  • data migration testing

Continuous integration

Continuous integration is an automated process that monitors changes in source code, compiles the application and then run the suite of unit tests. Results of the tests are sent to the team, providing immediate feedback if there are any problems.

Database development should be part of this process. Prerequisites are being able to compile* the latest version of the database and having database unit tests.

* Run change scripts against a baseline version of the database.

Continuous deployment

Agile emphasises ‘working software’.

This should be extended to, albeit a less snappy caption:

‘working software that is ready to deploy to all required environments including production’.

At any given point in a project life-cycle there shouldn’t be any ‘technical’ reasons that should delay deployment. Being ready to deploy, especially for databases, is an area that is often neglected by teams. It is often manually intensive, error prone and left until the go ahead for production release has been given. This type of approach causes waste and delay through out all environments.

A continuous deployment process needs to be in place that:

  • Automatically builds all ‘deployables’. For databases this means getting all the required change scripts and taking into account different environments.
  • Testing deployments – to ensure change scripts actually install correctly.

Individual script files for all changes

All database changes should be placed in individual ‘atomic’ change script files. For a release, these are then run in sequence against the previous version of the database*.

See Managing SQL scripts and continuous integration.

This approach lends itself to:

  • Taking a test driven approach for database development/unit testing.
  • Storing and managing changes in source control.
  • Managing deployment scripts that take into account different environments.
  • Correlating each build/release with a known set of database changes.

* At the start this is a blank database.

Build in data quality

Data quality is the foundation to all information systems. Poor data quality is wasteful – think of all those hours spent on remedial actions – and if severe enough will lead to the failure of a project/system.

Taking an agile approach should include:

  • Giving data quality appropriate priority. Think about the data quality aspect in user stories.
  • Ensuring active stakeholder participation with DQ. Users are best placed in deciding if data is ‘fit for purpose’ and for setting DQ priorities. It should be noted that DQ encompasses more than system development – it is part of the wider data management of the overall organisation. But, projects should play a key part in providing feedback to those group(s) responsible.
  • Automating data quality and profile testing. It is important that this should be on-going – continuing after initial production release.
  • Building this up in an incremental and iterative fashion. For example, if there are large data sets with data quality issues then break them up into smaller more manageable parts – based on user priorities – to test, fix, migrate etc.

Don’t ignore data migration work

Many projects involve some degree of data migration. These tasks are often left until near the end and are not included in automated testing – especially if it is seen as a ‘one off import’.

Start thinking about data migration work sooner rather than later. Treat it the same as you would for any other task. But remember source systems will often not be under the direct control of a team, so if there are issues they are likely to become ‘political’ and to take longer to resolve. Active stakeholder participation is key in managing this.

From a technical viewpoint, encapsulating your database (see later item) allows for greater flexibility in decoupling data migration ‘components’ from base table implementation.

*The data quality of source systems is often a critical issue – therefore it is not surprising that most of the points for DQ above also apply for this section.

Separate managed environments

Separate managed environments – each with it’s own copy of a database – play a big part in ensuring quality and providing immediate and continuous feedback. It also plays a role in reducing waste and the friction caused by issues in one area of testing effecting another. As the number of environments the importance of having an automated continuous deployment process becomes greater.

The list below is of the environments that are typically set up:

  • Development. Each developer has their own copy of the database.
  • Build. For continuous integration.
  • System. For deployment and ‘end to end’ system testing.
  • QA. For manual exploratory type testing.
  • Demo. For stakeholders access.
  • Stage. Pre-production – normally as close as possible to production infrastructure.
  • Production

Depending on the project other separate environments might be required e.g. for migration, integration, a number of different QA environments.

Use production as a base line for all environments

Deploying changes to the production environment is a core objective for a development project. Production is the target. Therefore, it should be used as the start point, the base line, throughout all other environments. To do this effectively requires having a scheduled process (at a minimum after each production deployment) that automatically takes the production database and creates appropriate copies* for all environments.

In all environments, the general pattern for each test cycle is:

  • Restore latest production copy* of database
  • Run current change scripts for that iteration
  • Run tests.

This has a number of advantages:

  • A standard approach is used with all environments built in a similar way.
  • Each environment is testing the real target. Any ‘inadvertent’ changes to production, that might affect deployments, are more likely to be captured much earlier in the overall process.

* Production data might need to be ‘desensitised’ for security/regulatory reasons. Also, using a full copy of data might cause performance issues e.g. in dev, build.

Assign adequate resource for setting up test data

Carrying out effective testing will require creating/setting up test data for different environments. As always this should be incremental and iterative fashion.

Database schema versioning

When there are multiple environments with different copies of the database it is very important to be able to quickly check which database version is being used and that it actually contains the correct SQL objects. It also helps to be able to correlate versions of application(s) software with different releases of a database.

There are a number of techniques that can be used including:

Small and concise pieces of work

Database development should follow the standard agile development approach – with work being carried out in short sprints of typically 1 to 2 weeks. In this time it is expected to have completed a user stories with the outcome being ‘working software’.

Encapsulate your database – restrict direct access to base tables

Encapsulating access to a database using a database API e.g. using sprocs, views is a key part in enabling database development to be agile. Sadly it is a technique that does not seem to be widely practiced. Many applications, reports, ETL type processes are coupled to database schema with them effectively having direct access to base tables.

Using a DB API to restrict direct access to base tables has a number of advantages including:

  • Making it easier to extend/change a database schema without necessarily having to change client applications. A by-product of this is that it increases the ability to reuse databases for multiple applications/business purposes.
  • Allowing greater flexibility in the development process. Creating a DB API increasing the ability to effectively decouple application and base schema implementation meaning that neither client software or base table implementation have to become a bottleneck for each other.

Use conceptual model patterns to get started

There are a number of resources including books produced by David Hay and Len Silverston that provide high level conceptual model patterns for a number of different business domains.

These can be excellent aids in providing a start point when coming to grips with a new domain. They also provide business terms/definitions which can help when initially engaging with stakeholders and starting to build a common business language.

Every organisation is different, so these models/definitions will need to be adapted and they do not replace the need for detailed data analysis and logical modelling. But, they can be really useful in getting you up and running.

Delay base table implementation until required

Base table design and implementation should be based on a normalised logical model – as this helps in reducing data redundancy and improves overall data quality. Also, design decisions on issues such as degree of generalisation that is appropriate are often better delayed until there is enough information to make a more informed decision. But, taking an agile approach means developing incrementally and iteratively – rather than entirely ‘up front’. At first it might seem that these are in conflict with each other. But, this is not the case.

As mentioned previously, encapsulating a database means that base table implementation does not need to become a bottleneck in a project. Working client software can continue to be produced without necessarily having to implement base table design at the same time. This also means that the data analysis/modelling exercise gets continuous feedback of real data requirements as user stories are incrementally implemented.

Whilst a team should not be afraid to a change base tables* – if they are implemented too early it increases the risk of unnecessary change at a later point. Therefore, delay implementation base table until required and recognise that it does not all need to be implemented in one go.

* Overtime as the organisation changes it is going to have to happen – that’s one of the reasons for having a suite of automated tests.

Think about the rest of the enterprise

Many projects – whether taking an agile approach or not – are incentivised to take a ‘tactical’ approach that whilst maybe being optimum for a project is often sub-optimum when considering the overall enterprise. If the whole enterprise is not considered unintended data quality debt is more likely to increase.

Ideally an organisation should have one or more people that are responsible for looking at the strategic data architecture across the whole enterprise (in large organisations typically part of a wider enterprise architecture group). Members of the group should play an active part in all projects (if this is not possible at a minimum receive continuous feedback from all projects).

This approach offers a number of advantages including:

  • Better communication – allowing for continuous feedback between project and strategic viewpoints.
  • The increase possibility of reuse of existing data assets across projects.
  • Better informed decisions on the level of data quality debt that might be reasonable to incur for a given project.

Definitions

A common business language, with key terms and definitions, has many advantages at both a project and enterprise level – see previous blog What do you mean? . Like ‘Data quality’ above, definition creation/maintenance needs to be given appropriate priority in a project. It needs to be integrated into the standard development process and measures need to be taken to ensure that it is seen as of practical use to the overall team.

Use simple naming standards and conventions

Agreeing on simple standards can provide real benefit in reducing the amount of waste/friction within a project.

These can include:

  • Naming conventions e.g. for SQL objects, change and build scripts files (conventions increase the potential for reuse across different projects).
  • Standard templates e.g. create/alter scripts – to make it easier for team members to follow conventions

It is worth remembering that the purpose of these is to save time and help improve quality – so avoid getting into ‘religious’ debates about individual specifics. There are plenty of resources out there to help you get started. Choose one that suits your needs and adapt appropriately.

References and links

Scott Ambler’s – Evolutionary/agile database best practices
Larry Burn’s – Build the agile database
Larry Burn’s TDAN articles list – most are agile database development related
Ben Taylor’s – Top 10 agile database development practices

What next?

What do you think?

Do you agree with these?

Are there others that you think should be added?

Next step is to start looking at each item in more detail.

Agile database dev tips page will include all of the links