Last modified: 2011-04-14
Agile techniques are being increasingly used in general software development but, so far, they have failed to take off in database development.
The purpose of this blog is to list agile database development techniques and practices that I have found useful. It won’t be complete – it will need to change/improve over time. I also hope others will contribute to it.
Brief notes are included with each item. It is my intention is to write more detailed blogs in the future – which will include examples and where appropriate source code.
What is agile?
Agile as a development concept came from software development.
There is no official definition – the closest being the four key principles of the Agile Manifesto.
Key features include:
- Iterative and incremental approach.
- An emphasis on creating ‘working software’.
- Just enough design rather than doing all design/analysis up front.
- Active stakeholder participation.
- Providing continuous and rapid feedback.
What about database development?
Whilst many teams are using agile techniques – they rarely extend this to include database development.
Issues with current approaches include:
- Database schemas are deemed difficult to change/extend. Once created they are effectively static.
- Little reuse of data assets. New project continue to create their own ‘silos’ of data.
- Processes, such as deployment, are reliant on manual procedures leading to waste and errors.
These all contribute to the poor data quality that plagues many organisations.
A newly created database and the data within it is likely to be around for a long time – often much longer than the software that is initially created with it. Therefore, IMHO, perhaps it is even more important that techniques are employed that enable systems to be maintainable, extensible, and to reduce the time/cost of change – whilst ensuring high quality.
Agile database techniques and practices
Standard part of development
For many projects database development often stands out like a ‘sore thumb’ – with separate processes/techniques/tools being used. On some, it is not even given the same level of importance as other areas of development.
Database development should be considered an equal – no more no less – to other areas and appropriate priority assigned to it. Where possible similar tools and techniques should be used as unnecessary variation often leads to waste and friction.
As per the title – it should just be treated as a standard part of development.
Automate, automate, automate
Repetitive routine tasks should be automated. It reduces waste, friction of change and can help in increasing overall quality. Deployment, testing, anything that is routine – are all prime candidates for automation.
When starting out or trying to apply to an existing system it might seem like an onerous task. Start by taking an incremental and iterative – with priority going to areas that will provide most value and/or are causing the most pain. It is also worth remembering that once created much of functionality can be reused in other projects.
Use source control
The use of source control is considered a given for software development. But, this is rarely the case for database development. All database changes should be via scripts. Creating/altering SQL objects, security/permissions, core configuration/reference data and data change scripts should be held in source control.
This makes it easier:
- To have consistent builds/deployments
- To correlate database changes with code changes
- For developers to access latest versions of change scripts.
Automate database testing
A suite of automated tests is core for providing the ability to change/extend your database whilst maintaining existing quality. Automated database testing should be integrated throughout the project lifecycle.
Key areas to include:
- database unit testing
- system and integration testing
- deployment testing
- data quality testing
- data migration testing
Continuous integration is an automated process that monitors changes in source code, compiles the application and then run the suite of unit tests. Results of the tests are sent to the team, providing immediate feedback if there are any problems.
Database development should be part of this process. Prerequisites are being able to compile* the latest version of the database and having database unit tests.
* Run change scripts against a baseline version of the database.
Agile emphasises ‘working software’.
This should be extended to, albeit a less snappy caption:
‘working software that is ready to deploy to all required environments including production’.
At any given point in a project life-cycle there shouldn’t be any ‘technical’ reasons that should delay deployment. Being ready to deploy, especially for databases, is an area that is often neglected by teams. It is often manually intensive, error prone and left until the go ahead for production release has been given. This type of approach causes waste and delay through out all environments.
A continuous deployment process needs to be in place that:
- Automatically builds all ‘deployables’. For databases this means getting all the required change scripts and taking into account different environments.
- Testing deployments – to ensure change scripts actually install correctly.
Individual script files for all changes
All database changes should be placed in individual ‘atomic’ change script files. For a release, these are then run in sequence against the previous version of the database*.
This approach lends itself to:
- Taking a test driven approach for database development/unit testing.
- Storing and managing changes in source control.
- Managing deployment scripts that take into account different environments.
- Correlating each build/release with a known set of database changes.
* At the start this is a blank database.
Build in data quality
Data quality is the foundation to all information systems. Poor data quality is wasteful – think of all those hours spent on remedial actions – and if severe enough will lead to the failure of a project/system.
Taking an agile approach should include:
- Giving data quality appropriate priority. Think about the data quality aspect in user stories.
- Ensuring active stakeholder participation with DQ. Users are best placed in deciding if data is ‘fit for purpose’ and for setting DQ priorities. It should be noted that DQ encompasses more than system development – it is part of the wider data management of the overall organisation. But, projects should play a key part in providing feedback to those group(s) responsible.
- Automating data quality and profile testing. It is important that this should be on-going – continuing after initial production release.
- Building this up in an incremental and iterative fashion. For example, if there are large data sets with data quality issues then break them up into smaller more manageable parts – based on user priorities – to test, fix, migrate etc.
Don’t ignore data migration work
Many projects involve some degree of data migration. These tasks are often left until near the end and are not included in automated testing – especially if it is seen as a ‘one off import’.
Start thinking about data migration work sooner rather than later. Treat it the same as you would for any other task. But remember source systems will often not be under the direct control of a team, so if there are issues they are likely to become ‘political’ and to take longer to resolve. Active stakeholder participation is key in managing this.
From a technical viewpoint, encapsulating your database (see later item) allows for greater flexibility in decoupling data migration ‘components’ from base table implementation.
*The data quality of source systems is often a critical issue – therefore it is not surprising that most of the points for DQ above also apply for this section.
Separate managed environments
Separate managed environments – each with it’s own copy of a database – play a big part in ensuring quality and providing immediate and continuous feedback. It also plays a role in reducing waste and the friction caused by issues in one area of testing effecting another. As the number of environments the importance of having an automated continuous deployment process becomes greater.
The list below is of the environments that are typically set up:
- Development. Each developer has their own copy of the database.
- Build. For continuous integration.
- System. For deployment and ‘end to end’ system testing.
- QA. For manual exploratory type testing.
- Demo. For stakeholders access.
- Stage. Pre-production – normally as close as possible to production infrastructure.
Depending on the project other separate environments might be required e.g. for migration, integration, a number of different QA environments.
Use production as a base line for all environments
Deploying changes to the production environment is a core objective for a development project. Production is the target. Therefore, it should be used as the start point, the base line, throughout all other environments. To do this effectively requires having a scheduled process (at a minimum after each production deployment) that automatically takes the production database and creates appropriate copies* for all environments.
In all environments, the general pattern for each test cycle is:
- Restore latest production copy* of database
- Run current change scripts for that iteration
- Run tests.
This has a number of advantages:
- A standard approach is used with all environments built in a similar way.
- Each environment is testing the real target. Any ‘inadvertent’ changes to production, that might affect deployments, are more likely to be captured much earlier in the overall process.
* Production data might need to be ‘desensitised’ for security/regulatory reasons. Also, using a full copy of data might cause performance issues e.g. in dev, build.
Assign adequate resource for setting up test data
Carrying out effective testing will require creating/setting up test data for different environments. As always this should be incremental and iterative fashion.
Database schema versioning
When there are multiple environments with different copies of the database it is very important to be able to quickly check which database version is being used and that it actually contains the correct SQL objects. It also helps to be able to correlate versions of application(s) software with different releases of a database.
There are a number of techniques that can be used including:
- Storing a build/version number in the database. This can be auto-generated by tools such as Nant and can tie in with the continuous integration process, automatically updating the database schema with each build.
- Storing overall checksums of SQL objects. This can make it easier to check that all expected SQL objects are present and that they haven’t been ‘inadvertently’ changed.
- Labelling individual SQL objects with the build number.
- Correlate builds with database unit tests so that there is a record of the SQL objects that should be in each build. Running these tests will also ensure that all expected SQL objects are present.
- Automatically tracking all schema changes. For example, by using DDL event trigger log all changes. This can help in identifying any ‘inadvertent’ changes.
Small and concise pieces of work
Database development should follow the standard agile development approach – with work being carried out in short sprints of typically 1 to 2 weeks. In this time it is expected to have completed a user stories with the outcome being ‘working software’.
Encapsulate your database – restrict direct access to base tables
Encapsulating access to a database using a database API e.g. using sprocs, views is a key part in enabling database development to be agile. Sadly it is a technique that does not seem to be widely practiced. Many applications, reports, ETL type processes are coupled to database schema with them effectively having direct access to base tables.
Using a DB API to restrict direct access to base tables has a number of advantages including:
- Making it easier to extend/change a database schema without necessarily having to change client applications. A by-product of this is that it increases the ability to reuse databases for multiple applications/business purposes.
- Allowing greater flexibility in the development process. Creating a DB API increasing the ability to effectively decouple application and base schema implementation meaning that neither client software or base table implementation have to become a bottleneck for each other.
Use conceptual model patterns to get started
These can be excellent aids in providing a start point when coming to grips with a new domain. They also provide business terms/definitions which can help when initially engaging with stakeholders and starting to build a common business language.
Every organisation is different, so these models/definitions will need to be adapted and they do not replace the need for detailed data analysis and logical modelling. But, they can be really useful in getting you up and running.
Delay base table implementation until required
Base table design and implementation should be based on a normalised logical model – as this helps in reducing data redundancy and improves overall data quality. Also, design decisions on issues such as degree of generalisation that is appropriate are often better delayed until there is enough information to make a more informed decision. But, taking an agile approach means developing incrementally and iteratively – rather than entirely ‘up front’. At first it might seem that these are in conflict with each other. But, this is not the case.
As mentioned previously, encapsulating a database means that base table implementation does not need to become a bottleneck in a project. Working client software can continue to be produced without necessarily having to implement base table design at the same time. This also means that the data analysis/modelling exercise gets continuous feedback of real data requirements as user stories are incrementally implemented.
Whilst a team should not be afraid to a change base tables* – if they are implemented too early it increases the risk of unnecessary change at a later point. Therefore, delay implementation base table until required and recognise that it does not all need to be implemented in one go.
* Overtime as the organisation changes it is going to have to happen – that’s one of the reasons for having a suite of automated tests.
Think about the rest of the enterprise
Many projects – whether taking an agile approach or not – are incentivised to take a ‘tactical’ approach that whilst maybe being optimum for a project is often sub-optimum when considering the overall enterprise. If the whole enterprise is not considered unintended data quality debt is more likely to increase.
Ideally an organisation should have one or more people that are responsible for looking at the strategic data architecture across the whole enterprise (in large organisations typically part of a wider enterprise architecture group). Members of the group should play an active part in all projects (if this is not possible at a minimum receive continuous feedback from all projects).
This approach offers a number of advantages including:
- Better communication – allowing for continuous feedback between project and strategic viewpoints.
- The increase possibility of reuse of existing data assets across projects.
- Better informed decisions on the level of data quality debt that might be reasonable to incur for a given project.
A common business language, with key terms and definitions, has many advantages at both a project and enterprise level – see previous blog What do you mean? . Like ‘Data quality’ above, definition creation/maintenance needs to be given appropriate priority in a project. It needs to be integrated into the standard development process and measures need to be taken to ensure that it is seen as of practical use to the overall team.
Use simple naming standards and conventions
Agreeing on simple standards can provide real benefit in reducing the amount of waste/friction within a project.
These can include:
- Naming conventions e.g. for SQL objects, change and build scripts files (conventions increase the potential for reuse across different projects).
- Standard templates e.g. create/alter scripts – to make it easier for team members to follow conventions
It is worth remembering that the purpose of these is to save time and help improve quality – so avoid getting into ‘religious’ debates about individual specifics. There are plenty of resources out there to help you get started. Choose one that suits your needs and adapt appropriately.
References and links
Scott Ambler’s – Evolutionary/agile database best practices
Larry Burn’s – Build the agile database
Larry Burn’s TDAN articles list – most are agile database development related
Ben Taylor’s – Top 10 agile database development practices
What do you think?
Do you agree with these?
Are there others that you think should be added?
Next step is to start looking at each item in more detail.