Français

Technical Playbook

This playbook suggests what needs to be done to build effective digital service on a technical level. It is opinionated but backed by years of experience and drawn from best practices from the private sector and government. It breaks the software development discipline into 14 actionable plays. We do not enforce the plays to any team, but strongly recommend them. At CDS, we believe this playbook will provide the following outcomes:

Set technical standard baseline across our community and product teams
Increase collaboration and re-use
Build speed and consistency
Streamline onboarding new developers, departments or partners
Consolidate industry best practices and adapt them to our mission
Realize economies of scale
Mitigate risk and increase operational efficiency

We hope to refine it as we practice the plays everyday in our craft.

See the plays

Help improve this content

In detail

PLAY 1

Work in the open

When we collaborate in the open and publish our data publicly, we can improve Government together. By building services more openly and publishing open data, we simplify the public’s access to government services and information, allow the public to contribute easily, and enable reuse by entrepreneurs, nonprofits, other agencies, and the public.

Skills to work on

Choose a permissive license approved by the Open Source Initiative such as the MIT License
Leverage Github to publish code, track issues, and share the product backlog
For each code repository, publish a simple and clear code of conduct, contribution guidelines, license, and security vulnerability disclosure process
Adhere to the Official Languages Act: provide bilingual documentation, create bilingual templates for issues and pull requests, respond to issues in the language of the issue whenever possible
Keep your artifacts and repositories clean and accurate: archive unmaintained repositories, delete stale and merged branches, remove old Pull Requests that will not be merged, regularly prune old issues

PLAY 2

Continuously integrate

The greatest and most wide ranging benefit of continuous integration is reduced risk. It allows you to detect issues sooner. It is a fail-fast mechanism and removes one of the biggest barriers to frequent deployment.

Skills to work on

Integrate changes into a mainline as early as possible
Define your peer review process and documentation. Start with a pull request template
Use SemVer to version your changes
Enforce Conventional Commits. See why

PLAY 3

Deploy often

Frequent deployment is valuable because it allows your users to get new features more rapidly, to give more rapid feedback on those features, and generally become more collaborative in the development cycle. This helps break down the barriers between users of our services and developers - which are the biggest barriers to successful software development.

Skills to work on

Deploy updated code into production environments multiple times a day in a structured and safe way
Use infrastructure as code to manage and provision all environments
Use hermetic builds - meaning that they are insensitive to the libraries and other software installed on the build machine - via container images and store them in the cloud
Every application or system should have a release checklist or an agile release train process defined. Read about our release management practice and responsabilities
Keep a release manifest file that links releases and component versions. Here is one example
Automate your release notes
Provide a rollback plan for each release
Leverage a green/blue deployment
If possible, use techniques to throttle traffic to the new release e.g. canary release
Ensure each release candidate is properly tested (manually or automatically). Testing should cover: unit tests, integration tests and performance tests
Ensure each release candidate has gone through a vulnerability scan
Always verify/monitor production logs and key metrics right after a release
Avoid deploying outside business hours or just before the weekend or bank holidays

PLAY 4

Use feature toggles

Feature toggles aka release toggles allow incomplete and untested code paths to be shipped to production as latent code and turned on when the new feature is ready. Using release toggles in this way is the most common way to implement the continuous delivery principle of “separating feature release from code deployment.”

Skills to work on

Install, implement or use an existing feature toggle platform
Practice writing code using branch by abstraction
Deploy large feature in smaller pieces
Avoid releasing new code outside business hours or just before the weekend or bank holidays

PLAY 5

Choose the stack carefully

The technology decisions we make need to enable development teams to work efficiently and enable services to scale easily and cost-effectively. Our choices for hosting infrastructure, databases, software frameworks, programming languages and the rest of the technology stack should seek to avoid vendor lock-in and match what successful modern consumer and enterprise software companies would choose today. In particular, digital services teams should consider using open source, cloud-based, and commodity solutions across the technology stack, because of their widespread adoption and support by successful consumer and enterprise technology companies in the private sector.

Skills to work on

Ensure the skills of the community line up with the requirements of supporting the stack in the near and long term
Assess the sufficient maturity with the developer tooling on the stack
Assess the technology stack has reasonable security management and patching in place
Review and rationalize additional technology dependencies

PLAY 6

Security first

Our digital services have to protect sensitive information and keep systems secure. This is typically a process of continuous review and improvement which should be built into the development and maintenance of the service. A key process to building a secure service is comprehensively testing and certifying the components in each layer of the technology stack for security vulnerabilities, and then to re-use these same pre-certified components for multiple services.

Skills to work on

Seek what security means for your organization and your application and prioritize it sooner in your product development
Use a static code analysis tool to flag programming errors, bugs, stylistic errors, and suspicious constructs
Use a secret scan (e.g. Git Seekret) to prevent adding sensitive information into a code repository
Enable code dependency scanning and automatic patching
Publish vulnerability disclosure policies
Enable Docker container image vulnerability scanning
Do a Risk/Threat Analysis consistently across products
Use shared secret management

PLAY 7

Write clean code

Code is clean if it can be understood easily – by everyone on the team. Clean code can be read and enhanced by a developer other than its original author. With understandability comes readability, changeability, extensibility and maintainability.

Skills to work on

Team chooses a code style. Every programming language has one and you should set on in.
Use static code analysis where possible
As a team, choose a set of rules that reflects what clean code means for you

PLAY 8

Develop accessible front-end

Building accessible services means meeting the needs of as many people as possible. From the start, we work with the people who will use a product, including people with disabilities.

Skills to work on

Ensure every UI component is accessible; refer to our guide https://digital.canada.ca/a11y/
Perform automated accessibility scans during end to end automated tests using axe-core integration with Cypress

PLAY 9

Automate all tests

All code written should include reasonable tests to ensure functionality and to demonstrate consideration of edge cases. Test automation guarantees quality and reliability for every application deployment and increases overall software development efficiency.

Skills to work on

Treat your test code as production code
Writing tests should be part of your acceptance criteria and pull request
Manual testing should be kept to a minimum
Pick testing libraries that help with code coverage, reporting and test code readability
Leverage mocking interfaces to test internal logic
Ensure to include some end to end testings when possible

PLAY 10

Optimize the cloud’s cost

Cloud computing costs can quickly start to add up if your organization doesn’t stay on top of them. Manage cloud expenses efficiently by measuring and managing key cost contributors such as memory, storage, network traffic and instance utilization.

Skills to work on

Data transfer is optimized according to the needs of the application taking into consideration data centre region and zone cross-traffic
Compute instances are right-sized for the task
High cost applications are profiled for cloud optimization
Logging, monitoring and alerting for high load in traffic / cloud utilization is in place
CDNs are leveraged where feasible
Storage has predefined retention periods for disposal or archival
Data availability and redundancy is right-sized according to SLAs
Delete instances and resources when they are not needed or by schedule if possible.

PLAY 11

Log, monitor and alert

Logging, monitoring and alerting are the bedrock of incident management. Developers and operations teams will be able to plan for and troubleshoot application issues much faster.

Skills to work on

Logs are collected and stored in a central location
Auditable events such as logins, failed logins, and high-value transactions are logged
Warnings and errors generate adequate and clear log messages
Logs of applications and APIs are monitored for suspicious activity
Appropriate alerting thresholds and response escalation processes are in place
The application is able to detect, escalate or alert for active incidents & threats in real time or near real time
Logs are held for sufficient retention periods to allow for delayed analysis
Logs are generated in a format that can be easily consumed by a centralized log management solution i.e. implement structured logging

PLAY 12

Get production support right

The world depends on “always on” services more than ever before. An outage can affect millions of people, with real impact: they can’t pay their bills, they can’t book their flights, they can’t video call with their friends. And whether you’re having a major bug, capacity issues, or you’re down completely, customers who depend on your services expect an immediate response. (The same is true for internal teams.)

A fair on-call schedule, coupled with an on-call compensation plan, can even foster a culture of shared responsibility and help your teams learn more about what it takes to make resilient software and services, making for a better overall product and fewer outages.

Skills to work on

Clearly define the on-call responsibilities, provide on-call training and access to all needed tools
Implement a rich knowledge base and document every incident
Check-in regularly if the process is healthy and effective
Build an effective on-call schedule. Use the right tool; we use OpsGenie
Every product team member should be part of the on-call schedule either as an incident commander or an on-call developer

PLAY 13

Manage incident

When moving fast we expect issues. We want to respond quickly, respond in a way that doesn’t make the issue bigger and not repeat the same problems so that our users are able to get services.

Skills to work on

Agree on what defines an incident and the different severity levels
Define an incident runbook: provide a list of clear steps to follow when an incident happens - it will bring in consistency and sets a quality standard
Always adopt a blameless approach and ensure psychology safety
Eliminate noise or false positive as soon as possible
Make your incident reports and action items visible
Automate the process: opening a bridge, creating the report, following up on action items etc.
Clarify and simplify the communication channels when incidents occur

PLAY 14

Sharpen your skills

This final play focuses on practicing all the above plays and enhancing them. Our coding craftsmanship is our greatest asset - we need to continuously improve it.

Skills to work on

Participate to at least one code retreat each year
Review Pull Requests from other teams
Read & share insightful technical blogs & news across the community
Step outside of your comfort zone when the opportunity presents itself
Pick a skill and master it with deep understanding