Français

Technical Playbook

This playbook suggests what needs to be done to build effective digital service on a technical level. It is opinionated but backed by years of experience and drawn from best practices from the private sector and government. It breaks the software development discipline into 14 actionable plays. We do not enforce the plays to any team, but strongly recommend them. At CDS, we believe this playbook will provide the following outcomes:

  • Set technical standard baseline across our community and product teams
  • Increase collaboration and re-use
  • Build speed and consistency
  • Streamline onboarding new developers, departments or partners
  • Consolidate industry best practices and adapt them to our mission
  • Realize economies of scale
  • Mitigate risk and increase operational efficiency

We hope to refine it as we practice the plays everyday in our craft.

PLAY 1

Work in the open

When we collaborate in the open and publish our data publicly, we can improve Government together. By building services more openly and publishing open data, we simplify the public’s access to government services and information, allow the public to contribute easily, and enable reuse by entrepreneurs, nonprofits, other agencies, and the public.

Skills to work on

  1. Choose a permissive license approved by the Open Source Initiative such as the MIT License
  2. Leverage Github to publish code, track issues, and share the product backlog
  3. For each code repository, publish a simple and clear code of conduct, contribution guidelines, license, and security vulnerability disclosure process
  4. Adhere to the Official Languages Act: provide bilingual documentation, create bilingual templates for issues and pull requests, respond to issues in the language of the issue whenever possible
  5. Keep your artifacts and repositories clean and accurate: archive unmaintained repositories, delete stale and merged branches, remove old Pull Requests that will not be merged, regularly prune old issues
PLAY 2

Continuously integrate

The greatest and most wide ranging benefit of continuous integration is reduced risk. It allows you to detect issues sooner. It is a fail-fast mechanism and removes one of the biggest barriers to frequent deployment.

Skills to work on

  1. Integrate changes into a mainline as early as possible
  2. Define your peer review process and documentation. Start with a pull request template
  3. Use SemVer to version your changes
  4. Enforce Conventional Commits. See why
PLAY 3

Deploy often

Frequent deployment is valuable because it allows your users to get new features more rapidly, to give more rapid feedback on those features, and generally become more collaborative in the development cycle. This helps break down the barriers between users of our services and developers - which are the biggest barriers to successful software development.

Skills to work on

  1. Deploy updated code into production environments multiple times a day in a structured and safe way
  2. Use infrastructure as code to manage and provision all environments
  3. Use hermetic builds - meaning that they are insensitive to the libraries and other software installed on the build machine - via container images and store them in the cloud
  4. Every application or system should have a release checklist or an agile release train process defined. Read about our release management practice and responsabilities
  5. Keep a release manifest file that links releases and component versions. Here is one example
  6. Automate your release notes
  7. Provide a rollback plan for each release
  8. Leverage a green/blue deployment
  9. If possible, use techniques to throttle traffic to the new release e.g. canary release
  10. Ensure each release candidate is properly tested (manually or automatically). Testing should cover: unit tests, integration tests and performance tests
  11. Ensure each release candidate has gone through a vulnerability scan
  12. Always verify/monitor production logs and key metrics right after a release
  13. Avoid deploying outside business hours or just before the weekend or bank holidays
PLAY 4

Use feature toggles

Feature toggles aka release toggles allow incomplete and untested code paths to be shipped to production as latent code and turned on when the new feature is ready. Using release toggles in this way is the most common way to implement the continuous delivery principle of “separating feature release from code deployment.”

Skills to work on

  1. Install, implement or use an existing feature toggle platform
  2. Practice writing code using branch by abstraction
  3. Deploy large feature in smaller pieces
  4. Avoid releasing new code outside business hours or just before the weekend or bank holidays
PLAY 5

Choose the stack carefully

The technology decisions we make need to enable development teams to work efficiently and enable services to scale easily and cost-effectively. Our choices for hosting infrastructure, databases, software frameworks, programming languages and the rest of the technology stack should seek to avoid vendor lock-in and match what successful modern consumer and enterprise software companies would choose today. In particular, digital services teams should consider using open source, cloud-based, and commodity solutions across the technology stack, because of their widespread adoption and support by successful consumer and enterprise technology companies in the private sector.

Skills to work on

  1. Ensure the skills of the community line up with the requirements of supporting the stack in the near and long term
  2. Assess the sufficient maturity with the developer tooling on the stack
  3. Assess the technology stack has reasonable security management and patching in place
  4. Review and rationalize additional technology dependencies
PLAY 6

Security first

Our digital services have to protect sensitive information and keep systems secure. This is typically a process of continuous review and improvement which should be built into the development and maintenance of the service. A key process to building a secure service is comprehensively testing and certifying the components in each layer of the technology stack for security vulnerabilities, and then to re-use these same pre-certified components for multiple services.

Skills to work on

  1. Seek what security means for your organization and your application and prioritize it sooner in your product development
  2. Use a static code analysis tool to flag programming errors, bugs, stylistic errors, and suspicious constructs
  3. Use a secret scan (e.g. Git Seekret) to prevent adding sensitive information into a code repository
  4. Enable code dependency scanning and automatic patching
  5. Publish vulnerability disclosure policies
  6. Enable Docker container image vulnerability scanning
  7. Do a Risk/Threat Analysis consistently across products
  8. Use shared secret management
PLAY 7

Write clean code

Code is clean if it can be understood easily – by everyone on the team. Clean code can be read and enhanced by a developer other than its original author. With understandability comes readability, changeability, extensibility and maintainability.

Skills to work on

  1. Team chooses a code style. Every programming language has one and you should set on in.
  2. Use static code analysis where possible
  3. As a team, choose a set of rules that reflects what clean code means for you
PLAY 8

Develop accessible front-end

Building accessible services means meeting the needs of as many people as possible. From the start, we work with the people who will use a product, including people with disabilities.

Skills to work on

  1. Ensure every UI component is accessible; refer to our guide https://digital.canada.ca/a11y/
  2. Perform automated accessibility scans during end to end automated tests using axe-core integration with Cypress
PLAY 9

Automate all tests

All code written should include reasonable tests to ensure functionality and to demonstrate consideration of edge cases. Test automation guarantees quality and reliability for every application deployment and increases overall software development efficiency.

Skills to work on

  1. Treat your test code as production code
  2. Writing tests should be part of your acceptance criteria and pull request
  3. Manual testing should be kept to a minimum
  4. Pick testing libraries that help with code coverage, reporting and test code readability
  5. Leverage mocking interfaces to test internal logic
  6. Ensure to include some end to end testings when possible
PLAY 10

Optimize the cloud’s cost

Cloud computing costs can quickly start to add up if your organization doesn’t stay on top of them. Manage cloud expenses efficiently by measuring and managing key cost contributors such as memory, storage, network traffic and instance utilization.

Skills to work on

  1. Data transfer is optimized according to the needs of the application taking into consideration data centre region and zone cross-traffic
  2. Compute instances are right-sized for the task
  3. High cost applications are profiled for cloud optimization
  4. Logging, monitoring and alerting for high load in traffic / cloud utilization is in place
  5. CDNs are leveraged where feasible
  6. Storage has predefined retention periods for disposal or archival
  7. Data availability and redundancy is right-sized according to SLAs
  8. Delete instances and resources when they are not needed or by schedule if possible.
PLAY 11

Log, monitor and alert

Logging, monitoring and alerting are the bedrock of incident management. Developers and operations teams will be able to plan for and troubleshoot application issues much faster.

Skills to work on

  1. Logs are collected and stored in a central location
  2. Auditable events such as logins, failed logins, and high-value transactions are logged
  3. Warnings and errors generate adequate and clear log messages
  4. Logs of applications and APIs are monitored for suspicious activity
  5. Appropriate alerting thresholds and response escalation processes are in place
  6. The application is able to detect, escalate or alert for active incidents & threats in real time or near real time
  7. Logs are held for sufficient retention periods to allow for delayed analysis
  8. Logs are generated in a format that can be easily consumed by a centralized log management solution i.e. implement structured logging
PLAY 12

Get production support right

The world depends on “always on” services more than ever before. An outage can affect millions of people, with real impact: they can’t pay their bills, they can’t book their flights, they can’t video call with their friends. And whether you’re having a major bug, capacity issues, or you’re down completely, customers who depend on your services expect an immediate response. (The same is true for internal teams.)

A fair on-call schedule, coupled with an on-call compensation plan, can even foster a culture of shared responsibility and help your teams learn more about what it takes to make resilient software and services, making for a better overall product and fewer outages.

Skills to work on

  1. Clearly define the on-call responsibilities, provide on-call training and access to all needed tools
  2. Implement a rich knowledge base and document every incident
  3. Check-in regularly if the process is healthy and effective
  4. Build an effective on-call schedule. Use the right tool; we use OpsGenie
  5. Every product team member should be part of the on-call schedule either as an incident commander or an on-call developer
PLAY 13

Manage incident

When moving fast we expect issues. We want to respond quickly, respond in a way that doesn’t make the issue bigger and not repeat the same problems so that our users are able to get services.

Skills to work on

  1. Agree on what defines an incident and the different severity levels
  2. Define an incident runbook: provide a list of clear steps to follow when an incident happens - it will bring in consistency and sets a quality standard
  3. Always adopt a blameless approach and ensure psychology safety
  4. Eliminate noise or false positive as soon as possible
  5. Make your incident reports and action items visible
  6. Automate the process: opening a bridge, creating the report, following up on action items etc.
  7. Clarify and simplify the communication channels when incidents occur
PLAY 14

Sharpen your skills

This final play focuses on practicing all the above plays and enhancing them. Our coding craftsmanship is our greatest asset - we need to continuously improve it.

Skills to work on

  1. Participate to at least one code retreat each year
  2. Review Pull Requests from other teams
  3. Read & share insightful technical blogs & news across the community
  4. Step outside of your comfort zone when the opportunity presents itself
  5. Pick a skill and master it with deep understanding