An interesting listen which has a lot of content that is applicable in the SCOM monitoring world.

https://soundcloud.com/user-718131608/podcast-devops-metrics-success-follows-failure

Some key takeaways that are applicable to scom monitoring in general:

  • Try to eliminate (perhaps reduce would be a better word) waste through automation where waste can be thought of as useless process or manual steps that don’t add value.
  • Track iterative, measured improvements that can be concerned with:
    • Changes (did they make things better whether that is performance or stability or some other metric?)
    • Mean Time To Recovery – have we learnt from previous outages or failures?
    • Provide reports to support the above.
  • There is no single pane of glass. At best there are multiple panes of glass that provide different data visualisations to different customers.
    • Provide platform metrics and data to teams that support the platforms.
    • Provide application metrics and data to developers.
    • Provide service level metrics and data to service management.
  • Agree Design Principals such as ….
    • Self heal where possible
    • Take regular backups and verify them with test restores
    • Document \ create wikis as much as possible.
    • Create coherent runbooks so that when an incident does happen, you have concise and detailed resolution steps which reduce time to recovery. These need to be reviewed \ updated with future incidents.
  • We all succeed together and we all fail together.
  • DevOps is a culture \ philosphy and GitOps are a set of techniques that support and enable DevOps.

By graham