The Art of Modern Ops Podcast: DevOps Metrics – Success Follows Failure

An interesting listen which has a lot of content that is applicable in the SCOM monitoring world.

https://soundcloud.com/user-718131608/podcast-devops-metrics-success-follows-failure

Some key takeaways that are applicable to scom monitoring in general:

Try to eliminate (perhaps reduce would be a better word) waste through automation where waste can be thought of as useless process or manual steps that don’t add value.
Track iterative, measured improvements that can be concerned with:
- Changes (did they make things better whether that is performance or stability or some other metric?)
- Mean Time To Recovery – have we learnt from previous outages or failures?
- Provide reports to support the above.
There is no single pane of glass. At best there are multiple panes of glass that provide different data visualisations to different customers.
- Provide platform metrics and data to teams that support the platforms.
- Provide application metrics and data to developers.
- Provide service level metrics and data to service management.

Agree Design Principals such as ….
- Self heal where possible
- Take regular backups and verify them with test restores
- Document \ create wikis as much as possible.
- Create coherent runbooks so that when an incident does happen, you have concise and detailed resolution steps which reduce time to recovery. These need to be reviewed \ updated with future incidents.

We all succeed together and we all fail together.
DevOps is a culture \ philosphy and GitOps are a set of techniques that support and enable DevOps.