An interesting listen which has a lot of content that is applicable in the SCOM monitoring world.
Some key takeaways that are applicable to scom monitoring in general:
- Try to eliminate (perhaps reduce would be a better word) waste through automation where waste can be thought of as useless process or manual steps that don’t add value.
- Track iterative, measured improvements that can be concerned with:
- Changes (did they make things better whether that is performance or stability or some other metric?)
- Mean Time To Recovery – have we learnt from previous outages or failures?
- Provide reports to support the above.
- There is no single pane of glass. At best there are multiple panes of glass that provide different data visualisations to different customers.
- Provide platform metrics and data to teams that support the platforms.
- Provide application metrics and data to developers.
- Provide service level metrics and data to service management.
- Agree Design Principals such as ….
- Self heal where possible
- Take regular backups and verify them with test restores
- Document \ create wikis as much as possible.
- Create coherent runbooks so that when an incident does happen, you have concise and detailed resolution steps which reduce time to recovery. These need to be reviewed \ updated with future incidents.
- We all succeed together and we all fail together.
- DevOps is a culture \ philosphy and GitOps are a set of techniques that support and enable DevOps.