|
| 1 | +--- |
| 2 | +title: |
| 3 | + "Making observability fun: How we increased engineers' confidence in incident |
| 4 | + management using a game" |
| 5 | +linkTitle: Skyscanner using OTel Demo |
| 6 | +date: 2024-02-26 |
| 7 | +author: >- |
| 8 | + [Jordi Bisbal Ansaldo](https://github.com/jordibisbal8) (Skyscanner) |
| 9 | +cSpell:ignore: Ansaldo Bisbal Jordi runbooks Skyscanner upskilled Yankova |
| 10 | +--- |
| 11 | + |
| 12 | +At [Skyscanner](https://www.skyscanner.net), as in many organizations, teams |
| 13 | +tend to follow specific runbooks for individual failure modes. With modern and |
| 14 | +complex distributed systems, this has the downside of most of the errors being |
| 15 | +unknowns, which makes runbooks only partially applicable. |
| 16 | + |
| 17 | +After migrating our telemetry data to the OpenTelemetry standards at Skyscanner, |
| 18 | +we now have richer instrumentation and can rely on observability directly. As a |
| 19 | +result, we are ready to adopt a new |
| 20 | +[observability mindset](https://charity.wtf/2019/09/20/love-and-alerting-in-the-time-of-cholera-and-observability/), |
| 21 | +which requires training our engineers to work effectively with the new |
| 22 | +ecosystem. This allows them to react efficiently to any known or unknown issues, |
| 23 | +even under pressure. |
| 24 | + |
| 25 | +To achieve this, we believe that the best way to gain knowledge isn’t through |
| 26 | +one-time viewings of documents or videos. Instead, it’s through practical |
| 27 | +exercises that include situations with never-before-seen (or at least rarely |
| 28 | +seen) problems. This helps the company reduce the time to mitigate an issue |
| 29 | +(TTM), which starts when a first responder acknowledges the incident, until |
| 30 | +users stop suffering from the incident. |
| 31 | + |
| 32 | +## Environment |
| 33 | + |
| 34 | +To begin with, we need to set up an environment that demonstrates the best |
| 35 | +practices for monitoring and debugging using OpenTelemetry instrumentation and |
| 36 | +observability. For this, we propose the use of the official |
| 37 | +[OpenTelemetry Demo](/docs/demo/), which is a realistic example of a distributed |
| 38 | +system called Astronomy Shop. Thanks to the |
| 39 | +[OpenTelemetry Protocol](/docs/specs/otlp/) (OTLP), it allows us to simply point |
| 40 | +the standard OTLP exporter in the Collector to |
| 41 | +[New Relic](https://newrelic.com/), our chosen observability platform at |
| 42 | +Skyscanner which, like other platforms, is fully embracing open standards to |
| 43 | +ingest telemetry data. |
| 44 | + |
| 45 | +This system contains regressions that can be injected into the platform and |
| 46 | +helps us demonstrate the importance of Service Levels Objectives (SLOs), |
| 47 | +tracing, logs, metrics, etc. For instance, we can observe traffic flow through |
| 48 | +various components, as shown in the image below. Since part of the OpenTelemetry |
| 49 | +ecosystem is open source, we can easily introduce any new features that will be |
| 50 | +reviewed by OpenTelemetry contributors. |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | +## Observability game day |
| 55 | + |
| 56 | +Once the environment is set up, we can introduce the Observability Game Day, an |
| 57 | +initiative based on the Wheel of Misfortune practices that Google uses and |
| 58 | +describes in the [Site Reliability Engineering book](https://sre.google/books/). |
| 59 | + |
| 60 | +This game simulates a production incident, where a moderator known as the game |
| 61 | +master (GM) conducts the session and someone from the audience spins the wheel |
| 62 | +and explains an incident or outage. The participants are then divided into teams |
| 63 | +and tasked with identifying and resolving the issue as quickly as possible. If |
| 64 | +the solution is not optimal, the GM can help by introducing a new tool or view, |
| 65 | +which gives a different perspective on how to tackle the incident (knowledge |
| 66 | +sharing). This exercise can be repeated multiple times for different incidents. |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | +## Results |
| 71 | + |
| 72 | +The Observability Game Day has already been completed by multiple Skyscanner |
| 73 | +teams, where each team observability expert (ambassador) runs the session. The |
| 74 | +participants have given extremely positive feedback, where 90% of the responders |
| 75 | +say that after the Game Day, they feel more confident debugging production |
| 76 | +systems and would love to have further sessions. |
| 77 | + |
| 78 | +- Hugely valuable to run against real services and to compare and contrast |
| 79 | + different debugging methods. I'm certain everyone, regardless of skill level, |
| 80 | + will have got something out of the session - I know I did! Thank you for |
| 81 | + taking the time to set this up and promoting it for us - |
| 82 | + [Dominic Fraser](https://github.com/dominicfraser) (Senior Software Engineer) |
| 83 | +- It is a really great (company-wide) initiative to get people upskilled in |
| 84 | + observability and OpenTelemetry/New Relic and I personally found it very |
| 85 | + useful, as well as a lot of fun! :D - Polly Yankova (Software Engineer) |
| 86 | + |
| 87 | +In addition, we learned that: |
| 88 | + |
| 89 | +1. OTLP makes it incredibly simple to integrate a standard application with an |
| 90 | + observability vendor. Just point it to the right endpoint and job done. |
| 91 | +2. Our winning teams relied primarily on tracing data to analyze regressions |
| 92 | + that helped them understand the root cause faster. Tracing FTW! |
| 93 | +3. Front-end engineers found the Game Day lacked focus on client-side |
| 94 | + observability, so we decided to contribute upstream (see next steps below). |
| 95 | + This was my first contribution to the project, and it was a great experience! |
| 96 | + Maintainers were very welcoming and helped me to test and release. Thanks! |
| 97 | + |
| 98 | +## Next steps |
| 99 | + |
| 100 | +The next action is to run sessions for all the engineering teams in the company |
| 101 | +and convert them into a Skyscanner learning course. This way, the content can be |
| 102 | +used during the onboarding process for new joiners or even reviewed at any time |
| 103 | +as a refresher for those who have been in the company longer. In addition, after |
| 104 | +observing common feedback, we identified that it would be beneficial to extend |
| 105 | +the current incidents to include more front-end-specific ones, such as incidents |
| 106 | +triggered by browser traffic. To achieve this, we have contributed to the |
| 107 | +OpenTelemetry Demo and enabled these features for other interested parties. For |
| 108 | +more information, please have a look at the |
| 109 | +[raised PR](https://github.com/open-telemetry/opentelemetry-demo/pull/1345). |
0 commit comments