Skip to content

Commit 0986511

Browse files
Merge branch 'main' into js_instrumentation_doc_fix
2 parents 80a3d3e + 19720bb commit 0986511

File tree

4 files changed

+137
-0
lines changed

4 files changed

+137
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
---
2+
title:
3+
"Making observability fun: How we increased engineers' confidence in incident
4+
management using a game"
5+
linkTitle: Skyscanner using OTel Demo
6+
date: 2024-02-26
7+
author: >-
8+
[Jordi Bisbal Ansaldo](https://github.com/jordibisbal8) (Skyscanner)
9+
cSpell:ignore: Ansaldo Bisbal Jordi runbooks Skyscanner upskilled Yankova
10+
---
11+
12+
At [Skyscanner](https://www.skyscanner.net), as in many organizations, teams
13+
tend to follow specific runbooks for individual failure modes. With modern and
14+
complex distributed systems, this has the downside of most of the errors being
15+
unknowns, which makes runbooks only partially applicable.
16+
17+
After migrating our telemetry data to the OpenTelemetry standards at Skyscanner,
18+
we now have richer instrumentation and can rely on observability directly. As a
19+
result, we are ready to adopt a new
20+
[observability mindset](https://charity.wtf/2019/09/20/love-and-alerting-in-the-time-of-cholera-and-observability/),
21+
which requires training our engineers to work effectively with the new
22+
ecosystem. This allows them to react efficiently to any known or unknown issues,
23+
even under pressure.
24+
25+
To achieve this, we believe that the best way to gain knowledge isn’t through
26+
one-time viewings of documents or videos. Instead, it’s through practical
27+
exercises that include situations with never-before-seen (or at least rarely
28+
seen) problems. This helps the company reduce the time to mitigate an issue
29+
(TTM), which starts when a first responder acknowledges the incident, until
30+
users stop suffering from the incident.
31+
32+
## Environment
33+
34+
To begin with, we need to set up an environment that demonstrates the best
35+
practices for monitoring and debugging using OpenTelemetry instrumentation and
36+
observability. For this, we propose the use of the official
37+
[OpenTelemetry Demo](/docs/demo/), which is a realistic example of a distributed
38+
system called Astronomy Shop. Thanks to the
39+
[OpenTelemetry Protocol](/docs/specs/otlp/) (OTLP), it allows us to simply point
40+
the standard OTLP exporter in the Collector to
41+
[New Relic](https://newrelic.com/), our chosen observability platform at
42+
Skyscanner which, like other platforms, is fully embracing open standards to
43+
ingest telemetry data.
44+
45+
This system contains regressions that can be injected into the platform and
46+
helps us demonstrate the importance of Service Levels Objectives (SLOs),
47+
tracing, logs, metrics, etc. For instance, we can observe traffic flow through
48+
various components, as shown in the image below. Since part of the OpenTelemetry
49+
ecosystem is open source, we can easily introduce any new features that will be
50+
reviewed by OpenTelemetry contributors.
51+
52+
![Distributed tracing example in Astronomy shop](tracing-example.png)
53+
54+
## Observability game day
55+
56+
Once the environment is set up, we can introduce the Observability Game Day, an
57+
initiative based on the Wheel of Misfortune practices that Google uses and
58+
describes in the [Site Reliability Engineering book](https://sre.google/books/).
59+
60+
This game simulates a production incident, where a moderator known as the game
61+
master (GM) conducts the session and someone from the audience spins the wheel
62+
and explains an incident or outage. The participants are then divided into teams
63+
and tasked with identifying and resolving the issue as quickly as possible. If
64+
the solution is not optimal, the GM can help by introducing a new tool or view,
65+
which gives a different perspective on how to tackle the incident (knowledge
66+
sharing). This exercise can be repeated multiple times for different incidents.
67+
68+
![Wheel of misfortune example](wheel.png)
69+
70+
## Results
71+
72+
The Observability Game Day has already been completed by multiple Skyscanner
73+
teams, where each team observability expert (ambassador) runs the session. The
74+
participants have given extremely positive feedback, where 90% of the responders
75+
say that after the Game Day, they feel more confident debugging production
76+
systems and would love to have further sessions.
77+
78+
- Hugely valuable to run against real services and to compare and contrast
79+
different debugging methods. I'm certain everyone, regardless of skill level,
80+
will have got something out of the session - I know I did! Thank you for
81+
taking the time to set this up and promoting it for us -
82+
[Dominic Fraser](https://github.com/dominicfraser) (Senior Software Engineer)
83+
- It is a really great (company-wide) initiative to get people upskilled in
84+
observability and OpenTelemetry/New Relic and I personally found it very
85+
useful, as well as a lot of fun! :D - Polly Yankova (Software Engineer)
86+
87+
In addition, we learned that:
88+
89+
1. OTLP makes it incredibly simple to integrate a standard application with an
90+
observability vendor. Just point it to the right endpoint and job done.
91+
2. Our winning teams relied primarily on tracing data to analyze regressions
92+
that helped them understand the root cause faster. Tracing FTW!
93+
3. Front-end engineers found the Game Day lacked focus on client-side
94+
observability, so we decided to contribute upstream (see next steps below).
95+
This was my first contribution to the project, and it was a great experience!
96+
Maintainers were very welcoming and helped me to test and release. Thanks!
97+
98+
## Next steps
99+
100+
The next action is to run sessions for all the engineering teams in the company
101+
and convert them into a Skyscanner learning course. This way, the content can be
102+
used during the onboarding process for new joiners or even reviewed at any time
103+
as a refresher for those who have been in the company longer. In addition, after
104+
observing common feedback, we identified that it would be beneficial to extend
105+
the current incidents to include more front-end-specific ones, such as incidents
106+
triggered by browser traffic. To achieve this, we have contributed to the
107+
OpenTelemetry Demo and enabled these features for other interested parties. For
108+
more information, please have a look at the
109+
[raised PR](https://github.com/open-telemetry/opentelemetry-demo/pull/1345).
Loading
27.8 KB
Loading

static/refcache.json

+28
Original file line numberDiff line numberDiff line change
@@ -223,6 +223,10 @@
223223
"StatusCode": 206,
224224
"LastSeen": "2024-01-30T06:06:13.062554-05:00"
225225
},
226+
"https://charity.wtf/2019/09/20/love-and-alerting-in-the-time-of-cholera-and-observability/": {
227+
"StatusCode": 200,
228+
"LastSeen": "2024-02-26T10:53:38.116124+01:00"
229+
},
226230
"https://circleci.com": {
227231
"StatusCode": 206,
228232
"LastSeen": "2024-01-30T05:18:29.78394-05:00"
@@ -2291,6 +2295,10 @@
22912295
"StatusCode": 200,
22922296
"LastSeen": "2024-01-25T10:54:57.162378745Z"
22932297
},
2298+
"https://github.com/dominicfraser": {
2299+
"StatusCode": 200,
2300+
"LastSeen": "2024-02-26T10:53:39.237712+01:00"
2301+
},
22942302
"https://github.com/dotansimha/graphql-yoga": {
22952303
"StatusCode": 200,
22962304
"LastSeen": "2024-01-30T05:18:34.524624-05:00"
@@ -2475,6 +2483,10 @@
24752483
"StatusCode": 200,
24762484
"LastSeen": "2024-01-30T16:14:54.527183-05:00"
24772485
},
2486+
"https://github.com/jordibisbal8": {
2487+
"StatusCode": 200,
2488+
"LastSeen": "2024-02-26T10:53:37.290066+01:00"
2489+
},
24782490
"https://github.com/joshleecreates/": {
24792491
"StatusCode": 200,
24802492
"LastSeen": "2024-01-30T05:18:13.610639-05:00"
@@ -3011,6 +3023,10 @@
30113023
"StatusCode": 200,
30123024
"LastSeen": "2024-01-30T06:05:58.807859-05:00"
30133025
},
3026+
"https://github.com/open-telemetry/opentelemetry-demo/pull/1345": {
3027+
"StatusCode": 200,
3028+
"LastSeen": "2024-02-26T10:53:40.03196+01:00"
3029+
},
30143030
"https://github.com/open-telemetry/opentelemetry-demo/pull/432": {
30153031
"StatusCode": 200,
30163032
"LastSeen": "2024-01-30T15:26:30.96845-05:00"
@@ -4851,6 +4867,10 @@
48514867
"StatusCode": 200,
48524868
"LastSeen": "2024-01-18T19:02:19.249572-05:00"
48534869
},
4870+
"https://newrelic.com/": {
4871+
"StatusCode": 206,
4872+
"LastSeen": "2024-02-26T10:53:38.368111+01:00"
4873+
},
48544874
"https://newrelic.com/blog/authors/daniel-kim": {
48554875
"StatusCode": 206,
48564876
"LastSeen": "2024-01-18T19:10:48.917326-05:00"
@@ -6967,6 +6987,10 @@
69676987
"StatusCode": 206,
69686988
"LastSeen": "2024-01-30T15:25:13.019086-05:00"
69696989
},
6990+
"https://sre.google/books/": {
6991+
"StatusCode": 206,
6992+
"LastSeen": "2024-02-26T10:53:38.643051+01:00"
6993+
},
69706994
"https://stackoverflow.com/questions/5626193/what-is-monkey-patching": {
69716995
"StatusCode": 200,
69726996
"LastSeen": "2024-01-18T19:07:28.672979-05:00"
@@ -8119,6 +8143,10 @@
81198143
"StatusCode": 200,
81208144
"LastSeen": "2024-01-30T06:01:24.01921-05:00"
81218145
},
8146+
"https://www.skyscanner.net": {
8147+
"StatusCode": 206,
8148+
"LastSeen": "2024-02-26T10:53:37.476242+01:00"
8149+
},
81228150
"https://www.slideshare.net/Altinity/osa-con-2022-signal-correlation-the-ho11y-grail-michael-hausenblas-awspdf": {
81238151
"StatusCode": 206,
81248152
"LastSeen": "2024-01-18T19:56:02.307051-05:00"

0 commit comments

Comments
 (0)