From cef7595ecbf22cf6cad4514aa2bc63ceb28122e9 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Sat, 28 Oct 2023 10:05:10 -0700 Subject: [PATCH 01/13] wip count success/failure/drops --- text/metrics/0000-sdk-self-metrics.md | 245 ++++++++++++++++++++++++++ 1 file changed, 245 insertions(+) create mode 100644 text/metrics/0000-sdk-self-metrics.md diff --git a/text/metrics/0000-sdk-self-metrics.md b/text/metrics/0000-sdk-self-metrics.md new file mode 100644 index 000000000..5d32129eb --- /dev/null +++ b/text/metrics/0000-sdk-self-metrics.md @@ -0,0 +1,245 @@ +# OpenTelemetry Export-pipeline metrics + +Propose a uniform standard for OpenTelemetry SDK and Collector +export-pipeline metrics with three standard levels of detail. + +## Motivation + +OpenTelemetry has pending requests to standardize the metrics emitted +by SDKs. At the same time, the OpenTelemetry Collector is becoming a +stable and critical part of the ecosystem, and it has different +semantic conventions. Here we attempt to unify them. + +## Explanation + +The OpenTelemetry Collector's pipeline metrics were derived from the +OpenCensus collector. There is no original source material explaining +the current state of metrics in the OTel collector. + +### Collector metrics + +The OpenTelemetry collector code base was audited for metrics usage +detail around the time of the v0.88.0 release. Here is a summary of +the current state of the Collector regarding export-pipeline metrics. + +The core collector formerly contained a package named `obsreport`, +which has a uniform interface dedicated to each of its components. +This package has been migrated into the commonly-used helper classes +known as `receiverhelper`, `processorhelper`, and `exporterhelper.` + +Obsreport is responsible for giving collector metrics a uniform +appearance. Metric names were created using OpenCensus style, which +uses a `/` character to indicate hierarchy and a `.` to separate the +operative verb and noun. This library creates metrics named, in +general, `{component-type}/{verb}.{plural-noun}`, with component types +`receiver`, `processor`, and, `exporter`, and with signal-specific +nouns `spans`, `metric_points` and `logs` corresponding with the unit +of information for the tracing, metrics, and logs signals, +respectively. + +Earlier adopters of the Collector would use Prometheus to read these +metrics, which does not accept `/` or `.`. The Prometheus integration +would add a `otelcol_` prefix and replace the invalid characters with +`_`. The same metric in the example above would appear named +`otelcol_receiver_accepted_spans`, for example. + +#### Obsreport receiver + +For receivers, the obsreport library counts items in two ways: + +1. Receiver `accepted` items. Items that are received and + successfully consumed by the pipeline. +2. Receiver `refused` items. Items that are received and fail to be + consumed by the pipeline. + +Items are exclusively counted in one of these counts. The lifetime +average failure rate of the receiver com is defined as +`refused / (accepted + refused)`. + +The `accepted` metric does not "lead" the `refused` metric, because +items are not counted until the end of the receiver operation. A +single interface used by receiver components with `StartOp(...)`, and +`EndOp(..., numItems)` methods has both kinds of instrumentation. + +Note there are a few well-known exporter and processor components that +return success unconditionally, preventing failures from passing back +to the producers. With this behavior, the `refused` count becomes +unused. + +#### Collector: Obsreport processor metrics + +For processors, the obsreport library counts items in three ways: + +1. Processor `accepted` items. Defined as the number of items that are passed to the next component and return successfully. +2. Processor `dropped` items. This is a counter of items that are + deliberately excluded from the output, which will be counted as accepted by the preceding pipeline component but were not transmitted. +3. Processor `refused` items. Defined as the number of items that are passed to the next component and fail. + +Items are exclusively counted in one of these counts. The average drop rate +can be defined as `dropped / (accepted + dropped + refused)` + +Note there are a few well-known exporter and processor components that +return success unconditionally, preventing failures from passing back +to the producers. With this behavior, the `refused` count becomes +unused. + +#### Collector: Obsreport exporter metrics + +The `obsreport_exporter` interface counts spans in two ways: + +1. Exporter `sent` items. Items that are sent and succeed. +2. Receiver `send_failed` items. Items that are sent and fail. + +Items are exclusively counted in one of these counts. The average +failure rate is defined as `send_failed / (sent + send_failed)`. + +### Jaeger trace SDK metrics + +Jaeger SDKs expose metrics on the "Reporter", which includes +"Success", "Failure", "Dropped" (Counters), and "Queue_Length" +(UpDownCounter). See [here](https://github.com/jaegertracing/jaeger-client-go/blob/8d8e8fcfd04de42b8482476abac6a902fca47c18/metrics.go#L22-L106). + +### Analysis + +#### SDK perspective + +Considering the Jaeger SDK, data items are counted in exactly one of +three counters. While unambiguous, the use of three +exclusively-counted metrics means that to compute any useful ratio +about SDK performance requires querying three tiemseries, and any pair +of these metrics tells an incomplete story. + +There is no way to add varying level of detail, with three exclusive +counters. If we wanted to omit any one of these timeseries, the other +two would have to change meaning. While items that drop are in some +ways a failure, they are counted exclusively and so cannot be combined +with the failure count to be less detailed. + +#### Collector perspective + +Collector counters are exclusive. Like for SDKs, items that enter a +processor are counted in one of three ways and to compute a meaningful +ratio requires all three timeseries. If the processor is a sampler, +for example, the effective sampling rate is computed as +`(accepted+refused)/(accepted+refused+dropped)`. + +While the collector defines and emits metrics sufficient for +monitoring the individual pipeline component--taken as a whole, there +is substantial redundancy in having so many exclusive counters. For +example, when a collector pipeline features no processors, the +receiver's `refused` count is expected to equal the exporter's +`send_failed` count. + +When there are several processors, it is primarily the number of +dropped items that we are interested in counting. Whene there are +multiple sequential processors in a pipeline, however, counting the +total number of items at each stage in a multi-processor pipeline +leads to over-counting in aggregate. For example, if you combine +`accepted` and `refused` for two adjacent processors, then remove the +metric attribute which distinguishes them, the resulting sum will be +twice the number of items processed by the pipeline. + +The same logic applies to suggest that multiple sequential collectors +in a pipeline cannot use the same metric names, otherwise removal of +the which distinguishing metric attribute would cause over-counting of +the pipeline. + +### Pipeline monitoring + +The term _Stage_ is used to describe the a single component in an +export pipeline. + +The term _Station_ is used to describe a location in the export +pipeline where the participating stages are part of the same logical +failure domain. Typically each SDK or Collector is considered a +station. + +#### Station integrity principle + +The [OpenTelemetry library guidelines (point +4)](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/library-guidelines.md#requirements) +describes a separation of protocol-dependent ("receivers", +"exporters") and protocol-independent ("processors") parts. Here we +refer to this combination of parts as a station, belonging to a single +failure domain, because: + +1. Logic internal to a station is presumably non-lossy. Dropping + within a station is presumed to be intentional, as distinct from + the case of failures. +2. Under normal circumstances, we expect all-or-none failures for + individual stations. + +These qualities of the station will allow us to vary the level of +detail between basic and normal-level monitoring without information +loss. + +#### Pipeline stage-name uniqueness + +The Pipeline Stage Name Uniqueness requirement developed here avoids +over-counting in an export pipeline by ensuring that no single metric +name counts items are more than one distinct component. This rule +prevents counting items of telemetry sent by SDKs and Collectors in +the same metric; it also prevents counting items of telemetry sent +through a multi-tier arrangement of Collectors. + +In a standard deployment of OpenTelemetry, we expect one, two, or +three stations in a collection pipeline. The names given to these +standard set of stations: + +- `sdk`: an original source of new telemetry +- `agent`: a collector with operations "local" to the `sdk` +- `gateway`: a collector serving as a proxy to an external service. + +This is not meant as an exclusive set of station names. Users should +be given the ability to configure the station name used by particular +instances of the OpenTelemetry Collector. It may even be desirable to +support configuring "sub-stations" within a larger pipeline, for +example when there are connectors in use; however, if so, the +collector must enforce that pipeline-stage names are unique within a +pipeline. + +#### Pipeline conservation principle + +The station integrity principle leads to the axiom: items that are +transmitted, leading to success or failure, cannot have been dropped. + +The second principle developed here establishes that items going in to +a station either succeed or fail. + + +---00- + +From a technical perspective, how do you propose accomplishing the +proposal? In particular, please explain: + +* How the change would impact and interact with existing functionality +* Likely error modes (and how to handle them) +* Corner cases (and how to handle them) + +While you do not need to prescribe a particular implementation - indeed, OTEPs should be about **behaviour**, not implementation! - it may be useful to provide at least one suggestion as to how the proposal *could* be implemented. This helps reassure reviewers that implementation is at least possible, and often helps them inspire them to think more deeply about trade-offs, alternatives, etc. + +## Trade-offs and mitigations + +What are some (known!) drawbacks? What are some ways that they might be mitigated? + +Note that mitigations do not need to be complete *solutions*, and that they do not need to be accomplished directly through your proposal. A suggested mitigation may even warrant its own OTEP! + +## Prior art and alternatives + +What are some prior and/or alternative approaches? For instance, is there a corresponding feature in OpenTracing or OpenCensus? What are some ideas that you have rejected? + +## Open questions + +What are some questions that you know aren't resolved yet by the OTEP? These may be questions that could be answered through further discussion, implementation experiments, or anything else that the future may bring. + +## Future possibilities + +What are some future changes that this proposal would enable? + + + + + + + +semantic-conventions/ From 4d3cfae892ecef3c202d0c00c6e73682b130c9b4 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Mon, 30 Oct 2023 20:30:02 -0700 Subject: [PATCH 02/13] rough draft --- text/images/otel-pipeline-monitoring.png | Bin 0 -> 47704 bytes ...metrics.md => 0000-pipeline-monitoring.md} | 178 ++++++++++++++---- 2 files changed, 140 insertions(+), 38 deletions(-) create mode 100644 text/images/otel-pipeline-monitoring.png rename text/metrics/{0000-sdk-self-metrics.md => 0000-pipeline-monitoring.md} (52%) diff --git a/text/images/otel-pipeline-monitoring.png b/text/images/otel-pipeline-monitoring.png new file mode 100644 index 0000000000000000000000000000000000000000..dc42a6625a2823b6df1ae34b2ef446e5e5c28951 GIT binary patch literal 47704 zcmeEuXIN8dw>FAm0~JB4fE}dw5~|V!>AeIHsZymkL$T5k=_P>lVuI8dARs6p0Ya0+ z5Q<6(5XwL(L;p6SGw;kh@0>a3`hI-ZcOCwbWbbD^t3B(!*V;QmM@yCFB+E$(3JMyv z+e&&A6i0L@C=N{>rvg4P;0+{GP=r#bDcv;ix156>dwW$Gx$As|_3jm^`IiYtZeBfJ zd6D_*@pDHkJd-adJvn;h1our(utMThDthIc3^$cg%_|ZkLFc=qg34(($J~u^X<6lg z7^(W8wal6HvBiK<2W?}>popv$P6JmzIEXm}Ec*Y!pTgm}OlP-asB5@Pu=BGoETeM| z=Y(tTF)>Y=E&ALm*cP+HUr(xZvzEXG_SC1=X*IOe47s?^K8rdJ4cXaoZ58wiCokI6 z!I-SukZX-q!9hN%YY$-JRfMu`mlfX2^ zX}Y!^zd~6tWF_|*_~isR!~-Vo?ow`PYwHQMK6UJFEP3&3lcvvOYf+FQY(OlWd8mL}&kF)RxBmw^OP#nr6nV09a-jD;pa?kkPYIu%U2H`awx z)1=?44JvG?Z7eDv-aqUL&=;QY<)TjH8O~+qf-avrSoPXtA)OCA7S5rPMJT5a)3!O@ zBB#L4p=Hfv1;mm}63GfF_eBM(F!SL7E7yqnpn|U3>LR5p?YyIXyuCZ5W5ZPTifhin zt=>^kV-@-y!M+-uj0Z3JLTM8Fkf%L^RUVI;`-hmN*p{e+zi<)d zqhg@hH@)@Y?w~QPXEAbD=Bpw-OT<}nPIWKz)w=EUZ|McOW*cCS6;6WWag3Q`O_)Ql zGnLL_#VveDH(xnboK(oD%(xk8arnzN$?>}SmyeYiQB>twBo$n+jOyE8{SVizYLu6Z zAJ5K+rtbdOE4Ru{r8ub2cexEs$hII7{Jmqj*Sv8}VY%zCQj4G=A)r0mt~mo5wCT{s z`W6jweHG@p1+&{xnXL6}a)Wb%ovlbxP1CShX=dxK$v{!uIy8g>#2M$eD$--Q_|R&D zgJ>fbq)F5ib2!j_jLjRWNUE-Mp;wS+UAx!4VV+6Vd6-D(Q0RMS5;CSwQ=HRWJ{#u)rJ-lmVp(~qjQb^O|r3yfq z)v5(D57Nj!);c`4-yGkS*x+Ed8PepwhK!dz?%Xx`^00xL{?cPLLY;Ck+J$IkJoUI( zL`U?!qMh&s@z;sR>_P4fH>SFimha|#Og!0kGwFf3*4rIL-q&HC_f6r1yjeG|epmKV z+}xAdc`IQZgI&jh*Xkiuk+X#cXv*gB%E@$f(L}<>_qHyHPbQezLWp$H@EQ1qY_CY_QuHf$^(U zIq#XrBOppUOVh#&zTeb|Hl@#$1a%h55)DtB!Wub@0}5CktdvPT;`4-{Upl2FsB^kJ z(Qu|mAzsM%xtz!#iF}f)hr0^VeZ=-cX)v(yS3WA$Y*_abjtnZPrdMaF>APF>Ojj?| zHhg*5g-NOm13q}z7kDFC3WGDINqpQXoI!q<3a%%b)eI*KmH4OK6V|sCJp6sXR}}?y zL}6Bxm8Fey+w+tx(T46;CH>i9o`CKn-4b+AqBXo*u1Fa3A+j{raGRF~-d-2DJ-paP zfJhRvxqKG;_1Cbdu{$ri&bNN*wk@^^BQG^8oIrSeCivSbY?nLJ+#*ra@3=Rj`AZT_ zy~|$SYt+ASiYn564+;9+`31x^oN&umAZ zg$2-}srpSZRe--^kCt2+j@1#<5zT9E^yTy!DhG1K(%?;ZlRMXj+wz(VYs*Lu!1~$z zouKX^TH!|eJoA{t%?dN3^*0elCA;6$@hwbTb(2>g@>lUpGRW$x3JI42-{W(D_2em) z5tE-h%E+tD3MGwAZ^Wan@uKS__g$D-&8$~^_?ed*vpy=O5TkG+`~6+fa+fQ>&Kg~R z?ri8edS&A@19PsEu?4T>?I|=Cc2K&6+E-kqNrV78ECr&EQqi&YX>HA$$+<)HSfoim zl{!;ukKagn z(Ho{>++)RhZhgekRKPgzt#hV#?n7Sgj%H&V1_ffnE{hH**_lZiPh&4ZNrMDOz$IZk zxs~t158qEo&Ys|Gb*X(?n6tzJIQf>$_D#sQ~He zHAl*AqyYzGIz8=C&rNdgjy8ouZgyXD&B?S%mbJK1z=*Jz%hF=b#GrP?O=~q1pLg}z`z^)@-Sm!M(Y6P60yd}hGgnss3fi_S$9$iW* zH;7Mrwb?4Bb6TL&o3(3ba%`R!X(w4wp{Yo3Evr()IWig!WIkzkVJ>?@ zrHdLXSD--|S~TO{d-Z^%9x0AIf-cqj!^H72s>ojrOoO#H$~9uSf#orFX;~|Jke8YY zR2oc~&P`#jM0HxW^ma?1tlG%gpefB*m8;I>&w{SLx63)=H5bJ%H|C9z8-KX-jfIPQ zuZ%2Jm;V|eQZ-ARc(zO|Xg162O5Wwy#vx_IJ|>70vCNAe%#oM`cYsjTrqh`6`4;E5rxV>ctWNWPP;Tn;Ix;L|7gHU6V7Qc z>z&b)f(GrdYNP?i3FkyFFHV^i>A`W>P_PduxtaeZ^4*3wRzWHkNP_p$t{4KY0XhBE zD*#iI!IfU(6=J&1gWHvy0k+4_-)VRChJ!UWnrgVtP}wJ?9-CM6RIdFli=&Q(-CJt< z5vsL>O7Nn!k!fQoSwzWzpAQ~(xdmo~hS@Xl4O&__=S)A`wJ?Rt7&)XnVh2R*pzM8M zu~2X66toYaO9_W=&t>AeCE{SfZzD47#_3IXVX5eYNVK!Y#@g$lN_kJYeA~V?Sc-$m ze)mL*7#lSfE`j8|;IWOHB~(E}*yz|gbDuS@-4N3GyqkNfnG%963fo2SosgF2?-lxj z#ik8L#-I`>*?nNDAJNpM(#cx94;rCB`nX7wU62LW6N4&2!!FxP?`@Q7k?uHNkCKys z1al9B?ucD_``yV*^OkcQczjrNNFw?hah|f@wwIm@)1YhEwUWLH>pZb%+Npb8rY=Qu z9UZroyHzAlc}?cDNu@~C{%xy|S9zA!jZJN)k~e6w{N~~sgI7F)7#;Aa4F$tIiI0}iR1kF zBtsnHMMO@e7Uf$Qw7hA+NK#%B`$dM@`&t@E*w4~Df|c4V<5sy>DD1)L2R&%kr+ZS( z8+8c6NuuU)*v;i!XOuh*+}$@=<*D)(B3nDD3R*6ot5<05W?)ZqfhOIj^BL8saGUoX zjq}|y{-0~)4OI{Y)5FE}V#<`>s95ZJvEubdvt`$n{hPdiM7UyBDvt(4$WTN=&>N{i znITUq?I~tZPJAiNIe2y_0mH3Fy>szz_U1L9y$+>yQib5@wZb6qmcFZPYzO^GQmjb} zlABKPdQxxniw6&VT&5nPHnWrw#E%a*0aSY|EzQim72ceinKk|}R`GSdm_@AjDq`gz zFRbtyQeMk3S7%_TsX$13QyNwk5_(@>V3r#Rcgv^{aaK9W%;`Iyz-^8FC{pBhC%^9FROz)FnIt`6Y6%W%4>IK)?w@Th+DqbXa3k9m{HL7ZjJ)N>jjxu%pfzk zDRF27o~!I`SmYy3RWB)c;L7iAJq|lO5{vy-EXSwrT-~69$FfNH>KY$>8+u9kR6fRQ zM}Xl}FOB0`hb%^($}FxFoBMIGtH$iwKL7HM{toAn|94dFVpq-j}Llzl5xE^lR=J&hJ=!az0?=cjwxE_ke)h+Q zIVm?Z63a%v?1ueV28C__n@TthGo{3^v87yjqM~vq#s?p%|IPEs*;~giJ?U*xvl-!M zUBr0xY5?1i*3OvN4aPSvoUu`r6Qn~sK#zF*usK#i6;(V`RBCvpGZ)Uh>u3Hg+*-Lb~rXFA#%gwnMA52RM-_b87lU9e0pTL$%1=B6s45n&=mm z`)X3PYRC4a49yyUj)^uFdsZ_(1iG)`LeOGa3?3R1Q;3>AQkwA1Fxb!aG#6POy zvX@1?^ZQDELBfV9e;IM8X9F@kzfGOG1bg#nzzZdZ8$ynfw&eu6Yn%WAGZQQ!$>A46 zzOKV&K2e9+c#D`c8?w&XmqUU4x6(H=aq&oUshQ;9;^9jhTo_O8_ku)t)X9B!iVHqnx|bH%JjD#W?3+;c|?cN!4Q96DA$ zn!UEM0~R2H@>mkRoJ@E1iB{{X$i}bR^XfYghJBeRMpQ)MFd#-1L$#DRb@tC+@leNi zG5s$+T@vJ!sM{21o4iJ&kAKY3sW0>5OL88s+)l3Cp7u27O(1z;GPp95?7TAL5>8aA znO+|?NVc-P5mdoD@H!#XzJB}7YgY5*ZlFm(-|NEE?G`UYo~?9=0q;EksCDj6Z|b=E z=6xxmK0NVQ5Ws}=td!~I4ZCl#0I8Q+<8Q?Rh;I#=?yDz0D)e>HGM_m#5=d@BG8mu_ zvr6uPBvJ@oXxEWVn+*wKOnm4rmPBahCW&93x6wlG6oBRZ_!O@Rj%=h|L!=Ohl4EX! z#^XI^b{h)WM=9Rrp0n`#QKl%!D>WTdD1Yf}%`9A-Rs)RO1OT&J;u%<_n#wyh0u-pa z7|~5cNsh7_OP%eDAP#bgdKz`&-DgK?J>?c;^x6F{;fy=?SFcyJa*bM-v|?n4fb?w~wdbpI zz3ss`k*>4LM`?TW*a$HL*`6t3KKm=2v{ywzVR9jt6uX~pfm46_`IzJ$p|4yej@zyC z7!!TpQT4y@iug;psk66U`%tM1n!&{;Zz0AU89wSyu3aG7oH>5zyr`lQV)zO3WdV?T zpoyBDGHW$I;8XK0FCM}cWzFJ6#!*Q;@TzHfBJgFE^_BuxXYY{pIGZraQ zKfLP=Z<;c@b|mrj)b`=GU$ME0(2#i~gqB?v`?jg)FvVSu2)OkAS+ZY-+?9;r*ZJeY zvU2G%6PK0msLpW^4^9>%KwpNxxRtyVrbUHT!z?cDf;*G#q$QhSKIJf=yTnNK>!E*-z@ zO4NJ>MOUtRK?ZR7u}HRpomN(eVU-)rU8mbdl&#<#K=a!v$iTnuwaw zM;{aT4tUG;Rv4p~ENR>IX3zv^xW+#9(<^hxn(uDSy5FiZG+Y;HNqRUeZkHD8jBGqk z5&G)C*e@&Q0!~YFl=hMpp{knXo2rFi$?oCTl?k6LTW@Sl|vjYI5vg+rmESN>j! ztQah&T*C7WtoSP#4yiB!Zv1=n|K*K=%3SjINTFcT$~~3%Z&eD5s`rTQPw4Yr%{%Nn z4*LCUq@++75C2=I1L{i03w=JSKhY31=f4HvX$p#%|6+TGC~`dh-ku-|3eni5~dFq{JA6xZKM5rZU4WjlcqXon7zoP*5rUl}Xks zyN;}%A~w};iP}^K;Vu0>-QT>r{?avr^c*66j3U%csW5zJVIe;OElyeC0^jnScjTdJ z4#YHJrCpA`yBq%J!~xiF+Q~OnbaZDUt?HX|%+tPhZQn5_G4zi5R+RvtY>U>n;HsD` zzgxJjiW9*bipFyW{^Z(5TaEA&F`Yt+((6*{~8>C+=YVW_M1;Qo%Q~LKks_8l83>ziS(6%N+ zGRvhahv6bRA0jd3@wpJ$`E3xFb=QO2%mCjKrF_6?{7dQq^lvhBU@F@Z5uF9@&O&4K z;UH-%KMkaXL(bewQV()Bp2UM(=vu{s9RS5J-MU=$?*>=VPvs&mer=2c!0-!RfG=*S zbm>$#y_kB@Ou-1)-I!l+TgRMwTJP)gt?XuQHaYL^(H`(AKdWh>*%Lc$H$$e$71_+U zicQmr6cmBak77dB4QUdMQ|tn%J(8A?jWgSGFK^i4Gcw~wYKF1F)uD4$lfFrunQ@Ri z$e6#_VfHrWvD$3u7C*YMJN2c7{|6R)^nJQWSJnX@y)0MaGx6NUu)TGT}&k< z&og-i(*ugXFU*HaM>bBI&F_9|psg~4FD8eaYMrU6kvsYB7Ts@(i_+Q0_0pf+>Rtz& z`_y?oa@j>%qMCM<<|o__{g5`_AXNFL%e!%2QvbBz)gPRdczA z@tHun*46xoTp0p}##ISI`8y%-EZgyg(E(hlJFdKsq94x}I$b2^r)bho;(gs5ugtRk z&HxB*D^XVeNh1=v%`ySr6y&SLIE}x)&a2KmD|ARnx|18k;k{+1=VoMt5 zP1L;*iIgt8DH;I4Z?Y5;{M~oV!X9@ScN0FJ1Ff-#%~!=mW-A`9|x0J+L#k_2kc zQsi9yO{3P-iJD$JpEt(x$|p9TkzJdzoRidtS-k+<-{Z;maW$9hamMH#{oZ@c7L<$C)A0cfxBs7hpA0fxE0Z?1O-w9=D%{}b4lL=WKNgL zIfr+$0r_eyy_;t;XsRDd*JFgX>{@Mb?$%&VB8d~N90-+jC`<;ga|+>}1kuLe7{&R2 zw0)o`KQUd)fMV2q47X_om*HD7XOO^X)+1Q4wG^?)R=e^%)reO9+ez1_)-zyIeb`=+ z?f|l;DQ!}m>LkUAsyF0&Ov<9~(j-!5#!0stNrp(5%(iHJlqYUV;DV)NgXt{CxBZ&l za^jjiT`wF~_wA}Fy+Ju0@Q=jXhgEx%TZZs#12nv2H45A=6%t>%9%yZfFoJ8R*xlTV zCk@&`6v|6>9`@FJ6CK&KK(-ltb3?BVfZaP*p9^bky1>PFJZ@FlPZ2lkom(N&xJ>ov z-raJ{Y6zx%nrQ*OdIGlxKs=P@ziy=y8b|0gnwXn^2n}|B()G}F?*{IQ9a3z`int9^ z*mMiw$0;LgouVj0<^JhDF0*^>?mUSyVCE2(4$dY~5g9E9pH}WN>Zn99ypos0sUT1L z23wp@JD*YF5n)L#M>ZxqXMjm>v3sJHrV$BYmlc(k>U6T8BpPIJ25Aev+VmpN3j&XZ zn8Q~y_r_E-SF!ZJ@O4VQiV8qspqM63i<~`qqQKi@o4gb`em~O68&&4lVYy?pt%Kbx z)LW8e!^ww&Z@~pXO>nhe8|2fS&6uxzT2RD<6g^Mf^W0shqTv2R@|ip14%L%h2+JRL zT^za_{v-2kp1UUq%nMMVvIU=-QAO7F%t!qKHcTd2lDX4wT&5dOND@XBYB1H;M&7f1 z?I<1r*c6gA#P~H{C~w&YJui54JG$|B+*%%9Slg(!X0K;O+28vse}Y2}fP4c}rZ->j z7n(td0F{vf;wXA*%O~7rSX4wd@!l#3KGpJCRTkAQEQdP{ui=kmC>^Z3L7{Wb;z{m4 z!`JMaP%vI0-|$x{>rOa8Wvzquf`PG@NPt^0hBisU6x+fz=qDc&ZH2UC;BuJ@G6_xYJAVi#0;vrtc#*Nx~rL|t0+Q8e^0`|s2Dq@ z{r+0cwTB32*#uKe#x+ia@~`HaVp}>s-`2oqYD%|y7oV=3$EqKqXwSJE@H4~zjgry& zn5Jw&6o`qdb4}SFAij5|e1_*9W(bZem2YGj`EW=}aWF3UrG({jihNqi16y(JVWVIE zg7oR`K@*j_3l}oUNNOmSUhKu{rzZ+J4a0~d(U6%Scd`5RVvV>f+0L9?OZRl zBFtua$f%AH2IUcgTBwy@ts*~Zl~ex1JhLCjrmSYK%RyCF znYaS`wa#074#p=Qt-f70bYwueyhKc!*ZSGVuamXF%tndcb86N!>2La&y)zS;!TB~7 zBH+@7p4B}-Zg;xhH<+61)}gYTLXpXvFmvg{-DGZRCFacZ{H!9NWJ83-(6MIQwnT=o*~nS+RT0cXC+Z z(*C^2gY41H^GTHekureE;JcIN`3hIUaw)L;4Zm6v89KJk(>V_q8Vyi2u4UVGp5{}} zYil0B?1_*Y2bJxvz0WlpQvBZARiE6EB6%rQgc1>v!61Li|a8Q9z`eg;LX}T@lt6 zEmMr6g{NPGQYStL((upsxXAlUY0>JDm2|V-@=br6Zmqt(v%y7r4Ak_@R}I~6bRzf> zR{CzvsDxRY7ptP|d`4wOJIDFhi+7xByF4#_+&J|>^L+UhF9n6tk2DEzvg-dxi@;1{ zJeHAr6Bw};xNjb{Za8C`^YLx(y^7MURPvs*KhO}2ppKrs^+uR`1(aTqYkGAN4V4>JR8txUAI5G3&5P}rmzULOSMM>h z0cVo;^@CGe5kSM&{*lg7Cep30q|pY-*pAFgJ_m7#bO47~MwdCj3^i=zD%4=JFbG4=kL}P}* z<1PR*E%+lmFHtBU{z(?5p!ltD{D->Pud})SS|j>}d~CsvTZ7d%?kKzJ8wCvaru~d& zN{V;C2wK11dMbjuC&JnOy@yYTWtj~AG$c)-uzvREfg!H0jjoy>?p67hn*8%;r)`t< zjQY`UyG}GOpZ1Bldyg>#6bbk{Mbb<2n?Ymx6aEavn2fMc$hQr9n-oCtWNHJ zfWkEYc5T|iuFSdJ@zCxO`8{8KmsEp-O$!OnrVG+-5Uu&SSPF`|nBQu9-N1iC##h?( zc(i@AQN9-IJ@S?uteh{=h}6ko3E$|g7vJUq=v){7x0j=*Yf`+qL?s7zlIkEYL_#Qlc) z-nPRO-qOFtLEcLKpTTS0V+-*IDR9YF)I-8o{tPMn4rKPK-|Lt|6pn}g!*y>{MBSo0 z+hQX+>2U5~_gCXJ*@tcw;{iG%r|&{4xSigw8Xzn}vc@F(J8C}?80;->FjLJ== z2YUIjP|3c4*qXBbo?yEWAp%TG+`S8en>yi~dJOM%lofQ|?IF48i+);vUNr+l$hG`e@_&=(|utQ;p8GBb?#Qt^Nl$om?LT<^xcgAAfo-o(o zykz%g>KUsYS$@5>aZ?NE7HMDCPJ3zbuzkC>&Mo;swtN}H@4#^_%0To*z_}HJZ`UI| zU8H@#DG{X(Sb%rO&+cS>@69O)oP(tKmBmVKHRO(G1-I7cf$o>#=Jh@qJG=t8F+gXvK2*mpweNxZ|_MX~oZXR18;oKMkR*%V$hV)NRSj*JcZsOBCs zo6j@4r{KFwgRn(FHfI{Y&uH8BbtT2rey%LWmC<}Dde0A@RD4BBEitcFkK&zFXs>q@vUlF|(i{t{p8N(a<`2dXG^ zSCsy#irkLCby;_-tK_R(%5=(7wQSNJx#SB6{MFqf;lYaESrnBlq)N7roGxSPGVIpl zL!Y}g);I}~Hy?%d+ttlwZAFRctTYzr=Be0KVJ!rLb~1?e#*kmWas>-gp`V#IlzvXZ zs`U8^>Wuntt33_KZAwWO6slj|eV$vZ`C_yQbzFI-WRC84kL@$yg9n^TO#gH22p(2y z;`rV^9BJ>ya5~wT3*EMyd>BNXhyfy7vsW20YAS>FhsH>Bx5cs=dppWPiC^fn??bqy`A(zeaODl z>#hgMkPs+t@}qd!QdaL@!bjmj1=~{7pNHywmj4h~2eWzJ96(uXJh5pC2j(^U10GM5 zBUh+Nu+CH*anp}T*fWM7MHWu5KuP0v8^)Ou+KmQod&9t-FB6wVK4q*Fzx4EzgxuvE zAPeP9f3CmqOWS@NBxcAhDGAlv#;i(A^PB=6(sOSpD4$z6+H+n+#~>IjJ$8OSDA!&7 zry>L7cq(`dv$_S&Fhb*eM-4VgoV!I!>Yoz7RL;#O2`^zQw?C*eQ>inmIYTuvWsu`7 z6*#%gid>t;?Pru*XaJ0?NiR?5c3=-I9Jw zRjV#|57irCxT8d$y-xGre&c08iTvjW5^&FoZ~Rzhn!K)^FlDIyj(1GrEEuh^iBl z7%u+&9JZ)3voi!xc3Fu;r}k&zKPori81R0$cCS|JM#yPU?V1}@=I5y@_hfE%dsliU z8_ACmD~;Ap6T}2PhlM_xMtpBnyxIbu`mudh>_SvGRnRJ%_@ z)l@m{Ta8z#UW=}x1AL=JE)BM<^qvWUmV;&!}4W}H^n~u1LXra7%Q4;hJ^r6 zlbj-yB^;qF2>F@D7xFL)QV8C%pu{B?A_@MzH*_BgA^Cv<0;jAat~CP~{BRYf+byJ0jC^%EysC%i-s_5U z-aD!s0vd@l;$xTVH^Wqc12{qMN)UjFf%3DS1}~&M~=Z4o}jsxLe4_6@Nk>;;$Mq?bAkRVWH3a9A<*1Sy;=ffUoWf z>1_VNp{{gR?r__MxkAh9bN1U-ZUUVvpV-oFgc&DKrWm0Gb#%so3hKKrI1rGpV*Y_I zP}}B&4`jym2_L@)hBJCXC$=Ah4Av8X$%oZr6Pxy?{4?^oVi{IvW{!9H1ap-8z)bo3 zi@aQQA~L{jGc^D?sLH&qNL;tZw`@Fr)G+rjlbzi&jjW<}e#GvVRDd^>#l4yw15i9l zW5xV?tGpKi9O5L&u}j{Bm`N0-ebL0NEbqSgsSi|*DI~_E`FACl_Gg5UBUC4y8WL@O zDEIe1&EF~)V5Tx8Iq2F9(~su5rd=11ba#1J{QA^h#7+ZQ1-*cjGf5sXZ41{&SISS- ziQ$rwqZ1AI|LOLDo(?E{81i{uYD4rasnYe}V z*uC>rAHynWecoPnkoW=|!fjJ7(0TNKxXU^kRiSjY%a6jA9dN>+CvE2GOG_{?9rxs| zsi>iby5ZAQ=p6`c)`KF2UwdZn=4|uhmH{%tR4@?AQIAbsU?yh2tGkC}>(w``iNhrl z98w5tT2BQF=N7A`?S7A?1b`2l{-|^!f;#XjZbdz_D?X(B6DgUUPqa*x?fGc{kxhg* zR`Bk5_Vqswd&D9z`S~D_|D`GRcUjaHU_Yr(w8|K|9iyC#0H{PQyN$F$@RMtFqo%+| z%2jH-dr2hu&Zfs#zSR zyyB_7F5l6J*EX)iq9-%=LC)!Qp+o4U%yF|g{^Y35K;mNgz$f^ZeDr-oRqE-*2R&A$ zKuxg^9sa7ezsi=4N`>j8vmeL8Q0AQ4Kdb69mw+d}ZFPW03cmp}JHvyFU;9$kCRyZ7 zy$R8?3pWt^k{h1u)nM$lrEOM_Dl!~--sbmp_0dLOcFPuyPtJ;O4=&t6vu(N}+~LnA*ZJuOh=GE1A&hJtu7D2N&1b`xC3S?DebNtNaua2gcMQ?T|{W7gYvAKJ^9Mium z@5r=ruE6&i=f4+yz=Mlv*UX%ioNvP|Z|JiEk4?wKYi*^EKOA{F>04y(?`m80InFFeY$ z`fQ}qY0PbBmZz-U-#0H~SiKOr=HZQmm>^woPRKl7fX%ymDcA}askFShDa^q=vby56 zfAU$Bv8|8D@7G9`QEn&^CkZ{!uOIX)M^1yTa$EI-ravT9jhvAa<>ui2@46UZ(%e!3 zgM3x`=Q!+@NW1hSAZD(6J>b2MtMV9D`oP1p#LT#+;&Y4uAzeH^j*LaGS`rDt9W|B$ zOk9ycLbw^ER_E#tcrV47B&_8!(a^VijHpRBudlt=mtZKrXhyP|Es-^&ShvE+M^79l zI>4uwIs*Bq69a#wg6}pUzv`dQHr4U2Ub@4V#8=<1?wT$P37md{=Rd|yU&D67^=)fe z;5(ePd@0`}CzpVV*(oPETN5#XZxC^Z_GE z>nzozGarRP3 ztj37eYj?{ncJ{e>%81uqcvpC{!cE0XlgQt@w*ch-PicV@AYf}N&p*4H<5+k+n)hEr zdqd~`MM`B5X#~e06vQ5AV!*kN^DFmpck%G-s6%!SJYXUZ56y;%mzK@eosl=18%RF$ zv?b{6{D8Kqj}fS_p7CMhudI|uX`Ut$|Alz}&jnu=1rt~Il^=!a53i+ZQ;=FuUq3Ij zCY~iy-gw>IM^j!}IUQu9;yZn#NM(ME{LAwyKH^+z46}4?_&}5SV#6KZPR{JPC?pY2 ztlEbI=JvDIjXSVLOdWnhVz|Gy+|?J)XYTwyCbB86>6dTcjPl6vdVunW#7Tn8gC`9u z_n}M$q+P$qR=VlyJC~LZ5v`4?F)W6|r@dRZI2@gUhZt)Uk}V6v&7E34xa{S{c8mK@ zPBC#gRv2axen*S{v9R6W=SU30;RRMYHGr!;81I!1xUo*c!maarApxgA(+Noa2aWH2 zCGIV>m)CX8_P&JRr@k5|J68WgULOf9?c+x&gp*uA_|W zLnuqD>cN&8VI{4mW!ct7yISyraTvs7P)x^d^8+3Jv&VR)tKSG}Ubbg%nDLZ|YK3(W zs@u|Z&Up-}dVcY1ew09`5!k2Z8>HjQ3D+8(*Q4#e^^2wrapU}Ni~UhTHf?uva$1v@ z!85qC|8@OI_hk?{g~aAPm3~#DPLD`|gG&(lMJ9|0`K9c#eT{ZAL;~3PcBakb6Uv1M)6D{%Z(M;esUj|ug&~A zchm){w|l|cWqt${@H_LtWFmn(qlNE-bniZ+R#_YxLshRl`PQK_bjJITu+HuXLvpmo zs77B(cS)h!)m|^O1HQ#NN0&-{^@fb_NrUTGY}+Ago&0t4@LKWaUwUDDz9D9gM!<|` zP;dompCS`6HW7-fCztftA>?N?<96aeg^pe`BkRf@^&HEExeE}EpoY)2RFQ61^dN$F z@?CvL#WDbZ+Uo2Ww9kDpUHrOQI^><<8b$C~fGpg>ccfTa&XFeW@JD?UWu>A#$_Gy* z_LjU#{zJuo`7hAlX+e$GqE3lhn}^KbE~31hxA&`ZXk}i|{kDm3YPWgtY%L%AvcoQ8 z{dPN5s^od=YE*kfKPg|yehLi{PtkHrEv7J7l&YJN~iDCELze=qSv7WTCZd>kT)c#h2MXgryypK~; zoiSo=)yhD+5MUsmA-6bG?wZj);~HKA^oia(293jN`D=Y-#yK^(>|35N!P4%nI}|9} zosru3REu*8imP9Y2z$g}KDj~lbEN2R>LiNy9}|J65FW8%z2vtEnoEmnhNU6n3=5M< zMaaD(Utd+hN&pabdF*!DH{QUJ(R+H>qYx0Taxf=z-V@SO)4r3>d*)-;i<|l}@z$># z65~og7`lJOw{E0bpMUL9VU-Cd1d8Aor1MyIG+1PzlhMx)8XDoZF`ohAjO%*=iPP51c{0=~?O@ zoZ-=5Iztz)SYUOKe}dt#z^*@JJL7W6LrmSmq))P+l|jSEp=cT4buH;(ORmHw7g6Uc z4>U~In!!BMA!R+Ro;#OtM2pT96FwZr{@aMsIPVd_@|aZ~oLT0Y4TklKnX*}x%0#8b z!nSj!RaXFOdZ!fgAR9RPi&Vl(3A;W33F;Wxx2nIsntfrV-3gxItDtln`FZByscbC?A zY9IP6eaJkUr)*(o#w+3)=)3-i7+DF-?wa`}XjYk$;Kni|#2h+px3W_r_NAx3s_!$u zF2lbrwSs^w?5vR{hw>NoGLcvAXA$JROfZcL*$SJZ>=|K6$}>TG6<8^1{Adw2eIGdd z1OJ>4Jj#fIY}{Dr8LR)4!MW+Trm?>4#(F-1J}0U|`)glU^cZOGsFwwt8P@yg&2{t@ z%-E=YlE$gQ`)lF7yf4DC3?Z>(SImQ@T{mu0?TyPely)>k_rU<;f;mb|Pv8ks_VDDk zTyD15#I4!r$*rAL$OkO zY|23+=sZ)wK~Q&az~E_VhSl`dT;}UrYysZZ$&RSR*~`d4nX0vl%^?!Kh*|YJ7(@Ob z>^2h@pX@H|SS7;J)1`K5vnq0(U2ZD|F)Ua=-By+NE6jKJvblBAH-in+n+~~+eXjlbs1*w)_Jq-x zei!Mqf!YREVA?16`%Ld2nlZP9-J+#Eff4x-1lvMoQeoZP8`yA+I8|L-ow>g)*TdET zuW#XN+=x7$W2QT|5iQ5Nh5P13c9tv|5r`o~G?2PqsF8EYA~xQ4USREzD80S6QTu7z z#x8SuvCbjSYLewHzydZ0+AfVgM_nr|Q`dg4n@M%%)m&xJLQm~zK*~MAS7IunluD#< zM{Asud79|Z9=giGAtA-|y16T?CQ9|B^YuGoS}GlZ8K_D@J6jV+l}O0;(!wzgGnu;2 zjq=@zee=@wjZ0sOwfciVG9m8HubXZPd8x!%k9xL4D?6u}8U!oonNQ1 z?bM#Cq>PiU2uD;X0<*rBPkgIv1E*y>6#zyZQuSU-N{|q;kcIC$Z%;p|n(gIpzCFYW zON(KVsF$F=aaB4uj_Kb=%f;MY$~1s zGhv}Fh+d?>-%M}wf{;${Z8a(J9gxpKMaOhafKOA%{rFWn`-QW&ZMvisFQyq(G2T2z z-0UuqM!IV=5qC59P|qK?76Jo;r+s#nKHaMP+V|d!XTl0IwtLT10#^XHdpk^gIM)S? z*kO75T*Wyj0u#vPuU7SW(*u@$al*F1_&eWPsgK?$aeKhCalq1|fnz@8m_9%7>OF?b zzgjEmuG1M~e$)!+LQc^TC>Yj=K6|L*5<$*!w) z7L^s_Z4MH$yE9z;w@-nyAAD_XoddyUyaB-0g=zYl{HTbyQMRCf3dFA6;-&8L+84SvJD$lQO2$KwBRdzs2@&9_`Rzn+oL~d zxNecyohm0(Ki6A=uV(I~Cx~BHz5(={*9BSjc9#{~>wE6UBspL_1vHtAqniy+N8JQo z2KGWXmu~5$dKy-~@AYXI&$$;TUbQpv7-nz#iPtvv^EolNYYwxZYo0<;CHjm{|a!SGew_ofKvj z=k8o#`rtqs(@u>)d3 z5D-ui2r8fykrqItcM<791%wcabOR)8L`3Nw2_>}9i*)JRLT{4Lt4awW1f+zL5V#Ar zo^!rk&OYC{f9|;B^M@m2gq6JO&3tEho@dTQ4a;IyW0H3IG`FT(XXQU#0F#{-DDhOIj-qV z&aosTXS{>9p)@`Rm^o&TnvS!8$Ijg#nHZUtcyoiRjBM0{p@A(G8UV(%aF&h7M8$4` z zSq?H=Y9p%wO~{^a2MQGo+-iH3s<~Vzz72P8wTO6njR){jMD~RSfC}ZBY>koV?ed9N z6W*WJth23;oUia)Z#5jz3IjL`Z>~Dm5~jje>7)m#_LbzT$5*Y|#TuXB)A$=M+Kn>D zqR;H{)8pw~RlW|+9opn|)Dw-K;idYbV$Cw0a$T}lq9Y4wDz_ol*1ANI*zf;wYVAwO&I!>;5oN9PpJIl0_c#To-;zT}kkie^HEVoRfZw>;xuO z*#5wwX(CQRiuvU(LnXaueqH;>(REjjJRC{-N^0Ghr-faNFfha;wT(ibQy69zJcHnS_ZQ3h0&Y71_^)gsg}!ZEPdn|L|l z!XRR^WY{D>xu+CnK#udG-$pNVEd13(vsbhYCA8h)uUA{MCoCFpZRTkBM4@mym zp|N;J;Zwhflh$Q)6#Ri<9x1ov%=a}IwJJmJ;H%TdiAy`rU;V6g;h%PMMOMIOD>go~ z-M{>3L1&Z}r2e7BFd+NTjCP)CvCc29u(L=ozjM=ekr2o{V`|?Bb-o%oZYZI#RQnrH zIi4E(xT`xZMXKWIN~nR+IgdS5sl=q?&Zv_NIK}UhzM?yD^g^Pnv$Vp$&Oi5ihPpKK zb|lyPm+$EY9*(PcGpnYX_nTga>B6dIT37T1P3Ad|f4Yc5FWSS}vr8MC5=m$K)lh8d zqxR(>dk{yf{4iBl4mR_4brNeh&?XPxHf6FZ9(qJeU(6|E3SRH;!g?2Z*uQ*9X;#$; zXBwb-jqe;$cHhUDWn@+8+vlA>nq0sCLfM@bmFnQJjNm^9Djwba<&m_EQg$9 zGDqbRSZdwW>}W6oXcUl6A99SCNF=fD5jp+Z<8=S9j>Lk!@hNx3_C4zvni(XHfWM^L2UAei6 zu1V91AM=1r;yef7O(;awyegm|R&(GN7R`N|tWpdm)*48q{D#hklzR+$qc1tJj(Zs1 zV=W)&@;-aR>*uh_4S2_?rb?Dq{ma|5`(5SE28Q!rx9!}pc zO-f^xwWMCdn?`gqU!`YR9X90Vhd`AanH!I(hXC=EbH2vL!SCdL&uW&Pb60d~-)|1@ zWA7QO56-M8a9h3As#!K$^ms=lDRX%-yU?7@?Z`jx{Bhf+(c^h6!Xk;{z3{fAX|m=l zK9fG&E;!!r^dn+9nFxuM-K?{qUd{o=9!gR3qJt@C?;8tw&FfU<$w2nbxANU8xh>6` z53!l{O$LU+^LaI&n@xkq904MrsZujq&~q#o#L?Jycli0MGxX7l^bL6hcFl7ots;7S z!eyGvx4sS&w}uV&aQEfcQzs>WTOHs}xiV3D&AMEZhfPpKBa?z? zJl9#qD597pbQgWVps;-8crQ#$Z)SN-O0WC3DJC2M#=GCi77wo?Mi!4HCp&K)8-@ff zMv+Q)PXvTXQMHC;(e*ci5JcXV$AQT%4-&N#Jl%kv_UyrC4ZsSWC6LqYIYM)p!C_Y( z6(M))a*8-G-zAimm&rikA^6xOiM$BQsL9)d=$cUcLG|A6ToE%SOiICRDYure3t2ic zk(@qX1)3zE>?7X3KukLe%dv07PYfp^)~`A?<}Rv80ZPN*1cCJmotn+kxmR*EytV)q z-nI_*ar7W%0C4zI%=tuQveZ9*@;#l3|MWkkKRTyp>h^oS0h#F!-}g_oXU3}$kHPV> zjBZ)EE0d4$=D{r&C%5|W={1l@7yEsC`WF=?;HcE3z|yKkKbX&r%YWQY$$v_JQ8{~p zI3uI=wW|}!<6FZvWu{>AJ&&PW8|JEc+j?e>NsYT(lX`EUo#U)al^GVMVs*@CH2WcNPyE&@@^`^+nB@AHXpO)S1lxD6h(wUOEr=P)3y zc5+ol04;d*6Z0`CSLS$T%h804O`t8YnUU>TrCt|SNZN259zU9lPh*7u3ZZo14}^5h z?uIyg2un&$$#|Q2Zq$n_dTnb#!zJa>j_h2Sb)vi0>5nox?Kbp3izz+p-oG1tkg(X8DVE?Mx#YYDYGAu545n8$X3VDZ>8v z(wvwgi<)c0wT8uq#UBiPn~sunjTS`jih-yI7>$a4Ya>N0O?X!=o9n)WSXh*vS-JYm zH5T~wN0sz(E7`2I;4i=IPz}qAb2A}ZTeH(&SmO>8v=o9$9EKv#^sUNbw9v>%vnMpI zHif|Rs6J9f{H{4am9tMm(rMk@an#N9v7y**Z;QX!Jxwa3I4HTJyC?r9vSB0oSd}om(*#eu7pIW?COxw4;q)9TOL>CTxE;!(gaZ zCO3@$;}d{NoHzIVSYjC^5N}~Tz{YBUIxqY15jM@0)f2h0#-=d5WJ;W%w#s%qckO2E zCaKBpWKgKu$t+3vF>Z)aArz_^4fnyO>3eS!RE^0=-`Z6qVw@!S#=YFsx)a^+KPeJW zZLGSNSaj%_R|EfAYrd0Y#+{APrCic2mBWMM z6r}RPki=S5bWg)&yuj?FKiITJZxE!=xti%yc|l(P@?59h-T|?JK;d*V)N|01Y`5ls z=rNmL?n-<-R>K2IZNx^OE71)zQ{qAhVm54i znMZgcvA~;}Hr+KTlYGs@73R{v9X#IXL|aezjKY>#II22Yz%5;dljx(im?mTDHoKZT z;pQrXzUv+2HcO{#I*%ve#w(vhJ^8~MYOYCoypLQIy=LrND;+OXwUwbVUe#pSS$nXO@O zXT%IvxcLme(9&L)!pbJ5sI!1~?(jME6PGTREtz~R@|~h}M2Dh%PQ^sKP3E(XCwGDx zA_$aopdS9gNd(5bnJOWy+1=Cq^P=O$D=DQ`H;G;ztum+wBv8wW}T3x--+qB0r>c{y$f~eHYFVcGU z8n3oxukcHj>ZgQ#_upNI9WkI&QhvUoQ*GrUsioNBy-@49%v9bz!+Nmr%=P)VU(^~E zj_^8+RBC&zFMes@9j(&ovHDPZRdle51kTfglJ}&!yjV{j72xPoHtU&$F^uE7S4`Y2&)L1 zRH~vRoZ{R_6T`CjS2iKeTQ_u32NgzhV%RNH&wF$_3x)5z-hhf|0P^x7jOYAFp^?RC zZN(A3WC__-!8ZeZw>-WI2w4y-9irEhrK-{5nP6JrE_q|jqE7jUCu5;RIx7oj$J{sg z0&~g~A7DgRW~=Cx9XG#~JxX?pJ}>F$BdOIK+MpEYhA_{Dg>$ui<-cjuJ}+rf1)WM! z#K-o?`3iD?=mN=Co2xTK9}G5@VP%&DaLq@yGUNsGJIE2=Q4hTTHm5%E~L(NGME6J|-QJ|r&~ZeUN!&vf(sS0f6f)ENG0keFeN5fuhE zAz%sF{l#Mz-NKfSHue0Y>rtfG0-qhO_?wThdW%Y5Cba0=x2#gRdu zN1GoV^bYp@~{$LGb8ufx1TT^t~aqLwg^SJ9z- zCujUh);F{I&v)ZJC?T(AdPQIm?_b%(M4Tr!-Jz#c5w-bL1~&JX9EUQ>kDwt*ZVO}Y zVq4^)r}tcvU8_9&YevslRT}$x(JF$8^WIJ6n+S?EHoxa5wLtCU=tHo9>hdl$?Xo03 zDBlbcs=7QV>CoJSS$8LB8g@_WA8vZxmR)VhChB#H?bi17jZ10e6;6%u43b|}ZJusA zSr_`HI~lnbH1Vv0aY;wGJzCiN(=8dB2A{oAUp*WFQ5DltgI={j@<50443?WPXsR)k zQu!7Zj!@qy)T5hOuHtg(;a?rd&h?EVe4*YHPH1SqN_W3`Wwa4-vl>b$SfS67(6#z_ z{!?)kk*WX46;g$Tqr3j|7@aJ~EeMY~B26=?Uw9Hv=~jZ_rEtbC-t@FEr|_u&)+ESQ z<-M`Mmbs+KF8Fn4Qr1y(xV67g1?^TEjH~bzc89V*6mjWXZbIfLm9Ah0qbru`may4f z4V7d3w#>m;N5tHhO!L4KYP)G4%q?<}N@V1AjBERbfp5AC)_Qd|9@oEnrEA08q*m$h z%KH|Hx0a#n`f^!BMt1k7)n>(XCp2qFI8jRzB7U)cUF(FTy_KYcj#|(E$ZK^iEMUoh z7Ut^wN_iO1c|FMrd67V5+65ok&~`Ms%Rb_`YYTK#%Y8#r)!Jgdp(Lx;ZeNMM-J%$N zZn{COzvD%HU_+>^-S(HoAr26NN%BIpPP?csu#O6=zHMY%&N8-0u|2=1a0(}MYnXPi zn((1N!H4~9hR->W*0cS1Z$A|I;=E*yy}jz-HOM@+_;us;w<($t-Q(*Xz8{I<$k1E8 z^Z{M`XQnsxHGoEYMmA9XsB~& z;@n(Vz28)YElt@|HdMQJS<&(UGpPW7#- ztMe5v_ByqG`kcgYSB}gpI2oR+1+z+)P_JJcrxes?QGP~09+cOUyOfbR`T62IM1^HS zJCzU?7>IU?38n6Dx<#= zMZM^hXyIXaSvSO|NiRz)B0tOw0zBKPHkR_PywBz;S|S⪼^*li!;S!bvHh_$$M*) zOz{P?g=U=7v8P;@25iRG>Zf!BwZt`?Vu6PMX{G6LX_54IU&m3HlUETb$}NN;of}cr z`8&#p0_Dcs;@%oQl*`bkOI7S!>13V0$vn9}<_o^sS!K$Y5Lr@)3kvK{xj{y(o5aR} z6ASS-)P7{fiipu%_dm@S099F?)$2`nB{7o^De-{Xj*kVKXYdZrq#18S*9)vfF<%o{0{lcCtKFJvU82 zM&qWVJ1owYW9mo>;)bi^+AID;pn847V=TdyakNT{IXVIn`(#Vg_c?wq-FvrFknwi1ynav671wt#e1 z(bJE#B*>b{i|%h3;caW7)U`zt<$bx|CFteQERnXCy|N?aFvAqC>TC?Y^k0kDoMPV! z?LR#%fZ#bzF&1|(!Gm)owo#Dm65Kiqh(E7Cjl~^h?wW~L5b+wzxq!}$6hvLgM0Xr+ zH~7-wC$78Qp>7Gee1d(5zbx#Rc#n!t;y=y?J3v+CvkH0O#eTkAPxa550F4)OVk62c zU#%od6H%X&1}-h@+n2|wPl$2n%K+k{+fgRx?#sbwQ9Fu(u|3ePpJRKgN`Z|>;9>&?_(aVS%b45RyFA*wU66H$~6?z+{$&$Q!ILIP%q$n+l zpCKlXcQ0cO`FUJL+I!d)3Y`yGfzvgGGxWYZNp>|!yrb`GWeY;60If>>;ngxDH-{(q zbxdGt`fhGcvACa0MovSNmy{&BoF4~ko^tvbii)IM3_TEYF=+y-f&jRPA`25r;_Y2K z5(_Px)?6nQIH!i$?VF zqX&t|9aB@%?KB^xo?&r(!}#=zR$`6R#92C}4g2vdgT>arl^*M6^@*lpE9daNt<+>SN z*%yJQ-IRXu*GYCxb9zkjNb*XHG|nJBcOQ%J`)a|y{SEGrn*=TvxGiE-_izfs8f(dy z^0iWPpoFEyU<;F2WxWCw3#)YsWdqHh>65U%Xz#6!rKpSCEbj=DxUnwxR4zWN$r1<8 z?hy(-@a=HWWS1sPWSE#2-lrx2mXWvdv$tKF>Y(yOQn4YR{-Gu_cvk$9xY`wSbnX>SND2?wKAr}u2=ym zK*A|U1cmI0$e?69+h_}9Tez(jF0z%+cu__~jKqFcYF3vrr)6r|lC4b)F)a<$L%5S0 zF>6Vc{}U|Ay=k8RXV?XhmeLo?4GP{JidZA%tq{Y#NRN1MA9lYW30?SFm>CiVH9ym~ z^ge?WR$o%OLptjy)X0J+LnS7?D>||(wCaY_38EvzDg;lkiPyWK-VVPkIzrNDS~{}V zs1f`N!$~xszxB5q#j!92)NF=WBYk)SbiI~rtM!whJh5YxYG6{MgGhYgvj>NeI^}!M zduv%lKAa0Z#Kx(PIe zV)fTFrFMUgFRI>EhXKQiO^%wp*0UR~afcf+6hjRyrrNcVJdkkjyRp5r800D)Q?*#3 zcrnRUv`QMNIILnSEUcobL_)qVu#MqRYv9@oeUWw=Q0u<5D0 za;q^r6JPSLF7lJjDr3!yV~F8fc!>z; zDRjl6?)Yv}ldk|Lceu99^cOwhW2mr<$SVMx}*pQ zL(H1gA==>$cc_TsT2~2nZ6yt;D<_*1_(zU`%ptJ1b4V`;x+lKrdtI>ybwKX~uFPRR zgn#`M0cR8)Tu%`>f8yNrPcyw$XrOv4G`p|-{6#V`X~j15r-U0O-WO;q=}tc6s7q(= zNE@~$>9K|WvYe%&2=rTvCF82ztS6-mJL!y6s4#gV5SICtwNElCkM{@1`kt8Kcsg`c z;&xUXKqwYbxjx}YC$mTaB)D(__q=rZYM4ZTwLK_K-DxnbkjP4V{{{d!9I6-6#jB(& zG1kPmwwIPKq%!W%a1*k%d^!yqs*)C#jLd31X<~$iF3-c(NY#EhLqQ;Ol;tHMGj>!r zV#Mw$D!-5$7-)p*E5agj!t6q`EwxjoDQpSZEn29B&hfPzNx84u*qY1gIOg+OtoKk9 z@r06^IL)(ikGeHx13iAyhiI{`5JZw<7d{G__Q>Q>3m{wFB7p~*-| zb2q667i1HEOaHdqsHeS z+nBi!1aYFmC!-oYC&J{hQRFIq(V|3@Ss6(BER+!K`S#S6Gg%?-_(J#;ruRloqw@PX zUX_doZjBYi!b~goO@83hu~8m@pTf@WVOzKG7}f)bvV0$c+?aTbIsHew&1x2C-J_l4 zwrFnJEs!vs;_50{FjqOVvksv!Yx+o7Fe!(8cTJF#xteuE`tcmEV$Tx*^HLZNw19s| z#;s%4lj2|WcjWYv#S4(jXA=Eoq9F2F5O{_6eFR4V6;eO9I7Yp%4q(Epa+zXZwP{Kt z-)vRq2V)E~)O|)Mp7(2dK~V#NCU52+Z?LfzCMOEylSah7lxVo;Wjasv1;i0Il2ryj zm2-JpkR57@S*F&Z?1DK&=*+cUc^2$Md!&b8_(P*P(41eFhcgw*qm!lKWr z4{oVM3svL^mx($G&Nq&JR9=x!bK6CbUMc?xqtSoKDRsz=S^YZ$^BW}*Y!{Pt&CVoT zYXf-T9Dv2zDN(kw%v-c|dwNqMLpW#+I3BYGCPK>Z!cW5edhhdQAak2wwaK$!!5D`4 z@h{SBAb0nGM&X$DCEY?(=AB%#qp8Qv{Gfh{Sstc}L4KIZZ&b#;3_yXL^~KD0Bi1pH zv#?fd=^b(W9%o^r)#_jrK-}yzB{`*7c41e;0vCNAaV^!1d-f@^0{e7RU#@=QXI@Di zPU%<=jSxLDdbt)0lz%FEDt)_&E!hDt7@0#K5iI(?D7j1EKYEJj9(cs_52#!4y*#vz z*=F?ZMa54y_xYjP9A(+v{xfZFdj0)b843$C-IMHhOzzT_(&LWKOXd}vke)R%EG}p1 z2LO%}-AhoyMCCr6l;}|}z0fKzS&-v&e0W#1;fkH@mGG>c%$`XL^h8k(qY^Dxau;`- zkirE#tcdv^F59lRl3Q-k3a1BZ8oue+da|5xpBxg(Gd9tOPACrYR~5c}=aB5!vrl1t z--()C0)D(adCB$5P&u=uY0o}x|6zg=NNC0ErhQq+c%?4Td}_MTwRz!rT>UYTtQ-p# zvY}G4?!8IM8a{~9lRnGhZmXvlR8s~f`BP+mraN>&=?5-)z6qGG`mbyGJ#V%jZ8+8) zKSlVJR^;qnv3sk=t%ojJP0^tr7LaiWqPXknWMO6Ggw%;~Pl5SOcmD$dYM-Gh#CzX$ z+V)Bu?IkXBXuC>#CQBinjly+Db4TlHc^c13j|lDYRpY4_>1Kq3Db;GLAX>40%Yb}a zhKAnMab8Q_`qZn3kIM)B$E=x;4)oo;X_Sz+r^iO$?SAUQpe%HHD#8{GI?0A0J^O(P5xgDaEhUFios~cIKW5I|jzAUPhUcp~G0}(bHqZb+)x7 zYqC0{eYIzRG3u(Pei;5gInTc!0hf+J1$h(rfC@2rtYM{jnrs0!09R3Fs8I?$M`~=U zAAgxq7;bKLDhu;Ft=<@KUp67Q!oa+|q*QN>m2qMvb1U`tbDPp(+|$Icci@mK^Kt3g zf_biK_gLr-d7#$|kMG_G8K5}rifi7>#n2lt{V+jSX7uk4KsgW}-XUPMJS5Kt#i?fc z*u+M-lyTi+CbO8Ji!_%BRheStb_o!x>A+3$4vNhdJsW)L0GSGnnV-a7nPORQ6QT4XJ(m=f2|C+zUoPFQo;)y<3dh(5-*3pa z1XispzAqS;irfaP0sf-XJyX0`&=tQ3u`pH~1}y4c;)FTiL9CD{R?&b-1Gt9^~wmCPv2lW@CYA)A7eTqxLxXf4ki)ht!+kScY!QZta-H0JZuZ7Wb18Tdbrszej>fzg?UdwHVFG)_l zS8Eg8=+PExSgy;oS>jss3g$-RSGvrR<|1n^6@(nxd84Bfo^wydJgi|;&%zX_lUD-~XglRSie8}1EP$|^w=N{YufdR^LFh>c0f?n`Ps z17M$pL6lNm5t;JYHgl!=yT03MdV*Cpf(61cMQOdfu0bQ&pstk7mh}59?v{F^$1!9b zWH#7I*zh!9xc0zVjxBH6sGrz{3KE3)vJacTqU5?Ey`!_ zOEf9tU@-$F8iAR+5eYPsrf&8AVr#cE;WjfoZou43r{nG`b-8%LKd2hYn) zGSo*;ASTOLSX)6r@HpAgd%D-E125N=}Pv0Sgw&0OC5)YK9!u! ze=48-gs051}g*!&Xj00jp|Dge_wNNAU0^bI3a6h%vlQe%3hz(1;HNr`aKgH-e zC{Z!O`(Y_2O~$0`V|?szI#5Z`N4~%jZR+o22G(>O(5vy)i#*JVZrWX#X%s-NcQ7s* z$7VSpr9BouBL}LQMBv-*8w8lqInBP zD+LFd>O~0>W3>h6HcmLY1q~b-h85S3Eo*gfwNHD`H5o>OX-D3A13(ckw6_#Pg=4o^ zV^-~H+u3bET3-um&FQU=$<}GyX={&%2;((4*b?QauWBKU?-ZZeG+1vPk6+3=6g*ba z$+9asTIqy#n;UDQ$xOHQvf*rqtQP$IXzvuNrdM|*N&0f3NtK#1`sv-RhIL46RFM-A zj_I#9&CY~vbSu<^=+?f7&diko_hc~{){^zGGkf)qw$Xzs2eiJ4>uG|toAu^u8!Hxk$7}U~J>A6xU_37e_-m8_uaW}qFRU35|Uhk5*zdiX!{itWz~Bx+BchmTO&P%cQ?mx&_9QL9O?>^S!- zyY(g4JBptzC>|LdQ;f_4G_9qXs_rLHuN%uyeza&ehH&VEFR^B=O*$zu?c)5z0)SYB zoMbE>JnxIh&v%9EKV1uT=oksYv9%pj2X7SKq(i9{@?EOrs zBx?+sn-25!Ah~Xj8jLJ?V@qqc`j*2pwpAvSXbi4$TdmB52ea&vUeN&6-hnBE6YH(M zeNmB``m&Ra_JoQkugkr)SUyyv<dPB(=ej6B9uHuVl)mZ59)v~?}gIeqFzw9eIrTT8KKJ{@B#6>d%m z;s7jYDCF6+9k57gL3??vOsmLwylC$nWxZv|kC>FbxF26N4(Dg(4K+RN-^d(vVn_Bp z*pQi*^GT6F)_1X85fQa|ut|`{zQu)K#g_rw#L8I(9hJIndpQ~B#u2~ae-{#p#Fb=C zha|Yg5N*sk5$*}1&L)+1Unr?-HW9}%y|4NL8h~nKr<}#eCY$q`?jX09q6v?w)wSWt zwiI33>VdxVe#TLIKurEGf0=dOGntSdjUM+m$HMq1m$x%-H+2 zYD3!X*{w&7YD zkc$|e3hacPQcctu!~vl86`)$%3$J3A*?2{+U~$keNqAHudm2HI04qdLWvk{Ej@cy5 zgjINl7dbB$z(lsh$TpuJ@^CV}-=g6O2#ssnt56Vq_)kyv+LqJ+;z%~vb1-19MWWLGe?3B3brId*M zv&iduX9NC?(Dbgld^?8$VD5q}mi*TNVCi9)+>EWwYr8x&7C?YEky?nM5MtKs6I zAoAi@3FG{t$wZa}y1aQb`|xc>1NZKj>(Dn+MDS*ipm62foPq(1gxzmij*(87Y{nQk z6J!bNmW=nMJQt1GKKfc#aJ=e(uK&TuYI_>Ef}j6V&!-0^%^zhJ)(YsQPi(idQlljK z_O^~*{LmO9vo*ovNPI{u2sq==;DDB$?bZS>UG`i6%1P}Kc%h$&>tcXVF+{ms)yjlD zz1Vfbp`mO$)Wz!vgRtDeALpulG(d$%Erw5Dkr*~40#ML|ki%M$C3h*vBV5xf$-Y_y zYxlP1!gu0zOo{wL?XPJb2^F_DoRY!^I!nfwL5tFNGDGyA(^u zr9CQ^TOL~-tG>Eq^x76s4CRJ!$z6P^T)6sZ$rMVUR01T>wjv=PK5D#pqnwaVME>&+`z8>xp@k5L(JYn#|P0aEhU+sQFTYQch}Aqxr_^b| zJ1|CO2Zg*kY3z2aL`{Gbxs3B6&z6(tjo<7Z^CIip(^W%UA`C?+Ij8@s?xI9#(k$k z;#P*LTsYcA`&)>JD|;%Z_%5o_SVtHnO==e%7rPMpQJVO*ZN`o4 z-|oZ+4Mj!jrMn6v^-+8TUjO03;d|TW)1jH8`gz4p2gFq##7zUzrFT~@{D3XOz5-Jb zIV7x(jvU?-b~>=e4lJGniAab^SbXR-t#F*ln$N7ltsQPBdj|4oV0m=6t3HfuHP5aF zOU;{_aC+16Ivu$HOEq)*AQnHk81oPle_;kkA1aHlI@2Vddkf%TWA?Z)o|aJI{%k1T z23v77e?YEet56K3G_wa_e;%vZ0K4qAZ|bH%4Q;ZwrUiQBdt;L03Qc&=WNEu9hbl*E z*4ue()hWm4bcSvSGZn?hX*M@Nb}q( zWPMvYGGV0e<^z{g(jEL{q5~cePl?0^JHz^tJK4xK?5Ez&RvbxcE2v1V1Be+AP5K*O z=-1>4F1Swq&el(MVgHHEjsxSUZsmk1LpKQ`1cDU6`A8kAT!{FBKr>Yve^&5ZzPaxSS$p}29%08|zq`)+ zSv8=~Sx9CYvpt<~+N9@WHj`v}xzWO&+On;}en$mVN@kN2ZmT;5Zm@{BDzIY^`nm`CpmbX3vptR1P)Ib zOaZLl7yJu~=eouUAW(g9_T*n7DZ%O;93U@&PxSJNE2B0-Kw)fpM1Hb;-bnZWdwP+R zHMj4oLi{Ws7+C79D%^pvN2af>K2A2!#svpL)={X*59(FKHTkE1~_ff(aeC0 z&kK zCkD18gfMMf@;efjy<&TW0~5+FIzY)Rs_>yAi@Ty+BgjAUjt(iYIR%fBj34v{sk;V`ZmSXQDB(b#K8n! zllLqFP>#2pLTe6rmCD*nZ%Tl3A#JmX?>mXWKsRe&yeFDI5**2~`Bc^#qozaZUA5!C zxLenbZ;ZYFFXe}l)WT~^^Au6x@rKwEk2iWIfMhM0>|}+lH-yH=MOub=N)tDIM#7A0 z7i>{ozkUhitv>B9^DF4J;1Saqh>S#y=(E(-pW81D>+%kj%XIe=fVc*Hdu+r+ARDt=ZH00x zD?K-DRyMt@_H{*WWeU8$cb{dxZ2r|x%_;CLZ}4ej0=T;Dy~LQo;S34 z7*OYX%GC~8VK9@?(7_TwJ+Hn8$JBB{w^gHL3XxNs@iyz`8cA)Raq*55vjdWBqFZ8B z+Ug@Asv0~pVudQR9|7%pS64Q1qI}_^C#Sj2AXy()$K|zbuyNjlY6bl=#W7YLind~{ zBV*StIRqEv%G5L{zbo(}kpP88fP;>t?~4v^q9T*DYi`;;VN-|)5Pm7)3FZN*>+9Po zaeGZAJjQoHvEvDpt)I0b9xBz_o@wN{44fap>6m zcZ?bM9*VvRT8UWcgNC7>0u3n-B~Ck~b_&ct=giScBdkF5d>;Z|bq-{3`|m?<{{@Da zW{pOET#@BJ`Z@OX#V^KH>-=1zWc8GQr;T3!PgZ+fmtB~aqFEc3{G&tVH>BNzXZS3g zb#Xe^6+3-B+0N?Pd7O7Yx14dGv69x+d0aZOF^4sX-Pjkb6@Qx`iM-#pYuslVaJO2F zE_lSOFM{IhYYLL>$WWre^>)qB$Od6utKIbbtYql`EDbFbGvHI=CTM32Fh&-@=FZgp zd|2J!!}C)5xlx6%Y~{7>qLQ z^Uc>mzizLx(DOCx+cgBSUGQWjyQjI;SLFIw5^)Uz^XMQ7E*&B)W}j+n9@BVreks^1 zGsl}AgYQLX=lAePv66lJvOCtQ8*}6Y=8KPGr~tF%{P)Lgzx9Pt+T}-U(Y{qU426E= z04=eqAHI`PbOkR|Qlfi>M-=0$w`&Aq;}EzHm0(iCZ9)yZ;8d^c z&IZMeA%;);@2zW}AbY3+RCfCu4}hHm;<#EzlIp7A#JL!lo!6=-`WUt$*wIYtYPHup zjDi;ib8d9s^wLkU`eJ_(d&U!5DYlc5zEhgx6io3Anon%g6VU~WLcgxxCLlA8^PZWS z4D#ZS&iDupi27;OjJ7;RCdSugTJvsgmz9D!oAu4r3rvP_huq!>m>EV7n9G{cE zlc9S>dOYcji0=KNkGG+ekajcOAxF2d=(dvNUA;cxd0#G5&`iA-l^`+zj3?G@rv?UK zDOaHcFDZ#eT@##AhQ&&yJh(=;J=2Wz40Bw zf6eaH;dUWFQ+a$ed&9Qnd;)KvjY|x+^0nBFTHTt-V*yxgFvS>_XJo=cE)KwsTzh1n z7PArM5=H_hXQ7qgPHGsq6-rElb5~2xA~Q^OVh(6#`L7lPGh~Q4+asg~^rwLmq*m87 zDRer5E`<0=FF)x4lziow@~64#LOSd7rTMSVU^=oz$bZ&;=F@B$qx)_a{3 zV|a(*JS7pJ6{GEg@U^aK>5U-KC_rz|yQc=v)EXU7t3uT90Zhs3HE!%+Z1ixNcf)3v za=ZmB-4+oKL99FDN8N4XfeO}aWg(vEPqnDEee5E?OEU-0G?M*bmg2gVXVrNm%H<1K z-1@a8Mjo~ZcLGY8ZHKgSNRUeKhxUTaz}syXPBmVMVR7$&UkDWvib>IcWxXsRx5US| z+T&G}XzufG5qi7%IToy?^rOc^gL>VybL9peRiJ`0o1Wn8g<@9zJI4gaR9rnA1XS8Q zW3GY_&qV3Chb5jpMctpC?&R~Ejl@=&Pp`Nw4hXP`5^}8CD5KF=XU$)iLkF7^P8V&V!GX!plxRn>Znk3sm(qb$!^vfX~ zX#r7e8Bmpxw3hJOk~z8}rsc%$HYG-o9rPh})$+e`{%{O09Y(2xr43AVchcpUFgNZN zj@|>0yT#yJHi_dumwAhic7)zuh6Ay@Q%%Kcu>#;2;ulr&3cLsnnMFlHY|>k=HNoQ< zrr|=xj;7B38nyP?1TbW;S3@>mx%4e{d+pf9u*!K^ zF*trB&U5C2lk_#Z+dPUBg7%HglYrQx zB*_?Ljsr@P2-f2m0dPtQ87~B!+;Xm9<#%&FUMgq~dQ7k2PK{T?;PGK%5YKT;#eJU! zIl=iF3K~#d-unAgallX_mF#A@DFEU11F(7cbU zve0Cf(un56Ko=hKH123TpJVERfK{rD#co+he40|lFj6X>E|}f;9&7Z(IWYtLueUwe zBR^?Nqf#>Tu1nWCHr|G2YPtdmq0oPGUnBhjGLUR-^yd^EYg~rJWvA54?DnUOvX3G- z%+t{=$;hOR4T>=O_{w<0in6D6hMLJ0SzK&i&V`fChIcall^KET52)<{{Y3jqCd zdEDE>Rc>EZJPELWEYpCVSozhQ(cz$hdnH?dCcvR%tie)RLL}>A{Z?Xjbg+|5Pi%i5 zm-qOTjwCzXp$AN-|2Dfjw2v$8_0{e-SDOO;SOD~~y~bX6185sDHH6eoY_3ko8`Iip z2`(mM27E8G4gr&W3~a=0AGA(?st!!49_3sL@^b$UCq_d0bg0I;{Y8}iG`>%dF13K$HCa?0NN~qiz=j5Ad^^GKamsEaI3kN7$OHNYX z<6T~&#^aW4jrKuoiATE{9w#J=a>P}esz}_u0|lx-n1#h+QJOi@XT;sJ%MI#rYPb+x z;LP)Q8{dE8wiMS*uP2Y$ql$6`rb?eo)88$-`QoMr9DFb8qz0a6t$V^ zhcFwK=1$h{9t0j*h}YdpsrSD1VRl#;uv$Pf!=V6ZDMv3)mOX3`H7E}#=|o$bvcz?C zbn0yZAnupNVxJM~90w{ku45euw5TLI_8qAB56YWACZJZQhXU>wTA@HoA+YQk%f>hP zg8&(JnaVlTks!!^-T(hv>z6FF0N1O9A(S)>97d?)kkLfq_ z(E_K#W2(PVT%)zkrE7XG1Kna`39=(aZ!P0bV2A_#yg(aH#0{M~#|EcLQNtJ={V1#T z){tfW(Dh8Wo_mRhbJ+fi^CjNO0Wv|~X(5pHJTA+bXJgwxL_ru!fd>s3QDhT5%wgN+X=~C$2{BBUF8wWZ^@m9ShWM@oBciv?T=)*xxmiz)TVn+E}=Vq zP+|YYztK(n{-^q0n}(jM6W#9z&T)KpVeiGiL)U+Q2KEOoFxEC7ok-a4oxdlc5V7~- z-(Zw~_@RG;2+S=HE_Lot$H+^((E2w3+#h`MZ?eCl8~<+%<+pqLKlAuL66F8PM!+>e+V4k zeO)K_+0NQG6X7HvYOFOlbX3|wHuFjGFZenlXveeN9Sk?aduE^ejh{vMXnaANh+W-Q zT#m9N$)e8vCCvR75#f~R{iR0q*xgN5s$<<%toPIS$^c*PPi5DKCk~)Mll^ycFRkAB zU$4L0Q$+8&PTuQralO1wnN$DUMT!6Fm;ka~&6J*1>o3jcQ$LI1zx0>n>tBQieTk^Wa~KZ|J0={$zO6Z*13J z+|A)laCN?J-t(gupP#(F{~x?2|G@wK`~S-h=!%v@akEP%6?Wq9vgp3w)vq2=>!i_L z)BO8|+GYGdZZ-8zsdYzOreffU-2vcL8AiRcwovieAXU|=k>-J|_ulZeo2c6&RsMKSzVOs*e#lb&2BfqHOx-S? z?Mj?+f9+z$ZhwuBt`^$OG`<&;T^E~HypYvSQV?=^+d$DLmT<%fr=NUh?Shai82Ee^X zFCYWSt;^i*i?btb*bk*1)#LXOSs(8IJGg)B&f%s``ot`pHn{x$Z&EP5bRe*ut2-sN z29W#1+4|rW{dVk5z>56^j@c$sTkL)37Z^TT8n%hc-F`ipEuPJelk4aV<42ML0pQii ztdLfFnZl z;#Sc!8yWRedzJ=zF=>ylkNzcggTt9l_rzJYP zWJA#K%rPJAO4yg{l&99@$Hu@TOXjrIZ^*aHEsc$hP;HSag&`_jA6SVb_DCKdHBYhC ziVr*Xb*Le zU=pP6cH4j$6afd%rKAx#KLtW_EVq5Pk0i+Eso&g}%{ZkR73@S-g1}(FyG*XYY_rsp znG@~5&vvhN*Q>j`11~mT@18J{gis)a`{fRWF+d3qfa7x_B0oXN=pb4+UC;d17&bXn zN#AVzF;0G{J6gCnog(jK_A7d2fpKdx02N2fn5t!X+gJPb;F2|L8^cah(WSB1#4Eg} zDF0IliGKtSO(YVqBEjd!p>OR?P!l6uxxGgYd(>V!eWmxcN|cV`uNUH)2%Hk(p`I_@ zDHh|}g^x#VKGyhC_9>c%mqw{D^@DOJ(1tB!RYEkEOxCR*!g*l?WA<}jdQvQ6mYydY zniTG^NI3v=7_1ALEYAz=Dk-vLM`0KG-8h-j%BY9kg}vw{cA=IkhX2I0f(3Fd!I?^`Zit6sF;(Z~#kIx=4}e8Lb368gJP$)OqS!dE2xeID*Iqm_hmFxk z5@Ny|n#+q=h14)_f#2K5Y}sPbim>UnuTguk4++EJ6|*1KJ8 zs$__TQO%Zk@3d8J;f?CZAI(!wC8=*ly#M?B*zT+LV+&mDFnYYfw+v4$hnpfZ@*^Gy zKtxUE-fNxe;DThzRb?B+eRKP8=s*=hkD2sO-_KS`#!r*AMFLHxEdn8%PXlJbiO7IZ zTV9I#)A?*gh(n5Yaw}1_V6N*>#YHMYLYqXCX1~XF%m*LGg)>~1Hp`ifB+e0M!Tn=i zG!W<6%}ZZP@X_h;aC{RzhQ}NSiGudsvbp+A0HUluQH?SWz*U~q=rzy2gzD4@pt|d{ zskNzgLM)~)?^bzWV>g4Lt>%PyNmDZPcNSQC>o0dj0I?=52yR)RC!qTPsxc~qg4$Ey z6aB57I)!53oDf^c#pGO;W*Mg9^w_LxyU2$YQ2a zQ3@s;csKrlsK1}4ozJ?(@{U9T^E;(el6Fr%$_eY}3gRBNr4E2v8eCAyyim{F+-l2{ z8pCZB@TPF_AYkjOQmdoBp|qc`q1dhkJV;ToHl}uGlaA5M>Mjr96hEpPfjeaLol!w8 zHIq_s#`DF&KzM5~v#}vvSYw!NJ)Wiq|Jnso6jTy>z`;F2t(`4PTrftJCLdVLaiBHa z{%QA?-su7(%fut2_taL(>W4vjsC^n5aK}8yR3}=%)SR8kcuzuJqDdE%HT>Zca!VJ@ z2|8V#){q)dqXTBKLel|RQu_a6C_z-`nq{mt1>%FZ>G z+WTdUfE;>$6{L}pnvOs0v_^fM{Bm2usA~bZUzy2ztCez8pM!xV@mN?^%Y>YY*^%md zr8b5B^@=T>LTkrEh}_PEzR{w=BCVj*&ewG*?pzr{N}0>OJXPG%fd2Xg+AXVmc~mj8UmT~&;`-X!BF<)_vJ zi69L#zc;_66Q^-8tqzJ1oR741M~^vg=o1G$`j=LtVUXNoy^w<2XKS-7-bkA_fl^B7 zr_8uhm_Q2;bu?t5Bn{v9*Be1=NT9qJ5DA z^b=L)=~@p`oka{EEE6~@Tl?Mc<}Xy+xcjfjPsjUq2Oe}HVOG2>>^=7fe)u?lp?0^| zm7^UJeKK2>g<)str^IuV&pNksR2zjLL_wFaP|Nh zt?F7EdvTg4yi^n+CiGC@wY$pU8Cl-G6Y2_Zb9SbIwSu&wB|F-n76rq*bTNpq zJ?S%_lAwCylr&mny6i8yW+jGr?XyVD7~Qhl29fjr$WA^hmhNRA;s&&3Kc4Z_XtPpF z9!k-ejXD+e(risx>eS+YQdB<_lFK(TR`PYYUf?rK(vL}x;&dScOO=FX>;Na$>DA-# z;hny>ap*N&r|)o^YM+dU$7U&Y1;WOT3_m$*YiM=m7l=%|?i=|vsgfGqHdPw/`. For example, if there +were two `batch` processors in a collection pipeline (e.g., one for +error spans and one for non-error spans) they might use the names +`batch/error` and `batch/noerror`. + +#### Pipeline monitoring diagram -What are some prior and/or alternative approaches? For instance, is there a corresponding feature in OpenTracing or OpenCensus? What are some ideas that you have rejected? +The relationship between items received, dropped, and exported is +shown in the following diagram. -## Open questions +![pipeline monitoring metrics](../images/otel-pipeline-monitoring.png) -What are some questions that you know aren't resolved yet by the OTEP? These may be questions that could be answered through further discussion, implementation experiments, or anything else that the future may bring. +### Proposed metrics semantic conventions -## Future possibilities +The proposed metric names are: -What are some future changes that this proposal would enable? +`otel.{station}.received`: Inclusive count of items entering the pipeline at a station. +`otel.{station}.dropped`: Non-inclusive count of items dropped by a component in a pipeline. +`otel.{station}.exported`: Inclusive count of items exiting the pipeline at a station. +The behavior specified for SDKs and Collectors at each level of detail +is different, because SDKs do not receive items from a pipeline. +#### SDK default configuration +At the basic level of detail, SDKs are required to count spans +received by the export pipeline. Only the `otel.sdk.received` metric +is required. This includes items that succeed, fail, or are dropped. +At the normal level of detail, the `otel.sdk.received` metric gains an +additional boolean attribute, `success` indicating success or failure. +Also at the normal level of detail, the `otel.sdk.dropped` metric is +counted. +#### Collector default configuration +At the basic level of detail, Collectors of a given `{station}` name +are required to count `otel.{station}.received`, +`otel.{station}.dropped`, and `otel.{station}.exported`. + +At the normal level of detailed, the `received` and `exported` metrics +gain an additional boolean attribute, `success` indicating success or +failure. + +#### Detailed-level metrics configuration + +There is one additional dimension that users may wish to opt-in to, in +order to gain information about failures at a particular pipeline +stage. When detail-level metrics are requested, all three metric +instruments specified for pipeline monitoring gain an additional +`reason` attribute, with a short string explaining the failure. + +For example, with detailed-level metrics in use, the +`otel.{station}.received` and `otel.{station}.exported` counters will +include additional `reason` information (e.g., `timeout`, +`resource_exhausted`, `permission_denied`). + +## Trade-offs and mitigations + +While the use of three-levels of metric detail may seem excessive, +instrumentation authors are expected to implement the cardinality of +attributes specified here, with the use of Metric SDK View +configuration to remove unwanted attributes at runtime. + +This approach (i.e., configuration of views) can also be used in the +Collector, which is instrumented using the OTel-Go metrics SDK. + +## Prior art and alternatives -semantic-conventions/ +Prior work in (this PR)[https://github.com/open-telemetry/semantic-conventions/pull/184]. From 3a9ef272d7eb8db318c0f55ec2eb95b937d73b21 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Tue, 31 Oct 2023 14:45:16 -0700 Subject: [PATCH 03/13] updates --- text/metrics/0000-pipeline-monitoring.md | 347 ------------------ text/metrics/0238-pipeline-monitoring.md | 426 +++++++++++++++++++++++ 2 files changed, 426 insertions(+), 347 deletions(-) delete mode 100644 text/metrics/0000-pipeline-monitoring.md create mode 100644 text/metrics/0238-pipeline-monitoring.md diff --git a/text/metrics/0000-pipeline-monitoring.md b/text/metrics/0000-pipeline-monitoring.md deleted file mode 100644 index 9bb760a90..000000000 --- a/text/metrics/0000-pipeline-monitoring.md +++ /dev/null @@ -1,347 +0,0 @@ -# OpenTelemetry Export-pipeline metrics - -Propose a uniform standard for OpenTelemetry SDK and Collector -export-pipeline metrics with three standard levels of detail. - -## Motivation - -OpenTelemetry has pending requests to standardize the metrics emitted -by SDKs. At the same time, the OpenTelemetry Collector is becoming a -stable and critical part of the ecosystem, and it has different -semantic conventions. Here we attempt to unify them. - -## Explanation - -The OpenTelemetry Collector's pipeline metrics were derived from the -OpenCensus collector. There is no original source material explaining -the current state of metrics in the OTel collector. - -### Collector metrics - -The OpenTelemetry collector code base was audited for metrics usage -detail around the time of the v0.88.0 release. Here is a summary of -the current state of the Collector regarding export-pipeline metrics. - -The core collector formerly contained a package named `obsreport`, -which has a uniform interface dedicated to each of its components. -This package has been migrated into the commonly-used helper classes -known as `receiverhelper`, `processorhelper`, and `exporterhelper.` - -Obsreport is responsible for giving collector metrics a uniform -appearance. Metric names were created using OpenCensus style, which -uses a `/` character to indicate hierarchy and a `.` to separate the -operative verb and noun. This library creates metrics named, in -general, `{component-type}/{verb}.{plural-noun}`, with component types -`receiver`, `processor`, and, `exporter`, and with signal-specific -nouns `spans`, `metric_points` and `logs` corresponding with the unit -of information for the tracing, metrics, and logs signals, -respectively. - -Earlier adopters of the Collector would use Prometheus to read these -metrics, which does not accept `/` or `.`. The Prometheus integration -would add a `otelcol_` prefix and replace the invalid characters with -`_`. The same metric in the example above would appear named -`otelcol_receiver_accepted_spans`, for example. - -#### Obsreport receiver - -For receivers, the obsreport library counts items in two ways: - -1. Receiver `accepted` items. Items that are received and - successfully consumed by the pipeline. -2. Receiver `refused` items. Items that are received and fail to be - consumed by the pipeline. - -Items are exclusively counted in one of these counts. The lifetime -average failure rate of the receiver com is defined as -`refused / (accepted + refused)`. - -The `accepted` metric does not "lead" the `refused` metric, because -items are not counted until the end of the receiver operation. A -single interface used by receiver components with `StartOp(...)`, and -`EndOp(..., numItems)` methods has both kinds of instrumentation. - -Note there are a few well-known exporter and processor components that -return success unconditionally, preventing failures from passing back -to the producers. With this behavior, the `refused` count becomes -unused. - -#### Collector: Obsreport processor metrics - -For processors, the obsreport library counts items in three ways: - -1. Processor `accepted` items. Defined as the number of items that are passed to the next component and return successfully. -2. Processor `dropped` items. This is a counter of items that are - deliberately excluded from the output, which will be counted as accepted by the preceding pipeline component but were not transmitted. -3. Processor `refused` items. Defined as the number of items that are passed to the next component and fail. - -Items are exclusively counted in one of these counts. The average drop rate -can be defined as `dropped / (accepted + dropped + refused)` - -Note there are a few well-known exporter and processor components that -return success unconditionally, preventing failures from passing back -to the producers. With this behavior, the `refused` count becomes -unused. - -#### Collector: Obsreport exporter metrics - -The `obsreport_exporter` interface counts spans in two ways: - -1. Exporter `sent` items. Items that are sent and succeed. -2. Receiver `send_failed` items. Items that are sent and fail. - -Items are exclusively counted in one of these counts. The average -failure rate is defined as `send_failed / (sent + send_failed)`. - -### Jaeger trace SDK metrics - -Jaeger SDKs expose metrics on the "Reporter", which includes -"Success", "Failure", "Dropped" (Counters), and "Queue_Length" -(UpDownCounter). See [here](https://github.com/jaegertracing/jaeger-client-go/blob/8d8e8fcfd04de42b8482476abac6a902fca47c18/metrics.go#L22-L106). - -### Analysis - -#### SDK perspective - -Considering the Jaeger SDK, data items are counted in exactly one of -three counters. While unambiguous, the use of three -exclusively-counted metrics means that to compute any useful ratio -about SDK performance requires querying three tiemseries, and any pair -of these metrics tells an incomplete story. - -There is no way to add varying level of detail, with three exclusive -counters. If we wanted to omit any one of these timeseries, the other -two would have to change meaning. While items that drop are in some -ways a failure, they are counted exclusively and so cannot be combined -with the failure count to be less detailed. - -#### Collector perspective - -Collector counters are exclusive. Like for SDKs, items that enter a -processor are counted in one of three ways and to compute a meaningful -ratio requires all three timeseries. If the processor is a sampler, -for example, the effective sampling rate is computed as -`(accepted+refused)/(accepted+refused+dropped)`. - -While the collector defines and emits metrics sufficient for -monitoring the individual pipeline component--taken as a whole, there -is substantial redundancy in having so many exclusive counters. For -example, when a collector pipeline features no processors, the -receiver's `refused` count is expected to equal the exporter's -`send_failed` count. - -When there are several processors, it is primarily the number of -dropped items that we are interested in counting. Whene there are -multiple sequential processors in a pipeline, however, counting the -total number of items at each stage in a multi-processor pipeline -leads to over-counting in aggregate. For example, if you combine -`accepted` and `refused` for two adjacent processors, then remove the -metric attribute which distinguishes them, the resulting sum will be -twice the number of items processed by the pipeline. - -The same logic applies to suggest that multiple sequential collectors -in a pipeline cannot use the same metric names, otherwise removal of -the which distinguishing metric attribute would cause over-counting of -the pipeline. - -### Pipeline monitoring - -The term _Stage_ is used to describe the a single component in an -export pipeline. - -The term _Station_ is used to describe a location in the export -pipeline where the participating stages are part of the same logical -failure domain. Typically each SDK or Collector is considered a -station. - -#### Station integrity principles - -The [OpenTelemetry library guidelines (point -4)](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/library-guidelines.md#requirements) -describes a separation of protocol-dependent ("receivers", -"exporters") and protocol-independent ("processors") parts. We refer -to the combination of parts as a station. - -The station concept is called out because within a station, we expect -that the station (software) acts responsibly by design, for the -integrity of the pipeline. Stations allow data to enter a pipeline -only through receiver components. Stations are never responsible for -dropping data, because only processor components drop data. Stations -allow data to leave a pipeline only through exporter components. - -Because of station integrity, we can make the following assertions: - -1. Data that enters a pipeline is eventually exported or dropped. -2. No other outcomes are possible. - -There is a potential for pipeline metrics to be redundant, as -described in these assertions. In a pipeline with no fan-in or -fan-out, each stage processes as many items as the stage before it -did, minus the number of items dropped. - -#### Pipeline stage-name uniqueness - -The Pipeline Stage Name Uniqueness requirement developed here avoids -over-counting in an export pipeline by ensuring that no single metric -name counts items more than once in transit. This rule prevents -counting items of telemetry sent by SDKs and Collectors in the same -metric; it also prevents counting items of telemetry sent through a -multi-tier arrangement of Collectors using the same metric. - -In a standard deployment of OpenTelemetry, we expect one, two, or -three stations in a collection pipeline. The names given to these -standard set of stations: - -- `sdk`: an original source of new telemetry -- `agent`: a collector with operations "local" to the `sdk` -- `gateway`: a collector serving as a proxy to an external service. - -This is not meant as an exclusive set of station names. Users should -be given the ability to configure the station name used by particular -instances of the OpenTelemetry Collector. It may even be desirable to -support configuring "sub-stations" within a larger pipeline, for -example when there are connectors in use; however, if so, the -collector must enforce that pipeline-stage names are unique within a -pipeline. - -#### Basic detail through inclusive counters - -By station integrity principles, we have several forms of detail that -may be omitted by users that want only the basic level of detail. - -At a minimum, to establish information about _loss_ requires knowing -how much is received by the first station in the pipeline and how much -is exported by the last station in the pipeline. From the total -received and the total exported, we can compute total pipeline loss. -Note that metrics about intermediate pipeline stations may be omitted, -since they implicitly factor into global pipeline loss. - -For this proposal to succeed, it is necessary to use inclusive -counters as opposed to exclusive counters. For the receivers, at a -basic level of detail, we only need to know the number of items -received (i.e., including items that succeed or fail or are dropped) -because those that fail or are dropped implicitly factor in to global -pipeline loss. - -For processors, at a basic level of detail, it is not necessary to -count anything, since drops are implicit. Any items not received by -the next stage in the pipeline must have dropped, therefore we can -infer drop counts from basic-detail metrics without any new counters. - -For exporters, at a basic level of detail, the same argument applies. -Metrics describing an exporter in a pipeline ordinarily will match -those of receiver at the next stage in the pipeline, so they are not -needed at the basic level of detail, provided all receivers report -inclusive counters. - -#### Pipeline failures optional - -For a processor, dropping data is always on purpose, but dropped data -may be counted as failures or not, depending on the circumstances. -For an SDK batch processor, dropping data is considered failure. For -a Collector sampler processor, dropping data is considered success. -It is up to the individual processor component whether to treat -dropped data as failures or successes. - -We are also aware of a precedent for returning success at certain -stages in a pipeline, perhaps asynchronously, regardless of actual -success or not. This is known to happen in the Collector's core -`exporterhelper` library, which provides a number of standard -features, including the ability to drop data when the queue is full. - -Because failure is sometimes treated as success, it may be necessary -to monitor a point in the pipeline after failures are suppressed. - -#### Pipeline signal type - -OpenTelemetry currently has 3 signal types, but it may add more. -Instead of using the signal name in the metric names, we opt for a -general-purpose noun that usefully describes any signal. - -The signal-agnostic term used here is "items", referring to spans, log -records, and metric data points. An attribute to distinguish the -`signal` will be used with name `traces`, `logs`, or `metrics`. - -Users are expected to understand that the data item for traces is a -span, for logs is a record, and for metrics is a point. - -#### Pipeline component name - -Components are uniquely identified using a descriptive `name` -attribute which encompasses at least a short name describing the type -of component being used (e.g., `batch` for the SDK BatchSpanProcessor -or the Collector batch proessor). - -When there is more than one component of a given type active in a -pipeline having the same `domain` and `signal` attributes, the `name` -should include additional information to disambiguate the multiple -instances using the syntax `/`. For example, if there -were two `batch` processors in a collection pipeline (e.g., one for -error spans and one for non-error spans) they might use the names -`batch/error` and `batch/noerror`. - -#### Pipeline monitoring diagram - -The relationship between items received, dropped, and exported is -shown in the following diagram. - -![pipeline monitoring metrics](../images/otel-pipeline-monitoring.png) - -### Proposed metrics semantic conventions - -The proposed metric names are: - -`otel.{station}.received`: Inclusive count of items entering the pipeline at a station. -`otel.{station}.dropped`: Non-inclusive count of items dropped by a component in a pipeline. -`otel.{station}.exported`: Inclusive count of items exiting the pipeline at a station. - -The behavior specified for SDKs and Collectors at each level of detail -is different, because SDKs do not receive items from a pipeline. - -#### SDK default configuration - -At the basic level of detail, SDKs are required to count spans -received by the export pipeline. Only the `otel.sdk.received` metric -is required. This includes items that succeed, fail, or are dropped. - -At the normal level of detail, the `otel.sdk.received` metric gains an -additional boolean attribute, `success` indicating success or failure. -Also at the normal level of detail, the `otel.sdk.dropped` metric is -counted. - -#### Collector default configuration - -At the basic level of detail, Collectors of a given `{station}` name -are required to count `otel.{station}.received`, -`otel.{station}.dropped`, and `otel.{station}.exported`. - -At the normal level of detailed, the `received` and `exported` metrics -gain an additional boolean attribute, `success` indicating success or -failure. - -#### Detailed-level metrics configuration - -There is one additional dimension that users may wish to opt-in to, in -order to gain information about failures at a particular pipeline -stage. When detail-level metrics are requested, all three metric -instruments specified for pipeline monitoring gain an additional -`reason` attribute, with a short string explaining the failure. - -For example, with detailed-level metrics in use, the -`otel.{station}.received` and `otel.{station}.exported` counters will -include additional `reason` information (e.g., `timeout`, -`resource_exhausted`, `permission_denied`). - -## Trade-offs and mitigations - -While the use of three-levels of metric detail may seem excessive, -instrumentation authors are expected to implement the cardinality of -attributes specified here, with the use of Metric SDK View -configuration to remove unwanted attributes at runtime. - -This approach (i.e., configuration of views) can also be used in the -Collector, which is instrumented using the OTel-Go metrics SDK. - -## Prior art and alternatives - -Prior work in (this PR)[https://github.com/open-telemetry/semantic-conventions/pull/184]. diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md new file mode 100644 index 000000000..3c9883554 --- /dev/null +++ b/text/metrics/0238-pipeline-monitoring.md @@ -0,0 +1,426 @@ +# OpenTelemetry Export-pipeline metrics + +Propose a uniform standard for OpenTelemetry SDK and Collector +export-pipeline metrics with three standard levels of detail. + +## Motivation + +OpenTelemetry has pending requests to standardize the metrics emitted +by SDKs. At the same time, the OpenTelemetry Collector is becoming a +stable and critical part of the ecosystem, and it has different +semantic conventions. Here we attempt to unify them. + +## Explanation + +The OpenTelemetry Collector's pipeline metrics were derived from the +OpenCensus collector. There is no original source material explaining +the current state of metrics in the OTel collector. + +### Collector metrics + +The OpenTelemetry collector code base was audited for metrics usage +detail around the time of the v0.88.0 release. Here is a summary of +the current state of the Collector regarding export-pipeline metrics. + +The core collector formerly contained a package named `obsreport`, +which has a uniform interface dedicated to each of its components. +This package has been migrated into the commonly-used helper classes +known as `receiverhelper`, `processorhelper`, and `exporterhelper.` + +Obsreport is responsible for giving collector metrics a uniform +appearance. Metric names were created using OpenCensus style, which +uses a `/` character to indicate hierarchy and a `.` to separate the +operative verb and noun. This library creates metrics named, in +general, `{component-type}/{verb}.{plural-noun}`, with component types +`receiver`, `processor`, and, `exporter`, and with signal-specific +nouns `spans`, `metric_points` and `logs` corresponding with the unit +of information for the tracing, metrics, and logs signals, +respectively. + +Earlier adopters of the Collector would use Prometheus to read these +metrics, which does not accept `/` or `.`. The Prometheus integration +would add a `otelcol_` prefix and replace the invalid characters with +`_`. The same metric in the example above would appear named +`otelcol_receiver_accepted_spans`, for example. + +#### Obsreport receiver + +For receivers, the obsreport library counts items in two ways: + +1. Receiver `accepted` items. Items that are received and + successfully consumed by the pipeline. +2. Receiver `refused` items. Items that are received and fail to be + consumed by the pipeline. + +Items are exclusively counted in one of these counts. The lifetime +average failure rate of the receiver com is defined as +`refused / (accepted + refused)`. + +#### Collector: Obsreport processor metrics + +For processors, the obsreport library counts items in three ways: + +1. Processor `accepted` items. Defined as the number of items that are passed to the next component and return successfully. +2. Processor `dropped` items. This is a counter of items that are + deliberately excluded from the output, which will be counted as accepted by the preceding pipeline component but were not transmitted. +3. Processor `refused` items. Defined as the number of items that are passed to the next component and fail. + +Items are exclusively counted in one of these counts. The average drop rate +can be defined as `dropped / (accepted + dropped + refused)` + +#### Collector: Obsreport exporter metrics + +The `obsreport_exporter` interface counts spans in two ways: + +1. Exporter `sent` items. Items that are sent and succeed. +2. Receiver `send_failed` items. Items that are sent and fail. + +Items are exclusively counted in one of these counts. The average +failure rate is defined as `send_failed / (sent + send_failed)`. + +### Jaeger trace SDK metrics + +Jaeger SDKs expose metrics on the "Reporter", which includes +"Success", "Failure", "Dropped" counters describing the pipeline. See +[here](https://github.com/jaegertracing/jaeger-client-go/blob/8d8e8fcfd04de42b8482476abac6a902fca47c18/metrics.go#L22-L106). + +Jaeger SDK metrics are equivalent to the three metrics produced by +OpenTelemetry Collector processor components. + +### Analysis + +#### Use of exclusive counters + +As we can see by the examples documented above, it is a standard +practice to monitor a telemetry pipeline using three counters to count +successful, failed, and dropped items. In each of the existing +solutions, because the counters are exclusive, all three counter +values are needed to establish the total loss rate. + +Because the number of SDKs generally is greater than collectors, and +because they are in the first position of the pipeline, this is a +significant detail. When the subject is a single SDK, all three +counters are essential and necessary for a complete understanding of +loss. The single-SDK loss rate, defined for exclusive counters, is: + +``` +SingleLossRate = 1 - Success / (Success + Failed + Dropped) +``` + +However, in a scenario where hundreds or thousands of identical SDKs +are deployed, users may wish to opt-out of such extensive detail. +Reasoning that identical SDKs are likely experiencing the same +failures, users may wish to enable additional detail only in a sample +of SDKs, or only in regions or services where loss is already known to +exist. + +To use inclusive counters, in this case, means to use a single metric +name with one or more attributes to subdivide the total into specific +categories. For example, a total count can be subdivided into +`success=true` and `success=false`. From the SDK perspective, drops +are counted as failures, and from the Collector perspective we will +find other reasons to count drops separately. Therefore, the +definition above can be replaced for inclusive counters: + +``` +SingleLossRate = Total{success=false} / Total{*} +``` + +By using inclusive counters instead of exclusive counters, it is +possible to establish the total rate of loss with substantially fewer +timeseries, because we only need to distinguish success and failure +the end of the pipeline to determine total loss. + +``` +PipelineLossRate = LastStageTotal{success=false} / FirstStageTotal{*} +``` + +Since total loss can be calculated with only a single timneseries per +SDK, this will be specified as the behavior when configuring pipeline +monitoring with basic-level metric detail. + +#### Collector perspective + +Collector counters are exclusive. Like for SDKs, items that enter a +processor are counted in one of three ways and to compute a meaningful +ratio requires all three timeseries. If the processor is a sampler, +for example, the effective sampling rate is computed as +`(accepted+refused)/(accepted+refused+dropped)`. + +While the collector defines and emits metrics sufficient for +monitoring the individual pipeline component--taken as a whole, there +is substantial redundancy in having so many exclusive counters. For +example, when a collector pipeline features no processors, the +receiver's `refused` count is expected to equal the exporter's +`send_failed` count. + +When there are several processors, it is primarily the number of +dropped items that we are interested in counting. When there are +multiple sequential processors in a pipeline, however, counting the +total number of items at each stage in a multi-processor pipeline +leads to over-counting in aggregate. For example, if you combine +`accepted` and `refused` for two adjacent processors, then remove the +metric attribute which distinguishes them, the resulting sum will be +twice the number of items processed by the pipeline. + +### Pipeline monitoring + +The term _Stage_ is used to describe the a single component in an +export pipeline. + +The term _Station_ is used to describe a location in the export +pipeline where the participating stages are part of the same logical +failure domain. Typically each SDK or Collector is considered a +station. + +#### Station integrity principles + +The [OpenTelemetry library guidelines (point +4)](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/library-guidelines.md#requirements) +describes a separation of protocol-dependent ("receivers", +"exporters") and protocol-independent ("processors") parts. We refer +to the combination of parts as a station. + +The station concept is called out because within a station, we expect +that the station (software) acts responsibly by design, for the +integrity of the pipeline. Stations allow data to enter a pipeline +only through receiver components. Stations are never responsible for +dropping data, because only processor components drop data. Stations +allow data to leave a pipeline only through exporter components. + +Because of station integrity, we can make the following assertions: + +1. Data that enters a station is eventually exported or dropped. +2. No other outcomes are possible. + +These principles suggest ways to monitor a pipeline, when normal-level +metric detail is configured, to avoid redundancy. For simple +pipelines, the number of items exported equals the number of items +received minus the number of items dropped, and for simple pipelines +it is sufficient to observe only successes and failures by receiver as +well as items dropped by processors. + +#### Practice of error suppression + +There is a accepted practice in the OpenTelemetry Collector of +accepting data and returning success before the data is written to its +final destination. In fact, this is the out-of-the-box default for +most pipelines, because of `exporterhelper` defaults. + +Suppressing errors, when it is practiced, means a later stage in the +pipeline must be monitored to detect resource exhaustion, since +earlier stages will not see any failures or experience backpressure. +Because the practice of asynchronous reporting is widespread, +OpenTelemetry Collectors therefore should normally count exported data +in addition to received data, despite creating redundancy when errors +are not suppressed. + +#### Pipeline stage-name uniqueness + +The Pipeline Stage Name Uniqueness requirement developed here avoids +over-counting in an export pipeline by ensuring that no single metric +name counts items more than once in transit. This rule prevents +counting items of telemetry sent by SDKs and Collectors in the same +metric; it also prevents counting items of telemetry sent through a +multi-tier arrangement of Collectors using the same metric. + +In a standard deployment of OpenTelemetry, we expect one, two, or +three stations in a collection pipeline. The names given to these +standard set of stations: + +- `sdk`: an original source of new telemetry +- `agent`: a collector with operations "local" to the `sdk` +- `gateway`: a collector serving as a proxy to an external service. + +This is not meant as an exclusive set of station names. Users should +be given the ability to configure the station name used by particular +instances of the OpenTelemetry Collector. It may even be desirable to +support configuring "sub-stations" within a larger pipeline, for +example when there are connectors in use; however, if so, the +collector must enforce that pipeline-stage names are unique within a +station. + +#### Pipeline signal type + +OpenTelemetry currently has 3 signal types, but it may add more. +Instead of using the signal name in the metric names, we opt for a +general-purpose noun that usefully describes any signal. + +The signal-agnostic term used here is "items", referring to spans, log +records, and metric data points. An attribute to distinguish the +`signal` will be used with name `traces`, `logs`, or `metrics`. + +Users are expected to understand that the data item for traces is a +span, for logs is a record, and for metrics is a point. Users may +opt-in to removing this attribute, in which case items of telemetry +data will be counted in aggregate. When the `signal` attribute is +removed, loss-rate can likewise only be calculated in aggregate. + +#### Pipeline component name + +Components are uniquely identified using a descriptive `name` +attribute which encompasses at least a short name describing the type +of component being used (e.g., `batch` for the SDK BatchSpanProcessor +or the Collector batch proessor). + +When there is more than one component of a given type active in a +pipeline having the same `domain` and `signal` attributes, the `name` +should include additional information to disambiguate the multiple +instances using the syntax `/`. For example, if there +were two `batch` processors in a collection pipeline (e.g., one for +error spans and one for non-error spans) they might use the names +`batch/error` and `batch/noerror`. + +#### Pipeline monitoring diagram + +The relationship between items received, dropped, and exported is +shown in the following diagram. + +![pipeline monitoring metrics](../images/otel-pipeline-monitoring.png) + +### Proposed metrics semantic conventions + +The proposed metric names match the following pattern: + +| Metric Name | Meaning | +|---------------------------|--------------------------------------------------------------------| +| `otel.{station}.received` | Inclusive count of items entering the pipeline at a station. | +| `otel.{station}.dropped` | Non-inclusive count of items dropped by a processor in a pipeline. | +| `otel.{station}.exported` | Inclusive count of items exiting the pipeline at a station. | + +The behavior specified for SDKs and Collectors at each level of detail +is different, because SDKs do not receive items from a pipeline and +because they outnumber the other components. + +These attributes can be applied to any of the pipeline monitoring +metrics specified here. + +| Attributes | Meaning | Level of detail (Optional) | Examples | +|----------------|---------------------------------------------|----------------------------|------------------------------------------------------------| +| `otel.signal` | Name of the telemetry signal | Basic (Opt-out) | `traces`, `logs`, `metrics` | +| `otel.name` | Type, name, or "type/name" of the component | Normal (Opt-out) | `probabilitysampler`, `batch`, `otlp/grpc` | +| `otel.success` | Boolean: item considered success? | Normal (Opt-out) | `true`, `false` | +| `otel.reason` | Explaination of success/failures. | Detailed (Opt-in) | `ok`, `timeout`, `permission_denied`, `resource_exhausted` | +| `otel.scope` | Name of instrumentation. | Detailed (Opt-in) | `opentelemetry.io/library` | + +For example, when a sampler processor drops an item it may report +`success=true`, but when a queue processor drops an item it may report +`success=false`. + +#### SDK default configuration + +| Metric name | Enablement level | +|---------------------|------------------| +| `otel.sdk.received` | Basic | +| `otel.sdk.dropped` | Normal | +| `otel.sdk.exported` | Detailed | + +#### Collector default configuration + +| Metric name | Enablement level | +|---------------------------|------------------| +| `otel.{station}.received` | Basic | +| `otel.{station}.dropped` | Basic | +| `otel.{station}.exported` | Basic | + +## Pipeline monitoring as a service + +With this specification, users operating OpenTelemetry SDKs and +Collectors sending to a third-party observability system may expect to +be provided with information about telemetry losses. + +### Configuration of pipeline monitoring metrics + +OpenTelemetry SDKs will be configured by default to send pipeline +monitoring metrics using Meter instances obtained from the global +meter provider. + +SDKs authors SHOULD provide users with an option to configure an +alternate destination for pipeline monitoring metrics, so that +pipeline metrics can be monitored independently of ordinary telemetry +data. + +### Inference rules for service providers + +By design, losses and failures due to intermediate collectors will be +observable to the service provider, as long as all original producers +report pipeline-monitoring metrics. For all telemetry producers with +basic-level pipeline monitoring enabled, the telemetry system will be +able to compare the actual number of OpenTelemetry spans, metric data +points, and log records received against the numbers that entered the +system's own pipelines. In aggregate, the number of items entering +the pipeline should match the number of items successfully received, +otherwise the system is capable of reporting the combined losses to +the user. + +When losses are unacceptable to the user, or the causes of loss cannot +be resolved through other system indicators, the user (or the system +acting on the user's behalf) may wish to enable normal-level detail +for SDKs or enable metrics for intermediate collectors. The +additional detail will give the system information that can be used to +narrow down the location(s) and source(s) where loss occurs. + +## OpenTelemetry SDKs have no receivers + +OpenTelemetry SDKs are special pipeline components because they do not +receive data from an external source. OpenTelemetry SDKs support +specific, signal-specific features that may appear like standard +receivers or processors in a pipeline, but as specified here, pipeline +monitoring only applies to items of data that are submitted for +export by an SDK-internal mechanism. + +There are several examples: + +- In the trace SDK, spans are sampled by a component that emulates a + pipeline processor, but the actions of a sampler are not the same as + an export pipeline. While we could count spans not sampled using + `otel.sdk.dropped` with `success=true` and + `name=traceidratiosampler`, it could lead to misleading + interpretation because spans that are unfinished are neither + `success=true` nor `success=false`. Different semantic conventions + should probably be used to monitor sampler components. +- In the metrics SDK, Metric View configuration can cause metric + events to be "dropped". We could count all metric data points as + logically entering a pipeline, and then the ones that are dropped + would appear as `otel.sdk.dropped` with `success=true` and + `name=metricreader`, but this would lead to an accounting problem. + The point of Drop aggregation is to avoid the cost of a metric + instrument, so we do not which to count what we drop. + +For these reasons, the `otel.sdk.received` metric is defined as the +number of items that the SDK produces as input to the pipeline. This +quantity MUST be the number of items that are are expected for +delivery at the final destination, when the pipeline is operating +correctly and without failures. + +## Metrics SDK special considerations + +We expect that Metrics SDKs will be used to generate +pipeline-monitoring metrics reporting about themselves. + +As stated above, SDKs SHOULD support configuring an alternate Meter +Provider for pipeline-monitoring metrics. When the global Meter +Provider is used, the Metrics SDK's pipeline will receive its own +pipeline-monitoring metrics. When a custom Meter Provider is used, a +secondary pipeline will receive the pipeline monitoring metrics, in +which case the secondary pipeline may also self-report for itself. + +## Trade-offs and mitigations + +The use of three-levels of metric detail may seem like more freedom +than necessary. Implementors are expected to take advantage of Metric +View configuration in the Metrics SDK for configuring opt-out of +standard attributes (i.e., to remove `otel.signal`, `otel.name`, or +`otel.signal`). For opt-in attributes (i.e., to configure no +`otel.reason` or `otel.scope` attribute), implementors MAY choose to +enable additional attributes only when configured. + +## Prior art and alternatives + +Prior work in (this PR)[https://github.com/open-telemetry/semantic-conventions/pull/184]. + +Issues: +- [Determine how to report dropped metrics](https://github.com/open-telemetry/opentelemetry-specification/issues/1655) +- [How should OpenTelemetry-internal metrics be exposed?](https://github.com/open-telemetry/opentelemetry-specification/issues/959) +- [OTLP Exporter must send client side metrics](https://github.com/open-telemetry/opentelemetry-specification/issues/791) +- [Making Tracing SDK metrics aware](https://github.com/open-telemetry/opentelemetry-specification/issues/381) From 00624800569eb1b9901747c94a655ce2ead6a303 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 14 Dec 2023 16:18:47 -0800 Subject: [PATCH 04/13] Wip --- text/metrics/0238-pipeline-monitoring.md | 121 +++++++++++------------ 1 file changed, 55 insertions(+), 66 deletions(-) diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md index 3c9883554..1a7275c8b 100644 --- a/text/metrics/0238-pipeline-monitoring.md +++ b/text/metrics/0238-pipeline-monitoring.md @@ -18,10 +18,6 @@ the current state of metrics in the OTel collector. ### Collector metrics -The OpenTelemetry collector code base was audited for metrics usage -detail around the time of the v0.88.0 release. Here is a summary of -the current state of the Collector regarding export-pipeline metrics. - The core collector formerly contained a package named `obsreport`, which has a uniform interface dedicated to each of its components. This package has been migrated into the commonly-used helper classes @@ -31,7 +27,7 @@ Obsreport is responsible for giving collector metrics a uniform appearance. Metric names were created using OpenCensus style, which uses a `/` character to indicate hierarchy and a `.` to separate the operative verb and noun. This library creates metrics named, in -general, `{component-type}/{verb}.{plural-noun}`, with component types +general, `{component-type}/{verb}.{noun}`, with component types `receiver`, `processor`, and, `exporter`, and with signal-specific nouns `spans`, `metric_points` and `logs` corresponding with the unit of information for the tracing, metrics, and logs signals, @@ -41,7 +37,7 @@ Earlier adopters of the Collector would use Prometheus to read these metrics, which does not accept `/` or `.`. The Prometheus integration would add a `otelcol_` prefix and replace the invalid characters with `_`. The same metric in the example above would appear named -`otelcol_receiver_accepted_spans`, for example. +`otelcol_receiver_accepted_spans`. #### Obsreport receiver @@ -78,6 +74,10 @@ The `obsreport_exporter` interface counts spans in two ways: Items are exclusively counted in one of these counts. The average failure rate is defined as `send_failed / (sent + send_failed)`. +The exporterhelper package takes on many aspects of processor +behavior, including the ability to drop when a queue is full. It uses +a separate counter for these items, known as `enqueue_failed`. + ### Jaeger trace SDK metrics Jaeger SDKs expose metrics on the "Reporter", which includes @@ -89,79 +89,68 @@ OpenTelemetry Collector processor components. ### Analysis -#### Use of exclusive counters - As we can see by the examples documented above, it is a standard practice to monitor a telemetry pipeline using three counters to count -successful, failed, and dropped items. In each of the existing -solutions, because the counters are exclusive, all three counter -values are needed to establish the total loss rate. +successful, failed, and dropped items. -Because the number of SDKs generally is greater than collectors, and -because they are in the first position of the pipeline, this is a -significant detail. When the subject is a single SDK, all three -counters are essential and necessary for a complete understanding of -loss. The single-SDK loss rate, defined for exclusive counters, is: +A central aspect of the proposed specification is to use a single +metric instrument with three exclusive attribute values, as compared +with the use of three separate metric instruments. -``` -SingleLossRate = 1 - Success / (Success + Failed + Dropped) -``` +#### Loss rate calculation + +The benefit of using a single metric instrument is that aggregation is +easy to apply, particularly in Metric SDKs using the standard Views +mechanism. This means it is both easy and natural to configure an SDK +to produce more or less detail, so that both advanced and basic +use-cases are possible. -However, in a scenario where hundreds or thousands of identical SDKs -are deployed, users may wish to opt-out of such extensive detail. -Reasoning that identical SDKs are likely experiencing the same -failures, users may wish to enable additional detail only in a sample -of SDKs, or only in regions or services where loss is already known to -exist. - -To use inclusive counters, in this case, means to use a single metric -name with one or more attributes to subdivide the total into specific -categories. For example, a total count can be subdivided into -`success=true` and `success=false`. From the SDK perspective, drops -are counted as failures, and from the Collector perspective we will -find other reasons to count drops separately. Therefore, the -definition above can be replaced for inclusive counters: +When calculating the single-SDK loss rate, all three variables are +necessary. We use the terms `items{outcome=success}`, +`items{outcome=failed}`, and `items{outcome=dropped}` to denote the +count of items by three outcomes, and the loss rate is a function of +all three. ``` -SingleLossRate = Total{success=false} / Total{*} +SingleLossRate = (items{outcome=failed} + items{outcome=dropped}) / (items{outcome=success} + items{outcome=failed} + items{outcome=dropped}) ``` -By using inclusive counters instead of exclusive counters, it is -possible to establish the total rate of loss with substantially fewer -timeseries, because we only need to distinguish success and failure -the end of the pipeline to determine total loss. +The benefit of using a single metric instrument is that the +calculation `(items{outcome=success} + items{outcome=failed} + +items{outcome=dropped})` is simple and inexpensive to apply in SDKs by +removing the `outcome` attribute. The sum of these three with no +`outcome` variable (`items{}`) can be used to establish the loss rate +between two points in a pipeline, because at this level the +distinction between outcomes does not matter. + +In an ordinary deployment where the number of SDKs is orders of +magnitude larger than the number of collectors, this can lead to +meaningful savings. The total loss rate for a pipeline is calculated +from the sum of items at the final stage of collection and the sum of +items at the SDKs. ``` -PipelineLossRate = LastStageTotal{success=false} / FirstStageTotal{*} +TotalLossRate = `sum(collector_items{}) / sum(sdk_items{})` ``` -Since total loss can be calculated with only a single timneseries per -SDK, this will be specified as the behavior when configuring pipeline -monitoring with basic-level metric detail. - -#### Collector perspective - -Collector counters are exclusive. Like for SDKs, items that enter a -processor are counted in one of three ways and to compute a meaningful -ratio requires all three timeseries. If the processor is a sampler, -for example, the effective sampling rate is computed as -`(accepted+refused)/(accepted+refused+dropped)`. - -While the collector defines and emits metrics sufficient for -monitoring the individual pipeline component--taken as a whole, there -is substantial redundancy in having so many exclusive counters. For -example, when a collector pipeline features no processors, the -receiver's `refused` count is expected to equal the exporter's -`send_failed` count. - -When there are several processors, it is primarily the number of -dropped items that we are interested in counting. When there are -multiple sequential processors in a pipeline, however, counting the -total number of items at each stage in a multi-processor pipeline -leads to over-counting in aggregate. For example, if you combine -`accepted` and `refused` for two adjacent processors, then remove the -metric attribute which distinguishes them, the resulting sum will be -twice the number of items processed by the pipeline. +#### Dropped means intentionally not transmitted + +In the SDK, we may consider the case of dropped items as a form of +failure, but these are intentional failures. Looking at Collector +processors, we see that dropped items may or may not be considered +success, but whatever they are, they result from an intentional +decision not to transmit an item. + +As an example, when a sampler processor drops spans, it should not be +considered failure. When the Collector's exporterhelper component +drops items, it should probabily be considered a failure. + +#### Collector + +TODO: Why we count only drops for pipeline segments, but not SDKs. + + + ### Pipeline monitoring From b39b732e10e543cd258372b478d89d8a173fb526 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Fri, 15 Dec 2023 16:26:40 -0800 Subject: [PATCH 05/13] TODO WIP too much text now --- text/metrics/0238-pipeline-monitoring.md | 109 ++++++++++++++++++++--- 1 file changed, 96 insertions(+), 13 deletions(-) diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md index 1a7275c8b..fd8ba3161 100644 --- a/text/metrics/0238-pipeline-monitoring.md +++ b/text/metrics/0238-pipeline-monitoring.md @@ -97,6 +97,101 @@ A central aspect of the proposed specification is to use a single metric instrument with three exclusive attribute values, as compared with the use of three separate metric instruments. +By specifying attribute dimensions for the resulting single +instrument, users can configure the level of detail and the number of +timeseries needed to convey the information they to monitor. + +#### Meaning of "dropped" telemetry + +The term "Dropped" in pipeline monitoring usually refers to telemetry +that was intentionally not transmitted. A survey of existing pipeline +components shows the following uses. + +In the SDK, the standard OpenTelemetry BatchSpanProcessor will drop +spans that cannot be admitted into its queue. These cases are +intentional, to protect the application and downstream pipeline, but +they should be considered failure because they were meant to be be +collected. + +In a Collector pipeline, there are formal and informal uses: + +- A sampling processor, for example, may drop spans because it was + instructed to (e.g., due to an attribute like `sampling.priority=0`). + In this case, drops are considered success. +- The memorylimiter processor, for example, may "drop" spans because + it was instructed to (e.g., when it is above a hard limit). + However, when it does this, it returns an error counts the item as + `refused`, contradicting the documentation of that metric instrument: + +> "Number of spans that were rejected by the next component in the pipeline." + +There is already an inconsistency. By counting its own failures as +refused, we should expect that the next component in the pipeline +handled the data. This is a failure case drop, one where the next +component in the pipeline does not handle the item: + +> "Number of spans that were dropped." + +The memory limiter source code actually has a comment on this topic, + +``` +// TODO: actually to be 100% sure that this is "refused" and not "dropped" +// it is necessary to check the pipeline to see if this is directly connected +// to a receiver (ie.: a receiver is on the call stack). For now it +// assumes that the pipeline is properly configured and a receiver is on the +// callstack and that the receiver will correctly retry the refused data again. +``` + +which adds to the confusion -- it is not standard practice for +receivers to retry in the OpenTelemetry collector, that is the duty of +exporters in our current practice. + +There is still another use of "dropped" in the collector, similar to +the memory limiter example and the SDK use-case, where "dropped" is a +case of failure. In the `exporterhelper` module, the term dropped is +used in log messages to describe data that was tried at least once and +will not be retried, which matches the processor's definition of +`refused` in the sense that data was submitted to the next component +in the pipeline and failed and does not match the processor's +definition `dropped`. When counting these spans, they may or may not +be treated as send failures, depending on whether "queue-sender" +behavior was configured or not (and whether it is a persistent queue +or not). + +As the exporter helper is not part of a processor framework, it does +not have a conventional way to count dropped items. When the +queue-sender is enabled and the queue is full, items are dropped in +the standard sense, but they are counted using an `enqueue_failed` +metric. + +#### Practice of suppressing errors + +Continuing the analysis of existing pipeline monitoring behavior, when +the Collector exporter helper is configured with a non-persistent +queue-sender, errors are suppressed. Error suppression is common and +meant to support use-cases when there would otherwise be a +potentential to send duplicate information with negative impact. This +ordinarily happens when there is fan-in to or fan-out from a +component. + +For example, when a the core batch processor forms a batch using items +from multiple producers (e.g., as by multiple clients sending, or +multiple receivers configured), there is a many-to-many relationship +between arriving batches and outgoing batches. When some of the +producers data failed and but some of it was successful, the standard +practice + +TODO: talk about partial success, and how we should use it. how SDKs +and collectors should preserve this in their counting, where +"rejected" is failure-dropped by a subsequent pipeline and conveyed by +success because some of the data was not dropped. + +The recommendation here is that Collectors try to avoid suppressing +errors, because it makes SDK pipeline monitoring unreliable. SDKs +can't honestly report failed telemetry if Collectors are suppressing +errors. Instead, we should count partial success, which naturally +prevents retry but still preserves failure information. + #### Loss rate calculation The benefit of using a single metric instrument is that aggregation is @@ -133,19 +228,7 @@ items at the SDKs. TotalLossRate = `sum(collector_items{}) / sum(sdk_items{})` ``` -#### Dropped means intentionally not transmitted - -In the SDK, we may consider the case of dropped items as a form of -failure, but these are intentional failures. Looking at Collector -processors, we see that dropped items may or may not be considered -success, but whatever they are, they result from an intentional -decision not to transmit an item. - -As an example, when a sampler processor drops spans, it should not be -considered failure. When the Collector's exporterhelper component -drops items, it should probabily be considered a failure. - -#### Collector +#### Collector TODO: Why we count only drops for pipeline segments, but not SDKs. From a1177793ef2449bd921d042e08940ec8fb3f1326 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 20 Dec 2023 15:31:25 -0800 Subject: [PATCH 06/13] wip update --- text/metrics/0238-pipeline-monitoring.md | 362 ++++------------------- 1 file changed, 57 insertions(+), 305 deletions(-) diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md index fd8ba3161..02d4a4920 100644 --- a/text/metrics/0238-pipeline-monitoring.md +++ b/text/metrics/0238-pipeline-monitoring.md @@ -5,16 +5,15 @@ export-pipeline metrics with three standard levels of detail. ## Motivation -OpenTelemetry has pending requests to standardize the metrics emitted -by SDKs. At the same time, the OpenTelemetry Collector is becoming a -stable and critical part of the ecosystem, and it has different -semantic conventions. Here we attempt to unify them. +OpenTelemetry has pending requests to standardize conventions for the +metrics emitted by SDKs. At the same time, the OpenTelemetry Collector +is becoming a stable and critical part of the ecosystem, and it has +different semantic conventions. Here we attempt to unify them. ## Explanation The OpenTelemetry Collector's pipeline metrics were derived from the -OpenCensus collector. There is no original source material explaining -the current state of metrics in the OTel collector. +OpenCensus collector. ### Collector metrics @@ -39,7 +38,7 @@ would add a `otelcol_` prefix and replace the invalid characters with `_`. The same metric in the example above would appear named `otelcol_receiver_accepted_spans`. -#### Obsreport receiver +#### Collector: Obsreport receiver metrics For receivers, the obsreport library counts items in two ways: @@ -94,12 +93,12 @@ practice to monitor a telemetry pipeline using three counters to count successful, failed, and dropped items. A central aspect of the proposed specification is to use a single -metric instrument with three exclusive attribute values, as compared -with the use of three separate metric instruments. +metric instrument with exclusive attribute values, as compared with +the use of separate metric instruments. By specifying attribute dimensions for the resulting single instrument, users can configure the level of detail and the number of -timeseries needed to convey the information they to monitor. +timeseries needed to convey the information they want to monitor. #### Meaning of "dropped" telemetry @@ -110,8 +109,8 @@ components shows the following uses. In the SDK, the standard OpenTelemetry BatchSpanProcessor will drop spans that cannot be admitted into its queue. These cases are intentional, to protect the application and downstream pipeline, but -they should be considered failure because they were meant to be be -collected. +they should be considered failure because they were sampled, and not +collecting them in general will lead to trace incompleteness. In a Collector pipeline, there are formal and informal uses: @@ -144,7 +143,9 @@ The memory limiter source code actually has a comment on this topic, which adds to the confusion -- it is not standard practice for receivers to retry in the OpenTelemetry collector, that is the duty of -exporters in our current practice. +exporters in our current practice. So, the memory limiter component, +to be consistent, should count "failure drops" to indicate that the +next stage of the pipeline did not see the data. There is still another use of "dropped" in the collector, similar to the memory limiter example and the SDK use-case, where "dropped" is a @@ -153,10 +154,7 @@ used in log messages to describe data that was tried at least once and will not be retried, which matches the processor's definition of `refused` in the sense that data was submitted to the next component in the pipeline and failed and does not match the processor's -definition `dropped`. When counting these spans, they may or may not -be treated as send failures, depending on whether "queue-sender" -behavior was configured or not (and whether it is a persistent queue -or not). +definition `dropped`. As the exporter helper is not part of a processor framework, it does not have a conventional way to count dropped items. When the @@ -164,307 +162,61 @@ queue-sender is enabled and the queue is full, items are dropped in the standard sense, but they are counted using an `enqueue_failed` metric. -#### Practice of suppressing errors - -Continuing the analysis of existing pipeline monitoring behavior, when -the Collector exporter helper is configured with a non-persistent -queue-sender, errors are suppressed. Error suppression is common and -meant to support use-cases when there would otherwise be a -potentential to send duplicate information with negative impact. This -ordinarily happens when there is fan-in to or fan-out from a -component. - -For example, when a the core batch processor forms a batch using items -from multiple producers (e.g., as by multiple clients sending, or -multiple receivers configured), there is a many-to-many relationship -between arriving batches and outgoing batches. When some of the -producers data failed and but some of it was successful, the standard -practice - -TODO: talk about partial success, and how we should use it. how SDKs -and collectors should preserve this in their counting, where -"rejected" is failure-dropped by a subsequent pipeline and conveyed by -success because some of the data was not dropped. - -The recommendation here is that Collectors try to avoid suppressing -errors, because it makes SDK pipeline monitoring unreliable. SDKs -can't honestly report failed telemetry if Collectors are suppressing -errors. Instead, we should count partial success, which naturally -prevents retry but still preserves failure information. - -#### Loss rate calculation - -The benefit of using a single metric instrument is that aggregation is -easy to apply, particularly in Metric SDKs using the standard Views -mechanism. This means it is both easy and natural to configure an SDK -to produce more or less detail, so that both advanced and basic -use-cases are possible. - -When calculating the single-SDK loss rate, all three variables are -necessary. We use the terms `items{outcome=success}`, -`items{outcome=failed}`, and `items{outcome=dropped}` to denote the -count of items by three outcomes, and the loss rate is a function of -all three. +## Proposed semantic conventions + +### Use of a single metric name + +The use of a single metric name is less confusing than the use of +multiple metric names, because the user has to know only a single name +to writing useful queries. Users working with existing collector and +SDK pipeline monitoring metrics have to remember at least three metric +names and explicitly join them custom metric queries. For example, to +calculate loss rate for an SDK using traditional pipeline metrics, ``` -SingleLossRate = (items{outcome=failed} + items{outcome=dropped}) / (items{outcome=success} + items{outcome=failed} + items{outcome=dropped}) +LossRate_MultipleMetrics = (dropped + failed) / (dropped + failed + success) ``` -The benefit of using a single metric instrument is that the -calculation `(items{outcome=success} + items{outcome=failed} + -items{outcome=dropped})` is simple and inexpensive to apply in SDKs by -removing the `outcome` attribute. The sum of these three with no -`outcome` variable (`items{}`) can be used to establish the loss rate -between two points in a pipeline, because at this level the -distinction between outcomes does not matter. - -In an ordinary deployment where the number of SDKs is orders of -magnitude larger than the number of collectors, this can lead to -meaningful savings. The total loss rate for a pipeline is calculated -from the sum of items at the final stage of collection and the sum of -items at the SDKs. +On the other hand, with a uniform boolean attribute indicating success +or failure the resulting query is simpler. ``` -TotalLossRate = `sum(collector_items{}) / sum(sdk_items{})` +LossRate_SingleMetric = items{success=false} / items{success=*} ``` +In a typical metric query engine, after the user has entered the one +metric name, attribute values will be automatically surfaced in the +user interface, allowing them to make sense of the data and +interactively build useful queries. On the other hand, the user who +has to query multiple metrics has to enter each metric name +explicitly without help from the user interface. + +The proposed metric instrument would be named distinctly depending on +whether it is a collector or an SDK, to prevent accidental aggregation +of these timeseries. The specified counter names: + +- `otelsdk.producer.items`: count of successful and failed items of + telemetry produced by signal type by an OpenTelemetry SDK. +- `otelcol.receiver.items`: +- `otelcol.processor.items`: +- `otelcol.exporter.items`: + +### Recommended conventional attributes + +- `otel.success` (boolean): This is true or false depending on whether the + component considers the outcome a success or a failure. +- `otel.outcome` (string): This describes the outcome in a more specific + way than `otel.success`, with recommended values specified below. +- `otel.signal` (string): This is the name of the signal (e.g., "logs", + "metrics", "traces") +- `otel.name` (string): Name of the component +- `otel.pipeline` (string): + #### Collector TODO: Why we count only drops for pipeline segments, but not SDKs. - - -### Pipeline monitoring - -The term _Stage_ is used to describe the a single component in an -export pipeline. - -The term _Station_ is used to describe a location in the export -pipeline where the participating stages are part of the same logical -failure domain. Typically each SDK or Collector is considered a -station. - -#### Station integrity principles - -The [OpenTelemetry library guidelines (point -4)](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/library-guidelines.md#requirements) -describes a separation of protocol-dependent ("receivers", -"exporters") and protocol-independent ("processors") parts. We refer -to the combination of parts as a station. - -The station concept is called out because within a station, we expect -that the station (software) acts responsibly by design, for the -integrity of the pipeline. Stations allow data to enter a pipeline -only through receiver components. Stations are never responsible for -dropping data, because only processor components drop data. Stations -allow data to leave a pipeline only through exporter components. - -Because of station integrity, we can make the following assertions: - -1. Data that enters a station is eventually exported or dropped. -2. No other outcomes are possible. - -These principles suggest ways to monitor a pipeline, when normal-level -metric detail is configured, to avoid redundancy. For simple -pipelines, the number of items exported equals the number of items -received minus the number of items dropped, and for simple pipelines -it is sufficient to observe only successes and failures by receiver as -well as items dropped by processors. - -#### Practice of error suppression - -There is a accepted practice in the OpenTelemetry Collector of -accepting data and returning success before the data is written to its -final destination. In fact, this is the out-of-the-box default for -most pipelines, because of `exporterhelper` defaults. - -Suppressing errors, when it is practiced, means a later stage in the -pipeline must be monitored to detect resource exhaustion, since -earlier stages will not see any failures or experience backpressure. -Because the practice of asynchronous reporting is widespread, -OpenTelemetry Collectors therefore should normally count exported data -in addition to received data, despite creating redundancy when errors -are not suppressed. - -#### Pipeline stage-name uniqueness - -The Pipeline Stage Name Uniqueness requirement developed here avoids -over-counting in an export pipeline by ensuring that no single metric -name counts items more than once in transit. This rule prevents -counting items of telemetry sent by SDKs and Collectors in the same -metric; it also prevents counting items of telemetry sent through a -multi-tier arrangement of Collectors using the same metric. - -In a standard deployment of OpenTelemetry, we expect one, two, or -three stations in a collection pipeline. The names given to these -standard set of stations: - -- `sdk`: an original source of new telemetry -- `agent`: a collector with operations "local" to the `sdk` -- `gateway`: a collector serving as a proxy to an external service. - -This is not meant as an exclusive set of station names. Users should -be given the ability to configure the station name used by particular -instances of the OpenTelemetry Collector. It may even be desirable to -support configuring "sub-stations" within a larger pipeline, for -example when there are connectors in use; however, if so, the -collector must enforce that pipeline-stage names are unique within a -station. - -#### Pipeline signal type - -OpenTelemetry currently has 3 signal types, but it may add more. -Instead of using the signal name in the metric names, we opt for a -general-purpose noun that usefully describes any signal. - -The signal-agnostic term used here is "items", referring to spans, log -records, and metric data points. An attribute to distinguish the -`signal` will be used with name `traces`, `logs`, or `metrics`. - -Users are expected to understand that the data item for traces is a -span, for logs is a record, and for metrics is a point. Users may -opt-in to removing this attribute, in which case items of telemetry -data will be counted in aggregate. When the `signal` attribute is -removed, loss-rate can likewise only be calculated in aggregate. - -#### Pipeline component name - -Components are uniquely identified using a descriptive `name` -attribute which encompasses at least a short name describing the type -of component being used (e.g., `batch` for the SDK BatchSpanProcessor -or the Collector batch proessor). - -When there is more than one component of a given type active in a -pipeline having the same `domain` and `signal` attributes, the `name` -should include additional information to disambiguate the multiple -instances using the syntax `/`. For example, if there -were two `batch` processors in a collection pipeline (e.g., one for -error spans and one for non-error spans) they might use the names -`batch/error` and `batch/noerror`. - -#### Pipeline monitoring diagram - -The relationship between items received, dropped, and exported is -shown in the following diagram. - -![pipeline monitoring metrics](../images/otel-pipeline-monitoring.png) - -### Proposed metrics semantic conventions - -The proposed metric names match the following pattern: - -| Metric Name | Meaning | -|---------------------------|--------------------------------------------------------------------| -| `otel.{station}.received` | Inclusive count of items entering the pipeline at a station. | -| `otel.{station}.dropped` | Non-inclusive count of items dropped by a processor in a pipeline. | -| `otel.{station}.exported` | Inclusive count of items exiting the pipeline at a station. | - -The behavior specified for SDKs and Collectors at each level of detail -is different, because SDKs do not receive items from a pipeline and -because they outnumber the other components. - -These attributes can be applied to any of the pipeline monitoring -metrics specified here. - -| Attributes | Meaning | Level of detail (Optional) | Examples | -|----------------|---------------------------------------------|----------------------------|------------------------------------------------------------| -| `otel.signal` | Name of the telemetry signal | Basic (Opt-out) | `traces`, `logs`, `metrics` | -| `otel.name` | Type, name, or "type/name" of the component | Normal (Opt-out) | `probabilitysampler`, `batch`, `otlp/grpc` | -| `otel.success` | Boolean: item considered success? | Normal (Opt-out) | `true`, `false` | -| `otel.reason` | Explaination of success/failures. | Detailed (Opt-in) | `ok`, `timeout`, `permission_denied`, `resource_exhausted` | -| `otel.scope` | Name of instrumentation. | Detailed (Opt-in) | `opentelemetry.io/library` | - -For example, when a sampler processor drops an item it may report -`success=true`, but when a queue processor drops an item it may report -`success=false`. - -#### SDK default configuration - -| Metric name | Enablement level | -|---------------------|------------------| -| `otel.sdk.received` | Basic | -| `otel.sdk.dropped` | Normal | -| `otel.sdk.exported` | Detailed | - -#### Collector default configuration - -| Metric name | Enablement level | -|---------------------------|------------------| -| `otel.{station}.received` | Basic | -| `otel.{station}.dropped` | Basic | -| `otel.{station}.exported` | Basic | - -## Pipeline monitoring as a service - -With this specification, users operating OpenTelemetry SDKs and -Collectors sending to a third-party observability system may expect to -be provided with information about telemetry losses. - -### Configuration of pipeline monitoring metrics - -OpenTelemetry SDKs will be configured by default to send pipeline -monitoring metrics using Meter instances obtained from the global -meter provider. - -SDKs authors SHOULD provide users with an option to configure an -alternate destination for pipeline monitoring metrics, so that -pipeline metrics can be monitored independently of ordinary telemetry -data. - -### Inference rules for service providers - -By design, losses and failures due to intermediate collectors will be -observable to the service provider, as long as all original producers -report pipeline-monitoring metrics. For all telemetry producers with -basic-level pipeline monitoring enabled, the telemetry system will be -able to compare the actual number of OpenTelemetry spans, metric data -points, and log records received against the numbers that entered the -system's own pipelines. In aggregate, the number of items entering -the pipeline should match the number of items successfully received, -otherwise the system is capable of reporting the combined losses to -the user. - -When losses are unacceptable to the user, or the causes of loss cannot -be resolved through other system indicators, the user (or the system -acting on the user's behalf) may wish to enable normal-level detail -for SDKs or enable metrics for intermediate collectors. The -additional detail will give the system information that can be used to -narrow down the location(s) and source(s) where loss occurs. - -## OpenTelemetry SDKs have no receivers - -OpenTelemetry SDKs are special pipeline components because they do not -receive data from an external source. OpenTelemetry SDKs support -specific, signal-specific features that may appear like standard -receivers or processors in a pipeline, but as specified here, pipeline -monitoring only applies to items of data that are submitted for -export by an SDK-internal mechanism. - -There are several examples: - -- In the trace SDK, spans are sampled by a component that emulates a - pipeline processor, but the actions of a sampler are not the same as - an export pipeline. While we could count spans not sampled using - `otel.sdk.dropped` with `success=true` and - `name=traceidratiosampler`, it could lead to misleading - interpretation because spans that are unfinished are neither - `success=true` nor `success=false`. Different semantic conventions - should probably be used to monitor sampler components. -- In the metrics SDK, Metric View configuration can cause metric - events to be "dropped". We could count all metric data points as - logically entering a pipeline, and then the ones that are dropped - would appear as `otel.sdk.dropped` with `success=true` and - `name=metricreader`, but this would lead to an accounting problem. - The point of Drop aggregation is to avoid the cost of a metric - instrument, so we do not which to count what we drop. - -For these reasons, the `otel.sdk.received` metric is defined as the -number of items that the SDK produces as input to the pipeline. This -quantity MUST be the number of items that are are expected for -delivery at the final destination, when the pipeline is operating -correctly and without failures. - ## Metrics SDK special considerations We expect that Metrics SDKs will be used to generate From 3ce051090e2b3b0500c85d98e841e79dd7f7d7fe Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 20 Dec 2023 15:36:32 -0800 Subject: [PATCH 07/13] wip2 --- text/metrics/0238-pipeline-monitoring.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md index 02d4a4920..aa6520a65 100644 --- a/text/metrics/0238-pipeline-monitoring.md +++ b/text/metrics/0238-pipeline-monitoring.md @@ -209,14 +209,13 @@ of these timeseries. The specified counter names: way than `otel.success`, with recommended values specified below. - `otel.signal` (string): This is the name of the signal (e.g., "logs", "metrics", "traces") -- `otel.name` (string): Name of the component -- `otel.pipeline` (string): +- `otel.name` (string): Name of the component in a pipeline. +- `otel.pipeline` (string): Name of the pipeline in a collector. #### Collector TODO: Why we count only drops for pipeline segments, but not SDKs. - ## Metrics SDK special considerations We expect that Metrics SDKs will be used to generate From bbcf391a2d11a77f6bfc561fe5474f364c7b48d9 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 3 Jan 2024 12:16:14 -0800 Subject: [PATCH 08/13] specify values --- text/metrics/0238-pipeline-monitoring.md | 93 ++++++++++++++++++++---- 1 file changed, 79 insertions(+), 14 deletions(-) diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md index aa6520a65..3dd24750a 100644 --- a/text/metrics/0238-pipeline-monitoring.md +++ b/text/metrics/0238-pipeline-monitoring.md @@ -1,7 +1,7 @@ # OpenTelemetry Export-pipeline metrics Propose a uniform standard for OpenTelemetry SDK and Collector -export-pipeline metrics with three standard levels of detail. +export-pipeline metrics with support for multiple levels of detail. ## Motivation @@ -94,7 +94,7 @@ successful, failed, and dropped items. A central aspect of the proposed specification is to use a single metric instrument with exclusive attribute values, as compared with -the use of separate metric instruments. +the use of separate, exclusive metric instruments. By specifying attribute dimensions for the resulting single instrument, users can configure the level of detail and the number of @@ -119,15 +119,18 @@ In a Collector pipeline, there are formal and informal uses: In this case, drops are considered success. - The memorylimiter processor, for example, may "drop" spans because it was instructed to (e.g., when it is above a hard limit). - However, when it does this, it returns an error counts the item as + However, when it does this, it returns an error counting the item as `refused`, contradicting the documentation of that metric instrument: > "Number of spans that were rejected by the next component in the pipeline." -There is already an inconsistency. By counting its own failures as -refused, we should expect that the next component in the pipeline -handled the data. This is a failure case drop, one where the next -component in the pipeline does not handle the item: +There is already an inconsistency, along with a new term "rejected". +By counting its own failures as refused, we should expect that the +next component in the pipeline handled the data. This is a failure +case drop, one where the next component in the pipeline does not +handle the item, however counting drops as refused leads to +inconsitency, since refused spans should be visibly counted by the +next stage in the pipeline. > "Number of spans that were dropped." @@ -164,14 +167,30 @@ metric. ## Proposed semantic conventions +Following the analysis above, the main problem being addressed is +confusion over the meaning of "dropped", which is sometimes success +and sometimes failure. The use of a single metric with optional +attributes allows us to explicitly count success and failure while +optionally counting additional dimensions. As we will see, this +allows introducing newly-distinct outcomes without breaking past +conventions. + +For example, the term "rejected" has a formal definition in +OpenTelemetry that is not expressed by existing metrics. An item of +telemetry is considered rejected when it is included in a successful +request but was individually dropped (for stated reasons) and should +not be retried; these items were successfully sent but dropped (due to +partial success) after processing by the next stage in the pipeline. + ### Use of a single metric name The use of a single metric name is less confusing than the use of multiple metric names, because the user has to know only a single name to writing useful queries. Users working with existing collector and SDK pipeline monitoring metrics have to remember at least three metric -names and explicitly join them custom metric queries. For example, to -calculate loss rate for an SDK using traditional pipeline metrics, +names and explicitly join them via custom metric queries. For +example, to calculate loss rate for an SDK using traditional pipeline +metrics, ``` LossRate_MultipleMetrics = (dropped + failed) / (dropped + failed + success) @@ -193,13 +212,19 @@ explicitly without help from the user interface. The proposed metric instrument would be named distinctly depending on whether it is a collector or an SDK, to prevent accidental aggregation -of these timeseries. The specified counter names: +of these timeseries. The specified counter names would be: - `otelsdk.producer.items`: count of successful and failed items of - telemetry produced by signal type by an OpenTelemetry SDK. -- `otelcol.receiver.items`: -- `otelcol.processor.items`: -- `otelcol.exporter.items`: + telemetry produced, by signal type, by an OpenTelemetry SDK. +- `otelcol.receiver.items`: count of successful and failed items of + telemetry received, by signal type, by an OpenTelemetry Collector + receiver component. +- `otelcol.processor.items`: count of successful and failed items of + telemetry processed, by signal type, by an OpenTelemetry Collector + receiver component. +- `otelcol.exporter.items`: count of successful and failed items of + telemetry processed, by signal type, by an OpenTelemetry Collector + receiver component. ### Recommended conventional attributes @@ -212,9 +237,49 @@ of these timeseries. The specified counter names: - `otel.name` (string): Name of the component in a pipeline. - `otel.pipeline` (string): Name of the pipeline in a collector. +### Specified `otel.outcome` attribute values + +The `otel.outcome` attribute indicates extra information about a +success or failure. A set of standard conventional attribute values +is supplied, however it should not be considered a closed set. If +these outcomes do not accurately explain the reason for a success or +failure outcome, they can be extended by users with alternative, +low-cardinality explanatory values. + +For success: + +- `ok`: Indicates a normal success case. The item was handled by the + next stage of the pipeline, which returned success. +- `not_sampled`: Indicates a successful drop case, due to sampling. + The item was intentionally not handled by the next stage of the + pipeline. + +For failure: + +- `timeout`: The item was in the process of being transmitted but the + request timed out. +- `queue_full`: Indicates a dropped item because a local, limited-size + queue is at capacity. The item was not handled by the next stage of + the pipeline. If the item was handled by the next stage of the + pipeline, use `resource_exhausted`. +- `resource_exhausted`: The item was handled by the next stage of the + pipeline, which returned an error code indicating that it was + overloaded. If the resource being exhausted is local and the item + was not handled by the next stage of the pipeline, use `queue_full`. +- `rejected`: The item was handled by the next stage of the pipeline, + which returned a partial success status indicating that some items + could not be accepted. +- `transient`: The item was handled by the next stage of the pipeline, + which returned a retryable error status not covered by any of the + above values. +- `permanent`: The item was handled by the next stage of the pipeline, + which returned a permanent error status not covered by any of the + above values. + #### Collector TODO: Why we count only drops for pipeline segments, but not SDKs. +TODO: What to do about suppression. ## Metrics SDK special considerations From 133419151f506603205ac299056548193858cd7d Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 3 Jan 2024 15:15:30 -0800 Subject: [PATCH 09/13] the long tail (start) --- text/metrics/0238-pipeline-monitoring.md | 137 ++++++++++++++++++++--- 1 file changed, 124 insertions(+), 13 deletions(-) diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md index 3dd24750a..ed3d1f653 100644 --- a/text/metrics/0238-pipeline-monitoring.md +++ b/text/metrics/0238-pipeline-monitoring.md @@ -241,23 +241,25 @@ of these timeseries. The specified counter names would be: The `otel.outcome` attribute indicates extra information about a success or failure. A set of standard conventional attribute values -is supplied, however it should not be considered a closed set. If -these outcomes do not accurately explain the reason for a success or -failure outcome, they can be extended by users with alternative, -low-cardinality explanatory values. +is supplied and is considered a closed set. If these outcomes do not +accurately explain the reason for a success or failure outcome, they +SHOULD be extended by OpenTelemetry. For success: -- `ok`: Indicates a normal success case. The item was handled by the - next stage of the pipeline, which returned success. -- `not_sampled`: Indicates a successful drop case, due to sampling. +- `consumed`: Indicates a normal, synchronous request success case. + The item was consumed by the next stage of the pipeline, which + returned success. +- `unsampled`: Indicates a successful drop case, due to sampling. The item was intentionally not handled by the next stage of the pipeline. +- `queued`: Indicates the component admitted items into a queue and + then allowed the request to return before the final outcome was known. For failure: -- `timeout`: The item was in the process of being transmitted but the - request timed out. +- `timeout`: The item was in the process of being sent but the request + timed out. - `queue_full`: Indicates a dropped item because a local, limited-size queue is at capacity. The item was not handled by the next stage of the pipeline. If the item was handled by the next stage of the @@ -276,10 +278,119 @@ For failure: which returned a permanent error status not covered by any of the above values. -#### Collector - -TODO: Why we count only drops for pipeline segments, but not SDKs. -TODO: What to do about suppression. +### Error suppression behavior + +OpenTelemetry collector exporter components have existing error +suppression behavior, optionally obtained through the `exporterhelper` +library, which causes the `Consume()` function to return success for +what would ordinarily count as failure. This behavior makes automatic +component health status reporting more difficult than necessary. + +One goal if this proposal is that Collector component health could be +automatically inferred from metrics. Therefore, error suppression +performed by a component SHOULD NOT alter the `otel.success` attribute +value used in counting. + +Error suppression is naturally exposed as inconsistency in pipeline +metrics between the component and preceding components in the +pipeline. When an exporter suppresses errors, the processors and +receivers that it consumes from will (in aggregate) report +`otel.success=true` for more items than the exporter itself. + +As an option, the Collector MAY alter the `otel.outcome` attribute +value indicated when errors are suppressed, in conjunction with the +`otel.success=true` attribute. Instead of `otel.outcome=consumed`, +components can form a string using `suppressed:` followed by the +suppressed outcome (e.g., `otel.outcome=suppressed:queue_full`). This +is optional because could require substantial new code for the +collector component framework to track error suppression across +components. + +### Batch processor behavior + +Current `batchprocessor` behavior is to return success when the item +is accepted into its internal queue. This specification would add +`otel.outcome=queued` to the success response. + +Note the existing Collector core `batchprocessor` component has no +option to block until the actual outcome is known. If it had that +option, the Collector would need a way to return the failure to its +preceding component. + +Note that the `batchprocessor` component was designed before OTLP +introduced `PartialSuccess` messages, which provide a way to return +success, meaning not to retry, even when some or all of the data was +ultimately rejected by the pipeline. + +### Rejected points behavior + +Note that the current Collector does not account for the number of +items rejected, as introduced in OTLP through `PartialSuccess` +response messages. The error suppression semantic specified here is +compatible with this existing behavior, in the sense that rejected +points are being counted as successes. Collectors SHOULD count +rejected points as failed according to the specification here unless +error suppression is enabled. + +Since rejected points are generally part of successful export +requests, they are naturally suppressed from preceding pipeline +components. + +### SDKs are not like Collectors + +The proposed specification uses one metric per SDK instance +(`otelsdk.producer.items`) while it uses three per Collector instance +(`otelcol.*.items`) for the three primary component categories. + +This is justified as follows: + +- SDKs are net producers of telemetry, while Collectors pass telemetry + through, therefore we monitor these components in different ways. +- It is meaningless to aggregate pipeline metrics describing SDKs and + Collectors in a single metric. Collectors are generally + instrumented with OpenTelemtry SDKs, so this ambigiuty is avoided. +- SDKs are not consistent about component names. While tracing SDKs + have both processor and exporter components, there is no reason to + separately account for these components. On the other hand, metrics + SDKs do not have a "processor" component, they have a "reader" + component. + +### Connectors are both exporters and receivers + +Collectors have a special type of component called a "Connector", +which acts as both a receiver and an exporter, possibly having +different signal type. These components should be instrumented twice, +making pipeline metrics available for both the receiver and exporter. + +Therefore, a single Connector component will show up twice, having +both `otelcol.receiver.items` and `otelcol.exporter.items` counters. +These two counters will have the same component name (i.e., +`otel.name` value), different pipeline name (i.e., `otel.pipeline` +value) and possibly different signal type (i.e., `otel.signal` value). + +### Components fail in all sorts of ways + +The existing Collector `obsreport` framework is overly restrictive in +terms of the available outcomes that can be counted. As discussed +above, exporter components have no natural way to report dropped data +when a queue is full. + +Processor components, for example, are able to report `refused`, +`dropped`, and `success` outcomes but have no natural way to report +internally-generated failures (e.g., `memorylimiter` discussed above). + +Another example concerns processors that introduce delay but wish to +honor deadlines. There is not a natural way for processors to count +timeouts. The proposed specification here allows all components to +report failures on an item-by-item basis. + +### + +TODO: About how there are strongly-recommended dimensions. How certain attributes, if removed, lead to meaningful/useless outcomes. + +TODO: about level of detail: table of which attributes at which levels + +TODO: about trace-specific condisderations: samplers are not counted, not covered here. ## Metrics SDK special considerations From f9d1e9ac6105d0b11b236a8403d78b9f90cb4fea Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Fri, 5 Jan 2024 14:20:35 -0800 Subject: [PATCH 10/13] draft with no more TODOs --- text/metrics/0238-pipeline-monitoring.md | 99 +++++++++++++++++++----- 1 file changed, 79 insertions(+), 20 deletions(-) diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md index ed3d1f653..fc4090b01 100644 --- a/text/metrics/0238-pipeline-monitoring.md +++ b/text/metrics/0238-pipeline-monitoring.md @@ -278,6 +278,30 @@ For failure: which returned a permanent error status not covered by any of the above values. +### Metric detail levels + +The five metric attributes specified above are recommended at all +levels of metric detail for OpenTelemetry Collectors. Collectors may +wish to configure other instrumentation at higher levels of detail, +such as counters for number of compressed bytes transmitted and +received. + +For SDKs, two levels of detail are available, as options that allow +the user to control whether each signal uses two timeseries or more, +as many as seven per signal. The metrics level of detail determines +which attributes are viewed, with the following recommended defaults. + +| Level | SDK attributes | +|--------|-----------------------------------------------| +| Basic | `otel.success`, `otel.signal` | +| Normal | `otel.success`, `otel.signal`, `otel.outcome` | + +SDKs may wish to configure additional instrumentation at higher levels +of detail, such as gRPC or HTTP-specific instrumentation for export +operations. + +## Detailed rationale + ### Error suppression behavior OpenTelemetry collector exporter components have existing error @@ -384,35 +408,70 @@ honor deadlines. There is not a natural way for processors to count timeouts. The proposed specification here allows all components to report failures on an item-by-item basis. -### +### Strongly-recommended dimensions -TODO: About how there are strongly-recommended dimensions. How certain attributes, if removed, lead to meaningful/useless outcomes. +We are aware of the potential to accidentally combine metrics from a +pipeline in ways that lead to meaningful but potentially suprising +results. This is best explained by example: -TODO: about level of detail: table of which attributes at which levels +- A collector pipeline contains two processor elements. If the + `otelcol.processor.items` metric is aggregated to remove the + `otel.name` attribute, then the resulting aggregate will count two + items for every one passing through the pipeline. +- A multi-stage pipeline passes telemetry through an agent collector + followed by a gateway collector. If any of the `otelcol.*.items` + metrics are aggregated such that the agent-vs-gateway distinction is + lost, then the resulting aggregate counts two for every one item + passing through the pipeline (this distinction is a resource-level + attribute, not covered by these semantic conventions). -TODO: about trace-specific condisderations: samplers are not counted, not covered here. +These are meaningful and correct aggregations, just not usually what +the user is looking for. Products built to monitor OpenTelemetry +pipelines specifically should be aware that values in these dimensions +are potentially repeated. -## Metrics SDK special considerations +## SDK special considerations We expect that Metrics SDKs will be used to generate pipeline-monitoring metrics reporting about themselves. -As stated above, SDKs SHOULD support configuring an alternate Meter -Provider for pipeline-monitoring metrics. When the global Meter -Provider is used, the Metrics SDK's pipeline will receive its own +All SDKs SHOULD support configuring an alternate Meter Provider for pipeline-monitoring metrics. When a custom Meter Provider is used, a -secondary pipeline will receive the pipeline monitoring metrics, in -which case the secondary pipeline may also self-report for itself. - -## Trade-offs and mitigations - -The use of three-levels of metric detail may seem like more freedom -than necessary. Implementors are expected to take advantage of Metric -View configuration in the Metrics SDK for configuring opt-out of -standard attributes (i.e., to remove `otel.signal`, `otel.name`, or -`otel.signal`). For opt-in attributes (i.e., to configure no -`otel.reason` or `otel.scope` attribute), implementors MAY choose to -enable additional attributes only when configured. +secondary SDK will receive the pipeline monitoring metrics. + +When an SDK is used as the global Meter Provider is used (or when a +secondary SDK monitors a primary SDK), that Metrics SDK will receive +its own pipeline-monitoring metrics. When a Metrics SDK reports to +itself, the export operations that occur during shutdown MAY NOT be +instrumented. + +## Trace SDK special considerations + +Samplers are not considered part of the telemetry pipeline, because +Spans are unfinished at the time they start and cannot be considered +part of a pipeline until the finish. Sampler behavior is not covered +by these semantic conventions. The number counted by +`otelsdk.producer.items` includes items that are exported and/or +dropped by the Span Processor. + +When the Metrics SDK responsible for reporting pipeline metrics is not +functional for any reason, another signal may be necessary to +effectively monitor the situation. + +For this reason, all SDKs SHOULD support configuring a Tracer Provider +for monitoring pipeline export operations. Users concerned about +potential loss of metrics service may wish to configure an independent +trace data pipeline. In that case, the spans duration should match +the export and the span name SHOULD be `OTel SDK Export`. + +The specified attributes are: + +- `otel.status` (string): OK, Error, or Unset +- `otel.items` (int64): Number of spans, metric data points, or log records. +- `otel.signal` (string): `traces`, `metrics`, or `logs`. + +These spans MAY contain additional detail about the cause of failures, +for example. ## Prior art and alternatives From 1f48c3e559bb78d8560a33ccdba474deebc5e719 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 31 Jan 2024 16:19:20 -0800 Subject: [PATCH 11/13] wip --- text/metrics/0238-pipeline-monitoring.md | 66 ++++++++++++++---------- 1 file changed, 39 insertions(+), 27 deletions(-) diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md index fc4090b01..acd904913 100644 --- a/text/metrics/0238-pipeline-monitoring.md +++ b/text/metrics/0238-pipeline-monitoring.md @@ -1,7 +1,8 @@ -# OpenTelemetry Export-pipeline metrics +# OpenTelemetry Telemetry Pipeline metrics -Propose a uniform standard for OpenTelemetry SDK and Collector -export-pipeline metrics with support for multiple levels of detail. +Propose a uniform standard for telemetry pipeline metrics generated by +OpenTelemetry SDKs and Collectors with support for multiple levels of +detail. ## Motivation @@ -10,6 +11,8 @@ metrics emitted by SDKs. At the same time, the OpenTelemetry Collector is becoming a stable and critical part of the ecosystem, and it has different semantic conventions. Here we attempt to unify them. +The term Telemetry Pipeline + ## Explanation The OpenTelemetry Collector's pipeline metrics were derived from the @@ -103,8 +106,9 @@ timeseries needed to convey the information they want to monitor. #### Meaning of "dropped" telemetry The term "Dropped" in pipeline monitoring usually refers to telemetry -that was intentionally not transmitted. A survey of existing pipeline -components shows the following uses. +that was intentionally not transmitted or that has failed and cannot +be retried. A survey of existing pipeline components shows the +following uses. In the SDK, the standard OpenTelemetry BatchSpanProcessor will drop spans that cannot be admitted into its queue. These cases are @@ -165,6 +169,10 @@ queue-sender is enabled and the queue is full, items are dropped in the standard sense, but they are counted using an `enqueue_failed` metric. +## Meaning of "rejected" telemetry + +TODO: https://github.com/open-telemetry/opentelemetry-collector/pull/9260#discussion_r1473516302 + ## Proposed semantic conventions Following the analysis above, the main problem being addressed is @@ -245,38 +253,42 @@ is supplied and is considered a closed set. If these outcomes do not accurately explain the reason for a success or failure outcome, they SHOULD be extended by OpenTelemetry. -For success: +For success=true: - `consumed`: Indicates a normal, synchronous request success case. The item was consumed by the next stage of the pipeline, which returned success. -- `unsampled`: Indicates a successful drop case, due to sampling. - The item was intentionally not handled by the next stage of the - pipeline. -- `queued`: Indicates the component admitted items into a queue and - then allowed the request to return before the final outcome was known. - -For failure: - -- `timeout`: The item was in the process of being sent but the request - timed out. -- `queue_full`: Indicates a dropped item because a local, limited-size - queue is at capacity. The item was not handled by the next stage of - the pipeline. If the item was handled by the next stage of the - pipeline, use `resource_exhausted`. +- `enqueued`: Indicates the component admitted items into a queue and + then allowed the request to return success before the final outcome + was known. +- `suppressed:`: When the outcome is known but + the compnent intentionally returns. + +For success=false: + +- `deadline_exceeded`: The item was in the process of being sent but the request + timed out, or its deadline was exceeded. - `resource_exhausted`: The item was handled by the next stage of the pipeline, which returned an error code indicating that it was overloaded. If the resource being exhausted is local and the item - was not handled by the next stage of the pipeline, use `queue_full`. + was not handled by the next stage of the pipeline, use `dropped`. - `rejected`: The item was handled by the next stage of the pipeline, - which returned a partial success status indicating that some items - could not be accepted. -- `transient`: The item was handled by the next stage of the pipeline, + which returned a permanent error status or partial success status + indicating that some items could not be accepted. +- `retryable`: The item was handled by the next stage of the pipeline, which returned a retryable error status not covered by any of the above values. -- `permanent`: The item was handled by the next stage of the pipeline, - which returned a permanent error status not covered by any of the - above values. + +For both success=true and success=false: + +- `dropped`: Indicates an item that was not sent to the next stage of + the pipeline in both cases. Processors may use this to indicate + both success and failure, for example include sampling processors + and filtering processors, which successfully avoid sending data + based on configuration. For receivers and exporters, dropped data + indicates resource exhaustion of the component itself, in which case + the component reports points as dropped, while the producers will + see a resource exhausted status code. ### Metric detail levels From 261323c74572408738a85e586bdbf2c8538cf83e Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 1 Feb 2024 21:00:18 -0800 Subject: [PATCH 12/13] add examples --- text/metrics/0238-pipeline-monitoring.md | 556 +++++------------------ 1 file changed, 120 insertions(+), 436 deletions(-) diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md index acd904913..da44f4c10 100644 --- a/text/metrics/0238-pipeline-monitoring.md +++ b/text/metrics/0238-pipeline-monitoring.md @@ -1,222 +1,29 @@ # OpenTelemetry Telemetry Pipeline metrics Propose a uniform standard for telemetry pipeline metrics generated by -OpenTelemetry SDKs and Collectors with support for multiple levels of +OpenTelemetry SDKs and Collectors with support for several levels of detail. +**WIP**: This document has been edited recently, based on reviewer +feedback. Since it has changed substantially, I removed a lot of +text. I will restore this document after sharing the revisions with +reviewers. + ## Motivation -OpenTelemetry has pending requests to standardize conventions for the -metrics emitted by SDKs. At the same time, the OpenTelemetry Collector -is becoming a stable and critical part of the ecosystem, and it has -different semantic conventions. Here we attempt to unify them. +OpenTelemetry desires to standardize conventions for the metrics +emitted by SDKs about success and failure of telemetry reporting. At +the same time, the OpenTelemetry Collector is becoming a stable and +critical part of the ecosystem, and it has existing conventions which +are expected to connect with metrics emitted by SDKs. -The term Telemetry Pipeline +We use the term "pipeline" to describe an arrangement of system +components which produce, consume, and process telemetry on its way +from the point of origin to the endpoint(s) in its journey. ## Explanation -The OpenTelemetry Collector's pipeline metrics were derived from the -OpenCensus collector. - -### Collector metrics - -The core collector formerly contained a package named `obsreport`, -which has a uniform interface dedicated to each of its components. -This package has been migrated into the commonly-used helper classes -known as `receiverhelper`, `processorhelper`, and `exporterhelper.` - -Obsreport is responsible for giving collector metrics a uniform -appearance. Metric names were created using OpenCensus style, which -uses a `/` character to indicate hierarchy and a `.` to separate the -operative verb and noun. This library creates metrics named, in -general, `{component-type}/{verb}.{noun}`, with component types -`receiver`, `processor`, and, `exporter`, and with signal-specific -nouns `spans`, `metric_points` and `logs` corresponding with the unit -of information for the tracing, metrics, and logs signals, -respectively. - -Earlier adopters of the Collector would use Prometheus to read these -metrics, which does not accept `/` or `.`. The Prometheus integration -would add a `otelcol_` prefix and replace the invalid characters with -`_`. The same metric in the example above would appear named -`otelcol_receiver_accepted_spans`. - -#### Collector: Obsreport receiver metrics - -For receivers, the obsreport library counts items in two ways: - -1. Receiver `accepted` items. Items that are received and - successfully consumed by the pipeline. -2. Receiver `refused` items. Items that are received and fail to be - consumed by the pipeline. - -Items are exclusively counted in one of these counts. The lifetime -average failure rate of the receiver com is defined as -`refused / (accepted + refused)`. - -#### Collector: Obsreport processor metrics - -For processors, the obsreport library counts items in three ways: - -1. Processor `accepted` items. Defined as the number of items that are passed to the next component and return successfully. -2. Processor `dropped` items. This is a counter of items that are - deliberately excluded from the output, which will be counted as accepted by the preceding pipeline component but were not transmitted. -3. Processor `refused` items. Defined as the number of items that are passed to the next component and fail. - -Items are exclusively counted in one of these counts. The average drop rate -can be defined as `dropped / (accepted + dropped + refused)` - -#### Collector: Obsreport exporter metrics - -The `obsreport_exporter` interface counts spans in two ways: - -1. Exporter `sent` items. Items that are sent and succeed. -2. Receiver `send_failed` items. Items that are sent and fail. - -Items are exclusively counted in one of these counts. The average -failure rate is defined as `send_failed / (sent + send_failed)`. - -The exporterhelper package takes on many aspects of processor -behavior, including the ability to drop when a queue is full. It uses -a separate counter for these items, known as `enqueue_failed`. - -### Jaeger trace SDK metrics - -Jaeger SDKs expose metrics on the "Reporter", which includes -"Success", "Failure", "Dropped" counters describing the pipeline. See -[here](https://github.com/jaegertracing/jaeger-client-go/blob/8d8e8fcfd04de42b8482476abac6a902fca47c18/metrics.go#L22-L106). - -Jaeger SDK metrics are equivalent to the three metrics produced by -OpenTelemetry Collector processor components. - -### Analysis - -As we can see by the examples documented above, it is a standard -practice to monitor a telemetry pipeline using three counters to count -successful, failed, and dropped items. - -A central aspect of the proposed specification is to use a single -metric instrument with exclusive attribute values, as compared with -the use of separate, exclusive metric instruments. - -By specifying attribute dimensions for the resulting single -instrument, users can configure the level of detail and the number of -timeseries needed to convey the information they want to monitor. - -#### Meaning of "dropped" telemetry - -The term "Dropped" in pipeline monitoring usually refers to telemetry -that was intentionally not transmitted or that has failed and cannot -be retried. A survey of existing pipeline components shows the -following uses. - -In the SDK, the standard OpenTelemetry BatchSpanProcessor will drop -spans that cannot be admitted into its queue. These cases are -intentional, to protect the application and downstream pipeline, but -they should be considered failure because they were sampled, and not -collecting them in general will lead to trace incompleteness. - -In a Collector pipeline, there are formal and informal uses: - -- A sampling processor, for example, may drop spans because it was - instructed to (e.g., due to an attribute like `sampling.priority=0`). - In this case, drops are considered success. -- The memorylimiter processor, for example, may "drop" spans because - it was instructed to (e.g., when it is above a hard limit). - However, when it does this, it returns an error counting the item as - `refused`, contradicting the documentation of that metric instrument: - -> "Number of spans that were rejected by the next component in the pipeline." - -There is already an inconsistency, along with a new term "rejected". -By counting its own failures as refused, we should expect that the -next component in the pipeline handled the data. This is a failure -case drop, one where the next component in the pipeline does not -handle the item, however counting drops as refused leads to -inconsitency, since refused spans should be visibly counted by the -next stage in the pipeline. - -> "Number of spans that were dropped." - -The memory limiter source code actually has a comment on this topic, - -``` -// TODO: actually to be 100% sure that this is "refused" and not "dropped" -// it is necessary to check the pipeline to see if this is directly connected -// to a receiver (ie.: a receiver is on the call stack). For now it -// assumes that the pipeline is properly configured and a receiver is on the -// callstack and that the receiver will correctly retry the refused data again. -``` - -which adds to the confusion -- it is not standard practice for -receivers to retry in the OpenTelemetry collector, that is the duty of -exporters in our current practice. So, the memory limiter component, -to be consistent, should count "failure drops" to indicate that the -next stage of the pipeline did not see the data. - -There is still another use of "dropped" in the collector, similar to -the memory limiter example and the SDK use-case, where "dropped" is a -case of failure. In the `exporterhelper` module, the term dropped is -used in log messages to describe data that was tried at least once and -will not be retried, which matches the processor's definition of -`refused` in the sense that data was submitted to the next component -in the pipeline and failed and does not match the processor's -definition `dropped`. - -As the exporter helper is not part of a processor framework, it does -not have a conventional way to count dropped items. When the -queue-sender is enabled and the queue is full, items are dropped in -the standard sense, but they are counted using an `enqueue_failed` -metric. - -## Meaning of "rejected" telemetry - -TODO: https://github.com/open-telemetry/opentelemetry-collector/pull/9260#discussion_r1473516302 - -## Proposed semantic conventions - -Following the analysis above, the main problem being addressed is -confusion over the meaning of "dropped", which is sometimes success -and sometimes failure. The use of a single metric with optional -attributes allows us to explicitly count success and failure while -optionally counting additional dimensions. As we will see, this -allows introducing newly-distinct outcomes without breaking past -conventions. - -For example, the term "rejected" has a formal definition in -OpenTelemetry that is not expressed by existing metrics. An item of -telemetry is considered rejected when it is included in a successful -request but was individually dropped (for stated reasons) and should -not be retried; these items were successfully sent but dropped (due to -partial success) after processing by the next stage in the pipeline. - -### Use of a single metric name - -The use of a single metric name is less confusing than the use of -multiple metric names, because the user has to know only a single name -to writing useful queries. Users working with existing collector and -SDK pipeline monitoring metrics have to remember at least three metric -names and explicitly join them via custom metric queries. For -example, to calculate loss rate for an SDK using traditional pipeline -metrics, - -``` -LossRate_MultipleMetrics = (dropped + failed) / (dropped + failed + success) -``` - -On the other hand, with a uniform boolean attribute indicating success -or failure the resulting query is simpler. - -``` -LossRate_SingleMetric = items{success=false} / items{success=*} -``` - -In a typical metric query engine, after the user has entered the one -metric name, attribute values will be automatically surfaced in the -user interface, allowing them to make sense of the data and -interactively build useful queries. On the other hand, the user who -has to query multiple metrics has to enter each metric name -explicitly without help from the user interface. +### Detailed design The proposed metric instrument would be named distinctly depending on whether it is a collector or an SDK, to prevent accidental aggregation @@ -255,16 +62,30 @@ SHOULD be extended by OpenTelemetry. For success=true: -- `consumed`: Indicates a normal, synchronous request success case. +- `accepted`: Indicates a normal, synchronous request success case. The item was consumed by the next stage of the pipeline, which - returned success. -- `enqueued`: Indicates the component admitted items into a queue and - then allowed the request to return success before the final outcome - was known. -- `suppressed:`: When the outcome is known but - the compnent intentionally returns. - -For success=false: + returned success. Note the item could have been suppressed by a + subsequent component, but as far as this component knows, the + request successful. +- `suppressed:`: When the true + outcome is not known at the time of counting, and the compnent + intentionally returns success to its producer. Examples are given + below. + +For both success=true and success=false, there is a special outcome +indicating items did not reach the next stage in the pipeline, +considered "dropped". When comparing pipeline metrics from one stage +to the next, those which are dropped by a component are expected not +to appear in totals of the subequent pipeline. + +- `dropped`: Processors may use this to indicate both success and + failure, for example include sampling processors and filtering + processors, which successfully avoid sending data based on + configuration. For all components, dropped with success=false + indicates that the component introduced an original failure and did + not send to the next stage in the pipeline. + +For success=false, transient and potentially retryable: - `deadline_exceeded`: The item was in the process of being sent but the request timed out, or its deadline was exceeded. @@ -272,225 +93,88 @@ For success=false: pipeline, which returned an error code indicating that it was overloaded. If the resource being exhausted is local and the item was not handled by the next stage of the pipeline, use `dropped`. -- `rejected`: The item was handled by the next stage of the pipeline, - which returned a permanent error status or partial success status - indicating that some items could not be accepted. - `retryable`: The item was handled by the next stage of the pipeline, which returned a retryable error status not covered by any of the above values. - -For both success=true and success=false: - -- `dropped`: Indicates an item that was not sent to the next stage of - the pipeline in both cases. Processors may use this to indicate - both success and failure, for example include sampling processors - and filtering processors, which successfully avoid sending data - based on configuration. For receivers and exporters, dropped data - indicates resource exhaustion of the component itself, in which case - the component reports points as dropped, while the producers will - see a resource exhausted status code. - -### Metric detail levels - -The five metric attributes specified above are recommended at all -levels of metric detail for OpenTelemetry Collectors. Collectors may -wish to configure other instrumentation at higher levels of detail, -such as counters for number of compressed bytes transmitted and -received. - -For SDKs, two levels of detail are available, as options that allow -the user to control whether each signal uses two timeseries or more, -as many as seven per signal. The metrics level of detail determines -which attributes are viewed, with the following recommended defaults. - -| Level | SDK attributes | -|--------|-----------------------------------------------| -| Basic | `otel.success`, `otel.signal` | -| Normal | `otel.success`, `otel.signal`, `otel.outcome` | - -SDKs may wish to configure additional instrumentation at higher levels -of detail, such as gRPC or HTTP-specific instrumentation for export -operations. - -## Detailed rationale - -### Error suppression behavior - -OpenTelemetry collector exporter components have existing error -suppression behavior, optionally obtained through the `exporterhelper` -library, which causes the `Consume()` function to return success for -what would ordinarily count as failure. This behavior makes automatic -component health status reporting more difficult than necessary. - -One goal if this proposal is that Collector component health could be -automatically inferred from metrics. Therefore, error suppression -performed by a component SHOULD NOT alter the `otel.success` attribute -value used in counting. - -Error suppression is naturally exposed as inconsistency in pipeline -metrics between the component and preceding components in the -pipeline. When an exporter suppresses errors, the processors and -receivers that it consumes from will (in aggregate) report -`otel.success=true` for more items than the exporter itself. - -As an option, the Collector MAY alter the `otel.outcome` attribute -value indicated when errors are suppressed, in conjunction with the -`otel.success=true` attribute. Instead of `otel.outcome=consumed`, -components can form a string using `suppressed:` followed by the -suppressed outcome (e.g., `otel.outcome=suppressed:queue_full`). This -is optional because could require substantial new code for the -collector component framework to track error suppression across -components. - -### Batch processor behavior - -Current `batchprocessor` behavior is to return success when the item -is accepted into its internal queue. This specification would add -`otel.outcome=queued` to the success response. - -Note the existing Collector core `batchprocessor` component has no -option to block until the actual outcome is known. If it had that -option, the Collector would need a way to return the failure to its -preceding component. - -Note that the `batchprocessor` component was designed before OTLP -introduced `PartialSuccess` messages, which provide a way to return -success, meaning not to retry, even when some or all of the data was -ultimately rejected by the pipeline. - -### Rejected points behavior - -Note that the current Collector does not account for the number of -items rejected, as introduced in OTLP through `PartialSuccess` -response messages. The error suppression semantic specified here is -compatible with this existing behavior, in the sense that rejected -points are being counted as successes. Collectors SHOULD count -rejected points as failed according to the specification here unless -error suppression is enabled. - -Since rejected points are generally part of successful export -requests, they are naturally suppressed from preceding pipeline -components. - -### SDKs are not like Collectors - -The proposed specification uses one metric per SDK instance -(`otelsdk.producer.items`) while it uses three per Collector instance -(`otelcol.*.items`) for the three primary component categories. - -This is justified as follows: - -- SDKs are net producers of telemetry, while Collectors pass telemetry - through, therefore we monitor these components in different ways. -- It is meaningless to aggregate pipeline metrics describing SDKs and - Collectors in a single metric. Collectors are generally - instrumented with OpenTelemtry SDKs, so this ambigiuty is avoided. -- SDKs are not consistent about component names. While tracing SDKs - have both processor and exporter components, there is no reason to - separately account for these components. On the other hand, metrics - SDKs do not have a "processor" component, they have a "reader" - component. - -### Connectors are both exporters and receivers - -Collectors have a special type of component called a "Connector", -which acts as both a receiver and an exporter, possibly having -different signal type. These components should be instrumented twice, -making pipeline metrics available for both the receiver and exporter. - -Therefore, a single Connector component will show up twice, having -both `otelcol.receiver.items` and `otelcol.exporter.items` counters. -These two counters will have the same component name (i.e., -`otel.name` value), different pipeline name (i.e., `otel.pipeline` -value) and possibly different signal type (i.e., `otel.signal` value). - -### Components fail in all sorts of ways - -The existing Collector `obsreport` framework is overly restrictive in -terms of the available outcomes that can be counted. As discussed -above, exporter components have no natural way to report dropped data -when a queue is full. - -Processor components, for example, are able to report `refused`, -`dropped`, and `success` outcomes but have no natural way to report -internally-generated failures (e.g., `memorylimiter` discussed above). - -Another example concerns processors that introduce delay but wish to -honor deadlines. There is not a natural way for processors to count -timeouts. The proposed specification here allows all components to -report failures on an item-by-item basis. - -### Strongly-recommended dimensions - -We are aware of the potential to accidentally combine metrics from a -pipeline in ways that lead to meaningful but potentially suprising -results. This is best explained by example: - -- A collector pipeline contains two processor elements. If the - `otelcol.processor.items` metric is aggregated to remove the - `otel.name` attribute, then the resulting aggregate will count two - items for every one passing through the pipeline. -- A multi-stage pipeline passes telemetry through an agent collector - followed by a gateway collector. If any of the `otelcol.*.items` - metrics are aggregated such that the agent-vs-gateway distinction is - lost, then the resulting aggregate counts two for every one item - passing through the pipeline (this distinction is a resource-level - attribute, not covered by these semantic conventions). - -These are meaningful and correct aggregations, just not usually what -the user is looking for. Products built to monitor OpenTelemetry -pipelines specifically should be aware that values in these dimensions -are potentially repeated. - -## SDK special considerations - -We expect that Metrics SDKs will be used to generate -pipeline-monitoring metrics reporting about themselves. - -All SDKs SHOULD support configuring an alternate Meter Provider for -pipeline-monitoring metrics. When a custom Meter Provider is used, a -secondary SDK will receive the pipeline monitoring metrics. - -When an SDK is used as the global Meter Provider is used (or when a -secondary SDK monitors a primary SDK), that Metrics SDK will receive -its own pipeline-monitoring metrics. When a Metrics SDK reports to -itself, the export operations that occur during shutdown MAY NOT be -instrumented. - -## Trace SDK special considerations - -Samplers are not considered part of the telemetry pipeline, because -Spans are unfinished at the time they start and cannot be considered -part of a pipeline until the finish. Sampler behavior is not covered -by these semantic conventions. The number counted by -`otelsdk.producer.items` includes items that are exported and/or -dropped by the Span Processor. - -When the Metrics SDK responsible for reporting pipeline metrics is not -functional for any reason, another signal may be necessary to -effectively monitor the situation. - -For this reason, all SDKs SHOULD support configuring a Tracer Provider -for monitoring pipeline export operations. Users concerned about -potential loss of metrics service may wish to configure an independent -trace data pipeline. In that case, the spans duration should match -the export and the span name SHOULD be `OTel SDK Export`. - -The specified attributes are: - -- `otel.status` (string): OK, Error, or Unset -- `otel.items` (int64): Number of spans, metric data points, or log records. -- `otel.signal` (string): `traces`, `metrics`, or `logs`. - -These spans MAY contain additional detail about the cause of failures, -for example. - -## Prior art and alternatives - -Prior work in (this PR)[https://github.com/open-telemetry/semantic-conventions/pull/184]. - -Issues: -- [Determine how to report dropped metrics](https://github.com/open-telemetry/opentelemetry-specification/issues/1655) -- [How should OpenTelemetry-internal metrics be exposed?](https://github.com/open-telemetry/opentelemetry-specification/issues/959) -- [OTLP Exporter must send client side metrics](https://github.com/open-telemetry/opentelemetry-specification/issues/791) -- [Making Tracing SDK metrics aware](https://github.com/open-telemetry/opentelemetry-specification/issues/381) + +For success=false, permanent category: + +- `rejected`: The item was handled by the next stage of the pipeline, + which returned a permanent error status or partial success status + indicating that some items could not be accepted. + + +#### Success, Outcome matrix + +| Success | Outcome | Meaning | +|---------|------------------------------|-------------------------------------------------------------------| +| true | accepted | Synchronous send succeeded | +| true | dropped | Dropped by intention | +| true | supressed:accepted | Producer saw success; true outcome unknown | +| true | supressed:dropped | Producer saw success; request was not sent | +| true | supressed:deadline_exceeded | Producer saw success; request sent, timed out | +| true | supressed:resource_exhausted | Producer saw success; request sent, insufficient resources | +| true | supressed:retryable | Producer saw success; request sent, other non-permanent condition | +| true | supressed:rejected | Producer saw success; request sent, permanent condition | +| false | dropped | Producer saw the component return failure, request was not sent | +| false | deadline_exceeded | Producer saw the component return failure, request timed out | +| false | resource_exhausted | Producer saw the component return failure, insufficient resources | +| false | retryable | Producer saw the component return other non-permanent condition | +| false | rejected | Producer saw the component return a permanent condition | + +#### Examples of each outcome + +##### Success, Accepted + +This is the common success case. The item(s) were sent to the next +stage in the pipeline while blocking the producer. + +##### Success, Dropped + +A processor was configured with instructions not to pass certain data. + +##### Failure, Suppressed-Accepted + +A component returned success to its producer, without making any +effort to determine the true outcome. + +##### Failure, Dropped and Suppressed Dropped + +(If suppressed: A component returned success to its producer, then ...) + +The component never sent the item(s) due to limits in effect. For +example, shutdown was ordered and the queue could not be drained in +time due to a limit on parallelism. + +##### Failure, Deadline exceeded and Suppressed Deadline exceeded + +(If suppressed: A component returned success to its producer, then ...) + +The component attempted sending the item(s), but the item(s) did not +succeed before the deadline expired. If there were attempts to retry, +this is outcome of the final attempt. + +##### Failure, Resource exhausted and Suppressed Resource exhausted + +(If suppressed: A component returned success to its producer, then ...) + +The component attempted sending the item(s), but the consumer +indicated its (or its consumers') resources were exceeded. If there +were attempts to retry, this is outcome of the final attempt. + +##### Failure, Retryable and Suppressed Retryable + +(If suppressed: A component returned success to its producer, then ...) + +A component returned success to its producer, and then it attempted +sending the item(s), but the consumer indicated some kind of transient +condition other than deadline- or resource-related (e.g., connection +not accepted). If there were attempts to retry, this is outcome of +the final attempt. + +##### Failure, Rejected and Suppressed Rejected + +(If suppressed: A component returned success to its producer, then ...) + +A compmnent returned success to its producer, and then it attempted +sending the item(s), but the consumer returned a permanent error. From 6195761bed837065c310766b0eda35d3c33059d9 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 1 Feb 2024 21:14:25 -0800 Subject: [PATCH 13/13] small revision; needs more work --- text/metrics/0238-pipeline-monitoring.md | 29 ++++++++++++------------ 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md index da44f4c10..aebe94508 100644 --- a/text/metrics/0238-pipeline-monitoring.md +++ b/text/metrics/0238-pipeline-monitoring.md @@ -110,17 +110,18 @@ For success=false, permanent category: |---------|------------------------------|-------------------------------------------------------------------| | true | accepted | Synchronous send succeeded | | true | dropped | Dropped by intention | -| true | supressed:accepted | Producer saw success; true outcome unknown | -| true | supressed:dropped | Producer saw success; request was not sent | -| true | supressed:deadline_exceeded | Producer saw success; request sent, timed out | -| true | supressed:resource_exhausted | Producer saw success; request sent, insufficient resources | -| true | supressed:retryable | Producer saw success; request sent, other non-permanent condition | -| true | supressed:rejected | Producer saw success; request sent, permanent condition | | false | dropped | Producer saw the component return failure, request was not sent | | false | deadline_exceeded | Producer saw the component return failure, request timed out | | false | resource_exhausted | Producer saw the component return failure, insufficient resources | | false | retryable | Producer saw the component return other non-permanent condition | | false | rejected | Producer saw the component return a permanent condition | +| true | supressed:accepted | Producer saw success; eventually accepted | +| true | supressed:dropped | Producer saw success; request was not sent | +| true | supressed:deadline_exceeded | Producer saw success; request sent, timed out | +| true | supressed:resource_exhausted | Producer saw success; request sent, insufficient resources | +| true | supressed:retryable | Producer saw success; request sent, other non-permanent condition | +| true | supressed:rejected | Producer saw success; request sent, permanent condition | +| true | supressed:unknown | Producer saw success; no effort to report true outcome | #### Examples of each outcome @@ -133,12 +134,12 @@ stage in the pipeline while blocking the producer. A processor was configured with instructions not to pass certain data. -##### Failure, Suppressed-Accepted +##### Success, Suppressed-Accepted -A component returned success to its producer, without making any -effort to determine the true outcome. +A component returned success to its producer, and later the outcome +was successful. -##### Failure, Dropped and Suppressed Dropped +##### Failure, Dropped and Success, Suppressed-Dropped (If suppressed: A component returned success to its producer, then ...) @@ -146,7 +147,7 @@ The component never sent the item(s) due to limits in effect. For example, shutdown was ordered and the queue could not be drained in time due to a limit on parallelism. -##### Failure, Deadline exceeded and Suppressed Deadline exceeded +##### Failure, Deadline exceeded and Success, Suppressed-Deadline exceeded (If suppressed: A component returned success to its producer, then ...) @@ -154,7 +155,7 @@ The component attempted sending the item(s), but the item(s) did not succeed before the deadline expired. If there were attempts to retry, this is outcome of the final attempt. -##### Failure, Resource exhausted and Suppressed Resource exhausted +##### Failure, Resource exhausted and Success, Suppressed-Resource exhausted (If suppressed: A component returned success to its producer, then ...) @@ -162,7 +163,7 @@ The component attempted sending the item(s), but the consumer indicated its (or its consumers') resources were exceeded. If there were attempts to retry, this is outcome of the final attempt. -##### Failure, Retryable and Suppressed Retryable +##### Failure, Retryable and Success, Suppressed-Retryable (If suppressed: A component returned success to its producer, then ...) @@ -172,7 +173,7 @@ condition other than deadline- or resource-related (e.g., connection not accepted). If there were attempts to retry, this is outcome of the final attempt. -##### Failure, Rejected and Suppressed Rejected +##### Failure, Rejected and Success, Suppressed-Rejected (If suppressed: A component returned success to its producer, then ...)