diff --git a/src/data/nav.yml b/src/data/nav.yml index 32aa7f163..0c01ab7c4 100644 --- a/src/data/nav.yml +++ b/src/data/nav.yml @@ -53,10 +53,16 @@ - title: Use New Relic to diagnose problems in your system url: '/automate-workflows/diagnose-problems' pages: + - title: Spin up Acme Telco Lite architecture + url: '/automate-workflows/diagnose-problems/spin-up-acme' + - title: View your services + url: '/automate-workflows/diagnose-problems/view-your-services' - title: High response times url: '/automate-workflows/diagnose-problems/high-response-times' - title: Error alerts url: '/automate-workflows/diagnose-problems/error-alerts' + - title: Tear down Telco Lite + url: '/automate-workflows/diagnose-problems/tear-down' - title: Build apps icon: nr-build-apps url: '/build-apps' diff --git a/src/markdown-pages/automate-workflows/diagnose-problems/error-alerts.mdx b/src/markdown-pages/automate-workflows/diagnose-problems/error-alerts.mdx index d27266620..79c6dc33a 100644 --- a/src/markdown-pages/automate-workflows/diagnose-problems/error-alerts.mdx +++ b/src/markdown-pages/automate-workflows/diagnose-problems/error-alerts.mdx @@ -1,39 +1,29 @@ --- path: '/automate-workflows/diagnose-problems/error-alerts' duration: '20 min' -title: 'Diagnose error alerts in Telco Lite' +title: 'Diagnose error alerts' template: 'GuideTemplate' description: 'Learn how to use New Relic to diagnose error alerts in your services.' -tileShorthand: - title: 'Diagnose error alerts in Telco Lite' - description: 'Learn how to diagnose error alerts with New Relic.' -resources: - - title: 'New Relic Demo Catalog' - url: 'https://github.com/newrelic/demo-catalog' - - title: 'New Relic Demo Deployer' - url: 'https://github.com/newrelic/demo-deployer' -tags: - - demo - - explore +procIdx: 4 --- -Use New Relic to understand why some services are raising alerts. + -## Prerequisites +This procedure is part of a lab that teaches you how to diagnose common issues using New Relic. -Before you begin: +Each procedure in the lab builds upon the last, so make sure you've completed the last procedure, [_Diagnose high response time_](/automate-workflows/diagnose-problems/high-response-times), before starting this one. -- Learn about the [infrastructure of Telco Lite](/automate-workflows/diagnose-problems#welcome-to-acme-telco-lite) -- [Set up your local environment](/automate-workflows/diagnose-problems#set-up-your-environment) -- [Deploy and instrument the Telco Lite services](/automate-workflows/diagnose-problems#deploy-telco-lite) + + +In this procedure, you use New Relic to understand understand why some services are raising alerts. -When you're ready, start your journey by observing the error alerts in your services. +## Diagnose error alerts in Telco Lite -Log in to [New Relic One](https://one.newrelic.com) and select **APM** from the top navigation menu. Here, you see an overview of all eight Telco Lite services, including the service names, response times, and throughputs. Notice that **Telco-Login Service** and **Telco-Web Portal** have opened critical violations for high error percentages: +Log in to [New Relic One](https://one.newrelic.com) and select **APM** from the top navigation menu to see an overview of all Telco Lite services including the service names, response times, and throughputs. Notice that **Telco-Login Service** and **Telco-Web Portal** have opened critical violations for high error percentages: ![Error alerts](../../../images/telco-lite/error-alerts.png) @@ -43,7 +33,7 @@ If you don't see all the same alerts, don't worry. The simulated issues happen a -The deployment has created an alert condition for cases where a service's error percentage rises above 10% for 5 minutes or longer. A critical violation means that the service's conditions violate that threshold. +The deployment had created an alert condition for cases where a service's error percentage rises above 10% for 5 minutes or longer. A critical violation means that the service's conditions violate that threshold. Begin your investigation by selecting the **Telco-Web Portal** service name. @@ -51,11 +41,17 @@ Begin your investigation by selecting the **Telco-Web Portal** service name. -You're now on the web portal's APM summary page. The top graph, **Web transactions time**, shows you the service's response times. By default, it also displays periods of critical violation. On the right-hand side of the view, **Application activity** shows when violations opened and closed: +Select the **Telco-Web Portal** service name. ![Web portal alerts](../../../images/telco-lite/web-portal-apm-summary.png) -From the left-hand navigation, select **Events > Errors** to learn more about the errors: +You're now on the web portal's APM summary page. The top graph, **Web transactions time**, shows you the service's response times. By default, it also displays periods of critical violation. On the right-hand side of the view, **Application activity** shows when violations opened and closed. + + + + + +From the left-hand navigation, select **Events > Errors**: ![Web portal errors](../../../images/telco-lite/web-portal-errors.png) @@ -69,24 +65,40 @@ In this scenario, the only error in the service has the following message: This is a helpful message that explains that the web portal made a request to another service, raised an error, and responded with a response code of 500, indicating an **Internal server error**. -Since this message tells you that the error occurred while the web portal was making an outbound request, use distributed tracing to better understand the issue. Select **Monitor > Distributed tracing** from the left-hand navigation. +Since this message tells you that the error occurred while the web portal was making an outbound request, use distributed tracing to better understand the issue. +Select **Monitor > Distributed tracing** from the left-hand navigation. + ![Web portal distributed tracing](../../../images/telco-lite/web-portal-dt.png) -Distributed tracing provides end-to-end information about a request. In this case, you're looking for a request to the web portal that raised an error, so that you can better understand what happened during that request. Filter the table, by selecting the **Errors** column header twice, to order by descending counts: +Distributed tracing provides end-to-end information about a request. In this case, you're looking for a request to the web portal that raised an error, so that you better understand what happened during that request. + + + + + +Select the **Errors** column header twice to order the table by descending counts: ![Web portal traces, ordered by descending error counts](../../../images/telco-lite/web-portal-dt-ordered.png) + + + + Select the first row in the table: ![Web portal trace data](../../../images/telco-lite/web-portal-dt-row.png) This trace gives a lot of information about what happened with the request once the web portal received it. One of the things that the trace reveals is that the web portal made a `GET` request to **Telco-Login Service** and received an error. The trace indicates an error by coloring the text red. + + + + Select the row (called a span) to see more information about the request to the login service: ![Distributed trace error details](../../../images/telco-lite/dt-error-details.png) @@ -99,35 +111,45 @@ Expand **Error details** to see the error message: Interesting! This message says that, at the time that the web portal made the `GET` request to the login service, the login service was not ready to accept traffic. Inspect the login service to dive further into the root cause of these cascading errors. -Return to the **APM** page, and select **Telco-Login Service**. - +Return to the **APM** page, and select **Telco-Login Service**. + ![Login service APM summary](../../../images/telco-lite/login-apm-summary.png) Notice that the APM summary for **Telco-Login Service** has similar red flags to the ones in the web portal: **Web transactions time** has a red error indicator, and **Application activity** shows critical violations. More than that, the times that the errors occurred in both services match up (around 10:53 AM, in this example). +**Web transactions time**, in **APM**, also shows that requests to the login service spend all their time in Java code. Next, explore JVMs to see what's happening. + -**Web transactions time**, in **APM**, shows that requests to the login service spend all their time in Java code. So, in the left-hand navigation, open **Monitor > JVMs**: +Open **Monitor > JVMs** in the left-hand navigation, : ![JVMs overview](../../../images/telco-lite/jvm-overview.png) -Java Virtual Machines, or JVMs, run Java processes, such as those used by the login service. This view shows resource graphs for each JVM your service uses. Change the timeslice to look at data for the last 3 hours, so you can get a better idea of how the service has been behaving: +Java Virtual Machines, or JVMs, run Java processes, such as those used by the login service. This view shows resource graphs for each JVM your service uses. + + + + + +Change the timeslice to look at data for the last 3 hours: ![JVM heap memory usage](../../../images/telco-lite/jvm-heap-mem.png) Notice, in **Heap memory usage**, that the line for **Used Heap** rises consistently over 30 minute intervals. About two-thirds of the way through each interval, the line for **Committed Heap** (the amount of JVM heap memory dedicated for use by Java processes) quickly rises to accommodate the increasing memory demands. This graph indicates that the Java process is leaking memory. +The next step is to understand the extent of the leak's impact. + -Your Java process is leaking memory, but you need to understand the extent of the leak's impact. Navigate to the login service's host infrastructure view to dive a little deeper. +Navigate to the login service's host infrastructure view to dive a little deeper. First, go to the **Telco-Login Service** summary page and turn off **Show new view**: @@ -137,17 +159,19 @@ Then, scroll to the bottom of the page, and select the host's name: ![Login APM select host](../../../images/telco-lite/apm-login-select-host.png) - + Right now, you can only select the host's name from the old version of the UI (we're working on it). So, make sure you toggle off **Show new view**. - + In this infrastructure view, **Memory Used %** for **Telco-Authentication-host** consistently climbs from around 60% to around 90% over 30-minute intervals, matching the intervals in the JVM's heap memory usage graph: ![Authentication host memory](../../../images/telco-lite/infra-auth-host.png) -Therefore, the memory leak effects the login service's entire host. Click and drag on **Memory Used %** to narrow the timeslice to one of the peaks: +Therefore, the memory leak effects the login service's entire host. + +Click and drag on **Memory Used %** to narrow the timeslice to one of the peaks: ![Authentication host, new timeslice](../../../images/telco-lite/auth-host-zoomed.png) @@ -169,7 +193,13 @@ The message for those errors is the same one you saw earlier: [output] {red}java.lang.Exception:{plain} The application is not yet ready to accept traffic ``` -This suggests that the memory leaks cause the application to fail for a time. To understand the error a bit more, select the error class from the table at the bottom of the view: +This suggests that the memory leaks cause the application to fail for a time. + + + + + +To understand the error a bit more, select the error class from the table at the bottom of the view: ![Login service error details](../../../images/telco-lite/login-error-class.png) @@ -179,13 +209,15 @@ The stack trace shows that the service raised an `UnhandledException` from a fun With the information you've collected so far, you can conclude that **Telco-Login Service's** Java code has a memory leak. Also, the login service restarts the application when it runs out of memory, and it raises an `UnhandledException` when it receives requests while the app is restarting. -You also know the login service is effecting the web portal, because that is what introduced you to this problem, but does the issue effect any other services? +You also know the login service is affecting the web portal, because that is what introduced you to this problem, but does the issue effect any other services? -Visualize service dependencies using service maps. First, navigate back to **APM**, and from **Telco-Login Service**, select **Monitor > Service map**: +Visualize service dependencies using service maps. + +First, navigate back to **APM**, and from **Telco-Login Service**, select **Monitor > Service map**: ![Login service map](../../../images/telco-lite/login-service-map.png) @@ -205,7 +237,7 @@ Use the same steps you used to investigate issues in the web portal to confirm t At the end of your investigation, you discovered: -- **Telco-Login Service** and **Telco-Web Portal** raise alerts during critical violations +- **Telco-Login Service** and **Telco-Web Portal** raise critical violation alerts - The login service's Java processes leak memory - When the login service's host, **Telco-Authentication-host**, runs out of memory, it restarts the login application - While the login application is restarting, it raises an `UnhandledException` when it receives requests @@ -213,4 +245,8 @@ At the end of your investigation, you discovered: Now, as a Telco Lite developer, you have enough information to debug the issue causing the memory leak. Congratulations! -Learn more about using New Relic by diagnosing [other issues](/automate-workflows/diagnose-problems#view-your-services). If this is your last issue, [tear down](/automate-workflows/diagnose-problems#tear-down-telco-lite) all the Telco Lite services. + + +This procedure is part of a lab that teaches you how to diagnose common issues using New Relic. Now that you've diagnosed all the issues affecting Telco Lite, [Tear down your services](/automate-workflows/diagnose-problems/tear-down). + + \ No newline at end of file diff --git a/src/markdown-pages/automate-workflows/diagnose-problems/high-response-times.mdx b/src/markdown-pages/automate-workflows/diagnose-problems/high-response-times.mdx index 49498a604..fbac2c2d8 100644 --- a/src/markdown-pages/automate-workflows/diagnose-problems/high-response-times.mdx +++ b/src/markdown-pages/automate-workflows/diagnose-problems/high-response-times.mdx @@ -1,42 +1,38 @@ --- path: '/automate-workflows/diagnose-problems/high-response-times' duration: '15 min' -title: 'Diagnose high response times in Telco Lite' +title: 'Diagnose high response time' template: 'GuideTemplate' description: 'Learn how to use New Relic to diagnose high response times in your services.' -tileShorthand: - title: 'Diagnose high response times in Telco Lite' - description: 'Learn how to diagnose high response times with New Relic.' -resources: - - title: 'New Relic Demo Catalog' - url: 'https://github.com/newrelic/demo-catalog' - - title: 'New Relic Demo Deployer' - url: 'https://github.com/newrelic/demo-deployer' -tags: - - demo - - explore +procIdx: 3 --- -Use New Relic to understand why **Telco-Warehouse Portal** has slower-than-normal response times. + -## Prerequisites +This procedure is part of a lab that teaches you how to diagnose common issues using New Relic. -Before you begin: +Each procedure in the lab builds upon the last, so make sure you've completed the last procedure, [_View your services_](/automate-workflows/diagnose-problems/view-your-services), before starting this one. -- Learn about the [infrastructure of Telco Lite](/automate-workflows/diagnose-problems#welcome-to-acme-telco-lite) -- [Set up your local environment](/automate-workflows/diagnose-problems#set-up-your-environment) -- [Deploy and instrument the Telco Lite services](/automate-workflows/diagnose-problems#deploy-telco-lite) + -When you're ready, start your journey by observing the high response times in your services. +In this procedure, you use New Relic to understand why **Telco-Warehouse Portal** has slower-than-normal response times. + +## Diagnose high response time in Telco Lite -Log in to [New Relic One](https://one.newrelic.com) and select **APM** from the top navigation menu. Here, you see an overview of all eight Telco Lite services, including the service names, response times, and throughputs. Notice that the response time for **Telco-Warehouse Portal** is 43 seconds, much higher than the response times in other services: +Log in to [New Relic One](https://one.newrelic.com) and select **APM** from the top navigation menu. Here, you see an overview of all eight Telco Lite services: ![APM overview](../../../images/telco-lite/high-response-times.png) +The overview includes the service names, response times, and throughputs. Notice that the response time for **Telco-Warehouse Portal** is 43 seconds, much higher than the response times in other services. + + + + + On the **APM** page, select the **Telco-Warehouse Portal** service name to see a summary of that service. The top graph in the summary view shows **Web transactions time**: ![APM transaction time summary](../../../images/telco-lite/apm-summary.png) @@ -49,74 +45,96 @@ The graph changes to show only what the Node.js component contributes to the ove ![APM web external transaction time](../../../images/telco-lite/apm-webex.png) -External web traffic is the primary contributor to the high response times. That's a good start, but it doesn't tell the whole story. Next, dive deeper to find out which external service is causing the issue. +Here, you see that **Web external** is the culprit of the high response times, but it's hard to tell why. + +External web traffic is all the requests made from your service to other services. This means you should look into what external requests the warehouse portal makes to try to understand exactly what external service is the bottleneck. -So, you know that **Web external** is the culprit of the high response times, but it's hard to tell why. External web traffic is all the requests made from your service to other services. This means you should look into what external requests that the warehouse portal makes to try to understand exactly what external service is the bottleneck. - From the left-hand navigation, select **Monitor > Distributed tracing**: ![Distributed tracing overview](../../../images/telco-lite/dt-overview.png) -This view shows you requests to the warehouse portal. Select a request from the table at the bottom of the view to see a trace through that request: +This view shows you requests to the warehouse portal. + + + + + +Select a request from the table at the bottom of the view to see a trace through that request: ![Specific trace](../../../images/telco-lite/dt-trace.png) This trace shows that one external request contributes almost all of the total trace duration. Specifically, an external request to the **Telco-Fulfillment Service** contributes over 99% of the overall response time. -This is good news! You own the fulfillment service, which means you can drill down for even more information. Select the offending row (called a span), and then select **Explore this transaction**: - -![Span details](../../../images/telco-lite/dt-span-details.png) +Next, drill down the fulfillment service for even more information. -You're now looking at the `__main__:inventory_item` transaction overview. Because you know that some part of this transaction is slow, you can use this overview to narrow your focus even further. +Select the offending row (called a span), and then select **Explore this transaction**: + +![Span details](../../../images/telco-lite/dt-span-details.png) + +You're now looking at the **__main__:inventory_item** transaction overview. Because you know that some part of this transaction is slow, you use this overview to narrow your focus even further. Similar to how you modified the warehouse portal APM graph, you look specifically at the components of this transaction to understand where the root cause of the slowness is. -Similar to how you modified the warehouse portal APM graph, you can look specifically at the components of this transaction to understand where the root cause of the slowness is. Another way to view this information is to scroll down to the **Breakdown table** in that same view: +Another way to view this information is to scroll down to the **Breakdown table** in that same view: ![Transaction breakdown](../../../images/telco-lite/transaction-breakdown.png) `Function/__main__:inventory_item`, a Python function, contributes over 99% of the overall response time. -At this point, you know that **Telco-Warehouse Portal** is slow because it makes an external request to **Telco-Fulfillment Service**, which is slow. You also know that the issue in the fulfillment service is local because over 99% of the request is spent in Python code, not external services. Navigate to the fulfillment service's summary page to look at the service as a whole, instead of this single transaction: +At this point, you know that **Telco-Warehouse Portal** is slow because it makes an external request to **Telco-Fulfillment Service**, which is slow. You also know that the issue in the fulfillment service is local because over 99% of the request is spent in Python code, not external services. + + + + + +Navigate to the fulfillment service's summary page to look at the service as a whole: ![Fulfillment transaction time summary](../../../images/telco-lite/apm-fulfillment-summary.png) +Scroll down on this view to familiarize yourself with the graphs & tables it shows, such as **Throughput**, **Error rate**, **Hosts**. Next, drill into the host to see what's happening. + -Scroll down on this view to familiarize yourself with the graphs it shows, such as **Throughput** and **Error rate**. At the bottom of the page, you can see a table with the fulfillment hosts. You can't currently drill into a specific host in the new UI (we're working on it), but you can in the old UI. Toggle **Show new view** to off and select the host link: +To drill into the host, switch to the old UI. Toggle **Show new view** to off and select the host link: ![Switch to the old UI](../../../images/telco-lite/old-ui.png) +Now, you're looking at graphs in the infrastructure view for that service's host. Notice that **CPU %** has a lot of high spikes. + -Now, you're looking at graphs in the infrastructure view for that service's host. Notice that **CPU %** has a lot of high spikes. Click and drag on the graph from the start of a spike to the end of it to narrow the timeslice to the period when CPU utilization goes up: +Click and drag on the graph from the start of a spike to the end of it to narrow the timeslice to the period when CPU utilization goes up: ![CPU spike](../../../images/telco-lite/cpu-spike.png) If you compare this graph to the fulfillment service's transaction graph you looked at earlier, you'll see that soon after `__main__:inventory_item` begins executing, the CPU utilization of the host sharply rises to 100%! +Now, you understand the problem causing slow response times in the warehouse portal, but you don't know the extent of the issue. Using service maps, you can see all your services that depend on the fullfillment service. + -Now, you understand the problem causing slow response times in the warehouse portal, but you don't know the extent of the issue. Using service maps, you can see all your services that depend on the fullfillment service. - Navigate to the service map under **APM > Telco-Fulfillment Service**: ![Service map](../../../images/telco-lite/service-map.png) This map shows you the fulfillment service's incoming and outgoing dependencies. Not only is **Telco-Warehouse Portal** dependent on the fullfillment service, but so is **Telco-Web Portal**! + + + + Select the web portal node to see that the fulfillment service also affects the web portal's response times: ![Web portal dependency](../../../images/telco-lite/web-portal-dep.png) @@ -136,4 +154,8 @@ At the end of your investigation, you discovered: Now, as the developer behind the fulfillment service, you have enough information to debug the issue causing the CPU spikes. Congratulations! -You can learn more about using New Relic by diagnosing [other issues](/automate-workflows/diagnose-problems#view-your-services). If this is your last issue, you can [tear down](/automate-workflows/diagnose-problems#tear-down-telco-lite) all the Telco Lite services. + + +This procedure is part of a lab that teaches you how to diagnose common issues using New Relic. Continue on to next procedure: [Diagnose error alerts](/automate-workflows/diagnose-problems/error-alerts). + + \ No newline at end of file diff --git a/src/markdown-pages/automate-workflows/diagnose-problems/index.mdx b/src/markdown-pages/automate-workflows/diagnose-problems/index.mdx index b76befb91..7e7544c2b 100644 --- a/src/markdown-pages/automate-workflows/diagnose-problems/index.mdx +++ b/src/markdown-pages/automate-workflows/diagnose-problems/index.mdx @@ -1,197 +1,45 @@ --- path: '/automate-workflows/diagnose-problems' -duration: '30 min' title: 'Practice diagnosing common issues using New Relic' -template: 'GuideTemplate' +template: 'LabOverviewTemplate' description: 'Automatically spin up a microservice infrastructure, and use New Relic to diagnose its issues.' -tileShorthand: - title: 'Use New Relic to diagnose problems' - description: 'Learn to diagnose problems using New Relic.' -resources: - - title: 'New Relic Demo Catalog' - url: 'https://github.com/newrelic/demo-catalog' - - title: 'New Relic Demo Deployer' - url: 'https://github.com/newrelic/demo-deployer' -tags: - - demo - - explore --- -Every time you deploy an application, you hope that it's efficient and error-free. The reality, however, is usually quite different. You might introduce a bug in a release, overlook an edge case, or rely on a broken dependency. These issues, and others, can result in bad user experiences. + -In this guide, you: +You're a developer of Acme Telco Lite, a mock telecom business that maintains an eCommerce website for its customers. When deploying your application, you hope it's efficient and error free. The reality, however, is quite different. Your users are complaining of slow response and errors. You use New Relic to figure out the source of your user's frustration. -- Use the open-source New Relic [`demo-deployer`](https://github.com/newrelic/demo-deployer) to spin up the infrastructure for Acme Telco Lite, a fictional company. This demo scenario is part of our [Demo Catalog](https://github.com/newrelic/demo-catalog) and will simulate real-world issues in a controlled, demo environment -- Use New Relic to understand those issues from the perspective of a Telco Lite developer -- Use the deployer to tear down the resources you create + ## Welcome to Acme Telco Lite! -Acme Telco Lite is a mock telecom business that maintains an eCommerce website for its customers. The site's architecture has eight, interconnected microservices, plus a simulator: - -![Acme Telco Lite architecture diagram](../../../images/telco-lite/acme.png) - -The simulator isn't part of the Telco Lite infrastructure, but it is part of the demo deployment. It runs scenarios against the application to create web traffic and generate interesting data in New Relic. - -## Set up your environment - -Before you begin, follow the [Prerequisites guide](https://github.com/newrelic/demo-catalog/blob/main/GETTING_STARTED.md#prerequisites) from the deployer's GitHub repository for a detailed walkthrough of how to set up your environment. For this guide, you can choose between Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP) to host your deployment. - -The guide steps you through: - -- Installing Docker & pulling the demo-deployer image locally -- Creating a user config file containing credentials for New Relic and your cloud provider -- Downloading a .pem key file (if your cloud provider is AWS) - -Once you're all set up, deploy Acme Telco Lite. - -## Deploy Telco Lite - -It's time to deploy and instrument the Telco Lite services using `demo-deployer`. Copy the url for the demo that corresponds to the cloud provider you chose when you set up your environment: - -- [AWS Telco Lite Demo](https://raw.githubusercontent.com/newrelic/demo-catalog/main/catalog/telco_lite/telcolite.aws.json) -- [Azure Telco Lite Demo](https://raw.githubusercontent.com/newrelic/demo-catalog/main/catalog/telco_lite/telcolite.azure.json) -- [GCP Telco Lite Demo](https://raw.githubusercontent.com/newrelic/demo-catalog/main/catalog/telco_lite/telcolite.gcp.json) - -Follow the [Deployment guide](https://github.com/newrelic/demo-catalog/blob/main/GETTING_STARTED.md#deploy-your-services) in the Demo Catalog repository for a thorough explanation of how to use the deployer in a local Docker environment. When you run the deployment script, make sure to pass the url you copied for ``. - - - -Since Telco Lite contains several services, the deployment can take over half an hour. - - - -When the deloyer is finished, you should see some output stating that the deployment was successful: - -```shell copyable=false -[output] {muted}[INFO] Executing Deployment -[output] [{green}✔{plain}] Parsing and validating Deployment configuration {green}success -[output] [{green}✔{plain}] Provisioner {green}success -[output] [{green}✔{plain}] Installing On-Host instrumentation {green}success -[output] [{green}✔{plain}] Installing Services and instrumentations {green}success -[output] -[output] {muted}[INFO] Deployment successful! -[output] -[output] Deployed Resources: -[output] -[output] simuhost (aws/ec2): -[output] ip: {blue}34.201.60.23 -[output] services: ["simulator"] -[output] -[output] uihost (aws/ec2): -[output] ip: {blue}18.233.97.28 -[output] services: ["webportal", "fluentd"] -[output] instrumentation: -[output] nr_infra: newrelic v1.12.1 -[output] -[output] backendhost (aws/ec2): -[output] ip: {blue}35.170.192.236 -[output] services: ["promo", "login", "inventory", "plan", "fulfillment", "warehouse", "fluentd"] -[output] instrumentation: -[output] nr_infra: newrelic v1.12.1 -[output] -[output] reportinghost (aws/ec2): -[output] ip: {blue}54.152.82.127 -[output] services: ["billing", "fluentd"] -[output] instrumentation: -[output] nr_infra: newrelic v1.12.1 -[output] -[output] Installed Services: -[output] -[output] simulator: -[output] url: {blue}http://34.201.60.23:5000 -[output] -[output] webportal: -[output] url: {blue}http://18.233.97.28:5001 -[output] instrumentation: -[output] nr_node_agent: newrelic v6.11.0 -[output] nr_logging_in_context: newrelic -[output] -[output] promo: -[output] url: {blue}http://35.170.192.236:8001 -[output] instrumentation: -[output] nr_python_agent: newrelic v5.14.1.144 -[output] nr_logging_in_context: newrelic -[output] -[output] login: -[output] url: {blue}http://35.170.192.236:8002 -[output] instrumentation: -[output] nr_python_agent: newrelic v5.14.1.144 -[output] nr_logging_in_context: newrelic -[output] -[output] inventory: -[output] url: {blue}http://35.170.192.236:8003 -[output] instrumentation: -[output] nr_python_agent: newrelic v5.14.1.144 -[output] nr_logging_in_context: newrelic -[output] -[output] plan: -[output] url: {blue}http://35.170.192.236:8004 -[output] instrumentation: -[output] nr_python_agent: newrelic v5.14.1.144 -[output] nr_logging_in_context: newrelic -[output] -[output] fulfillment: -[output] url: {blue}http://35.170.192.236:8005 -[output] instrumentation: -[output] nr_python_agent: newrelic v5.14.1.144 -[output] nr_logging_in_context: newrelic -[output] -[output] billing: -[output] url: {blue}http://54.152.82.127:9001 -[output] instrumentation: -[output] nr_java_agent: newrelic v5.14.0 -[output] nr_logging_in_context: newrelic -[output] nr_logging: newrelic -[output] -[output] warehouse: -[output] url: {blue}http://35.170.192.236:9002 -[output] instrumentation: -[output] nr_python_agent: newrelic v5.14.1.144 -[output] nr_logging_in_context: newrelic -[output] -[output] fluentd: -[output] url: {blue}http://18.233.97.28:9999 -[output] url: {blue}http://35.170.192.236:9999 -[output] url: {blue}http://54.152.82.127:9999 -[output] -[output] Completed at 2020-08-11 11:27:00 -0700 -[output] -[output] {muted}[INFO] This deployment summary can also be found in: -[output] {muted}[INFO] /tmp/telcolite/deploy_summary.txt -``` - -
- -After configuring your environment, you only needed two commands (and a bit of patience) to spin up all the Telco Lite services! - -## View your services - -With your services running in the cloud, log in to New Relic and select **APM** from the top navigation to see how your services are holding up: - -![APM story introduction](../../../images/telco-lite/story-introduction.png) - -Yikes! The alerts, high response times, and red-colored indicators suggest things aren't well. Use New Relic to diagnose these issues, which are simultaneously affecting your services: - -- [Issue 1: The Warehouse Portal has abnormally high response times](/automate-workflows/diagnose-problems/high-response-times) -- [Issue 2: Multiple services are raising error alerts](/automate-workflows/diagnose-problems/error-alerts) - - - -Don't worry if you don't see all the same alerts. The simulator triggers issues at regular intervals, so you should start seeing these problems in New Relic within 30 minutes to an hour. - - - -## Tear down Telco Lite - -When you're finished diagnosing all the issues effecting Telco Lite, follow the [Teardown guide](https://github.com/newrelic/demo-catalog/blob/main/GETTING_STARTED.md#tear-down-your-resources) in the deployer's repository to tear down the services you created in your cloud provider. If you're still exploring, don't tear down your services, or you'll have to deploy them again later. - -## Conclusion - -Congratulations, you're done! Throughout this tutorial, you: - -- Used the `demo-deployer` to deploy Telco Lite to the cloud -- Used New Relic to investigate simulated issues in Telco Lite services -- Tore down all the infrastructural resources you created throughout this tutorial - -Hopefully, you learned a lot about using New Relic to investigate issues in your services. To get your hands on more features of New Relic, pick another demo from the [catalog](https://github.com/newrelic/demo-catalog) and spin it up with the deployer! +Acme Telco Lite is a mock telecom business that maintains an eCommerce website for its customers. The site's architecture has eight, interconnected microservices: + +![Acme Telco Lite's architecture](../../../images/telco-lite/acme.png) + +Your customers were happy with their experience. However, you have been receiving complains for bad user experience after the recent deployment. You don't know if there's a bug in the release, an overlooked edge case, or a broken dependency causing the issue. But you know that you can use New Relic to observe your application and quickly diagnose the issue. This helps you take in-time action and get back on track. + +
+ +
+ +## Learning Objectives + +In this lab, you: + +- Spin up the infrastructure for Acme Telco Lite using the New Relic [`demo-deployer`](https://github.com/newrelic/demo-deployer) +- Use New Relic to understand the issues with Acme Telco Lite +- Use the deployer to tear down the resources you create + +
+ +
+ +## Requirements + +- Create a free [New Relic account](https://newrelic.com/signup?utm_source=developer-site) +- Install [Docker](https://www.docker.com/) + +
+ +
diff --git a/src/markdown-pages/automate-workflows/diagnose-problems/spin-up-acme.mdx b/src/markdown-pages/automate-workflows/diagnose-problems/spin-up-acme.mdx new file mode 100644 index 000000000..6d899b9a6 --- /dev/null +++ b/src/markdown-pages/automate-workflows/diagnose-problems/spin-up-acme.mdx @@ -0,0 +1,153 @@ +--- +path: '/automate-workflows/diagnose-problems/spin-up-acme' +title: 'Spin up Acme Telco Lite architecture' +template: 'GuideTemplate' +description: 'Set up your your environment to deploy Acme Telco Lite.' +duration: '15 min' +procIdx: 1 +--- + + + + +This procedure is part of a lab that teaches you how to diagnose common issues using New Relic. If you haven't already, check out the [lab introduction](/automate-workflows/diagnose-problems). + + + +## Set up your environment + +Before you begin, follow the [Prerequisites guide](https://github.com/newrelic/demo-catalog/blob/main/GETTING_STARTED.md#prerequisites) from the deployer's GitHub repository for a detailed walkthrough of how to set up your environment. You can choose between Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP) to host your deployment. + +To summarize: + +- Install Docker & pull the demo-deployer image locally +- Create a user config file containing credentials for New Relic and your cloud provider +- Download a .pem key file (if your cloud provider is AWS) + +Once you're all set up, deploy Acme Telco Lite. + +## Deploy Telco Lite + +It's time to deploy and instrument the Telco Lite services using `demo-deployer`. Copy the url for the demo that corresponds to the cloud provider you chose when you set up your environment: + +- [AWS Telco Lite Demo](https://raw.githubusercontent.com/newrelic/demo-catalog/main/catalog/telco_lite/telcolite.aws.json) +- [Azure Telco Lite Demo](https://raw.githubusercontent.com/newrelic/demo-catalog/main/catalog/telco_lite/telcolite.azure.json) +- [GCP Telco Lite Demo](https://raw.githubusercontent.com/newrelic/demo-catalog/main/catalog/telco_lite/telcolite.gcp.json) + +Follow the [Deployment guide](https://github.com/newrelic/demo-catalog/blob/main/GETTING_STARTED.md#deploy-your-services) in the Demo Catalog repository for a thorough explanation of how to use the deployer in a local Docker environment. When you run the deployment script, make sure to pass the url you copied for ``. + + + +Since Telco Lite contains several services, the deployment can take over half an hour. + + + +When the deloyer is finished, you should see some output stating that the deployment was successful: + +```shell copyable=false +[output] {muted}[INFO] Executing Deployment +[output] [{green}✔{plain}] Parsing and validating Deployment configuration {green}success +[output] [{green}✔{plain}] Provisioner {green}success +[output] [{green}✔{plain}] Installing On-Host instrumentation {green}success +[output] [{green}✔{plain}] Installing Services and instrumentations {green}success +[output] +[output] {muted}[INFO] Deployment successful! +[output] +[output] Deployed Resources: +[output] +[output] simuhost (aws/ec2): +[output] ip: {blue}34.201.60.23 +[output] services: ["simulator"] +[output] +[output] uihost (aws/ec2): +[output] ip: {blue}18.233.97.28 +[output] services: ["webportal", "fluentd"] +[output] instrumentation: +[output] nr_infra: newrelic v1.12.1 +[output] +[output] backendhost (aws/ec2): +[output] ip: {blue}35.170.192.236 +[output] services: ["promo", "login", "inventory", "plan", "fulfillment", "warehouse", "fluentd"] +[output] instrumentation: +[output] nr_infra: newrelic v1.12.1 +[output] +[output] reportinghost (aws/ec2): +[output] ip: {blue}54.152.82.127 +[output] services: ["billing", "fluentd"] +[output] instrumentation: +[output] nr_infra: newrelic v1.12.1 +[output] +[output] Installed Services: +[output] +[output] simulator: +[output] url: {blue}http://34.201.60.23:5000 +[output] +[output] webportal: +[output] url: {blue}http://18.233.97.28:5001 +[output] instrumentation: +[output] nr_node_agent: newrelic v6.11.0 +[output] nr_logging_in_context: newrelic +[output] +[output] promo: +[output] url: {blue}http://35.170.192.236:8001 +[output] instrumentation: +[output] nr_python_agent: newrelic v5.14.1.144 +[output] nr_logging_in_context: newrelic +[output] +[output] login: +[output] url: {blue}http://35.170.192.236:8002 +[output] instrumentation: +[output] nr_python_agent: newrelic v5.14.1.144 +[output] nr_logging_in_context: newrelic +[output] +[output] inventory: +[output] url: {blue}http://35.170.192.236:8003 +[output] instrumentation: +[output] nr_python_agent: newrelic v5.14.1.144 +[output] nr_logging_in_context: newrelic +[output] +[output] plan: +[output] url: {blue}http://35.170.192.236:8004 +[output] instrumentation: +[output] nr_python_agent: newrelic v5.14.1.144 +[output] nr_logging_in_context: newrelic +[output] +[output] fulfillment: +[output] url: {blue}http://35.170.192.236:8005 +[output] instrumentation: +[output] nr_python_agent: newrelic v5.14.1.144 +[output] nr_logging_in_context: newrelic +[output] +[output] billing: +[output] url: {blue}http://54.152.82.127:9001 +[output] instrumentation: +[output] nr_java_agent: newrelic v5.14.0 +[output] nr_logging_in_context: newrelic +[output] nr_logging: newrelic +[output] +[output] warehouse: +[output] url: {blue}http://35.170.192.236:9002 +[output] instrumentation: +[output] nr_python_agent: newrelic v5.14.1.144 +[output] nr_logging_in_context: newrelic +[output] +[output] fluentd: +[output] url: {blue}http://18.233.97.28:9999 +[output] url: {blue}http://35.170.192.236:9999 +[output] url: {blue}http://54.152.82.127:9999 +[output] +[output] Completed at 2020-08-11 11:27:00 -0700 +[output] +[output] {muted}[INFO] This deployment summary can also be found in: +[output] {muted}[INFO] /tmp/telcolite/deploy_summary.txt +``` + +
+ +After configuring your environment, you only needed two commands (and a bit of patience) to spin up all the Telco Lite services! + + + +This procedure is part of a lab that teaches you how to diagnose common issues using New Relic. Now that you've set up your environment, [view your services](/automate-workflows/diagnose-problems/view-your-services). + + diff --git a/src/markdown-pages/automate-workflows/diagnose-problems/tear-down.mdx b/src/markdown-pages/automate-workflows/diagnose-problems/tear-down.mdx new file mode 100644 index 000000000..aeca32479 --- /dev/null +++ b/src/markdown-pages/automate-workflows/diagnose-problems/tear-down.mdx @@ -0,0 +1,40 @@ +--- +path: '/automate-workflows/diagnose-problems/tear-down' +duration: '15 min' +title: 'Tear Down Telco Lite' +template: 'GuideTemplate' +description: 'Once you finish diagnosing all the issues affecting Telco Lite, tear down your services.' +procIdx: 5 +--- + + + +This procedure is part of a lab that teaches you how to diagnose common issues using New Relic. + +Each procedure in the lab builds upon the last, so make sure you've completed the last procedure, [_Diagnose error alerts_](/automate-workflows/diagnose-problems/error-alerts), before starting this one. + + + +Now that you have diagnosed all the issues affecting Telco Lite, you have enough information to debug the issues causing bad user experience. + +Since you deployed your services in cloud, it's time to tear them down to avoid any unnecessary costs. + +## Tear down your services + +Follow the [Teardown guide](https://github.com/newrelic/demo-catalog/blob/main/GETTING_STARTED.md#tear-down-your-resources) in the deployer's repository to tear down the services you created in your cloud provider. + + + +If you're still exploring, don't tear down your services, or you'll have to deploy them again later. + + + +## Conclusion + +Congratulations, you're done! Throughout this lab, you: + +- Used the `demo-deployer` to deploy Telco Lite to the cloud +- Used New Relic to investigate simulated issues in Telco Lite services +- Tore down all the infrastructural resources you created throughout this tutorial + +Hopefully, you learned a lot about using New Relic to investigate issues in your services. To get your hands on more features of New Relic, pick another demo from the [catalog](https://github.com/newrelic/demo-catalog) and spin it up with the deployer! \ No newline at end of file diff --git a/src/markdown-pages/automate-workflows/diagnose-problems/view-your-services.mdx b/src/markdown-pages/automate-workflows/diagnose-problems/view-your-services.mdx new file mode 100644 index 000000000..93280e80c --- /dev/null +++ b/src/markdown-pages/automate-workflows/diagnose-problems/view-your-services.mdx @@ -0,0 +1,39 @@ +--- +path: '/automate-workflows/diagnose-problems/view-your-services' +title: 'View your services' +template: 'GuideTemplate' +description: 'View your services in New Relic to diagnose the problem.' +duration: '5 min' +procIdx: 2 +--- + + + +This procedure is part of a lab that teaches you how to diagnose common issues using New Relic. + +Each procedure in the lab builds upon the last, so make sure you've completed the last procedure, [_Spin up Acme Telco Lite architecture_](/automate-workflows/diagnose-problems/spin-up-acme), before starting this one. + + + +You can monitor your applications in New Relic and get real-time performance related data to see what's happening at a particular point of time. This allows you to quickly diagnose and debug any issues that might result in bad user experience. + +In this Procedure, you view your services in New Relic. + +## View your services + +With your services running in the cloud, log in to New Relic and select **APM** from the top navigation to see how your services are holding up: + +![APM story introduction](../../../images/telco-lite/story-introduction.png) + +Yikes! The alerts, high response times, and red-colored indicators suggest things aren't well. There are two main issues which seem to be affecting your services: + +- The Warehouse Portal has abnormally high response times +- Multiple services are raising error alerts + +The next step is to diagnose these issues. + + + +This procedure is part of a lab that teaches you how to diagnose common issues using New Relic. Continue on to next procedure: [Diagnose high response time](/automate-workflows/diagnose-problems/high-response-times). + + \ No newline at end of file