Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix throughput/octetDeltaCount issue in flow visualization #2089

Merged
merged 1 commit into from
Apr 15, 2021

Conversation

zyiou
Copy link
Contributor

@zyiou zyiou commented Apr 14, 2021

IsPresent field of connection should be updated before having
IsConnectionDying check otherwise IsConnectionDying will always
return true, which makes existing connections cannot be updated and
octetDeltaCount always return 0.

This commit also changes the delta count of first record from zero
to total delta count, modifies throughput calculation of first
record in logstash config and changes names of thoughput diagram
from throughput to cumulative bytes.

fixes #2085

@zyiou
Copy link
Contributor Author

zyiou commented Apr 14, 2021

@antoninbas Please have a quick check whether the fix works for your setup. It works for me locally. Thanks!

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it fixes the issue with flows not being reported at all. However, if I use connections that last 30 seconds, and keep the export interval at 60 seconds everywhere, all the flow records still have delta counts set to 0, which in turn means that the throughput graphs are still all empty.

This doesn't seem right. I don't know if this is something that should be fixed by including non-zero delta counts in the first flow record for a connection, or if this is something that needs to be fixed on the ELK configuration side. @srikartati

@codecov-io
Copy link

codecov-io commented Apr 14, 2021

Codecov Report

Merging #2089 (820eaab) into main (b118f2d) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2089      +/-   ##
==========================================
- Coverage   60.97%   60.94%   -0.03%     
==========================================
  Files         270      270              
  Lines       20366    20365       -1     
==========================================
- Hits        12418    12412       -6     
- Misses       6649     6670      +21     
+ Partials     1299     1283      -16     
Flag Coverage Δ
kind-e2e-tests 51.71% <100.00%> (+0.07%) ⬆️
unit-tests 41.35% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/flowexporter/connections/connections.go 76.05% <100.00%> (-0.17%) ⬇️
pkg/agent/controller/traceflow/packetin.go 56.69% <0.00%> (-7.15%) ⬇️
...agent/controller/traceflow/traceflow_controller.go 69.46% <0.00%> (-3.82%) ⬇️
pkg/controller/traceflow/controller.go 67.27% <0.00%> (-2.43%) ⬇️
pkg/agent/openflow/pipeline.go 68.99% <0.00%> (-0.87%) ⬇️
pkg/agent/route/route_linux.go 42.44% <0.00%> (+0.58%) ⬆️
pkg/agent/agent.go 47.72% <0.00%> (+0.71%) ⬆️
...ntroller/networkpolicy/networkpolicy_controller.go 71.75% <0.00%> (+1.62%) ⬆️
...gent/controller/networkpolicy/status_controller.go 75.34% <0.00%> (+2.73%) ⬆️
pkg/agent/flowexporter/exporter/exporter.go 74.59% <0.00%> (+5.14%) ⬆️
... and 2 more

@zyiou
Copy link
Contributor Author

zyiou commented Apr 14, 2021

Currently delta count of first record will always be 0. Reason is that for some connections established before flow exporter gets deployed, the bytes in conntrack table may be quite large, making the first throughput a peak in the diagram.
In the case that connection lasting time is less than exporting interval, there will be only one record exported and delta count of the record will all be 0.
This can be fixed by latest commit.

@srikartati
Copy link
Member

srikartati commented Apr 14, 2021

So it fixes the issue with flows not being reported at all. However, if I use connections that last 30 seconds, and keep the export interval at 60 seconds everywhere, all the flow records still have delta counts set to 0, which in turn means that the throughput graphs are still all empty.

This doesn't seem right. I don't know if this is something that should be fixed by including non-zero delta counts in the first flow record for a connection, or if this is something that needs to be fixed on the ELK configuration side. @srikartati

For connections whose duration is lower than the activeFlowExportTimeout, we will send the flow record with idleFlowExportTimeout, depending on the exact duration of the short flow. In these cases, when we send one final record we can send non-zero delta values, i.e., for all flow records generated by idleFlowExportTimeout, we will send non-zero delta count, even though it is the first record (need a fix for this).
For flows whose duration is such a value, where there will be one record because of activeFlowExportTimeout and the final record because of inactiveFlowExportTimeout will still have zero delta count and the second record will have the non-zero delta count.

What do you think?

@antoninbas
Copy link
Contributor

I don't know if there is anything to change on the Antrea side. I do not want to report non-zero delta counts if it's going to create other issues, e.g. with bandwidth reporting, or if it is not in agreement with the IPFIX specs.

It seems to me that maybe this should be fixed on the ELK configuration side: maybe for these "throughput" graphs showing the cumulative amount of traffic, we should fallback to the total amount if the delta count is 0?

@srikartati
Copy link
Member

I don't know if there is anything to change on the Antrea side. I do not want to report non-zero delta counts if it's going to create other issues, e.g. with bandwidth reporting, or if it is not in agreement with the IPFIX specs.

It seems to me that maybe this should be fixed on the ELK configuration side: maybe for these "throughput" graphs showing the cumulative amount of traffic, we should fallback to the total amount if the delta count is 0?

I checked RFC again and could not find any relevant specification specific to this scenario. Yes even with change on the flow exporter side with inactiveFlowExportTimeout, we will still have an issue of incorrect bandwidth computation using the existing methodology.

Bandwidth computation for flows with only one record using delta count and time interval of records, which is the current method of computation, does not make sense. Therefore, thinking of providing a solution from the ELK flow collector side is a good alternative. I can think of two options:

  1. Ignore the first flow record for bandwidth computation. The downside is we do not see flows with one flow record in throughput charts.
  2. Compute bandwidth or throughput differently for the first flow record. Use total bytes and difference of flow end time - flow start time for the bandwidth computation corresponding to the first record and use the existing computation for the rest of the flow records. Normalize this for all flows. One con is that I am doubtful about the flow start time correctness as we have not tested this enough on all platforms. In due course of time, it will become robust.

@zyiou I feel the name Pod-to-Pod Flow Throughput (Bytes) is not appropriate as there is no time component at all. It should be changed to something like Pod-to-Pod Flow Cumulative Bytes. Do we use total bytes here or delta count? If delta counts, we should directly use total bytes and keep updating after every record.

@zyiou
Copy link
Contributor Author

zyiou commented Apr 14, 2021

@zyiou I feel the name Pod-to-Pod Flow Throughput (Bytes) is not appropriate as there is no time component at all. It should be changed to something like Pod-to-Pod Flow Cumulative Bytes. Do we use total bytes here or delta count? If delta counts, we should directly use total bytes and keep updating after every record.

I agree that the name should be changed.
We use delta count in the diagram. Reason is that the value shown in the diagram is actually the sum of delta count over the time period user chooses. For example, if user wants to view the data of last 5 min, the bytes in the diagram is the sum of delta count over last 5 minutes. Sum of total count over last 5 min does not make sense in this case.

@srikartati
Copy link
Member

srikartati commented Apr 14, 2021

We use delta count in the diagram. Reason is that the value shown in the diagram is actually the sum of delta count over the time period user chooses. For example, if user wants to view the data of last 5 min, the bytes in the diagram is the sum of delta count over last 5 minutes. Sum of total count over last 5 min does not make sense in this case.

In theory, we can do the difference of the total count of the last flow record and first flow record in the last five mins. Instead of changing too much, we can do a mix of the total count and delta counts, i.e., use total count for the very first flow record and delta count for the rest. Till now we are ignoring the bytes that come from the first flow record, which leads to inaccurate cumulative byte count. Do you agree? Can this logic be managed in logstash?

IsPresent field of connection should be updated before having
IsConnectionDying check otherwise IsConnectionDying will always
return true, which makes existing connections cannot be updated and
octetDeltaCount always return 0.

This commit also changes the delta count of first record from zero
to its total delta count, modifies throughput calculation of first
record in logstash config and changes names of thoughput diagram
from 'throughput' to 'cumulative bytes'.
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the Kibana changes (kibana.ndjson and KIBANA_DEFAULTAPPID value)?
did you test the change in your cluster?

@@ -165,7 +165,7 @@ spec:
- name: ELASTICSEARCH_URL
value: "http://elasticsearch:9200"
- name: KIBANA_DEFAULTAPPID
value: "dashboard/653cf1e0-2fd2-11e7-99ed-49759aed30f5"
value: "dashboard/3b331b30-b987-11ea-b16e-fb06687c3589"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as Antonin. Why is this required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found that dashboard cannot be found error popping up after importing the Kibana ndjson file.

@zyiou
Copy link
Contributor Author

zyiou commented Apr 15, 2021

what are the Kibana changes (kibana.ndjson and KIBANA_DEFAULTAPPID value)?
did you test the change in your cluster?

Found that dashboard cannot be found error popping up after importing the Kibana ndjson file. It does not affect our functionalities. (Probably because my browser setting this error does not show before) Tested the change.

@srikartati
Copy link
Member

/test-all

@srikartati
Copy link
Member

/test-ipv6-only-e2e
/test-ipv6-e2e

@srikartati srikartati merged commit efddc8d into antrea-io:main Apr 15, 2021
antoninbas pushed a commit to antoninbas/antrea that referenced this pull request Apr 29, 2021
IsPresent field of connection should be updated before having
IsConnectionDying check otherwise IsConnectionDying will always
return true, which makes existing connections cannot be updated and
octetDeltaCount always return 0.

This commit also changes the delta count of first record from zero
to its total delta count, modifies throughput calculation of first
record in logstash config and changes names of thoughput diagram
from 'throughput' to 'cumulative bytes'.
antoninbas pushed a commit that referenced this pull request Apr 30, 2021
IsPresent field of connection should be updated before having
IsConnectionDying check otherwise IsConnectionDying will always
return true, which makes existing connections cannot be updated and
octetDeltaCount always return 0.

This commit also changes the delta count of first record from zero
to its total delta count, modifies throughput calculation of first
record in logstash config and changes names of thoughput diagram
from 'throughput' to 'cumulative bytes'.
@zyiou zyiou added area/flow-visibility Issues or PRs related to flow visibility support in Antrea area/flow-visibility/elk Issues or PRs related to the reference ELK configuration for flow visualization labels Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/flow-visibility/elk Issues or PRs related to the reference ELK configuration for flow visualization area/flow-visibility Issues or PRs related to flow visibility support in Antrea
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flow visualization no longer working correctly in v1.0
5 participants