-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up build of thin data streams #11618
Conversation
Skipping CI for Draft Pull Request. |
🤖 A k8s content image for this PR is available at: Click here to see how to deploy itIf you alread have Compliance Operator deployed: Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and: |
d6d7163
to
68814df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the generated data streams are broken because the rules don't reference the OVAL components correctly.
For example, the build/thin_ds/ssg-fedora-ds_selinux_state.xml
contains this:
<xccdf-1.2:check system="http://oval.mitre.org/XMLSchema/oval-definitions-5">
<xccdf-1.2:check-content-ref href="oval-unlinked.xml" name="selinux_state" />
</xccdf-1.2:check>
build-scripts/build_xccdf.py
Outdated
if args.thin_ds_components_dir != "off": | ||
if not os.path.exists(args.thin_ds_components_dir): | ||
os.makedirs(args.thin_ds_components_dir) | ||
store_xccdf_per_profile(loader, args.thin_ds_components_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I build thin data streams for RHEL 9 product, it tracebacks here:
[ 30%] [rhel9-content] generating plain XCCDF, OVAL and OCIL files
Traceback (most recent call last):
File "/home/jcerny/work/git/content/build-scripts/build_xccdf.py", line 138, in <module>
main()
File "/home/jcerny/work/git/content/build-scripts/build_xccdf.py", line 132, in main
store_xccdf_per_profile(loader, args.thin_ds_components_dir)
File "/home/jcerny/work/git/content/build-scripts/build_xccdf.py", line 93, in store_xccdf_per_profile
for id_, xccdftree in loader.get_benchmark_xml_by_profile():
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 1603, in get_benchmark_xml_by_profile
profile_id, benchmark = self.benchmark.get_benchmark_xml_for_profile(profile)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 518, in get_benchmark_xml_for_profile
return profile.id_, self.to_xml_element(
^^^^^^^^^^^^^^^^^^^^
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 469, in to_xml_element
self._add_groups_xml(root, components_to_not_include, env_yaml)
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 439, in _add_groups_xml
root.append(group.to_xml_element(env_yaml, components_to_not_include))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 723, in to_xml_element
self._add_sub_groups(group, components_to_not_include, env_yaml)
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 693, in _add_sub_groups
group.append(_group.to_xml_element(env_yaml, components_to_not_include))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 723, in to_xml_element
self._add_sub_groups(group, components_to_not_include, env_yaml)
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 693, in _add_sub_groups
group.append(_group.to_xml_element(env_yaml, components_to_not_include))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 723, in to_xml_element
self._add_sub_groups(group, components_to_not_include, env_yaml)
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 693, in _add_sub_groups
group.append(_group.to_xml_element(env_yaml, components_to_not_include))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 722, in to_xml_element
self._add_rules_xml(group, rules_to_not_include, env_yaml)
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 648, in _add_rules_xml
group.append(rule.to_xml_element(env_yaml))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 1132, in to_xml_element
add_reference_elements(rule, self.references, ref_uri_dict)
File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 120, in add_reference_elements
raise ValueError(msg)
ValueError: Error processing reference cis: ['5.6.1.4']. A reference type has been added that the project doesn't know about.
The problem is related to the references, specifically, how reference URIs are stored and processed throughout the built process. Unfortunately, we now have product-specific reference types and global reference types. Global reference types are defined in ssg/constants.py
and the product-specific ones are set in each product's product.yml
. The reference URIs are included into the infamous variable env_yaml
. See the Rule.to_xml_element()
method in build_yaml.py
around line 1128. I think we probably need to pass env_yaml
to this function and then pass it down to the called functions, as can we infer from the traceback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fd product.yml products | xargs grep cis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed traceback.
build-scripts/build_xccdf.py
Outdated
ocil = loader.export_ocil_to_xml() | ||
link_ocil(xccdftree, checks, args.ocil, ocil) | ||
|
||
ssg.xml.ElementTree.ElementTree(xccdftree).write(args.xccdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I think of it, I realize that in case of building thin data streams we don't want to store this file and moreover we don't need to serialize it. But I can see that most of the code depends on the xccdftree
. Therefore, I'm afraid that it would be difficult to rework this now. But I think it is a nice idea for future tickets. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this xccdf file is created because it is used in the next steps of the build. Especially in the formatting steps. These steps might be removed when python2 is not supported.
build-scripts/build_xccdf.py
Outdated
def store_xccdf_per_profile(loader, thin_ds_components_dir): | ||
for id_, xccdftree in loader.get_benchmark_xml_by_profile(): | ||
xccdf_file_name = os.path.join(thin_ds_components_dir, "xccdf_{}.xml".format(id_)) | ||
ssg.xml.ElementTree.ElementTree(xccdftree).write(xccdf_file_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the write method needs to have the encoding set to "utf-8" because it caused a problem recently, look #11614.
build-scripts/build_xccdf.py
Outdated
ocil = loader.export_ocil_to_xml() | ||
link_ocil(xccdftree, checks, args.ocil, ocil) | ||
|
||
ssg.xml.ElementTree.ElementTree(xccdftree).write(args.xccdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another write, check the encoding
ssg/build_yaml.py
Outdated
# This is where references should be put if there are any | ||
# This is where rationale should be put if there are any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left these comments in the method because they were present in the old version to_xml_element
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I also notice that the method is part of the Group class. So these comments actually make sense, but our groups don't have references and rationales, they wouldn't fit our structure. I feel that these comments provide unrequested information. So If you're fine with this type of noise you can keep them. If not, I would prefer to remove them.
63902da
to
a2acbc7
Compare
a2acbc7
to
2027550
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a huge speed up. On my machine building thin DSs for all rules in rhel9 content took 2:15 which is comparable to the normal build which takes 0:53. This improvement makes the feature useful for various use cases.
ssg/build_yaml.py
Outdated
# This is where references should be put if there are any | ||
# This is where rationale should be put if there are any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I also notice that the method is part of the Group class. So these comments actually make sense, but our groups don't have references and rationales, they wouldn't fit our structure. I feel that these comments provide unrequested information. So If you're fine with this type of noise you can keep them. If not, I would prefer to remove them.
ssg/build_yaml.py
Outdated
groups = set() | ||
for group in self.groups.values(): | ||
rules_, groups_ = group.get_not_included_components(rule_ids_list) | ||
if len(rules) == len(self.rules) and len(rules_) == len(group.rules): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we have:
rules
rules_
self.rules
group.rules
Confusing!
3d9a5a5
to
7c7c823
Compare
/packit retest-failed |
Code Climate has analyzed commit 7c7c823 and detected 1 issue on this pull request. Here's the issue category breakdown:
The test coverage on the diff in this pull request is 36.8% (50% is the threshold). This pull request will bring the total coverage in the repository to 58.0% (-0.1% change). View more on Code Climate. |
/packit retest-failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have built the data stream using the -r
option. I also have built all thin data streams using the -t
option. I have checked their contents. I have used some of them in oscap
scans. I have used them in automatus
tests. I also compared the normal data streams of rhel9 before and after this change.
Description:
This PR speeds up the build of thin data streams using the
--thin
flag. Before the changes, the build takes quite a long time: 48m31.2s. After changes, the build takes a total of 4m40s.The old build approach was to copy Benchmark instances with only one profile. Then the next build steps are performed with that copy. This leads to a slowdown due to deep copying of the Benchmark class and performing link steps for each thin DS.
The new approach changes the way thin DSs are built. The first step is to create a Benchmark for the thick DS. Then the XML generation from the Benchmark for one profile is performed. Before the generation, a dictionary of components that will not be included in the XML is created. And then it is linked to OVAL.
Review Hints:
To test the -
-thin
flag, you can run this script:The script generates a thin Datastream for each rule and then performs a scan using
oscap
.This test takes more than an hour to run because there are approximately 1830 rules to process and some are memory intensive.
To test the speed up of the build, you can run this command:
Flag
-p
enables profiling of the build.