Detect description and author fields from Cargo.toml manifest #1454

ritiek · 2019-03-16T07:55:18Z

Closes #1424 and addresses #1436.

scancode can now parse description and author fields from Cargo.toml manifest and deal with some edge cases when authors or description are not mentioned under Cargo.toml.

This should be ready to merge if everything looks good!

Signed-off-by: Ritiek Malhotra <[email protected]>

codecov · 2019-03-16T08:10:50Z

Codecov Report

Merging #1454 into develop will decrease coverage by 0.78%.
The diff coverage is 93.75%.

@@             Coverage Diff             @@
##           develop    #1454      +/-   ##
===========================================
- Coverage     83.9%   83.11%   -0.79%     
===========================================
  Files          120      120              
  Lines        14095    14123      +28     
===========================================
- Hits         11826    11739      -87     
- Misses        2269     2384     +115

Impacted Files	Coverage Δ
src/packagedcode/cargo.py	`91.35% <93.75%> (+0.79%)`	⬆️
src/cluecode/plugin_ignore_copyrights.py	`46.34% <0%> (-29.27%)`	⬇️
src/cluecode/finder.py	`69.53% <0%> (-12.11%)`	⬇️
src/cluecode/finder_data.py	`82.75% <0%> (-10.35%)`	⬇️
src/licensedcode/match.py	`72.44% <0%> (-5.26%)`	⬇️
src/textcode/markup.py	`93.47% <0%> (-4.35%)`	⬇️
src/commoncode/command.py	`82.2% <0%> (-3.39%)`	⬇️
src/textcode/analysis.py	`72.59% <0%> (-2.97%)`	⬇️
src/licensedcode/match_spdx_lid.py	`83.33% <0%> (-1.93%)`	⬇️
src/cluecode/copyrights.py	`91.18% <0%> (-1.38%)`	⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65528ff...527e663. Read the comment docs.

codecov · 2019-03-16T08:10:50Z

Codecov Report

Merging #1454 into develop will decrease coverage by 0.97%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           develop    #1454      +/-   ##
===========================================
- Coverage    83.91%   82.94%   -0.98%     
===========================================
  Files          120      121       +1     
  Lines        14095    14453     +358     
===========================================
+ Hits         11828    11988     +160     
- Misses        2267     2465     +198

Impacted Files	Coverage Δ
src/packagedcode/cargo.py	`93.24% <100%> (+2.67%)`	⬆️
src/licensedcode/tracing.py	`56.07% <0%> (-30.89%)`	⬇️
src/licensedcode/tokenize.py	`73.4% <0%> (-23.15%)`	⬇️
src/licensedcode/match.py	`69.67% <0%> (-8.03%)`	⬇️
src/licensedcode/cache.py	`96.96% <0%> (-1.52%)`	⬇️
src/commoncode/fileutils.py	`81.18% <0%> (-1.33%)`	⬇️
src/packagedcode/models.py	`92.3% <0%> (-0.62%)`	⬇️
src/licensedcode/frequent_tokens.py	`100% <0%> (ø)`	⬆️
src/licensedcode/stopwords.py	`100% <0%> (ø)`
src/scancode/extract_cli.py	`88% <0%> (+3%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d990a3...7416220. Read the comment docs.

Signed-off-by: Ritiek Malhotra <[email protected]>

pombredanne

Thank you: looking good. See my review comments inline.

Also in the parse function there is something wrong from previous commits:

def parse(location):
    """
    Return a Package object from a Cargo.toml file or None.
    """
    if not is_cargo_toml(location):
        return
    with io.open(location, encoding='utf-8') as loc: <-- you open `location` as `loc` alright
        package_data = toml.load(location, _dict=OrderedDict) <-- ... but you pass a location string and not a loc file-like object
    return build_package(package_data)

pombredanne · 2019-03-19T09:40:20Z

src/packagedcode/cargo.py

    # At the moment, this is only useful for making tests pass
    ordered_dict_map = {}
-    for key in ("source_packages", "dependencies", "keywords", "parties"):
+    for key in ("source_packages", "dependencies", "keywords"):


nit picking: can you use single quotes consistently everywhere? (unless this is for a docstring """)

pombredanne · 2019-03-19T09:42:20Z

src/packagedcode/cargo.py

        **ordered_dict_map
    )

+    field_mappers = [


Since there is a single function that you call here, may be it would simpler to just call it directly without using a list of partials, etc? I reckon the code in npm does this, but this is not super readable IMHO and there are many more functions being called.

pombredanne · 2019-03-19T09:43:42Z

src/packagedcode/cargo.py

+
+def party_mapper(party, package, party_type):
+    """
+    Update package parties with party of `party_type` and return package.


Rather than updating the package, it would be simpler to return the parties, and update the package at the call site instead. This way here has been borrowed from the npm.py way and is not something that I am very proud of so it is best not to carry it around.

May be something such as this:

def party_mapper(party, party_role): """ ...... """" for auth in party: name, email = parse_person(auth) yield models.Party( type=models.party_person, name=name, role=party_role, email=email)

I agree. Passing and then updating the package within functions can indeed cause confusion. Also, using yield seems a neat idea here!

pombredanne · 2019-03-19T09:45:55Z

src/packagedcode/cargo.py

+    return name, email
+
+
+person_parser = re.compile(


Do you mind adding a few unit tests that showcase how this regex and the regex below behave with real data? This helps understand things better.

Yep, I'll add in some tests shortly.

pombredanne · 2019-03-19T10:04:12Z

src/packagedcode/cargo.py


    name = package_data.get('package').get('name')
    version = package_data.get('package').get('version')
+    description = package_data.get('package').get('description')


IMHO assign package_data.get('package') to a variable rather than calling it over and over

pombredanne · 2019-03-19T10:04:27Z

src/packagedcode/cargo.py

    # TODO: Remove this ordered_dict_map once cargo.py is able to handle
    # the appropriate data (source_packages, dependencies, etc..)
    # At the moment, this is only useful for making tests pass
    ordered_dict_map = {}


ordered_dict_map sounds a bit strange for a variable name: do not say what type it is, but name its content instead.
But also it might be much simpler to no create empty fields that are not needed and instead just add keyword arguments directly to your package constructor below:

authors = package_data.get('package', {}).get('authors', []) # this assume a refactored party_mapper based on other comment parties = list(party_mapper(authors, party_type='author')) package = RustCargoCrate( name=name, version=version, description=description, parties=parties, )

ordered_dict_map sounds a bit strange for a variable name: do not say what type it is, but name its content instead.

Good variable naming conventions. I'll keep that it mind. :)

But also it might be much simpler to no create empty fields that are not needed and instead just > add keyword arguments directly to your package constructor below:
...

I did try something like this in #1426 but for a reason I couldn't figure out - tests kept failing. Anyway, just tried to make it work this way again and seems like I had to also modify the respected fields in Cargo.toml.expected files from {} to [].

Signed-off-by: Ritiek Malhotra <[email protected]>

…ict instance Signed-off-by: Ritiek Malhotra <[email protected]>

ritiek · 2019-03-19T15:47:40Z

Thanks for such a detailed review! I've made the appropriate changes. We're still missing regex tests which I'll add in soon.

pombredanne · 2019-03-20T10:31:37Z

tests/packagedcode/data/cargo/clap/Cargo.toml.expected

 "notice_text": null,
 "manifest_path": null,
- "dependencies": {},
+ "dependencies": [],


This was important as your use of empty dicts was changing the default type for this: it is a list.

pombredanne · 2019-03-20T16:34:52Z

We're still missing regex tests which I'll add in soon.

Thank you. LGTM and I will merge as soon as we have these

ritiek · 2019-03-20T17:16:22Z

I've added tests for our regex parser.

By the way, IMO using @pytest.mark.parametrize decorator makes things nicer and reduces code repetition. What do you think about it? We could also replace tests under TestCargo class to make use of @pytest.mark.parametrize decorator if that sounds like a good idea.

pombredanne · 2019-03-20T20:27:41Z

re@pytest.mark.parametrize I am not opposed to it per se, but I would rather stay mostly independent of the test runner. That say we already depend on it for some command line tests. So sure, why not give it a try?... Yet, please see my comments in line.

ritiek · 2019-03-20T21:31:40Z

So sure, why not give it a try?

Yup! Will do.

please see my comments in line.

Um.. I can't see any new inline comments since my last comment and the old ones have been addressed I guess?

pombredanne

Somehow, I had created but did not submit some review items.
Here they are! and sorry for that.

pombredanne · 2019-03-20T20:24:20Z

tests/packagedcode/test_cargo.py

+    ('<[email protected]>', (None, '<[email protected]>')),
+]
+
+class TestRegex:


you would need to subclass (object) for now. We are still on Python 2... otherwise you are using "old style" classes ,

Ah yup, forgot about that. I force pushed my last commit to address this.

pombredanne · 2019-03-20T20:27:03Z

tests/packagedcode/test_cargo.py

        self.check_package(package, expected_loc, regen=False)
+
+
+PERSON_PARSER_TEST_TABLE = [


I am not sure I would like this style where we have only positional arguments. That's not really easy to read and make sense of. Also to be convinced that its is a better solution, I would also need to see what a failure trace looks like and what feedback it provides to know what to fix where.

I intentionally modified two input sets so the tests fail.

$ pytest tests/packagedcode/test_cargo.py ================================== test session starts ================================== platform linux2 -- Python 2.7.15rc1, pytest-4.3.1, py-1.8.0, pluggy-0.9.0 rootdir: /home/ritiek/Downloads/scancode-toolkit, inifile: setup.cfg collected 12 items tests/packagedcode/test_cargo.py .....F.F.... [100%] ======================================= FAILURES ======================================= _____ TestRegex.test_person_parser[Barney Rubble <[email protected]>-expected_person0] ______ self = <test_cargo.TestRegex instance at 0x7f004f85c5a8> person = 'Barney Rubble <[email protected]>' expected_person = ('Barney Rubble ', '<[email protected]>') @pytest.mark.parametrize('person, expected_person', PERSON_PARSER_TEST_TABLE) def test_person_parser(self, person, expected_person): parsed_person = cargo.person_parser(person) person_information = parsed_person.groupdict() name, email = person_information.get('name'), person_information.get('email') > assert (name, email) == expected_person E AssertionError: assert ('Barney [email protected]>') == ('Barney [email protected]>') E At index 1 diff: u'<[email protected]>' != u'<[email protected]>' E Use -v to get the full diff tests/packagedcode/test_cargo.py:89: AssertionError _ TestRegex.test_person_parser[Some Bad Guy <[email protected]>-expected_person2] __ self = <test_cargo.TestRegex instance at 0x7f004fb5c908> person = 'Some Bad Guy <[email protected]>' expected_person = ('Some Good Guy ', '<[email protected]>') @pytest.mark.parametrize('person, expected_person', PERSON_PARSER_TEST_TABLE) def test_person_parser(self, person, expected_person): parsed_person = cargo.person_parser(person) person_information = parsed_person.groupdict() name, email = person_information.get('name'), person_information.get('email') > assert (name, email) == expected_person E AssertionError: assert ('Some Bad [email protected]>') == ('Some Good [email protected]>') E At index 0 diff: u'Some Bad Guy ' != u'Some Good Guy ' E Use -v to get the full diff tests/packagedcode/test_cargo.py:89: AssertionError =============================== short test summary info ================================ FAILED tests/packagedcode/test_cargo.py::TestRegex::test_person_parser[Barney Rubble <[email protected]>-expected_person0] FAILED tests/packagedcode/test_cargo.py::TestRegex::test_person_parser[Some Bad Guy <[email protected]>-expected_person2]

This is what it looks like running pytest without any additional parameters. It shows us the inputs on which the parametrized test function failed assertion.

That said, This is merely a choice of preference and I don't have any strong opinions on either approach. I could settle with either way. :)

thank you for the failure trace! It looks decent so I think that can work for smaller tests... though I am not even sure this is practical when there is an expected failure on a test. I do not know if you can also select a subset of the data with the -k CLI arg of py.test.... Anyway, let's keep it as you pushed it for now... and see if I warm up more with the approach :)

Signed-off-by: Ritiek Malhotra <[email protected]>

pombredanne

Thanks! LGTM and will merge as soon as the CIs complete their runs. These are so slow! but there are a lot of tests too.

pombredanne · 2019-03-23T15:01:18Z

Thank you again 💯

ritiek added 5 commits March 16, 2019 12:03

Parse authors from Cargo.toml manifest

2b1651b

Signed-off-by: Ritiek Malhotra <[email protected]>

Parse crate description from Cargo.toml manifest

40049cf

Signed-off-by: Ritiek Malhotra <[email protected]>

Fix crates.io base URL formatting

7679320

Signed-off-by: Ritiek Malhotra <[email protected]>

Update tests

b473e5d

Signed-off-by: Ritiek Malhotra <[email protected]>

Handle when description or authors are not present

92c4604

Signed-off-by: Ritiek Malhotra <[email protected]>

Add docstrings!

10c9458

Signed-off-by: Ritiek Malhotra <[email protected]>

ritiek force-pushed the parse-more-fields-in-cargo-manifest branch from 527e663 to 10c9458 Compare March 16, 2019 09:00

pombredanne reviewed Mar 19, 2019

View reviewed changes

ritiek added 4 commits March 19, 2019 20:26

Code style consistency

fb3d509

Signed-off-by: Ritiek Malhotra <[email protected]>

Remove dead code opening Cargo.toml manifest

1f3aac4

Signed-off-by: Ritiek Malhotra <[email protected]>

Pass "parties" to constructor in RustCargoCrate class

bc20482

Signed-off-by: Ritiek Malhotra <[email protected]>

Do not unnecessarily map unimplemented manifest fields to an OrderedD…

e8b2836

…ict instance Signed-off-by: Ritiek Malhotra <[email protected]>

pombredanne reviewed Mar 20, 2019

View reviewed changes

pombredanne reviewed Mar 21, 2019

View reviewed changes

Add tests for RegEx parsing

7416220

Signed-off-by: Ritiek Malhotra <[email protected]>

ritiek force-pushed the parse-more-fields-in-cargo-manifest branch from db4c5a9 to 7416220 Compare March 22, 2019 14:07

pombredanne approved these changes Mar 22, 2019

View reviewed changes

pombredanne merged commit f4aa8cd into aboutcode-org:develop Mar 23, 2019

ritiek deleted the parse-more-fields-in-cargo-manifest branch March 25, 2019 13:15

		self.check_package(package, expected_loc, regen=False)


		PERSON_PARSER_TEST_TABLE = [

Uh oh!

Detect description and author fields from Cargo.toml manifest #1454

Detect description and author fields from Cargo.toml manifest #1454

Uh oh!

Conversation

ritiek commented Mar 16, 2019

Uh oh!

codecov bot commented Mar 16, 2019

Codecov Report

Uh oh!

codecov bot commented Mar 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pombredanne left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ritiek commented Mar 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pombredanne commented Mar 20, 2019

Uh oh!

ritiek commented Mar 20, 2019

Uh oh!

pombredanne commented Mar 20, 2019

Uh oh!

ritiek commented Mar 20, 2019

Uh oh!

pombredanne left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pombredanne left a comment

Choose a reason for hiding this comment

Uh oh!

pombredanne commented Mar 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 16, 2019 •

edited

Loading

pombredanne left a comment •

edited

Loading