Skip to content

Conversation

@ritiek
Copy link
Contributor

@ritiek ritiek commented Mar 16, 2019

Closes #1424 and addresses #1436.

scancode can now parse description and author fields from Cargo.toml manifest and deal with some edge cases when authors or description are not mentioned under Cargo.toml.

This should be ready to merge if everything looks good!

@codecov
Copy link

codecov bot commented Mar 16, 2019

Codecov Report

Merging #1454 into develop will decrease coverage by 0.78%.
The diff coverage is 93.75%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1454      +/-   ##
===========================================
- Coverage     83.9%   83.11%   -0.79%     
===========================================
  Files          120      120              
  Lines        14095    14123      +28     
===========================================
- Hits         11826    11739      -87     
- Misses        2269     2384     +115
Impacted Files Coverage Δ
src/packagedcode/cargo.py 91.35% <93.75%> (+0.79%) ⬆️
src/cluecode/plugin_ignore_copyrights.py 46.34% <0%> (-29.27%) ⬇️
src/cluecode/finder.py 69.53% <0%> (-12.11%) ⬇️
src/cluecode/finder_data.py 82.75% <0%> (-10.35%) ⬇️
src/licensedcode/match.py 72.44% <0%> (-5.26%) ⬇️
src/textcode/markup.py 93.47% <0%> (-4.35%) ⬇️
src/commoncode/command.py 82.2% <0%> (-3.39%) ⬇️
src/textcode/analysis.py 72.59% <0%> (-2.97%) ⬇️
src/licensedcode/match_spdx_lid.py 83.33% <0%> (-1.93%) ⬇️
src/cluecode/copyrights.py 91.18% <0%> (-1.38%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65528ff...527e663. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 16, 2019

Codecov Report

Merging #1454 into develop will decrease coverage by 0.97%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1454      +/-   ##
===========================================
- Coverage    83.91%   82.94%   -0.98%     
===========================================
  Files          120      121       +1     
  Lines        14095    14453     +358     
===========================================
+ Hits         11828    11988     +160     
- Misses        2267     2465     +198
Impacted Files Coverage Δ
src/packagedcode/cargo.py 93.24% <100%> (+2.67%) ⬆️
src/licensedcode/tracing.py 56.07% <0%> (-30.89%) ⬇️
src/licensedcode/tokenize.py 73.4% <0%> (-23.15%) ⬇️
src/licensedcode/match.py 69.67% <0%> (-8.03%) ⬇️
src/licensedcode/cache.py 96.96% <0%> (-1.52%) ⬇️
src/commoncode/fileutils.py 81.18% <0%> (-1.33%) ⬇️
src/packagedcode/models.py 92.3% <0%> (-0.62%) ⬇️
src/licensedcode/frequent_tokens.py 100% <0%> (ø) ⬆️
src/licensedcode/stopwords.py 100% <0%> (ø)
src/scancode/extract_cli.py 88% <0%> (+3%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d990a3...7416220. Read the comment docs.

Signed-off-by: Ritiek Malhotra <[email protected]>
@ritiek ritiek force-pushed the parse-more-fields-in-cargo-manifest branch from 527e663 to 10c9458 Compare March 16, 2019 09:00
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you: looking good. See my review comments inline.

Also in the parse function there is something wrong from previous commits:

def parse(location):
    """
    Return a Package object from a Cargo.toml file or None.
    """
    if not is_cargo_toml(location):
        return
    with io.open(location, encoding='utf-8') as loc: <-- you open `location` as `loc` alright
        package_data = toml.load(location, _dict=OrderedDict) <-- ... but you pass a location string and not a loc file-like object
    return build_package(package_data)

# At the moment, this is only useful for making tests pass
ordered_dict_map = {}
for key in ("source_packages", "dependencies", "keywords", "parties"):
for key in ("source_packages", "dependencies", "keywords"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit picking: can you use single quotes consistently everywhere? (unless this is for a docstring """)

**ordered_dict_map
)

field_mappers = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is a single function that you call here, may be it would simpler to just call it directly without using a list of partials, etc? I reckon the code in npm does this, but this is not super readable IMHO and there are many more functions being called.


def party_mapper(party, package, party_type):
"""
Update package parties with party of `party_type` and return package.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than updating the package, it would be simpler to return the parties, and update the package at the call site instead. This way here has been borrowed from the npm.py way and is not something that I am very proud of so it is best not to carry it around.

May be something such as this:

def party_mapper(party, party_role):
    """
    ......
    """"
    for auth in party:
        name, email = parse_person(auth)
        yield models.Party(
            type=models.party_person,
            name=name,
            role=party_role,
            email=email)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Passing and then updating the package within functions can indeed cause confusion. Also, using yield seems a neat idea here!

return name, email


person_parser = re.compile(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding a few unit tests that showcase how this regex and the regex below behave with real data? This helps understand things better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll add in some tests shortly.


name = package_data.get('package').get('name')
version = package_data.get('package').get('version')
description = package_data.get('package').get('description')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO assign package_data.get('package') to a variable rather than calling it over and over

# TODO: Remove this ordered_dict_map once cargo.py is able to handle
# the appropriate data (source_packages, dependencies, etc..)
# At the moment, this is only useful for making tests pass
ordered_dict_map = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordered_dict_map sounds a bit strange for a variable name: do not say what type it is, but name its content instead.
But also it might be much simpler to no create empty fields that are not needed and instead just add keyword arguments directly to your package constructor below:

authors = package_data.get('package', {}).get('authors', [])
# this assume a refactored party_mapper based on other comment 
parties = list(party_mapper(authors, party_type='author'))
package = RustCargoCrate(
    name=name,
    version=version,
    description=description,
    parties=parties,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordered_dict_map sounds a bit strange for a variable name: do not say what type it is, but name its content instead.

Good variable naming conventions. I'll keep that it mind. :)

But also it might be much simpler to no create empty fields that are not needed and instead just > add keyword arguments directly to your package constructor below:
...

I did try something like this in #1426 but for a reason I couldn't figure out - tests kept failing. Anyway, just tried to make it work this way again and seems like I had to also modify the respected fields in Cargo.toml.expected files from {} to [].

@ritiek
Copy link
Contributor Author

ritiek commented Mar 19, 2019

Thanks for such a detailed review! I've made the appropriate changes. We're still missing regex tests which I'll add in soon.

"notice_text": null,
"manifest_path": null,
"dependencies": {},
"dependencies": [],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was important as your use of empty dicts was changing the default type for this: it is a list.

@pombredanne
Copy link
Member

We're still missing regex tests which I'll add in soon.

Thank you. LGTM and I will merge as soon as we have these

@ritiek
Copy link
Contributor Author

ritiek commented Mar 20, 2019

I've added tests for our regex parser.

By the way, IMO using @pytest.mark.parametrize decorator makes things nicer and reduces code repetition. What do you think about it? We could also replace tests under TestCargo class to make use of @pytest.mark.parametrize decorator if that sounds like a good idea.

@pombredanne
Copy link
Member

re@pytest.mark.parametrize I am not opposed to it per se, but I would rather stay mostly independent of the test runner. That say we already depend on it for some command line tests. So sure, why not give it a try?... Yet, please see my comments in line.

@ritiek
Copy link
Contributor Author

ritiek commented Mar 20, 2019

So sure, why not give it a try?

Yup! Will do.

please see my comments in line.

Um.. I can't see any new inline comments since my last comment and the old ones have been addressed I guess?

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow, I had created but did not submit some review items.
Here they are! and sorry for that.

('<[email protected]>', (None, '<[email protected]>')),
]

class TestRegex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would need to subclass (object) for now. We are still on Python 2... otherwise you are using "old style" classes ,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yup, forgot about that. I force pushed my last commit to address this.

self.check_package(package, expected_loc, regen=False)


PERSON_PARSER_TEST_TABLE = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I would like this style where we have only positional arguments. That's not really easy to read and make sense of. Also to be convinced that its is a better solution, I would also need to see what a failure trace looks like and what feedback it provides to know what to fix where.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally modified two input sets so the tests fail.

$ pytest tests/packagedcode/test_cargo.py
================================== test session starts ==================================
platform linux2 -- Python 2.7.15rc1, pytest-4.3.1, py-1.8.0, pluggy-0.9.0
rootdir: /home/ritiek/Downloads/scancode-toolkit, inifile: setup.cfg
collected 12 items                                                                      

tests/packagedcode/test_cargo.py .....F.F....                                    [100%]

======================================= FAILURES =======================================
_____ TestRegex.test_person_parser[Barney Rubble <[email protected]>-expected_person0] ______

self = <test_cargo.TestRegex instance at 0x7f004f85c5a8>
person = 'Barney Rubble <[email protected]>'
expected_person = ('Barney Rubble ', '<[email protected]>')

    @pytest.mark.parametrize('person, expected_person', PERSON_PARSER_TEST_TABLE)
    def test_person_parser(self, person, expected_person):
        parsed_person = cargo.person_parser(person)
        person_information = parsed_person.groupdict()
        name, email = person_information.get('name'), person_information.get('email')
>       assert (name, email) == expected_person
E       AssertionError: assert ('Barney [email protected]>') == ('Barney [email protected]>')
E         At index 1 diff: u'<[email protected]>' != u'<[email protected]>'
E         Use -v to get the full diff

tests/packagedcode/test_cargo.py:89: AssertionError
_ TestRegex.test_person_parser[Some Bad Guy <[email protected]>-expected_person2] __

self = <test_cargo.TestRegex instance at 0x7f004fb5c908>
person = 'Some Bad Guy <[email protected]>'
expected_person = ('Some Good Guy ', '<[email protected]>')

    @pytest.mark.parametrize('person, expected_person', PERSON_PARSER_TEST_TABLE)
    def test_person_parser(self, person, expected_person):
        parsed_person = cargo.person_parser(person)
        person_information = parsed_person.groupdict()
        name, email = person_information.get('name'), person_information.get('email')
>       assert (name, email) == expected_person
E       AssertionError: assert ('Some Bad [email protected]>') == ('Some Good [email protected]>')
E         At index 0 diff: u'Some Bad Guy ' != u'Some Good Guy '
E         Use -v to get the full diff

tests/packagedcode/test_cargo.py:89: AssertionError
=============================== short test summary info ================================
FAILED tests/packagedcode/test_cargo.py::TestRegex::test_person_parser[Barney Rubble <[email protected]>-expected_person0]
FAILED tests/packagedcode/test_cargo.py::TestRegex::test_person_parser[Some Bad Guy <[email protected]>-expected_person2]

This is what it looks like running pytest without any additional parameters. It shows us the inputs on which the parametrized test function failed assertion.

That said, This is merely a choice of preference and I don't have any strong opinions on either approach. I could settle with either way. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the failure trace! It looks decent so I think that can work for smaller tests... though I am not even sure this is practical when there is an expected failure on a test. I do not know if you can also select a subset of the data with the -k CLI arg of py.test.... Anyway, let's keep it as you pushed it for now... and see if I warm up more with the approach :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! :D

Signed-off-by: Ritiek Malhotra <[email protected]>
@ritiek ritiek force-pushed the parse-more-fields-in-cargo-manifest branch from db4c5a9 to 7416220 Compare March 22, 2019 14:07
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM and will merge as soon as the CIs complete their runs. These are so slow! but there are a lot of tests too.

@pombredanne
Copy link
Member

Thank you again 💯

@pombredanne pombredanne merged commit f4aa8cd into aboutcode-org:develop Mar 23, 2019
@ritiek ritiek deleted the parse-more-fields-in-cargo-manifest branch March 25, 2019 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants