Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved input syntax for frequency_differencing operation #1106

Merged
merged 4 commits into from
Aug 31, 2023

Conversation

praneethratna
Copy link
Contributor

@praneethratna praneethratna commented Jul 31, 2023

Addresses #967. Now input for frequency_differencing can be given directly in the form of frequency differencing criteria instead of separating out freq/chan from operators.

CC @leewujung

@emiliom
Copy link
Collaborator

emiliom commented Jul 31, 2023

These improvements are very cool!

Since you're adding sympy as a dependency, how "heavy" a library is it? In other words, is it likely to create dependency resolution issues? I see the dependency tree sympy > mpmath > gmpy2, where gmpy2 apparently uses a C library ("gmpy2 is an optimized, C-coded Python extension module ...").

Are there lighter-weight alternatives to sympy that would meet our needs? sympy has tons of capabilities, and here I think we'd be using only a tiny subset.

requirements-dev.txt Outdated Show resolved Hide resolved
@praneethratna
Copy link
Contributor Author

praneethratna commented Jul 31, 2023

These improvements are very cool!

Since you're adding sympy as a dependency, how "heavy" a library is it? In other words, is it likely to create dependency resolution issues? I see the dependency tree sympy > mpmath > gmpy2, where gmpy2 apparently uses a C library ("gmpy2 is an optimized, C-coded Python extension module ...").

I feel SymPy is lightweight compared to other mathematical libraries found online. Also SymPy has a hard dependency only on mpmath and I don't think it would create dependency resolution issues.

Are there lighter-weight alternatives to sympy that would meet our needs? sympy has tons of capabilities, and here I think we'd be using only a tiny subset.

I'm using SymPy for parsing numbers from strings using it's inbuilt parser. I'm not sure of any other libraries which can be used to do this, but there might be some libraries to do this task exclusively?

@emiliom
Copy link
Collaborator

emiliom commented Jul 31, 2023

Also SymPy has a hard dependency only on mpmath but I don't it would create dependency resolution issues.

Right! Scratch my comment about dependency resolution per se. I did see that the dependency was limited to mpmath > gmpy2, and we don't currently use those libraries. My apologies.

@praneethratna
Copy link
Contributor Author

praneethratna commented Aug 1, 2023

Right! Scratch my comment about dependency resolution per se. I did see that the dependency was limited to mpmath > gmpy2, and we don't currently use those libraries. My apologies.

No worries! I understand the clarification now. Since we don't use mpmath or gmpy2 libraries, these dependencies doesn't affect us. But we need to look at alternatives for SymPy if that makes the package slow.

@leewujung
Copy link
Member

Since the equations are pretty predictable, one option would be to just do re all the way?

@praneethratna
Copy link
Contributor Author

praneethratna commented Aug 1, 2023

Since the equations are pretty predictable, one option would be to just do re all the way?

@leewujung Do you mean we use a single regex expression to check for the validity of the whole frequency differencing criteria? Wouldn't that make the code a bit messy?

@leewujung leewujung added this to the 0.8.1 milestone Aug 2, 2023
@leewujung
Copy link
Member

@leewujung Do you mean we use a single regex expression to check for the validity of the whole frequency differencing criteria? Wouldn't that make the code a bit messy?

I think it would be ok? For the freqAB case, the Hz/kHz/MHz/GHz cases are fixed, so the number in front of that could be extracted. For the chanAB case, it is the string between the double quotes. The binary comparison operator >, <, >=, <=, == are the allowed cases. And then we need to extracted the number before dB. This doesn't seem overly complicated?

@praneethratna
Copy link
Contributor Author

I think it would be ok? For the freqAB case, the Hz/kHz/MHz/GHz cases are fixed, so the number in front of that could be extracted. For the chanAB case, it is the string between the double quotes. The binary comparison operator >, <, >=, <=, == are the allowed cases. And then we need to extracted the number before dB. This doesn't seem overly complicated?

@leewujung I think it shouldn't be overly complicated, it's just that i split everything into individual parts for better code debugging. I'll make these changes by today.

@praneethratna
Copy link
Contributor Author

praneethratna commented Aug 3, 2023

@leewujung Even if use a single re expression to check for validity, we again need to use separate re expressions to extract freq/chan values, operators and diff or am i understanding your point in a wrong way?

@leewujung
Copy link
Member

Hi @praneethratna : sorry for my super late response. What I am thinking in terms of using re to grab out the pattern is like the following:

  • use patterns like "(?P<freqA>\d+)(?P<unitA>\w?)Hz" to match freqA and freqB, for example:
    a = re.match("(?P<freq>\d+)\s*(?P<unit>\w?)Hz", "120kHz") # also matches arbitrary space between "120" and "kHz"
    a["freq"]  # get the number associated with the frequency
    a["unit"]  # get the unit associated with the frequency
    then you can use a["unit"] to determine the right multiplication factor to get to the numbers in frequency_nominal, which is with units Hz (i.e. 120kHz is 120000 in frequency_nominal).
  • you can use the pattern above to build up the whole pattern, something like the following:
    freqA_pattern = "(?P<freqA>\d+)\s*(?P<unitA>\w?)Hz"
    freqA_pattern = "(?P<freqB>\d+)\s*(?P<unitB>\w?)Hz"
    db_pattern = "(?P<db>\d+)\s*dB"
    cmp_pattern = "\s*(?P<cmp>\S+?)\s*"
    DIFF_MATCHER = re.compile(freqA_pattern + "\s*-\s*" + freqB_pattern + cmp_pattern + db_pattern)
    and then use outputs of the matched results with some error checking and handling (like converting str to float, k to 1000 etc) to combine with your other parts of the code.

What do you think?

@praneethratna
Copy link
Contributor Author

praneethratna commented Aug 25, 2023

Hi @praneethratna : sorry for my super late response. What I am thinking in terms of using re to grab out the pattern is like the following:

  • use patterns like "(?P<freqA>\d+)(?P<unitA>\w?)Hz" to match freqA and freqB, for example:

    a = re.match("(?P<freq>\d+)\s*(?P<unit>\w?)Hz", "120kHz") # also matches arbitrary space between "120" and "kHz"
    a["freq"]  # get the number associated with the frequency
    a["unit"]  # get the unit associated with the frequency

    then you can use a["unit"] to determine the right multiplication factor to get to the numbers in frequency_nominal, which is with units Hz (i.e. 120kHz is 120000 in frequency_nominal).

  • you can use the pattern above to build up the whole pattern, something like the following:

    freqA_pattern = "(?P<freqA>\d+)\s*(?P<unitA>\w?)Hz"
    freqA_pattern = "(?P<freqB>\d+)\s*(?P<unitB>\w?)Hz"
    db_pattern = "(?P<db>\d+)\s*dB"
    cmp_pattern = "\s*(?P<cmp>\S+?)\s*"
    DIFF_MATCHER = re.compile(freqA_pattern + "\s*-\s*" + freqB_pattern + cmp_pattern + db_pattern)

    and then use outputs of the matched results with some error checking and handling (like converting str to float, k to 1000 etc) to combine with your other parts of the code.

What do you think?

@leewujung No worries! I understand it now, these suggestions helped me to refactor the code in a much simpler and clean way.

@leewujung
Copy link
Member

@praneethratna : there are still places where you use sympy in the code. Could you remove them completely and use re instead? Thanks!

@praneethratna
Copy link
Contributor Author

praneethratna commented Aug 28, 2023

@leewujung I have removed the usage of sympy completely and also squashed all the commits into one.

echopype/mask/api.py Outdated Show resolved Hide resolved
Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@praneethratna : Thanks for this great PR!! It is awesome to have this intuitive way of putting in the freq diff equations.

I have some small suggestions for the organization and the tests:

  • Since in Add transient and impulse noise masks #1142 there are more functions added to the mask subpackage, how about in this PR moving the frequency-differencing-related functions (_check_source_Sv_freq_diff and check_freq_diff_and_parse_arguments) to a separate module, say echopype/mask/freq_diff.py and keep the main API function frequency_differencing in echopype/mask/api.py?
  • for the tests, currently test_frequency_differencing is more of an integrated test, and in it the frequency units are all Hz so the unit matching was not tested. How about adding another set of tests to test for the function you wrote check_freq_diff_and_parse_arguments? This way you can test parsed element values and xfail the ones that should fail, etc.
  • rename check_freq_diff_and_parse_arguments to _parse_freq_diff_eq to make it obvious that it is a private function, and also making it shorter
  • rename _check_source_Sv_freq_diff to _check_freq_diff_source_Sv: this more of a style thing, but I feel that having the freq_diff part first make it easier to know where this function is used.

@leewujung
Copy link
Member

Also, I caught a bug / undesired behavior in the current _check_source_Sv_freq_diff function. In the current implementation, the values of frequency_nominal is not allowed to be duplicated:

if len(set(source_Sv.frequency_nominal.values)) < source_Sv.frequency_nominal.size:
raise ValueError(
"The provided source_Sv contains repeated frequency_nominal "
"values, this is not allowed!"
)

However, it is possible that there are 2 channels (with different channel strings) that have the same frequency_nominal values (this is actually why we decided to use channel as the coordinate, instead of frequency_nominal, since coordinates cannot have duplicated values), and users may want to diff these 2 channels using the chanABEq. I think that potentially duplicated frequency_nominal values should only be checked if people use freqABEq to put in their equation.

Looking at the code again, what do you think if we take the following approach to change this function?

  • keep the first checks # check that channel and frequency nominal are in source_Sv intact
  • for the below three checks, make it such that only freqAB checks are executed when people use freqABEq, and only chanAB checks are executed when people use chanABEq
    # make sure that the channel and frequency_nominal values are not repeated in source_Sv
    # check that the elements of freqAB are in frequency_nominal
    # check that the elements of chanAB are in channel
    
    ie we only check for frequency_nominal to not be duplicated and elements of freqAB are in frequency_nominal when people use freqABEq, and only check for channel to not be duplicated and elements of chanAB are in channel when people use chanABEq.

Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @praneethratna for your speedy changes! I added some suggestions, and I think once they are addressed we can merge this PR. :)

@leewujung leewujung added the enhancement This makes echopype better label Aug 31, 2023
Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@praneethratna : Thanks a lot for your contribution. We can merge this PR now once the tests run through!

@leewujung
Copy link
Member

Ugh sorry just remember one thing. In here

def _freqdiff_applymask(test_ds):

frequency_differencing is called (this is something the CI does not catch, unless one uses [all tests ci] (see here). Could you quickly change it? Then I will merge. Thank you!

@codecov-commenter
Copy link

Codecov Report

Merging #1106 (d8b8b26) into dev (6fc5a1d) will decrease coverage by 9.44%.
Report is 53 commits behind head on dev.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##              dev    #1106      +/-   ##
==========================================
- Coverage   78.12%   68.69%   -9.44%     
==========================================
  Files          65        8      -57     
  Lines        6227      658    -5569     
==========================================
- Hits         4865      452    -4413     
+ Misses       1362      206    -1156     
Flag Coverage Δ
unittests 68.69% <93.75%> (-9.44%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
echopype/mask/freq_diff.py 93.44% <93.44%> (ø)
echopype/mask/api.py 91.66% <100.00%> (+4.16%) ⬆️

... and 63 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@leewujung leewujung merged commit 5abca87 into OSOceanAcoustics:dev Aug 31, 2023
4 checks passed
@praneethratna praneethratna deleted the issue#967-mask branch August 31, 2023 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This makes echopype better
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants