Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input sample size #29

Open
4 tasks
kellijohnson-NOAA opened this issue Oct 21, 2020 · 10 comments
Open
4 tasks

input sample size #29

kellijohnson-NOAA opened this issue Oct 21, 2020 · 10 comments
Labels
priority: low The lowest level priority, i.e., not urgent. status: long term A long-term problem that could potentially be a change needed in PacFIN. status: question Questions about the issue need answered topic: code Related to R code within this package type: enhancement
Milestone

Comments

@kellijohnson-NOAA
Copy link
Contributor

kellijohnson-NOAA commented Oct 21, 2020

Input sample size needs to be given more thought. 2020-10-21 PEP team meeting talked about best practices going forward. The methods may not be consistent across species because of life-history characteristics.

  • document how input sample size is created
  • make a single standard method that is used for input to SS3
  • allow for other methods in extra files but don't bloat the main output file
  • ensure consistency with {nwfscSurvey}, to the extent possible
@kellijohnson-NOAA kellijohnson-NOAA added type: enhancement status: question Questions about the issue need answered topic: code Related to R code within this package labels May 4, 2022
@kellijohnson-NOAA kellijohnson-NOAA added this to the year_2022 milestone May 4, 2022
@kellijohnson-NOAA kellijohnson-NOAA modified the milestones: year_2022, year_2023 May 8, 2023
@kellijohnson-NOAA kellijohnson-NOAA added priority: low The lowest level priority, i.e., not urgent. status: long term A long-term problem that could potentially be a change needed in PacFIN. labels May 9, 2023
@kellijohnson-NOAA
Copy link
Contributor Author

This has been a long-standing issue but I am thinking about it now because I am currently working on getComps(). The code sums up the number of tows by sex, e.g., if there are 5 tows with 20 females from 5 tows, 30 males from 3 tows (2 tows had no males), and 3 unsexed fish from 1 tow then the number of tows that goes into the unsexed composition data would only be 1 and if males and females are split into separate rows, then the number of tows for females would be five and number of tows for males would be 3. I personally think that this method is overcomplicated and incorrect because 5 tows were performed that led to those composition data. I think many of you have thought about this more than me and I am seeking opinions. Should the code be maintained as is or should the number of tows be the total number of tows performed regardless of what sex was found 🤷?

@chantelwetzel-noaa
Copy link
Contributor

I think it should be based on the number of tows performed regardless of sex. This was the approach that I was trying to implement with the "both" or "b" columns in previous functions, but was focusing on when you were creating sex = 3 composition data. In hindsight this should also have extended to the unsexed fish as well. This should greatly simplify the code and improve the input sample size calculation across sexed and unsexed fish.

I think many of these issues arose in how the original code was structured which output composition data for all Stock Synthesis specifications (sex = 0, sex = 1, sex = 2, sex = 3) and modifications were made trying to preserve the existing functionality. That was a mistake IMO. I think the situation where someone would want only female or only male compositions is so rare that we should only be outputting sex = 0 and sex = 3 composition data. If someone does want only a single sex composition data they can then augment the sex = 3 themselves.

@iantaylor-NOAA
Copy link
Contributor

Thanks @kellijohnson-NOAA for raising this issue. Ultimately these input samples sizes are all adjusted by a tuning algorithm anyway, which I'm guessing will have a larger impact than this choice, but it's still good to have it logical and keep the code simple.

I also like the @chantelwetzel-noaa proposal to remove the sex = 1 or 2 output and just reporting females and males in vectors with sex = 3 and unsexed as sex = 0 (my ideal would be all in one table that users could filter or modify as they wish). I also recognizing that there was lots of debate about this at the black rockfish review.

But in this example, wouldn't the sample sizes still add up incorrectly, with 6 tows total? If the user drops unsexed fish from the output the issue goes away. Also for most of our species the tows with unsexed fish are much fewer than those with sexed fish so the impact is small.

sex Nsamp
3 5
1 1

@kellijohnson-NOAA
Copy link
Contributor Author

Thanks for the comments. In your example @iantaylor-NOAA I am proposing

sex Nsamp
3 5
1 5
for that year, gear, season. So dropping sex from the stratification when the number of tows are counted. I am not sure if this is correct for the difference between sexed and unsexed samples but as pointed out above it is for sure correct for males versus females when using sex type 1, 2, or 3 the number of tows should ignore sex. So, now we just need to decide if we should ignore sex when counting number of tows for sex versus unsexed? What do those unsexed fish represent when more than likely there were also sexed fish in some of those same tows.

@chantelwetzel-noaa
Copy link
Contributor

@kellijohnson-NOAA got her response in quicker than I could. I think sex should be ignored when counting the number of tows. I think if we calculate sexed vs. unsexed separately we end up calculating a higher input sample size for unsexed vs. sexed fish. This is where it really matters since all of our data weighting methods apply a simple multiplier by fleet and composition type making it really important that you have inadvertently over-weighted one of the sex inputs relative to the other.

@iantaylor-NOAA
Copy link
Contributor

I'm inspired to do some model explorations on this one. I'll try to do some of that this morning but for now I think whatever is easiest for @kellijohnson-NOAA to do in getComps() is fine and we can modify in the future if we want.

I think the impact of the unsexed fish on the model should be similar to what it would have been if the sex were known for those fish. So if the fraction that's unsexed is small it could be problematic to have the same input sample size for those observations as for the sexed fish.

Here's my updated thinking on how I would treat these things in a model (from NOAA-only doc: https://docs.google.com/presentation/d/1RnfNzP6Nlyp_b4W2yZjbmBrvm5jUe0L8OJxO633dH1k/edit#slide=id.g3181828d758_0_2)
pfmc-assessments next generation flowcharts

@iantaylor-NOAA
Copy link
Contributor

I did a simple experiment with the simple_small model in r4ss where I removed the fishery age comps (to avoid conflict between unsexed lengths and sexed ages), then made the small length bins unsexed as a separate vector with 3 options for sample size: same as large fish (N = 50), removed from the model (essentially N = 0), or with reduced sample size proportional to the actual number of small fish (worked out to be N = 1.75 based on the observed ratio associated with my arbitrary selection of the first 6 length bins as unsexed).

Keeping Nsamp equal for both vectors created a big change and the recdevs jumped around a lot because the model thought there was now LOTS of information about the small fish (N = 50 but only a few bins with non-zero samples). Removing the small fish entirely changed things a bit, because selectivity shifted as a result of missing information on the presence of small fish in the sample. Using a reduced input sample size produced results very similar to the "all fish sexed" model.

This test is limited in lots of ways and including the fact that the original data were random samples from the multinomial distribution and not from the complex fishery sampling process with autocorrelation among fish within each haul. However, I think that the results make sense and could be applied to our input sample size calculations for PacFIN data. That is, the focus on the number of hauls as a basis for the input sample size makes sense when you're comparing among years. But if you're looking at sexed vs unsexed fish for the same year and fleet, I think the actual number of sampled fish matters more so calculating the input sample size for the unsexed fish could be based based on the number of trips with any fish sampled multiplied by the fraction of sampled fish that were unsexed. If the PacFIN.Utilities code reports Nhauls and Nfish for each vector, the user could do their own calculations along those lines. Or the function could just do it automatically.

@kellijohnson-NOAA and @chantelwetzel-noaa let me know if any of this makes sense and/or you want further explorations.

Code to run the experiment is in testing_impact_of_small_unsexed_fish.R.txt
Comparisons are in the figure below.

compare3_Bratio

@kellijohnson-NOAA
Copy link
Contributor Author

As the default and hidden deep within the code, I do not think that I want to substitute one complicated method for another. So, I am resorting to returning the number of tows across all sexed and unsexed fish and then the number of fish per sex (i.e., n_fish_U, n_fish_F, and n_fish_M). The number of fish can then easily be added together to create the number of sexed fish. This is all returned from getComps(), which is passed to writeComps() so if a user wants to create a different input sample size based off of the number of tows and the number of fish one could. It is above my pay grade to decide if this is a good idea or not 😉. This change to the code removed 3 helper functions and greatly reduced the number of lines of code needed. Tomorrow, I will work on facilitating the output of getComps() in writeComps() to return just the two types of composition data, i.e., sex == 0 and sex == 3, where users could easily make sex == 1 and sex == 2 from the sex == 3 comps as @chantelwetzel-noaa mentioned above. This output will be directly importable into an SS3 dat file rather than needing massaging by the user prior to using the composition data. This branch will be pushed and merged by Friday.

@chantelwetzel-noaa
Copy link
Contributor

Thank you for testing this out @iantaylor-NOAA. I agree with your decision @kellijohnson-NOAA to keep the options which then allows users to explore which approach works best for their model.

@iantaylor-NOAA
Copy link
Contributor

Yes, nice work @kellijohnson-NOAA keeping things simple and straightforward. Thank you for your service in getting this stuff done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: low The lowest level priority, i.e., not urgent. status: long term A long-term problem that could potentially be a change needed in PacFIN. status: question Questions about the issue need answered topic: code Related to R code within this package type: enhancement
Projects
None yet
Development

No branches or pull requests

3 participants