-
-
Notifications
You must be signed in to change notification settings - Fork 193
feat: change grep & redirection exercise to use metadata instead of fastq #371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you!Thank you for your pull request 😃 🤖 This automated message can help you check the rendered files in your submission for clarity. If you have any questions, please feel free to open an issue in {sandpaper}. If you have files that automatically render output (e.g. R Markdown), then you should check for the following:
Rendered Changes🔍 Inspect the changes: https://github.com/datacarpentry/shell-genomics/compare/md-outputs..md-outputs-PR-371 The following changes were observed in the rendered markdown documents:
What does this mean?If you have source files that require output and figures to be generated (e.g. R Markdown), then it is important to make sure the generated figures and output are reproducible. This output provides a way for you to inspect the output in a diff-friendly manner so that it's easy to see the changes that occur due to new software versions or randomisation. ⏱️ Updated at 2025-09-19 18:58:28 +0000 |
also be a FASTQ file? The answer is, yes - it will be a FASTQ file and it would make sense to | ||
name it with a `.fastq` extension. However, using a `.fastq` extension will lead us to problems | ||
when we move to using wildcards later in this episode. We'll point out where this becomes | ||
important. For now, it's good that you're thinking about file extensions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consolidated into the new callout on wildcards below.
``` | ||
|
||
```output | ||
249 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducting shell arithmetic is too advanced for this lesson.
four to get the number of sequences that match our search pattern. Since 802 / 4 = 200.5 and we | ||
are expecting an integer number of records, there is something added or missing in `bad_reads.txt`. | ||
If we explore `bad_reads.txt` using `less`, we might be able to notice what is causing the uneven | ||
number of lines. Luckily, this issue happens by the end of the file so we can also spot it with `tail`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole piece of the exercise is extraneous information for beginners.
``` | ||
|
||
The `-v` option in the second `grep` search stands for `--invert-match` meaning `grep` will now only display the | ||
lines which do not match the searched pattern, in this case `'^--'`. The caret (`^`) is an **anchoring** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This intro to the caret usage and the single quote enclosing was moved to the relevant place in episode 5 below.
@ggrimes Could you have the curriculum advisory committee take a look at this please? It's a fairly large change and I would like to get as many eyes as possible on it before we look to merge. |
@tobyhodges how do i get the curriculum advisory committee to convene this process this |
@datacarpentry/curriculum-advisors-genomics please could one or more of you review this/discuss it? |
I am happy to review it but on't seem to have the privileges |
I'll ask on the maintainer's Slack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to see this PR! It seems to be a good start for changing the example in the redirection section. I left some comments and small language changes. Would be nice if we didn't have to switch back and forth between the two data folders, if possible.
I'm not sure it fully addresses the issues in that the grep example in the writing scripts is still using the bad reads example, which is somewhat artificial as an example and in contrast with the fastqc and trimmomatic best practice taught in the data wrangling lesson. Working with the metadata there is perhaps not a better example though. I'd have to think more about what might be a good example to swap out that example. Maybe something like having them pull out all their read names into a file with the name of the sample in it or as part of the filename?
Let's try out this command to look for particular samples in the SRA metadata file and copy the | ||
output to another file called `metadata.txt`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try out this command to look for particular samples in the SRA metadata file and copy the | |
output to another file called `metadata.txt`. | |
Let's try out this command to pull out particular samples from the SRA metadata file and save the metadata for those samples to another file called `metadata.txt`. |
$ grep PAIRED SraRunTable.txt > metadata.txt | ||
$ grep SINGLE SraRunTable.txt >> metadata.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't quite find the SraRunTable.txt
to check but doesn't this just remake the input again since they are all either paired or single? (except the header maybe?)
An alternative might be to grep the header and a sample name or header and a couple samples and >> them to the same output file.
In this example, I would probably rename the output to keep them as separate files, one for the PAIRED and one for SINGLE, especially since we already have them in the input file.
::::::::::::::::::::::::::::::::::::::::: callout | ||
|
||
The output of our second call to `wc` shows that we have not overwritten our original data. | ||
## Using `grep` with wildcards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This switching back has me wondering, should this whole episode be on the metadata to avoid switching?
If you want to keep it as it is above, switching to metadata in the redirection section, this section could be removed so you don't have to switch back and forth. Not sure the example with multiple files is fully needed here.
The fifth and six lines in the output display "--" which is the default action for `grep` to separate groups of | ||
lines matching the pattern, and indicate groups of lines which did not match the pattern so are not displayed. | ||
To fix this issue, we can redirect the output of grep to a second instance of `grep` as follows. | ||
The `-v` option for `grep` search stands for `--invert-match` meaning `grep` will now only display the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `-v` option for `grep` search stands for `--invert-match` meaning `grep` will now only display the | |
Sometimes we want to find all the lines that don't match a specific pattern, rather than trying to construct a complex pattern with many options.. | |
The `-v` option for `grep` search stands for `--invert-match` meaning `grep` will now only display the | |
lines which do not match the searched pattern. | |
```bash | |
$ grep -v SINGLE SraRunTable.txt | less |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favour of using the SraRunTable.txt
as the primary example for the grep exercise. This feels more like what you would would do in a reality. I would prefer not using the fastq files as examples for grep at all as it's a bit of tricky format for newbies.
Thank you for the PR @ameynert! I'll take a look through this ASAP |
If this pull request addresses an open issue on the repository, please add 'Closes #NN' below, where NN is the issue number.
Closes #230
Closes #316
Closes #369
Please briefly summarise the changes made in the pull request, and the reason(s) for making these changes.
Changed the
grep
examples in Episode 4 for redirection to use the SRA metadata file instead of FASTQ. This gets around the issue ofgrep
on different platforms behaving differently with the FASTQ examples as in #316, removes the issue of appending empty output as in #369 , and is a more realistic use of the toolgrep
and redirection as addressed in #230.If any relevant discussions have taken place elsewhere, please provide links to these.
See the issues linked.