Skip to content

Conversation

ameynert
Copy link
Contributor

If this pull request addresses an open issue on the repository, please add 'Closes #NN' below, where NN is the issue number.

Closes #230
Closes #316
Closes #369

Please briefly summarise the changes made in the pull request, and the reason(s) for making these changes.

Changed the grep examples in Episode 4 for redirection to use the SRA metadata file instead of FASTQ. This gets around the issue of grep on different platforms behaving differently with the FASTQ examples as in #316, removes the issue of appending empty output as in #369 , and is a more realistic use of the tool grep and redirection as addressed in #230.

If any relevant discussions have taken place elsewhere, please provide links to these.

See the issues linked.

@ameynert ameynert self-assigned this Sep 19, 2025
Copy link

github-actions bot commented Sep 19, 2025

Thank you!

Thank you for your pull request 😃

🤖 This automated message can help you check the rendered files in your submission for clarity. If you have any questions, please feel free to open an issue in {sandpaper}.

If you have files that automatically render output (e.g. R Markdown), then you should check for the following:

  • 🎯 correct output
  • 🖼️ correct figures
  • ❓ new warnings
  • ‼️ new errors

Rendered Changes

🔍 Inspect the changes: https://github.com/datacarpentry/shell-genomics/compare/md-outputs..md-outputs-PR-371

The following changes were observed in the rendered markdown documents:

 04-redirection.md     | 200 +++++++++++++-------------------------------------
 05-writing-scripts.md |   2 +
 md5sum.txt            |   4 +-
 3 files changed, 55 insertions(+), 151 deletions(-)
What does this mean?

If you have source files that require output and figures to be generated (e.g. R Markdown), then it is important to make sure the generated figures and output are reproducible.

This output provides a way for you to inspect the output in a diff-friendly manner so that it's easy to see the changes that occur due to new software versions or randomisation.

⏱️ Updated at 2025-09-19 18:58:28 +0000

also be a FASTQ file? The answer is, yes - it will be a FASTQ file and it would make sense to
name it with a `.fastq` extension. However, using a `.fastq` extension will lead us to problems
when we move to using wildcards later in this episode. We'll point out where this becomes
important. For now, it's good that you're thinking about file extensions!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consolidated into the new callout on wildcards below.

```

```output
249
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducting shell arithmetic is too advanced for this lesson.

github-actions bot pushed a commit that referenced this pull request Sep 19, 2025
four to get the number of sequences that match our search pattern. Since 802 / 4 = 200.5 and we
are expecting an integer number of records, there is something added or missing in `bad_reads.txt`.
If we explore `bad_reads.txt` using `less`, we might be able to notice what is causing the uneven
number of lines. Luckily, this issue happens by the end of the file so we can also spot it with `tail`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole piece of the exercise is extraneous information for beginners.

```

The `-v` option in the second `grep` search stands for `--invert-match` meaning `grep` will now only display the
lines which do not match the searched pattern, in this case `'^--'`. The caret (`^`) is an **anchoring**
Copy link
Contributor Author

@ameynert ameynert Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This intro to the caret usage and the single quote enclosing was moved to the relevant place in episode 5 below.

@ameynert
Copy link
Contributor Author

@ggrimes Could you have the curriculum advisory committee take a look at this please? It's a fairly large change and I would like to get as many eyes as possible on it before we look to merge.

@ggrimes
Copy link
Contributor

ggrimes commented Sep 26, 2025

@tobyhodges how do i get the curriculum advisory committee to convene this process this

@tobyhodges
Copy link
Member

@datacarpentry/curriculum-advisors-genomics please could one or more of you review this/discuss it?

@ggrimes
Copy link
Contributor

ggrimes commented Oct 3, 2025

@datacarpentry/curriculum-advisors-genomics please could one or more of you review this/discuss it?

I am happy to review it but on't seem to have the privileges

@ameynert
Copy link
Contributor Author

ameynert commented Oct 3, 2025

@datacarpentry/curriculum-advisors-genomics please could one or more of you review this/discuss it?

I am happy to review it but on't seem to have the privileges

I'll ask on the maintainer's Slack.

Copy link
Contributor

@sstevens2 sstevens2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to see this PR! It seems to be a good start for changing the example in the redirection section. I left some comments and small language changes. Would be nice if we didn't have to switch back and forth between the two data folders, if possible.

I'm not sure it fully addresses the issues in that the grep example in the writing scripts is still using the bad reads example, which is somewhat artificial as an example and in contrast with the fastqc and trimmomatic best practice taught in the data wrangling lesson. Working with the metadata there is perhaps not a better example though. I'd have to think more about what might be a good example to swap out that example. Maybe something like having them pull out all their read names into a file with the name of the sample in it or as part of the filename?

Comment on lines +182 to +183
Let's try out this command to look for particular samples in the SRA metadata file and copy the
output to another file called `metadata.txt`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Let's try out this command to look for particular samples in the SRA metadata file and copy the
output to another file called `metadata.txt`.
Let's try out this command to pull out particular samples from the SRA metadata file and save the metadata for those samples to another file called `metadata.txt`.

Comment on lines +287 to +288
$ grep PAIRED SraRunTable.txt > metadata.txt
$ grep SINGLE SraRunTable.txt >> metadata.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't quite find the SraRunTable.txt to check but doesn't this just remake the input again since they are all either paired or single? (except the header maybe?)
An alternative might be to grep the header and a sample name or header and a couple samples and >> them to the same output file.
In this example, I would probably rename the output to keep them as separate files, one for the PAIRED and one for SINGLE, especially since we already have them in the input file.

::::::::::::::::::::::::::::::::::::::::: callout

The output of our second call to `wc` shows that we have not overwritten our original data.
## Using `grep` with wildcards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switching back has me wondering, should this whole episode be on the metadata to avoid switching?
If you want to keep it as it is above, switching to metadata in the redirection section, this section could be removed so you don't have to switch back and forth. Not sure the example with multiple files is fully needed here.

The fifth and six lines in the output display "--" which is the default action for `grep` to separate groups of
lines matching the pattern, and indicate groups of lines which did not match the pattern so are not displayed.
To fix this issue, we can redirect the output of grep to a second instance of `grep` as follows.
The `-v` option for `grep` search stands for `--invert-match` meaning `grep` will now only display the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `-v` option for `grep` search stands for `--invert-match` meaning `grep` will now only display the
Sometimes we want to find all the lines that don't match a specific pattern, rather than trying to construct a complex pattern with many options..
The `-v` option for `grep` search stands for `--invert-match` meaning `grep` will now only display the
lines which do not match the searched pattern.
```bash
$ grep -v SINGLE SraRunTable.txt | less

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favour of using the SraRunTable.txt as the primary example for the grep exercise. This feels more like what you would would do in a reality. I would prefer not using the fastq files as examples for grep at all as it's a bit of tricky format for newbies.

@naupaka
Copy link
Member

naupaka commented Oct 7, 2025

Thank you for the PR @ameynert! I'll take a look through this ASAP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants