Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve -dump-hashes output #4369

Merged
merged 6 commits into from
Oct 15, 2023
Merged

Improve -dump-hashes output #4369

merged 6 commits into from
Oct 15, 2023

Conversation

bentsherman
Copy link
Member

Close #4367

Re-structures the -dump-hashes output to be on a single line per task, which makes it easier to produce a diff from the log files.

To test it, do an initial run, modify the pipeline, then a resumed run:

./launch.sh -log run_fresh.log run rnaseq-nf -profile docker -dump-hashes

# modify the rnaseq-nf pipeline locally
vim ~/.nextflow/assets/nextflow-io/rnaseq-nf/modules/fastqc/main.nf

./launch.sh -log run_resume.log run rnaseq-nf -profile docker -dump-hashes -resume

Then run the following bash script to filter the logs and produce the diff:

#!/bin/bash

if [[ $# != 2 ]] ; then
    >&2 echo "usage: $0 <log1> <log2>"
    exit 1
fi

get_hashes() {
    cat $1 \
    | grep 'cache hash:' \
    | cut -d ' ' -f 10- \
    | sort \
    | awk '{ print; print ""; }'
}

get_hashes $1 > diff.1.log
get_hashes $2 > diff.2.log

diff diff.1.log diff.2.log
./diff-hashes.sh run_fresh.log run_resume.log

Signed-off-by: Ben Sherman <[email protected]>
@netlify
Copy link

netlify bot commented Oct 4, 2023

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 4c43753
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/652bf01e3c172c0008971af3

@pditommaso
Copy link
Member

To tell the true I find this output much more confusing compared the current one

[sayHello (4)] cache hash: 341e08ddb2f780263ffe665202f0db6c; mode: STANDARD; entries: 0a01126af407b31e29e31a5778ce7e99 [java.util.UUID] 7bda72f1-8d94-4ca7-818b-4db206314bf1 | 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello | e605d10a140bca2c087a238261356639 [java.lang.String] """\n    echo '$x world!'\n    """\n | 769f897d21d56476ad01edexecutor >  local (4)
[7d/105df3] process > sayHello (1) [100%] 4 of 4 ✔
[sayHello (4)] cache hash: 341e08ddb2f780263ffe665202f0db6c; mode: STANDARD; entries: 0a01126af407b31e29e31a5778ce7e99 [java.util.UUID] 7bda72f1-8d94-4ca7-818b-4db206314bf1 | 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello | e605d10a140bca2c087a238261356639 [java.lang.String] """\n    echo '$x world!'\n    """\n | 769f897d21d56476ad01edc930becff0 [java.lang.String] x | f5e76d4e64af0c5d859ff08ab3b720b7 [java.lang.String] Hola | 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ | 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true
[sayHello (2)] cache hash: b37742150cadbd69cb6072d8dffc51c9; mode: STANDARD; entries: 0a01126af407b31e29e31a5778ce7e99 [java.util.UUID] 7bda72f1-8d94-4ca7-818b-4db206314bf1 | 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello | e605d10a140bca2c087a238261356639 [java.lang.String] """\n    echo '$x world!'\n    """\n | 769f897d21d56476ad01edc930becff0 [java.lang.String] x | 442e002ddd8b0a2b10ed51352f8c0488 [java.lang.String] Ciao | 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ | 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true
[sayHello (1)] cache hash: eb8e64ef291404fc09bdd89324c9a984; mode: STANDARD; entries: 0a01126af407b31e29e31a5778ce7e99 [java.util.UUID] 7bda72f1-8d94-4ca7-818b-4db206314bf1 | 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello | e605d10a140bca2c087a238261356639 [java.lang.String] """\n    echo '$x world!'\n    """\n | 769f897d21d56476ad01edc930becff0 [java.lang.String] x | c9273e5a7ac3508ef910437c4bb35a90 [java.lang.String] Bonjour | 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ | 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true
[sayHello (3)] cache hash: b86b4bc95cebc58c8be5f1c06fa470e8; mode: STANDARD; entries: 0a01126af407b31e29e31a5778ce7e99 [java.util.UUID] 7bda72f1-8d94-4ca7-818b-4db206314bf1 | 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello | e605d10a140bca2c087a238261356639 [java.lang.String] """\n    echo '$x world!'\n    """\n | 769f897d21d56476ad01edc930becff0 [java.lang.String] x | 0ab6632d52e811e9ef7c044666ac496a [java.lang.String] Hello | 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ | 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true

@bentsherman
Copy link
Member Author

Why? It's the same content but on a single line per task. And it's not really meant to be viewed in that form... imagine a pipeline run with 10,000 tasks, you wouldn't be viewing those logs.

Instead, you should use the bash script to extract and sort the logs, then use a diff viewer to highlight the changes.

@pditommaso
Copy link
Member

because the current format allows to diff two logs and find where it diverges. With the new one is quite everything in one row

@bentsherman
Copy link
Member Author

In other words, there's no point in making the output prettier, because the tasks will be logged in arbitrary order anyway. You have to take some extra steps to get it into a usable format, which primarily means one line per task.

@pditommaso
Copy link
Member

It can be added as an optional format, but it should be human readable first

@bentsherman
Copy link
Member Author

It's still human readable though

@bentsherman
Copy link
Member Author

because the current format allows to diff two logs and find where it diverges. With the new one is quite everything in one row

but you can't really diff the logs because the tasks are printed in arbitrary order. you have to re-order the tasks manually.

I don't see the issue with putting everything on a single line, just use line wrap in your editor

@bentsherman
Copy link
Member Author

Sorry about the flurry of messages, got out of sync with you. Here is a screenshot of a diff I produced with the bash script, using a diff viewer in VS code with line wrap:

image

@pditommaso
Copy link
Member

Most of the time the order is maintained (and using process.fair should be consistent). Above all having one hash per liner at least, It can be figure out quickly where it start to diverge.

Screenshot 2023-10-06 at 10 38 50

If the need is a format for programmatic processing, something more structured should be used (json?), instead of stacking hacks over hacks

@bentsherman
Copy link
Member Author

I don't see how your screenshot illustrates your point... it is mostly noise 😄

Fair execution would improve somewhat, but for large pipelines you could still have different processes executing in arbitrary order. So the current format is still very difficult to use

But I agree that I would rather not layer hacks upon hacks. That is why in the original issue I suggested that we find a way for Nextflow to automate the whole thing and produce the final diff:

  • Make sure the task cache has all the components to re-create the hash
  • Add a new command to query the task cache for two runs with the same session id and compare the tasks for each run:
    nextflow diff kickass_pare nostalgic_murdock
    and produces some human-readable output, like:
    Task `RNASEQ:FASTQC (1)` differs between the two runs:
    
      script:
      """
    - fastqc.sh "$sample_id" "$reads"
    + fastqc.sh "$sample_id" "$reads" -t 2
      """
    

But that solution is more involved, and restructuring the hash dump to one line per task is a simple fix that greatly improves the overall debugging experience, even if some manual effort is still required. So here we are 😄

If it will make you happy, I will change the one-line format to json and make it opt-in with -dump-hashes json.

@bentsherman
Copy link
Member Author

Okay it's ready

@pditommaso pditommaso merged commit 5bdaac9 into master Oct 15, 2023
4 checks passed
@pditommaso pditommaso deleted the 4367-improve-dump-hashes branch October 15, 2023 14:19
abhi18av pushed a commit to abhi18av/nextflow that referenced this pull request Oct 28, 2023

Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve cache debugging with -dump-hashes
2 participants