Skip to content

Commit

Permalink
Improve -dump-hashes output adding json format (nextflow-io#4369)
Browse files Browse the repository at this point in the history

Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
  • Loading branch information
2 people authored and abhi18av committed Oct 28, 2023
1 parent 722a674 commit 4084b0a
Show file tree
Hide file tree
Showing 7 changed files with 56 additions and 8 deletions.
27 changes: 27 additions & 0 deletions docs/cache-and-resume.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,8 @@ nextflow run rnaseq-nf -resume 4dc656d2-c410-44c8-bc32-7dd0ea87bebf

You can use the {ref}`cli-log` command to view all previous runs as well as the task executions for each run.

(cache-compare-hashes)=

### Comparing the hashes of two runs

One way to debug a resumed run is to compare the task hashes of each run using the `-dump-hashes` option.
Expand All @@ -196,3 +198,28 @@ One way to debug a resumed run is to compare the task hashes of each run using t
4. Compare the runs with a diff viewer

While some manual effort is required, the final diff can often reveal the exact change that caused a task to be re-executed.

:::{versionadded} 23.10.0
:::

When using `-dump-hashes json`, the task hashes can be more easily extracted into a diff. Here is an example Bash script to perform two runs and produce a diff:

```bash
nextflow -log run_1.log run $pipeline -dump-hashes json
nextflow -log run_2.log run $pipeline -dump-hashes json -resume

get_hashes() {
cat $1 \
| grep 'cache hash:' \
| cut -d ' ' -f 10- \
| sort \
| awk '{ print; print ""; }'
}

get_hashes run_1.log > run_1.tasks.log
get_hashes run_2.log > run_2.tasks.log

diff run_1.tasks.log run_2.tasks.log
```

You can then view the `diff` output or use a graphical diff viewer to compare `run_1.tasks.log` and `run_2.tasks.log`.
5 changes: 4 additions & 1 deletion docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -1154,7 +1154,10 @@ The `run` command is used to execute a local pipeline script or remote pipeline
: Dump channels for debugging purpose.

`-dump-hashes`
: Dump task hash keys for debugging purpose.
: Dump task hash keys for debugging purposes.
: :::{versionadded} 23.10.0
You can use `-dump-hashes json` to dump the task hash keys as JSON for easier post-processing. See the {ref}`caching and resuming tips <cache-compare-hashes>` for more details.
:::

`-e.<key>=<value>`
: Add the specified variable to execution environment.
Expand Down
4 changes: 2 additions & 2 deletions modules/nextflow/src/main/groovy/nextflow/Session.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -247,11 +247,11 @@ class Session implements ISession {

boolean getStatsEnabled() { statsEnabled }

private boolean dumpHashes
private String dumpHashes

private List<String> dumpChannels

boolean getDumpHashes() { dumpHashes }
String getDumpHashes() { dumpHashes }

List<String> getDumpChannels() { dumpChannels }

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ class CmdRun extends CmdBase implements HubOptions {
String profile

@Parameter(names=['-dump-hashes'], description = 'Dump task hash keys for debugging purpose')
boolean dumpHashes
String dumpHashes

@Parameter(names=['-dump-channels'], description = 'Dump channels for debugging purpose')
String dumpChannels
Expand Down
4 changes: 4 additions & 0 deletions modules/nextflow/src/main/groovy/nextflow/cli/Launcher.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,10 @@ class Launcher {
normalized << '%all'
}

else if( current == '-dump-hashes' && (i==args.size() || args[i].startsWith('-'))) {
normalized << '-'
}

else if( current == '-with-cloudcache' && (i==args.size() || args[i].startsWith('-'))) {
normalized << '-'
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -595,9 +595,10 @@ class ConfigBuilder {
if( config.isSet('resume') )
config.resume = normalizeResumeId(config.resume as String)

// -- sets `dumpKeys` option
if( cmdRun.dumpHashes )
config.dumpHashes = cmdRun.dumpHashes
// -- sets `dumpHashes` option
if( cmdRun.dumpHashes ) {
config.dumpHashes = cmdRun.dumpHashes != '-' ? cmdRun.dumpHashes : 'default'
}

if( cmdRun.dumpChannels )
config.dumpChannels = cmdRun.dumpChannels.tokenize(',')
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ import java.util.regex.Pattern

import ch.artecat.grengine.Grengine
import com.google.common.hash.HashCode
import groovy.json.JsonOutput
import groovy.transform.CompileStatic
import groovy.transform.Memoized
import groovy.transform.PackageScope
Expand Down Expand Up @@ -2155,7 +2156,9 @@ class TaskProcessor {
final mode = config.getHashMode()
final hash = computeHash(keys, mode)
if( session.dumpHashes ) {
traceInputsHashes(task, keys, mode, hash)
session.dumpHashes=='json'
? traceInputsHashesJson(task, keys, mode, hash)
: traceInputsHashes(task, keys, mode, hash)
}
return hash
}
Expand Down Expand Up @@ -2191,6 +2194,16 @@ class TaskProcessor {
return result
}

private void traceInputsHashesJson( TaskRun task, List entries, CacheHelper.HashMode mode, hash ) {
final collector = (item) -> [
hash: CacheHelper.hasher(item, mode).hash().toString(),
type: item?.getClass()?.getName(),
value: item?.toString()
]
final json = JsonOutput.toJson(entries.collect(collector))
log.info "[${safeTaskName(task)}] cache hash: ${hash}; mode: ${mode}; entries: ${JsonOutput.prettyPrint(json)}"
}

private void traceInputsHashes( TaskRun task, List entries, CacheHelper.HashMode mode, hash ) {

def buffer = new StringBuilder()
Expand Down

0 comments on commit 4084b0a

Please sign in to comment.