Skip to content

Commit

Permalink
version v0.3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
johnerikhalse committed Feb 3, 2023
1 parent 66cd77c commit ff48ab0
Show file tree
Hide file tree
Showing 11 changed files with 79 additions and 51 deletions.
6 changes: 3 additions & 3 deletions docs/content/cmd/warc.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
date: 2023-01-30T10:17:46+01:00
date: 2023-02-03T12:57:12+01:00
title: "warc"
slug: warc
url: /cmd/warc/
Expand All @@ -13,8 +13,8 @@ A tool for handling warc files
```
--config string config file. If not set, /etc/warc/, $HOME/.warc/ and current working dir will be searched for file config.yaml
-h, --help help for warc
--log-console strings The kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings The kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
--log-console strings the kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings the kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
-L, --log-file-name string a file to write log output. Empty for no log file
--tmpdir string directory to use for temporary files (default "/tmp")
```
Expand Down
31 changes: 20 additions & 11 deletions docs/content/cmd/warc_cat.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
date: 2023-01-30T10:17:46+01:00
date: 2023-02-03T12:57:12+01:00
title: "warc cat"
slug: warc_cat
url: /cmd/warc_cat/
Expand All @@ -25,22 +25,31 @@ warc cat -n4 -P file1.warc.gz | feh -
### Options

```
-w, --header show WARC header
-h, --help help for cat
--id stringArray id
-n, --num int print the n'th record. This is applied after records are filtered out by other options (default -1)
-o, --offset int print record at offset bytes (default -1)
-P, --payload show payload
-p, --protocol-header show protocol header
-c, --record-count int The maximum number of records to show. Defaults to show all records except if -o or -n option is set, then default is one.
-w, --header show WARC header
-h, --help help for cat
--id stringArray filter record ID's. For more than one, repeat flag or comma separated list.
-m, --mime-type strings filter records with given mime-types. For more than one, repeat flag or comma separated list.
-n, --num int print the n'th record. This is applied after records are filtered out by other options (default -1)
-o, --offset int print record at offset bytes (default -1)
-P, --payload show payload
-p, --protocol-header show protocol header
-c, --record-count int The maximum number of records to show. Defaults to show all records except if -o or -n option is set, then default is one.
-t, --record-type strings filter record types. For more than one, repeat flag or comma separated list.
Legal values: warcinfo,request,response,metadata,revisit,resource,continuation,conversion
-S, --response-code string filter records with given http response codes. Format is 'from-to' where from is inclusive and to is exclusive.
Examples:
'200': only records with 200 response
'200-300': all records with response code between 200(inclusive) and 300(exclusive)
'-400': all response codes below 400
'500-': all response codes from 500 and above
```

### Options inherited from parent commands

```
--config string config file. If not set, /etc/warc/, $HOME/.warc/ and current working dir will be searched for file config.yaml
--log-console strings The kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings The kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
--log-console strings the kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings the kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
-L, --log-file-name string a file to write log output. Empty for no log file
--tmpdir string directory to use for temporary files (default "/tmp")
```
Expand Down
6 changes: 3 additions & 3 deletions docs/content/cmd/warc_completion.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
date: 2023-01-30T10:17:46+01:00
date: 2023-02-03T12:57:12+01:00
title: "warc completion"
slug: warc_completion
url: /cmd/warc_completion/
Expand Down Expand Up @@ -64,8 +64,8 @@ warc completion [bash|zsh|fish|powershell]

```
--config string config file. If not set, /etc/warc/, $HOME/.warc/ and current working dir will be searched for file config.yaml
--log-console strings The kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings The kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
--log-console strings the kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings the kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
-L, --log-file-name string a file to write log output. Empty for no log file
--tmpdir string directory to use for temporary files (default "/tmp")
```
Expand Down
6 changes: 3 additions & 3 deletions docs/content/cmd/warc_console.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
date: 2023-01-30T10:17:46+01:00
date: 2023-02-03T12:57:12+01:00
title: "warc console"
slug: warc_console
url: /cmd/warc_console/
Expand All @@ -23,8 +23,8 @@ warc console <directory> [flags]

```
--config string config file. If not set, /etc/warc/, $HOME/.warc/ and current working dir will be searched for file config.yaml
--log-console strings The kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings The kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
--log-console strings the kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings the kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
-L, --log-file-name string a file to write log output. Empty for no log file
--tmpdir string directory to use for temporary files (default "/tmp")
```
Expand Down
6 changes: 3 additions & 3 deletions docs/content/cmd/warc_convert.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
date: 2023-01-30T10:17:46+01:00
date: 2023-02-03T12:57:12+01:00
title: "warc convert"
slug: warc_convert
url: /cmd/warc_convert/
Expand All @@ -18,8 +18,8 @@ Convert web archives to warc files. Use subcommands for the supported formats

```
--config string config file. If not set, /etc/warc/, $HOME/.warc/ and current working dir will be searched for file config.yaml
--log-console strings The kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings The kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
--log-console strings the kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings the kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
-L, --log-file-name string a file to write log output. Empty for no log file
--tmpdir string directory to use for temporary files (default "/tmp")
```
Expand Down
8 changes: 4 additions & 4 deletions docs/content/cmd/warc_convert_arc.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
date: 2023-01-30T10:17:46+01:00
date: 2023-02-03T12:57:12+01:00
title: "warc convert arc"
slug: warc_convert_arc
url: /cmd/warc_convert_arc/
Expand All @@ -20,7 +20,7 @@ warc convert arc <files/dirs> [flags]
-c, --concurrency int number of input files to process simultaneously. The default value is 1.5 x <number of cpu cores> (default 24)
-C, --concurrent-writers int maximum concurrent WARC writers. This is the number of WARC-files simultaneously written to.
A consequence is that at least this many WARC files are created even if there is only one input file. (default 1)
-t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2023-1-30")
-t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2023-2-3")
-S, --file-size int The maximum size for WARC files (default 1073741824)
--flush if true, sync WARC file to disk after writing each record
-h, --help help for arc
Expand Down Expand Up @@ -50,8 +50,8 @@ warc convert arc <files/dirs> [flags]

```
--config string config file. If not set, /etc/warc/, $HOME/.warc/ and current working dir will be searched for file config.yaml
--log-console strings The kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings The kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
--log-console strings the kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings the kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
-L, --log-file-name string a file to write log output. Empty for no log file
--tmpdir string directory to use for temporary files (default "/tmp")
```
Expand Down
8 changes: 4 additions & 4 deletions docs/content/cmd/warc_convert_nedlib.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
date: 2023-01-30T10:17:46+01:00
date: 2023-02-03T12:57:12+01:00
title: "warc convert nedlib"
slug: warc_convert_nedlib
url: /cmd/warc_convert_nedlib/
Expand All @@ -20,7 +20,7 @@ warc convert nedlib <files/dirs> [flags]
-c, --concurrency int number of input files to process simultaneously. The default value is 1.5 x <number of cpu cores> (default 24)
-C, --concurrent-writers int maximum concurrent WARC writers. This is the number of WARC-files simultaneously written to.
A consequence is that at least this many WARC files are created even if there is only one input file. (default 1)
-t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2023-1-30")
-t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2023-2-3")
-S, --file-size int The maximum size for WARC files (default 1073741824)
--flush if true, sync WARC file to disk after writing each record
-h, --help help for nedlib
Expand Down Expand Up @@ -48,8 +48,8 @@ warc convert nedlib <files/dirs> [flags]

```
--config string config file. If not set, /etc/warc/, $HOME/.warc/ and current working dir will be searched for file config.yaml
--log-console strings The kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings The kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
--log-console strings the kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings the kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
-L, --log-file-name string a file to write log output. Empty for no log file
--tmpdir string directory to use for temporary files (default "/tmp")
```
Expand Down
8 changes: 4 additions & 4 deletions docs/content/cmd/warc_convert_warc.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
date: 2023-01-30T10:17:46+01:00
date: 2023-02-03T12:57:12+01:00
title: "warc convert warc"
slug: warc_convert_warc
url: /cmd/warc_convert_warc/
Expand All @@ -25,7 +25,7 @@ warc convert warc <files/dirs> [flags]
-c, --concurrency int number of input files to process simultaneously. The default value is 1.5 x <number of cpu cores> (default 24)
-C, --concurrent-writers int maximum concurrent WARC writers. This is the number of WARC-files simultaneously written to.
A consequence is that at least this many WARC files are created even if there is only one input file. (default 1)
-t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2023-1-30")
-t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2023-2-3")
-S, --file-size int The maximum size for WARC files (default 1073741824)
--flush if true, sync WARC file to disk after writing each record
-h, --help help for warc
Expand Down Expand Up @@ -57,8 +57,8 @@ warc convert warc <files/dirs> [flags]

```
--config string config file. If not set, /etc/warc/, $HOME/.warc/ and current working dir will be searched for file config.yaml
--log-console strings The kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings The kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
--log-console strings the kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings the kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
-L, --log-file-name string a file to write log output. Empty for no log file
--tmpdir string directory to use for temporary files (default "/tmp")
```
Expand Down
23 changes: 19 additions & 4 deletions docs/content/cmd/warc_dedup.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
date: 2023-01-30T10:17:46+01:00
date: 2023-02-03T12:57:12+01:00
title: "warc dedup"
slug: warc_dedup
url: /cmd/warc_dedup/
Expand All @@ -8,6 +8,13 @@ url: /cmd/warc_dedup/

Deduplicate WARC files

### Synopsis

Deduplicate WARC files.

NOTE: The filtering options only decides which records are candidates for deduplication.
The remaining records are written as is.

```
warc dedup [flags]
```
Expand All @@ -23,18 +30,26 @@ warc dedup [flags]
-S, --file-size string The maximum size for WARC files (default "1GB")
--flush if true, sync WARC file to disk after writing each record
-h, --help help for dedup
--id stringArray filter record ID's. For more than one, repeat flag or comma separated list.
-i, --index-dir string directory to store indexes (default "/home/johnh/.cache/warc")
-k, --keep-index true to keep index on disk so that the next run will continue where the previous run left off
-m, --mime-type strings filter records with given mime-types. For more than one, repeat flag or comma separated list.
--min-free-disk string minimum free space on disk to allow WARC writing (default "256MB")
-g, --min-size-gain string minimum bytes one must earn to perform a deduplication (default "2KB")
-n, --name-generator string the name generator to use. By setting this to 'identity', the input filename will also be used as
output file name (prefix and suffix might still change). In this mode exactly one file is generated for every input file (default "default")
-K, --new-index true to start from a fresh index, deleting eventual index from last run
-p, --prefix string filename prefix for WARC files
-t, --record-type strings record types to dedup. For more than one, repeat flag or comma separated list.
-t, --record-type strings filter record types. For more than one, repeat flag or comma separated list.
Legal values: warcinfo,request,response,metadata,revisit,resource,continuation,conversion (default [response])
-r, --recursive walk directories recursively
-R, --repair try to fix errors in records
-e, --response-code string filter records with given http response codes. Format is 'from-to' where from is inclusive and to is exclusive.
Examples:
'200': only records with 200 response
'200-300': all records with response code between 200(inclusive) and 300(exclusive)
'-400': all response codes below 400
'500-': all response codes from 500 and above
--subdir-pattern string a pattern to use for generating subdirectories.
/ in pattern separates subdirectories on all platforms
{YYYY} is replaced with a 4 digit year
Expand All @@ -53,8 +68,8 @@ warc dedup [flags]

```
--config string config file. If not set, /etc/warc/, $HOME/.warc/ and current working dir will be searched for file config.yaml
--log-console strings The kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings The kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
--log-console strings the kind of log output to write to console. Valid values: info, error, summary, progress (default [progress,summary])
--log-file strings the kind of log output to write to file. Valid values: info, error, summary (default [info,error,summary])
-L, --log-file-name string a file to write log output. Empty for no log file
--tmpdir string directory to use for temporary files (default "/tmp")
```
Expand Down
Loading

0 comments on commit ff48ab0

Please sign in to comment.