Skip to content

Commit ad95923

Browse files
committed
version v2.0.0
1 parent 7a1ab1d commit ad95923

11 files changed

+315
-163
lines changed

docs/content/cmd/warc.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
date: 2023-06-30T15:18:26+02:00
2+
date: 2024-02-21T15:43:33+01:00
33
title: "warc"
44
slug: warc
55
url: /cmd/warc/

docs/content/cmd/warc_cat.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
date: 2023-06-30T15:18:26+02:00
2+
date: 2024-02-21T15:43:33+01:00
33
title: "warc cat"
44
slug: warc_cat
55
url: /cmd/warc_cat/

docs/content/cmd/warc_completion.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
date: 2023-06-30T15:18:26+02:00
2+
date: 2024-02-21T15:43:33+01:00
33
title: "warc completion"
44
slug: warc_completion
55
url: /cmd/warc_completion/

docs/content/cmd/warc_console.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
date: 2023-06-30T15:18:26+02:00
2+
date: 2024-02-21T15:43:33+01:00
33
title: "warc console"
44
slug: warc_console
55
url: /cmd/warc_console/

docs/content/cmd/warc_convert.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
date: 2023-06-30T15:18:26+02:00
2+
date: 2024-02-21T15:43:33+01:00
33
title: "warc convert"
44
slug: warc_convert
55
url: /cmd/warc_convert/

docs/content/cmd/warc_convert_arc.md

+58-30
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
date: 2023-06-30T15:18:26+02:00
2+
date: 2024-02-21T15:43:33+01:00
33
title: "warc convert arc"
44
slug: warc_convert_arc
55
url: /cmd/warc_convert_arc/
@@ -15,35 +15,63 @@ warc convert arc <files/dirs> [flags]
1515
### Options
1616

1717
```
18-
-z, --compress use gzip compression for WARC files
19-
--compression-level the gzip compression level to use (value between 1 and 9)
20-
-c, --concurrency int number of input files to process simultaneously. The default value is 1.5 x <number of cpu cores> (default 24)
21-
-C, --concurrent-writers int maximum concurrent WARC writers. This is the number of WARC-files simultaneously written to.
22-
A consequence is that at least this many WARC files are created even if there is only one input file. (default 1)
23-
-t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2023-6-30")
24-
-S, --file-size int The maximum size for WARC files (default 1073741824)
25-
--flush if true, sync WARC file to disk after writing each record
26-
-h, --help help for arc
27-
-i, --index-dir string directory to store indexes (default "/home/johnh/.cache/warc")
28-
-k, --keep-index true to keep index on disk so that the next run will continue where the previous run left off
29-
-n, --name-generator string the name generator to use. By setting this to 'identity', the input filename will also be used as
30-
output file name (prefix and suffix might still change). In this mode exactly one file is generated for every input file (default "default")
31-
-K, --new-index true to start from a fresh index, deleting eventual index from last run
32-
-p, --prefix string filename prefix for WARC files
33-
-r, --recursive walk directories recursively
34-
--subdir-pattern string a pattern to use for generating subdirectories.
35-
/ in pattern separates subdirectories on all platforms
36-
{YYYY} is replaced with a 4 digit year
37-
{YY} is replaced with a 2 digit year
38-
{MM} is replaced with a 2 digit month
39-
{DD} is replaced with a 2 digit day
40-
The date used is the WARC date of each record. Therefore a input file might be split into
41-
WARC files in different subdirectories. If NameGenerator is 'identity' only the first record
42-
of each file's date is used to keep the file as one.
43-
--suffixes strings filter files by suffixes (default [.arc,.arc.gz])
44-
-s, --symlinks follow symlinks
45-
-w, --warc-dir string output directory for generated warc files. Directory must exist. (default ".")
46-
--warc-version string the WARC version to use for created files (default "1.1")
18+
--close-input-file-hook string a command to run after closing each input file. The command has access to data as environment variables.
19+
WARC_COMMAND contains the subcommand name
20+
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
21+
WARC_FILE_NAME contains the file name of the input file
22+
WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
23+
--close-output-file-hook string a command to run after closing each output file. The command has access to data as environment variables.
24+
WARC_COMMAND contains the subcommand name
25+
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
26+
WARC_FILE_NAME contains the file name of the output file
27+
WARC_SIZE contains the size of the output file
28+
WARC_INFO_ID contains the ID of the output file's WARCInfo-record if created
29+
WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
30+
WARC_HASH contains the hash of the output file if computed
31+
WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
32+
-z, --compress use gzip compression for WARC files
33+
--compression-level the gzip compression level to use (value between 1 and 9)
34+
-c, --concurrency int number of input files to process simultaneously. The default value is 1.5 x <number of cpu cores> (default 24)
35+
-C, --concurrent-writers int maximum concurrent WARC writers. This is the number of WARC-files simultaneously written to.
36+
A consequence is that at least this many WARC files are created even if there is only one input file. (default 1)
37+
-t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2024-2-21")
38+
-S, --file-size int The maximum size for WARC files (default 1073741824)
39+
--flush if true, sync WARC file to disk after writing each record
40+
-h, --help help for arc
41+
-i, --index-dir string directory to store indexes (default "/home/johnh/.cache/warc")
42+
-k, --keep-index true to keep index on disk so that the next run will continue where the previous run left off
43+
-n, --name-generator string the name generator to use. By setting this to 'identity', the input filename will also be used as
44+
output file name (prefix and suffix might still change). In this mode exactly one file is generated for every input file (default "identity")
45+
-K, --new-index true to start from a fresh index, deleting eventual index from last run
46+
--open-input-file-hook string a command to run before opening each input file. The command has access to data as environment variables.
47+
WARC_COMMAND contains the subcommand name
48+
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
49+
WARC_FILE_NAME contains the file name of the input file
50+
--open-output-file-hook string a command to run before opening each output file. The command has access to data as environment variables.
51+
WARC_COMMAND contains the subcommand name
52+
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
53+
WARC_FILE_NAME contains the file name of the output file
54+
WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
55+
-p, --prefix string filename prefix for WARC files (default "from_arc_")
56+
-r, --recursive walk directories recursively
57+
--source-filesystem string the source filesystem to use for input files. Default is to use OS file system. Legal values:
58+
ftp://user/pass@host:port
59+
tar://path/to/archive.tar
60+
tgz://path/to/archive.tar.gz
61+
62+
--subdir-pattern string a pattern to use for generating subdirectories.
63+
/ in pattern separates subdirectories on all platforms
64+
{YYYY} is replaced with a 4 digit year
65+
{YY} is replaced with a 2 digit year
66+
{MM} is replaced with a 2 digit month
67+
{DD} is replaced with a 2 digit day
68+
The date used is the WARC date of each record. Therefore a input file might be split into
69+
WARC files in different subdirectories. If NameGenerator is 'identity' only the first record
70+
of each file's date is used to keep the file as one.
71+
--suffixes strings filter files by suffixes (default [.arc,.arc.gz])
72+
-s, --symlinks follow symlinks
73+
-w, --warc-dir string output directory for generated warc files. Directory must exist. (default ".")
74+
--warc-version string the WARC version to use for created files (default "1.1")
4775
```
4876

4977
### Options inherited from parent commands

docs/content/cmd/warc_convert_nedlib.md

+56-28
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
date: 2023-06-30T15:18:26+02:00
2+
date: 2024-02-21T15:43:33+01:00
33
title: "warc convert nedlib"
44
slug: warc_convert_nedlib
55
url: /cmd/warc_convert_nedlib/
@@ -15,33 +15,61 @@ warc convert nedlib <files/dirs> [flags]
1515
### Options
1616

1717
```
18-
-z, --compress use gzip compression for WARC files
19-
--compression-level the gzip compression level to use (value between 1 and 9)
20-
-c, --concurrency int number of input files to process simultaneously. The default value is 1.5 x <number of cpu cores> (default 24)
21-
-C, --concurrent-writers int maximum concurrent WARC writers. This is the number of WARC-files simultaneously written to.
22-
A consequence is that at least this many WARC files are created even if there is only one input file. (default 1)
23-
-t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2023-6-30")
24-
-S, --file-size int The maximum size for WARC files (default 1073741824)
25-
--flush if true, sync WARC file to disk after writing each record
26-
-h, --help help for nedlib
27-
-i, --index-dir string directory to store indexes (default "/home/johnh/.cache/warc")
28-
-k, --keep-index true to keep index on disk so that the next run will continue where the previous run left off
29-
-K, --new-index true to start from a fresh index, deleting eventual index from last run
30-
-p, --prefix string filename prefix for WARC files
31-
-r, --recursive walk directories recursively
32-
--subdir-pattern string a pattern to use for generating subdirectories.
33-
/ in pattern separates subdirectories on all platforms
34-
{YYYY} is replaced with a 4 digit year
35-
{YY} is replaced with a 2 digit year
36-
{MM} is replaced with a 2 digit month
37-
{DD} is replaced with a 2 digit day
38-
The date used is the WARC date of each record. Therefore a input file might be split into
39-
WARC files in different subdirectories. If NameGenerator is 'identity' only the first record
40-
of each file's date is used to keep the file as one.
41-
--suffixes strings filter files by suffixes (default [.meta])
42-
-s, --symlinks follow symlinks
43-
-w, --warc-dir string output directory for generated warc files. Directory must exist. (default ".")
44-
--warc-version string the WARC version to use for created files (default "1.1")
18+
--close-input-file-hook string a command to run after closing each input file. The command has access to data as environment variables.
19+
WARC_COMMAND contains the subcommand name
20+
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
21+
WARC_FILE_NAME contains the file name of the input file
22+
WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
23+
--close-output-file-hook string a command to run after closing each output file. The command has access to data as environment variables.
24+
WARC_COMMAND contains the subcommand name
25+
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
26+
WARC_FILE_NAME contains the file name of the output file
27+
WARC_SIZE contains the size of the output file
28+
WARC_INFO_ID contains the ID of the output file's WARCInfo-record if created
29+
WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
30+
WARC_HASH contains the hash of the output file if computed
31+
WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
32+
-z, --compress use gzip compression for WARC files
33+
--compression-level the gzip compression level to use (value between 1 and 9)
34+
-c, --concurrency int number of input files to process simultaneously. The default value is 1.5 x <number of cpu cores> (default 24)
35+
-C, --concurrent-writers int maximum concurrent WARC writers. This is the number of WARC-files simultaneously written to.
36+
A consequence is that at least this many WARC files are created even if there is only one input file. (default 1)
37+
-t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2024-2-21")
38+
-S, --file-size int The maximum size for WARC files (default 1073741824)
39+
--flush if true, sync WARC file to disk after writing each record
40+
-h, --help help for nedlib
41+
-i, --index-dir string directory to store indexes (default "/home/johnh/.cache/warc")
42+
-k, --keep-index true to keep index on disk so that the next run will continue where the previous run left off
43+
-K, --new-index true to start from a fresh index, deleting eventual index from last run
44+
--open-input-file-hook string a command to run before opening each input file. The command has access to data as environment variables.
45+
WARC_COMMAND contains the subcommand name
46+
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
47+
WARC_FILE_NAME contains the file name of the input file
48+
--open-output-file-hook string a command to run before opening each output file. The command has access to data as environment variables.
49+
WARC_COMMAND contains the subcommand name
50+
WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
51+
WARC_FILE_NAME contains the file name of the output file
52+
WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
53+
-p, --prefix string filename prefix for WARC files (default "from_nedlib_")
54+
-r, --recursive walk directories recursively
55+
--source-filesystem string the source filesystem to use for input files. Default is to use OS file system. Legal values:
56+
ftp://user/pass@host:port
57+
tar://path/to/archive.tar
58+
tgz://path/to/archive.tar.gz
59+
60+
--subdir-pattern string a pattern to use for generating subdirectories.
61+
/ in pattern separates subdirectories on all platforms
62+
{YYYY} is replaced with a 4 digit year
63+
{YY} is replaced with a 2 digit year
64+
{MM} is replaced with a 2 digit month
65+
{DD} is replaced with a 2 digit day
66+
The date used is the WARC date of each record. Therefore a input file might be split into
67+
WARC files in different subdirectories. If NameGenerator is 'identity' only the first record
68+
of each file's date is used to keep the file as one.
69+
--suffixes strings filter files by suffixes (default [.meta])
70+
-s, --symlinks follow symlinks
71+
-w, --warc-dir string output directory for generated warc files. Directory must exist. (default ".")
72+
--warc-version string the WARC version to use for created files (default "1.1")
4573
```
4674

4775
### Options inherited from parent commands

0 commit comments

Comments
 (0)