1
1
---
2
- date : 2023-06-30T15:18:26+02 :00
2
+ date : 2024-02-21T15:43:33+01 :00
3
3
title : " warc convert arc"
4
4
slug : warc_convert_arc
5
5
url : /cmd/warc_convert_arc/
@@ -15,35 +15,63 @@ warc convert arc <files/dirs> [flags]
15
15
### Options
16
16
17
17
```
18
- -z, --compress use gzip compression for WARC files
19
- --compression-level the gzip compression level to use (value between 1 and 9)
20
- -c, --concurrency int number of input files to process simultaneously. The default value is 1.5 x <number of cpu cores> (default 24)
21
- -C, --concurrent-writers int maximum concurrent WARC writers. This is the number of WARC-files simultaneously written to.
22
- A consequence is that at least this many WARC files are created even if there is only one input file. (default 1)
23
- -t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2023-6-30")
24
- -S, --file-size int The maximum size for WARC files (default 1073741824)
25
- --flush if true, sync WARC file to disk after writing each record
26
- -h, --help help for arc
27
- -i, --index-dir string directory to store indexes (default "/home/johnh/.cache/warc")
28
- -k, --keep-index true to keep index on disk so that the next run will continue where the previous run left off
29
- -n, --name-generator string the name generator to use. By setting this to 'identity', the input filename will also be used as
30
- output file name (prefix and suffix might still change). In this mode exactly one file is generated for every input file (default "default")
31
- -K, --new-index true to start from a fresh index, deleting eventual index from last run
32
- -p, --prefix string filename prefix for WARC files
33
- -r, --recursive walk directories recursively
34
- --subdir-pattern string a pattern to use for generating subdirectories.
35
- / in pattern separates subdirectories on all platforms
36
- {YYYY} is replaced with a 4 digit year
37
- {YY} is replaced with a 2 digit year
38
- {MM} is replaced with a 2 digit month
39
- {DD} is replaced with a 2 digit day
40
- The date used is the WARC date of each record. Therefore a input file might be split into
41
- WARC files in different subdirectories. If NameGenerator is 'identity' only the first record
42
- of each file's date is used to keep the file as one.
43
- --suffixes strings filter files by suffixes (default [.arc,.arc.gz])
44
- -s, --symlinks follow symlinks
45
- -w, --warc-dir string output directory for generated warc files. Directory must exist. (default ".")
46
- --warc-version string the WARC version to use for created files (default "1.1")
18
+ --close-input-file-hook string a command to run after closing each input file. The command has access to data as environment variables.
19
+ WARC_COMMAND contains the subcommand name
20
+ WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
21
+ WARC_FILE_NAME contains the file name of the input file
22
+ WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
23
+ --close-output-file-hook string a command to run after closing each output file. The command has access to data as environment variables.
24
+ WARC_COMMAND contains the subcommand name
25
+ WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
26
+ WARC_FILE_NAME contains the file name of the output file
27
+ WARC_SIZE contains the size of the output file
28
+ WARC_INFO_ID contains the ID of the output file's WARCInfo-record if created
29
+ WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
30
+ WARC_HASH contains the hash of the output file if computed
31
+ WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
32
+ -z, --compress use gzip compression for WARC files
33
+ --compression-level the gzip compression level to use (value between 1 and 9)
34
+ -c, --concurrency int number of input files to process simultaneously. The default value is 1.5 x <number of cpu cores> (default 24)
35
+ -C, --concurrent-writers int maximum concurrent WARC writers. This is the number of WARC-files simultaneously written to.
36
+ A consequence is that at least this many WARC files are created even if there is only one input file. (default 1)
37
+ -t, --default-date string fetch date to use for records missing date metadata. Fetchtime is set to 12:00 UTC for the date (default "2024-2-21")
38
+ -S, --file-size int The maximum size for WARC files (default 1073741824)
39
+ --flush if true, sync WARC file to disk after writing each record
40
+ -h, --help help for arc
41
+ -i, --index-dir string directory to store indexes (default "/home/johnh/.cache/warc")
42
+ -k, --keep-index true to keep index on disk so that the next run will continue where the previous run left off
43
+ -n, --name-generator string the name generator to use. By setting this to 'identity', the input filename will also be used as
44
+ output file name (prefix and suffix might still change). In this mode exactly one file is generated for every input file (default "identity")
45
+ -K, --new-index true to start from a fresh index, deleting eventual index from last run
46
+ --open-input-file-hook string a command to run before opening each input file. The command has access to data as environment variables.
47
+ WARC_COMMAND contains the subcommand name
48
+ WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
49
+ WARC_FILE_NAME contains the file name of the input file
50
+ --open-output-file-hook string a command to run before opening each output file. The command has access to data as environment variables.
51
+ WARC_COMMAND contains the subcommand name
52
+ WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
53
+ WARC_FILE_NAME contains the file name of the output file
54
+ WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
55
+ -p, --prefix string filename prefix for WARC files (default "from_arc_")
56
+ -r, --recursive walk directories recursively
57
+ --source-filesystem string the source filesystem to use for input files. Default is to use OS file system. Legal values:
58
+ ftp://user/pass@host:port
59
+ tar://path/to/archive.tar
60
+ tgz://path/to/archive.tar.gz
61
+
62
+ --subdir-pattern string a pattern to use for generating subdirectories.
63
+ / in pattern separates subdirectories on all platforms
64
+ {YYYY} is replaced with a 4 digit year
65
+ {YY} is replaced with a 2 digit year
66
+ {MM} is replaced with a 2 digit month
67
+ {DD} is replaced with a 2 digit day
68
+ The date used is the WARC date of each record. Therefore a input file might be split into
69
+ WARC files in different subdirectories. If NameGenerator is 'identity' only the first record
70
+ of each file's date is used to keep the file as one.
71
+ --suffixes strings filter files by suffixes (default [.arc,.arc.gz])
72
+ -s, --symlinks follow symlinks
73
+ -w, --warc-dir string output directory for generated warc files. Directory must exist. (default ".")
74
+ --warc-version string the WARC version to use for created files (default "1.1")
47
75
```
48
76
49
77
### Options inherited from parent commands
0 commit comments