@@ -130,38 +130,42 @@ the corresponding text representation.
130
130
131
131
Command line parameters
132
132
-----------------------
133
+
133
134
The inscript.py command line client supports the following parameters::
134
135
135
- usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR]
136
- [--indentation INDENTATION] [-v]
137
- [input]
138
-
139
- Convert the given HTML document to text.
140
-
141
- positional arguments:
142
- input Html input either from a file or a URL (default:stdin).
143
-
144
- optional arguments:
145
- -h, --help show this help message and exit
146
- -o OUTPUT, --output OUTPUT
147
- Output file (default:stdout).
148
- -e ENCODING, --encoding ENCODING
149
- Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs).
150
- -i, --display-image-captions
151
- Display image captions (default:false).
152
- -d, --deduplicate-image-captions
153
- Deduplicate image captions (default:false).
154
- -l, --display-link-targets
155
- Display link targets (default:false).
156
- -a, --display-anchor-urls
157
- Display anchor URLs (default:false).
158
- -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES
159
- Path to an optional JSON file containing rules for annotating the retrieved text.
160
- -p POSTPROCESSOR, --postprocessor POSTPROCESSOR
161
- Optional component for postprocessing the result (html, surface, xml).
162
- --indentation INDENTATION
163
- How to handle indentation (extended or strict; default: extended).
164
- -v, --version display version information
136
+ usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR] [--indentation INDENTATION]
137
+ [--table-cell-separator TABLE_CELL_SEPARATOR] [-v]
138
+ [input]
139
+
140
+ Convert the given HTML document to text.
141
+
142
+ positional arguments:
143
+ input Html input either from a file or a URL (default:stdin).
144
+
145
+ optional arguments:
146
+ -h, --help show this help message and exit
147
+ -o OUTPUT, --output OUTPUT
148
+ Output file (default:stdout).
149
+ -e ENCODING, --encoding ENCODING
150
+ Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs).
151
+ -i, --display-image-captions
152
+ Display image captions (default:false).
153
+ -d, --deduplicate-image-captions
154
+ Deduplicate image captions (default:false).
155
+ -l, --display-link-targets
156
+ Display link targets (default:false).
157
+ -a, --display-anchor-urls
158
+ Display anchor URLs (default:false).
159
+ -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES
160
+ Path to an optional JSON file containing rules for annotating the retrieved text.
161
+ -p POSTPROCESSOR, --postprocessor POSTPROCESSOR
162
+ Optional component for postprocessing the result (html, surface, xml).
163
+ --indentation INDENTATION
164
+ How to handle indentation (extended or strict; default: extended).
165
+ --table-cell-separator TABLE_CELL_SEPARATOR
166
+ Separator to use between table cells (default: three spaces).
167
+ -v, --version display version information
168
+
165
169
166
170
167
171
HTML to text conversion
@@ -508,7 +512,36 @@ The following options are available for fine tuning inscriptis' HTML rendering:
508
512
html_tree = fromstring(html)
509
513
# create a parser using a custom css
510
514
config = ParserConfig(css = css)
511
- parser = Inscriptis(html_tree, config)
515
+ parser = Inscriptis(html_tree, config) usage: inscript.py [- h] [- o OUTPUT ] [- e ENCODING ] [- i] [- d] [- l] [- a] [- r ANNOTATION_RULES ] [- p POSTPROCESSOR ]
516
+ [-- indentation INDENTATION ] [- v]
517
+ [input ]
518
+
519
+ Convert the given HTML document to text.
520
+
521
+ positional arguments:
522
+ input Html input either from a file or a URL (default:stdin).
523
+
524
+ optional arguments:
525
+ - h, -- help show this help message and exit
526
+ - o OUTPUT , -- output OUTPUT
527
+ Output file (default:stdout).
528
+ - e ENCODING , -- encoding ENCODING
529
+ Input encoding to use (default:utf- 8 for files; detected server encoding for Web URLs).
530
+ - i, -- display- image- captions
531
+ Display image captions (default:false).
532
+ - d, -- deduplicate- image- captions
533
+ Deduplicate image captions (default:false).
534
+ - l, -- display- link- targets
535
+ Display link targets (default:false).
536
+ - a, -- display- anchor- urls
537
+ Display anchor URLs (default:false).
538
+ - r ANNOTATION_RULES , -- annotation- rules ANNOTATION_RULES
539
+ Path to an optional JSON file containing rules for annotating the retrieved text.
540
+ - p POSTPROCESSOR , -- postprocessor POSTPROCESSOR
541
+ Optional component for postprocessing the result (html, surface, xml).
542
+ -- indentation INDENTATION
543
+ How to handle indentation (extended or strict; default: extended).
544
+ - v, -- version display version information
512
545
text = parser.get_text()
513
546
514
547
0 commit comments