Add support for arbitrary adapters via a config file #60

phiresky · 2020-05-29T19:19:46Z

Since there's many feature requests for different file formats now, many of which do not have corresponding nice and fast Rust libraries, I think the best solution is to allow specifying "custom" preprocessors via a config file.

This comes with the question about how this would differ than just using rg with the --pre directive directly:

Compressed caching
I'm very happy with this feature of rga, since extractors are often very slow and with the zstd-compressed cache most extractions are both very small and very fast to read, while barely adding any overhead on initial run. This is hard or impossible to reproduce in a simple extract-script (see my original pdfextract.sh)
Archive recursion
rga can recurse into archives, and return contents at any depth as a binary stream. The same can be implemented for other things that aren't strictly archives, like a pdf file that contains images, where the images may be searched by a different extractor

Future additions that might be possible here (no promises) that will probably not appear in rg core are:

Declarative post-processing options
Like the pdf extractor already adds the Page number to the pdftotext output by counting for ascii pagebreak symbols, there might be a some postprocessing steps that could be defined in the config file so they are implemented in fast rust without effort on behalf of the filetype-handler
Not directly running a separate program for each file but using something like a file-type-handler-server instead
From current usage the extractor is always slow enough so the initialization time is kinda irrelevant, but this might not always be the case:
For example, stuff like tesseract loads neural networks into memory when started, which can be a significant overhead. I think those are evaluated on the CPU, but if there was stuff like GPU-based compute it would be even worse.
More pipeable adapters
It might be useful to add adapters that are more like text-conversion tools (such as removing broken characters (Unicode normalization #26, feature_request(ebooks): kill gremlin characters #46) or changing encodings (UTF16 and possibly other UTF encodings support #5, feature_request(ebooks): non UTF-8 books support #47)) that could then be added as a step before or after the usual adapters

The baseline implementation of this should be pretty easy, more features can be added later. Main decision is the config file format, whether or not to change existing SpawningFileAdapters to build on top of this and how to document it.

The text was updated successfully, but these errors were encountered:

phiresky · 2020-05-29T19:21:30Z

Would solve #53, #42, #47, #38, #36, #28, #14, possibly #52.

Kristinita · 2020-06-09T08:27:01Z

Type: Reply 💬

1. Summary

It would be nice, if format of configuration file will not JSON.

In my opinion YAML is the best config format for human editing; TOML is a good alternative.

2. Argumentation

2.1. Summary

JSON doesn't support comments. I'm used to writing detailed comments; it is a big problem for me.
This is an inconvenient format for manual human editing. YAML or TOML is a simplier.

2.2. Details

I already wrote 9 issues about this problem for another tools. Example; another issues referenced below.

2.3. Additional link

The downsides of JSON for config files

Thanks.

phiresky · 2020-06-09T08:53:28Z

Counterpoints:

YAML is a bad format with many unexpected caveats and questionable design decisions. Especially that e.g. hello: on is interpreted as on being a boolean instead of the string "on".

"Anyone who uses YAML long enough will eventually get burned when attempting to abbreviate Norway."

Example:

NI: Nicaragua
NL: Netherlands
NO: Norway # boom!

NO is parsed as a boolean type, which with the YAML 1.1 spec, there are 22 options to write "true" or "false."

Sources: https://github.com/cblp/yaml-sucks https://www.arp242.net/yaml-config.html
TOML is ok, though the array syntax is fairly weird and it is very unknown outside of the Rust community
ini is very restricted and defining arrays is not possible without syntax additions
JSON has a restricted and simple syntax
Is the only format (apart from JSON) I know of that , and has integrated and automatic support of schema validation and autocomplete in a major editor (VSCode).

See these screenshots:

I'd love to use a different format, I'm looking at JSON5, but so far I've not really found a worthy replacement.

phiresky · 2020-06-09T08:57:43Z

Specifically, the format "JSON with comments" (extension jsonc) is supported well in VSCode:

hediet · 2020-06-09T08:58:44Z

Maybe GeML + GeML schema could be the solution? :D

phiresky · 2020-06-09T09:03:57Z

Sure, but only if you manage to get your Automatic settings UI editor to work. I'd love to ship a html file so I can add a rga --config-ui command that just opens a GUI editor of the configuration.

Kristinita · 2020-06-09T16:24:22Z

Type: Reply 💬

1. Replies

1.1. StrictYAML

Did you try StrictYAML? Rust implementation (I didn’t test it).

See official documentation:

Refusing to parse the ugly, hard to read and insecure features of YAML like the Norway problem.

List of the features

1.2. YAML problems

YAML is a bad format with many unexpected caveats and questionable design decisions. Especially that e.g. hello: on is interpreted as on being a boolean instead of the string "on".

I don’t think, that YAML (issue) or ruamel.yaml (issue) is ideal, but:

Some problems as “Norway problem” I haven’t met in my real practice. And I don’t understand yet where they can appear in real ripgrep-all configuration files.
For most examples, that described on YAML sucks repository, I don’t understand why this is regarded as a problem.
But if someone really thinks these are problems, we have StrictYAML.

1.3. YAML vs JSON

JSON has a restricted and simple syntax

Simple example:

YAML:
```
custom_adapters:
- name: calibre
```

JSON:

{
    "custom_adapters": [
        {
            "name": "calibre"
        }
    ]
}

Extra symbols (1 - for YAML) for this small example:

4 braces
2 brackets
6 quotes

1.4. TOML popularity

TOML (…) is very unknown outside of the Rust community

May I ask what this opinion is based on?

TOML on GitHub:

Currently, on 9 June 2020, GitHub has 156 thousands TOML code results; most matches of toml word on Python. Also, TOML has 13,6 thousands GitHub stars.

I don’t think, that TOML is “very unknown”.

2. Custom configuration format

Some Node.js projects have a cosmiconfig dependency. Users of these projects can have YAML, JSON or JavaScript configuration (see official repository for details) files.

Confinode also can support another formats, include TOML.

Is something like this possible in Rust? So that ripgrep-all users themselves can choose preferable config format.

I found (but not tested) config-rs Rust repository.

Read from JSON, TOML, YAML, HJSON, INI files

Possibly, it may help.

Thanks.

phiresky · 2023-05-26T14:55:13Z

This will be added in 1.0.0 and is already present in 1.0.0-alpha.4

Kristinita mentioned this issue Jun 9, 2020

feature_request(debug): detailed debug information #63

Closed

5 tasks

phiresky mentioned this issue Jun 9, 2020

docs(adapters): third-party adapters commands #53

Closed

Kristinita mentioned this issue Jun 9, 2020

feature_request(books): detect incorrect and poor quality text #62

Closed

phiresky closed this as completed May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for arbitrary adapters via a config file #60

Add support for arbitrary adapters via a config file #60

phiresky commented May 29, 2020 •

edited

Loading

phiresky commented May 29, 2020 •

edited

Loading

Kristinita commented Jun 9, 2020

phiresky commented Jun 9, 2020 •

edited

Loading

phiresky commented Jun 9, 2020

hediet commented Jun 9, 2020

phiresky commented Jun 9, 2020

Kristinita commented Jun 9, 2020

phiresky commented May 26, 2023

Add support for arbitrary adapters via a config file #60

Add support for arbitrary adapters via a config file #60

Comments

phiresky commented May 29, 2020 • edited Loading

phiresky commented May 29, 2020 • edited Loading

Kristinita commented Jun 9, 2020

1. Summary

2. Argumentation

2.1. Summary

2.2. Details

2.3. Additional link

phiresky commented Jun 9, 2020 • edited Loading

phiresky commented Jun 9, 2020

hediet commented Jun 9, 2020

phiresky commented Jun 9, 2020

Kristinita commented Jun 9, 2020

1. Replies

1.1. StrictYAML

1.2. YAML problems

1.3. YAML vs JSON

1.4. TOML popularity

2. Custom configuration format

phiresky commented May 26, 2023

phiresky commented May 29, 2020 •

edited

Loading

phiresky commented May 29, 2020 •

edited

Loading

phiresky commented Jun 9, 2020 •

edited

Loading