TF-Query vs Full-text search #66

dirkroorda · 2020-12-11T22:30:55Z

dirkroorda
Dec 11, 2020
Maintainer

One thing TF-Query does not do particularly well: full-text search.

TF-Query is a walker at heart: if walks over nodes and inspects feature values.
It is not a helicopter: it does not make use of full-text indexes.

How can we do better?

We could go over to other interfaces that are good in full-text search, e.g.
blacklab

It is not that hard to export a corpus to blacklab and do full-text searches there.
But:

text-fabric is more flexible in the modelling area: text-as-graph, while blacklab is text-as-tree
the real power comes when you can combine full-text search with spatial search (where conditions
can be stated in terms of text topology)

While I intend to go the blacklab route (because blacklab has other nice things), I do not want to give up on full-text search inside TF-Query.

dirkroorda · 2020-12-11T22:35:55Z

dirkroorda
Dec 11, 2020
Maintainer Author

Here is an idea.

When we speak of full text search, we mean searching the full text of a feature by means of an index.

Ideally, I would like to leverage existing full-text search tools on feature files.

When we get the results back, we are not only interested in the result strings, but foremost into the nodes that
are annotated by the result feature values.

We can explode
a compact feature file to a file where each node corresponds to a single line, and that line is filled
with the value of that node.

Then we index the feature file, and we expect that search results return line numbers as well.

2 replies

dirkroorda Dec 11, 2020
Maintainer Author

I want to be able to search with regular expressions. I do not know how well indexes can support regex search.

As it stands now, TF-Query visits nodes and evaluates regexes on feature values one by one. It might go much quicker if it could do it on all values of a feature in one go.

Or maybe not, because TF-Query tries hard to avoid inspecting features of nodes of which it already knows are not in the end result.

jan-niestadt Dec 14, 2020

Lucene supports index search using regular expressions. It find all unique terms that match the regular expression and then returns occurrences of each term. This should be much faster than evaluating the regex over and over again.

dirkroorda · 2020-12-11T22:40:55Z

dirkroorda
Dec 11, 2020
Maintainer Author

The TF-Query algorithm starts with a stage where it evaluates all conditions on feature values separately.
This reduces the search space, but it can be quite costly. It happens by walking through all nodes.

When we have full-text search, we can consult the index instead of walking through all nodes, and I expect a significant gain for many queries.

1 reply

jan-niestadt Dec 14, 2020

Yes, I would expect that as well. BlackLab works this way, evaluating each condition separately using the index and combining the results (e.g. taking the intersection of the result sets).

dirkroorda · 2020-12-11T22:44:32Z

dirkroorda
Dec 11, 2020
Maintainer Author

Related to this is a query where we look for the equality of values of (different) features for (different) nodes.
See feature comparison.

This is currently a rather slow operation. Maybe we can speed this up as well by means of a full text index.

4 replies

jan-niestadt Dec 14, 2020

This seems more difficult to speed up using an index to me, because you're comparing features to each other instead of a feature to a fixed value. BlackLab has support for this, but it is done at the end of the process by evaluating the condition for all the hits found so far, and discaarding all where the features don't match.

dirkroorda Dec 14, 2020
Maintainer Author

Indeed, it seems this cannot be sped up by indexing. But maybe this will still help: suppose you want to find the n and m for which f(n) = g(m). You could make a text file with on each line the value

f(n) <tab> g(n)

Then run the regex

\n([^\t]*).*?\t\1\n

Every hit of this re starts at a line n and ends at a line m such that f(n) = g(m).
I think finding hits in one go like this might be quicker than iterating over nodes, collecting feature values and making comparisons.

jan-niestadt Dec 14, 2020

Yes, that's a good idea. Unfortunately, the regex engine in Lucene is a bit limited and doesn't support backreferences. These kinds of features were probably left out to keep things as fast as possible, but it is sometimes annoying.

dirkroorda Dec 14, 2020
Maintainer Author

But for this specific case we could not use the Lucene index anyway, so I guess I'll do this in Python, where backrefeferences work.

dirkroorda · 2021-03-23T08:16:39Z

dirkroorda
Mar 23, 2021
Maintainer Author

Currently I am building a phonological search interface for a modestly sized corpus of Neo-Aramaic texts.

The idea is that users can search in the full-text with regexes, but also have ways to specify CV patterns (consonant-vowel), articulation-place patterns (dental-labial-velar etc), and more.

So I create full-texts in different representations, remembering the mapping between character positions in each representation and the original character position. Then users can write a query for each layer and the results will be intersected and highlighted.

I use Text-Fabric to prepare a big blob of json data, and then I write a Javascript program to do the search. The result will be a single-page web app, that needs a big load of static data files (several MB).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF-Query vs Full-text search #66

{{title}}

Replies: 4 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

TF-Query vs Full-text search #66

dirkroorda Dec 11, 2020 Maintainer

Replies: 4 comments · 7 replies

dirkroorda Dec 11, 2020 Maintainer Author

dirkroorda Dec 11, 2020 Maintainer Author

jan-niestadt Dec 14, 2020

dirkroorda Dec 11, 2020 Maintainer Author

jan-niestadt Dec 14, 2020

dirkroorda Dec 11, 2020 Maintainer Author

jan-niestadt Dec 14, 2020

dirkroorda Dec 14, 2020 Maintainer Author

jan-niestadt Dec 14, 2020

dirkroorda Dec 14, 2020 Maintainer Author

dirkroorda Mar 23, 2021 Maintainer Author

dirkroorda
Dec 11, 2020
Maintainer

Replies: 4 comments 7 replies

dirkroorda
Dec 11, 2020
Maintainer Author

dirkroorda Dec 11, 2020
Maintainer Author

dirkroorda
Dec 11, 2020
Maintainer Author

dirkroorda
Dec 11, 2020
Maintainer Author

dirkroorda Dec 14, 2020
Maintainer Author

dirkroorda Dec 14, 2020
Maintainer Author

dirkroorda
Mar 23, 2021
Maintainer Author