Performance, Volumes, Intertextuality #64

dirkroorda · 2020-12-11T22:23:49Z

dirkroorda
Dec 11, 2020
Maintainer

Text-Fabric as the NumPy/Pandas for text?

NumPy and Pandas are high-performance libraries for a specific and yet generic type of data structure: the multi-dimensional array and the 2-dimensional data frame, respectively.

Text-Fabric deals with the specific and yet generic data structure text. Where text is taken as a graph with a bit of extra structure: nodes are natural numbers, the first N nodes are textual positions in that order, the remaining nodes are subsets of textual positions, nodes can be annotated by mapping them to values, and pairs of nodes can also be annotated, they become edges.

The question is: is Text-Fabric the new TextPy?

There are reasons that stand in the way:

Text-Fabric has acquired many secondary functions: it has ways to display nodes, to download data sets, and it has a built-in corpus browser.
The performance of Text-Fabric depends very much on the size of the corpus.

There is also an area that is not covered by Text-Fabric, and that is important in the research of texts:

the support for intertextuality (and also intratextuality).

It is easy to use Text-Fabric to load multiple datasets, but the efficient handling of these datasets in one program is the sole responsibility of the Text-Fabric user. There is no built in support for that.

There is an intrinsic link between performance and intra-textuality: if we have good support for intra-textuality, we can split a large text up into volumes, and treat them as multiple Text-Fabric datasets, but yet somehow as members of a family.

This idea could be pursued as follows.

a. Split Text-Fabric

TextPy: The core part that deals with the textual data model. Template based search is included too.
Text-Fabric: everything else. It will import TextPy as the engine.

There is a parallel with NumPy/Pandas: Pandas uses NumPy as the data engine, and sugars it with functions that realise the data frame.

Text-Fabric uses TextPy as data engine, and sugars it with functions that realise real-world corpus analysis.

b. Enhance the performance of TextPy

The functions in TextPy are number crunchers. They can be Cythonized. If done properly, it can speed up the code and lower the memory footprint.

The current Text-Fabric works very well for the Hebrew Bible (ca. 1000 pages, 500,000 words, many features), it works decently for the General Missives (ca. 10,000 pages, 5,000,0000 words, far less features). A Cythonized TextPy should work very well for the General Missives, and extremely well for the Hebrew Bible.

c. Add volume support to Text-Fabric

Even with better performance, there are elements in the search algorithm that do not scale linearly with the size of the corpus.
The same holds for tasks that need to compare pairs of passages in the corpus. They will benefit from performance gains, but not enough. They will benefit more from partitioning large corpora into volumes.

dirkroorda · 2021-02-19T07:55:26Z

dirkroorda
Feb 19, 2021
Maintainer Author

Cythonize and platform wheels

How do we distribute a Cythonized TextPy?
We need to distribute so-called platform wheels for Linux, Macos, and Windows.
But I have no experience in that.

GitHub Actions seems a good option to achieve this.

Advice appreciated.

0 replies

dirkroorda · 2021-02-19T08:11:04Z

dirkroorda
Feb 19, 2021
Maintainer Author

Volume support

Many corpora have a division into volumes.
The Hebrew Bible is divided into books.
The General Missives are divided into 13 volumes.
The Greek Literature dataset is divided into 265 authors and 1779 works.
Cuneiform clay tablets are divided in collections mainly by period.

Sometimes the order of volumes is fixed: those of the General Missives have a chronological order.
The books of the Bible have a standard order, but various traditions employ an alternative orders.
The works in the Greek Literature only have a partial ordering.

In a Text-Fabric dataset, the text of an entire corpus is totally ordered. If the corpus contains many works/volumes, the total ordering introduces an artefact that is not meaningful in the real world corpus. It would be better to break up such a corpus in its parts.

What do we loose?

Queries that look for patterns that involve material of multiple volumes. The results of these queries are not simply the union of the results obtained in the individual volumes.

What do we gain?

Most queries do not involve patterns across volumes. Because they are run on much smaller parts, they will be much faster.
Running a query on a corpus might well perform like O(n^3) (i.e. proportional to the cube of the number of words).
Suppose we have a query that takes 10 seconds on the whole corpus of say 10,000,000 words. If we split the corpus into 10 equally sized volumes of 1,000,000 words, the query needs 10 * 10 * 10 = 1000 times less time on each volume. So it takes 10/1000 = 0,01 second per volume, hence 0,1 seconds in total, a 100-fold improvement.
This is an interesting strategy.

What do we need?

In order to make it user friendly, we need to make machinery to dispatch queries over multiple volumes, and to collect and combine results in a handy way.

For queries that look for patterns that involve material from multiple volumes, we have to invent primitives to do that in a user friendly and efficient way.

The most user friendly way is to let volumes be part of query templates just like other nodes, then split the queries into parts that run in the individual volumes, and then search the combinations of results for the ones that satisfy the intra-volume requirements. It looks a bit like the quantifiers in search.

What else do we need?

We also need handy functions to load a multivolume corpus. Probably with a facility that loads a volume on demand, and discards volumes if limited memory requires it.

We need to give the user easy-to-use handles on the API for each individual volume.

1 reply

dirkroorda Jul 31, 2021
Maintainer Author

In Text-Fabric 9.0.0 I implemented the first steps of volume support.
You can make volumes out of the top-level sections of works, and you can combine volumes of the same work into collections. You can load volumes and collections instead of the whole work.
Volumes and collections have a mapping from their nodes to the nodes in the original work. So you can work with a volume, harvest results, and publish them against the work.

There are some hurdles to overcome, having to do with nodes that have slots in multiple volumes and with edges that go from one volume to another. These problems were solvable.

What is not yet implemented is: performing queries over multiple volumes, without the whole work being loaded.
But: when you perform a query in a volume, you have to load that volume. So, if you do queries over several volumes, you have to load them all.
That is where collections are for: you can call the collect function for the volumes you want, load the new collection, and perform the queries.

See volume support.

dirkroorda · 2021-02-19T08:25:33Z

dirkroorda
Feb 19, 2021
Maintainer Author

Intertextuality

While splitting up a corpus into volumes is an intra-textuality matter, we also want to support research between completely different works: inter-textuality.

The same machinery that loads the volumes of a dataset might also be used to load multiple datasets (with multiple volumes per dataset). The same memory swapping that we use for volumes, can be used again in this context.

The next thing would then be functions for finding parallel passages between datasets, and collating functions between datasets that are variants of each other.

Rather than coding collation functions, I would like to interface with Ronald Dekker's Collatex.

Collatex yields a graph of a Multi-text. We need to represent that graph in the world of a set of Text-Fabric datasets, or may be a set of TextPy corpus volumes.

0 replies

lwcvl · 2021-02-19T09:57:13Z

lwcvl
Feb 19, 2021

Text-Fabric's position in the text manipulation landscape

Where does TF situate itself? Close neighbors are Pandoc, NLTK and Passim.

It is like Pandoc in that it ideally likes to adopt any kind of text. It is unlike Pandoc in that it relies on its own data structure and does more than just conversion.
It is like NLTK in that it can quickly get rich information on parts of a text. It is unlike NLTK in that it relies on manual annotation. NLTK further assumes modern texts.
It is like Passim in that it can cover great distances. It is unlike Passim in that it does not (only) provide a bird's eye view but offers tools for very close reading.

TF is ideal for richly annotated texts that are too large to close-read manually, but still require a hands-on analysis. Right at the cusp of human-readable and machine-readable. It is in this sense that a comparison with Pandas is very apt. Much like Pandas, it leverages its power by loading the entire dataset into memory. I am surprised to learn TF is struggling when reaching millions of rows. That should be fixed, yes, but it is only fair to cap this and from there on rely on chunking. What Pandas does better is that it can handle a variety of file formats and directly gives power to the user.

TF is ideal for long-form texts. This is flagrant opposition to the omnipresent snippets of texts that the digital world consists of. How would TF position itself in this regard? Should it be the right tool to analyze social media? Why, why not?

4 replies

dirkroorda Feb 19, 2021
Maintainer Author

Thanks for this thoughtful comparison, @lwcvl!
A few remarks:

TF is agnostic to the way annotations are being produced. But it can handle millions of them.
TF is language agnostic. There is no built-in knowledge of how language works. It can also be used for non-linguistic corpora, such as proto-cuneiform Uruk

Homework for me: see what Passim is.

dirkroorda Feb 20, 2021
Maintainer Author

Passim is software to detect reuse of text, and calls up efficient machinery to do so.
Again an example of something that Text-Fabric should not aim at.

What we can do, is offer an exports of text into the exact json format that Passim needs.
Passim is a Java tool relying on Apache Spark. Maybe there are more pythonesque tools for this.

dirkroorda Feb 20, 2021
Maintainer Author

Another metaphor:

TF is a bit like a space station. DH research projects are like missions to the moon and to Mars.
At some pointyou get tired of starting each mission from earth. Better to use a space station as point of departure and point of return.
The bit between earth and the space station is also called preprocessing.
The bit between the space station and Mars is: converting to the input format of Passim, obtaing results with it, and distilling new TF features from the result.
And so with other tools.
From time to time these TF datasets with growing sets of additional features are brought back to Earth, maybe as TEI files in an archive.

lwcvl Feb 20, 2021

It's interesting what you're saying and the space station metaphor works well. However I do want to clarify that a direct comparison with Passim is not exactly the point I was trying to make. It is rather about scope: what kind of scope is TF meant for and what is it good for? Passim is only good for a very large corpus, or even, indeed, several corpora. NLTK probably performs best too when used on a large dataset (especially for the more experimental things like sentiment analysis). My gut feeling is that Text-Fabric is more at home with medium scale datasets, which are large enough to warrant computation but small (and/or complicated) enough to still want to read it yourself. small datasets, in turn, are perhaps more appropriate to use TEI-XML on. What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance, Volumes, Intertextuality #64

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance, Volumes, Intertextuality #64

dirkroorda Dec 11, 2020 Maintainer

Text-Fabric as the NumPy/Pandas for text?

a. Split Text-Fabric

b. Enhance the performance of TextPy

c. Add volume support to Text-Fabric

Replies: 4 comments · 5 replies

dirkroorda Feb 19, 2021 Maintainer Author

Cythonize and platform wheels

dirkroorda Feb 19, 2021 Maintainer Author

Volume support

What do we loose?

What do we gain?

What do we need?

What else do we need?

dirkroorda Jul 31, 2021 Maintainer Author

dirkroorda Feb 19, 2021 Maintainer Author

Intertextuality

lwcvl Feb 19, 2021

Text-Fabric's position in the text manipulation landscape

dirkroorda Feb 19, 2021 Maintainer Author

dirkroorda Feb 20, 2021 Maintainer Author

dirkroorda Feb 20, 2021 Maintainer Author

lwcvl Feb 20, 2021

dirkroorda
Dec 11, 2020
Maintainer

Replies: 4 comments 5 replies

dirkroorda
Feb 19, 2021
Maintainer Author

dirkroorda
Feb 19, 2021
Maintainer Author

dirkroorda Jul 31, 2021
Maintainer Author

dirkroorda
Feb 19, 2021
Maintainer Author

lwcvl
Feb 19, 2021

dirkroorda Feb 19, 2021
Maintainer Author

dirkroorda Feb 20, 2021
Maintainer Author

dirkroorda Feb 20, 2021
Maintainer Author