Performance, Volumes, Intertextuality #64
Replies: 4 comments 5 replies
-
Cythonize and platform wheelsHow do we distribute a Cythonized TextPy? GitHub Actions seems a good option to achieve this. Advice appreciated. |
Beta Was this translation helpful? Give feedback.
-
Volume supportMany corpora have a division into volumes. Sometimes the order of volumes is fixed: those of the General Missives have a chronological order. In a Text-Fabric dataset, the text of an entire corpus is totally ordered. If the corpus contains many works/volumes, the total ordering introduces an artefact that is not meaningful in the real world corpus. It would be better to break up such a corpus in its parts. What do we loose?Queries that look for patterns that involve material of multiple volumes. The results of these queries are not simply the union of the results obtained in the individual volumes. What do we gain?Most queries do not involve patterns across volumes. Because they are run on much smaller parts, they will be much faster. What do we need?In order to make it user friendly, we need to make machinery to dispatch queries over multiple volumes, and to collect and combine results in a handy way. For queries that look for patterns that involve material from multiple volumes, we have to invent primitives to do that in a user friendly and efficient way. The most user friendly way is to let volumes be part of query templates just like other nodes, then split the queries into parts that run in the individual volumes, and then search the combinations of results for the ones that satisfy the intra-volume requirements. It looks a bit like the quantifiers in search. What else do we need?We also need handy functions to load a multivolume corpus. Probably with a facility that loads a volume on demand, and discards volumes if limited memory requires it. We need to give the user easy-to-use handles on the API for each individual volume. |
Beta Was this translation helpful? Give feedback.
-
IntertextualityWhile splitting up a corpus into volumes is an intra-textuality matter, we also want to support research between completely different works: inter-textuality. The same machinery that loads the volumes of a dataset might also be used to load multiple datasets (with multiple volumes per dataset). The same memory swapping that we use for volumes, can be used again in this context. The next thing would then be functions for finding parallel passages between datasets, and collating functions between datasets that are variants of each other. Rather than coding collation functions, I would like to interface with Ronald Dekker's Collatex. Collatex yields a graph of a Multi-text. We need to represent that graph in the world of a set of Text-Fabric datasets, or may be a set of TextPy corpus volumes. |
Beta Was this translation helpful? Give feedback.
-
Text-Fabric's position in the text manipulation landscapeWhere does TF situate itself? Close neighbors are Pandoc, NLTK and Passim.
TF is ideal for richly annotated texts that are too large to close-read manually, but still require a hands-on analysis. Right at the cusp of human-readable and machine-readable. It is in this sense that a comparison with Pandas is very apt. Much like Pandas, it leverages its power by loading the entire dataset into memory. I am surprised to learn TF is struggling when reaching millions of rows. That should be fixed, yes, but it is only fair to cap this and from there on rely on chunking. What Pandas does better is that it can handle a variety of file formats and directly gives power to the user. TF is ideal for long-form texts. This is flagrant opposition to the omnipresent snippets of texts that the digital world consists of. How would TF position itself in this regard? Should it be the right tool to analyze social media? Why, why not? |
Beta Was this translation helpful? Give feedback.
-
Text-Fabric as the NumPy/Pandas for text?
NumPy and Pandas are high-performance libraries for a specific and yet generic type of data structure: the multi-dimensional array and the 2-dimensional data frame, respectively.
Text-Fabric deals with the specific and yet generic data structure text. Where text is taken as a graph with a bit of extra structure: nodes are natural numbers, the first N nodes are textual positions in that order, the remaining nodes are subsets of textual positions, nodes can be annotated by mapping them to values, and pairs of nodes can also be annotated, they become edges.
The question is: is Text-Fabric the new TextPy?
There are reasons that stand in the way:
There is also an area that is not covered by Text-Fabric, and that is important in the research of texts:
It is easy to use Text-Fabric to load multiple datasets, but the efficient handling of these datasets in one program is the sole responsibility of the Text-Fabric user. There is no built in support for that.
There is an intrinsic link between performance and intra-textuality: if we have good support for intra-textuality, we can split a large text up into volumes, and treat them as multiple Text-Fabric datasets, but yet somehow as members of a family.
This idea could be pursued as follows.
a. Split Text-Fabric
TextPy: The core part that deals with the textual data model. Template based search is included too.
Text-Fabric: everything else. It will import TextPy as the engine.
There is a parallel with NumPy/Pandas: Pandas uses NumPy as the data engine, and sugars it with functions that realise the data frame.
Text-Fabric uses TextPy as data engine, and sugars it with functions that realise real-world corpus analysis.
b. Enhance the performance of TextPy
The functions in TextPy are number crunchers. They can be Cythonized. If done properly, it can speed up the code and lower the memory footprint.
The current Text-Fabric works very well for the Hebrew Bible (ca. 1000 pages, 500,000 words, many features), it works decently for the General Missives (ca. 10,000 pages, 5,000,0000 words, far less features). A Cythonized TextPy should work very well for the General Missives, and extremely well for the Hebrew Bible.
c. Add volume support to Text-Fabric
Even with better performance, there are elements in the search algorithm that do not scale linearly with the size of the corpus.
The same holds for tasks that need to compare pairs of passages in the corpus. They will benefit from performance gains, but not enough. They will benefit more from partitioning large corpora into volumes.
Beta Was this translation helpful? Give feedback.
All reactions