This is a corpus search application that works with BlackLab Server. At the Dutch Language Institute, we use it to publish our corpora such as CHN (CLARIN login required), Letters as Loot and AutoSearch (CLARIN login required).
Help is contained in the application in the form of a page guide that can be opened by clicking the button on the right of the page.
- Java 11 (Jdk up to 15 will work, 16+ might cause some reflection errors, we'll fix these once BlackLab migrates to
jakarta
) - A java servlet container such as Apache Tomcat.
Use Tomcat 7 version
7.0.76
or newer or Tomcat 8 version8.0.42
or newer. Using older versions will cause some warnings from dependencies. Tomcat9
also works,10
is not supported (due to thejavax
->jakarta
migration). This will be fixed onceSOLR
, used by BlackLab migrates tojakarta
. - An instance of BlackLab-Server.
While we do our best to make the frontend work with older versions of BlackLab, use a matching version of BlackLab (so
corpus-frontend v2.0
withblacklab-server v2.0
).
Releases can be downloaded here.
- Clone this repository, use
mvn package
to build the WAR file (or download the .war from the latest release) and add corpus-frontend.war to Tomcat's webapps directory. - Optionally, create a config file for the frontend. See the Configuration section for more information.
- Navigate to
http://localhost:8080/corpus-frontend/
and you will see a list of available corpora you can search.
For further development and debugging help, see the Development section.
Make sure you enable BuildKit
(e.g. export DOCKER_BUILDKIT=1
) before building the image.
To create a container with BlackLab Frontend and Server, run:
docker-compose up --build
The config file ./docker/config/corpus-frontend.properties
will be mounted inside the container.
(if you need to change some settings, you can set the CONFIG_PATH
environment variable to read
corpus-frontend.properties
from a different directory).
If you have an indexed BlackLab corpus that you want to access, you can set CORPUS_DIR
to
this directory and CORPUS_NAME
to the name this corpus should have, e.g.:
CORPUS_DIR="/tmp/mycorpus" CORPUS_NAME="my-awesome-corpus" docker-compose up --build
See next section for how to configure BlackLab Frontend.
Corpus-Frontend is configured using a properties
file.
The application will normally look for this file in the same places as BlackLab. That is, the following locations, starting from the top:
BLACKLAB_CONFIG_DIR
environment variable (configure this in your servlet container, e.g. Tomcat)$HOME/.blacklab
(Linux) or%USERDIR%/.blacklab
(Windows)/etc/blacklab
(Linux)
The file name must be the same as the context path
of the corpus-frontend application.
That's the URL under which the corpus-frontend is reachable in the browser.
Often, if you don't configure the context path
, the context path
will be the name of the .war
file.
Examples:
- for
corpus-frontend.war
->/corpus-frontend
in browser ->corpus-frontend.properties
in any of the above locations - for
my-frontend.war
->/my-frontend
in browser ->my-frontend.properties
file name - for
/test/corpus-frontend
in browser, the file should be in thetest/corpus-frontend.properties
dir in above locations. - when you are behind a proxy, use the
context path
of proxied application, for exampleProxyPass "/frisian-corpora" "http://host:port/TEST/corpus-frontend"
/frisian-corpora
in browser -> proxied to/TEST/corpus-frontend
-> useTEST/corpus-frontend.properties
Example file (most values shown here are the default values):
# The url under which the back-end can reach blacklab-server.
# Separate from the front-end to allow connections for proxy situations
# where the paths or ports may differ internally and externally.
blsUrl=http://localhost:8080/blacklab-server/
# The url under which the client can reach blacklab-server.
blsUrlExternal=/blacklab-server/
# The url under which the client can reach the corpus-frontend.
# May be needed if the corpus-frontend is behind a proxy that changes the url.
# This setting actually defaults to the contextPath of the servlet, so this is just an example.
cfUrlExternal=/corpus-frontend/
# Optional directory where you can place files to further configure and customize
# the interface on a per-corpus basis.
# Files should be placed in a directory with the name of your corpus, e.g. files
# for a corpus 'MyCorpus' should be placed under 'corporaInterfaceDataDir/MyCorpus/'.
corporaInterfaceDataDir=/etc/blacklab/projectconfigs/
# Optional directory for default/fallback settings across all your corpora.
# The name of a directory directly under the corpusInterfaceDataDir.
# Files such as the help and about page will be loaded from here
# if they are not configured/available for a corpus.
# If this directory does not exist or is not configured,
# we'll use internal fallback files for all essential data.
corporaInterfaceDefault=default
# Path to frontend javascript files (can be configured to aid development, e.g.
# loading from an external server so the java web server does not need
# to constantly reload, and hot-reloading/refreshing of javascript can be used).
jspath=/corpus-frontend/js
# An optional banner message that shows above the navbar.
# It can be hidden by the user by clicking an embedded button, and stores a cookie to keep it hidden for a week.
# A new banner message will require the user to explicitly hide it again.
# Simply remove this property to disable the banner.
bannerMessage=<span class="fa fa-exclamation-triangle"></span> Configure this however you see fit, HTML is allowed here!
# Disable xslt and search.xml caching, useful during development.
cache=true
# Show or hide the debug info checkbox in the settings menu on the search page.
# N.B. The debug checkbox will always be visible when using webpack-dev-server during development.
# It can also be toggled by calling `debug.show()` and `debug.hide()` in the browser console.
debugInfo=false
# Set the "withCredentials" option for all ajax requests made from the client to the (blacklab/frontend)-server.
# Passes authentication cookies to blacklab-server.
# This may be required if your server is configured to use authentication.
# NOTE: this only works if the frontend and backend are hosted on the same domain, or when the server does not pass "*" for the Access-Control-Allow-Origin header.
withCredentials=false
# Make the server side of corpus-frontend pass some authentication headers to BlackLab
# The following property is proxied to BlackLab
# In this case, the Authorization header, which will be sufficient for most needs (basic auth, oauth2, oidc)
# When running behind something like oauth2-proxy, you could set these to x-forwarded-email for example, to pass along the email header from corpus-frontend to BlackLab (BlackLab will need its AuthSystem to be configured to use this header as well, of course)
auth.source.name=Authorization
auth.source.type=header
auth.target.name=Authorization
auth.target.type=header
Corpora may be added manually or uploaded by users (if configured).
After a corpus has been added, the corpus-frontend will automatically detect it, a restart should not be required.
To allow this, BlackLab needs to be configured properly (user support needs to be enabled and user directories need to be configured). See the BlackLab documentation.
When BlackLab is properly configured, two new sections will appear on the main corpus overview page. They allow you to define your own configurations to customize how blacklab will index your data, create private corpora (up to 10 by default, but can be customized in BlackLab), and add data to them.
Per corpus configuration is not supported for user corpora created through the Corpus-Frontend.
This means adding directories for user corpora in corporaInterfaceDataDir
won't work.
Out of the box, users can create corpora and upload data in any of the formats supported by BlackLab (tei
, folia
, chat
, tsv
, plaintext
and more).
In addition, users can also define their own formats or extend the builtin formats.
There is also a hidden/experimental page (/corpus-frontend/upload/
) for externally linking to the corpus-frontend to automatically index a file from the web.
It can be used it to link to the frontend from external web services that output indexable files.
It requires user uploading to be enabled, and there should be a cookie/query parameter present to configure the user name (depending on how BlackLab's authentication is configured, the frontend doesn't care and just passes everything along).
Parameters are passed as query parameters:
file=http://my-service.tld/my-file.zip
# optional
format=folia
# optional
corpus=my-corpus-name
If the user does not own a corpus with this name yet, it's automatically created. After indexing is complete, the user is redirected to the search page.
Customization options have gradually grown over the years, and have become a little cumbersome. We're aware and we'll eventually improve on this.
As an admin you can customize various aspects of the frontend, such as what data is shown in the results table, which filters are shown, etc.
Corpora can be individually customized/configured.
Corpora uploaded by users cannot be individually configured, user corpora will however use the default set of customizations if they exist.
- Create the main customization directory. The default location is
/etc/blacklab/projectconfigs/
on linux, andC:\\etc\blacklab\projectconfigs\
on windows. It can be changed by changing thecorporaInterfaceDataDir
setting incorpus-frontend.properties
. - (Optionally) create a default configuration: create a directory
default/
inside the config dir. It can be changed by changing thecorporaInterfaceDefault
setting incorpus-frontend.properties
. - Create a separate directory for every corpus you want to configure. The names should be equal to the ID of the corpus in
BlackLab
. - In the corpus' directory, create a
static/
dir, files in this this will be available in the browser undercorpus-frontend/my_corpus/static/...
. This directory can be used for custom css and js, or whatever other files you need.
You should be left with the following directory structure:
etc/projectConfigs/ # the location set in the corporaInterfaceDataDir setting
corpus-1/
search.xml # see below.
help.inc # see below.
about.inc # see below.
article.xsl # see below.
meta.xsl # see below.
static/
# files needed by the website go here.
custom.css
custom.search.js # see below for what you can do in javascript.
custom.article.js
corpus-2/
...
default/ # the name set in the corporaInterfaceDefault setting
# fallbacks/default configuration goes here
...
Good to know: customization and
static
files 'overlay' each other, the frontend will check the following locations in order, using the first location where a file is found:
- the directory of the corpus itself (corporaInterfaceDataDir)
- The default dir
- Inside the WAR
Let's perform a simple customization that will take you through the steps, adding a custom javascript file and change the displayed title of your documents.
- Follow the steps above to create the config directory for your corpus. I'll assume you left the config directory at its default location of
/etc/projectsconfigs/
and your corpus is calledexample
in the following steps. Use your custom paths if necessary. - Copy the default search.xml into
etc/projectconfigs/example/search.xml
. - Add a config option to include a custom script on the
search
page:<CustomJs page="search">${request:corpusPath}/static/js/custom.search.js</CustomJs>
- Create a matching javascript file
/etc/projectconfigs/example/static/js/custom.search.js
- Add the following snippet to your
custom.search.js
vuexModules.ui.getState().results.shared.getDocumentSummary = function(metadata, specialFields) { return 'This is everything we know about the document: ' + JSON.stringify(metadata); }
- Now restart your server and perform a search in your corpus, and see the new titles! http://localhost:8080/corpus-frontend/example/search/docs?patt=""
NOTE: You don't need to restart the application constantly, simply set
cache=false
in the maincorpus-frontend.properties
config file to disable caching of files by the server.
search.xml
Allows you to (among other things) set the navbar links and inject custom css/js, or enable/disable pagination in documents. The default configuration details all options.help.inc
,about.inc
Html content placed in the body of theMyCorpus/about/
and theMyCorpus/help/
pages. the default configuration.xsl
files These are used to transform documents in your corpus from XML into HTML for display in theMyCorpus/docs/some_doc/
page. Two files can be placed here:article.xsl
, the most important one, for your document's content. A small note: if you'd like to enable tooltips displaying more info on the words of your corpus, you can use thedata-tooltip-preview
(when just hovering) anddata-tooltip-content
(when clicking on the tooltip) attributes on any html element to create a tooltip. Alternatively if you don't want to write your own html, you can usetitle
for the preview, combined with one or moredata-${valueName}
to create a little table.title="greeting" data-lemma="hi" data-speaker="jim"
will create a preview containinggreeting
which can be clicked on to show a table with the details`.meta.xsl
for your document's metadata (shown under the metadata tab on the page) Note: this stylesheet does not receive the raw document, but rather the results of/blacklab-server/docs/${documentId}
, containing only the indexed metadata.
static/
A sandbox where you can place whatever other files you may need, such as custom js, css, fonts, logo's etc. These files are public, and can be accessed throughMyCorpus/static/path/to/my.file
.
The term
format
refers to the*.blf.yaml
or*.blf.json
file used to index data into the corpus.
Because the format
specifies the shape of a corpus (which metadata and annotations a corpus contains, what type of data they hold, and how they are related), it made sense to let the format
specify some things about how to display them in the interface.
NOTE: These properties need to be set before the first data is added to a corpus, editing the format config file afterwards will not work (though if you know what your are doing, you can edit the indexmetadata.yaml
or indexmetadata.json
file by hand and perform the change that way).
-
Change the text direction of your corpus
corpusConfig: textDirection: "rtl"
This will change many minor aspects of the interface, such as the order of columns in the result tables, the ordering of sentences in concordances, the text direction of dropdowns, etc. NOTE: This is a somewhat experimental feature. If you experience issues or want to see improvements, please create an issue!
Special thanks to Janneke van der Zwaan!
-
Group annotations into sub sections
corpusConfig: annotationGroups: contents: - name: Basics annotations: - word - lemma - pos_head - name: More annotations addRemainingAnnotations: true
The order of the annotations will be reflected in the interface.
NOTE: When using annotation groups, fields not in any group will be hidden everywhere! (unless manually re-added through
customjs
). This includes at least the following places:Explore/N-grams
Explore/Statistics
Search/Extended
Search/Advanced
Per Hit - in the results table
Per Hit/Group by
Per Hit/Sort by
-
Group metadata into sub sections
corpusConfig: metadataFieldGroups: - name: Corpus/collection fields: - Corpus_title - Collection - Country
The order of the fields & groups will be reflected in the interface.
NOTE: When using metadata groups, fields not in any group will be hidden everywhere! (unless manually re-added through
customjs
). This includes at least the following places:Explore/Corpora
Filter
Per Hit/Group by
Per Hit/Sort by
Per Document/Group by
Per Document/Sort by
-
Designate special metadata fields
corpusConfig: specialFields: pidField: id titleField: Title authorField: AuthorNameOrPseudonym dateField: PublicationYear
These fields will be used to format document title rows in the results table (unless overridden: see customizing document titles)
-
Change the type of an annotation
annotatedFields: contents: annotations: - name: "word" valuePath: "folia:t" uiType: "text"
-
Text (default)
-
Select/Dropdown Select is automatically enabled when the field does not have a uiType set, and all values are known. NOTE: Limited to
500
values! When you specifyselect
, we need to know all values beforehand, BlackLab only stores the first500
values, and ignores values longer than256
characters. When this happens, we transform the field into acombobox
for you, so you don't inadvertently miss any options. -
Combobox/Autocomplete Just like
text
, but add a dropdown that gets autocompleted values from the server. -
POS (Part of speech) Not supported for simple search This is an extension we use for corpora with split part of speech tags. It's mostly meant for internal use, but with some knowhow it can be configured for any corpus with detailed enough information. You will need to write a json file containing a
tagset
definition.type Tagset = { /** * All known values for this annotation. * The raw values can be gathered from blacklab * but displaynames, and the valid constraints need to be manually configured. */ values: { [key: string]: { value: string; displayName: string; /** All subannotations that can be used on this type of part-of-speech */ subAnnotationIds: Array<keyof Tagset['subAnnotations']>; } }; /** * All subannotations of the main annotation * Except the displayNames for values, we could just autofill this from blacklab. */ subAnnotations: { [key: string]: { id: string; /** The known values for the subannotation */ values: Array<{ value: string; displayName: string; /** Only allow/show this specific value for the defined main annotation values (referring to Tagset['values'][key]) */ pos?: string[]; }>; }; }; };
Then, during page initialization, the tagset will have to be loaded by calling
vuexModules.tagset.actions.load('http://localhost:8080/corpus-frontend/my-corpus/static/path/to/my/tagset.json');
This has to happen before $(document).ready fires! The reason for this is that the tagset you load determines how the page url is decoded, which is done when on first render. -
Lexicon (Autocomplete for Historical Dutch) Another extension we have developed for our own use. It aims to help users with searching in historical dutch texts. Whenever a word is entered, after a while, historical variants/spellings for that word are retrieved from our lexicon service. Those variants are then filtered against the corpus to see which actually occur. The user can then select if and with which of those variants he/she wishes to search.
-
-
Change the type of a metadata field
metadata: fields: - name: year uiType: range
-
Text (default)
-
Select Select is automatically enabled when the field does not have a uiType set, and all values are known. NOTE: Limited to
50
values (instead of500
- though unlike with annotations, this can be configured, see here for more information)! When you specifyselect
, we need to know all values beforehand, BlackLab only stores the first50
values, and ignores values longer than256
characters. When this happens, we transform the field into acombobox
for you, so you don't inadvertently miss any options. -
Combobox Just like
text
, but add a dropdown that gets autocompleted values from the server. -
Checkbox Predictably, transforms the dropdown into a checkbox selection. NOTE: The same limitations apply as with
select
. -
Radio Like checkbox, but allow only one value. NOTE: The same limitations apply as with
select
. -
Range Use two inputs to specify a range of values (usually for numeric fields, but works for text too).
-
Multi-field Range (custom js only!) This is a special purpose field that can be used if your documents contain metadata describing a value in a range.
For example: your documents have an unknown date of writing, but the date of writing definitely lies in some known range, for example between 1900-1950. This data is stored in two fields;
date_lower
anddate_upper
.vuexModules.filters.actions.registerFilter({ filter: { componentName: 'filter-range-multiple-fields', description: 'Filters documents based on their date range', displayName: 'Date text witness', groupId: 'Date', // The filter tab under which this should be placed, missing tabs will be created id: 'my-date-range-filter', // a unique id for internal bookkeeping metadata: { // Info the widget needs to do its work low: 'date_lower', // the id of the metadata field containing the lower bound high: 'date_upper', // the id of the metadata field containing the upper bound mode: null, // allowed values: 'strict' and 'permissive'. When this is set, hides the 'strictness' selector, and forces strictness to the set mode } }, // Set the id of another filter here to append this filter behind that one. // Undefined will add the filter at the top. precedingFilterId: undefined });
The
Permissive
/Strict
mode (see image) toggles whether to match documents that merely overlap with the provided range, or only documents that fall fully within the range. E.G.: the document hasdate_lower=1900
anddate_upper=1950
. The query is1900-1910
, this matches when using Permissive (as the values overlap somewhat), while Strict would not match, as the document's actual value could also be outside this range. To also match using Strict, the query would have to be at least1899-1951
. -
Date A special filter for dates. It works much like the
Multi-field Range
filter, only for full dates instead of only a single number.It can work on either one metadata field or two separate metadata fields (for when the exact date is unknown, but an earliest and latest possible date is known).
During setup, dates can be configured four ways:
- A string in the format of
yyyymmdd
. January is 01. - A string in the format of
yyyy-mm-dd
. January is 01. - A javascript
Date
object - A javascript object in the shape of
{y: '1900', m: '01', d: '01}
Config as follows:
vuexModules.filters.actions.registerFilter({ filter: { componentName: 'filter-date', // description: 'Filters documents based on their date range', // displayName: 'Date text witness', // groupId: 'Date', // The filter tab under which this should be placed, missing tabs will be created id: 'my-date-range-filter', // a unique id for internal bookkeeping // Setup using only one metadata field: metadata: { field: 'date', // the id of the metadata field containing the date // Optional: range: boolean, min: '1900-01-01', // see accepted values above. max: '1900-01-01', // see accepted values above. }, // Setup using two metadata fields: metadata: { from_field: 'date_start', // id of the metadata field for the start date in your documents to_field: 'date_end', // id of the metadata field for the end date in your documents mode: 'strict'|'permissive', // default mode. // Optional: range: boolean, min: '1900-01-01', // see accepted values above. max: '1900-01-01', // see accepted values above. } }, // Optional: ID of another filter in the same group before which to insert this filter, if omitted, the filter is appended at the end. insertBefore: 'some-other-filter' }
Tip: Use blacklab's
concatDate
process
concatDate
: concatenate 3 separate date fields into one, substituting unknown months and days with the first or last possible value. The output format is YYYYMMDD. Numbers are padded with leading zeroes.
Requires 4 arguments:
*yearField
: the metadata field containing the numeric year
*monthField
: the metadata field containing the numeric month (so "12" instead of "december" or "dec")
*dayField
: the metadata field containing the numeric day
*autofill
:start
to autofill missing month and day to the first possible value (01), orend
to autofill the last possible value (12 for months, last day of the month in that year for days - takes in to account leap years). This step requires that at least the year is known. If the year is not known, no output is generated. - A string in the format of
-
Custom javascript files can be included on any page by adding them to search.xml
NOTE: by default, your script will run on every page! Not all functions shown below are available everywhere! It is highly recommended to use multiple scripts, and only include them on a single page (by using the
<CustomJS page="..."/>
(see search.xml). All javascript should run before$(document).ready
unless otherwise stated.
Through javascript you can do many things, but outlined below are some of the more interesting/useful features on the /search/
page:
-
[Global] - Hide the page guide
vuexModules.ui.actions.global.pageGuide.enable(false)
-
[Global] - Configure which annotations & metadata fields are visible
This is just a shorthand method of configuring several parts of the UI. Individual features can also be configured one by one. Refer to the source code for more details (all of this module's exports are exposed under
window.vuexModules.ui
).First run (from the browser console)
printCustomJs()
. You will see (approximately) the following output. The printed javascript reflects the current settings of the page.var x = true; var ui = vuexModules.ui.actions; ui.helpers.configureAnnotations([ [ , 'EXTENDED' , 'ADVANCED' , 'EXPLORE' , 'SORT' , 'GROUP' , 'RESULTS' , 'CONCORDANCE' ], // Basics ['word' , x , x , x , x , x , , ], ['lemma' , x , x , x , , , , ], ['pos' , x , x , x , , , , ], // (not in any group) ['punct' , , , , , , , ], ['starttag' , , , , , , , ], ['word_id' , , , , , , , x ], // ... ]); ui.helpers.configureMetadata([ [ , 'FILTER' , 'SORT' , 'GROUP' , 'RESULTS/HITS' , 'RESULTS/DOCS' , 'EXPORT' ], // Date ['datering' , , x , x , , x , ], ['decade' , , x , x , , , ], // Localization ['country' , x , x , x , , , ], ['region' , x , x , x , , , ], ['place' , x , x , x , , , ], // ... ]);
You can now paste this code into your corpus's
custom.js
file and edit the cells.The
configureAnnotations
columns configure the following:- EXTENDED: Make the annotation appear in the
Extended
search tab. - ADVANCED: Make the annotation appear in the
Advanced/QueryBuilder
search tab. - EXPLORE: Make the annotation one of the options in the appear in the the
N-Gram
form. - SORT: Make this annotation one of the options in the
Sort by
dropdown (below thehits
results table). (Requires the forward index to be enabled for this annotation) - GROUP: Make this annotation available for grouping. This affects whether it is shown in the
Group by
dropdown options above the results table, and in theN-Gram
andStatistics
dropdowns in theExplore
forms. (Requires the forward index to be enabled for this annotation) - RESULTS: Make a separate column in the
hits
table that shows this annotation's value for every hit (typically enabled by default forlemma
andpos
). - CONCORDANCE: Show the value for this annotation when opening a hit's details in the
hits
table.
Similarly, these are the meanings of the columns in
configureMetadata
:- FILTER: Make this metadata field available as a filter. (Filters are shown for
Extended
,Advanced/QueryBuilder
,Expert
, and allExplore
forms. - SORT: Make this metadata field one of the options in the
Sort by
dropdown (below thePer hit
andPer document
results tables. - GROUP: Make this metadata field avaiable for grouping. This affects whether it is shown in the
Per hit
,Per document
, andStatistics
form underExplore
. - RESULTS/HITS: Show a column in the
Per hit
table detailing the metadata for every hit's document. - RESULTS/DOCS: Show a column in the
Per document
table with the document's metadata. - EXPORT: Whether to include this metadata in result exports.
Any columns that are completely empty (not one annotation/metadata is checked) are left at their default values. For most things this means everything in an annotation/metadata group is shown, unless no groups are defined, in which case everything not marked
isInternal
is shown.If you want to hide every single option in a category, use the dedicated function to configure that section of the ui and pass it an empty array.
- EXTENDED: Make the annotation appear in the
-
Group metadata into sections with more control
With customJS you can do two things you can't do with just config: - automatically activate filters when a filter tab is open - add subheadings in between filters.In this way you can divide your corpus into
sections
if you will, by giving every section its own tab in thefilters
section, and setting an automaticaly activating filter for that section.![](docs/img/ metadata_groups_with_sections.png)
type FilterGroup = { tabname: string; subtabs: Array<{ tabname?: string; fields: string[]; }>; query?: MapOf<string[]>; }
const filterState = vuexModules.filters.getState(); filterState.filterGroups = [{ tabname: 'My Filters' subtabs: [{ tabname: 'Title', // filters pertaining to titles fields: ['newspaper_title', 'section_title', 'article_title'] // names of metadata fields in your source material }, { tabname: 'Author', // filters pertaining to author... fields: ['author_name', 'author_birthdate'] }], query: { some_metadata_field: ['Some', 'Values'], // force the filter some_metadata_field to the value ['Some', 'Values'] while the tab 'My Filters' is opened // ... other fields here } }, { // Second tab .... follows the same config. The same filter may occur multiple times in different tabs. }]
NOTE: When using metadata groups, fields not in any group will be hidden everywhere! (unless manually re-added through
customjs
). This includes at least the following places:Explore/Corpora
Filter
Per Hit/Group by
Per Hit/Sort by
Per Document/Group by
Per Document/Sort by
-
[Global] - Set custom error messages
A function that is called whenever an error is returned from BlackLab. Context is one of 'snippet', 'concordances', 'hits', 'docs', 'groups' detailing during what action the error occured.
vuexModules.ui.getState().global.errorMessage = function(error, context) { return error.message; }
Error looks liketype ApiError = { /** Short summary */ title: string; /** Full error */ message: string; /** Message for HTTP status, e.g. "Service unavailable" for 503 */ statusText: string; }
-
[Search] - Hide the query builder
vuexModules.ui.actions.search.advanced.enable(false)
-
[Search] - Show/hide Split-Batch and Within
vuexModules.ui.actions.search.extended.splitBatch.enable(false)
vuexModules.ui.actions.search.shared.within.enable(false)
It's also possible to set which tags are shown (and how) in
within
. You can only add tags that you actually index (using the inlineTags options in your index config yaml)vuexModules.ui.actions.search.shared.within.elements({ title: 'Tooltip here (optional)', label: 'Sentence', value: 's' });
Lastly, if your corpus has dependency relations, you can set which inline tag is used for the sentence boundary when viewing a hit's dependency tree. The tree component will then show all words between two of these tags. There is rudimentary autodetection that tries to find the most likely tag, but you can override this by setting the tag manually.
vuexModules.ui.acions.search.shared.within.sentenceBoundary('s')
-
[Search] - Write your own (annotation) search fields
It's possible to completely replace the rendering of an annotation with your own plugin.
/** * @typedef {Object} AnnotationValue * @property {string} value * @property {boolean} case is the case-sensitive checkbox ticked (only applicable if caseSensitive === true in the config for this annotation) */ /** * @typedef {Object} NormalizedAnnotation * @property {string} annotatedFieldId id of the field this annotation resides in, usually 'contents' * @property {boolean} caseSensitive does the annotation super case-sensitive search * @property {string} description description of the annotation as set in the .blf.yaml * @property {string} displayName * @property {boolean} hasForwardIndex without a forward index, an annotation can only be searched, but not displayed * @property {string} id 'lemma', 'pos', etc. These are only unique within the same annotatedField * @property {boolean} isInternal once upon a time this was used to hide certain annotations from the interface. * @property {boolean} isMainAnnotation True if this is the default annotation that is matched when a cql query does not specify the field. (e.g. just "searchTerm" instead of [fieldName="searchTerm"] * @property {string} offsetsAlternative blacklab internal * @property {string} [parentAnnotationId] (optional) if this is a subannotation, what is its parent annotation * @property {string[]} [subAnnotations] (optional) only if this has any subannotation * @property {'select'|'combobox'|'text'|'pos'|'lexicon'} uiType * @property {Array<{value: string, label: string, title: string|null}>} [values] (optional) list of of all possible values, omitted if too many values. */ // OPTION 1: using jquery. // Create a div and attach event handlers. // You will also need to implement the update() function // Otherwise your component will go out of sync if the user presses the reset button, // or uses the history, or loads in from an existing url. vuexModules.ui.getState().search.shared.customAnnotations.word = { /** * @param {NormalizedAnnotation} config * @param {AnnotationValue} value * @param {Vue} vue the vue library */ render(config, state, Vue) { return $(` <div> <div>Custom render function for "word"</div> <label>${config.description} <input type="text" value="${state.value.replace('"', """)}"> </label> <label><input type="checkbox" ${state.case ? 'checked' : ''}">custom case sensitive?</label> </div>`) .on('change', 'input[type="checkbox"]', e => state.case = e.target.checked) .on('input', 'input[type="text"]', e => state.value = e.target.value) }, /** * @param {AnnotationValue} cur new value * @param {AnnotationValue} prev previous value * @param {HTMLElement} element div */ update(cur, prev, element) { element = $(element); element.find('input[type="text"]').val(cur.value); element.find('input[type="checkbox"]').prop('checked', cur.case); } } // OPTION 2: use Vue directly. // you may leave the update() function undefined, // as the reactivity will do the work. vuexModules.ui.getState().search.shared.customAnnotations.word = { /** * @param {NormalizedAnnotation} config * @param {AnnotationValue} value * @param {Vue} vue the vue library */ render(config, state, Vue) { return new Vue({ ...Vue.compile(` <div> <h2>{{config.displayName}}</h2> <input v-model="state.value"> <input type="checkbox" v-model="state.case"> </div> `), data: { state, config } }) } }
-
[Results/Hits] Enable custom inserts in concordances, such as an audio player for spoken corpora
The following example will create small play buttons in concordances, allowing the user to listen to the fragment. We use this feature in the
CGN (Corpus Gesproken Nederlands)
corpus.To enable this, three things are needed:
- The corpus needs to be annotated with some form of link to an audio file in the document metadata (or token annotations).
- You need to have hosted the audio files somewhere.
NOTE: The
/static/
directory will not work, as the server needs to support the httprange
headers, which that implementation does not. (PR would be most welcome!) - A function that transforms a result into a start- and endtime inside the audio file.
Shown here is (an edited version of) the function we use in the
CGN
, but specific implementation will of course be unique to your situation./* The context object contains the following information: { corpus: string, docId: string, // the document id context: BLTypes.BLHitSnippet, // the raw hit info as returned by blacklab document: BLTypes.BLDocInfo, // the document metadata documentUrl: string, // url to view the document in the corpus-frontend wordAnnotationId: string, // configured annotation to display for words (aka vuexModules.ui.results.hits.wordAnnotationId) dir: 'ltr'|'rtl', } The returned object should have the following shape: { name: string; // unique name for the widget you're rendering, can be anything component?: string; // (optional) name of the Vue component to render, component MUST be globally installed using vue.component(...) element?: string; // when not using a vue component, the name of the html element to render, defaults to 'div' props?: any; // attributes on the html element (such as 'class', 'tabindex', 'style' etc.), or props on the vue component content?: string // html content of the element, or content of the default slot when using a vue component listeners?: any; // event listeners, passed to v-on, so 'click', 'hover', etc. } */ vuexModules.ui.getState().results.hits.addons.push(function(context) { var snippet = context.context; var docId = context.docId; var s = 'begintime'; // word property that stores the start time of the word var e = 'endtime'; // word property that stores the end time of the word // find the first word that has a start time defined, and the first word that has an end time defined var startString = snippet.before[s].concat(snippet.match[s]).concat(snippet.after[s]).find(function(v) { return !!v; }); var endString = snippet.before[e].concat(snippet.match[e]).concat(snippet.after[e]).reverse().find(function(v) { return !!v; }); // Returning undefined will disable the addon for this hit if (!startString && !endString) { return undefined; } // Transform format of HH:MM:SS.mm to a float var start = startString ? startString.split(':') : undefined; start = start ? Number.parseInt(start[0], 10) * 3600 + Number.parseInt(start[1], 10) * 60 + Number.parseFloat(start[2]) : 0; // Transform format of HH:MM:SS:mm to a float var end = endString ? endString.split(':') : undefined; end = end ? Number.parseInt(end[0], 10) * 3600 + Number.parseInt(end[1], 10) * 60 + Number.parseFloat(end[2]) : Number.MAX_VALUE; return { component: 'AudioPlayer', // name for our component name: 'audio-player', // render the vue component 'audio player', so <audio-player> props: { // with these props, so <audio-player :docId="..." :startTime="..." :endTime="..." :url="..."> docId: docId, startTime: start, endTime: end, url: config.audioUrlBase + docId + '.mp3' }, } })
-
[Results/Hits] configure the displaying of hits and their surrounding words
vuexModules.ui.actions.results.shared.concordanceAnnotationId('word_xml')
vuexModules.ui.actions.results.shared.concordanceSize(100)
vuexModules.ui.actions.results.shared.concordanceAsHtml(true)
vuexModules.ui.actions.results.shared.transformSnippets(snippet -> {/* modify snippet in place */})
concordanceAnnotationId
lets you set which annotation is used to display words in the results table:While this example is a little contrived, consider the following:
In Clitic words (e.g. they've), there are usually two lemmata (they and have), but only one word shared between them. Indexing multiple values per token is supported by blacklab, and searching for either value will find the result, but due to a limitation in the forward index, only the first value for an annotation can be displayed. Using this system you can use a different annotation for searching, e.g. match on they and have, but show the original from the text (they've). Another case could be preserving markup such as --strikethrough--, italics or bold. You could index the words including markup and the version without as another annotation, then show the version with markup in the results table. (using the
concordanceAsHtml
).
concordanceSize
controls the number of visible surrounding words when expanding hits to view details (default 50, max 1000).vuexModules.ui.actions.results.shared.concordanceSize(10)
concordanceAsHtml
is an advanced feature. Combined with blacklab's captureXml mode, you can index snippets of raw xml from your document into an annotation. You can then set that annotation to display as html in the results table. This allows you to style it as any html usingcustom css
.Internally, we use this to display strike throughs, manual corrections and underlines in words in historical corpora. For example: given a word structure like so:
<!-- source document --> <w> en<strikethrough>de</strikethrough> </w>
# Index config - name: word_xml valuePath: //w captureXml: true
// Custom js vuexModules.ui.actions.results.shared.concordanceAnnotationId('word_xml') vuexModules.ui.actions.results.shared.concordanceAsHtml(true)
/* Custom css */ strikethrough { text-decoration: line-through; }
en
deBrowser support for non-html elements is good, so arbitrary xml should work out of the box, given it is well-formed.
USE THIS FEATURE WITH CARE! It may break your page if the xml contains unexpected or malformed contents.
transformSnippets
can be used to replace some words, add warnings, or correct some things in your data (such as escape sequences) after the fact, when doing so would be hard to do in the source files. Or it could be used to do some markup in combination with theconcordanceAsHtml
settings./** * E.g. {word: ['words', 'in', 'the', 'hit'], lemma: [...], pos: [...]} * @typedef {Object.<string, string[]>} SnippetPart */ /** * @typedef {Object} Snippet * @property {SnippetPart} left * @property {SnippetPart} match * @property {SnippetPart} right * @property {string} [Warning] - Warning: there might be more! */ /** * @typedef {function} TransformFunction * @param {Snippet} snippet * @returns {void} */ vuexModules.ui.actions.results.shared.transformSnippets(snippet => { const transform = (...snippets) => snippets.forEach(v => v.word = v.word.map(word => { if (word === 'de') return `<span style="text-decoration: underline; text-shadow: 0 0 2px red;">${word}</span>`; return word; })) transform(snippet.before, snippet.match, snippet.after); });
-
[Results/Docs] Customize the display of document titles in the results table
By setting a callback to generate or extract the title of the documents, you can have more control over it.
/** * @param metadata all metadata of the document, in the form of { [fieldName: string]: string[] } * @param specialFields the names of the pid, title, author, and date fields, in the shape of { authorField: string, pidField: string, dateField: string, titleField: string } @returns {string} */ vuexModules.ui.getState().results.shared.getDocumentSummary = function(metadata, specialFields) { return 'The document title is: ' + metadata[specialFields.titleField][0]; }
-
Miscellaneous configuration
This is currently not supported in other dropdowns.
vuexModules.ui.actions.dropdowns.groupBy.annotationGroupLabelsVisible(false)
vuexModules.ui.actions.dropdowns.groupBy.metadataGroupLabelsVisible(false)
With labels Without labels Set for how long the spinner will keep automatically refreshing until the user must continue it.
<=0
will keep refreshing indefinitely until the query finishes.
vuexModules.ui.results.shared.totalsTimeoutDurationMs(90_000)
Set how quickly the spinner will refresh while it's polling. This setting has a minimum of 100ms.
vuexModules.ui.results.shared.totalsRefreshIntervalMs(2_000)
-
Override any other data you want
If you know what you're doing, it's possible to override any value in the store. The
corpus
module (vuexModules.corpus
) contains all data used to render the page, including the annotations, metadata fields, display names, descriptions, uiTypes, etc. This object is writable and can be interacted with from the console or javascript, so tinker away.Changing the ids of annotations/metadata will result in broken queries though! :)
The /docs/
page has other features that can be enabled.
Enabling any of these will show a new Statistics
tab next to the default Content
and Metadata
tabs.
-
A table with whatever data you wish to show.
You may provide a function with the signature
(BLDocument, BLHitSnippet) => { [key: string]: string; }
(see blacklabTypes for type definitions)vuexModules.root.actions.statisticsTableFn(function(document, snippet) { var ret = {}; ret['Tokens'] = document.docInfo.lengthInTokens; ret['Types'] = Object.keys(snippet.match['pos'].reduce(function(acc, v) { acc[v] = true; return acc; }, {})).length; ret['Lemmas'] = Object.keys(snippet.match['lemma'].reduce(function(acc, v) { acc[v] = true; return acc; }, {})).length var ratio = ret['Tokens'] / ret['Types']; var invRatio = 1/ratio; ret['Type/token ratio'] = '1/'+ratio.toFixed(1)+' ('+invRatio.toFixed(2)+')'; return ret; });
-
The color palette for the charts
If you're using custom css, this can help them blend in with your own style. Defaults to bootstrap 3 primary blue (
#337ab7
).vuexModules.root.actions.baseColor('#9c1b2e');
As an experiment, we're trying out a different approach for some new customizations. In your custom.js
file(s), you can now use code like the following:
frontend.customize((corpus) => {
// Your customizations follow here.
// For example:
// Hide the field 'bad-field' from the metadata;
corpus.search.metadata.show = function (fieldName) {
if (fieldName === 'bad-field')
return false;
return null;
};
// Etc.
});
This code will be called while your corpus is initializing.
The corpus
object represents a more abstract customization API for your corpus. Here's a list of the current customization mechanisms.
-
Hide metadata fields
Normally all metadata fields are shown. If you wish to hide some, you can use the following code:
corpus.search.metadata.showField = function (name) { if (name === 'bad-field') return false; // hide this field return null; // default behaviour };
-
Add span filters (to filter by part of documents)
To add an extra tab where you can filter by part of the document, such as only searching in certain types of named entity, or speech by one person:
const m = corpus.search.metadata; m.addCustomTab( 'Span filters', [ m.createSpanFilter('named-entity', 'type'), m.createSpanFilter('speech', 'person'), ] );
Parameters for
createSpanFilter
:- Span name (e.g.
named-entity
) - Attribute name (e.g.
type
) - Widget to use. Can be
'auto'
,'text'
,'select'
, or'range'
.'auto'
is the default and will choose betweentext
andselect
based on the number of unique values. - Display name of the filter (optional). See also internationalization.
- Metadata object (optional). You can override the options for a select widget here (default are all the actual values in the corpus).
- Span name (e.g.
-
Customize what span attributes we can group on
corpus.grouping.includeSpanAttribute = function (name, attrName) { if (name === 'boring-span') return false; // no grouping on any of this span's attributes if (name === 'named-entity' && attrName === 'id') return false; // don't offer grouping on this attribute return null; // default behaviour (any attribute with a span filter) };
-
Customize the within widget
// Customize which spans are shown in the within widget corpus.search.within.includeSpan = function (name) { if (name === 'boring-span') return false; // hide this span return null; // default behaviour (all spans) }; // Customize if fields for any attributes are shown in // the within widget when selecting certain spans corpus.search.within.includeAttribute = function (name, attrName) { if (name === 'chapter') { // show this attribute return attrName === 'number'; } return null; // default behaviour (no attributes) };
-
Customize match info higlight style
Match info is any explicit captures (e.g.
A:[] "cow"
), spans (e.g.<named-entity type='loc' />
), or relations (_ -nsubj-> _
) that were encountered while resolving your query.corpus.results.matchInfoHighlightStyle = function (matchInfo) { if (matchInfo.isRelation) { // Show hover highlight for words if (matchInfo.relType === 'word-alignment') return 'hover'; // Don't show other relations return 'none'; } // Always highlight any named entities captured by our query // (e.g. <named-entity/> containing "dog") if (matchInfo.key === 'named-entity') return 'static'; // Default highlighting behaviour // ("highlight non-relations if there's explicit captures in the query") return null; };
We have included a template SASS file here to allow you to customize your page's color theme easily. From there you can then add your own customizations on top.
Create a file with the following contents
// custom.scss
$base: hsl(351, 70%, 36%); // Defines the base color of the theme, this can be any css color
@import 'style-template.scss'; // the absolute or relative path to our template file
// Your own styles & overrides here ...
You now need to compile this file by following the following steps:
- Install Node.js
- Install the
node-sass
package by runningnpm install -g node-sass
- Compile the file by running
node-sass -o . custom.scss
You will now have a custom.css
file you can include in your install through search.xml
.
The app is primarly written in Vue.js.
Outlined here is the /search/
page, as it contains the majority of the code.
Entry points are the following files
- article.ts
Handles the hit navigation, graphs and charts on the document page
/corpus-frontend/docs/...
- corpora.ts
The main index page (or
/corpus-frontend/corpora/
) - remote-index.ts
The
/upload/
page. - search.ts The search form
Individual components are contained in the pages directory. These components are single-use and/or connected to the store in some way. The components directory contains a few "dumb" components that can be reused anywhere.
We use vuex to store the app state, treat it as a central database (though it's not persisted between sessions).
The vuex store is made up of many modules
that all handle a specific part of the state, such as the metadata filters, or the settings menu (page size, random seed).
The form directory contains most of the state to do with the top of the page, such as filters, query builder, explore view. The results directory handles the settings that directly update the results, such as which page is open, how results are grouped, etc.
A couple of modules have slightly different roles:
- The corpus module stores the blacklab index config and is used almost everywhere.
- The history module stores the query history (not the browser history).
- The query module contains a snapshot of the form (with filters, patterns, etc) as it was when
submit
was pressed. This is what actually determines the results being shown, and is what render the query summary etc. - The tagset module is mostly inactive, but it stores the info to build the editor for the
pos
uiType.
The current page url is generated and updated in streams.ts.
It contains a few things: a stream that listens to state changes in the vuex
store, and is responsible for updating the page url, and a couple streams that fetch some metadata about the currently selected/searched corpus (shown below the filters and at the top of the results panel).
Url parsing happens in the UrlStateParser.
The url parsing is a little involved, because depending on whether a tagset
is provided it can differ (the cql pattern is normally parsed and split up so we know what to place in the simple
and extended
views, but this needs to happen differently when a tagset is involved).
Because of this, the store is first initialized (with empty values everywhere), then the url is parsed, after which the state is updated with the parsed values (see search.ts).
When navigating back and forth through browser history, the url is not parsed, instead the state is attached to the history entry and read directly.
The app is internationalized using vue-i18n. Please note that the app is only partially translatable right now; I18n is a work in progress. Contributions are welcome.
If you want to help add translation keys, look for e.g. {{ $t('search.simple.heading') }}
in the code to see how it's done.
If you want to help translate the app to a new language, you can do so by adding a new language file in the src/frontend/src/locales
directory. This is where the default translation files live. Copy one of the files (e.g. en.json
) and name it for the new locale (e.g. fr.json
for French). Then you can start translating the strings.
You can also override some default translations per corpus by creating a directory named locales
in the static
directory of the corpus' interface data dir (see the corporaInterfaceDataDir
setting) and create a file with the same name as above (e.g. fr.json
for French) with the desired overrides. The file should be read automatically by the app.
In the override file(s) in the static/locales
directory for your corpus, you can set names for annotations, metadata fields, etc. in your corpus.
For example in $corporaInterfaceDataDir/YOUR_CORPUS/static/locales/en-us.json
{
"index": {
"annotations": {
"pos": "Part of speech"
},
"annotationGroups": {
"simple": "Basic annotations",
"advanced": "Advanced annotations"
},
"metadata": {
"spanFilters": {
"name": {
"type": "Named entity type"
}
}
},
"metadataGroups": {
"author": "Author-related fields",
"date": "Date-related fields"
},
"within": {
"name": "Named entity"
}
}
}
Install the Vue devtools! (chrome, firefox).
You can compile and watch the javascript files using webpack.
Execute npm run start
in the src/frontend/
directory. This will start webpack-dev-server
(webpack is a javascript build tool) that will serve the compiled files (the entry points) at localhost/dist/
.
It adds a feature where if one of those files is loaded on the page, and the file changes, your page will reload automatically with the new changes.
Combining this with jspath
in corpus-frontend.properties
we can start the corpus-frontend as we normally would, but sideload our javascript from webpack-dev-server
and get realtime compilation :)
# No trailing slash!
jspath=http://localhost:8081/dist
cd corpus-frontend/src/frontend/
npm run start
One note is that by default the port is 8080
, but we changed it to 8081
, as tomcat
already binds to 8080
. To change this, edit the scripts.start
property in package.json.
The backend is written in Java, and does comparitively little. Its most important tasks are serving the right javascript file and setting up a page skeleton (with Apache Velocity).
When a request comes in, the MainServlet
fetches the relevant corpus data from BlackLab, reads the matching search.xml
file, and determines which page to serve (the *Response
classes). Together this renders the header, footer, defines some client side variables (mainly urls to the corpus frontend server and blacklab servers).
From there on out the rest happens clientside.
It also handles most of the document
page, retrieving the xml and metadata and converting it to html.
If you have any further questions or experience any issues, please contact Jan Niestadt and/or Koen Mertens.
Like BlackLab, this corpus frontend is licensed under the Apache License 2.0.