Releases: INL/BlackLab
Remove unnecessary libraries (Solr, Jetty, etc.).
With the Lucene 8 upgrade, a lot of libraries got pulled in because we used a class (SlowCompositeReaderWrapper) that was moved from Lucene to Solr. The libraries included Jetty, which interfered with running BlackLab Server in some application servers.
This gets rid of the class and unnecessary libraries.
Lucene 8. Improved performance. Huge corpora.
Changed
- Minimum Java version was raised from 8 to 11.
- Based on Lucene 8. Thanks to @zhyongwei for the initial version update. Further
changes were made to how DocValues are used, as this API is now sequential instead
of random-access. - Smarter default config values based on number of CPU cores and max. heap memory.
A debug message will show that and how the default value for a missing was determined. - Corpora larger than 2^31 tokens are now supported. The few operations
that don't support this yet will produce a clear error message. This functionality can
be disabled with thesearch.enableHugeResultSets
setting (defaulttrue
) that might
slightly improve performance. - Warn if an annotation named 'word' or 'lemma' has no explicit
sensitivity
declared. Due to a special case, these will automatically get sensitivitysensitive_insensitive
, but this quirk is deprecated and should not be relied upon. - Clearer error message if no
indexLocations
were found. - BLS now resolves symlinks while scanning indexLocations.
- BLS now allows dots in index names (in addition to underscore and dash).
- DocIndexerXPath now throws an exception if it encounters a non-UTF8 doc.
- FileProcessor should now handle files larges than 4G (although such files may lead to other problems, e.g. excessive memory use).
- When search is interrupted, there should now be a better indicating as to why.
- Stack trace should be included in more error responses if in debug mode.
- 'Unauthorized to view content' error now refers to documentation.
- If a format config contains an error, report the file it occurs in.
- Document that the first annotation declared becomes the main annotation.
- BLS now also looks at
X-Forwarded-For
header to determine debug mode. - BLS now accepts wildcards in the debug mode ip configuration.
- Update Jackson, revert YAML bug workaround.
- Improve how search/count times are reported in BLS.
New
- Added
naf
(NLP Annotation Format) to the builtin formats. FrequencyTool
is a commandline tool that allows you to get frequency lists for an entire corpus.
Java API
Hits
,HitsInternal(Mutable)
, CapturedGroups and other interfaces refactored to make
(im)mutability more explicit.Doc
andDocImpl
classes were removed. Now that we useDocValues
everywhere, caching
Lucene documents doesn't make sense.- Searches should no longer get stuck queued even if maxConcurrentSearches is set to a low value.
Fixed
- Fix usecontent=orig with outputformat=json
- Fix metadata value frequency reading, which due to a bug with how YAML was handled would all be read back as 0.
- Fix an issue where
HitProperty.contextIndices
would seemingly change during a sort operation. - Prevent NPE if no patt specified with
/hits
request. - Fix
hitsProcessedAtLeast()
method not always blocking. It may not be clear from the name, but this method will wait for the specified amount of hits to be processed, or will returnfalse
if all hits were processed and there were fewer than that amount. - Fix NPE for malformed sort string like
docid,
. - Don't hardcode "word" as the main annotation.
- Fix errors when running tests in parallel.
Removed
- support for previous BlackLab indexes (because Lucene 8 cannot read Lucene 5 indexes);
you must reindex your data to use this version. If this is impractical, please keep
using v2.3.0 for now. We would like to provide a conversion tool at some point. - support for obsolete content store and forward index files (cs types "utf8" and "utf8zip",
fi version 3; these were all replaced with newer versions six years ago. older indexes
will need to be re-indexed) - Some deprecated settings. A warning will be shown if the setting is still found.
- Deprecated methods from
Indexer
, among others.
Bugfix release for Lucene 5 version.
Fixed
- If another search needs a queued search, always unqueue it (avoids deadlock)
- Respect chosen context size for CSV export
- Update list of builtin formats so tei-p5 and the legacy tei formats can be found
- If format with same name is found, include the name in the exception
- Don't include metadataGroupInfo with every result for /hits and /docs responses, this was never intended and produced invalid JSON
- Don't crash if an unreadable index from a different BlackLab version is found, just skip it
- Add version and build time to WAR manifest.
Changed
- IndexTooOld/IndexTooNew replaced with IndexVersionMismatch. See exception message for details.
Final Lucene 5 release (except for bugfixes)
- Alternative cache implementation (
ResultsCache
by@eginez
of Lexion) that may be
faster in high-throughput scenario's. Note that this implementation currently does not
support queueing or aborting searches or getting a running totals count. - Add processing step to concatenate separate date fields into one.
- Added format configuration
tei-p5.blf.yaml
that uses more standardpos
attribute.
Renamed existing TEI format configurations to-legacy
. - several fixes, improvements and cleanup of deprecated stuff.
Fixes and speedups. Instrumentation. Test suite. Docker.
See https://inl.github.io/BlackLab/changelog.html for the full list of changes.
Bumps log4j to 2.16.0 (critical security fix)
This addresses security issue CVE-2021-45046. Everyone using v2.1.0 is advised to upgrade as soon as possible.
Bugfixes. MetadataFieldsWriter.
- Add MetadataFieldsWriter for programmatically setting the special fields
- Fix crash during indexing if terms file got very large.
- BLS: Fix incorrect check who user-owned formats.
- BLS: /termfreq operation no longer requires a filter query.
New API; multithreading; Saxon support
See changelog.md for more details.
First release candidate for v2.0.0.
See changelog.md for details. Please report any issues you experience with this preview.
Fix waitfortotal parameter
BlackLab Server's waitfortotal parameter, which indicates how to report the total hit count, and which was broken in 1.7.0. If true, BlackLab will count all hits before responding, which might take a long time. If false, BlackLab will report a running total and you can keep polling until the count is done.