Skip to content

Releases: netarchivesuite/solrwayback

SolrWayback bundle 5.1.0

26 Mar 12:57
Compare
Choose a tag to compare

The SolrWayback distribution is an out of the box solution for exploring archived webpages in ARC/WARC format.
Runs under Windows/Linux/MacOs.

SolrWayback bundle version 5+ now require java 11 or java 17 and no longer runs under java8. Tomcat and Solr has both been upgraded
from version 7 to version 9. SolrWayback webapp will be backwards compatible with a solr7 index. If you have a large index build under solr7 just keep the solr7 and do not use the new solr9 folder.

Download: https://github.com/netarchivesuite/solrwayback/releases/download/5.1.0/solrwayback_package_5.1.0.zip

How to install:
Unzip the bundle and read 'install guide' section in the README.md file in the root of the zip-file.
Solr must now be started with a -c (for cloud) argument:
solr-9/bin/solr start -c -m 4g

How to upgrade from a previous version:
Replace the solrwayback folder with the new folder, but keep the solr7 folder if you already have build an index and do not want to reindex.
Compare properties in solrwayback.properties and solrwaybackweb.properties with yours and add new missing properties.

Changelog:
See https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md

Changes since last 4.2.2 release:

5.1.0
Substatial speed up when exporting (csv,warc etc.) from large multi sharded collections. See #329 (Thanks Toke Eskildsen) This feature still needs a little more testing. Feedback will be welcome.

Minor tweaking of log info/debug. Less log lines in default solrwayback.log when running with log level INFO.
Fix regression bug where "page resources" was not showing missing resources for the webpage.

Updated the bundle install documentation. Added new section how to redeploy the Solr configuration.

5.0.0
Upgrade Java 1.8 → 11, Tomcat 8.5 → 9 and Solr 7 → 9. SolrWayback 5.0.0 is backwards compatible with existing Solr 7 installations.
Better guide for using start and stop scripts.
Fixed csv/json export when more than 1 facet was selected. (regression bug... sorry)
warc-indexer now also finds arc files when searching recursive(thanks to @fedorw)
Frontend third-parties dependencies updated.

4.4.3
Add Zip Export feature. It is now possible to extract raw files from SolrWayback in a combined zip file. This could for example be used to extract all HTML content, images, video etc. from a search result. (github #382 and #245). Add additional property in solrwaybackweb.properties to increase the default max file limit: export.zip.maxresults=1000000

Docker support. The docker file will install the SolrWayback in the docker container. You can index WARC files from a folder outside the docker contain. See the docker file for documentation. (Thanks to Trym Bremnes for this PR)

Query hints fix (range queries). The search validation helper did like range queries and showed warning when they was correct. (github #380)
Remove an error message that would be shown while waiting to load "Page resources"

CTRL+click on a facet will open the search-result in a new tab. On macOS use CMD+click. (github #404)

Setting encoding to UTF-8 when indexing into Solr using the indexing scripts in the bundle install. Some OS/docker containers may not have UTF-8 as default.

SolrWayback bundle 5.0.0

20 Dec 10:29
Compare
Choose a tag to compare

The SolrWayback distribution is an out of the box solution for exploring archived webpages in ARC/WARC format.
Runs under Windows/Linux/MacOs.

SolrWayback bundle version 5 now require java 11 and no longer runs under java8. Tomcat and solr has both been upgraded
from version 7 to version 9. SolrWayback webapp will be backwards compatible with a solr7 index. If you have a large index build under solr7 just keep the solr7 and do not use the new solr9 folder.

Download: https://github.com/netarchivesuite/solrwayback/releases/download/5.0.0/solrwayback_package_5.0.0.zip

How to install:
Unzip the bundle and read 'install guide' section in the README.md file in the root of the zip-file.

How to upgrade from a previous version:
Replace the solrwayback folder with the new folder, but keep the solr7 folder if you already have build an index and do not want to reindex.
Compare properties in solrwayback.properties and solrwaybackweb.properties with yours and add new missing properties.

Solr9 must now be started with a cloud (-c) argument: ./solr start -c <------

Changelog:
See https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md

Since last 4.2.2 release:

5.0.0
Upgrade Java 1.8 → 11, Tomcat 8.5 → 9 and Solr 7 → 9. SolrWayback 5.5.0 is backwards compatible with existing Solr 7 installations.
Better guide for using start and stop scripts.
Fixed csv/json export when more than 1 facet was selected. (regression bug... sorry)
warc-indexer now also finds arc files when searching recursive(thanks to @fedorw)
Frontend third-parties dependencies updated.

4.4.3
Add Zip Export feature. It is now possible to extract raw files from SolrWayback in a combined zip file. This could for example be used to extract all HTML content, images, video etc. from a search result. (github #382 and #245). Add additional property in solrwaybackweb.properties to increase the default max file limit: export.zip.maxresults=1000000

Docker support. The docker file will install the SolrWayback in the docker container. You can index WARC files from a folder outside the docker contain. See the docker file for documentation. (Thanks to Trym Bremnes for this PR)

Query hints fix (range queries). The search validation helper did like range queries and showed warning when they was correct. (github #380)
Remove an error message that would be shown while waiting to load "Page resources"

CTRL+click on a facet will open the search-result in a new tab. On macOS use CMD+click. (github #404)

Setting encoding to UTF-8 when indexing into Solr using the indexing scripts in the bundle install. Some OS/docker containers may not have UTF-8 as default.

SolrWayback bundle 4.4.2

07 Jun 08:28
Compare
Choose a tag to compare

The SolrWayback distribution is an out of the box solution for exploring archived webpages in ARC/WARC format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).

Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.4.2/solrwayback_package_4.4.2.zip

How to install:
Unzip the bundle and read 'install guide' section in the README.md file in the root of the zip-file.

How to upgrade from a previous version:
For older version replace solrwayback.war with the latest version in the Tomcat 'webapps' folder and replace the warc-indexer in the indexing folder.
Replace solrconfig.xml in '/solr-7.7.3/server/solr/configsets/netarchivebuilder/conf' (keep local changes if you made any)
Compare properties in solrwayback.properties and solrwaybackweb.properties with yours and add new missing properties.

Changelog:
See https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md

SolrWayback bundle 4.4.1

02 May 08:03
Compare
Choose a tag to compare

The SolrWayback distribution is an out of the box solution for exploring archived webpages in ARC/WARC format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).

Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.4.1/solrwayback_package_4.4.1.zip

How to install:
Unzip the bundle and read 'install guide' section in the README.md file in the root of the zip-file.

How to upgrade from a previous version:
For older version replace solrwayback.war with the latest version in the Tomcat 'webapps' folder and replace the warc-indexer in the indexing folder.
Replace solrconfig.xml in '/solr-7.7.3/server/solr/configsets/netarchivebuilder/conf' (keep local changes if you made any)
Compare properties in solrwayback.properties and solrwaybackweb.properties with yours and add new missing properties.

Changelog:
See https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md

SolrWayback bundle 4.4.0

23 Jan 11:20
Compare
Choose a tag to compare

SolrWayback bundle release 4.4.0

The SolrWayback distribution is an out of the box solution for exploring archived webpages in ARC/WARC format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).

Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.4.0/solrwayback_package_4.4.0.zip

How to install:
Unzip the bundle and read 'install guide' section in the README.md file in the root of the zip-file.

How to upgrade from a previous version:
For older version replace solrwayback.war with the latest version in the Tomcat folder and replace the warc-indexer in the indexing folder.
Compare properties in solrwayback.properties and solrwaybackweb.properties with yours and add new missing properties.

Changelog:
See https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md

SolrWayback bundle 4.3.0

05 Jul 11:21
Compare
Choose a tag to compare

SolrWayback bundle release 4.3.0

The SolrWayback distribution is an out of the box solution for exploring archived webpages in ARC/WARC format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).

Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.3.0/solrwayback_package_4.3.0.zip

How to install:
Unzip the bundle and read 'install guide' section in the README.md file in the root of the zip-file.

How to upgrade from a previous version:
For older version replace solrwayback.war with the latest version in the Tomcat folder and replace the warc-indexer in the indexing folder.
Compare properties in solrwayback.properties and solrwaybackweb.properties with yours and add new missing properties.

Changelog:
See https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md

SolrWayback bundle 4.2.3

05 Jan 13:48
Compare
Choose a tag to compare

The SolrWayback distribution is an out of the box solution for exploring archived webpages in ARC/WARC format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).

Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.2.3/solrwayback_package_4.2.3.zip
This bundle release has patched 'log4shell' in the Solr server included in the bundle. So no patching against 'log4shell' is required.
The standalone warc-indexer has also been patched against 'log4shell'.

No more live leaks.

From version 4.2.1 SolrWayback comes with a build in Serviceworker(javascript worker) that will redirect or block all live leaks. This works in modern browsers.
Playback will still work in legacy browsers using url rewrites, but can leak to the live web unless using http-proxy or sandbox.

How to upgrade from a previous version:
For older version replace solrwayback.war with the latest version in the Tomcat folder.
Compare properties in solrwayback.properties and solrwaybackweb.properties with yours and add new missing properties. (no new properties since 4.2.1)
Patch Solr against 'log4shell', see README.md : https://github.com/netarchivesuite/solrwayback/blob/master/README.md

Changes since 4.2.1:

4.2.3

Fixed in-player video player for some MP4 videos that was classified by Tika as 'application/mp4'.
Fixed log4shell vulnerabity in SolrWayback bundle (Solr and warc-indexer)

4.2.2

Support for Warc record type 'resource'. Also required fix in the warc-indexer and resourcetype added to config3.xml (in indexing folder)
Improved playback for Twitter API harvest (https://github.com/netarchivesuite/so-me). (also changes in solrconfig.xml)
Implemented new WARC file resolver. If WARCS files are removed after indexed, you can add a text file with the new location. Whenever a WARC needs needs to be loaded, if the WARC file is on the list, it will use that location instead of the one indexed into Solr.

Installation guide for SolrWayback bundle:
https://github.com/netarchivesuite/solrwayback/blob/master/README.md
(see the installation section)

SolrWayback bundle 4.2.1

10 Sep 11:44
Compare
Choose a tag to compare

The SolrWayback distribution is an out of the box solution for exploring archived webpages in ARC/WARC format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).

Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.2.1/solrwayback_package_4.2.1.zip

log4shell security alert

SolrWayback itself does not use log4j 2+ and is not directly affected by CVE-2021-44228.

The SolrWayback bundle uses Solr 7.7.3, which is affected by log4shell. Please follow the Solr log4shell mitigation guide if the bundled Solr is used. The quickest fix, taken from the guide, is

  • (Linux/MacOS) Edit your solr.in.sh file to include: SOLR_OPTS="$SOLR_OPTS -Dlog4j2.formatMsgNoLookups=true"
  • (Windows) Edit your solr.in.cmd file to include: set SOLR_OPTS=%SOLR_OPTS% -Dlog4j2.formatMsgNoLookups=true

If another version of Solr is used, note that Solr >= 7.4 and < 8.11 are vulnerable. See the mitigation guide above for details.

No more live leaks.

From version 4.2.1 SolrWayback comes with a build in Serviceworker(javascript worker) that will redirect or block all live leaks. This works in modern browsers.
Playback will still work in legacy browsers using url rewrites, but can leak to the live web unless using http-proxy or sandbox.

How to upgrade from previous version 4.1.1 (or higher):

To upgrade from a previous version you to need to replace the solrwayback.war in the 'apache-tomcat-8.5.60/webbapps' folder.
And add the following properties to 'solrwayback.properties' in your home folder if they are not present:

#Solr caching. Will be default false if not defined
solr.server.caching=true
solr.server.caching.max.entries=10000
solr.server.caching.age.seconds=86400

Add the following properties to 'solrwaybackweb.properties' in your home folder if they are not present:

#English
wordcloud.stopwords=i,me,my,myself,we, ...
(Take the full list from the property file in release. Also comes with a danish stopwords list)

To upgrade from an older version just compare solrwayback.properties and solrwaybackweb.properties and add the missing properties to your files.

Changes since release 4.1.0:
4.2.1

Further improvements in serviceworker:
a) The SolrWaybackRoot-servlet application is no longer required if te Serviceworker is loaded. For legacy browsers where servicerworker does not work, the root servlet will required for improved playback.
b) In rare cases referer is missing so crawltime for the origin resource is unknown. As a default it uses current year as crawltime. This situation is often not relevant for playback since the requests often are to trackers and adds.

Cleaned up in logging to the solrwayback.log file. It should not be as spammy now.
Upgraded frontend dependencies (security updates).

Fixed bug in load more facets for domain facet when there also was a filter query involved.

4.2.0

All Playback live leaks are now blocked or redirected back to SolrWayback with a javascript Serviceworker added to playback. No more leaking to the live web! This will also improve playback when the live leak can be resolved in SolrWayback. (Thanks to Ilya Kreymer for pointing me in this direction).
The Serviceworker implementation require the SolrWayback server to run under HTTPS. This can be archived by setting an Apache or Nginx in front of the Tomcat.
The Serviceworker feature is supported by most recent browser versions. See: https://caniuse.com/serviceworkers
Playback will still work in legacy browsers using url rewrite, but can leak to the live web in if not blocked by proxy server or sandboxed.
Encoding fix in javascript rewrite: Modify < > handling to preserve the original representation (including faulty ones). This closes SOLRWBFB-58
Upgraded frontend depencencies (security updates).

4.1.2

Wordcloud stop words works can be configured in solrwaybackweb.properties.
Added new property(wordcloud.stopwords) in solrwaybackweb.properties with default stopwords (english). Will use empty stopword list if not defined
Word cloud html pages extraction reduced from 10.000 to 5.000 as difference was minimal, but doubles performance
API method to extract word+count for a query+filterquery(optional) : /services/frontend/wordcloud/wordfrequency?q=xxx&fg=yyy
API method to extract wordcloud image for query+filterquery(optional): /services/frontend/wordcloud/query?q=xxx&fg=yyy

Solr query caching for performance boost.
Added new optional properties in solrwayback.properties
#Solr caching. Will be default false if not defined
solr.server.caching=true
solr.server.caching.max.entries=10000
solr.server.caching.age.seconds=86400

When clicking a link and opening playback in a new tab. The browser URL will match the crawl-time of the html page.

The file location of the two property-files solrwayback.properties and solrwaybackweb.properties can be configured so they do not have
to be in the HOME directory.
To change to location copy this file: https://github.com/netarchivesuite/solrwayback/blob/master/src/main/webapp/META-INF/context.xml
to the folder '/apache-tomcat-8.5.60/conf/Catalina/localhost' and rename it to solrwayback.war
Remnove the uncomment of the environment variables and edit the location of the files. During start up of the tomcat server, the
values will be logged in solrwayback.log.

Updated the README.md with more information about scaling and using SolrWayback in production.

See full changelog: https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md

SolrWayback bundle 4.1.1

11 Jun 11:46
Compare
Choose a tag to compare

The SolrWayback distribution is an out of the box solution for exploring archieved webpages in ARC/WARC format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).

Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.1.1/solrwayback_package_4.1.1.zip

Changes since 4.1.0:
Added a better parallel indexing script for Linux/macOS with more options. (warc-indexer.sh)
With warc-indexer.sh you can define number of threads. It keeps track of already index WARC-file so you can start it again after adding new WARC-files to the folder.
Example: THREADS=20 ./warc-indexer.sh warcs1

The file location of the two property-files solrwayback.properties and solrwaybackweb.properties can be configured so they do not have
to be in the HOME directory.
To change to location copy this file: https://github.com/netarchivesuite/solrwayback/blob/master/src/main/webapp/META-INF/context.xml
to the folder '/apache-tomcat-8.5.60/conf/Catalina/localhost' and rename it to solrwayback.war
Remnove the uncomment of the environment variables and edit the location of the files. During start up of the tomcat server, the
values will be logged in solrwayback.log.

Updated the README.md with more information about scaling and using SolrWayback in production.

See full changelog: https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md

SolrWayback bundle 4.1.0

03 May 11:52
Compare
Choose a tag to compare

The SolrWayback distribution is an out of the box solution for exploring archieved webpages in arc/warc format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).

Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.1.0/solrwayback_package_4.1.0.zip

Unzip the folder and read the README.md file and follow the instructions.

Changes since 4.0.6:
Indexing scripts updated
Introduced JavascriptPlayback class. Does nothing but handle brotli, but can later be improved to do url-replacement in javascript files.
Brotli encoding fix for javascript.
Fixed chunked transfer encoding error when HTTP header declared it was chunked, but was not.
New optional properties can be added to solrwaybackweb.properties to limit maximum number of export results for CSV/WARC.