A CKAN extension to help people discover your datasets
Open data can only be used for awesome things if it is found. CKAN has powerful built-in search tools, but for users who don't know (yet) what exactly they are looking for there is room for improvement.
This extension provides multiple plugins that make it easier for your users to discover your data:
- search_suggestions: Provides real-time search suggestions in a drop-down box
- similar_datasets: Adds a list of similar datasets to the dataset detail view
- solr_query_config: Allows you to easily override the parameters that CKAN passes to Solr
- tag_cloud: Replaces the list of frequent tags on the home page with a tag cloud that shows the most popular tags scaled according to their popularity
ckanext-discovery has been developed and tested with CKAN 2.6. Other versions may or may not work, any feedback in that regard is very welcome.
First activate your CKAN virtualenv:
. /usr/lib/ckan/default/bin/activate
Then install the latest version of ckanext-discovery:
pip install -e git+https://github.com/stadt-karlsruhe/ckanext-discovery#egg=ckanext-discovery
In a production environment you'll probably want to install a specific release version instead:
pip install -e git+https://github.com/stadt-karlsruhe/[email protected]#egg=ckanext-discovery
The extension is now installed. However, you still need to install and configure those of its plugins that you want to use. See the documentations of the individual plugins below for more information.
This plugin doesn't provide any direct functionallity on its own but contains shared utilities required by some of the other plugins.
This plugin provides real-time search suggestions while the user is entering text into a search field:
First add discovery
and search_suggestions
to the list of plugins in
CKAN's configuration INI:
plugins = ... discovery search_suggestions
Then create the necessary database tables:
. /usr/lib/ckan/default/bin/activate paster --plugin=ckanext-discovery search_suggestions init -c /etc/ckan/default/production.ini
Finally, restart CKAN:
sudo service apache2 restart
The suggestions are generated from previous searches by other users. The plugin therefore combines two functionalities:
- Automatic collection of statistics about the users' search queries
- Generation and display of search suggestions
Regarding the first point it is important to point out that search queries are stored in a post-processed form and without any reference to the corresponding user.
Search suggestions are automatically enabled on all search fields that CKAN
provides in its base templates, namely the search box on the home page, the
search field in the header, and the search box at /dataset
. In addition,
any HTML element with the class search
will also automatically display
search suggestions.
Obviously suggestions are only provided once some search queries have been recorded.
The plugin offers multiple settings that can be configured via CKAN's configuration INI:
# Maximum number of search suggestions to display in the drop-down # box. Defaults to 4. Note that fewer suggestions may be listed if not # enough related queries have been stored before. ckanext.discovery.search_suggestions.limit = 4 # Whether data about search queries should be stored. Defaults to true. ckanext.discovery.search_suggestions.store_queries = True # Whether search suggestions should be displayed. Defaults to true. ckanext.discovery.search_suggestions.provide_suggestions = True
To achieve good suggestions, search terms entered by the user must be preprocessed, for example by converting them to lower-case. Most of this is done automatically by the plugin. However, there's also the possibility to perform additional custom preprocessing and filtering steps. This is useful, for example, to filter out stop words in your target language (the plugin doesn't do stop word filter at all) or to ignore unwelcome content. The latter is particularly important since search terms entered by one user may end up in the suggestions provided to other users.
To implement your own search term filter and preprocessor, simply implement the
ISearchTermPreprocessor
interface, which contains a single method,
preprocess_search_term
:
import ckan.plugins as plugins from ckanext.discovery.plugins.search_suggestions.interfaces import \ ISearchTermPreprocessor STOP_WORDS = {'and', 'not', 'or'} # We don't want people to talk about dogs here! BAD_WORDS = {'dog', 'puppy'} REJECT_WORDS = STOP_WORDS.union(BAD_WORDS) class MyPlugin(plugins.SingletonPlugin): plugins.implements(ISearchTermPreprocessor) def preprocess_search_term(self, term): ''' Preprocess and filter a search term. ``term`` is a search term extracted from a user's search query. If this method returns a false value then the term is ignored w.r.t. search suggestions. This is useful for filtering stop words and unwelcome content. Otherwise the return value of the method is used instead of the original search term. In most cases you simply return the value unchanged. Note that all of this only affects the generation of the search suggestions but not the search itself. ''' if term in REJECT_WORDS: # Ignore this term return False # Go ahead and use term to calculate search suggestions return term
After adding, removing or changing an ISearchTermPreprocessor
implementation you need to reprocess the previously stored search terms:
. /usr/lib/ckan/default/bin/activate paster --plugin=ckanext-discovery search_suggestions reprocess -c /etc/ckan/default/production.ini
To show all currently stored search terms, use the list
command:
. /usr/lib/ckan/default/bin/activate paster --plugin=ckanext-discovery search_suggestions list -c /etc/ckan/default/production.ini
This plugin displays a list of similar datasets in the sidebar of the dataset view:
The plugin relies on Solr's More Like This feature and requires that you
configure your Solr instance appropriately. In particular, you need to set up a
MoreLikeThisHandler in your /etc/solr/conf/solrconfig.xml
. To do this, add
the following code block directly before the </config>
tag at the end of
the file:
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler"> <lst name="defaults"> <int name="mlt.mintf">3</int> <int name="mlt.mindf">1</int> <int name="mlt.minwl">3</int> </lst> </requestHandler>
Please refer to the documentation of the MoreLikeThisHandler for details on its configuration.
In addition, you need to enable term vector storage for the text
field
in your /etc/solr/conf/schema.xml
. To do this, locate the following field
definition:
<field name="text" type="text" indexed="true" stored="false" multiValued="true" />
Then add termVectors="true"
to the list of attributes so that the full
definition looks like this:
<field name="text" type="text" indexed="true" stored="false" multiValued="true" termVectors="true" />
Please note that term vectors can substantially increase the size of your Solr index.
Once you have updated your solrconfig.xml
and schema.xml
files as
described above you need to restart Solr. Assuming you're using Jetty, this
is done via
sudo service jetty restart
Finally you need to re-index your datasets once, so that the term vectors of the existing datasets are stored (for datasets that are added or updated in the future this is done automatically):
. /usr/lib/ckan/default/bin/activate paster --plugin=ckan search-index rebuild -c /etc/ckan/default/production.ini
Now add discovery
and similar_datasets
to your list of plugins in
CKAN's configuration INI:
plugins = ... discovery similar_datasets
After restarting CKAN the list of similar datasets should be displayed on the detailed view of each dataset:
sudo service apache2 restart
The plugin offers one setting that can be configured in CKAN's configuration INI:
# Maximum number of similar datasets to list. Defaults to 5. Note that less # datasets may be shown if Solr doesn't find enough similar datasets. ckanext.discovery.similar_datasets.max_num = 5
This plugin allows you to set Solr query parameters via entries in CKAN's configuration INI. You can either specify a default value for a parameter (which is only used if the parameter isn't already set in the current query) or you can force a parameter to a certain value (overriding it if it is already set).
Simply add solr_query_config
to the list of plugins in CKAN's
configuration INI:
plugins = ... solr_query_config
Then restart CKAN:
sudo service apache2 restart
To specify a default value, prefix the parameter name with
ckanext.discovery.solr_query_config.default.
:
# By default, sort by metadata modification timestamp ckanext.discovery.solr.default.sort = metadata_modified asc
Similarly, a value can be forced using the prefix
ckanext.discovery.solr_query_config.force.
:
# Always use a custom Solr query handler ckanext.discovery.solr.force.defType = my_special_query_handler
Note that only those Solr parameters that are accepted by the package_search API function can be set via this plugin.
This plugin shows links for the most frequent tags scaled according to their frequency:
Simply add discovery
and tag_cloud
to the list of plugins in CKAN's
configuration INI:
plugins = ... discovery tag_cloud
Then restart CKAN:
sudo service apache2 restart
The plugin automatically replaces the list of the most frequent tags on CKAN's default front page with a tag cloud.
If you want to use the tag cloud in a different part of the site you can use the following template snippet:
{% snippet 'ckanext-discovery/snippets/tag_cloud.html', num_tags=10 %}
The num_tags
specifies the number of tags in the tag cloud. It is optional
and defaults to the setting of the ckanext.discovery.tag_cloud.num_tags
option (see below).
The plugin offers one setting that can be configured via CKAN's configuration INI:
# Number of tags to show in the tag cloud. Defaults to 20 and can be # overriden by passing a ``num_tags`` parameter to the tag cloud template # snippet. ckanext.discovery.tag_cloud.num_tags = 20
Copyright (C) 2017 Stadt Karlsruhe (www.karlsruhe.de)
Distributed un der the GNU Affero General Public License. See the file
LICENSE
for details.
- Fix: The
search_suggestions init
paster command no longer deletes all previously stored search terms.
- First release