Installing and Configuring¶

PyTerrier is a declarative platform for information retrieval experiemnts in Python. It uses the Java-based Terrier information retrieval platform internally to support indexing and retrieval operations.

Pre-requisites¶

PyTerrier requires Python 3.8 or newer, and Java 11 or newer. PyTerrier is natively supported on Linux, Mac OS X and Windows.

Installation¶

Installing PyTerrier is easy - it can be installed from the command-line in the normal way using Pip:

pip install python-terrier

If you want the latest version of PyTerrier, you can install direct from the Github repo:

pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

NB: There is no need to have a local installation of the Java component, Terrier. PyTerrier will download the latest release on startup.

Installation Troubleshooting¶

We aim to ensure that there are pre-compiled binaries available for any dependencies with native components, for all supported Python versions and for all major platforms (Linux, macOS, Windows). One notable exception is Mac M1 etc., as there are no freely available GitHub Actions runners for M1. Mac M1 installs may require to compile some dependencies.

If the installation failed due to pyautocorpus did not run successfully, you may need to install pcre to your machine.

macOS:

brew install pcre

Linux:

apt-get update -y
apt-get install libpcre3-dev -y

Configuration¶

You must always start by importing PyTerrier and running init():

import pyterrier as pt
pt.init()

PyTerrier uses PyJnius as a “glue” layer in order to call Terrier’s Java classes. PyJnius will search the usual places on your machine for a Java installation. If you have problems, set the JAVA_HOME environment variable:

import os
os.environ["JAVA_HOME"] = "/path/to/my/jdk"
import pyterrier as pt
pt.init()

pt.init() has a multitude of options, for instance that can make PyTerrier more notebook friendly, or to change the underlying version of Terrier, as described below.

For users with an M1 Mac or later models, it is necessary to install the SSL certificates to avoid certificate errors. To do this, locate the Install Certificates.command file within the Application/Python[version] directory. Once found, double-click on it to run the installation process.

API Reference¶

All usages of PyTerrier start by importing PyTerrier and starting it using the init() method:

import pyterrier as pt
pt.init()

PyTerrier uses some of the functionality of the Java-based Terrier IR platform for indexing and retrieval functionality. Calling pt.init() downloads, if necessary, the Terrier jar file, and starts the Java Virtual Machine (JVM). It also configures the Terrier so that it can be more easily used from Python, such as redirecting the stdout and stderr streams, logging level etc.

Below, there is more documentation about method related to starting Terrier using PyTerrier, and ways to change the configuration.

Startup-related methods¶

pyterrier.init()[source]¶

Function necessary to be called before Terrier classes and methods can be used. Loads the Terrier .jar file and imports classes. Also finds the correct version of Terrier to download if no version is specified.

Parameters:

version (str) –
Which version of Terrier to download. Default is None.
- If None, find the newest Terrier released version in MavenCentral and download it.
- If “snapshot”, will download the latest build from Jitpack.
mem (str) – Maximum memory allocated for the Java virtual machine heap in MB. Corresponds to java -Xmx commandline argument. Default is 1/4 of physical memory.
boot_packages (list(str)) – Extra maven package coordinates files to load before starting Java. Default=`[]`. There is more information about loading packages in the Terrier documentation
packages (list(str)) – Extra maven package coordinates files to load, using the Terrier classloader. Default=`[]`. See also boot_packages above.
jvm_opts (list(str)) – Extra options to pass to the JVM. Default=`[]`. For instance, you may enable Java assertions by setting jvm_opts=[‘-ea’]
redirect_io (boolean) – If True, the Java System.out and System.err will be redirected to Pythons sys.out and sys.err. Default=True.
logging (str) –
the logging level to use:
- Can be one of ‘INFO’, ‘DEBUG’, ‘TRACE’, ‘WARN’, ‘ERROR’. The latter is the quietest.
- Default is ‘WARN’.
home_dir (str) – the home directory to use. Default to PYTERRIER_HOME environment variable.
tqdm – The tqdm instance to use for progress bars within PyTerrier. Defaults to tqdm.tqdm. Available options are ‘tqdm’, ‘auto’ or ‘notebook’.
helper_version (str) – Which version of the helper.

Locating the Terrier .jar file: PyTerrier is not tied to a specific version of Terrier and will automatically locate and download a recent Terrier .jar file. However, inevitably, some functionalities will require more recent Terrier versions.

If set, PyTerrier uses the version init kwarg to determine the .jar file to look for.

If the version init kwarg is not set, Terrier will query MavenCentral to determine the latest Terrier release.

If version is set to “snapshot”, the latest .jar file build derived from the Terrier Github repository will be downloaded from Jitpack.

Otherwise the local (~/.mvn) and MavenCentral repositories are searched for the jar file at the given version.

In this way, the default setting is to download the latest release of Terrier from MavenCentral. The user is also able to use a locally installed copy in their private Maven repository, or track the latest build of Terrier from Jitpack.

If you wish to run PyTerrier in an offline enviroment, you should ensure that the “terrier-assemblies-{your version}-jar-with-dependencies.jar” and “terrier-python-helper-{your helper version}.jar” are in the “~/.pyterrier” (if they are not present, they will be downloaded the first time). Then you should set their versions when calling init() function. For example: pt.init(version = 5.5, helper_version = "0.0.6").

pyterrier.started()[source]¶

Returns True if init() has already been called, false otherwise. Typical usage:

import pyterrier as pt
if not pt.started():
    pt.init()

pyterrier.run()[source]¶: Allows to run a Terrier executable class, i.e. one that can be access from the bin/terrier commandline programme.

Methods to change PyTerrier configuration¶

pyterrier.extend_classpath()[source]¶: Allows to add packages to Terrier’s classpath after the JVM has started.

pyterrier.logging()[source]¶

Set the logging level. Equivalent to setting the logging= parameter to init(). The following string values are allowed, corresponding to Java logging levels:

‘ERROR’: only show error messages

‘WARN’: only show warnings and error messages (default)

‘INFO’: show information, warnings and error messages

‘DEBUG’: show debugging, information, warnings and error messages

pyterrier.redirect_stdouterr()[source]¶: Ensure that stdout and stderr have been redirected. Equivalent to setting the redirect_io parameter to init() as True.

pyterrier.set_property()[source]¶

Allows to set a property in Terrier’s global properties configuration. Example:

pt.set_property("termpipelines", "")

While Terrier has a variety of properties – as discussed in its indexing and retrieval configuration guides – in PyTerrier, we aim to expose Terrier configuration through appropriate methods or arguments. So this method should be seen as a safety-valve - a way to override the Terrier configuration not explicitly supported by PyTerrier.

pyterrier.set_properties()[source]¶: Allows to set many properties in Terrier’s global properties configuration

pyterrier.set_tqdm()[source]¶

Set the tqdm progress bar type that Pyterrier will use internally. Many PyTerrier transformations can be expensive to apply in some settings - users can view progress by using the verbose=True kwarg to many classes, such as BatchRetrieve.

The tqdm progress bar can be made prettier when using appropriately configured Jupyter notebook setups. We use this automatically when Google Colab is detected.

Allowable options for type are:

‘tqdm’: corresponds to the standard text progresss bar, ala from tqdm import tqdm.

‘notebook’: corresponds to a notebook progress bar, ala from tqdm.notebook import tqdm

‘auto’: allows tqdm to decide on the progress bar type, ala from tqdm.auto import tqdm. Note that this works fine on Google Colab, but not on Jupyter unless the ipywidgets have been installed.