4 releases

0.1.3 Nov 2, 2024
0.1.2 Oct 26, 2024
0.1.1 Oct 26, 2024
0.1.0 Oct 26, 2024

#265 in Science

Download history 276/week @ 2024-10-21 100/week @ 2024-10-28 22/week @ 2024-11-04

398 downloads per month

MIT/Apache

180KB
3K SLoC

elinor-cli

Crates.io Build Status

elinor-cli is a set of command-line tools for evaluating IR systems:

  • elinor-evaluate evaluates the ranking metrics of the system.
  • elinor-compare compares the metrics of multiple systems with statistical tests.
  • elinor-convert converts the TREC format into the JSONL format for elinor-evaluate.

Installation

Simply use cargo to install from crates.io.

cargo install elinor-cli

Ubiquitous language

Elinor uses the following terms for convenience:

  • True relevance score means the relevance judgment provided by human assessors.
  • Predicted relevance score means the similarity score predicted by the system.

elinor-evaluate

elinor-evaluate evaluates the ranking metrics of the system.

Input format

elinor-evaluate requires two JSONL files of true and predicted relevance scores. Each line in the JSONL file should be a JSON object with the following fields:

  • query_id: The ID of the query.
  • doc_id: The ID of the document.
  • score: The relevance score of the query-document pair.
    • If it is a true one, the score should be a non-negative integer (e.g., 0, 1, 2).
    • If it is a predicted one, the score can be a float (e.g., 0.1, 0.5, 1.0).

An example of the JSONL file for the true relevance scores is:

{"query_id":"q_1","doc_id":"d_1","score":2}
{"query_id":"q_1","doc_id":"d_7","score":0}
{"query_id":"q_2","doc_id":"d_3","score":2}

An example of the JSONL file for the predicted relevance scores is:

{"query_id":"q_1","doc_id":"d_1","score":0.65}
{"query_id":"q_1","doc_id":"d_4","score":0.23}
{"query_id":"q_2","doc_id":"d_3","score":0.48}

The specifications are:

  • There is no need to sort the lines in the JSONL files.
  • The query-document pairs should be unique in each file.
  • The query IDs in the true and predicted files should be the same.
  • In binary metrics (e.g., Precision, Recall, F1), true relevance scores more than 0 are considered relevant.

Sample JSONL files are available in the test-data/sample directory.

Example usage

Here is example usage with sample JSONL files in the test-data/sample directory.

If you want to evaluate the Precision@3, Average Precision (AP), Reciprocal Rank (RR), and nDCG@3 metrics, run:

elinor-evaluate \
  --true-jsonl test-data/sample/true.jsonl \
  --pred-jsonl test-data/sample/pred_1.jsonl \
  --metrics precision@3 ap rr ndcg@3

The available metrics are shown in Metric.

The output will show several basic statistics and the macro-averaged scores for each metric:

n_queries_in_true       8
n_queries_in_pred       8
n_docs_in_true  20
n_docs_in_pred  24
n_relevant_docs    14
precision@3     0.5833
ap      0.8229
rr      0.8125
ndcg@3  0.8286

The detailed results can be saved to a CSV file by specifying the --output-csv option:

elinor-evaluate \
  --true-jsonl test-data/sample/true.jsonl \
  --pred-jsonl test-data/sample/pred_1.jsonl \
  --output-csv test-data/sample/pred_1.csv \  # Specify output CSV path
  --metrics precision@3 ap rr ndcg@3

The CSV file will contain the scores for each query:

query_id,precision@3,ap,rr,ndcg@3
q_1,0.6666666666666666,0.5833333333333333,0.5,0.66967181649423
q_2,0.6666666666666666,1.0,1.0,0.8597186998521972
q_3,0.6666666666666666,0.5833333333333333,0.5,0.6199062332840657
q_4,0.6666666666666666,0.5833333333333333,0.5,0.66967181649423
q_5,0.3333333333333333,1.0,1.0,1.0
q_6,0.6666666666666666,0.8333333333333333,1.0,0.9502344167898356
q_7,0.3333333333333333,1.0,1.0,1.0
q_8,0.6666666666666666,1.0,1.0,0.8597186998521972

The CSV files can be input to elinor-compare to compare the metrics of multiple systems.

elinor-compare

elinor-compare compares the metrics of multiple systems with statistical tests.

This tool supports several statistical tests and reports various statistics for in-depth analysis. This tool is designed not only for IR systems but also for any systems that can be evaluated with metrics.

Input format

elinor-compare requires multiple CSV files that contain the scores of the metrics for each query, such as the output of elinor-evaluate.

Precisely, the CSV files should have the following columns:

  • topic_id: The ID of the topic (e.g., query).
    • The colum name is arbitrary.
    • The column names must be the same across the CSV files.
    • The topic IDs should be the same across the CSV files.
  • metric_1, metric_2, ...: The scores of the metrics for the query.
    • The column names are the metric names.
    • The column names should be the same across the CSV files.
    • The metric scores should be floats.

Sample CSV files are available in the test-data/sample directory.

Example usage: Comparing two systems

Here is example usage with sample CSV files in the test-data/sample directory.

If you want to compare the metrics of two systems, run:

elinor-compare \
  --input-csvs test-data/sample/pred_1.csv \
  --input-csvs test-data/sample/pred_2.csv

The output will be:

# Basic statistics
+-----------+-------+
| Key       | Value |
+-----------+-------+
| n_systems | 2     |
| n_topics  | 8     |
| n_metrics | 4     |
+-----------+-------+

# Alias
+----------+-----------------------------+
| Alias    | Path                        |
+----------+-----------------------------+
| System_1 | test-data/sample/pred_1.csv |
| System_2 | test-data/sample/pred_2.csv |
+----------+-----------------------------+

# Means
+-------------+----------+----------+
| Metric      | System_1 | System_2 |
+-------------+----------+----------+
| precision@3 | 0.5833   | 0.2917   |
| ap          | 0.8229   | 0.4479   |
| rr          | 0.8125   | 0.5625   |
| ndcg@3      | 0.8286   | 0.4649   |
+-------------+----------+----------+

# Two-sided paired Student's t-test for (System_1 - System_2)
+-------------+--------+--------+--------+--------+---------+---------+
| Metric      | Mean   | Var    | ES     | t-stat | p-value | 95% MOE |
+-------------+--------+--------+--------+--------+---------+---------+
| precision@3 | 0.2917 | 0.0774 | 1.0485 | 2.9656 | 0.0209  | 0.2326  |
| ap          | 0.3750 | 0.1012 | 1.1789 | 3.3343 | 0.0125  | 0.2659  |
| rr          | 0.2500 | 0.0714 | 0.9354 | 2.6458 | 0.0331  | 0.2234  |
| ndcg@3      | 0.3637 | 0.1026 | 1.1356 | 3.2119 | 0.0148  | 0.2677  |
+-------------+--------+--------+--------+--------+---------+---------+

# Two-sided paired Bootstrap test (n_resamples = 10000)
+-------------+---------+
| Metric      | p-value |
+-------------+---------+
| precision@3 | 0.0240  |
| ap          | 0.0292  |
| rr          | 0.0602  |
| ndcg@3      | 0.0283  |
+-------------+---------+

# Fisher's randomized test (n_iters = 10000)
+-------------+---------+
| Metric      | p-value |
+-------------+---------+
| precision@3 | 0.0596  |
| ap          | 0.0657  |
| rr          | 0.1248  |
| ndcg@3      | 0.0612  |
+-------------+---------+

See the following documentation for more details about the statistical tests:

Example usage: Comparing three systems

If you want to compare the metrics of three (or more) systems, run:

elinor-compare \
  --input-csvs test-data/sample/pred_1.csv \
  --input-csvs test-data/sample/pred_2.csv \
  --input-csvs test-data/sample/pred_3.csv

The output will be:

# Basic statistics
+-----------+-------+
| Key       | Value |
+-----------+-------+
| n_systems | 3     |
| n_topics  | 8     |
| n_metrics | 4     |
+-----------+-------+

# Alias
+----------+-----------------------------+
| Alias    | Path                        |
+----------+-----------------------------+
| System_1 | test-data/sample/pred_1.csv |
| System_2 | test-data/sample/pred_2.csv |
| System_3 | test-data/sample/pred_3.csv |
+----------+-----------------------------+

# precision@3
## System means
+----------+--------+---------+
| System   | Mean   | 95% MOE |
+----------+--------+---------+
| System_1 | 0.5833 | 0.1498  |
| System_2 | 0.2917 | 0.1498  |
| System_3 | 0.4167 | 0.1498  |
+----------+--------+---------+
## Two-way ANOVA without replication
+-----------------+------------+----+----------+--------+---------+
| Factor          | Variation  | DF | Variance | F-stat | p-value |
+-----------------+------------+----+----------+--------+---------+
| Between-systems | 0.3426     | 2  | 0.1713   | 4.3898 | 0.0331  |
| Between-topics  | 0.3287     | 7  | 0.0470   | 1.2034 | 0.3623  |
| Residual        | 0.5463     | 14 | 0.0390   |        |         |
+-----------------+------------+----+----------+--------+---------+
## Effect sizes for Tukey HSD test
+----------+----------+----------+----------+
| ES       | System_1 | System_2 | System_3 |
+----------+----------+----------+----------+
| System_1 | 0.0000   | 1.4765   | 0.8437   |
| System_2 | -1.4765  | 0.0000   | -0.6328  |
| System_3 | -0.8437  | 0.6328   | 0.0000   |
+----------+----------+----------+----------+
## p-values for randomized Tukey HSD test (n_iters = 10000)
+----------+----------+----------+----------+
| p-value  | System_1 | System_2 | System_3 |
+----------+----------+----------+----------+
| System_1 | 1.0000   | 0.0248   | 0.2511   |
| System_2 | 0.0248   | 1.0000   | 0.6557   |
| System_3 | 0.2511   | 0.6557   | 1.0000   |
+----------+----------+----------+----------+

(The statistics for the other metrics will be shown as well.)

See the following documentation for more details about the statistical tests:

Example usage: Printing the tables in a tab-separated format

If you set --print-mode raw, the tables will be printed in a tab-separated format, enabling you to copy and paste them into a spreadsheet:

elinor-compare \
  --input-csvs test-data/sample/pred_1.csv \
  --input-csvs test-data/sample/pred_2.csv \
  --print-mode raw

elinor-convert

elinor-convert converts the TREC format into the JSONL format for elinor-evaluate.

For Qrels files:

elinor-convert \
  --input-trec qrels.trec \
  --output-jsonl qrels.jsonl \
  --rel-type true

For Run files:

elinor-convert \
  --input-trec run.trec \
  --output-jsonl run.jsonl \
  --rel-type pred

Licensing

Licensed under either of

at your option.

Dependencies

~29–39MB
~666K SLoC