TASS 2018 @SEPLN

Overall evaluation

There will be three evaluation scenarios:

Scenario 1: Only plain text is given (Subtasks A, B, C).

In this first scenario, the participants will perform the three subtasks consecutively and provide the corresponding development output files. The only input provided are plain text files input_<topic>.txt for a particular list of topics that were not released with the training data.

Systems will be ranked according to an aggregated F1 metric computed on the three tasks, by considering precision and recall as follows:

$$ precision(A,B,C) = \frac{Correct(A) + Correct(B) + Correct(C) + \frac{1}{2} Partial(A)}{Spurious(A) + Partial(A) + Correct(A) + Correct(B) + Incorrect(B) + Spurious(C) + Correct(C) } $$

$$ recall(A,B,C) = \frac{Correct(A) + Correct(B) + Correct(C) + \frac{1}{2} Partial(A)}{Missing(A) + Partial(A) + Correct(A) + Correct(B) + Incorrect(B) + Missing(C) + Correct(C) } $$

$$ F_1(A,B,C) = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$

Scenario 2: Plain text and manually annotated key phrase boundaries are given (Subtasks B, C).

In this second scenario participants will perform tasks B and C sequentially, and provide the corresponding output files. As input, they receive both plain text files ( input_<topic>.txt ), and the corresponding gold files for the task A ( output_A_<topic>.txt ). The purpose of this scenario is to evaluate the quality of tasks B and C independently from task A. As in the previous scenario, an aggregated F1 metric is reported, based on the following precision and recall :

$$ precision(B,C) = \frac{Correct(B) + Correct(C)}{Correct(B) + Incorrect(B) + Spurious(C) + Correct(C) } $$

$$ recall(B,C) = \frac{Correct(B) + Correct(C)}{Correct(B) + Incorrect(B) + Missing(C) + Correct(C) } $$

$$ F_1(B,C) = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$

NOTE: Please make sure to reuse the keyphrase IDs provided in the output for Task A given in the test set. Do not generate your own keyphrase IDs.

Scenario 3: Plain text with manually annotated key phrases and their types are given (Subtask C).

In this scenario both the gold outputs for task A and task B are provided, and the participants must only perform the process to obtain task C output files. The purpose of this scenario is to evaluate only the quality of task C independently of the complexity of task A and B. As before, an aggregated F1 metric is reported, based on the following precision and recall :

$$ precision(C) = \frac{Correct(C)}{Spurious(C) + Correct(C) } $$

$$ recall(C) = \frac{Correct(C)}{Missing(C) + Correct(C) } $$

$$ F_1(C) = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$

NOTE: Please make sure to reuse the keyphrase IDs provided in the output for Task A and Task B given in the test set. Do not generate your own keyphrase IDs.

Final Score: A macro-average of all three scenarios is given as the final score:

$$ F_{final} = \frac{1}{3} F_1(A,B,C) + \frac{1}{3} F_1(B,C) + \frac{1}{3} F_1(C) $$

Baseline implementation

A baseline implementation is provided in https://github.com/TASS18-Task3/data/tree/master/baseline. This implementation simply counts the number of occurrences of all concepts, actions and relations, and uses these statistics to match the exact same occurrences. Hence, it can be used as a minimal baseline of the expected score in each evaluation scenario.

Feel free to use and modify this baseline implementation, both for testing the submission process in Codalab and as a template for developing your own implementation if necessary.

Evaluation scripts

There are three differents ways of obtaining an evaluation. The first option is to use the script score_training.py which provides a detailed evaluation including a report of each of the mistakes (i.e., spurious or missing items) for each of the training files. This script will be mostly useful for the development of the models.

The second option is to use the script score_test.py is which provides a summarized report of the evaluations metrics described in this page. The script reports precission, recall and F1 for each scenario, and the macro average of all the F1 metrics.

The final option is to submit the results to Codalab, either in the training or the testing phase. In this submission, the results will be exactly the same as those output by the score_test.py script. The only difference with option 2 is that during the testing phase, participants will not have access to the testing gold outputs, hence, the only way to obtain a testing score is to submit to Codalab.

More information can be found in the Readme file.