GenChal 2022: Feedback Comment Generation

Ryo Nagata, Masato Hagiwara, Kazuaki Hanawa, Masato Mita

Task description

Feedback comment generation is a task where, given an input text and positions (where to comment), a system generates hints or explanatory notes (hereafter, feedback comments) useful for language learners.

For example, given a sentence with a position (agrees the):

*He agrees the opinion.

The task is to generate a feedback comment such as:

The verb agree is an intransitive verb and cannot take direct objects. Add the appropriate preposition.

The goal of the task is to generate feedback comments explaining to the writer why the range in question is erroneous possibly with related writing rules. Note that just pointing out an error is not enough in this task (in order to make the task different from grammatical error detection). Likewise just giving the correct form is not enough (difference from grammatical error correction). Also note that the feedback comments in the provided dataset are designed to try to avoid giving explicit corrections and instead to provide something that prompts the write to come up with a solution (Admittedly, there are feedback comments giving explicit corrections).

Task definition

Since unconstrained feedback comment generation is a difficult task, we put some constraints in this generation challenge. We limit the target only to proposition uses. It should be emphasized that the target includes missing prepositions, to-infinitives, and deverbal prepositions (e.g., including) in preposition uses. Specifically, participants develop systems that automatically generate feedback comments in response to preposition uses such as the example shown above. We also assume that we know where the target positions are (for example, the words “agree the” in the example above).

Input:

Pre-tokenized English sentences written by non-native learners (list of string)
Offset ranges (pairs of integers)

An offset is a range denoted by two integers (start character index, end character index; starting from 0) separated by ‘:’. It specifies where to generate a comment. For example, the offset for the sentence:

*He agrees the opinion.

is 3:13.

Output:

A feedback comment per an offset range, or <NO_COMMENT>

Task:

Given an input sentence and an offset range, generate an appropriate feedback comment in English. The length of the generated feedback comment should be less than 500 characters including blanks. The task is to generate a feedback comment to each pair of an English sentence and an offset. The special output token <NO_COMMENT> is allowed to indicate that the system cannot generate any reliable feedback comment. This allows us to calculate recall, precision, and F1 as explained below.

Data format

Training/development data

The file contains one sentence per line, which consists of:
```
input sentence [\t] offset range [\t] feedback comment 
```
Fields are separated by tabs (\t).
Test data

The format is basically the same as that of training/development data, except that it always consists of the first and second fields and the third one (the feedback comment) is missing.
Output (for submission)

The system output should follow the same format as that of training/development data where the third field is the system output (generated feedback comment).

If the system fails to generate a feedback comment for the given sentence and offset, then use a special token <NO_COMMENT>.

The orders of the system outputs must be identical to those of the target sentences in the test data.

Evaluation

Manual evaluation

1.1 Manual evaluation by organizer

Each system output is manually compared to the corresponding reference feedback comment (a manually created feedback comment) to evaluate generation results. To be precise, a system output is regarded as appropriate if (1) it contains information similar to the reference and (2) it does not contain information that is irrelevant to the offset; it may contain information that the reference does not contain as long as it is relevant to the offset. If these conditions are met, the output is labeled as correct. The performance is measured by recall, precision, and F1 based on correct/incorrect outputs.

System outputs with <NO_COMMENT> are excluded from both the numerator and the denominator of precision and from the numerator of recall. Namely, a system can “skip” an input sentence without hurting precision if it is not confident enough.

1.2 Manual evaluation by participants

Participants will have a chance to do their own evaluation of their system outputs. Once the organizers return the evaluation results to the participants, they can revise the evaluation results if they need to change the judgments. This only applies to their own system outputs.

In this evaluation, the above condition (1) is relaxed and the other conditions are kept as they are. The condition (1) is redefined that generated feedback comments may be evaluated as correct even if its content is not similar to the reference as long as it is appropriate for the given offset.

Participants may upload their manual evaluation results through the webpage.
Automatic evaluation

The BLEU score is calculated between the system output and the oracle feedback comment. Define (an extended version of) recall and precision as follows:
- recall = (sum of BLUE for each generation) / (number of expected feedback comments)
- precision = (sum of BLUE for each generation) / (number of generations excluding <NO_COMMENT>)
Then, F1is calculated based on this recall and precision.

System ranking are determined in three ways: (i) manual evaluation by organizers); (ii) participant’s manual evaluation; (iii) automated evaluation

Data and system releases

Submitted data (System outputs)

We will release all system outputs to the public with manual evaluation results so that they can be used for non-profit research purposes. Those who wish to participate in this generation challenge have to agree on the data release.
System source code

We also encourage the participants to release their systems either on the GenChal webpage or on their own page (and the link is shown on the GenChal webpage). This is not obligation, but is highly recommended. We will make a special page for this release.

Baseline system

The code for the baseline system is available at this Git repository.

The evaluation metrics are:

Precision: 0.44
Recall: 0.44
F1: 0.44
BLEU: 46.34