Jared Bernstein, Alexei V. Ivanov, and Elizabeth Rosenfeld
Abstract: We compared the performance of two automated grammar checkers on a small random sample of student texts to select the better text checker to incorporate in a product to tutor American students, age 1216, in essay writing. We studied two competing error detection services that located and labeled errors with claimed accuracy well above 90%. However, performance measurement with reference to a small set of essays (double-annotated for text errors) found that both services operated at similar low accuracies (F1 values in the range of 0.25 to 0.35) when analyzing either sentence- or word-level errors. Several findings emerged. First, the two systems were quite uniform in the rate of missed errors (false negatives), but were very different in their distribution of false positives. Second, the magnitude of the difference in accuracy between these two text checkers is quite small about one error, more or less, per two essays. Finally, we discuss contrasts between the functioning of these automated services in comparison to typical human approaches to grammar checking, and we propose bootstrap data collections that will support the development of improved future text correction methods.
1 Introduction
Student essays do not often satisfy all the doctrines of correctness that are taught in school or enforced by editors. Granting the use of error for deviation from prescribed usage, one observes that writers exhibit different patterns of error in text. There are several reasons why one would want to find, label, and fix those parts of text that do not conform to a standard, but our focus is on instruction. We recount a procedure we applied to compare the performance of two automated grammar checkers on a small random sample of student texts. The purpose was to select the better of two candidate text checkers to incorporate in a product designed to tutor American students, age 1216, in essay writing.
Jared Bernstein: Stanford University, Stanford, California, USA, and Universit Ca Foscari, Venice, Italy, e-mail: Jared413@Stanford.edu
Alexei V. Ivanov: Fondazione Bruno Kessler, Trento, Italy, e-mail: Alexei_V_Ivanov@IEEE.org
Elizabeth Rosenfeld: Tasso Partners LLC, Palo Alto, California, USA, e-mail: Elizabeth@eRosenfeld.com
Asking around, we found that experienced teachers approach text correction in several different ways, but teachers commonly infer the intent of the source passage and then produce a correct model re-write of a word or phrase that expresses that intent. For the struggling middle-school student, a teachers analysis is often implicit and any instruction needs to be inferred or requested. By middle school, only some teachers categorize the errors and relate them to particular rules or rubrics as a specific means of instruction.
Burstein (2011) distinguishes between automated essay scoring (producing only an overall score) and essay evaluation systems, which provide diagnostic feedback. Well follow her usage. Teachers can be trained to score essays with high (r > 0.9) inter-teacher reliabilities, and machines can be trained to match those teachers essay scores with correlations at human-human levels. Scoring is easier than evaluation. Automated scoring can reach respectable levels of correlation with human scoring by combining a few proxy measures such as text length, text perplexity, and/or use of discourse markers, with the occurrence of word families extrapolated from a sample of human-scored texts on the same topic. Note that the score product is usually a single 3- or 4-bit value (e.g. AF, or 110) that can be sufficiently optimized with regression techniques.
Essay evaluation, on the other hand, is more complicated for human teachers and for machines too. A 300-word essay by a typical middle school student (age 1215) may have 10 or even 20 errors. Ideally, an evaluation will locate, delimit, label, and fix each error, and not label or fix other text material that should be accepted as correct or adequate. It is much more difficult to train teachers to evaluate and analyze a text than it is to train them in global essay scoring.
Lets make the over-simple assumption that a 300-word essay has 60 phrases and that there are 7 tag categories (1 correct and 6 error types) that wed like to assign to each phrase. Also assume that about 75% of the phrases are correct and that the other six error categories are equally probable. If so, then just locating and labeling the errors in 60 phrases of a 300-word essay means producing about 86 bits of information. Note that delimiting the error (deciding which particular span of text is encompassed in a particular error) and selecting a correct replacement text is also part of what teachers routinely do, which creates more challenges for machine learning. Add the fact that teachers often approach an essay by only addressing the most egregious 5 or 10 errors, leaving relatively minor errors to be addressed in a later, revised draft from the student, and we see that a human-like scoring engine also needs to assign a severity score to each identified error. Therefore, full completion of the task probably needs to produce many more than 86 bits of information.
This differentiation of the text scoring and text evaluation tasks is parallel to the distinction between giving a second-language speaker an overall score on pronunciation (relatively easy) and reliably identifying the particular segmental and suprasegmental errors in a sample of that speakers speech (quite hard). Even if one approaches the pronunciation scoring task as a simple combination of multiple error-prone individual evaluations of the segments and prosodic phrases, for the scoring task the many individual evaluation errors (false positive and false negative) are somewhat independent, so they can largely cancel each other out. In the evaluation task (as needed in a pronunciation tutoring system), the product is an evaluation of each unit (segment or phrase) so the errors add (as in a Word Error Rate calculation) and do not cancel each other out (see Bernstein, 2012).
This paper presents a comparison of two essay evaluation systems and gives some context relevant to that comparison. We present the results of an exploratory study of how teachers grade essays, compare typical teacher practices to the systems we studied, and finally propose a service model that might aggregate useful data for improving automated essay evaluation systems, while also helping teachers and students.
2 Prior Work on Essay Analysis
The field of automated essay analysis has generally been dominated by essay scoring systems, with a flurry of systems reported in the past 20 years, including ones by Burstein et al. (1998), Foltz, Kintsch, and Landauer (1998), Larkey (1998), Rudner (2002), and Elliott (2003). Pages (1966) Project Essay Grade (PEG) seems to be the first demonstration of essay scoring technology. PEG relied on counts of commas, prepositions and uncommon words, and unit counts for syllables, words and sentences. An early editing and proofreading tool was The Writers Workbench developed by MacDonald et al. (1982), which gave feedback on points of grammar and orthographic conventions. The early scoring systems have typically implemented text feature extractors that feed a regression model that has been fit to match the average score of a panel of human essay scorers. Later, more sophisticated systems (Burstein et al., 1998) break an essay into sub-sections, homogeneous in topic domain, and extract similar features from those. Dikli (2006) presents an in-depth review of some major currently maintained systems.