Everyone hates Microsoft Word's grammar checker, yet no one tries to replace it.
In the 1970s and 1980s there were
several grammar checkers. Then in 1992, Microsoft added grammar checking to Word. The rest is history. Microsoft dominated the market and innovation stopped.
AbiWord, a competitor to MSWord, now uses an open source package, Links Grammar, to do grammar checking. But as I understand it, Links Grammar is a rule based checker. In the text processing field these days though, higher quality results come from throwing a large amount of data at the problem and using statistically-based algorithms to mine it.
This is what Google does. Their
Google Sets service extracts lists from the web and does association rules on them - similar to Amazon's "people who bought this book also bought these books". Google has a
patent on this. Google translate and other commercial machine translators learn by comparing large volumes of human-translated texts. These are more effective than attempts to codify all the rules of a language.
So why not use the same approach for grammar checking? There's a lot more data out there to use than back in 1992. But what data source to use? I propose: the
Wikipedia revision history. There's about 3 terabytes of text publically available. A large proportion of it is simply minor grammatical corrections. I believe this would be the largest publically available source of grammar corrections in the world. There is something in the order of 300 million revisions on the history. And if only a third are corrections of grammatical mistakes, that's 100 million corrections to learn from.
How would it work? I'm not sure exactly but I have some ideas. Look for minor changes in between revisions. Some are even tagged as grammatical changes:
- Consider subsequent edits to revisions to be higher quality than the previous. Editors are less likely to change correct grammar into incorrect grammar.
- Consider revisions that last a long time to be higher quality than ones that disappear quickly. Vandalism is generally wiped quickly.
- Use a part of speech tagger to help disambiguate word usage.
- Generate transformations using both the actual words and their morphology. It wouldn't take long to discover that "PLURALWORD is" is replaced with "PLURALWORD are".
- Part of speech tagging and parse trees could also be used.
I attempted this back in 2006 but I've decided it would be best as an open source project. I should dig into my code archives and see what I can find.
What this project needs:
- Big time server resources, beyond what my project TheFullWiki can provide. I understand many universities have already created servers to store Wikipedia revisions and analyse them. Access to this would be very helpful.
- Computational linguists
- A more recent dump of Wikipedia revision history would be nice.
- Pointers to any similar attempts?