{"id":9500,"date":"2016-08-02T07:00:05","date_gmt":"2016-08-02T07:00:05","guid":{"rendered":"http:\/\/revoscience.com\/en\/?p=9500"},"modified":"2016-08-02T07:00:05","modified_gmt":"2016-08-02T07:00:05","slug":"first-major-database-of-non-native-english","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/first-major-database-of-non-native-english\/","title":{"rendered":"First major database of non-native English"},"content":{"rendered":"<figure id=\"attachment_9501\" aria-describedby=\"caption-attachment-9501\" style=\"width: 607px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-9501\" src=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg\" alt=\"English is the most used language on the Internet, and globally most of the people who speak or write in English are non-native speakers. A new database of annotated English sentences written by non-native speakers could help improve how computers handle the spoken or written language of non-native English speakers.\" width=\"607\" height=\"404\" title=\"\" srcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg 639w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0-300x200.jpg 300w\" sizes=\"auto, (max-width: 607px) 100vw, 607px\" \/><\/a><figcaption id=\"caption-attachment-9501\" class=\"wp-caption-text\">English is the most used language on the Internet, and globally most of the people who speak or write in English are non-native speakers. A new database of annotated English sentences written by non-native speakers could help improve how computers handle the spoken or written language of non-native English speakers.<\/figcaption><\/figure>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>CAMBRIDGE, Mass<\/strong>. &#8212;\u00a0After thousands of hours of work, MIT researchers have released the first major database of fully annotated English sentences written by non-native speakers.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers who led the project had already shown that the grammatical quirks of non-native speakers writing in English could be a source of\u00a0<a style=\"color: #1155cc;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d805290-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=30567&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d805290-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D30567%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1470203202687000&amp;usg=AFQjCNEfizukHtp7K0b629BAwLT0_tVCgw\" rel=\"noopener\"><span style=\"color: #000000;\">linguistic insight<\/span><\/a>. But they hope that their dataset could also lead to applications that would improve computers\u2019 handling of spoken or written language of non-native English speakers.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cEnglish is the most used language on the Internet, with over 1 billion speakers,\u201d says Yevgeni Berzak, a graduate student in electrical engineering and computer science, who led the new project. \u201cMost of the people who speak English in the world or produce English text are non-native speakers. This characteristic is often overlooked when we study English scientifically or when we do natural-language processing for English.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\">[pullquote]The researchers\u2019 methodology, however, provides good evidence that annotators can chart ungrammatical sentences consistently.[\/pullquote]<\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Most natural-language-processing systems, which enable smartphone and other computer applications to process requests phrased in ordinary language, are based on machine learning, in which computer systems look for patterns in huge sets of training data. \u201cIf you want to handle noncanonical learner language, in terms of the training material that\u2019s available to you, you can only train on standard English,\u201d Berzak explains.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Systems trained on nonstandard English, on the other hand, could be better able to handle the idiosyncrasies of non-native English speakers, such as tendencies to drop or add prepositions, to substitute particular tenses for others, or to misuse particular auxiliary verbs. Indeed, the researchers hope that their work could lead to grammar-correction software targeted to native speakers of other languages.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Diagramming sentences<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers\u2019 dataset consists of 5,124 sentences culled from exam essays written by students of English as a second language (ESL). The sentences were drawn, in approximately equal distribution, from native speakers of 10 languages that are the primary tongues of roughly 40 percent of the world\u2019s population.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Every sentence in the dataset includes at least one grammatical error. The original source of the sentences was a collection made public by Cambridge University, which included annotation of the errors, but no other grammatical or syntactic information.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">To provide that additional information, Berzak recruited a group of MIT undergraduate and graduate students from the departments of Electrical Engineering and Computer Science (EECS), Linguistics, and Mechanical Engineering, led by Carolyn Spadine, a graduate student in both EECS and linguistics.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">After eight weeks of training in how to annotate both grammatically correct and error-ridden sentences, the students began working directly on the data. There are three levels of annotation. The first involves basic parts of speech \u2014 whether a word is a noun, a verb, a preposition, and so on. The next is a more detailed description of parts of speech \u2014 plural versus singular nouns, verb tenses, comparative and superlative adjectives, and the like.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Next, the annotators charted the syntactic relationships between the words of the sentences, using a relatively new annotation scheme called the Universal Dependency formalism. Syntactic relationships include things like which nouns are the objects of which verbs, which verbs are auxiliaries of other verbs, which adjectives modify which nouns, and so on.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The annotators created syntactic charts for both the corrected and uncorrected versions of each sentence. That required some prior conceptual work, since grammatical errors can make words\u2019 syntactic roles difficult to interpret.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Berzak and Spadine wrote a 20-page guide to their annotation scheme, much of which dealt with the handling of error-ridden sentences. Consistency in the treatment of such sentences is essential to any envisioned application of the dataset: A machine-learning system can\u2019t learn to recognize an error if the error is described differently in different training examples.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Repeatable results<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers\u2019 methodology, however, provides good evidence that annotators can chart ungrammatical sentences consistently. For every sentence, one evaluator annotated it completely; another reviewed the annotations and flagged any areas of disagreement; and a third ruled on the disagreements.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">There was some disagreement on how to handle ungrammatical sentences \u2014 but there was some disagreement on how to handle grammatical sentences, too. In general, levels of agreement were comparable for both types of sentences.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers report these and other results in a paper being presented at the Association for Computational Linguistics annual conference in August. Joining Berzak and Spadine on the paper are Boris Katz, who is Berzak\u2019s advisor and a principal research scientist at MIT\u2019s Computer Science and Artificial Intelligence Laboratory; and the undergraduate annotators:Jessica Kenney, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, and Sebastian Garza.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers\u2019 dataset is now one of the 59 datasets available from the organization that oversees the Universal Dependency (UD) standard. Berzak also created an\u00a0<a style=\"color: #1155cc;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d805290-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=30566&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d805290-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D30566%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1470203202687000&amp;usg=AFQjCNHI7dKr3KAhQd2epKJbxeWP3p65jQ\" rel=\"noopener\"><span style=\"color: #000000;\">online interface<\/span><\/a>\u00a0for the dataset, so that researchers can look for particular kinds of errors, in sentences produced by native speakers of particular languages, and the like.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The work was funded in part by the National Science Foundation, under the auspices of MIT\u2019s Center for Brains, Minds, and Machines.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u201cEnglish is the most used language on the Internet, with over 1 billion speakers,\u201d says Yevgeni Berzak, a graduate student in electrical engineering and computer science, who led the new project. \u201cMost of the people who speak English in the world or produce English text are non-native speakers.<\/p>\n","protected":false},"author":6,"featured_media":9501,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9,22,17],"tags":[],"class_list":["post-9500","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-culture","category-other","category-research"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",639,426,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0-150x150.jpg",150,150,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0-300x200.jpg",300,200,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",639,426,false],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",639,426,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",639,426,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",639,426,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",639,426,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",639,426,false],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",600,400,false],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",600,400,false],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",639,426,false],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",540,360,false],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",95,63,false],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",639,426,false],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",96,64,false],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/08\/MIT-ESL-Data_0.jpg",150,100,false]},"author_info":{"info":["Amrita Tuladhar"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/culture\/\" rel=\"category tag\">Culture<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/other\/\" rel=\"category tag\">Other<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/research\/\" rel=\"category tag\">Research<\/a>","tag_info":"Research","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/9500","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=9500"}],"version-history":[{"count":0,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/9500\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/9501"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=9500"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=9500"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=9500"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}