Boundary Recognition of Light-Pause Marks via Grammar Testing Method
MO Yiwen, CHEN Bo , LEI Pei1. College of Chinese Language and Literature, Wuhan University, Wuhan 430072, Hubei, China; 2. School of Computer, Wuhan University, Wuhan 430072, Hubei, China; 3. Department of Language & Literature, Hubei University of Art & Science, Xiangyang 441053, Hubei, China
Boundary recognition is an important research of natural language processing, and it provides a basis for the application of Chinese word segmentation, chunk analysis, named entity recognition, etc. Based on ambiguity in boundary recognition of Chinese punctuation marks, this paper proposes grammar testing methods for boundary recognition of slight-pause marks and then calculates the annotation consistency of these methods. The statistical results show that grammar testing methods can greatly improve the annotation consistency of slight-pause marks boundary recognition. The consistency during the second time is 0.030 3 higher than during the first, which will help guarantee the consistency of large-scale corpus annotation and improve the quality of corpus annotation.
Key words:slight-pause marks boundary; grammar testing; corpus annotation; Kappa statistics
 Wang Z, Xue N W. Joint POS tagging and transition-based constituent parsing in Chinese with non-local features [C]// Meeting of the Association for Computational Linguistics. Berlin: Association for Computational Linguistics, 2014: 733-742.
 Dhivya R, Dhanalakshmi V, Kumar M A, et al. Clause boundary identification for Tamil language using dependency parsing[C] // International Joint Conference on Advances in Signal Processing and Information Technology. Berlin Heidelberg: Springer-Verlag, 2011:195-197.
 Xue N W, Ng H T, Pradhan S, et al. CoNLL 2016 shared task on multilingual shallow discourse parsing[C] // Proceedings of the Fifteenth Conference on Computational Natural Language Learning Shared Task. Berlin: Association for Computational Linguistics, 2016: 978-986.
 Li X, Palmer M, Xue N W. Large multi-lingual, multi-level and multi-genre annotation corpus [C] // Proceedings of the 10th Edition of the Language Resources and Evaluation Conference (LREC). Portorož: Jozef Stefan Institute, 2016: 906-913.
 Kong F, Zhou G. Chinese comma disambiguation on K-best parse trees [J]. Communications in Computer & Information Science, 2014, 496: 13-22.
 Li Y C, Gu J J, Zhou G D. Adding colon and semicolon label feature to Chinese comma classification [J]. Journal of Chinese Information Processing, 2014, 28(5): 215-222(Ch).
 Qiu L K, Zhang Y, Jin P, et al. Multi-view Chinese Tree-banking[C] //Proceedings of the 25th International Confer-ence on Computational Linguistics. Dublin : Association for Computational Linguistics, 2014: 257-268.
 Huang C R, Xue N W. Modeling Word Concepts without Convention: Linguistic and Computational Issues in Chinese Word Identification [M]. Oxford: Oxford University Press, 2015: 348-361.
 Chen Y P, Zheng Q H, Zhang W. Omni-word feature and soft constraint for Chinese relation exraction[C] // Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics Association for Computational Linguistics. Baltimore: Association for Computational Linguistics, 2014: 572-581.
 Sun X, Matsuzaki T, Li W J. Latent structured perceptrons for large-scale learning with hidden information[J]. IEEE Trans Knowl Data Eng, 2013, 25(9): 2063-2075.
 Celce-Murcia M, Mcintosh L. Teaching English as a Second or Foreign Language [M]. Piscataway: IEEE , 1979.
 Zhou J S, Qu W G, Zhang F. Exploiting chunk-level features to improve phrase chunking[C] // Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Stroundsburg: Association for Computational Linguistics, 2012: 557-567.
 Sun X, Matsuzaki T, Okanohara D, et al. Latent variable perceptron algorithm for structured classification[C] // Pro-ceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09). San Francisco:Morgan Kaufmann Publishers Inc, 2009: 1236-1242.
 Stab E C, Gurevych I. Annotating argument components and relations in persuasive[C] // International Conference on Coling. Dublin: Computational Linguistics, 2014: 1501- 1510.
 Sergeant A. Automatic argumentation extraction [C] // Ex-tended Semanic Web Conference. New York: ACM Press, 2013: 656-660. DOI: 10. 1007/978-3-642-38288-8-46.
 Shermis M D, Burstein J. Handbook of Automated Essay Evaluation: Current Applications and New Directions [M]. Rutledge: Taylor & Francis Group, 2013.
 Attali Y, Lewis W, Steier M. Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring [J]. Language Testing, 2013, 30(1):125-141. DOI:10.1177/026553212452396.
 Cohen J. A coefficient of agreement for nominal scales [J]. Educational and Psychological Measurement, 1960, 20(1): 37-46.
 Klebanov B B, Flor M. Argumentation-relevant metaphors in test-taker essays[C] // Proceedings of the First Workshop on Metaphor in NLP. Atlanta: NLP, 2013: 11-20.
 Luu A, Malamud S A, Xue N W. Converting SynTagRus dependency treebank into penn treebank style[C] // Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016). Berlin: Association for Computational Linguistics, 2016: 16-21.
 Song L, Zhang Y, Peng X, et al. AMR-to-text generation as a Traveling Salesman Problem[EB/OL]. http: //arXiv preprint arXiv. 2016: 1609. 07451.