Open Access Open Access  Restricted Access Subscription or Fee Access

Suitability of Hidden Markov Model in Part-of-Speech Tagging for Indian Languages

Bhairab Sarma, Bipul Shayam Purkayastha

Abstract


Part-of-speech (POS) tagging is an important activity in Natural Language Processing (NLP) where each token is assigned with an appropriate symbol according to its lexical category. There are many applications of POS tagging in NLP; among them, two main applications are word class classification and word sense disambiguation. A number of approaches have been developed in POS tagging for different languages. However it is a challenging job to develop a universal tagger for all languages because of different grammatical rules available for different languages. Hidden Markov Model (HMM) is a statistical approaches commonly used for all languages. TNT is a popular example of POS tagger that used bi-gram and tri-gram HMM. According to HMM, a tagger tagged its tokens by estimating two kinds of probabilistic functions: observation probability and transition probability. Basically, this model estimated the probability of some hidden states based on previous known states. The category of a word could be predicted from its few previous words categories. This model is found suitable in word sense disambiguation and in information retrieval specifically for English language. Many researchers claimed up to 98% accuracy in their tagging. Unlike English, Indian languages are highly inflectional and have rich morphology. Due to their morphological richness, performance of HMM based tagger degraded. The second difficulty of Indian languages is that structurally they are free word order language. Because of these two significant problems in Indian languages, recent researcher tries to develop POS tagger applying multiple approaches. In this paper, we will discuss some pitfalls of HMM as a POS tagging approach considering few Indian languages including Hindi and try to develop alternative solution for developer. Our objective is to increase the accuracy level of probabilistic tagger specifically for Indian languages.

Keywords: contextual information, HMM, observation probability, POST, transition probability


Full Text:

PDF

References


Sharma U., Kalita J., Das R., Classifcation of Words Based on Affix Evidence. University of Colrado, downloaded from: www.cs.uccs.edu/~jkalita/papers/2002/SharmaICNLP2002.pdf

Hockenmaier J. Building a (statistical) POS tagger: HMM POS-tagging with Hidden Markov Models. Presented at 3324 Siebel Center.

Doug Cutting D., Kupiec J., Pedersen J. A Practical Part-of-Speech Tagger. In Proceeding of the Third Conference on Applied Natural Language Processing. Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA 94304, USA; 1992.

Brill E. A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing (ANLC' 92); 1992.

Ratnaparkhi A. A maximum entropy model for part-of-speech tagging. In Proceedings of the Empirical Methods in Natural Language Processing; 1996. 133–42p.

Saharia N., Das D., Sharma U. Part of Speech Tagger for Assamese Text. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Suntec, Singapore, 4 August, 2009. c 2009 ACL and AFNLP, 33–36p.

Saharia N., Sharma U., Kalita J., A Suffix-based Noun and Verb Classifier for an Inflectional Language, International Conference on Asian Language Processing. IEE Computer Society; 2010.

Blunsom P. Lecture note on Hidden Markov Models: Introduction to NLP. 2004.

Thede S.M., Harper M.P. A second-order hidden Markov model for part-of-speech tagging. In Proceedings of the 37th Annual Meeting of the ACL.

Zürich E., Hall K. Introduction to NLP: A lecture note on hidden Markov model. Spring. 2011.

Toutanova, Kristina, Manning. Enriching the Knowledge Sources used in a Maximum Entropy Part-of-Speech tagger. In Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora (EMNLP/VLC-2000). October 7–8, 2000, Hong Kong.

Green S., Marneffe M.C., Bauer J. Manning multiword expression identification with tree substitution grammars: a parsing tour de force with French. In EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL; 2011.

Johnson M.M. Why doesn’t EM find good HMM POS-taggers? Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Prague: Association for Computational Linguistics; 2007; 296–305p.

Jurafsky D., Martin J.H. Word classes and part of speech tagging, Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition. 2005.

Part of Speech Tagging Guidelines for the Penn Treebank Project 3rd Revision 2nd printing (1990), Issue: MS-CIS-90-47, Available from scholar.google.com.

Bharati A., Chaitanya V., Sagal R. Natural Language Processing: A Paninian Perspective, New Delhi: Prentice Hall of India; 1995.




DOI: https://doi.org/10.37628/ijocspl.v1i2.50

Refbacks

  • There are currently no refbacks.