A Part-of-Speech Tag Clustering for a Word Prediction System in Portuguese Language

  1. Cruz Cavalieri, Daniel
  2. Filho, Teodiano Freire Bastos
  3. Filho, Mário Sarcinelli
  4. Palazuelos Cagigas, Sira Elena
  5. Macías Guarasa, Javier
  6. Martín Sánchez, José Luis
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2011

Issue: 47

Pages: 197-205

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

This paper presents an automatic method for reducing the part-of-speech tagset to be considered by a word prediction system in Portuguese. The method is based on a similarity measure applied to a association matrix, generated by employing a odds ratio association measure in the bigrams of parts-of-speech (bipos) probability distribution in a corpus. The results reported in this paper show that using the proposed clustering method with an appropriate threshold value over the similarity has the potential to improve the word prediction system. Moreover, it makes possible to use new clustering techniques such as fuzzy clustering. The results also show that when using a word prediction system based on a syntactic model, the clustering cannot be performed between the major syntactic categories, even if the clusters generated seem correct from a linguistic point of view.

Bibliographic References

  • Aliprandi, Carlo, Nicola Carmignani, Paolo Mancarella, y Michele Rubino. 2007. A word predictor for inflected languages: system design and user-centric interface. En Proceedings of the Second IASTED International Conference on Human Computer Interaction, IASTED-HCI ’07, páginas 148–153, Anaheim, CA, USA. ACTA Press.
  • Bahrani, Mohammad, Hossein Sameti, Nazila Hafezi, y Saeedeh Momtazi. 2008. A new word clustering method for building n-gram language models in continuous speech recognition systems. En Proceedings of the 21st international conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: New Frontiers in Applied Artificial Intelligence, IEA/AIE ’08, páginas 286–293, Berlin, Heidelberg. Springer-Verlag.
  • Bick, Eckhard. 2000. The Parsing System “Palavras”: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Ph.D. tesis, Aarhus University, Aarhus, Denmark, November.
  • Brants, Thorsten. 1995. Tagset reduction without information loss. En Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ACL ’95, páginas 287–289, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Cavalieri, Daniel C., Sira E. Palazuelos-Cagigas, Teodiano F. Bastos-Filho, y Mário Sarcinelli-Filho. 2010. Evaluation of machine learning approaches to portuguese part-of-speech prediction. En António Teixeira Vera Lúcia Strube de Lima Luís Caldas de Oliveira, y Paulo Quaresma, editores, Computational Processing of the Portuguese Language, 9th International Conference, Proceedings (PROPOR 2010), Porto Alegre, Brasil, 27-30 de Abril.
  • Culleto, Thomas. 2007. Prediction of liaison in french by measures of information theory. En Oxford, editor, Proceedings of LingO, páginas 59–67. Oxford University.
  • Fazly, Afsaneh. 2002. The use of syntax in word completion utilities. Master’s thesis, University of Toronto, Department of Computer Science.
  • Garay-Vitoria, N. y J. Gonzalez-Abascal. 1997. Intelligent word prediction to enhance text input rate (a syntactic analysis based word prediction aid for people with severe motor speech disability). En Annual International Conference on Intelligent User Interfaces, páginas 241-247.
  • Ghayoomi, M. y S. Momtazi. 2009. An overview on the existing language models for prediction systems as writing assistant tools. En Systems, Man and Cybernetics, 2009. SMC 2009. IEEE International Conference on, páginas 5083–5087, San Antonio, Texas, 11-14 October. ISSN: 1062-922X.
  • Lin, Dekang. 1997. Using syntactic dependency as local context to resolve word sense ambiguity. En Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, EACL ’97, páginas 64–71, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Momtazi, Saeedeh y Hossein Sameti. 2009. A possibilistic approach for building statistical language models. En Proceedings of the 2009 Ninth International Conference on Intelligent Systems Design and Applications, ISDA ’09, páginas 1014–1018, Washington, DC, USA. IEEE Computer Society.
  • Palazuelos-Cagigas, S. E. 2001. Contribution to word prediction in Spanish and its integration in technical aids for people with physical disabilities. Ph.D. tesis, Universidad de Alcalá de Henares, Alcalá de Henares, Madrid, Spain.
  • Resnik, Philip. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11:95–130.
  • Sánchez-Martínez, Felipe, Juan Antonio Pérez-Ortiz, y Mikel L. Forcada. 2005. Target-language-driven agglomerative part-of-speech tag clustering for machine translation. En Proceedings of the International Conference RANLP - 2005 (Recent Advances in Natural Language Processing), páginas 471–477, September.
  • Santos, Diana y Paulo Rocha. 2004. The key to the first clef in portuguese: Topics, questions and answers in chave. En 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004, páginas 821-832, Bath, UK, September 15-17.
  • Trnka, Keith, Debra Yarrington, Kathleen McCoy, y Christopher Pennington. 2006. Topic modeling in fringe word prediction for aac. En Proceedings of the 11th international conference on Intelligent user interfaces, IUI ’06, páginas 276–278, New York, NY, USA. ACM.
  • Velldal, Erik. 2003. Modeling word senses with fuzzy clustering. Cand.philol. thesis, University of Oslo.
  • Wood, Matthew E. J. 1996. Syntactic Pre-Processing in Single-Word Prediction for Disabled People. Ph.D. tesis, Department of Computer Science, University of Bristol, June.
  • Yarowsky, David. 1992. Word-sense disambiguation using statistical models of roget’s categories trained on large corpora. En Proceedings of the 14th conference on Computational linguistics - Volume 2, COLING ’92, páginas 454–460, Stroudsburg, PA, USA. Association for Computational Linguistics.