Penalizing unknown words' emissions in HMM POS tagger based on Malay affix morphemes

Mohamed, Hassan and Omar, Nazlia and Ab Aziz, Mohd Juzaiddin and Zainol, Zuraini and Marzukhi, Syahaneim (2016) Penalizing unknown words' emissions in HMM POS tagger based on Malay affix morphemes. In: 3rd International Conference On Defence And Security Technology (DSTC2016), 15-17 August 2016., Marriot Hotel, Putrajaya.. (Submitted)

[thumbnail of Artikel] Text (Artikel)
PenalizingUnknownWords.pdf - Full text
Restricted to Registered users only until 31 January 2099.

Download (8MB)

Abstract

The challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger is that the training depends on an untagged corpus; the only supervised data limiting possible tagging of words is a dictionary. Any published dictionary can be used even if it does not include all the words found in a corpus. Therefore, training cannot properly map possible tags. The exact morphemes of prefixes, suffixes and circumfìxes in the agglutinative Malay language was examined to assign unknown words' probable tags based on linguistically meaningful affixes using a morpheme-based POS guessing algorithm for baseline tagging. The tagger was first examined using HMM-Viterbi with unknown words handled by character-based prediction; next, it was examined using HMM-Viterbi with unknown words handled by morpheme-based POS guessing; lastly, it was examined using a combination of the first and second. HMM-Viterbi tagging with morpheme-based POS guessing was found to be better than HMM-Viterbi tagging with characterbased prediction, especially at tagging unknown words not identified in the dictionary. The best combination proved to be using morpheme-based POS guessing with unknown word emissions replaced by a value proportionate to the marginal distribution of tags and words ' ending information, given a maximum predefined length of six characters and ignoring affixed words in smoothing. This method proved to be satisfactory for guessing tags of words not in the dictionary, and it outperformed the baseline.

Item Type: Conference or Workshop Item (Paper)
Uncontrolled Keywords: Malay POS tagger, morpheme-based, HMM
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
T Technology > T Technology (General)
Divisions: Centre For Cybersecurity
Depositing User: Mr Shahrim Daud
Date Deposited: 06 Feb 2024 08:15
Last Modified: 06 Feb 2024 08:15
URI: http://ir.upnm.edu.my/id/eprint/349

Actions (login required)

View Item
View Item