Detection of Adjective Compound Word in Malay Language using Enhanced Syntactic Rules

Compound word is defined as combination two or more words and it will produce a new meaning. Generally, compound word is existed in many languages such as English, Mandarin, Arabic and others. Although, there are discussion of existing methods to detect compound word yet some limitations on detecting Malay compound word. Thus, this study is done to improve accuracy towards adjective compound words. Training data is used in this study was Malay story books. Digitization data of Malay story book is used in this study. Then, the pre-processing method involved tokenization, stemming, bi-gram and part-of-speech (POS) tagging has been applied to produce the candidate compound word. Applying the enhanced syntactic rules shown the precision result is 70.3% through this study. Thus, this study will contribute to the academic research in improvise the issues on searching and document summarization application.


INTRODUCTION
Language is a central method for human to communicate to the community. Through language, it helps people to express their thought and feeling. In each country and region in this world, there have mother tongue, and formal language have been used in daily life. There are several languages have been used to communicate worldwide such English, Chinese, Arabic, Malay and others. Each language is unique, language help to portray the culture of the community or region. For example, most of Arabic people in middle east country such United Arab Emirates, Yemen, Messer, Jordan, Syria, Iraq, Iran and country nearby will use Arabic language as spoken and written. Thus, it portrays Muslim community. Furthermore, language also widely used in education, business, music, and others.
In most of the languages have systematic rule and structure to construct a sentence. Part of component in the language consist of grammar, word, compound word, sentence and others. What is compound word?
Compound word is combination of two or more words. This combination of the words will produce new words and give new meaning. For example, combination of two noun words (noun+ noun) such as football. Studies in compound word is widely attested in many languages such as English, Mandarin, Arabic (Christianto, 2020;Gagné, Spalding & Schmidtke, 2019;Altakhaineh, 2016). In Malay language, the study has started around year 2011 by Rahman, Omar and Aziz on detecting head modifier of noun phrase in Malay sentences. It shows that limited studies have been explored in this language area. Furthermore, study conducted by Rahman, Omar and Aziz (2013) on extraction of compound word in Malay text shown that they focus on compound nouns in Malay noun phrases and the accuracy of compound noun results was evaluated by the linguist or the language expert to verify the correctness of the compound noun produced from the prototype system. In other words, there is still lack of studies discussing on adjective compound word in Malay language. Thus, this study will focus on improving accuracy of detection adjective compound word in Malay story book using syntactic rules. The important of this study is to contribute for several application such as text summarization, machine translation, Information Retrieval (IR), machine translation, and semantic analysis Rahman et al., 2013).
Next section will be discussed related work on compound word. Followed by methodology and final section will be result and conclusion.

RELATED WORK
Large number of researches actively have been discussed on compound word across different languages. Compound word generally existed across different languages were defined as it composed of at least two root-words (Shen, Li & Pollatsek, 2017). As cited in Cahyanti (2016) compound word has three forms: closed form, compounds written as single words (newspaper, goldfish, highway); the hyphenated form, compounds that are hyphenated (daughter-in-law, third-rate, court-martial); the open form, compounds written as separated words (red zone, high school, health care). This study has adapted lexical meaning and contextual meaning to identify compound words in Twilight novel written by Stephenie Meyer. Through this quantitative study, result found 253 compound words which focusing on written, world class and meaning perspective.
Many studies have discussed on detection of noun-noun compound words as compared to adjective-noun compounds. However, an empirical study has been attested in German language to find suitability of adjective-noun compound as naming unit or phrase (Schlechtweg, 2018). Another study discussed by Ang, Tan and He (2017) employed node-and-collocate approach to identify noun-noun and adjective-noun collocations in English language research article. Thus, result shown general adjective is the most common noun pre-modification type found in the articles.
Fundamental study conducted by Rahman et.al. (2011), identified Malay grammar mainly can be categorised into sound of the language and arrangement of word in a sentence. By applying concept thematic relation of compound words. Thus, the study has come out with two main compound noun phrase categories are i) head and noun modifier ii) head and non-noun modifier. Result from this study by using thematic relation, has come out with construction of four basic compound phrase such as Noun Phrase + Noun Phrase (NP+NP), Noun Phrase + Verb Phrase (NP+VP), Noun Phrase + Adjective Phrase (NP+AP) and Noun Phrase + Preposition Phrase (NP+PP) in field of Malay language.

METHODOLOGY
This study applied five phases as a method to process the detection of adjective compound word, Figure 1 is shown that the flow of the method used in this study. The phases include; (i) digitize five Malay story books as a corpus, (ii) pre-processing tasks consist of two tasks which are normalization and tokenization (iii) the candidate generation consists of subject and predicate phrase, POS tagging, word bi-gram and candidate collocations frequency equal or greater than two (iv) automatic compound word detection (syntactic rules) (v) the evaluation metric which is precision and recall is used to evaluate the efficiency of the enhanced method used.

Corpus
Corpora have been extensively employed in several Natural Language Processing (NLP) tasks as the basis for automatically learning models for language analysis and generation. In this step, this study collected the data from story book which is written in Malay language. The size of the corpus is 30 pages and 3,116 tokens.

Pre-Processing
In this phase, all the digitization of story book is pre-processed by removing all page's tags, header and footer tags, remove all noisy data and breaking the content down to a sequence of individual tokens. After that, all-uppercase, capitalized and mixed case words were lowercased. Punctuations, special symbols and numbers are removed. All this process is known as normalization of the data. After the normalization process is done, we proceed to the next step which is tokenization. We develop the system using the Microsoft Office Studio to tokenize all the words in the digitize data.

Candidate Generation
In this phase, we have tagged all the word related with its POS tags in the text corpus that match with all Malay lexicon words in the database. After this POS tagging process is done, we construct the bi-gram word to all the word from the corpus. Table 2 shown the statistic of word bi-gram. This phase gives all possible Noun-Adjective (N-A) collocations that occur in a corpus. From the tagged corpus, if two consecutive words tagged as Noun and Adjective respectively is extracted as a candidate N-A collocation. These compound Adjective candidates are then passed to the next phase for automatic compound words extraction method. Compound words candidates which occur with frequency equal to one are not selected because more compound word will produce with irrelevant collocation between two words. Study conducted by Muneer et. al. (2016) stated compound word candidates with frequency in the corpus are greater than or equal to three are selected in their study. For this study, we have enhanced the syntactic rules which suitable with the dataset that we used from the story book where only candidate compound adjective collocations whose frequency in the corpus are greater than or equal to two are chosen.

Automatic Extraction
Once we have extracted the candidate N-A compounds in the compound word candidate generation phase, we have ranked the entire compound adjective according to higher frequency word bi-grams to lower word bi-gram from a corpus. In our task, linguistic approaches are used to get a valid compound. There are some experiments run to make sure the result produce with valid compound word and it verified by the expert.

Syntactic Rules
When we extract the compound words from the compound word candidate generation phase, this study proposes enhanced syntactic rules to detect the compound word adjective candidate from the Malay corpus. Thus, it means that, this study proposes linguistic knowledge approach to extract and categorize the compound adjective from the Malay corpus. In order to extract compound nouns in standard Malay sentences, we must understand the language grammar itself first. Basically, the grammar for Malay language describe that the sentence must have a subject, verb and predicate.
In this research, the first step to run the experiment is to separate all the sentence in the story into all the single sentence.

Katak betina pula sombong dan tidak suka berkawan dengan katak lain.
The next step is, we separate the sentence into subject and predicate. Below is a output after the process is done.
Besides that, we also removed the comma. Then, the sentence becomes a simple phases. Below is a phrase sentences after the removal process has been done. The POS tag is done for each word in this sentence below:

Evaluation
The Performance and Accuracy Measurement are described below; where: X = The total compound word retrieved to the query Y = The total compound word not retrieved that relevant to the query Z = The total not relevant compound word retrieved So, the precision and recall are evaluated as follows:

X = Total relations relevant Y = (Number of records on a particular topic -Total relations relevant) Z = (Total relations retrieved -Total relations relevant)
The measurement of the harmonic mean for recall and precision is formulated using this F1-score equation: Table 3 showed the result number of adjective compound words found by the linguistic expert and prototype system has been developed. This table showed, number of compound words found from five different story books. Meanwhile, this study is also shown the result in form of the graph. Table 3 is referred to the figure 2.  Table 4 showed the result number of compound words found by the linguistic expert and prototype system has been developed. This table shows, number of compound words found by each 100 sentences from the overall story book. Total of sentences from all the books are 409 sentences. Meanwhile, this study is also shown the result in form of the graph. Table 4 is referred to the figure 3.  The comparison of the results is shown in Table 5 showed the comparison of the result between based-line result and Enhanced Syntactic Rules. The result is measured according to the standard measurement which is precision, recall and F-Score in getting accuracy of the result. Finally, the recall, precision and F-Measure value by using Enhanced Syntactic Rules is showed that precision increased 0.3 percent, recall increased by 0.2 and F-Score increased from 18.1% into 18.5%. The increment of the percentage for the enhanced method is significant for this study even it is slightly lower. Thus, this study has also assisted in increasing the percentage values of improvement the result. Table 6 below is a few examples of adjective compound word for this study.

CONCLUSION
This study has discussed how the enhanced Syntactic Rule in a Malay language can be recognized using a dependency relationship approach. The result shows significant improvement in terms of the effectiveness for the relationship types used. This is done by evaluating them with the baseline values compiled from a set of training and testing data from our study. However, the percentage produced is not slightly higher due to the lack of test data required in our testing process. In future research work, we will improvise the structure of Malay sentence to become an additional part of Malay grammar rules structure. The use of larger data is also required in the training and test dataset for the experiment to get better results.