About
In this post, I’ll explain how I came to support Arabic keyword search in Elasticsearch.
This post is a rewrite of the original article in Korean.
Intro
Arabic is primarily spoken in the Middle East and North Africa region. It is an official language of the United Nations and is the fifth largest language in the world in terms of speakers.
We all know that it’s very difficult to read, that it reads backwards, and that it has no vowels (it has semi-vowels). and Arabic letters change their shape depending on when they are used on their own and where they are used in a word.
So when we first learn Arabic language, you have to memorize all of the letters in a word, from the beginning to the end, and from the end to the standalone. (also hamzah
)
Arabic dialects
The Arabic language (formal language) that we commonly encounter is Modern Standard Arabic (MSA). This is the language that is most often used in official materials such as news and public documents and is about 1,300 years old.
Arabic dialects (AD), on the other hand, are divided into regions: Gulf, Egyption, Maghrebi (western North Africa), and Levantine (Lebanon and Palestine and its environs). There are many more, but these are just four for convenience.
If you only need to deal with standard Arabic, there’s no reason to read any further, but many people actually use regional dialects on the internet, and if you’re running a service, you may find yourself in a situation where you need to consider Arabic dialects as well.
Why Arabic search doesn’t work well
Unlike English and Korean, Arabic don’t change their representation based on the position of the character.
There are obvious limitations to simply treating this as an ngram. If you truncate it to ngram, it becomes a completely different word, which is not able to search properly.
Many other projects are try to do this with ngram, but this can lead to performance issues due to the large number of tokens that need to be indexed.
Elasticsearch’s default Arabic analyzer
Elasticsearch and Lucene are also aware of this and provide a native Arabic analyzer. (Arabic analyzer)
The default Arabic analyzer is a enough nice, but in my experience, I’ve encountered issues with it not working properly and its limitations in searching data stored in multiple dialects simultaneously.
The default Arabic analyzer is an analyzer with a root dictionary for roots, which is handled in the same way as English and Korean, with the same limitations I described earlier.
No Root Dictionary
To solve this problem, many researchers have worked on the No Root Dictionary problem. One such example is the Arabic stemmer developed by ISRI (Information Science Research Institute), which is designed to stem the Arabic language even without a root word dictionary.
ISRI Stemmer
ISRI Stemmer focuses on keeping things as close to the original as possible, and I think it works better for languages with many dialects, like Arabic, because you don’t need to have a list of roots.
The sequence below is how ISRI Stemmer finds roots without a Root Dictionary.
- Removing diacritics (to find their prototypes)
- Removing hamza and using its prototype
- removing prefixes if they are 3 or 2 in length
- if the word starts with و, delete the waw
- alif-hamza also uses the alif prototype, not hamza
- Terminate if the stemmed token is 3 or less in length
- Depending on the length of the token, this is handled as following conditions
Length 4) If it matches the value of PR4 [specified in the paper], extract that value and terminate. Otherwise, compare suffix 1, prefix 1 against S1, P1 and delete it if it contains it and return.
Length 5) Similar to length 4 but with PR53, PR54
Length 6) Similar to length 4, but using PR63
Length 7) The goal here is to delete suffix 1, prefix 1. Process one more time in the same way as for length 6, then look at the value and return
Developing an Elasticsearch plugin
It’s good to have the crawler do natural language processing, but I thought it would be good to develop a plugin to analyze it easily.
It was not easy to refer to the plugin because there was not much reference to it, but there is an explanation on the official site, so it would be good to refer to it.
elasticsearch-arabic-dialect-plugin
Using the No Root Dictionary described above, I developed a plugin for Elasticsearch that supports searching for different Arabic dialects.
GitHub: https://github.com/bunseokbot/elasticsearch-arabic-dialect-plugin
Install
If you want to use it directly with Elasticsearch, you can install and restart it on all your running nodes like this
./bin/elasticsearch-plugin install https://github.com/bunseokbot/elasticsearch-arabic-dialect-plugin/releases/download/v1.0.0/elasticsearch-arabic-dialect-plugin-1.0.0.zip
You can then add “arabic_dialect_filter” to custom analyzer — filter in settings when creating an index.
PUT http://localhost:9200/arabic/
{
"settings": {
"analysis": {
"analyzer": {
"arabic_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["arabic_dialect_filter"]
}
}
}
}
}
Add a new document
When adding a new document to the index, you can specify analyzer as follows and it will be added automatically.
POST http://localhost:9200/arabic/_analyzer
{
"analyzer": "arabic_analyzer",
"text": "يمكنك الاعتماد على الجزيرة في مسألتي الحقيقة والشفافيةنحن نتفهّم أن خصوصيتك على الإنترنت أمر بالغ الأهمية، وموافقتك على تمكيننا من جمع بعض المعلومات الشخصية عنك يتطلب ثقة كبيرة منك."
}
Reference
- Boujou, ElMehdi & Chataoui, Hamza & el Mekki, Abdellah & Benjelloun, Saad & Chairi, Ikram & Berrada, Ismail. (2021). An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction.
- K. Taghva, R. Elkhoury and J. Coombs, “Arabic stemming without a root dictionary,” International Conference on Information Technology: Coding and Computing (ITCC’05) — Volume II, 2005, pp. 152–157 Vol. 1, doi: 10.1109/ITCC.2005.90.
- Muaad, A.Y.; Jayappa, H.; Al-antari, M.A.; Lee, S. ArCAR: A Novel Deep Learning ComputerAided Recognition for CharacterLevel Arabic Text Representation and Recognition. Algorithms 2021, 14, 216. https://doi.org/10.3390/a14070216
- Omar F. Zaidan, Chris Callison-Burch; Arabic Dialect Identification. Computational Linguistics 2014; 40 (1): 171–202. doi: https://doi.org/10.1162/COLI_a_00169