Kabyle Text Cleaner is a command-line utility and Python library designed to normalize Kabyle text. It converts non-standard Greek/Latin look-alike characters to proper Latin-derived letters used in Kabyle language.
![]() |
Kabyle Text Cleaner on a terminal |
The need for such tool came after we tried to extract text from some PDFs written in kabyle language. We noticed that due to old keyboard layouts, people still typing non Unicode latin symbols on their writings.
So when you extract text you will find yourself with a corpus you have to correct.
By using Kabyle Text Cleaner, you will be able to clean at least some of the corpus, normalize some words to their standard form instead of using a corpus full of symbols that make your text not only diffucult to read but difficult to process too.
Kabyle Text Cleaner does not correct, clean and normalize everything right now but we will try together to improve it.
If you are on Linux, you can install Kabyle Text Cleaner via pip command.
First you need to extract text from a PDF and give it name then you need to create a venv :
python -m venv kab
Activate your venv :
source kab/bin/activate
Installe Kabyle Text Cleaner via pip :
pip install kabyle-text-cleaner
Now you are ready to process your text file. First, let's check if there are some foreign symbols used inside your text file.
kabtxtcleaner text.txt
Notice that the command to call is kabtxtcleaner
To clean your text, type this command and you'll get a cleaner version in the same workng directory :
kabtxtcleaner text.txt -o clean.txt --fix
or
kabtxtcleaner text.txt -o kab_clean.txt --fix
Kabyle Text Cleaner is available on pypi : https://pypi.org/project/kabyle-text-cleaner/
And on Git : https://github.com/Imsidag/kabyle-text-cleaner
You're welcome to open issues and ask for improvements if you notice that Kabyle Text Cleaner is missing to clean some aspect of your extracted corpus.
Aucun commentaire:
Enregistrer un commentaire