20 octobre 2025

Kabyle Text Cleaner

 Kabyle Text Cleaner is a command-line utility and Python library designed to normalize Kabyle text. It converts non-standard Greek/Latin look-alike characters to proper Latin-derived letters used in Kabyle language.

Kabyle Text Cleaner on a terminal
Kabyle Text Cleaner on a terminal

 

The need for such tool came after we tried to extract text from some PDFs written in kabyle language. We noticed that due to old keyboard layouts, people still typing non Unicode latin symbols on their writings.

So when you extract text you will find yourself with a corpus you have to correct.

By using Kabyle Text Cleaner, you will be able to clean at least some of the corpus, normalize some words to their standard form instead of using a corpus full of symbols that make your text not only diffucult to read but difficult to process too.

 Kabyle Text Cleaner does not correct, clean and normalize everything right now but we will try together to improve it.

If you are on Linux, you can install Kabyle Text Cleaner via pip command.

First you need to extract text from a PDF and give it name then you need to create a venv :

python -m venv kab

Activate your venv : 

source kab/bin/activate 

Installe Kabyle Text Cleaner via pip :

pip install kabyle-text-cleaner

Now you are ready to process your text file. First, let's check if there are some foreign symbols used inside your text file.

kabtxtcleaner text.txt

Notice that the command to call is kabtxtcleaner

To clean your text, type this command and you'll get a cleaner version in the same workng directory :

kabtxtcleaner text.txt -o clean.txt --fix

or 

kabtxtcleaner text.txt -o kab_clean.txt --fix 

Kabyle Text Cleaner is available on pypi : https://pypi.org/project/kabyle-text-cleaner/

And on Git :  https://github.com/Imsidag/kabyle-text-cleaner

You're welcome to open issues and ask for improvements if you notice that Kabyle Text Cleaner is missing to clean some aspect of your extracted corpus. 

 

Aucun commentaire:

Enregistrer un commentaire

Kabyle Text Cleaner

  Kabyle Text Cleaner is a command-line utility and Python library designed to normalize Kabyle text. It converts non-standard Greek/Latin ...

Mastodon