20 octobre 2025

Kabyle Text Cleaner

 Kabyle Text Cleaner is a command-line utility and Python library designed to normalize Kabyle text. It converts non-standard Greek/Latin look-alike characters to proper Latin-derived letters used in Kabyle language.

Kabyle Text Cleaner on a terminal
Kabyle Text Cleaner on a terminal

 

The need for such tool came after we tried to extract text from some PDFs written in kabyle language. We noticed that due to old keyboard layouts, people still typing non Unicode latin symbols on their writings.

So when you extract text you will find yourself with a corpus you have to correct.

By using Kabyle Text Cleaner, you will be able to clean at least some of the corpus, normalize some words to their standard form instead of using a corpus full of symbols that make your text not only diffucult to read but difficult to process too.

 Kabyle Text Cleaner does not correct, clean and normalize everything right now but we will try together to improve it.

If you are on Linux, you can install Kabyle Text Cleaner via pip command.

First you need to extract text from a PDF and give it name then you need to create a venv :

python -m venv kab

Activate your venv : 

source kab/bin/activate 

Installe Kabyle Text Cleaner via pip :

pip install kabyle-text-cleaner

Now you are ready to process your text file. First, let's check if there are some foreign symbols used inside your text file.

kabtxtcleaner text.txt

Notice that the command to call is kabtxtcleaner

To clean your text, type this command and you'll get a cleaner version in the same workng directory :

kabtxtcleaner text.txt -o clean.txt --fix

or 

kabtxtcleaner text.txt -o kab_clean.txt --fix 

Kabyle Text Cleaner is available on pypi : https://pypi.org/project/kabyle-text-cleaner/

And on Git :  https://github.com/Imsidag/kabyle-text-cleaner

You're welcome to open issues and ask for improvements if you notice that Kabyle Text Cleaner is missing to clean some aspect of your extracted corpus. 

 

04 octobre 2025

Datasets and corpus in kabyle

 We know that there are people looking for some datasets in kabyle language for their projects, this is why we are publishing some of them via our community account on HuggingFace.

Imsidag community on HugginFace


 

    Some of these datasets are a parallel corpus and some are monolingual (kabyle only). We would like to emphasis that some datasets are not yet cleaned and published as is in purpose for people interested in building tools such as "fixers, cleaners, standardizers ..." for kabyle language.

    These datasets may help you to create and improve a kabyle spellchecker dictionary for example, word frequency analyzers, calculate n-grams, calculate syllable weight (CV), build word games from the datasets ... and so on.

    Please find them all on : https://huggingface.co/Imsidag-community/datasets 

Intelligence artificielle en kabyle

 Azul fell-awen·akent,

    Nous avons le plaisir de vous annoncer que nous participons à la traduction d'un projet qui se nomme OpenWebUI et ce en kabyle.

    OpenWebUI est un logiciel installable localement sur votre propre ordinateur et permet d’interagir ainsi avec des modèles génératifs localement que vous ayez besoin d'une connexion internet ou pas.

OpenWebUI est utilisable via un simple navigateur web. Une fois installé, rendez-vous dans vos paramètres et choisissez "Taqbaylit" comme langue d'interface.

Vidéo : https://www.youtube.com/watch?v=3G-FJEcmPC4


 Dépôt Git d'OpenWebUI : https://github.com/open-webui/open-webui 

Kabyle Text Cleaner

  Kabyle Text Cleaner is a command-line utility and Python library designed to normalize Kabyle text. It converts non-standard Greek/Latin ...

Mastodon