Sudanese Arabic LLM Project

Towards Representation of Sudanese Arabic Dialect in Large Language Models
A collaborative initiative to build a high-quality dataset and fine-tune language models that understand and generate Sudanese Arabic.

🌍 Project Overview

Sudanese Arabic is a widely spoken but underrepresented dialect in the field of Natural Language Processing (NLP). This project aims to:

Collect and annotate Sudanese Arabic text from diverse sources.
Create a balanced and labeled corpus.
Fine-tune Arabic-supportive LLMs (e.g., AraBERT, CAMeL BERT, LLaMA).
Evaluate model performance on dialect comprehension and generation.

🤝 How to Contribute

Fork this repository.
Create a new branch
git checkout -b feature/your-task
Make your changes and commit.
git commit -m "Your message"
Push your branch and open a Pull Request.
Use GitHub Issues or Discussions to coordinate and communicate.

🧠 Tasks You Can Help With

Area	Description
🗂️ Data Collection	Gather Sudanese Arabic from social media, transcripts, and oral storytelling.
📝 Annotation	Label text using dialect-specific guidelines.
🔧 Script Writing	Automate preprocessing, cleaning, and formatting tasks.
🧪 Model Fine-tuning	Fine-tune LLMs using the Sudanese corpus.
📊 Evaluation	Test model understanding and generation of Sudanese Arabic.
📢 Communication	Help with outreach, documentation, and community involvement.

📝 Annotation Guidelines (Summary)

Goal: Consistent and culturally-accurate annotation of Sudanese Arabic text.

1. Identify Language Variants:

Sudanese Arabic vs. Modern Standard Arabic (MSA).

2. Note Regional Vocabulary:

Tag terms unique to Khartoum, Darfur, East, North, South Sudan.

3. Normalize Spelling:

Respect Sudanese usage (e.g., شنو؟ instead of ماذا؟).

4. (Optional) Categorize Content:

Daily Conversations
Political Speech
Social Media Expressions
Folk Stories / Proverbs
Songs / Poetry

Full guidelines can be found in the docs/annotation-guidelines.md.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
docs		docs
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sudanese Arabic LLM Project

🌍 Project Overview

🤝 How to Contribute

🧠 Tasks You Can Help With

📝 Annotation Guidelines (Summary)

1. Identify Language Variants:

2. Note Regional Vocabulary:

3. Normalize Spelling:

4. (Optional) Categorize Content:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

AnwarCS/Sudanese-Arabic-LLM

Folders and files

Latest commit

History

Repository files navigation

Sudanese Arabic LLM Project

🌍 Project Overview

🤝 How to Contribute

🧠 Tasks You Can Help With

📝 Annotation Guidelines (Summary)

1. Identify Language Variants:

2. Note Regional Vocabulary:

3. Normalize Spelling:

4. (Optional) Categorize Content:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages