这是indexloc提供的服务,不要输入任何密码
Skip to content

AnwarCS/Sudanese-Arabic-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sudanese Arabic LLM Project

Towards Representation of Sudanese Arabic Dialect in Large Language Models
A collaborative initiative to build a high-quality dataset and fine-tune language models that understand and generate Sudanese Arabic.


🌍 Project Overview

Sudanese Arabic is a widely spoken but underrepresented dialect in the field of Natural Language Processing (NLP). This project aims to:

  • Collect and annotate Sudanese Arabic text from diverse sources.
  • Create a balanced and labeled corpus.
  • Fine-tune Arabic-supportive LLMs (e.g., AraBERT, CAMeL BERT, LLaMA).
  • Evaluate model performance on dialect comprehension and generation.

🤝 How to Contribute

  1. Fork this repository.
  2. Create a new branch
    git checkout -b feature/your-task
  3. Make your changes and commit.
    git commit -m "Your message"
  4. Push your branch and open a Pull Request.
  5. Use GitHub Issues or Discussions to coordinate and communicate.

🧠 Tasks You Can Help With

Area Description
🗂️ Data Collection Gather Sudanese Arabic from social media, transcripts, and oral storytelling.
📝 Annotation Label text using dialect-specific guidelines.
🔧 Script Writing Automate preprocessing, cleaning, and formatting tasks.
🧪 Model Fine-tuning Fine-tune LLMs using the Sudanese corpus.
📊 Evaluation Test model understanding and generation of Sudanese Arabic.
📢 Communication Help with outreach, documentation, and community involvement.

📝 Annotation Guidelines (Summary)

Goal: Consistent and culturally-accurate annotation of Sudanese Arabic text.

1. Identify Language Variants:

  • Sudanese Arabic vs. Modern Standard Arabic (MSA).

2. Note Regional Vocabulary:

  • Tag terms unique to Khartoum, Darfur, East, North, South Sudan.

3. Normalize Spelling:

  • Respect Sudanese usage (e.g., شنو؟ instead of ماذا؟).

4. (Optional) Categorize Content:

  • Daily Conversations
  • Political Speech
  • Social Media Expressions
  • Folk Stories / Proverbs
  • Songs / Poetry

Full guidelines can be found in the docs/annotation-guidelines.md.

About

Building a Sudanese Arabic dataset and fine-tuning LLMs to improve representation of this dialect.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •