This repository, vilalali-python-utility
, serves as a centralized collection of various Python-based (and some Perl/shell script) utility scripts primarily focused on Natural Language Processing (NLP) and general text processing tasks. These utilities have been developed to address common challenges in language data manipulation, particularly for Indic languages, but many are versatile enough for broader application.
The project is ideal for researchers, developers, and linguists working with text data, especially in areas like machine translation, text normalization, and data preparation.
The primary domain of this project is Natural Language Processing (NLP), with a strong emphasis on Computational Linguistics and Machine Translation (MT) specifically for Indic Languages (Urdu, Hindi, Punjabi). It also touches upon general Text Preprocessing and Data Utility tasks.
Here's a breakdown of the utilities contained within this repository:
vilalali-python-utility/
├── blue-score-calculation/
│ └── BLEU-calculation.zip # Script for calculating BLEU score, a common MT evaluation metric.
├── lookup_urd2hin/
│ ├── README.md # Specific README for Urdu to Hindi lookup.
│ ├── lookup.pl # Perl script for looking up terms.
│ ├── lookup.py # Python script for looking up terms.
│ ├── search.pl # Perl script for searching within data.
│ ├── search_pan.pl # Perl script for Punjabi-specific searching.
│ └── split.pl # Perl script for splitting text.
├── nukta-marker/
│ ├── README.md # Specific README for Nukta marker.
│ ├── ChangeLog.md # Change log for the Nukta marker tool.
│ ├── fw.txt # Forwarding rules/data for Nukta marking.
│ ├── n_gram_generator.py # Generates n-grams from text data.
│ ├── nukta.py # Core Python script for marking Nukta (diacritics) in Indic scripts.
│ └── run_shell.sh # Shell script to run the Nukta marker.
├── postprocessor_urd2hin/
│ ├── README.md.txt # Specific README for Urdu to Hindi post-processor.
│ ├── input.txt # Example input file.
│ ├── list.txt # List of rules/data for post-processing.
│ ├── myout.txt # Example output file.
│ ├── post_processor.py # Python script for post-processing Urdu to Hindi text.
│ ├── pp_ur2hi.sh # Shell script to run the post-processor.
│ └── printinput.pl # Perl script to print input.
├── space-insert/
│ ├── README.md # Specific README for space insertion.
│ └── spaceinsert.py # Python script to intelligently insert spaces in text.
└── uniq-and-freq-count/
├── README.md # Specific README for unique and frequency count.
├── uniq.py # Python script to extract unique lines/words.
└── vocab.py # Python script to generate vocabulary and frequency counts.
- Purpose: Contains tools for calculating the BLEU (Bilingual Evaluation Understudy) score, a widely used metric to quantitatively evaluate the quality of machine-translated text against human references.
- Domain: Machine Translation Evaluation.
- Purpose: A set of scripts designed for lookup and searching functionalities, specifically tailored for Urdu to Hindi language pairs. These are likely used for cross-lingual lexical mapping or data exploration.
- Domain: Cross-lingual NLP, Lexical Resources, Indic Languages.
- Purpose: This utility focuses on Nukta marking, which involves adding or correcting diacritical marks (nuktas) in scripts like Urdu, Hindi, or Punjabi. These marks are crucial for distinguishing between phonetically similar characters and ensuring correct pronunciation and meaning.
- Domain: Text Normalization, Computational Linguistics, Indic Script Processing.
- Purpose: Scripts for post-processing machine translation outputs from Urdu to Hindi. Post-processing often involves correcting common errors, normalizing text, or applying specific linguistic rules to improve the fluency and accuracy of translated content.
- Domain: Machine Translation Quality Improvement, Text Normalization, Indic Languages.
- Purpose: A Python script designed to intelligently insert spaces into text. This is particularly useful for languages or text data where spaces might be missing or inconsistently used, which can impact downstream NLP tasks.
- Domain: Text Preprocessing, Text Normalization.
- Purpose: Contains scripts for basic yet essential text analysis tasks: extracting unique items (lines, words) and calculating their frequencies to build vocabularies.
- Domain: Text Analysis, Data Preprocessing, Vocabulary Generation.
Each sub-directory generally contains its own README.md
(or .txt
file) with specific instructions on how to use the scripts within that particular utility. Please refer to those individual READMEs for detailed usage guides, dependencies, and examples.
- Clone the repository:
git clone https://github.com/vilalali/vilalali-python-utility.git cd vilalali-python-utility
- Navigate to the desired utility directory:
cd nukta-marker # Example
- Install dependencies (if any, typically listed in the sub-directory's README, or just standard Python libraries). You might want to use a virtual environment:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt # If a requirements.txt exists
- Run the script:
python3 script_name.py # Example
Contributions are welcome! If you have suggestions for improvements, bug fixes, or new utility scripts that fit the scope of this collection, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details (if you add one).