A Collection of NLP and Text Processing Python Utilities

Project Overview

This repository, vilalali-python-utility, serves as a centralized collection of various Python-based (and some Perl/shell script) utility scripts primarily focused on Natural Language Processing (NLP) and general text processing tasks. These utilities have been developed to address common challenges in language data manipulation, particularly for Indic languages, but many are versatile enough for broader application.

The project is ideal for researchers, developers, and linguists working with text data, especially in areas like machine translation, text normalization, and data preparation.

Domain

The primary domain of this project is Natural Language Processing (NLP), with a strong emphasis on Computational Linguistics and Machine Translation (MT) specifically for Indic Languages (Urdu, Hindi, Punjabi). It also touches upon general Text Preprocessing and Data Utility tasks.

Directory Structure and Utilities

Here's a breakdown of the utilities contained within this repository:

vilalali-python-utility/
├── blue-score-calculation/
│   └── BLEU-calculation.zip           # Script for calculating BLEU score, a common MT evaluation metric.
├── lookup_urd2hin/
│   ├── README.md                      # Specific README for Urdu to Hindi lookup.
│   ├── lookup.pl                      # Perl script for looking up terms.
│   ├── lookup.py                      # Python script for looking up terms.
│   ├── search.pl                      # Perl script for searching within data.
│   ├── search_pan.pl                  # Perl script for Punjabi-specific searching.
│   └── split.pl                       # Perl script for splitting text.
├── nukta-marker/
│   ├── README.md                      # Specific README for Nukta marker.
│   ├── ChangeLog.md                   # Change log for the Nukta marker tool.
│   ├── fw.txt                         # Forwarding rules/data for Nukta marking.
│   ├── n_gram_generator.py            # Generates n-grams from text data.
│   ├── nukta.py                       # Core Python script for marking Nukta (diacritics) in Indic scripts.
│   └── run_shell.sh                   # Shell script to run the Nukta marker.
├── postprocessor_urd2hin/
│   ├── README.md.txt                  # Specific README for Urdu to Hindi post-processor.
│   ├── input.txt                      # Example input file.
│   ├── list.txt                       # List of rules/data for post-processing.
│   ├── myout.txt                      # Example output file.
│   ├── post_processor.py              # Python script for post-processing Urdu to Hindi text.
│   ├── pp_ur2hi.sh                    # Shell script to run the post-processor.
│   └── printinput.pl                  # Perl script to print input.
├── space-insert/
│   ├── README.md                      # Specific README for space insertion.
│   └── spaceinsert.py                 # Python script to intelligently insert spaces in text.
└── uniq-and-freq-count/
    ├── README.md                      # Specific README for unique and frequency count.
    ├── uniq.py                        # Python script to extract unique lines/words.
    └── vocab.py                       # Python script to generate vocabulary and frequency counts.

Utilities Overview

1. `blue-score-calculation/`

Purpose: Contains tools for calculating the BLEU (Bilingual Evaluation Understudy) score, a widely used metric to quantitatively evaluate the quality of machine-translated text against human references.
Domain: Machine Translation Evaluation.

2. `lookup_urd2hin/`

Purpose: A set of scripts designed for lookup and searching functionalities, specifically tailored for Urdu to Hindi language pairs. These are likely used for cross-lingual lexical mapping or data exploration.
Domain: Cross-lingual NLP, Lexical Resources, Indic Languages.

3. `nukta-marker/`

Purpose: This utility focuses on Nukta marking, which involves adding or correcting diacritical marks (nuktas) in scripts like Urdu, Hindi, or Punjabi. These marks are crucial for distinguishing between phonetically similar characters and ensuring correct pronunciation and meaning.
Domain: Text Normalization, Computational Linguistics, Indic Script Processing.

4. `postprocessor_urd2hin/`

Purpose: Scripts for post-processing machine translation outputs from Urdu to Hindi. Post-processing often involves correcting common errors, normalizing text, or applying specific linguistic rules to improve the fluency and accuracy of translated content.
Domain: Machine Translation Quality Improvement, Text Normalization, Indic Languages.

5. `space-insert/`

Purpose: A Python script designed to intelligently insert spaces into text. This is particularly useful for languages or text data where spaces might be missing or inconsistently used, which can impact downstream NLP tasks.
Domain: Text Preprocessing, Text Normalization.

6. `uniq-and-freq-count/`

Purpose: Contains scripts for basic yet essential text analysis tasks: extracting unique items (lines, words) and calculating their frequencies to build vocabularies.
Domain: Text Analysis, Data Preprocessing, Vocabulary Generation.

How to Use

Each sub-directory generally contains its own README.md (or .txt file) with specific instructions on how to use the scripts within that particular utility. Please refer to those individual READMEs for detailed usage guides, dependencies, and examples.

General Setup (for Python scripts)

Clone the repository:

git clone https://github.com/vilalali/vilalali-python-utility.git
cd vilalali-python-utility

Navigate to the desired utility directory:
```
cd nukta-marker # Example
```
Install dependencies (if any, typically listed in the sub-directory's README, or just standard Python libraries). You might want to use a virtual environment:
```
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt # If a requirements.txt exists
```
Run the script:
```
python3 script_name.py # Example
```

Contributing

Contributions are welcome! If you have suggestions for improvements, bug fixes, or new utility scripts that fit the scope of this collection, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details (if you add one).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Collection of NLP and Text Processing Python Utilities

Project Overview

Domain

Directory Structure and Utilities

Utilities Overview

1. `blue-score-calculation/`

2. `lookup_urd2hin/`

3. `nukta-marker/`

4. `postprocessor_urd2hin/`

5. `space-insert/`

6. `uniq-and-freq-count/`

How to Use

General Setup (for Python scripts)

Contributing

License

Author

Vilal Ali

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
blue-score-calculation		blue-score-calculation
lookup_urd2hin		lookup_urd2hin
nukta-marker		nukta-marker
postprocessor_urd2hin		postprocessor_urd2hin
space-insert		space-insert
uniq-and-freq-count		uniq-and-freq-count
README.md		README.md

vilalali/python-utility

Folders and files

Latest commit

History

Repository files navigation

A Collection of NLP and Text Processing Python Utilities

Project Overview

Domain

Directory Structure and Utilities

Utilities Overview

1. blue-score-calculation/

2. lookup_urd2hin/

3. nukta-marker/

4. postprocessor_urd2hin/

5. space-insert/

6. uniq-and-freq-count/

How to Use

General Setup (for Python scripts)

Contributing

License

Author

Vilal Ali

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `blue-score-calculation/`

2. `lookup_urd2hin/`

3. `nukta-marker/`

4. `postprocessor_urd2hin/`

5. `space-insert/`

6. `uniq-and-freq-count/`

Packages