+
Skip to content

A collection of Python-based NLP and text processing utilities for Indic languages and general text data.

Notifications You must be signed in to change notification settings

vilalali/python-utility

Repository files navigation

A Collection of NLP and Text Processing Python Utilities

Python Shell Script Perl License

Project Overview

This repository, vilalali-python-utility, serves as a centralized collection of various Python-based (and some Perl/shell script) utility scripts primarily focused on Natural Language Processing (NLP) and general text processing tasks. These utilities have been developed to address common challenges in language data manipulation, particularly for Indic languages, but many are versatile enough for broader application.

The project is ideal for researchers, developers, and linguists working with text data, especially in areas like machine translation, text normalization, and data preparation.

Domain

The primary domain of this project is Natural Language Processing (NLP), with a strong emphasis on Computational Linguistics and Machine Translation (MT) specifically for Indic Languages (Urdu, Hindi, Punjabi). It also touches upon general Text Preprocessing and Data Utility tasks.

Directory Structure and Utilities

Here's a breakdown of the utilities contained within this repository:

vilalali-python-utility/
├── blue-score-calculation/
│   └── BLEU-calculation.zip           # Script for calculating BLEU score, a common MT evaluation metric.
├── lookup_urd2hin/
│   ├── README.md                      # Specific README for Urdu to Hindi lookup.
│   ├── lookup.pl                      # Perl script for looking up terms.
│   ├── lookup.py                      # Python script for looking up terms.
│   ├── search.pl                      # Perl script for searching within data.
│   ├── search_pan.pl                  # Perl script for Punjabi-specific searching.
│   └── split.pl                       # Perl script for splitting text.
├── nukta-marker/
│   ├── README.md                      # Specific README for Nukta marker.
│   ├── ChangeLog.md                   # Change log for the Nukta marker tool.
│   ├── fw.txt                         # Forwarding rules/data for Nukta marking.
│   ├── n_gram_generator.py            # Generates n-grams from text data.
│   ├── nukta.py                       # Core Python script for marking Nukta (diacritics) in Indic scripts.
│   └── run_shell.sh                   # Shell script to run the Nukta marker.
├── postprocessor_urd2hin/
│   ├── README.md.txt                  # Specific README for Urdu to Hindi post-processor.
│   ├── input.txt                      # Example input file.
│   ├── list.txt                       # List of rules/data for post-processing.
│   ├── myout.txt                      # Example output file.
│   ├── post_processor.py              # Python script for post-processing Urdu to Hindi text.
│   ├── pp_ur2hi.sh                    # Shell script to run the post-processor.
│   └── printinput.pl                  # Perl script to print input.
├── space-insert/
│   ├── README.md                      # Specific README for space insertion.
│   └── spaceinsert.py                 # Python script to intelligently insert spaces in text.
└── uniq-and-freq-count/
    ├── README.md                      # Specific README for unique and frequency count.
    ├── uniq.py                        # Python script to extract unique lines/words.
    └── vocab.py                       # Python script to generate vocabulary and frequency counts.

Utilities Overview

1. blue-score-calculation/

  • Purpose: Contains tools for calculating the BLEU (Bilingual Evaluation Understudy) score, a widely used metric to quantitatively evaluate the quality of machine-translated text against human references.
  • Domain: Machine Translation Evaluation.

2. lookup_urd2hin/

  • Purpose: A set of scripts designed for lookup and searching functionalities, specifically tailored for Urdu to Hindi language pairs. These are likely used for cross-lingual lexical mapping or data exploration.
  • Domain: Cross-lingual NLP, Lexical Resources, Indic Languages.

3. nukta-marker/

  • Purpose: This utility focuses on Nukta marking, which involves adding or correcting diacritical marks (nuktas) in scripts like Urdu, Hindi, or Punjabi. These marks are crucial for distinguishing between phonetically similar characters and ensuring correct pronunciation and meaning.
  • Domain: Text Normalization, Computational Linguistics, Indic Script Processing.

4. postprocessor_urd2hin/

  • Purpose: Scripts for post-processing machine translation outputs from Urdu to Hindi. Post-processing often involves correcting common errors, normalizing text, or applying specific linguistic rules to improve the fluency and accuracy of translated content.
  • Domain: Machine Translation Quality Improvement, Text Normalization, Indic Languages.

5. space-insert/

  • Purpose: A Python script designed to intelligently insert spaces into text. This is particularly useful for languages or text data where spaces might be missing or inconsistently used, which can impact downstream NLP tasks.
  • Domain: Text Preprocessing, Text Normalization.

6. uniq-and-freq-count/

  • Purpose: Contains scripts for basic yet essential text analysis tasks: extracting unique items (lines, words) and calculating their frequencies to build vocabularies.
  • Domain: Text Analysis, Data Preprocessing, Vocabulary Generation.

How to Use

Each sub-directory generally contains its own README.md (or .txt file) with specific instructions on how to use the scripts within that particular utility. Please refer to those individual READMEs for detailed usage guides, dependencies, and examples.

General Setup (for Python scripts)

  1. Clone the repository:
    git clone https://github.com/vilalali/vilalali-python-utility.git
    cd vilalali-python-utility
  2. Navigate to the desired utility directory:
    cd nukta-marker # Example
  3. Install dependencies (if any, typically listed in the sub-directory's README, or just standard Python libraries). You might want to use a virtual environment:
    python3 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt # If a requirements.txt exists
  4. Run the script:
    python3 script_name.py # Example

Contributing

Contributions are welcome! If you have suggestions for improvements, bug fixes, or new utility scripts that fit the scope of this collection, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details (if you add one).

Author

Vilal Ali
LinkedIn

About

A collection of Python-based NLP and text processing utilities for Indic languages and general text data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载