Wangperawong, 2022 - Google Patents
Multilingual search with subword tf-idfWangperawong, 2022
View PDF- Document ID
 - 8950376060543816441
 - Author
 - Wangperawong A
 - Publication year
 - Publication venue
 - arXiv preprint arXiv:2209.14281
 
External Links
Snippet
Multilingual search can be achieved with subword tokenization. The accuracy of traditional  TF-IDF approaches depend on manually curated tokenization, stop words and stemming  rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics … 
    - 238000011156 evaluation 0 abstract description 7
 
Classifications
- 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
 - G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
 - G06F17/30634—Querying
 - G06F17/30657—Query processing
 - G06F17/3066—Query translation
 - G06F17/30669—Translation of the query language, e.g. Chinese to English
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/20—Handling natural language data
 - G06F17/27—Automatic analysis, e.g. parsing
 - G06F17/2765—Recognition
 - G06F17/277—Lexical analysis, e.g. tokenisation, collocates
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/20—Handling natural language data
 - G06F17/21—Text processing
 - G06F17/22—Manipulating or registering by use of codes, e.g. in sequence of text characters
 - G06F17/2217—Character encodings
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/20—Handling natural language data
 - G06F17/28—Processing or translating of natural language
 - G06F17/2809—Data driven translation
 - G06F17/2827—Example based machine translation; Alignment
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/20—Handling natural language data
 - G06F17/21—Text processing
 - G06F17/211—Formatting, i.e. changing of presentation of document
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/20—Handling natural language data
 - G06F17/28—Processing or translating of natural language
 - G06F17/2863—Processing of non-latin text
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/20—Handling natural language data
 - G06F17/27—Automatic analysis, e.g. parsing
 - G06F17/2795—Thesaurus; Synonyms
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/20—Handling natural language data
 - G06F17/27—Automatic analysis, e.g. parsing
 - G06F17/2705—Parsing
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
 - G06F17/30861—Retrieval from the Internet, e.g. browsers
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/20—Handling natural language data
 - G06F17/21—Text processing
 - G06F17/24—Editing, e.g. insert/delete
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/20—Handling natural language data
 - G06F17/27—Automatic analysis, e.g. parsing
 - G06F17/273—Orthographic correction, e.g. spelling checkers, vowelisation
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
 - G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
 - G06F17/30613—Indexing
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
 - G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
 - G06F17/30716—Browsing or visualization
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING; CALCULATING; COUNTING
 - G06F—ELECTRICAL DIGITAL DATA PROCESSING
 - G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
 - G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
 - G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
 
 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| Richman et al. | Mining wiki resources for multilingual named entity recognition | |
| US8706474B2 (en) | Translation of entity names based on source document publication date, and frequency and co-occurrence of the entity names | |
| Wibawa et al. | Indonesian named-entity recognition for 15 classes using ensemble supervised learning | |
| Abuata et al. | A rule-based stemmer for Arabic Gulf dialect | |
| Lambert | A multitude of “lishes” The nomenclature of hybridity | |
| US20080065621A1 (en) | Ambiguous entity disambiguation method | |
| Kudrinski et al. | Sumerograms and Akkadograms in Hittite: Ideograms, logograms, allograms, or heterograms? | |
| Alnefaie et al. | Automatic minimal diacritization of Arabic texts | |
| Kachakeche et al. | Word order affects the frequency of adjective use across languages | |
| Wangperawong | Multilingual search with subword tf-idf | |
| Dabre et al. | MMCR4NLP: multilingual multiway corpora repository for natural language processing | |
| Saad et al. | Wikidocsaligner: An off-the-shelf Wikipedia documents alignment tool | |
| Rajitha et al. | Sinhala and english document alignment using statistical machine translation | |
| Mokhtaripour et al. | Introduction to a new Farsi stemmer | |
| Tsai et al. | Introduction to Entity Discovery and Linking | |
| Ikhsan et al. | Search and Comparison of Isim Ma ‘rifat with Remove Diacritic in the Qur ‘an and Hadith of Abu Daud | |
| Toves et al. | The Ukrainian Kyrylytsia, restored: An automation project for adding the Cyrillic fields to Ukrainian records in OCLC WorldCat | |
| Dash | Polysemy and homonymy: a conceptual labyrinth | |
| Zeldes | A characterwise windowed approach to Hebrew morphological segmentation | |
| Abumalloh et al. | Building Arabic corpus applied to part-of-speech tagging | |
| Lubis et al. | Methods of Foreign Language Translation to Arabic Language in Al-Jazeera Online Magazine | |
| Fall et al. | Searching trademark databases for verbal similarities | |
| Somers | Translation technologies and minority languages | |
| Bureros et al. | Building an English-Cebuano tourism parallel corpus and a named-entity list from the Web | |
| Komornicka | Dictionary of basic indexing terminology: Polish and Czech; Słownik podstawowej terminologii indeksacyjnej: polski i czeski; Slovník základní terminologie indexování: polský a český |