+
Skip to content

czcorpus/kontext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KonText screenshot

Contents

Introduction

KonText is an advanced corpus query interface and corpus data integration platform built around corpus search engine Manatee-open. It is written in Python 3 and TypeScript and it runs on any major Linux distribution. The development is maintained by the Institute of the Czech National Corpus.

Features

  • fully editable query chain
    • any operation from a user defined sequence (e.g. query -> filter -> sample -> sorting) can be changed and the whole sequence is then re-executed.
  • simple and advanced query types
    • advanced CQL editor with syntax highlighting and attribute recognition
    • interactive PoS tag composing tool for positional and key-value tagsets
    • customizable query suggestions and simple type query refinement (e.g. for homonym disambiguation)
  • support for spoken corpora
    • defined text segments can be played back as audio
    • KWIC detail with easily distinguishable speeches
  • rich concordance view options and tools
    • any positional attribute can be set as primary
    • multiple ways how to display other attributes
    • user-defined line groups - filtering, reviewing groups ratios
    • tokens and KWICs can be connected to external data services (e.g. dictionaries, encyclopedias)
  • rich subcorpus-related functionality
    • a subcorpus can be either private or published
    • text types metadata can be gradually refined to a specific subcorpus ("which publishers are there in case only fiction is selected?")
    • a custom text types ratio can be defined ("give me 20% fiction and 80% journalism")
  • frequency distribution
    • univariate
      • positional attributes (including tuples of multiple attributes per token)
      • structural attributes
    • multivariate distribution (2 dimensions) for both positional and structural attributes
  • collocation analysis
  • persistent URLs - any result page can be easily shared even if the original query is megabytes long
  • access to previous queries, named queries
  • convenient corpus access
    • finding corpus by a keyword (tag), size, description
    • adding corpus to favorites (incl. subcorpora, aligned corpora)
  • saving result to Excel, CSV, XML, TXT
  • integrability with existing information systems

Internal features

  • modern client-side application (written in TypeScript, event stream architecture, React components, extensible)
  • server-side written as a WSGI application with fully decoupled background concordance/frequency/collocation calculation (using an integrated worker server)
  • modular code design with dynamically loadable plug-ins providing custom functionality implementation (e.g. custom database adapters, authentication method, corpus listing widgets, HTTP session management)

Requirements

  • Python 3.6 (or newer):
    • WSGI-compatible server - Gunicorn (recommended), uWsgi (supported)
    • Werkzeug web application library
    • Jinja2 template engine
    • lxml library
    • PyICU library (optional but preferred)
    • markdown library (optional, for formatted corpora references)
    • openpyxl library (optional, for XLSX export)
    • Babel library
  • Manatee corpus search engine - version 2.167.8 and onwards
  • a key-value storage
    • Redis (recommended), SQLite (supported), custom implementations possible
  • a task queue - Rq (recommended), Celery task queue (supported)
  • HTTP proxy server

Build and installation

KonText provides a script for automatic installation to an existing Ubuntu system. The easiest way to install KonText is to create an LXC/LXD container, clone the repository there and run the script. On a decently fast network, the whole process takes only a couple of seconds. Please refer to the doc/INSTALL.md file for details.

Customization and contribution

Please refer to our Wiki.

Notable users

How to cite KonText

Tomáš Machálek (2020) - KonText: Advanced and Flexible Corpus Query Interface

@inproceedings{machalek-2020-kontext,
    title = "{K}on{T}ext: Advanced and Flexible Corpus Query Interface",
    author = "Mach{\'a}lek, Tom{\'a}{\v{s}}",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.865",
    pages = "7003--7008",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

About

An advanced, extensible web front-end for the Manatee-open corpus search engine

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 13

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载