+
Skip to content

NLP pipeline design #21

@rth

Description

@rth

Ideally, an NLP pipeline in Rust could look something like,

preprocessor = DefaultPreprocessor::new()
tokenizer = RegexpTokenizer::new(r"\b\w\w+\b")
stemmer = SnowballStemmer::new("en")
analyzer = NgramAnalyzer(range=(1, 1))

pipe = collection
          .map(preprocessor)
          .map(tokenizer)
          .map(|tokens| tokens.map(stemmer))
          .map(analyzer)

where collection is an iterator over documents.

There are several chalenges with it though,

  • It is better to avoid allocating strings for tokens in each pre-processing step and instead use a slice of the original document. Performance depends very strongly on this. The current implementation e.g. of RegexpTokenizer takes a reference to the document and return an Iterable of &str with the same lifetime as the input document, but then borrow checker doesn't appear to be happy when it is used in the pipeline. This may be related to using closures (cf next point) though.
  • Because structs are not callable, collection.map(tokenizer) doesn't work,
    nor does collection.map(tokenizer.tokenize) (i.e. using a method) for some reason. We can use collection.map(|document| tokenizer.tokenize(&document)) but then lifetime is not properly handled between input and output (described in the previous point).

More investigation would be necessary, and both points are likely related.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载