-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
help wantedExtra attention is neededExtra attention is needed
Description
Ideally, an NLP pipeline in Rust could look something like,
preprocessor = DefaultPreprocessor::new()
tokenizer = RegexpTokenizer::new(r"\b\w\w+\b")
stemmer = SnowballStemmer::new("en")
analyzer = NgramAnalyzer(range=(1, 1))
pipe = collection
.map(preprocessor)
.map(tokenizer)
.map(|tokens| tokens.map(stemmer))
.map(analyzer)
where collection
is an iterator over documents.
There are several chalenges with it though,
- It is better to avoid allocating strings for tokens in each pre-processing step and instead use a slice of the original document. Performance depends very strongly on this. The current implementation e.g. of
RegexpTokenizer
takes a reference to the document and return anIterable
of&str
with the same lifetime as the input document, but then borrow checker doesn't appear to be happy when it is used in the pipeline. This may be related to using closures (cf next point) though. - Because structs are not callable,
collection.map(tokenizer)
doesn't work,
nor doescollection.map(tokenizer.tokenize)
(i.e. using a method) for some reason. We can usecollection.map(|document| tokenizer.tokenize(&document))
but then lifetime is not properly handled between input and output (described in the previous point).
More investigation would be necessary, and both points are likely related.
Metadata
Metadata
Assignees
Labels
help wantedExtra attention is neededExtra attention is needed