Add in a min_tokens flag in SFileFilter.filter_sfile()?

When dealing with sfiles I sometimes just want to drop documents (lines) altogether if they have too few tokens in them. Up until now I've just used hacked solutions and been too lazy to actually figure how and if it should be integrated into rosetta. Here is a mock-up commit of what I mean:

https://github.com/ApproximateIdentity/rosetta/commit/6e2916de32ef2b8c7d4d2feda72d55a10bcd0927

All it does is add a flag min_tokens to filter_sfile() which will cause the filtering to not write lines with fewer than that many tokens. Would it make sense to add this thing here? Is there a better place to add it?

And maybe most importantly, does this functionality already exist somewhere else and I've just always missed it?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add in a min_tokens flag in SFileFilter.filter_sfile()? #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add in a min_tokens flag in SFileFilter.filter_sfile()? #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions