Releases: SeekStorm/SeekStorm
Releases · SeekStorm/SeekStorm
SeekStorm v0.14.0
Improved
- Maximum cardinality of distinct string facet values increased from 65_535 (16 bit) to 4_294_967_295 (32 bit).
- FieldType::String32 and FieldType::StringSet32 added, that allow a cardinality of 4_294_967_295 (32 bit) distinct string facet values,
while FieldType::String and FieldType::StringSet were renamed to FieldType::String16 and FieldType::StringSet16 that allow only a cardinality of 65_535 (16 bit) distinct string facet values, but are space-saving. - QueryFacet::String32 and QueryFacet::StringSet32 added, that allow a cardinality of 4_294_967_295 (32 bit) distinct string facet values,
while QueryFacet::String and QueryFacet::StringSet were renamed to QueryFacet::String16 and QueryFacet::StringSet16 that allow only a cardinality of 65_535 (16 bit) distinct string facet values, but are space-saving. - FacetFilter::String32 and FacetFilter::StringSet32 added, FacetFilter::String and FacetFilter::StringSet renamed to FacetFilter::String16 and FacetFilter::StringSet16.
- FilterSparse::String32 and FilterSparse::StringSet32 added, FilterSparse::String and FilterSparse::StringSet renamed to FilterSparse::String16 and FilterSparse::StringSet16
- FieldType::String32 and FieldType::StringSet32 added, that allow a cardinality of 4_294_967_295 (32 bit) distinct string facet values,
Changed
- Index format changed (INDEX_FORMAT_VERSION_MAJOR changed).
SeekStorm v0.13.3
hash32 fixed for platforms without aes or sse2.
SeekStorm v0.13.2
- rustdocflags added in config.toml and cargo.toml
SeekStorm v0.13.1
- Faster and complete topk results for union queries > 8 terms by using MAXSCORE.
- Required target_features for using gxhash fixed.
SeekStorm v0.13.0
Added
- N-gram indexing: N-grams are indexed in addition to single terms, for faster phrase search, at the cost of higher index size.
- N-grams not as parts of terms, but as combination of consecutive terms. See NGRAM_SEARCH.md.
- N-Gram indexing improves phrase query latency on average by factor 2.14 (114%), maximum tail latency by factor 7.51 (651%), and some phrase queries up to 3 orders of magnitude.
- Allows to enable a combination of different types of N-gram indexing: see NgramSet
- SingleTerm
- NgramFF : frequent frequent
- NgramFR : frequent rare
- NgramRF : rare frequent
- NgramFFF : frequent frequent frequent
- NgramRFF : rare frequent frequent
- NgramFFR : frequent frequent rare
- NgramFRF : frequent rare frequent
- Previously N-gram indexing was not configurable, but always set to the equivalent of NgramFF.
- IndexMetaObject.ngram_indexing property added, used in create_index library method.
- CreateIndexRequest ngram_indexing property added, used in create_index REST API endpoint.
- Ngram indexing only effects phrase search.
- BM25 scores (SimilarityType::Bm25f) are almost identical for both ngram and single term indexing. There are only small differences for phrase search resulting from
normalization (32bit->8bit->32bit lossy logarithmic compression/decompression) that is used for posting_count_ngram1/2/3, but not for single term posting_counts. - Default ngram_indexing: NgramSet::NgramFF as u8 | NgramSet::NgramFFF as u8,
Improved
- MAX_QUERY_TERM_NUMBER increased from 10 to 100.
- 2-term union count latency improved.
- DOCUMENT_LENGTH_COMPRESSION array now pre-calculated algorithmically with byte4_to_int instead of pre-defined values.
- faster document length compression with int_to_byte4 instead of norm_frequency (binary search in DOCUMENT_LENGTH_COMPRESSION table).
- int_to_byte4 is used also for compression of n-gram frequent_term positions_count (previously only for doc/field length compression)
- 256 limit for the maximum number of frequentwords (FrequentwordType::Custom) removed (because frequentword_index is not stored anymore).
Changed
-
Index format changed (INDEX_FORMAT_VERSION_MAJOR changed).
- Instead of u8 index to frequentword_posting_counts we now store the u8 compressed posting_count both for frequent and rare Ngram terms.
- AHash replaced with GxHash, which is faster and provides stable hashes across different dependency versions, platforms and hardware. This improves index persistence and portability.
- NgramType encoded into hash.
- Ngrams with 3 terms allowed.
-
in compress_postinglist posting_count_ngram1/2 are taken from decode_posting_list_counts instead from precalculated frequentword_posting_counts.
- update_frequentword_posting_counts removed.
- precondition for ngrams with rare terms.
Fixed
- Error in manual commit during intermittent indexing fixed: "Unable to index_file.set_len in commit".
- Realtime search BM25 scoring fixed: posting_counts are now based on the sum of committed and uncommitted documents (previously only uncommitted).
- Realtime search BM25 scoring fixed: now both terms of the ngram are taken into account.
SeekStorm v0.12.27
- very rare position compression bug fixed.
SeekStorm v0.12.26
- Put winapi crate behind conditional compilation #[cfg(target_os = "windows")]
SeekStorm v0.12.25
- Faster index_document, commit, clear_index:
Increased SEGMENT_KEY_CAPACITY prevents HashMap resizing during indexing.
vector/hashmap reuse instead of reinitialization. - Intersection between RLE-RLE and RLE-Bitmap compressed posting lists fixed.
SeekStorm v0.12.24
- Fixes a 85% performance drop (Windows 11 24H2/Intel hybrid CPUs only) caused by a faulty Windows 11 24H2 update,
that changed the task scheduler behavior into under-utilizing the P-Cores over E-cores of Intel hybrid CPUs.
This is a workaround until Microsoft fixes the issue in a future update.
The fix solves the issue for the SeekStorm server, if you embedd the SeekStorm library into your own code you have to apply the fix as well.
See blog post for details: https://seekstorm.com/blog/80-percent-performance-drop/
SeekStorm v0.12.23
- Ingestion of files in CSV, SSV, TSV, PSV format with
ingest_csv()
method and seekstorm_server command lineingest
:
configurable header, delimiter char, quoting, number of skipped document, number of indexed documents. stop_words
parameter (predefined languages and custom) added to create_index IndexMetaObject:
Stop words are not indexed for compact index and faster queries.frequent_words
parameter (predefined languages and custom) added to create_index IndexMetaObject:
consecutive frequent words are indexed as n-gram combinations for short posting lists and fast phrase queries.- TokenizerType
Whitespace
andWhitespaceLowercase
added. truncate()
andsubstring()
utils.