+
Skip to content

Conversation

thongnt99
Copy link

@thongnt99 thongnt99 commented Mar 23, 2023

Related to #1890
On-going work: Using FeatureField to directly index terms and weights

The indexing works and returns the same metrics as the token repeating method, but three tests (for the repeating method) are currently failing. Please let me know how to fix the tests or create new tests.

Indexing:

./anserini-lsr/target/appassembler/bin/IndexCollection \
-collection JsonTermWeightCollection \
-input collections/msmarco-passage/lsr_collection_jsonl \
-index indexes/msmarco-passage/lsr-index-msmarco \
-generator TermWeightDocumentGenerator \
-threads 60 -impact -pretokenized

Retrieval:

./anserini-lsr/target/appassembler/bin/SearchCollection \
-index path_to_index \
-topics path_to_topic \
-topicreader TsvString \
-output path_to_output_file \
-impact -pretokenized -hits 1000 -parallelism 60 

@lintool
Copy link
Member

lintool commented Mar 24, 2023

Hi @thongnt99 very interesting and thanks for the PR!

Can you provide a sense of the performance improvement?

@thongnt99
Copy link
Author

thongnt99 commented Mar 25, 2023

Hi @lintool ,

These are some comparison points I collected from our recent reproduction attempt with LSR methods.
The degree of speed up would depend on the magnitude of term weights, but at least twice faster than the term repeating method. We saw, for example, a huge improvement for indexing EPIC since EPIC does not use sparse regularizers during training, therefore produces generally larger weights (the term repeating method has to repeat more).

LSR method Old New
QMLP_DMLM 0:10:25 0:04:09
EPIC (top_k=400) 1:23:53 0:04:02
Splade (0.01, 0.08) 0:17:41 0:03:52
uniCOIL 0:05:11 0:02:18

@MXueguang
Copy link
Member

MXueguang commented Mar 25, 2023

Hi @lintool ,

These are some comparison points I collected from our recent reproduction attempt with LSR methods. The degree of speed up would depend on the magnitude of term weights, but at least twice faster than the term repeating method. We saw, for example, a huge improvement for indexing EPIC since EPIC does not use sparse regularizers during training, therefore produces generally larger weights (the term repeating method has to repeat more).

LSR method Old New
QMLP_DMLM 0:10:25 0:04:09
EPIC (top_k=400) 1:23:53 0:04:02
Splade (0.01, 0.08) 0:17:41 0:03:52
uniCOIL 0:05:11 0:02:18

@thongnt99 this is cool!

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments.

@lintool
Copy link
Member

lintool commented Mar 25, 2023

Instead of TermWeightDocument... why not just call it VectorDocument? Vector as Map<String,Float> seems pretty intuitive?

@thongnt99
Copy link
Author

thongnt99 commented Mar 25, 2023

Instead of TermWeightDocument... why not just call it VectorDocument? Vector as Map<String,Float> seems pretty intuitive?

Yes, I also think that TermWeightDocument isn't an ideal name. Probably SparseVectorDocument is more suitable than VectorDocument? The formers says that we should store indices/terms and values (similar to SparseMatrix vs DenseMatrix format).

@lintool
Copy link
Member

lintool commented Mar 25, 2023

I like SparseVectorDocument!

@thongnt99
Copy link
Author

@lintool
I changed class names and fixed issues in your previous comments.

./anserini-lsr/target/appassembler/bin/IndexCollection \
-collection JsonSparseVectorCollection \
-input collections/msmarco-passage/lsr_collection_jsonl \
-index indexes/msmarco-passage/lsr-index-msmarco \
-generator SparseVectorDocumentGenerator \
-threads 60 -impact -pretokenized

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about some tests?

@thongnt99
Copy link
Author

How about some tests?

@lintool I am gonna add the tests after ECIR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载