Releases: aryn-ai/sycamore
v0.1.33
This Sycamore release contains a variety of bug fixes and improvements.
What's Changed
- Low-hanging mypy fruit (Sycamore edition). by @alexaryn in #1265
- Miscellaneous code-level improvements. by @alexaryn in #1266
- Improve OpenSearch reader by batching certain operations by @austin-aryn-ai in #1267
- Don't include original_elements in queries by @baitsguy in #1268
- Add a new OpenAI model, revert LlmFilterPrompt change by @austin-aryn-ai in #1271
- Add metadata loading to materialize by @HenryL27 in #1270
- Update httpcore and h11 for dependabot issue. by @bsowell in #1274
- Put os_client_args in default namespace to avoid surprise. by @alexaryn in #1275
- Add debugging for unexpected gemini model stop reasons. by @eric-anderson in #1276
- Summarize the group name for groupby by @bohou-aryn in #1261
- Update torch to 2.7.0 and transformers to 4.50.0. by @bsowell in #1279
- Adding gpt-4.1 and gpt-4.1-nano by @Soeb-aryn in #1278
- Add better view_pdf function by @HenryL27 in #1277
- Standardize materialize filenames a little by @HenryL27 in #1283
- Add support for non-clustering groupby by @bohou-aryn in #1281
- Add instructions to split entities in LLM extract entity. by @akarshgupta7 in #1286
- Add function to unroll entities by @bohou-aryn in #1288
- fix hybrid table model fallback to not edit tokens in-place by @HenryL27 in #1290
- Add heuristics to compute K for kmeans based on docset size. by @akarshgupta7 in #1287
- fix the non cluster groupby path error by @bohou-aryn in #1291
- Speed up import sycamore by ~10x; add module/tool for timing imports. by @eric-anderson in #1292
- Improve docset documentation to explain that it's lazy. by @eric-anderson in #1294
- handle non clustering aggregate count by @bohou-aryn in #1293
- Minor fixups from testing with managed-service. by @eric-anderson in #1296
- load tiktoken tokenizer lazily by @HenryL27 in #1298
- refactor planner to add query preprocessor functionality by @HenryL27 in #1297
- Add Gemini 2.5 Flash by @karanataryn in #1295
- Upgrade gemini by @bsowell in #1299
- Adding notebooks that walk through the earnings calls documents by @AbhijitP-009 in #1284
- Bump AIOHttp To Fix Package Install by @karanataryn in #1300
- OpenSearchReader result filtering by @austin-aryn-ai in #1282
- Add support for default llm kwargs for all models. by @bsowell in #1303
- Use llm for clustering by @bohou-aryn in #1302
- Upgrade tornado dependency. by @bsowell in #1305
- Add VLMTableStructureExtractor for table structure extraction. by @bsowell in #1304
- Fix collect when groupby has no entity names by @bohou-aryn in #1307
- Upgrade setuptools. by @bsowell in #1309
- Add type casting logic to extract entity. by @akarshgupta7 in #1310
- Remove aryn-sdk from the sycamore repo. by @bsowell in #1306
- Initial draft of X-Y Cut reading order for Sycamore (Hackathon) by @alexaryn in #1301
- improve model selection error message by @HenryL27 in #1311
- add id field param to queryresult.retrieved_docs by @HenryL27 in #1312
- Update LLM models by @bsowell in #1313
- Fix the example llm_cluster_instruction for the topk operator. by @vikram-ak in #1314
- Add a split_pdf function. by @bsowell in #1315
- fix case where tokens are None inside of model selection parsing by @HenryL27 in #1317
- Fix fast notebook tests by @bsowell in #1316
- Handle case where query is a compound query by @austin-aryn-ai in #1319
- Add DeepWiki Badge by @karanataryn in #1321
- Fix issue with nested_lookup utility. by @bsowell in #1320
- Remove index from Jinjaprompt to avoid token blow up. by @akarshgupta7 in #1322
- Add flag to skip table extraction on empty tables by @MarkLindblad in #1323
- Enable pdfminer vertical text grouping. by @bsowell in #1324
- Update Slack invite link in README.md by @sohamkasar19 in #1325
- Fix when unroll handles None entity field by @bohou-aryn in #1326
- Add some useful planner processors by @HenryL27 in #1327
- Update Slack invite link in documentation and README.md by @sohamkasar19 in #1329
- Add code to turn materialize off for all nodes after and incl Sort. by @akarshgupta7 in #1328
- Add support for a custom supplement_text function in the partitioner. by @bsowell in #1332
- allow planner customization via prompt specification by @HenryL27 in #1330
- Upgrade to torch 2.7.1. by @bsowell in #1334
- Add RECOMPUTE source mode for nodes with materialized disabled. by @akarshgupta7 in #1333
- Rename Table Extractor Options to Table Extraction Options by @karanataryn in #1338
- Add Ability to Resolve Boundary Overlaps in Tables by @karanataryn in #1337
- Fix materialize commit by @bohou-aryn in #1331
- Add OpenAI reasoning models. by @bsowell in #1340
- Add retries for VLMTableStructureExtractor. by @bsowell in #1341
- Fix Resolve Overlaps Plumbing by @karanataryn in #1339
- First part of reliable opensearch writing -- handles new items and missing metadata on source or destination. by @eric-anderson in #1335
- Add support for document reconstruct for RAG. by @akarshgupta7 in #1343
- Sycamore: deal with rotated tables. by @alexaryn in #1336
- Add RAG support in Luna. by @akarshgupta7 in #1344
- Fix test clustering flaky by @bohou-aryn in #1345
- Add empty list in RAG doc reconstructor if doc not found in unique_docs. by @akarshgupta7 in #1346
- Fix notebook tests. by @eric-anderson in #1347
- Upgrade dependency on requests by @bsowell in #1348
- Remove default parameter for Gemini max_output_tokens. by @bsowell in #1349
- Introduce a new chained LLM by @austin-aryn-ai in #1342
- Fix MRR refresh by @dhruvkaliraman7 in #1350
- Add logging on retries. by @eric-anderson in #1318
- [tmp!!!] switch off partitioner its by @HenryL27 in #1351
- Load models in eval mode, no gradient computation by @dhruvkaliraman7 in #1352
- Update protobuf version for Dependabot. by @bsowell in #1353
- Add support for new Gemini models. by @bsowell in #1355
- Improve ChainedLLM to handle when to move to next llm in chain, add r… by @austin-aryn-ai in #1354
- Fix more Dependencies by @karanataryn in #1356
- add kwargs to planning processors and plumb through sq client by @HenryL27 in #1360
- Refactor Process Batch by @karanataryn in #1357
- Add Extract Images Function by @karanataryn in #1361
- switch partitioner its back on by @HenryL27 in #1366
- Add source fields to RAG ...
v0.1.32
This Sycamore release contains a variety of bug fixes and improvements.
What's Changed
- add autoschema param for aryn writer by @HenryL27 in #1232
- remove aryn sdk publisher workflow by @HenryL27 in #1234
- Add extract_image_format option. by @bsowell in #1230
- deserialize a summaryDocument as a SummaryDocument, not a Document by @HenryL27 in #1238
- Fix when no sub docs in summary document by @dhruvkaliraman7 in #1239
- Fix OS document reconstruct read by @dhruvkaliraman7 in #1240
- Favor doc_reconstruct when reconstruct_document is also mentioned by @dhruvkaliraman7 in #1210
- Add support for air-gapping the easyocr model. by @eric-anderson in #1231
- spread property data, not references by @HenryL27 in #1243
- fallback to tatr if deformable fails by @HenryL27 in #1244
- Fix var name for MRR which broke on bad merge by @dhruvkaliraman7 in #1245
- Upstreaming Customer prompts by @dhruvkaliraman7 in #1246
- Add access method for materialize docset by @bohou-aryn in #1241
- Bump Lint Dependencies and Relint by @karanataryn in #1247
- Add Claude 3.7 Sonnet by @karanataryn in #1248
- Rename QueryBookmark to DataLoader by @bohou-aryn in #1249
- Embedder is now a context manager and can free resources. by @alexaryn in #1250
- add default llm_mode to llms by @HenryL27 in #1253
- Add close() to our OpenAI and OpenAIClientWrapper classes. by @alexaryn in #1254
- Fix limit transform by @dhruvkaliraman7 in #1255
- Fix clustring flaky test by @bohou-aryn in #1251
- tweak llm filter prompt to be more better by @HenryL27 in #1258
- Allow override of OpenSearch user/password in run_plan(). by @alexaryn in #1256
- Delete PITs after we're done reading by @austin-aryn-ai in #1257
- Inspect serialization issues only if TypeError is raised by @austin-aryn-ai in #1259
- better defaults for aryn writer by @HenryL27 in #1262
- bugfix: missing token has 4 slashes by @eric-anderson in #1252
- Upgrade torch to 2.6.0. by @bsowell in #1263
- Bump version to 0.1.32. by @bsowell in #1264
New Contributors
- @austin-aryn-ai made their first contribution in #1257
Full Changelog: v0.1.31...v0.1.32
v0.1.31
This Sycamore release contains a variety of bug fixes and improvements.
What's Changed
- Refactor the caching API in llms so that the get and set APIs are symmetric by @eric-anderson in #1108
- Fix OpenSearch tests that require pre-loaded index by @austintlee in #1111
- Fix source_directory path in conf.py by @sravan1946 in #953
- Make
lib/poetry-lock/poetry-lock-all.sh
failures more obvious by @MarkLindblad in #1107 - aryn-opensearch-bedrock-rag-example.ipynb by @jonfritz in #1117
- Bump Dependencies to Fix Security Issues by @karanataryn in #1119
- [llm unify 1/n] Add consolidated prompt classes by @HenryL27 in #1120
- Add Dependency Review Action by @karanataryn in #1121
- Removing Guidance by @Soeb-aryn in #1114
- Bump PyPDF by @karanataryn in #1124
- Add anthropic api key to testing workflow by @HenryL27 in #1125
- Add support for async DocParse calls in
aryn-sdk
by @MarkLindblad in #1116 - Bump
aryn-sdk
version to 0.1.11 by @MarkLindblad in #1127 - Add CodeQL Vulnerability Scan by @karanataryn in #1118
- Update fileformattools by @baitsguy in #1133
- capturing metadata from LLMs by @Soeb-aryn in #1122
- Add jupyter utils (Finra upstream) by @dhruvkaliraman7 in #1135
- Change @context_params behavior to only pass explicit arguments. by @bsowell in #1136
- Upgrade OpenAI to ^1.60.2. by @bsowell in #1137
- Explicitly add tiktoken and relock. by @bsowell in #1138
- add HeaderAugmenterMerger to docs by @HenryL27 in #1139
- add another --no-root for rtd by @HenryL27 in #1140
- Improve use of async DocParse via
aryn-sdk
by @MarkLindblad in #1134 - Bump
aryn-sdk
version to 0.1.12 by @MarkLindblad in #1141 - Fix self-reported
aryn-sdk
version by @MarkLindblad in #1142 - Bump
aryn-sdk
version to 0.1.12.post0 by @MarkLindblad in #1143 - Update Testing Workflows by @karanataryn in #1144
- [llm unify 2/n] Implement llm_map(_elements) and move extract_entity to it. by @HenryL27 in #1126
- [llm unify 3/n] Reimplement SummarizeImages as an LLMMapElements by @HenryL27 in #1146
- Add OpenSearch shard related logging by @baitsguy in #1145
- Clean Up Dead Code by @karanataryn in #1132
- Fix fetching of parent doc properties during OpenSearch read by @austintlee in #1148
- [llm unify 4/n] extract properties by @HenryL27 in #1149
- Add a groupby operator by @bohou-aryn in #1123
- Change job to task in aryn-sdk async support by @MarkLindblad in #1151
- Bump
aryn-sdk
version to 0.1.13 by @MarkLindblad in #1152 - Finish cleaning up PR 1148 by @austintlee in #1150
- Aryn connectors for reading and writing docsets by @austintlee in #1147
- handle specified prompt and use_elements=True in extract entity by @HenryL27 in #1153
- Fix async DocParse task id in
aryn-sdk
example by @MarkLindblad in #1155 - Add Opensearch Writer Reliability by @dhruvkaliraman7 in #1130
- Add materialize read reliability by @dhruvkaliraman7 in #1094
- add docs -> docs wrapper function for LLMPropertyExtractor by @HenryL27 in #1158
- Reliability mocking bug by @dhruvkaliraman7 in #1160
- Ensure parent docs are collected during doc reconstruct by @austintlee in #1159
- fix extract properties again by @HenryL27 in #1162
- Make async DocParse methods in
aryn-sdk
not operate on non-DocParse async tasks by @MarkLindblad in #1156 - [llm unify 5a/n] Add JinjaPompt and re-convert extract entities by @HenryL27 in #1161
- Add list to cast types by @dhruvkaliraman7 in #1163
- [llm unify 5b/n] Jinja summarize images by @HenryL27 in #1166
- Rename async list endpoints to "action" from "path" by @MarkLindblad in #1170
- [llm unify 5c/n] jinjify extract properties by @HenryL27 in #1169
- Bump Beautiful Soup by @karanataryn in #1167
- Serialize query strings to avoid Ray Dataset column imputation by @austintlee in #1171
- Update
aryn-sdk
's async DocParse interface to raise Exceptions rather than returning error strings by @MarkLindblad in #1164 - Add OCR Languages to Aryn SDK by @karanataryn in #1168
- Fix None in llm response by @dhruvkaliraman7 in #1173
- [llm unify 5/n] llm_filter by @HenryL27 in #1154
- Bump
aryn-sdk
to v0.1.14 by @MarkLindblad in #1165 - ASDK: Prevent excessive memory consumption reading file. by @alexaryn in #1174
- Close the darn file! by @alexaryn in #1175
- Add planner interface by @baitsguy in #1177
- Plumbing X-Aryn-Trace-ID through ASDK partition_file. by @alexaryn in #1179
- Bump Ray to Fix Security Issue by @karanataryn in #1181
- Writer Reliability bug by @dhruvkaliraman7 in #1184
- fix anthropic required module by @HenryL27 in #1185
- Improve Aryn reader by @austintlee in #1172
- Retain element doc_id from source by @austintlee in #1178
- Update aryn-opensearch-bedrock-rag-example.ipynb by @jonfritz in #1187
- Add Gemini LLM and Summarizer by @karanataryn in #1176
- [llm unify 6/n] extract schema and batch schema to llm map by @HenryL27 in #1188
- Adding clustering and groupby in luna planner part by @bohou-aryn in #1183
- Fix Gemini Bugs by @karanataryn in #1191
- Helper script for getting git credentials from the environment by @eric-anderson in #1190
- fix image tests - we don't explode on bad bboxes these days by @HenryL27 in #1193
- Get random hits when filtering properties in sycamore query by @dhruvkaliraman7 in #1195
- Remove OCR Images by @karanataryn in #1194
- Add Summarize Images to Aryn SDK by @karanataryn in #1189
- Change Gemini Model Names by @karanataryn in #1197
- Add Gemini 2 Pro by @karanataryn in #1198
- Bump Aryn SDK to 0.1.15 by @karanataryn in #1199
- LLM Async mode by @HenryL27 in #1200
- LLM Batch inference by @HenryL27 in #1202
- updating run-jupyter.sh file by @Soeb-aryn in #1201
- Change
HeaderAugmenterMerger
str concatenation behavior to minimize adding newlines by @MarkLindblad in #1205 - Add param to control how model is selected in hybrid table extractor by @HenryL27 in #1203
- Add entity name in grouped result, also add materialize in groupbycount operator by @bohou-aryn in #1204
- Add VLM OCR to Aryn SDK by @karanataryn in #1207
- Fix OpenSearch integ test by @dhruvkaliraman7 in #1209
- [SDK] Ability to cancel running partition call. by @alexaryn in #1211
- bump aryn sdk version by @HenryL27 in #1212
- Bump Dependencies to Fix Security Issues by @karanataryn in #1208
- ...
v0.1.30
This Sycamore release contains several bug fixes and improvements.
What's Changed
- Add logging of the full exception in base_writer. by @eric-anderson in #1069
- Fix create_element to not crash on bad element types by @eric-anderson in #1070
- Add docset.take_stream() by @baitsguy in #1071
- Make temporary fix to
split_elements
to avoid exceeding recursion depth due to certain table elements by @MarkLindblad in #1073 - add TableMerger to merge elements docs by @HenryL27 in #1074
- Increase max recursion depth for
split_element
'ssplit_one
by @MarkLindblad in #1075 - Merge-elements-LLM-filter by @dhruvkaliraman7 in #1076
- Add support for GPU to similarity. by @austintlee in #999
- Tolerate bad entity extraction. by @eric-anderson in #1078
- move deformable detr safe loading code by @HenryL27 in #1055
- Allow Doc reconstruct via function by @austintlee in #1072
- Add-tokenizer-and-reranking-to-LLM-ExtractEntity by @dhruvkaliraman7 in #1081
- Schema object + entity extraction support by @baitsguy in #1083
- Make ttviz.cpp compile again. by @alexaryn in #1082
- Keep newline in OpenAI Embedder by @dhruvkaliraman7 in #1086
- Changed the default embedding model to openai. by @akarshgupta7 in #1087
- Add Embed at Element Level by @dhruvkaliraman7 in #1084
- Get sycamore.query to work with Schema instead of only OpenSearchSchema by @baitsguy in #1088
- Add hybrid table extractor by @HenryL27 in #1089
- Add map reduce style summarize to handle large texts for summarization. by @austintlee in #1079
- fix max(nothing) bug by @HenryL27 in #1091
- Delay initializing openai client in embedder by @HenryL27 in #1092
- fix materialize on windows by @HenryL27 in #1093
- Add Retries for OpenSearch Writer by @karanataryn in #1085
- Property extraction type cast by @baitsguy in #1095
- Revert overzealous no-rootification by @HenryL27 in #1098
- Add support for Anthropic LLMs. by @bsowell in #1096
- Fix similarity assert condition for LLM Filter by @dhruvkaliraman7 in #1099
- Raise PartitionError with explicit status code. by @alexaryn in #1101
- Add
PartitionError
toaryn_sdk.partition
's__init__.py
by @MarkLindblad in #1102 - Prompt update for property extraction by @baitsguy in #1103
- Add support for parallel read in OpenSearchReader by @austintlee in #1100
- Fix No Root Repetition in Test File by @karanataryn in #1097
- Bump version to 0.1.30. by @bsowell in #1109
Full Changelog: v0.1.29...v0.1.30
v0.1.29
This Sycamore release contains small bug fixes and enhancements.
What's Changed
- when there's no table structure, take the token bbox for the cell bbox by @HenryL27 in #1061
- Disable use of scroll in OpenSearch reader when running KNN queries. by @austintlee in #1062
- Binarize OCR Image to Improve Performance by @karanataryn in #1063
- Fix
split_elements
for table elements with noelem.table
attribute by @MarkLindblad in #1064 - Fix Extract Schema Empty Return by @karanataryn in #1067
- Bump version to v0.1.29. by @bsowell in #1068
Full Changelog: v0.1.28...v0.1.29
v0.1.28
This release updates doc_ids from UUIDs to NanoIds, adds some document title functionality, and improves stability and performance.
What's Changed
- adding one shot prompting along with multimodal request by @Soeb-aryn in #1023
- Fix query-ui dependency on boto3 and re-lock. by @mdwelsh in #1028
- Updated NTSB queries and ground truth for CIDR-25 paper. by @mdwelsh in #1026
- Add streaming support and tests for query-server. by @mdwelsh in #1027
- Supply element types in output from MarkedMerger. by @alexaryn in #1031
- Fix SummarizeData so that downstream .materialize operations will work. by @mdwelsh in #1030
- add nanoid by @HenryL27 in #1034
- Removed duplicate code in query execution. by @akarshgupta7 in #1035
- Convert docids from UUID to NanoID. by @alexaryn in #1032
- Use NanoIDs in file_scan. by @alexaryn in #1036
- extract table properties prompt & bug fix by @Soeb-aryn in #1037
- Convert DocIDs to UUIDs for Qdrant & Weaviate; unit tests. by @alexaryn in #1038
- heuristics to get title from section headers by @Soeb-aryn in #1033
- updating function in pdf_miner class by @Soeb-aryn in #1041
- Added ragas to compute string metrics for evaluation. by @akarshgupta7 in #1039
- Fix sort so that it works with an unspecified or None default_value. by @eric-anderson in #1040
- Added correctness score to the metrics. by @akarshgupta7 in #1043
- Query planner improvements by @baitsguy in #1046
- Fix materialize to tolerate an empty input directory in ray mode by @eric-anderson in #1045
- PR fix by @baitsguy in #1047
- disable vectorsearch rerank by default in query by @baitsguy in #1048
- vectorsearch planner prompt changes by @baitsguy in #1049
- Make OpenAIEmbedder serializable after client has been initialized. by @bsowell in #1050
- Rename Embedding in ElasticSearch Notebook by @karanataryn in #1051
- Add deformable table extractor by @HenryL27 in #1053
- Add helper for thread local variables that can be used to add metadata to the output stream by @eric-anderson in #1052
- Propagate element level llm_filter output to doc.properties by @baitsguy in #1054
- Handle military clock time (0800) in time standardizer. by @alexaryn in #1056
- Fix incorrect docstring for promote-certain-elements-to-title feature by @MarkLindblad in #1057
- adding parameter for API in sdk and remote_partitioner by @Soeb-aryn in #1042
- bump sycamore version to 0.1.28 by @HenryL27 in #1058
- bump aryn sdk version to 0.1.10 by @HenryL27 in #1059
- don't die if box is None in try_draw_boxes by @HenryL27 in #1060
New Contributors
- @akarshgupta7 made their first contribution in #1035
Full Changelog: v0.1.27...v0.1.28
v0.1.27
This Sycamore release includes a variety of small bug fixes and improvements.
What's Changed
- Bump
aryn-sdk
version to 0.1.9 from 0.1.8 by @MarkLindblad in #1011 - Add plan validation by @baitsguy in #1001
- Sort retrieval docs by score properties if they exist by @baitsguy in #1012
- Add 120k max chars (default) for summarize_data by @baitsguy in #1013
- Queryeval docset write fix by @baitsguy in #1014
- Add notebook file for OpenSearch example by @jonfritz in #1015
- Fix up NTSB queries for query-eval tool. by @mdwelsh in #1016
- Rename from APS to DocParse by @karanataryn in #1017
- enable JSONifying tables by @HenryL27 in #1018
- Fix
aryn-sdk
'sconvert_image_element
example by @MarkLindblad in #1019 - Fix DocParse chunking example in
aryn-sdk
by @MarkLindblad in #1021 - blacksmith.sh: Migrate workflows to Blacksmith by @blacksmith-sh in #1020
- Revert Unit Tests to GitHub Actions by @karanataryn in #1025
- Bump version to 0.1.27. by @bsowell in #1024
Full Changelog: v0.1.26...v0.1.27
v0.1.26
This release includes several stabliity and reliability improvements.
What's Changed
- skip flaky test by @HenryL27 in #956
- Fix mypy warnings. by @mdwelsh in #947
- Work around hang observed during vcrpy recording. by @alexaryn in #950
- Postprocessing to modify plans returned by llm planner; minor issues with query-ui by @amolvdeshpande in #882
- bump sdk to 0.1.7 by @HenryL27 in #961
- Add HeaderAugmenterMerger by @dhruvkaliraman7 in #946
- Update docs to reflect OpenAIPropertyExtractor->LLMPropertyextractor by @bsowell in #964
- Couple of minor fixes and tweaks to the table merger. by @bsowell in #963
- Enable use_elements in query.summarize_data by @baitsguy in #966
- Fix typo in syntax in docstring for Summarize Images by @jonfritz in #967
- Add missing
tokenizer
argument inMarkBreakByTokens
docstring by @MarkLindblad in #969 - Add Lots of Connector Unit Tests by @karanataryn in #957
- Add OCR Evaluation Code by @karanataryn in #685
- Fixed query tag check by @baitsguy in #968
- Fix SDK Threshold Bug by @karanataryn in #970
- Add score to each document in OpenSearch query result. by @bsowell in #971
- Fix HeaderAugmenterMerger by @MarkLindblad in #973
- Refactor
mark_bbox_preset
to expose function outsideDocSet
by @MarkLindblad in #972 - Fix
mark_bbox_preset
'sMarkDropHeaderFooter
parameter by @MarkLindblad in #975 - OpenSearch improvements by @baitsguy in #974
- Adding a separate installation instructions page by @AbhijitP-009 in #977
- Union OCR / PDFMiner Tokens with Table Outputs by @karanataryn in #976
- Make Table Code More Robust by @karanataryn in #979
- fix divide by zero in align_headers by @HenryL27 in #978
- Allow for returning query traces on cached query executions. by @mdwelsh in #959
- Add Enhance Table Option to SDK by @karanataryn in #980
- Bump SDK Version by @karanataryn in #981
- Update Lockfiles by @karanataryn in #920
- Add query planning strategy objects by @baitsguy in #982
- Move tokenized data to device by @baitsguy in #983
- Update vectorsearch query test by @baitsguy in #984
- Integration test for Sycamore Query demo. by @mdwelsh in #985
- Add Closure of Client Connections for Connectors by @karanataryn in #989
- Work around lack of resource module on Windows. by @alexaryn in #962
- Update README.md by @karanataryn in #990
- Merge in Fixes from Luna Demo Deployment by @karanataryn in #992
- Add table-chunker by @dhruvkaliraman7 in #993
- chore: Added back to top , contributors section and star history graph by @samarth29jc in #987
- Return the list of documents referenced in a Luna query. by @mdwelsh in #995
- Sync Locks across all Directories by @karanataryn in #988
- Remove unused code (
_batchify
) by @MarkLindblad in #887 - Don't try to put footers in columns by @HenryL27 in #998
- Docprep notebook testing by @sohamkasar19 in #996
- Add expected documents in query-eval tool by @baitsguy in #997
- Move Aryn DocParse Docs Out of Sycamore by @karanataryn in #994
- Remove seed from rewrite prompt by @baitsguy in #1000
- Fix OpenAI reduce methods to handle Azure deployment names. by @bsowell in #1002
- Add support for custom source parameter for remote Aryn Partitioner by @MarkLindblad in #1003
- Fix mixed samples for schema extraction. by @mdwelsh in #1004
- updating extract table prop by @Soeb-aryn in #1005
- Update Opensearch domain in docprep notebook testing (GHA) by @sohamkasar19 in #1006
- Improve suggested install command by @HenryL27 in #1007
- Fix augment_text docstring by @HenryL27 in #1008
- Add support for using Aryn DocParse chunking from
aryn-sdk
by @MarkLindblad in #1010 - Update sycamore to 0.1.26 by @HenryL27 in #1009
New Contributors
- @amolvdeshpande made their first contribution in #882
- @samarth29jc made their first contribution in #987
Full Changelog: v0.1.25...v0.1.26
v0.1.25
This Sycamore release includes numerous bug fixes for connectors and other transforms. It also includes support for Anthropic LLMs via Amazon Bedrock.
What's Changed
- Sycamore Query evaluation tool. by @mdwelsh in #912
- Luna client local schema (take 2) by @dtecuci in #919
- Fix small bug in client. by @mdwelsh in #923
- Fix DuckDB Spelling Error by @karanataryn in #924
- Make OpenSearchSchema a proper Pydantic model. by @mdwelsh in #922
- Fix typo by @Yashbhatt786 in #927
- Bugfixes: DocumentSource enum serialization and missing element_id in old data by @baitsguy in #928
- Bug fixes: remove kwargs in docset.rerank, sycamore query codegen by @baitsguy in #932
- Add Table Merger by @dhruvkaliraman7 in #880
- Basic Bedrock LLM client. by @mdwelsh in #931
- Accept query plan examples in config by @baitsguy in #934
- Evaluate query plans in query-eval by @baitsguy in #936
- Add local mode support for json scan and json document scan by @bohou-aryn in #925
- Handle Drawing Missing Tables and Cells by @karanataryn in #938
- Support LLM selection in Sycamore Query Client. by @mdwelsh in #935
- Crop To Bbox Error by @karanataryn in #939
- Add plan correctness metrics summary + K in TopK optional by @baitsguy in #940
- don't embed the empty string with openai by @HenryL27 in #943
- Support SummarizeImages with non-OpenAI LLMs. by @bsowell in #941
- Add support for tags and notes. by @mdwelsh in #942
- Create LLMSchemaExtractor and LLMPropertyExtractor classes. by @bsowell in #945
- Don't run embedded weaviate in the unit tests by @HenryL27 in #951
- fix empty strings in section headers by @HenryL27 in #948
- Select pages by @bsowell in #937
- Fixup notebook tests by @eric-anderson in #933
- Use pytest-xdist for unit tests. by @mdwelsh in #952
- Update standardizer.py by @jonfritz in #944
- Fix bugs in Unflattening Data by @karanataryn in #930
- fix materialize bug with s3 filesystem by @eric-anderson in #954
- Bump version to 0.1.25. by @bsowell in #955
New Contributors
- @Yashbhatt786 made their first contribution in #927
Full Changelog: v0.1.24...v0.1.25
v0.1.24
This Sycamore release includes several bug fixes in the Weaviate and DuckDB connectors and in several of the example notebooks. Thanks to @Dnaynu for contributing to the Sycamore documentation!
What's Changed
- fix asdict in the reader too. duh by @HenryL27 in #907
- Add text reprentation for empty tables by @dhruvkaliraman7 in #909
- Refactor logical plan serialization. by @mdwelsh in #905
- microperformance improvement by @HenryL27 in #906
- Bugfix: Handle opensearch reader doc resconstruction when no parent doc in results by @baitsguy in #908
- Fix bug in entity extraction. by @eric-anderson in #911
- added ability to read schema from file by @dtecuci in #904
- Enable copying of the hash context. by @alexaryn in #910
- Add option to extract line-based bounding boxes from pdfminer. by @bsowell in #874
- Support random sample in local mode. by @bsowell in #913
- Opensearch kwargs fix by @baitsguy in #914
- Fix Typo in NTSB Demo by @karanataryn in #917
- Update using_jupyter.md by @jonfritz in #902
- Docs: Typo Fix by @Dnaynu in #918
- Update DuckDB Reader to Package Change by @karanataryn in #916
- Make metadata-extraction.ipynb work by @eric-anderson in #915
- Bump Sycamore version to 0.1.24. by @bsowell in #921
New Contributors
Full Changelog: v0.1.23...v0.1.24