This Sycamore release contains a variety of bug fixes and improvements.
What's Changed
- Low-hanging mypy fruit (Sycamore edition). by @alexaryn in #1265
- Miscellaneous code-level improvements. by @alexaryn in #1266
- Improve OpenSearch reader by batching certain operations by @austin-aryn-ai in #1267
- Don't include original_elements in queries by @baitsguy in #1268
- Add a new OpenAI model, revert LlmFilterPrompt change by @austin-aryn-ai in #1271
- Add metadata loading to materialize by @HenryL27 in #1270
- Update httpcore and h11 for dependabot issue. by @bsowell in #1274
- Put os_client_args in default namespace to avoid surprise. by @alexaryn in #1275
- Add debugging for unexpected gemini model stop reasons. by @eric-anderson in #1276
- Summarize the group name for groupby by @bohou-aryn in #1261
- Update torch to 2.7.0 and transformers to 4.50.0. by @bsowell in #1279
- Adding gpt-4.1 and gpt-4.1-nano by @Soeb-aryn in #1278
- Add better view_pdf function by @HenryL27 in #1277
- Standardize materialize filenames a little by @HenryL27 in #1283
- Add support for non-clustering groupby by @bohou-aryn in #1281
- Add instructions to split entities in LLM extract entity. by @akarshgupta7 in #1286
- Add function to unroll entities by @bohou-aryn in #1288
- fix hybrid table model fallback to not edit tokens in-place by @HenryL27 in #1290
- Add heuristics to compute K for kmeans based on docset size. by @akarshgupta7 in #1287
- fix the non cluster groupby path error by @bohou-aryn in #1291
- Speed up import sycamore by ~10x; add module/tool for timing imports. by @eric-anderson in #1292
- Improve docset documentation to explain that it's lazy. by @eric-anderson in #1294
- handle non clustering aggregate count by @bohou-aryn in #1293
- Minor fixups from testing with managed-service. by @eric-anderson in #1296
- load tiktoken tokenizer lazily by @HenryL27 in #1298
- refactor planner to add query preprocessor functionality by @HenryL27 in #1297
- Add Gemini 2.5 Flash by @karanataryn in #1295
- Upgrade gemini by @bsowell in #1299
- Adding notebooks that walk through the earnings calls documents by @AbhijitP-009 in #1284
- Bump AIOHttp To Fix Package Install by @karanataryn in #1300
- OpenSearchReader result filtering by @austin-aryn-ai in #1282
- Add support for default llm kwargs for all models. by @bsowell in #1303
- Use llm for clustering by @bohou-aryn in #1302
- Upgrade tornado dependency. by @bsowell in #1305
- Add VLMTableStructureExtractor for table structure extraction. by @bsowell in #1304
- Fix collect when groupby has no entity names by @bohou-aryn in #1307
- Upgrade setuptools. by @bsowell in #1309
- Add type casting logic to extract entity. by @akarshgupta7 in #1310
- Remove aryn-sdk from the sycamore repo. by @bsowell in #1306
- Initial draft of X-Y Cut reading order for Sycamore (Hackathon) by @alexaryn in #1301
- improve model selection error message by @HenryL27 in #1311
- add id field param to queryresult.retrieved_docs by @HenryL27 in #1312
- Update LLM models by @bsowell in #1313
- Fix the example llm_cluster_instruction for the topk operator. by @vikram-ak in #1314
- Add a split_pdf function. by @bsowell in #1315
- fix case where tokens are None inside of model selection parsing by @HenryL27 in #1317
- Fix fast notebook tests by @bsowell in #1316
- Handle case where query is a compound query by @austin-aryn-ai in #1319
- Add DeepWiki Badge by @karanataryn in #1321
- Fix issue with nested_lookup utility. by @bsowell in #1320
- Remove index from Jinjaprompt to avoid token blow up. by @akarshgupta7 in #1322
- Add flag to skip table extraction on empty tables by @MarkLindblad in #1323
- Enable pdfminer vertical text grouping. by @bsowell in #1324
- Update Slack invite link in README.md by @sohamkasar19 in #1325
- Fix when unroll handles None entity field by @bohou-aryn in #1326
- Add some useful planner processors by @HenryL27 in #1327
- Update Slack invite link in documentation and README.md by @sohamkasar19 in #1329
- Add code to turn materialize off for all nodes after and incl Sort. by @akarshgupta7 in #1328
- Add support for a custom supplement_text function in the partitioner. by @bsowell in #1332
- allow planner customization via prompt specification by @HenryL27 in #1330
- Upgrade to torch 2.7.1. by @bsowell in #1334
- Add RECOMPUTE source mode for nodes with materialized disabled. by @akarshgupta7 in #1333
- Rename Table Extractor Options to Table Extraction Options by @karanataryn in #1338
- Add Ability to Resolve Boundary Overlaps in Tables by @karanataryn in #1337
- Fix materialize commit by @bohou-aryn in #1331
- Add OpenAI reasoning models. by @bsowell in #1340
- Add retries for VLMTableStructureExtractor. by @bsowell in #1341
- Fix Resolve Overlaps Plumbing by @karanataryn in #1339
- First part of reliable opensearch writing -- handles new items and missing metadata on source or destination. by @eric-anderson in #1335
- Add support for document reconstruct for RAG. by @akarshgupta7 in #1343
- Sycamore: deal with rotated tables. by @alexaryn in #1336
- Add RAG support in Luna. by @akarshgupta7 in #1344
- Fix test clustering flaky by @bohou-aryn in #1345
- Add empty list in RAG doc reconstructor if doc not found in unique_docs. by @akarshgupta7 in #1346
- Fix notebook tests. by @eric-anderson in #1347
- Upgrade dependency on requests by @bsowell in #1348
- Remove default parameter for Gemini max_output_tokens. by @bsowell in #1349
- Introduce a new chained LLM by @austin-aryn-ai in #1342
- Fix MRR refresh by @dhruvkaliraman7 in #1350
- Add logging on retries. by @eric-anderson in #1318
- [tmp!!!] switch off partitioner its by @HenryL27 in #1351
- Load models in eval mode, no gradient computation by @dhruvkaliraman7 in #1352
- Update protobuf version for Dependabot. by @bsowell in #1353
- Add support for new Gemini models. by @bsowell in #1355
- Improve ChainedLLM to handle when to move to next llm in chain, add r… by @austin-aryn-ai in #1354
- Fix more Dependencies by @karanataryn in #1356
- add kwargs to planning processors and plumb through sq client by @HenryL27 in #1360
- Refactor Process Batch by @karanataryn in #1357
- Add Extract Images Function by @karanataryn in #1361
- switch partitioner its back on by @HenryL27 in #1366
- Add source fields to RAG document reconstructor. by @akarshgupta7 in #1364
- Upgrade PaddleOCR by @karanataryn in #1365
- Add size to opensearch knn query. by @akarshgupta7 in #1368
- Fix OCR Bug by @karanataryn in #1369
- Drop support for Python 3.9 and add Python 3.13. by @bsowell in #1367
- [Dependency] Upgrade urllib3. by @bsowell in #1370
- [Dependency] Upgrade urllib3 in one more app. by @bsowell in #1371
- fix sycamore query codegen mode options by @HenryL27 in #1372
- adding Metadata extraction logic to extract_properties by @Soeb-aryn in #1362
- Add Optional Parameters by @karanataryn in #1374
- Add opensearch scores to element properties. by @akarshgupta7 in #1375
- Change the name of search relevance score. by @akarshgupta7 in #1376
- Add default to relevance scores returned by opensearch. by @akarshgupta7 in #1377
- Add support for
extract_image_format
when using remote ArynPartitioner by @MarkLindblad in #1379 - Support Delete and Update in OpenSearchSync by @eric-anderson in #1373
- Bump Pillow by @karanataryn in #1378
- Bump Pillow by @karanataryn in #1382
- Bump Paddle to 3.1 by @karanataryn in #1383
- Add prompt tail to llm planner by @dhruvkaliraman7 in #1385
- plumb kwargs through to run plan as well as generate plan by @HenryL27 in #1386
- Make OS optional in LLMPlanner by @dhruvkaliraman7 in #1387
- Upgrade transformers to 4.53.1. by @bsowell in #1388
- Fix warning for K-Means by @bohou-aryn in #1390
- add limit llm operations and require query database processors by @HenryL27 in #1389
- Upgrade aiohttp dependency by @bsowell in #1392
- Tolerate mis-formatted files instead of cleaning up. by @eric-anderson in #1380
- Scripts for running docparse controlled by BigQuery by @eric-anderson in #1393
- Change
ArynWriter
to use new serialization approach by @MarkLindblad in #1384 - Fix
ArynWriter
's use of stream by @MarkLindblad in #1394 - Upgrade jupyter-core. by @bsowell in #1398
- Bump sycamore version to 0.1.33 by @MarkLindblad in #1397
- Upgrade Gemini 2.5 Flash Lite, which is now GA. by @bsowell in #1399
New Contributors
- @vikram-ak made their first contribution in #1314
Full Changelog: v0.1.32...v0.1.33