Speed up import sycamore by ~10x; add module/tool for timing imports. #1292

eric-anderson · 2025-05-08T05:24:27Z

Went from ~1.2s to ~0.1s to execute import sycamore. on my laptop.

Remaining work is all around improving llm import time, that import is about 90% of the remaining time.

lib/import_timer is the new library for doing import timing. See the file for instructions.
Thanks to Claude for an initial version.

Went from ~1.2s to ~0.1s to execute import sycamore. on my laptop. Remaining work is all around improving llm import time, that import is about 90% of the remaining time. lib/import_timer is the new library for doing import timing. See the file for instructions. Thanks to Claude for an initial version.

HenryL27

Nice! the reimports being slow seems real fishy to me so I worry that something is not doing what it's supposed to

HenryL27 · 2025-05-08T16:27:07Z

lib/import_timer/import_timer.py

+import time
+import os
+import builtins
+from typing import Dict, List


I think you can use lowercase dict and list for type annotations and then you save yourself a typing import cuz those are builtins

HenryL27 · 2025-05-08T16:31:21Z

lib/import_timer/import_timer.py

+        elapsed = end_time - start_time
+
+        if name in import_times:
+            # Re-imports can be really slow; not sure why, but sycamore.llms.llm can be 0.19s -> 0.62s


this seems really odd to me. I thought python caches imported modules. https://docs.python.org/3/reference/import.html#the-module-cache

I thought so too; It happens all over (lots of modules, I had an error then a warning then I turned it off), that was just one of the extreme ones.

HenryL27 · 2025-05-08T16:41:36Z

lib/sycamore/sycamore/llms/config.py

+@dataclass
+class OpenAIModel:


If we're putting the openai enums in a separate file can we bring in the anthropic, bedrock, and gemini enums too? But also this seems like a kinda weird thing to do. why?

Because llms/init.py needs the list of models to build the MODELS dictionary, and that's used in enough places that I didn't want to try to fix it. Pulling out openai was sufficient to give me most of the speedup, so I didn't go farther.

Put the other config stuff in.

HenryL27 · 2025-05-08T16:43:12Z

lib/sycamore/sycamore/tests/unit/test_import_speed.py

+    # Attention future developer: You're here, your PR just made this test fail, you're thinking
+    # "I'll just increase this by a little bit, it's not a big deal."  You're probably not even
+    # responsible for most of the slowdown. Please, take the time to fix it and not let it get
+    # "just a little bit worse."  That's the way that we get back up to taking >1s to import a
+    # library when it should take <0.1s.


HenryL27 · 2025-05-08T16:44:39Z

lib/sycamore/sycamore/tests/unit/transforms/test_embed.py

+        import time
+
+        def sleepfn(d):
+            print("Sleeping...")
+            time.sleep(10)
+            return d
+
+        import sycamore
+
+        sycamore.init().read.document([Document(d) for d in dicts]).map(sleepfn).take_all()


That was debugging. I was having trouble with GPU stuff and thought maybe it was the weird way ray was getting started. Removed.

HenryL27 · 2025-05-08T16:46:35Z

lib/sycamore/sycamore/transforms/__init__.py

+# from sycamore.transforms.extract_schema import (
+#     ExtractSchema,
+#     ExtractBatchSchema,
+#     SchemaExtractor,
+#     PropertyExtractor,
+# )


delete commented stuff? idk lmk if you want me to go through and mark all the spots

HenryL27 · 2025-05-08T16:47:31Z

lib/sycamore/sycamore/transforms/detr_partitioner_config.py

+ARYN_DETR_MODEL = "Aryn/deformable-detr-DocLayNet"
+DEFAULT_ARYN_PARTITIONER_ADDRESS = "https://api.aryn.cloud/v1/document/partition"
+DEFAULT_LOCAL_THRESHOLD = 0.35


why move these to their own place?

Because transforms/partition.py needs them to set default values and I don't want to have to import the whole detr partitioner to get them.

HenryL27 · 2025-05-08T16:51:08Z

lib/sycamore/sycamore/transforms/summarize_images.py


 from sycamore.data import Document, Element
-from sycamore.llms import LLM, OpenAI, OpenAIClientWrapper, OpenAIModels, Gemini, GeminiModels
+from sycamore.llms import LLM


why not import this one from sycamore.llms.llms?

eric-anderson requested a review from HenryL27 May 8, 2025 05:24

eric-anderson added 3 commits May 7, 2025 23:28

bugfix import in github tests

02f6605

poetry lock

f69efe2

fix speed test with real constant

00a0e8a

HenryL27 reviewed May 8, 2025

View reviewed changes

HenryL27 approved these changes May 8, 2025

View reviewed changes

Review fixes.

52e4e44

eric-anderson merged commit d3e07cb into main May 8, 2025
11 of 15 checks passed

eric-anderson deleted the eric-make-import-faster-again branch May 8, 2025 23:20

		@dataclass
		class OpenAIModel:

Speed up import sycamore by ~10x; add module/tool for timing imports. #1292

Speed up import sycamore by ~10x; add module/tool for timing imports. #1292

Uh oh!

Conversation

eric-anderson commented May 8, 2025

Uh oh!

HenryL27 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-anderson May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eric-anderson May 8, 2025 •

edited

Loading