+
Skip to content

Conversation

eric-anderson
Copy link
Collaborator

Went from ~1.2s to ~0.1s to execute import sycamore. on my laptop.

Remaining work is all around improving llm import time, that import is about 90% of the remaining time.

lib/import_timer is the new library for doing import timing. See the file for instructions.
Thanks to Claude for an initial version.

Went from ~1.2s to ~0.1s to execute import sycamore. on my laptop.

Remaining work is all around improving llm import time, that import is about 90% of the remaining
time.

lib/import_timer is the new library for doing import timing. See the file for instructions.
  Thanks to Claude for an initial version.
@eric-anderson eric-anderson requested a review from HenryL27 May 8, 2025 05:24
Copy link
Collaborator

@HenryL27 HenryL27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! the reimports being slow seems real fishy to me so I worry that something is not doing what it's supposed to

import time
import os
import builtins
from typing import Dict, List
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use lowercase dict and list for type annotations and then you save yourself a typing import cuz those are builtins

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

elapsed = end_time - start_time

if name in import_times:
# Re-imports can be really slow; not sure why, but sycamore.llms.llm can be 0.19s -> 0.62s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems really odd to me. I thought python caches imported modules. https://docs.python.org/3/reference/import.html#the-module-cache

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so too; It happens all over (lots of modules, I had an error then a warning then I turned it off), that was just one of the extreme ones.

Comment on lines +5 to +6
@dataclass
class OpenAIModel:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're putting the openai enums in a separate file can we bring in the anthropic, bedrock, and gemini enums too? But also this seems like a kinda weird thing to do. why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because llms/init.py needs the list of models to build the MODELS dictionary, and that's used in enough places that I didn't want to try to fix it. Pulling out openai was sufficient to give me most of the speedup, so I didn't go farther.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the other config stuff in.

Comment on lines +6 to +10
# Attention future developer: You're here, your PR just made this test fail, you're thinking
# "I'll just increase this by a little bit, it's not a big deal." You're probably not even
# responsible for most of the slowdown. Please, take the time to fix it and not let it get
# "just a little bit worse." That's the way that we get back up to taking >1s to import a
# library when it should take <0.1s.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol nice

Comment on lines 109 to 118
import time

def sleepfn(d):
print("Sleeping...")
time.sleep(10)
return d

import sycamore

sycamore.init().read.document([Document(d) for d in dicts]).map(sleepfn).take_all()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was debugging. I was having trouble with GPU stuff and thought maybe it was the weird way ray was getting started. Removed.

Comment on lines 39 to 44
# from sycamore.transforms.extract_schema import (
# ExtractSchema,
# ExtractBatchSchema,
# SchemaExtractor,
# PropertyExtractor,
# )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete commented stuff? idk lmk if you want me to go through and mark all the spots

Comment on lines +1 to +3
ARYN_DETR_MODEL = "Aryn/deformable-detr-DocLayNet"
DEFAULT_ARYN_PARTITIONER_ADDRESS = "https://api.aryn.cloud/v1/document/partition"
DEFAULT_LOCAL_THRESHOLD = 0.35
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why move these to their own place?

Copy link
Collaborator Author

@eric-anderson eric-anderson May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because transforms/partition.py needs them to set default values and I don't want to have to import the whole detr partitioner to get them.


from sycamore.data import Document, Element
from sycamore.llms import LLM, OpenAI, OpenAIClientWrapper, OpenAIModels, Gemini, GeminiModels
from sycamore.llms import LLM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not import this one from sycamore.llms.llms?

@eric-anderson eric-anderson merged commit d3e07cb into main May 8, 2025
11 of 15 checks passed
@eric-anderson eric-anderson deleted the eric-make-import-faster-again branch May 8, 2025 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载