-
Notifications
You must be signed in to change notification settings - Fork 65
Speed up import sycamore by ~10x; add module/tool for timing imports. #1292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Went from ~1.2s to ~0.1s to execute import sycamore. on my laptop. Remaining work is all around improving llm import time, that import is about 90% of the remaining time. lib/import_timer is the new library for doing import timing. See the file for instructions. Thanks to Claude for an initial version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! the reimports being slow seems real fishy to me so I worry that something is not doing what it's supposed to
lib/import_timer/import_timer.py
Outdated
import time | ||
import os | ||
import builtins | ||
from typing import Dict, List |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use lowercase dict
and list
for type annotations and then you save yourself a typing import cuz those are builtins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
elapsed = end_time - start_time | ||
|
||
if name in import_times: | ||
# Re-imports can be really slow; not sure why, but sycamore.llms.llm can be 0.19s -> 0.62s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems really odd to me. I thought python caches imported modules. https://docs.python.org/3/reference/import.html#the-module-cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so too; It happens all over (lots of modules, I had an error then a warning then I turned it off), that was just one of the extreme ones.
@dataclass | ||
class OpenAIModel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're putting the openai enums in a separate file can we bring in the anthropic, bedrock, and gemini enums too? But also this seems like a kinda weird thing to do. why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because llms/init.py needs the list of models to build the MODELS dictionary, and that's used in enough places that I didn't want to try to fix it. Pulling out openai was sufficient to give me most of the speedup, so I didn't go farther.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the other config stuff in.
# Attention future developer: You're here, your PR just made this test fail, you're thinking | ||
# "I'll just increase this by a little bit, it's not a big deal." You're probably not even | ||
# responsible for most of the slowdown. Please, take the time to fix it and not let it get | ||
# "just a little bit worse." That's the way that we get back up to taking >1s to import a | ||
# library when it should take <0.1s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol nice
import time | ||
|
||
def sleepfn(d): | ||
print("Sleeping...") | ||
time.sleep(10) | ||
return d | ||
|
||
import sycamore | ||
|
||
sycamore.init().read.document([Document(d) for d in dicts]).map(sleepfn).take_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was debugging. I was having trouble with GPU stuff and thought maybe it was the weird way ray was getting started. Removed.
# from sycamore.transforms.extract_schema import ( | ||
# ExtractSchema, | ||
# ExtractBatchSchema, | ||
# SchemaExtractor, | ||
# PropertyExtractor, | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete commented stuff? idk lmk if you want me to go through and mark all the spots
ARYN_DETR_MODEL = "Aryn/deformable-detr-DocLayNet" | ||
DEFAULT_ARYN_PARTITIONER_ADDRESS = "https://api.aryn.cloud/v1/document/partition" | ||
DEFAULT_LOCAL_THRESHOLD = 0.35 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why move these to their own place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because transforms/partition.py needs them to set default values and I don't want to have to import the whole detr partitioner to get them.
|
||
from sycamore.data import Document, Element | ||
from sycamore.llms import LLM, OpenAI, OpenAIClientWrapper, OpenAIModels, Gemini, GeminiModels | ||
from sycamore.llms import LLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not import this one from sycamore.llms.llms?
Went from ~1.2s to ~0.1s to execute import sycamore. on my laptop.
Remaining work is all around improving llm import time, that import is about 90% of the remaining time.
lib/import_timer is the new library for doing import timing. See the file for instructions.
Thanks to Claude for an initial version.