+
Skip to content

Conversation

HenryL27
Copy link
Collaborator

@HenryL27 HenryL27 commented Jan 24, 2025

This is hard to break up bc I'm changing APIs of things that already exist and have tests...

  • llm.generate now takes a RenderedPrompt instead of a dict
  • the old api exists as llm.generate_old, and it's marked as deprecated.
  • most operations have been switched to call llm.generate_old
  • implement docset.llm_map and docset.llm_map_elements
  • implement a EntityExtractor.as_llm_map method that converts all of the functionality of OpenAIEntityExtractor into prompts and hooks for llm_map
  • docset.extract_entity() now uses that to use LLMMap instead of ExtractEntity.

Also changes the prompt rendering apis to produce either a single RenderedPrompt os a sequence of them in order to support a mode of extract entity that uses a tokenizer to make token-limited chunks to try to extract from

Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
…kwargs

Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
…prompts (to do things like adhere to token limits)

Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
…ive implementation of extract entity bc I don't want to deal with llm_filter just yet

Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
@HenryL27 HenryL27 marked this pull request as ready for review January 28, 2025 21:58
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
@HenryL27 HenryL27 changed the title [llm unify 2/n] Implement llm_map(_elements) and move ops to it. [llm unify 2/n] Implement llm_map(_elements) and move extract_entity to it. Jan 28, 2025
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
… docs say because azure disagrees and they'd better both accept 'system'

Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
@HenryL27
Copy link
Collaborator Author

afaict the ITs that are failing are mostly due to llm.generate_old() not quite converting prompt_kwargs to a RenderedPrompt correctly. I could spend a lot of time debugging that but given my goal is to delete generate_old entirely I'm tempted to just leave the tests broken and fix them when I translate the transforms they're testing to LLMMap.

Copy link
Contributor

@bsowell bsowell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot here. For the most part it looks good, I think. Couple of things to think about:

  • Is there a reasonable path to replace the generate_old calls at some point, do you think? I think you are doing the right thing doing the replacement incrementally, but I have a suspicion that we are going to live with this for a while.

  • Anything else we need to test to get confidence here? I guess most of our demos don't use sycamore directly anymore. Does AddDoc in DocStore still work? What about the DocPrep generated scripts?

new.__dict__[k] = v
return new

def is_done(self, s: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand this (and maybe it will become clearer as I get through the rest of the files). Is the idea that the sequence will be an iterator or something and you invoke this between calls? Is it used anywhere as anything other than True? I don't see an implementation in ElementListIterPrompt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's the idea. In extract entities with a tokenizer we set this to s != "None", which is the condition it used to use. I'm not sure that the prompt is the right place for this piece of logic to live, though I guess the prompt is also what decides whether to make a sequence of renderedprompts so maybe?

UNKNOWN = 0
SYNC = 1
ASYNC = 2
BATCH = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something done by the LLM, or are we batching by combining multiple things into a prompt before sending?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something done by the LLM - OpenAI and Anthropic have batch apis now: Anthropic / OpenAI. I think similar things exist in bedrock/azure as well but less sure



class LLMMap(MapBatch):
"""The LLMMap transform renders each Document in a docset into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as part of this PR, but I wonder if eventually it would make sense to have an option to attach the raw LLM request/response the document as well for debugging.

from sycamore.data import Document, Element


def _infer_prompts(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confess I'm having a hard time following all of the prompt sequence stuff. I understand the basic motivation for the token stuff, but I can't help but wonder if there is a cleaner way to do it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought I had which I think is beyond the scope of this is some sort of conditional branching/looping logic in a sycamore pipeline. Like

docset.do()
    .llm_map(<extract entity prompt>)
    .map(if not json set back to 'None')
    .while(lambda d: d.properties.get('entity', 'None') == 'None')
    .continue_processing()

I'm sure that creates all sorts of theoretic problems.

But then prompts can go back to being a single prompt and we can increment a counter on the document object and stuff.

I guess we can do the to-render-object counter thing and keep prompts to single-render in a for loop in llm_map which is probably cleaner. I'll try it and see

document_structure = ExtractDocumentStructure(self.plan, structure=structure, **kwargs)
return DocSet(self.context, document_structure)

@deprecated(version="0.1.31", reason="Use llm_map instead")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan also to deprecate extract_properties?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the plan is to deprecate just about everything on my sprint tasks

@bsowell
Copy link
Contributor

bsowell commented Jan 29, 2025

Also, have you gone through the integ test failures to see if they are a related? I did a cursory glance and a few did look llm related.

@HenryL27
Copy link
Collaborator Author

HenryL27 commented Jan 29, 2025

Is there a reasonable path to replace the generate_old calls at some point, do you think?

The goal is for generate_old to be deleted entirely. git grep claims there are 28 uses (10 of them are unit test mocks). Most of the rest of them are in the path of the replacements I have planned. There's 3 more in sycamore query that I'm hoping should be relatively simple to switch over, and the constant warnings should annoy vinayak into doing it (or I can if it isn't done by the time I try to remove generate_old).

Also, have you gone through the integ test failures to see if they are a related?

Yep. Of the 7 failing ITs I think 4 are directly related to these changes - with generate_old not quite handling images and response_formats correctly. My plan was to let those break (which I guess includes breaking the transforms) and then fix them when I dealt with their associated transforms. Maybe they need pytest skip marks then? Or maybe I need to actually fix them. Who's using the graph stuff anyways? SummarizeImages was the next thing I was gonna tackle though

Anything else we need to test to get confidence here?

The docprep IT's passed (except for the pinecone what that gets rate-limited and dies). I can try to figure out how to test random things in AddDoc - probably should know that anyway. But overall I feel like passing sycamore test suite (barring the 4 guys above) is probably good. Should make sure that AddDoc doesn't hit the bad paths in the failing ITs though.

Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
@HenryL27
Copy link
Collaborator Author

update: was able to run Add a doc with docstore using this

Copy link
Contributor

@bsowell bsowell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your plan seems reasonable. You'll have to keep an eye on things when you push this as it potentially has a big blast radius, but I don't think I see any reason to wait.

Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com>
@HenryL27 HenryL27 merged commit 73f59c0 into main Jan 30, 2025
13 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载