Would be nice to have an eval pipeline to continuously measure performance/cost gains over a set of LLMs and documents.