Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Researching evaluations #9

Open
irthomasthomas opened this issue May 2, 2024 · 0 comments
Open

Researching evaluations #9

irthomasthomas opened this issue May 2, 2024 · 0 comments

Comments

@irthomasthomas
Copy link

irthomasthomas commented May 2, 2024

Hi Simon,

In case its useful, I am collecting research notes on LLM evaluation frameworks for my own needs. I have collected over 11k prompts in my llm cli db, and I would love to mine those for evaluations data. My own evaluation needs are complex. I am trying to evaluate some multi-agent interactions. I need to evaluate each response, as well as evaluate the overall conversation, and progress towards a goal. My first attempt at automated evals using gemini-1.5-pro-latest just resulted in random scores being produced. Running the same test multiple times generates different scores, even with temperature 0. Opus is a lot better. So I created a grading rubric and had opus grade 10 examples and justify each one. My hope is that I can take advantage of the gemini context length and feed it the samples from opus to improve its grading abilities. Opus is too expensive to use for the whole project.

Before I go too far with that, I wanted to collect some research notes. I keep my bookmarks and notes in gh issues, and I embed them using your llm cli and jina embeddings model.

To help me reading papers, I just generated a quick app: https://github.com/irthomasthomas/clipnotes, with this monitoring the clipboard while I read and collecting the copied items to a file for further processing.

The first paper is GPTScore: Evaluate as You Desire. It includes tests with some older architecture models like encoder-decoder models, which aren't used today, so I ignored those and focused on the results for the GPT3 models.

irthomasthomas/undecidability#823

You might find some other useful links under the llm-evalution label.

Ta,
Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant