Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store Message Toxicity in database #553

Merged
merged 27 commits into from
Jan 14, 2023

Conversation

nil-andreu
Copy link
Contributor

Implementing the calculation of the message toxicity in the workflow as well as storing its value in the database.


except OasstError:
logger.error(
f"Could not compute toxicity for text reply to {interaction.message_id} with {interaction.text} by {interaction.user}."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need the last =, which was new feature in python.
The = specifier can be used to expand an expression to the text of the expression, an equal sign, then the representation of the evaluated expression.

from oasst_shared.exceptions import OasstError, OasstErrorCode


class HF_url(str, Enum):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the camelcase convention used for saving the embeddings.

@@ -27,6 +27,7 @@ class Settings(BaseSettings):
)

HUGGING_FACE_API_KEY: str = ""
DEBUT_SKIP_TOXICITY_CALCULATION: bool = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG*

@@ -95,6 +95,7 @@ services:
- DEBUG_SKIP_API_KEY_CHECK=True
- DEBUG_USE_SEED_DATA=True
- MAX_WORKERS=1
- DEBUT_SKIP_TOXICITY_CALCULATION=False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set to True.

@nil-andreu nil-andreu marked this pull request as ready for review January 9, 2023 21:17
@andrewm4894
Copy link
Collaborator

andrewm4894 commented Jan 9, 2023

Cool stuff! Am curious and have some questions:

  • Do we have plans to use the toxicity or message embeddings within the app such that we need them right away?
  • Do they add much in terms of resource overhead on the backend?
  • Do they add any latency or complexity that could affect to user experience and flow?
  • Any cost considerations with huggingface api in terms of scale and streaming vs batch usage of thier api?

Mainly I am wondering if/why this needs to happen within the app and not as some sort of regular batch job so we can have more separation of concerns.

I am not super familiar with the backend or anything so asking out of ignorance and curiosity and a little bit as devil's advocate but with the best intentions :)

Copy link
Collaborator

@yk yk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey thanks, I left a few comments

re: self-documenting expressions, hard to believe, 2019 is now 4 years ago :D not so new anymore... https://docs.python.org/3/whatsnew/3.8.html#f-strings-support-for-self-documenting-expressions-and-debugging

@@ -25,7 +25,8 @@ async def get_text_toxicity(
ToxicityClassification: the score of toxicity of the message.
"""

api_url: str = HfUrl.HUGGINGFACE_TOXIC_ROBERTA.value
api_url: str = HfUrl.HUGGINGFACE_TOXIC_ROBERTA.value + "/" + HfModel.TOXIC_ROBERTA.value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. You'll end up with the model name twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yeah needed to separate the model name correctly.

text=interaction.text,
frontend_message_id=interaction.message_id,
user_frontend_message_id=interaction.user_message_id,
)

if not settings.DEBUG_SKIP_TOXICITY_CALCULATION:
save_toxicity(interaction, pr, new_message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below for the embedding calculation, we implement this in a different way, i.e. we don't have a helper function in hugging_face, but delegate to prompt_repository. Is there a reason we cannot do it analogously here? I'm not saying one approach is better, but we should strive for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, well this helper function in huggingface is at the end using the prompt_repository. Just I divided the concerns to make the code more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the creation of the rest api client of the embeddings calculation, I also refactored that code with this helper function (that used as well the pr underneath).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I saw that. the issue is, a function that is called save_X, I would expect to take X as one of the arguments and then save that. But here, the function is actually mainly responsible for getting X, while the actual saving part is delegated to the prompt repository. so the prompt repository itself is perfectly capable of "saving X" in a single line of code, and I don't think we need additional redirection for that.

the bigger problem is this: the current change is not abstraction, it's just indirection & moving code around, making things more complicated. if anything should be abstracted, it's the actual computation part, but that is as of now simply placed into a different spot, it's not made easier. you can easily see how this design is suboptimal in the new PR you opened (#578 ) because there, you actually have to duplicate the code of building the URL and doing the http request into both your new helper function and the endpoint, because you obviously don't want to save in the endpoint. isn't the purpose of such a helper function to prevent you from having to do exactly that? on top of that, you're now missing error handling in the new PR.

I just don't see the current change as having positive impact. it moves control flow unnecessarily from one file to the next without gaining anything. I'd leave the calling of HF and storing as it is done below for the embedding, that's much clearer. If we want to refactor anything (in a separate PR), we can make a helper to make it easier to call HF endpoints and then use that from both branches

@@ -95,6 +95,7 @@ services:
- DEBUG_SKIP_API_KEY_CHECK=True
- DEBUG_USE_SEED_DATA=True
- MAX_WORKERS=1
- DEBUG_SKIP_TOXICITY_CALCULATION=False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set to true by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thanks! Done

):
try:
model_name = HfModel.TOXIC_ROBERTA.value
hugging_face_api = HuggingFaceAPI(f"{HfUrl.HUGGINGFACE_TOXIC_ROBERTA.value}/{model_name}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't think that results in the correct endpoint (at least I get a "not found" when I go there)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved thanks!

HUGGINGFACE_FEATURE_EXTRACTION = "https://api-inference.huggingface.co/pipeline/feature-extraction"


class HfModel(str, Enum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In analogy to below, this should probably be called something like HfClassificationModel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks!

Comment on lines 252 to 253
if None in (message_id, model, toxicity):
raise OasstError("Paramters missing to add toxicity", OasstErrorCode.GENERIC_ERROR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need these, since the MessageToxicity's validators would pull. Feel free to write a unit test for this, but I'd be surprised if not.

@yk
Copy link
Collaborator

yk commented Jan 10, 2023

  • Do we have plans to use the toxicity or message embeddings within the app such that we need them right away?

Not concrete plans, but the idea is that a (trusted) frontend could check dynamically whether some input violates the classifier.

  • Do they add much in terms of resource overhead on the backend?

Not really, beyond an open socket.

  • Do they add any latency or complexity that could affect to user experience and flow?

Maybe, we'll have to see.

  • Any cost considerations with huggingface api in terms of scale and streaming vs batch usage of thier api?

This would only be for real-time inference. I think could still do batch computation for all stored things.

@nil-andreu
Copy link
Contributor Author

  • Do we have plans to use the toxicity or message embeddings within the app such that we need them right away?

Not concrete plans, but the idea is that a (trusted) frontend could check dynamically whether some input violates the classifier.

  • Do they add much in terms of resource overhead on the backend?

Not really, beyond an open socket.

  • Do they add any latency or complexity that could affect to user experience and flow?

Maybe, we'll have to see.

  • Any cost considerations with huggingface api in terms of scale and streaming vs batch usage of thier api?

This would only be for real-time inference. I think could still do batch computation for all stored things.

Maybe batch processing for the messages that we were not able to obtain the toxicity score. This is something I could work on it after this PR.

@nil-andreu
Copy link
Contributor Author

Have made couple final changes, tomorrow will test them and make sure it works correctly.

Copy link
Collaborator

@yk yk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, I've left some more comments

also, I'd just delete all alembic revisions and recreate, because it's a big mess

@@ -0,0 +1,21 @@
"""empty message
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this revision is empty and can be deleted

@@ -0,0 +1,21 @@
"""empty message
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one too

@@ -0,0 +1,21 @@
"""empty message
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one too

@@ -8,10 +8,14 @@


class HfUrl(str, Enum):
HUGGINGFACE_TOXIC_ROBERTA = ("https://api-inference.huggingface.co/models/unitary/multilingual-toxic-xlm-roberta",)
HUGGINGFACE_TOXIC_ROBERTA = "https://api-inference.huggingface.co/models/unitary"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this string value has nothing to do with "toxic" or "roberta", so it shouldn't be named that

HUGGINGFACE_FEATURE_EXTRACTION = "https://api-inference.huggingface.co/pipeline/feature-extraction"


class HfClassificationModel(str, Enum):
TOXIC_ROBERTA = "multilingual-toxic-xlm-roberta"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at the hub, the model is called "unitary/multilingual-toxic-xlm-roberta", right now the "unitary" is in the URL part above and it should be here.

raise OasstError("Paramters missing to add toxicity", OasstErrorCode.GENERIC_ERROR)

message_toxicity = MessageToxicity(
message_id=message_id, model=model, score=toxicity.score, label=toxicity.label
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see now you added label, but it's missing in the None check above. I'm still quite convinced the None check is unnecessary and the pydantic validators will pull

f"{HfUrl.HUGGINGFACE_TOXIC_ROBERTA.value}/{model_name}"
)

toxicity: List[List[ToxicityClassification]] = await hugging_face_api.post(interaction.text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post returns Any. casting like this is really suboptimal, and especially to pydantic objects.
have you tested whether all of this works? if yes, we can leave it like this and fix later, but given that http calls return json as dicts, and later in your code you act properties of toxicity, it seems quite dangerous.

@nil-andreu
Copy link
Contributor Author

nil-andreu commented Jan 11, 2023

Have changed the code based on the feedback. If there needs to be changed anything let me know! @yk

Copy link
Collaborator

@andreaskoepf andreaskoepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I requested some minor changes (alembic revision & type annotation).


# revision identifiers, used by Alembic.
revision = "bcc2fe18d214"
down_revision = "20cd871f4ec7"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please change this to the latest version? Please check latest version here.

@@ -291,6 +292,26 @@ def store_ranking(self, ranking: protocol_schema.MessageRanking) -> Tuple[Messag

return reaction, task

def insert_toxicity(self, message_id: UUID, model: str, toxicity) -> MessageToxicity:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please provide a type annotation for the toxicity parameter, e.g. dict[str, Any]

Copy link
Collaborator

@andreaskoepf andreaskoepf Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the function seems to expect a single float in toxicity["label"] .. it would be better to extract that value outside of the function and pass simply a float (then it would also match the docstring again ;-) ).

Args:
message_id (UUID): the identifier of the message we want to save its toxicity score
model (str): the model used for creating the toxicity score
toxicity (float): the values obtained from the message & model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the doc string (float) type does not match how it is used below in line 308, e.g. it is probably a dict[str, Any]?

Copy link
Collaborator

@andreaskoepf andreaskoepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready to be merged!

@andreaskoepf andreaskoepf enabled auto-merge (squash) January 14, 2023 12:22
@andreaskoepf andreaskoepf merged commit a902c60 into LAION-AI:main Jan 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants