Store Message Toxicity in database #553

nil-andreu · 2023-01-08T20:54:32Z

Implementing the calculation of the message toxicity in the workflow as well as storing its value in the database.

nil-andreu · 2023-01-09T11:16:09Z

backend/oasst_backend/utils/hugging_face.py

+
+    except OasstError:
+        logger.error(
+            f"Could not compute toxicity for  text reply to {interaction.message_id} with {interaction.text} by {interaction.user}."


Need the last =, which was new feature in python.
The = specifier can be used to expand an expression to the text of the expression, an equal sign, then the representation of the evaluated expression.

nil-andreu · 2023-01-09T11:17:21Z

backend/oasst_backend/utils/hugging_face.py

 from oasst_shared.exceptions import OasstError, OasstErrorCode


+class HF_url(str, Enum):


Follow the camelcase convention used for saving the embeddings.

nil-andreu · 2023-01-09T11:18:28Z

backend/oasst_backend/config.py

@@ -27,6 +27,7 @@ class Settings(BaseSettings):
    )

    HUGGING_FACE_API_KEY: str = ""
+    DEBUT_SKIP_TOXICITY_CALCULATION: bool = False


nil-andreu · 2023-01-09T11:18:58Z

docker-compose.yaml

@@ -95,6 +95,7 @@ services:
      - DEBUG_SKIP_API_KEY_CHECK=True
      - DEBUG_USE_SEED_DATA=True
      - MAX_WORKERS=1
+      - DEBUT_SKIP_TOXICITY_CALCULATION=False


Set to True.

andrewm4894 · 2023-01-09T21:33:49Z

Cool stuff! Am curious and have some questions:

Do we have plans to use the toxicity or message embeddings within the app such that we need them right away?
Do they add much in terms of resource overhead on the backend?
Do they add any latency or complexity that could affect to user experience and flow?
Any cost considerations with huggingface api in terms of scale and streaming vs batch usage of thier api?

Mainly I am wondering if/why this needs to happen within the app and not as some sort of regular batch job so we can have more separation of concerns.

I am not super familiar with the backend or anything so asking out of ignorance and curiosity and a little bit as devil's advocate but with the best intentions :)

yk

hey thanks, I left a few comments

re: self-documenting expressions, hard to believe, 2019 is now 4 years ago :D not so new anymore... https://docs.python.org/3/whatsnew/3.8.html#f-strings-support-for-self-documenting-expressions-and-debugging

yk · 2023-01-09T23:08:57Z

backend/oasst_backend/api/v1/hugging_face.py

@@ -25,7 +25,8 @@ async def get_text_toxicity(
        ToxicityClassification: the score of toxicity of the message.
    """

-    api_url: str = HfUrl.HUGGINGFACE_TOXIC_ROBERTA.value
+    api_url: str = HfUrl.HUGGINGFACE_TOXIC_ROBERTA.value + "/" + HfModel.TOXIC_ROBERTA.value


I don't think this is correct. You'll end up with the model name twice

Thanks! Yeah needed to separate the model name correctly.

yk · 2023-01-09T23:12:31Z

backend/oasst_backend/api/v1/tasks.py

                    text=interaction.text,
                    frontend_message_id=interaction.message_id,
                    user_frontend_message_id=interaction.user_message_id,
                )

+                if not settings.DEBUG_SKIP_TOXICITY_CALCULATION:
+                    save_toxicity(interaction, pr, new_message)


below for the embedding calculation, we implement this in a different way, i.e. we don't have a helper function in hugging_face, but delegate to prompt_repository. Is there a reason we cannot do it analogously here? I'm not saying one approach is better, but we should strive for consistency.

Yes, well this helper function in huggingface is at the end using the prompt_repository. Just I divided the concerns to make the code more readable.

For the creation of the rest api client of the embeddings calculation, I also refactored that code with this helper function (that used as well the pr underneath).

yes, I saw that. the issue is, a function that is called save_X, I would expect to take X as one of the arguments and then save that. But here, the function is actually mainly responsible for getting X, while the actual saving part is delegated to the prompt repository. so the prompt repository itself is perfectly capable of "saving X" in a single line of code, and I don't think we need additional redirection for that.

the bigger problem is this: the current change is not abstraction, it's just indirection & moving code around, making things more complicated. if anything should be abstracted, it's the actual computation part, but that is as of now simply placed into a different spot, it's not made easier. you can easily see how this design is suboptimal in the new PR you opened (#578 ) because there, you actually have to duplicate the code of building the URL and doing the http request into both your new helper function and the endpoint, because you obviously don't want to save in the endpoint. isn't the purpose of such a helper function to prevent you from having to do exactly that? on top of that, you're now missing error handling in the new PR.

I just don't see the current change as having positive impact. it moves control flow unnecessarily from one file to the next without gaining anything. I'd leave the calling of HF and storing as it is done below for the embedding, that's much clearer. If we want to refactor anything (in a separate PR), we can make a helper to make it easier to call HF endpoints and then use that from both branches

yk · 2023-01-09T23:13:37Z

docker-compose.yaml

@@ -95,6 +95,7 @@ services:
      - DEBUG_SKIP_API_KEY_CHECK=True
      - DEBUG_USE_SEED_DATA=True
      - MAX_WORKERS=1
+      - DEBUG_SKIP_TOXICITY_CALCULATION=False


set to true by default

Okay thanks! Done

yk · 2023-01-09T23:14:21Z

backend/oasst_backend/utils/hugging_face.py

+):
+    try:
+        model_name = HfModel.TOXIC_ROBERTA.value
+        hugging_face_api = HuggingFaceAPI(f"{HfUrl.HUGGINGFACE_TOXIC_ROBERTA.value}/{model_name}")


Again, I don't think that results in the correct endpoint (at least I get a "not found" when I go there)

Solved thanks!

yk · 2023-01-09T23:16:14Z

backend/oasst_backend/utils/hugging_face.py

    HUGGINGFACE_FEATURE_EXTRACTION = "https://api-inference.huggingface.co/pipeline/feature-extraction"


+class HfModel(str, Enum):


In analogy to below, this should probably be called something like HfClassificationModel

Done thanks!

yk · 2023-01-09T23:17:04Z

backend/oasst_backend/prompt_repository.py

+        if None in (message_id, model, toxicity):
+            raise OasstError("Paramters missing to add toxicity", OasstErrorCode.GENERIC_ERROR)


I don't think you need these, since the MessageToxicity's validators would pull. Feel free to write a unit test for this, but I'd be surprised if not.

yk · 2023-01-10T21:09:28Z

Do we have plans to use the toxicity or message embeddings within the app such that we need them right away?

Not concrete plans, but the idea is that a (trusted) frontend could check dynamically whether some input violates the classifier.

Do they add much in terms of resource overhead on the backend?

Not really, beyond an open socket.

Do they add any latency or complexity that could affect to user experience and flow?

Maybe, we'll have to see.

Any cost considerations with huggingface api in terms of scale and streaming vs batch usage of thier api?

This would only be for real-time inference. I think could still do batch computation for all stored things.

nil-andreu · 2023-01-10T21:15:05Z

Do we have plans to use the toxicity or message embeddings within the app such that we need them right away?

Not concrete plans, but the idea is that a (trusted) frontend could check dynamically whether some input violates the classifier.

Do they add much in terms of resource overhead on the backend?

Not really, beyond an open socket.

Do they add any latency or complexity that could affect to user experience and flow?

Maybe, we'll have to see.

Any cost considerations with huggingface api in terms of scale and streaming vs batch usage of thier api?

This would only be for real-time inference. I think could still do batch computation for all stored things.

Maybe batch processing for the messages that we were not able to obtain the toxicity score. This is something I could work on it after this PR.

nil-andreu · 2023-01-10T21:19:03Z

Have made couple final changes, tomorrow will test them and make sure it works correctly.

yk

thank you, I've left some more comments

also, I'd just delete all alembic revisions and recreate, because it's a big mess

yk · 2023-01-10T21:21:39Z

backend/alembic/versions/2023_01_10_0827-e7f955d6ff1b_.py

@@ -0,0 +1,21 @@
+"""empty message


this revision is empty and can be deleted

yk · 2023-01-10T21:21:45Z

backend/alembic/versions/2023_01_10_0845-2d71d4cd5a1b_.py

@@ -0,0 +1,21 @@
+"""empty message


this one too

yk · 2023-01-10T21:21:52Z

backend/alembic/versions/2023_01_10_2207-33edba7076ff_.py

@@ -0,0 +1,21 @@
+"""empty message


this one too

yk · 2023-01-10T21:24:37Z

backend/oasst_backend/utils/hugging_face.py

@@ -8,10 +8,14 @@


 class HfUrl(str, Enum):
-    HUGGINGFACE_TOXIC_ROBERTA = ("https://api-inference.huggingface.co/models/unitary/multilingual-toxic-xlm-roberta",)
+    HUGGINGFACE_TOXIC_ROBERTA = "https://api-inference.huggingface.co/models/unitary"


this string value has nothing to do with "toxic" or "roberta", so it shouldn't be named that

yk · 2023-01-10T21:29:33Z

backend/oasst_backend/utils/hugging_face.py

    HUGGINGFACE_FEATURE_EXTRACTION = "https://api-inference.huggingface.co/pipeline/feature-extraction"


+class HfClassificationModel(str, Enum):
+    TOXIC_ROBERTA = "multilingual-toxic-xlm-roberta"


if you look at the hub, the model is called "unitary/multilingual-toxic-xlm-roberta", right now the "unitary" is in the URL part above and it should be here.

yk · 2023-01-10T21:38:13Z

backend/oasst_backend/prompt_repository.py

+            raise OasstError("Paramters missing to add toxicity", OasstErrorCode.GENERIC_ERROR)
+
+        message_toxicity = MessageToxicity(
+            message_id=message_id, model=model, score=toxicity.score, label=toxicity.label


see now you added label, but it's missing in the None check above. I'm still quite convinced the None check is unnecessary and the pydantic validators will pull

yk · 2023-01-10T21:40:21Z

backend/oasst_backend/api/v1/tasks.py

+                            f"{HfUrl.HUGGINGFACE_TOXIC_ROBERTA.value}/{model_name}"
+                        )
+
+                        toxicity: List[List[ToxicityClassification]] = await hugging_face_api.post(interaction.text)


post returns Any. casting like this is really suboptimal, and especially to pydantic objects.
have you tested whether all of this works? if yes, we can leave it like this and fix later, but given that http calls return json as dicts, and later in your code you act properties of toxicity, it seems quite dangerous.

nil-andreu · 2023-01-11T20:44:21Z

Have changed the code based on the feedback. If there needs to be changed anything let me know! @yk

…essagesToxicity

andreaskoepf

Looks good, I requested some minor changes (alembic revision & type annotation).

andreaskoepf · 2023-01-12T08:14:10Z

backend/alembic/versions/2023_01_08_2200-bcc2fe18d214_messagetoxicity.py

+
+# revision identifiers, used by Alembic.
+revision = "bcc2fe18d214"
+down_revision = "20cd871f4ec7"


Could you please change this to the latest version? Please check latest version here.

andreaskoepf · 2023-01-12T08:17:34Z

backend/oasst_backend/prompt_repository.py

@@ -291,6 +292,26 @@ def store_ranking(self, ranking: protocol_schema.MessageRanking) -> Tuple[Messag

        return reaction, task

+    def insert_toxicity(self, message_id: UUID, model: str, toxicity) -> MessageToxicity:


please provide a type annotation for the toxicity parameter, e.g. dict[str, Any]

Since the function seems to expect a single float in toxicity["label"] .. it would be better to extract that value outside of the function and pass simply a float (then it would also match the docstring again ;-) ).

andreaskoepf · 2023-01-12T08:19:00Z

backend/oasst_backend/prompt_repository.py

+        Args:
+            message_id (UUID): the identifier of the message we want to save its toxicity score
+            model (str): the model used for creating the toxicity score
+            toxicity (float): the values obtained from the message & model


the doc string (float) type does not match how it is used below in line 308, e.g. it is probably a dict[str, Any]?

backend/oasst_backend/prompt_repository.py

andreaskoepf

ready to be merged!

nil-andreu added 10 commits January 8, 2023 21:53

[NEW] MessageToxicity table

c280566

[NEW] Alembic message Toxicity

dc01b68

[NEW] Model name enum

fa6f57e

[NEW] Refactor Enum HF

e3a1c5c

[NEW] Settings: DEBUT_SKIP_TOXICITY_CALCULATION

de0a9f1

[NEW] Store toxicity values

9fb54c6

[NEW] Solve merge conflicts

dda6603

[FIX] Merge conflict

f4e72a2

[FIX] Documentation

9208048

[NEW] save_toxicity: function

ac88cc8

fozziethebeat added the backend label Jan 9, 2023

nil-andreu commented Jan 9, 2023

View reviewed changes

nil-andreu added 2 commits January 9, 2023 22:15

[NEW] Solving merge conflict

3d66e6f

[FIX] Formatted string

38ce9aa

nil-andreu marked this pull request as ready for review January 9, 2023 21:17

nil-andreu requested review from yk and andreaskoepf as code owners January 9, 2023 21:17

yk reviewed Jan 9, 2023

View reviewed changes

nil-andreu added 5 commits January 10, 2023 08:24

[NEW] DEBUG_SKIP_TOXICITY_CALCULATION=True

631e7bf

[FIX] HfClassificationModel

b48dd22

[FIX] Alembic merge heads

0361a09

[NEW] Refactor save_toxicity

e1fa6e2

[NEW] Separating score/label

34031b3

[NEW] Store score and label

c6dbd8d

yk reviewed Jan 10, 2023

View reviewed changes

nil-andreu added 4 commits January 11, 2023 21:29

[FIX] Cleaning Alembic

5a37230

[NEW] Clean HF names

bc65628

[NEW] Not type hinting

1b56bd2

[FIX] Merge conflicts

858b67a

Merge branch 'main' of github.com:LAION-AI/Open-Assistant into storeM…

d26d72d

…essagesToxicity

andreaskoepf requested changes Jan 12, 2023

View reviewed changes

andreaskoepf mentioned this pull request Jan 12, 2023

Create retry-batch process for Detoxify result (message_toxicity table) #640

Closed

nil-andreu and others added 4 commits January 14, 2023 10:30

[NEW] Update alembic versions

2b79e39

[NEW] Revert the changes

793518e

[NEW] Type hinting label & score

d3b0767

Updated down_revision in migration script

6e47c50

andreaskoepf approved these changes Jan 14, 2023

View reviewed changes

andreaskoepf enabled auto-merge (squash) January 14, 2023 12:22

andreaskoepf merged commit a902c60 into LAION-AI:main Jan 14, 2023

		from oasst_shared.exceptions import OasstError, OasstErrorCode


		class HF_url(str, Enum):

		HUGGINGFACE_FEATURE_EXTRACTION = "https://api-inference.huggingface.co/pipeline/feature-extraction"


		class HfModel(str, Enum):

		if None in (message_id, model, toxicity):
		raise OasstError("Paramters missing to add toxicity", OasstErrorCode.GENERIC_ERROR)

		@@ -291,6 +292,26 @@ def store_ranking(self, ranking: protocol_schema.MessageRanking) -> Tuple[Messag

		return reaction, task

		def insert_toxicity(self, message_id: UUID, model: str, toxicity) -> MessageToxicity:

Store Message Toxicity in database #553

Store Message Toxicity in database #553

Conversation

nil-andreu commented Jan 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewm4894 commented Jan 9, 2023 • edited Loading

yk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yk commented Jan 10, 2023

nil-andreu commented Jan 10, 2023

nil-andreu commented Jan 10, 2023

yk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nil-andreu commented Jan 11, 2023 • edited Loading

andreaskoepf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreaskoepf Jan 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreaskoepf left a comment

Choose a reason for hiding this comment

andrewm4894 commented Jan 9, 2023 •

edited

Loading

nil-andreu commented Jan 11, 2023 •

edited

Loading

andreaskoepf Jan 12, 2023 •

edited

Loading