You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I already generated watermarked data with the below code sample:
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
SynthIDTextWatermarkingConfig,
)
# Standard model and tokenizer initialization
tokenizer = AutoTokenizer.from_pretrained('repo/id')
model = AutoModelForCausalLM.from_pretrained('repo/id')
# SynthID Text configuration
watermarking_config = SynthIDTextWatermarkingConfig(
keys=[654, 400, 836, 123, 340, 443, 597, 160, 57, ...],
ngram_len=5,
)
# Generation with watermarking
tokenized_prompts = tokenizer(["your prompts here"])
output_sequences = model.generate(
**tokenized_prompts,
watermarking_config=watermarking_config,
do_sample=True,
)
watermarked_text = tokenizer.batch_decode(output_sequences)
Could you help me with:
How to train the detector and how to detect the watermark?
My length of output text is maximum 200 tokens, can you suggest the threshold for detection?
The text was updated successfully, but these errors were encountered:
There a few detectors that do not need training, you could get started with them and see if you see that the scores for watermarked text and unwatermarked text follow different distributions. Following that, there's also a section of the colab that shows you how to train a detector if the performance of the detectors that aren't trained does not suffice.
Hello,
I already generated watermarked data with the below code sample:
Could you help me with:
How to train the detector and how to detect the watermark?
My length of output text is maximum 200 tokens, can you suggest the threshold for detection?
The text was updated successfully, but these errors were encountered: