Hate speech is not only generated by individuals but is now also produced by AI systems. We propose to explore the differences between human and AI generated hate speech and develop a model capable of distinguishing between the two. Our research questions are:
-
Can we accurately classify whether the implicit hate speech is generated by humans or AI?
- What linguistic features are indicative of human vs AI hate?
-
Can using open source LLMs help us improve classification accuracy (i.e. address the imbalance problem)?
- Use similarity metric to see if AI generated data is similar to human data
- How do we want to sample data? Down sampling throws away data. Having 50/50 split of hate/non-hate is not a great idea either because the data is not representative of real world hate distributions.
We will fine-tune on the following models for classification:
We will use the following open source LLMs:
- ToxiGen Dataset: A dataset of AI-generated toxic and hate speech targeting 13 groups, with 27.5k human validated rows.(Hartvigsen et al., 2022)
- Implicit Hate: The ElSherief et al. (2021) dataset includes 22,056 tweets from prominent U.S. extremist groups, with 6,346 tweets labeled as containing implicit hate speech and fine-grained annotations for each message and its implications.
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (Hartvigsen et al., ACL 2022)
Latent Hatred: A Benchmark for Understanding Implicit Hate Speech (ElSherief et al., EMNLP 2021)