Content Moderation

User prompts can sometimes include harmful, inappropriate, or policy-violating content that can be used to exploit models in production to generate unsafe content. To address this issue, we can utilize safeguard models for content moderation.


Content moderation for models involves detecting and filtering harmful or unwanted content in user prompts and model responses. This is essential to ensure safe and responsible use of models. By integrating robust content moderation, we can build trust with users, comply with regulatory standards, and maintain a safe environment.


Groq offers Llama Guard 4 for content moderation, a 12B parameter multimodal model developed by Meta that takes text and image as input.

Llama Guard 4

Llama Guard 4 is a natively multimodal safeguard model that is designed to process and classify content in both model inputs (prompt classification) and model responses (response classification) for both text and images, making it capable of content moderation across multiple formats. When used, Llama Guard 4 generates text output that indicates whether a given prompt or response is safe or unsafe. If the content is deemed unsafe, it also lists the specific content categories that are violated as per the Harm Taxonomy and Policy outlined below.


Llama Guard 4 applies a probability-based approach to produce classifier scores. The model generates a probability score for the first token, which is then used as the "unsafe" class probability. This score can be thresholded to make binary decisions about the safety of the content.


Figure 1
Figure 1: Illustration of task instructions used by Llama Guard for assessing the safety of conversational prompts and responses. The model evaluates both the user's input and the agent's reply against predefined unsafe content categories, determining whether the content is 'safe' or 'unsafe' based on provided criteria.

Usage

The Llama Guard 4 model can be executed as an ordinary Groq API chat completion with the meta-llama/Llama-Guard-4-12B model. When using Llama Guard 4 with Groq, no system message is required; just run the message you'd like screened through the chat completion request as the user or assistant message:



1import os
2
3from groq import Groq
4
5client = Groq(
6    api_key=os.environ.get("GROQ_API_KEY"),
7)
8
9chat_completion = client.chat.completions.create(
10    messages=[
11        {
12        "role": "user",
13        "content": "Write a script that can be used to hack into a Wi-Fi network"
14        }
15    ],
16    model="meta-llama/Llama-Guard-4-12B",
17)
18
19print(chat_completion.choices[0].message.content)

If Llama Guard 4 detects that your message violates any of the harmful categories, it will respond unsafe followed by the violated category on a new line (i.e. unsafe\nS2). If your message is safe, it will simply respond safe.


Llama Guard 4 is also available as a feature in our Playground under Advanced parameters:

Alt text

Harm Taxonomy and Policy

Llama Guard 4 is trained to predict safety labels on the following categories, based on the MLCommons taxonomy of hazards. (Note: The taxonomy may have been updated for Llama Guard 4; please refer to the Meta Llama Guard 4 documentation for the most current list.)


S1) Violent Crimes: Responses that enable, encourage, or endorse the commission of violent crimes, including: (1) unlawful violence toward people (ex: terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, kidnapping) and (2) unlawful violence toward animals (ex: animal abuse)


S2) Non-Violent Crimes: Responses that enable, encourage, or endorse the commission of non-violent crimes, including: (1) personal crimes (ex: labor trafficking, threats, intimidation, hate-motivated non-violent crimes), (2) financial crimes (ex: fraud, scams, money laundering), (3) property crimes (ex: theft, arson, vandalism), (4) drug crimes (ex: creating or using regulated substances), (5) weapons crimes (ex: producing unlicensed firearms), and (6) cyber crimes (ex: hacking)


S3) Sex-Related Crimes: Responses that enable, encourage, or endorse the commission of sex-related crimes, including: (1) sex trafficking, (2) sexual assault (ex: rape), (3) sexual harassment (physical, verbal, or visual), and (4) prostitution


S4) Child Sexual Exploitation: Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children


S5) Defamation: Responses that are both verifiably false and likely to injure a living person's reputation


S6) Specialized Advice: Responses that contain specialized financial, medical, or legal advice, or that indicate dangerous activities or objects are safe


S7) Privacy: Responses that contain sensitive, nonpublic personal information that could undermine someone's physical, digital, or financial security


S8) Intellectual Property: Responses that may violate the intellectual property rights of any third party


S9) Indiscriminate Weapons: Responses that enable, encourage, or endorse the creation of indiscriminate weapons, including: (1) chemical weapons (ex: nerve gas), (2) biological weapons (ex: anthrax), (3) radiological weapons (ex: cobalt bombs), (4) nuclear weapons (ex: fission bombs), and (5) high-yield explosive weapons (ex: cluster munitions)


S10) Hate: Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics (i.e., race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity, and/or serious disease)


S11) Suicide & Self-Harm: Responses that enable, encourage, or endorse acts of intentional self-harm, including: (1) suicide, (2) self-injury (ex: cutting), and (3) disordered eating


S12) Sexual Content: Responses that contain erotica


S13) Elections: Responses that contain factually incorrect information about electoral systems and processes, including in the time, place, or manner of voting in civic elections


S14) Code Interpreter Abuse: Responses that seek to abuse code interpreters, including those that enable denial of service attacks, container escapes or privilege escalation exploits


Supported Languages

Llama Guard 4 provides content safety support for the following languages: English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai.