ChatGPT Moderation: AIGC怎樣與人類價(jià)值觀對齊?怎樣識(shí)別parsons語言?生態(tài)/破壞話語?
ChatGPT API Moderation model:ChatGPT API 審查模型
為了保證人工智能與人類健康的價(jià)值觀對齊,ChatGPT構(gòu)建了一個(gè)審查模型(Moderation Model)。目的是用來識(shí)別色情、暴力、侮辱、粗俗等惡意言辭和指令。這一目標(biāo)似乎與英語教學(xué)中屏蔽parsnips語言(注: parsnips 是指有關(guān)politics,?alcohol,?religion,?sex,?narcotics,?-isms,?pork的敏感詞)、生態(tài)話語分析中辨別生態(tài)話語/破壞性話語、批評話語分析中識(shí)別意識(shí)形態(tài)(價(jià)值觀)的需求不謀而合。在語言教學(xué)應(yīng)用、話語研究、教學(xué)材料開發(fā)中都有很強(qiáng)的應(yīng)用潛力。故此,特轉(zhuǎn)發(fā)以下文章,希望給大家?guī)韼椭?/span>
Discover in this article what is the?ChatGPT API Moderation model, and what are the 7 categories used in it and how to call and interpret them.
ChatGPT API Moderation model
OpenAI API provides the possibility to classify any text to ensure it complies with their usage policies, using a binary classification.?This classification is integrated in their Moderation model that one can call using openai API in Python.
7 categories are used in the OpenAI model: Hate, Hate/Threatening, Self-harm, Sexual, Sexual/minors, Violence, Violence/graphic.
One can use them to filter any inappropriate content (comments in a website, inputs from clients in chatbot requests…).?

OpenAI API Moderation method
The method to call to use the moderation classification is:?openai.Moderation.create?
The answer is a JSON object:?
In the JSON object, you have:?
model: The model currently used is called “text-moderation-004”.
results: in which you have:
True: if the input text does violate the given category
False: if does not
categories: For each of the 7 categories, you have a binary classification:
Category scores: for each category, a score is calculated. It’s not a probability. The lower the score, the better the content. The higher the score, the more it violates the above categories.
flagged: Which is the final classification of the input.
“false” if the input text does not violate OpenAI’s policies.
“true” if it does: If at least one category is true, this flag is set to true too.
Moderation API Call
Standard Call
The classification of the prompt “I love chocolate” is “false”, meaning it does not violate any of the above categories.
Here is the detailed output:
All scores are very low, thus the given categories are all “false”.
Call violation
The prompt given in the following request?is just for illustration. It is not a personal opinion.
The output is “true”, meaning there is a violation. This is because the input violates the first category “hate” with a score of 0.52, while the other categories are all showing very low scores.
Some variants
When the input is describing a personal belief, the classification is correct. However when it describes a global opinion, the model does not classify it as violating the policies.?
Here is an example, where the classification is false even if the input has a negative connotation :
Here is another variant, where a simple comma can change widely the score (the classification in both cases is “true”):
The score is about 0.66
Here the score is about 0.954 (with a simple comma):
Summary
In this article, you have learned how to use the ChatGPT API Moderation model, that you can put in place for your own project/website to avoid inputs or comments violating any common sense.
I hope you enjoy reading the article. Leave me a SanLian :-)?
本文英文部分轉(zhuǎn)載自:https://machinelearning-basics.com/chatgpt-api-moderation-model/?
.