stable-diffusion里的Clip skip到底是什么?
CLIP model (The text embedding present in 1.x models) has a structure that is composed of layers. Each layer is more specific than the last. Example if layer 1 is "Person" then layer 2 could be: "male" and "female"; then if you go down the path of "male" layer 3 could be: Man, boy, lad, father, grandpa... etc. Note this is not exactly how the CLIP model is structured, but for the sake of example.
The 1.5 model is for example 12 ranks deep. Where in 12th layer is the last layer of text embedding. Each layer matrix of some size, and each layer is has additional matrixes. So 4x4 first layer has 4 4x4 under it... SO and so forth. So the text space is dimensionally fucking huge.
Now why would you want to stop earlier in the Clip layers? Well if you want picture of "a cow" you might not care about the sub categories of "cow" the text model might have. Especially since these can have varying degrees of quality. So if you want "a cow" you might not want "a abederdeen angus bull".
You can imagine CLIP skip to basically be a setting for "how accurate you want the text model to be". You can test it out, wtih XY script for example. You can see that each clip stage has more definition in the description sense. So if you have a detailed prompt about a young man standing in a field, with lower clip stages you'd get picture of "a man standing", then deeper "young man standing", "Young man standing in a forest"... etc.
CLIP skip really becomes good when you use models that are structured in a special way. Like Booru models. Where "1girl" tag can break down to many sub tags that connect to that one major tag. Whether you get use of from clip skip is really just trial and error.
Now keep in mind that CLIP skip only works in models that use CLIP and or are based on models that use CLIP. As in 1.x models and it's derivates. 2.0 models and it's derivates do not interact with CLIP because they use OpenCLIP.
以下是中文翻譯(AI翻譯)
CLIP模型(1.x模型中存在的文本嵌入)具有由層組成的結(jié)構(gòu)。每一層比上一層更具體。例如,如果第一層是“人”,則第二層可能是:“男性”和“女性”;然后,如果您沿著“男性”的路徑走,第三層可能是:男人、男孩、小伙子、父親、爺爺?shù)取U?qǐng)注意,CLIP模型的結(jié)構(gòu)并非完全如此,但是為了舉例而言。
例如,1.5模型有12個(gè)等級(jí)。在第12層中,是文本嵌入的最后一層。每個(gè)層矩陣有一定的大小,每個(gè)層都有額外的矩陣。因此,第一層的4x4有4個(gè)4x4在其下面...如此等等。因此,文本空間的維度非常巨大。
現(xiàn)在為什么要在Clip層中停止較早?如果您想要“一頭牛”的圖片,則可能不關(guān)心文本模型可能具有的“?!钡淖宇悇e。特別是因?yàn)檫@些可以具有不同程度的質(zhì)量。因此,如果您想要“一頭?!?,您可能不想要“一頭阿伯丁安格斯公牛”。
您可以將CLIP skip想象成“您希望文本模型有多準(zhǔn)確”的設(shè)置。例如,您可以使用XY腳本進(jìn)行測(cè)試。您可以看到每個(gè)clip階段在描述意義上都具有更多的定義。因此,如果您有關(guān)于一個(gè)年輕男子站在田野上的詳細(xì)提示,那么在較低的clip階段中,您會(huì)得到“一個(gè)站立的男人的圖片”,然后更深入的是“站立的年輕男人”,“站在森林中的年輕男人”等等。
當(dāng)您使用以特殊方式結(jié)構(gòu)化的模型時(shí),CLIP skip真正變得好用。例如,Booru模型。在那里,“1girl”標(biāo)記可以分解為許多連接到該主要標(biāo)記的子標(biāo)記。無論您是否從clip skip中獲得使用都只是試錯(cuò)。
現(xiàn)在請(qǐng)記住,CLIP跳過僅適用于使用CLIP或基于使用CLIP的模型。即1.x模型及其派生物。2.0模型及其派生物不與CLIP交互,因?yàn)樗鼈兪褂肙penCLIP。
尋找合適且易懂的解釋花了些時(shí)間,但還好這是有收獲的
此回答轉(zhuǎn)自https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/5674
感興趣的可以去看看。