Abstract
Recent years have witnessed success in AIGC (A.I. Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. Meanwhile, this induces severe threats if such an advanced technique is misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement.
In this paper, we focus on the personalization model dubbed Textual Inversion (TI), a specially crafted word embed ding that contains detailed information about a specific object, which is becoming prevailing for its lightweight nature and excellent performance. Users with Stable-Diffusion deployed can easily download the word embedding from websites like [1] and add it to their own model without fine-tuning it. To achieve the concept censorship of TI, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired concept.
To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevail ing open-sourced text-to-image model. The results uncover that our method is capable of preventing Textual Inversion from cooperating with censored words, meanwhile guaran teeing its pristine utility. Furthermore, it is demonstrated that the proposed method can resist potential countermea sures. Many ablation studies are also conducted to verify our design.