【ChatGPT】Visual ChatGPT 视觉模型深度解析

说明：
根据有关要求，本文将【Visual ChatGPT】模型简称为【Visual-GPT】。
本文为删节版，进行了大量删改，有些内容比较晦涩，读者可以略过，当然也可以仔细研读…完整版参见文末链接。
更新说明：文末链接已删除。

1. 【Visual- ChatGPT】火热来袭

3月9日，微软亚洲研究院发布了图文版 ChatGPT——Visual ChatGPT，并在 Github 开源了基础代码，短短一周已经获得了 19.7k 颗星。

2022年11月，OpenAI 推出的 ChatGPT，几个月来已经火爆全球，不仅需要候补注册，还要科学上网。ChatGPT 具有强大的会话能力的语言界面进行人机对话，能陪你聊天、编写代码、修改 bug、解答问题…，但是目前还不能处理或生成视觉图像。

Visual ChatGPT 把一系列 Visual Foundation 视觉模型接入 ChatGPT，使用户能够与 ChatGPT 以文本和图像的形式交互，还能提供复杂的视觉指令，让多个模型协同工作。Visual ChatGPT 可以理解和响应基于文本的输入和基于视觉的输入，减少进入文本到图像模型的障碍，增加各种 AI 工具的互操作性。

Visual Transformer 将 ChatGPT 作为逻辑处理中心，集成 Visual Foundation 视觉基础模型，从而实现：

提供视觉聊天系统，可以接收和发送文本和图像；
提供复杂的视觉问答和视觉编辑指令，可以解决复杂视觉任务；
可以提供反馈，总结答案，还可以主动对模糊的指令进行询问。

Visual-GPT 可以用自然语言简单地从模型中键入想要的内容，如题图所示的过程中进行了几轮对话：

用户要求生成一张猫的图像。Visual-GPT 生成了一幅正在看书的猫的图像。
用户要求将图像中的猫换成狗，并把书删除。Visual-GPT 将该图像中的猫换成了狗，并删除了图像中的书。
用户要求对图像进行 Canny 边缘检测。Visual-GPT 理解并执行了 Canny 边缘检测操作，生成了边缘图像。
用户要求基于指定的网络图像，生成一幅黄狗图像，Visual-GPT 也很好地完成了这个任务。

2. 【Visual-GPT】操作实例

2.1 处理流程

Visual-GPT 的基本处理流程如图所示。

如图所示，用户上传了一张黄色花朵的图像，并输入一条复杂的语言指令「请根据该图像生成的深度图在生成一朵红色花朵，然后逐步将其制作成卡通图片」。

Visual-GPT 中的 Prompt Manager 控制与 VFM 相关的处理流程。ChatGPT 利用这些 VFMs，并以迭代的方式接收其反馈，直到满足用户的要求或达到结束条件。

首先是深度估计模型，用来检测图像深度信息；
然后是深度图像模型，用来生成具有深度信息的红色花朵图像；
最后利用基于 Stable Diffusion 的风格迁移模型，将图像风格转换为卡通图像。

在上述 pipeline 中，Prompt Manager 作为 ChatGPT 的管理调度中心，提供可视化格式的类型并记录信息转换的过程，最后输出最终结果图像并显示。

2.2 操作实例

第一轮对话：
Q1：用户文本询问，问题与图像无关。
A1：模型文本回答，回答与图像无关。
Q2：用户要求画一个苹果。
A2：模型图文回答，绘制了一幅苹果图片。

第二轮对话：
Q3：用户输入图像，是一个苹果和杯子的草图。
A3：模型文本回答，询问用户的意图，并主动提示草图的文件名。
Q4：用户文本输入，要求按草图绘制苹果和杯子。
A5：模型图文回答，按照用户要求绘制了一幅苹果和杯子的图片。

第三轮对话： Q5：用户输入文本，要求把上图修改为水彩画风格。 A5：模型图文回答，按照用户要求把上图修改为一幅水彩画风格的图片。 Q6：用户文本输入，询问图片的背景颜色。 A6：模型文本回答，回答图片的背景颜色。

第四轮对话： Q7：用户文本输入，要求去除图片中的苹果。 A7：模型图文回答，按照用户要求从图片中去除苹果——但是没有去除苹果在桌面上的影子。 Q8：用户输入文本，指出上图中的影子还在桌面上，并要求把换一张黑色的桌子。 A8：模型图文回答，按照用户要求把图片中的桌子换成黑色桌子。

3. 【Visual-GPT】技术原理分析

4. 【Visual-GPT】使用与运行

【本文为删节版，相关内容已删除。】

4.1 clone the repo

4.2 prepare the basic environments

4.3 start local runing

5. 【Visual-GPT】论文简介

5.1 论文获取

Title：Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
标题：Visual ChatGPT：使用 Visual Foundation 模型进行对话、绘图和编辑
作者：Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan
机构：Microsoft Researc Asia（微软亚洲研究院）
论文链接： https://arxiv.org/abs/2303.04671
开源代码： https://github.com/microsoft/visual-chatgpt

我已经将本文上传到 CSDN，读者也可以从 arxiv 自行下载。

第一作者：吴晨飞，高级研究员，2020 年加入微软亚洲研究院自然语言计算组，研究领域为多模型的预训练、理解和生成。

通讯作者：段楠，微软亚洲研究院首席研究员及自然语言计算组研究经理，中国科学技术大学兼职博导，天津大学兼职教授，研究领域为自然语言处理、代码智能、多模态智能和机器推理等。

5.2 主要贡献

（1）提出 Visual ChatGPT，打开了 ChatGPT 和 VFM 连接的大门，使 ChatGPT 能够处理复杂的视觉任务。

（2）设计了一个 Prompt Manager，其中涉及 22 个不同的 VFM，并定义了它们之间的内在关联，以便更好地交互和组合。

（3）进行了大量的零样本实验，并展示了大量的案例来验证 Visual ChatGPT 的理解和生成能力。

5.3 本文的启发

本文开启了 ChatGPT 处理视觉任务的大门。
NLP —> Natural Language PhotoShop，自然语言文本描述下的图片创作编辑和问答。
可以通过系统设计和工具包设计的 Prompt 实现无监督的工具调用，类似于 zero-shot 的 toolformer。
ChatGPT 本身对仿真场景的能力很强，也能接受图片路径和函数关系，可以很好地使用基础视觉模型。
Visual ChatGPT 本身是一个语言模型，所谓的两方多轮对话只是一个 Human AI 的多轮特殊形式。

5.4 模型复现

Visual-GPT 的运行步骤如下。

（1）创建 Python3.8 环境并激活新的环境：

# create a new environment
conda create -n visgpt python=3.8

# activate the new environment
conda activate visgpt

（2）安装所需的依赖（详见4.2）：

#  prepare the basic environments
pip install -r requirement.txt

（3）clone the repo：

【删除】

clone the repo 所建立的文件夹结构如下：

├── assets
│ ├── demo.gif
│ ├── demo_short.gif
│ └── figure.jpg
├── download.sh
├── LICENSE.md
├── README.md
├── requirement.txt
└── visual_chatgpt.py

（4）设置工作目录：

将工作目录设置为创建的 github repo 的 copy：

# clone the repo
%cd visual-chatgpt

（5）下载基本视觉模型 VFM：

# download the visual foundation models
bash download.sh

（6）输入 OpenAI_API_key：

要开始使用OpenAI API，请访问 platform.OpenAI.com 并使用 Google 或 Microsoft 邮箱注册帐户，获取 API 密钥，该密钥将允许您访问API。——科学上网，势不可挡！

%env OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# prepare your private OpenAI key (for Linux)
export OPENAI_API_KEY={Your_Private_Openai_Key}

# prepare your private OpenAI key (for Windows)
set OPENAI_API_KEY={Your_Private_Openai_Key}

（7）创建图像保存目录

!mkdir ./image

（8）运行 Visual GPT

!python3.8 ./visual_chatgpt.py

注意问题：

（1）可以通过 “–load” 指定 GPU/CPU 分配，该参数设置使用的 VFM 模型及加载位置。可用的 Visual Foundation 模型参见 3.6 节内容。

例如，将 ImageCaptiing 加载到 cpu，将 Text2Image 加载到 cuda:0，则设置为：

python visual_chatgpt.py –load ImageCaptioning_cpu, Text2Image_cuda:0

（2）VFM 模型所需的内存资源很大，推荐的设置选项为：

CPU 用户：只加载 ImageCaptioning_cpu, Text2Image_cpu
1 Tesla T4 15GB 用户：只加载 ImageCaptioning_cuda:0, Text2Image_cuda:0，可以加载 ImageEditing_cuda:0
4 Tesla V100 32GB 用户：加载如下

--load ImageCaptioning_cuda:0,ImageEditing_cuda:0,
    Text2Image_cuda:1,Image2Canny_cpu,CannyText2Image_cuda:1,
    Image2Depth_cpu,DepthText2Image_cuda:1,VisualQuestionAnswering_cuda:2,
    InstructPix2Pix_cuda:2,Image2Scribble_cpu,ScribbleText2Image_cuda:2,
    Image2Seg_cpu,SegText2Image_cuda:2,Image2Pose_cpu,PoseText2Image_cuda:2,
    Image2Hed_cpu,HedText2Image_cuda:3,Image2Normal_cpu,
    NormalText2Image_cuda:3,Image2Line_cpu,LineText2Image_cuda:3

（3）不同 VFM 模型所需内存的参考值。

Foundation Model	Memory Usage (MB)
ImageEditing	6.5
ImageCaption	1.7
T2I	6.5
canny2image	5.4
line2image	6.5
hed2image	6.5
scribble2image	6.5
pose2image	6.5
BLIPVQA	2.6
seg2image	5.4
depth2image	6.5
normal2image	3.9
InstructPix2Pix	2.7

5.5 常见错误

RuntimeError: CUDA error: invalid device ordinal
问题原因：GPU 的数量不够。
解决方案：将 visual_chatgpt.py 文件中的所有 cuda:\d 替换为 cuda:0。
OutOfMemoryError: CUDA out of memory
问题原因：没有足够的 GPU 内存来运行 VFM模型。
解决方案：忽略 download.sh 和 visual_chatgpt.py 文件中不需要的一些模型，只加载必要的模型。

5.6 代码解读

**说明：**本节内容来自外网，博主也在解读和测试。在此贴出相关内容，仅供参考，更多解读详见【Visua ChatGPT: Paper and Code Review】。

with gr.Column(scale=0.15, min_width=0): 
btn = gr.UploadButton(“Upload”, file_types=[“image”])
btn.upload(bot.run_image, [btn, state, txt], [chatbot, state, txt])

def run_image(self, image, state, txt):
image_filename = os.path.join('image', str(uuid.uuid4())[0:8] + ".png")
print("======>Auto Resize Image...")
img = Image.open(image.name)
width, height = img.size
ratio = min(512 / width, 512 / height)
width_new, height_new = (round(width * ratio), round(height * ratio))
img = img.resize((width_new, height_new))
img = img.convert('RGB')
img.save(image_filename, "PNG")
print(f"Resize image form {width}x{height} to {width_new}x{height_new}")
description = self.i2t.inference(image_filename)
Human_prompt = "nHuman: provide a figure named {}. The description is: {}. This information helps you to understand this image, but you should use tools to finish following tasks, " 
"rather than directly imagine from my description. If you understand, say "Received". n".format(image_filename, description)
AI_prompt = "Received.  "
self.agent.memory.buffer = self.agent.memory.buffer + Human_prompt + 'AI: ' + AI_prompt
print("======>Current memory:n %s" % self.agent.memory)
state = state + [(f"![](/file={image_filename})*{image_filename}*", AI_prompt)]
print("Outputs:", state)
return state, state, txt + ' ' + image_filename + ' '

如上所述，上传图像后，调用run_image函数。此函数通过uuid创建新的图像名称，对图像进行预处理，然后创建添加到缓存的人工旋转。

还可以看出，图像描述与文件名一起被包括作为初始输入。该描述由Blip图像字幕模型生成。

class ImageCaptioning:
def __init__(self, device):
print("Initializing ImageCaptioning to %s" % device)
self.device = device
self.processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
self.model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to(self.device)
self.i2t = ImageCaptioning(device="cuda:4")

从上面声明的Human_prompt变量中可以看出，短语“但在根据我的描述直接想象之前，您应该使用工具完成以下任务。设置ChatGPT使用VFM而不是任意提供响应的音调。

Human_prompt = "nHuman: provide a figure named {}. The description is: {}. This information helps you to understand this image, but you should use tools to finish following tasks, " 
"rather than directly imagine from my description. If you understand, say "Received". n".format(image_filename, description)

除了调用提交图像之外，每个调用还具有前缀和后缀，以进一步确保模型不会以特殊方式运行。前缀中列出的一些关键准则如下：

作为一种语言模型，VisualChatGPT不能直接读取图像，但它有一系列工具来完成各种视觉任务。每个图像都将创建一个文件名为“image/xxx.png”，VisualChatGPT可以调用各种工具来间接理解图像。
VisualChatGPT 对图像的文件名非常严格，不支持不存在的文件。
Visual ChatGPT可以按顺序使用这些工具，并且忠于工具观察结果的输出，而不是伪造图像内容和图像文件名。如果创建了新图像，它将记住上次观察工具时的文件名。
Visual ChatGPT可以访问以下工具：

这些声明使Visual ChatGPT能够使用可用的可视化工具，以及如何处理文件名以及如何与用户就VFM模型之一生成的图像进行通信。

代理有一个可以使用的所有工具的列表，在本例中是VFM。每个工具都有详细描述其功能，例如：

Tool(name="Generate Image From User Input Text", func=self.t2i.inference,
description="useful when you want to generate an image from a user input text and save 
it to a file. like: generate an image of an object or something, or 
generate an image that includes some objects. "
"The input to this tool should be a string, representing the text 
used to generate image. "),

所使用的工具之一是VFM，它可以将文本转换为图像，如图所示，为代理提供了有关工具名称的信息，该信息概括了模型的功能、要调用的函数以及详细描述工具和输入。以及仪器输出。

然后，代理使用工具的描述和过去的对话历史来决定下一步使用哪个工具。使用ReAct框架做出决策。

self.agent = initialize_agent(
self.tools,
self.llm,
agent="conversational-react-description",
verbose=True,
memory=self.memory,
return_intermediate_steps=True,
agent_kwargs={'prefix': VISUAL_CHATGPT_PREFIX, 
'format_instructions': VISUAL_CHATGPT_FORMAT_INSTRUCTIONS, 
'suffix': VISUAL_CHATGPT_SUFFIX},

ReAct可以被认为是推理链（CoT）推理范式的扩展。而CoT允许LM生成一系列推理来解决任务，从而减少产生幻觉的可能性。

为了确保ChatGPT以这种格式响应，ChatGPT提示符包含以下内容：

VISUAL_CHATGPT_FORMAT_INSTRUCTIONS = “””To use a tool, please use the following 
format:
"""
Thought: Do I need to use a tool? Yes
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
When you have a response to say to the Human, or if you do not need to 
use a tool, you MUST use the format:
Thought: Do I need to use a tool? No
"""
{ai_prefix}: [your response here]”””

需要注意的是，想法、行动和观察步骤的输出不会显示给最终用户。所有这些信息都是隐藏的，以确保最终用户不会被没有直接解决用户问题的所有中间答案淹没。

相反，当LM认为它已经得到了最终答案或想向用户提问时，只向用户 [此处为您的回答] 字段显示生成的文本的一部分。

ReAct范式的另一个好效果是，我们现在可以结合使用多种工具，因为在看到观察结果后，ChatGPT默认会考虑是否需要使用工具。本质上我必须使用工具吗？是添加到ChatGPT服务生成的每个查询和代理响应的后缀。

从ChatGPT响应格式上方的提示可以看出，对于ChatGPT从可用列表中选择一个工具，可以从前面看到的工具描述中获得工具的输入格式，最后可以从视图中解析VFM输出。

可以通过下面的LangChain库查看行动分析和行动条目：

def _extract_tool_and_input(self, llm_output: str) -> Optional[Tuple[str, str]]:
if f"{self.ai_prefix}:" in llm_output:
return self.ai_prefix, llm_output.split(f"{self.ai_prefix}:")[-1].strip()
regex = r"Action: (.*?)[n]*Action Input: (.*)"
match = re.search(regex, llm_output)
if not match:
raise ValueError(f"Could not parse LLM output: `{llm_output}`")
action = match.group(1)
action_input = match.group(2)
return action.strip(), action_input.strip(" ").strip('"')

在提取要使用的工具和要提供的输入时，进行调用以执行该工具。

每个模型的输出以以下格式保存为文件名：

{Name}_{Operation}_{Previous Name}_{Organization Name}.

title 是唯一的 uuid，操作对应于工具的名称，原名对应于用于创建新图像的输入图像的 uuid，组织的名称对应于用户提供的原始输入图像。按照这种命名约定，ChatGPT可以很容易地导出有关新生成的图像的信息。

def get_new_image_name(org_img_name, func_name="update"):
head_tail = os.path.split(org_img_name)
head = head_tail[0]
tail = head_tail[1]
name_split = tail.split('.')[0].split('_')
this_new_uuid = str(uuid.uuid4())[0:4]
if len(name_split) == 1:
most_org_file_name = name_split[0]
recent_prev_file_name = name_split[0]
new_file_name = '{}_{}_{}_{}.png'.format(this_new_uuid, func_name, recent_prev_file_name, most_org_file_name)
else:
assert len(name_split) == 4
most_org_file_name = name_split[3]
recent_prev_file_name = name_split[0]
new_file_name = '{}_{}_{}_{}.png'.format(this_new_uuid, func_name, recent_prev_file_name, most_org_file_name)
return os.path.join(head, new_file_name)

最后，将所有移动部件组合起来，与Visual ChatGPT进行对话，后者可以使用视觉信息。

这项工作是快速工程重要性的完美例证。提示允许代理使用文件名处理视觉信息，并创建思维链->动作链->观察反应链，帮助确定要使用哪些VFM并处理VFM模型的输出。

为了抽象解决方案的复杂性质，中介响应（包括思想、行动和观察话语）对用户是隐藏的，只有当ChatGPT相信时LM生成的最终响应才会显示给用户。不再需要使用VFM。