{"data":{"post":{"title":"为 GLM 5.2 添加视觉能力：动态 agent 间通信协议及多模态陷阱","subtitle":"","isPublished":true,"createdTime":"2026-06-29T00:00:00.000Z","lastModifiedTime":null,"license":null,"tags":["AI","Agent"],"category":"Programming","file":{"childMdx":{"excerpt":"你有没有在 OpenCode 里给 GLM-5.2 发过图片？模型会告诉你它只处理文本，无法查看视觉内容。你只好接受这个限制，然后再换其他的方法。 更隐蔽的问题在后面。当 GLM-5.…","code":{"body":"function _objectWithoutProperties(source, excluded) { if (source == null) return {}; var target = _objectWithoutPropertiesLoose(source, excluded); var key, i; if (Object.getOwnPropertySymbols) { var sourceSymbolKeys = Object.getOwnPropertySymbols(source); for (i = 0; i < sourceSymbolKeys.length; i++) { key = sourceSymbolKeys[i]; if (excluded.indexOf(key) >= 0) continue; if (!Object.prototype.propertyIsEnumerable.call(source, key)) continue; target[key] = source[key]; } } return target; }\n\nfunction _objectWithoutPropertiesLoose(source, excluded) { if (source == null) return {}; var target = {}; var sourceKeys = Object.keys(source); var key, i; for (i = 0; i < sourceKeys.length; i++) { key = sourceKeys[i]; if (excluded.indexOf(key) >= 0) continue; target[key] = source[key]; } return target; }\n\nconst layoutProps = {};\nreturn class MDXContent extends React.Component {\n  constructor(props) {\n    super(props);\n    this.layout = null;\n  }\n\n  render() {\n    const _this$props = this.props,\n          {\n      components\n    } = _this$props,\n          props = _objectWithoutProperties(_this$props, [\"components\"]);\n\n    return React.createElement(MDXTag, {\n      name: \"wrapper\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `你有没有在 OpenCode 里给 GLM-5.2 发过图片？模型会告诉你它只处理文本，无法查看视觉内容。你只好接受这个限制，然后再换其他的方法。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"figure\",\n      components: components,\n      parentName: \"p\"\n    }, `\n    `, React.createElement(MDXTag, {\n      name: \"a\",\n      components: components,\n      parentName: \"figure\",\n      props: {\n        \"href\": \"/static/132076ad702789fc387b27379f1f0f11/d16b7/glm-5-2-not-spport-vision-task-1.webp\"\n      }\n    }, React.createElement(MDXTag, {\n      name: \"picture\",\n      components: components,\n      parentName: \"a\"\n    }, `\n  `, React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/132076ad702789fc387b27379f1f0f11/0cc25/glm-5-2-not-spport-vision-task-1.png\",\n        \"srcSet\": [\"/static/132076ad702789fc387b27379f1f0f11/5116e/glm-5-2-not-spport-vision-task-1.png 178w\", \"/static/132076ad702789fc387b27379f1f0f11/92f55/glm-5-2-not-spport-vision-task-1.png 356w\", \"/static/132076ad702789fc387b27379f1f0f11/0cc25/glm-5-2-not-spport-vision-task-1.png 712w\", \"/static/132076ad702789fc387b27379f1f0f11/7ae06/glm-5-2-not-spport-vision-task-1.png 1068w\", \"/static/132076ad702789fc387b27379f1f0f11/eee47/glm-5-2-not-spport-vision-task-1.png 1424w\", \"/static/132076ad702789fc387b27379f1f0f11/38407/glm-5-2-not-spport-vision-task-1.png 2136w\", \"/static/132076ad702789fc387b27379f1f0f11/58df7/glm-5-2-not-spport-vision-task-1.png 2626w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/132076ad702789fc387b27379f1f0f11/690c8/glm-5-2-not-spport-vision-task-1.webp\",\n        \"srcSet\": [\"/static/132076ad702789fc387b27379f1f0f11/25c8a/glm-5-2-not-spport-vision-task-1.webp 178w\", \"/static/132076ad702789fc387b27379f1f0f11/60698/glm-5-2-not-spport-vision-task-1.webp 356w\", \"/static/132076ad702789fc387b27379f1f0f11/690c8/glm-5-2-not-spport-vision-task-1.webp 712w\", \"/static/132076ad702789fc387b27379f1f0f11/d7e52/glm-5-2-not-spport-vision-task-1.webp 1068w\", \"/static/132076ad702789fc387b27379f1f0f11/456ef/glm-5-2-not-spport-vision-task-1.webp 1424w\", \"/static/132076ad702789fc387b27379f1f0f11/2a654/glm-5-2-not-spport-vision-task-1.webp 2136w\", \"/static/132076ad702789fc387b27379f1f0f11/d16b7/glm-5-2-not-spport-vision-task-1.webp 2626w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), `\n  `, React.createElement(MDXTag, {\n      name: \"img\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/132076ad702789fc387b27379f1f0f11/d16b7/glm-5-2-not-spport-vision-task-1.webp\",\n        \"alt\": \"OpenCode 的提示框里附上了一张图片，GLM-5.2 回复说无法查看图像。\",\n        \"title\": \"GLM-5.2 无法接收图片输入\",\n        \"width\": 712,\n        \"height\": 164,\n        \"loading\": \"lazy\"\n      }\n    }))), `\n    `, React.createElement(MDXTag, {\n      name: \"figcaption\",\n      components: components,\n      parentName: \"figure\"\n    }, `\n        `, React.createElement(MDXTag, {\n      name: \"span\",\n      components: components,\n      parentName: \"figcaption\"\n    }, `\n            GLM-5.2 无法接收图片输入\n        `), `\n    `))), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `更隐蔽的问题在后面。当 GLM-5.2 调用 browser-use 工具时，这些工具会进行「截图」，而模型会信誓旦旦地描述它「看到」的内容。可它其实一个像素都没真正看过。因为它读的是 AX tree，也就是某次单独 snapshot 调用返回的 accessibility metadata，然后把这些当成了视觉验证。AX tree 能确认一个按钮存在，却无法确认按钮是否居中、文字是否清晰可读，或者两张截图是否一致。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"figure\",\n      components: components,\n      parentName: \"p\"\n    }, `\n    `, React.createElement(MDXTag, {\n      name: \"a\",\n      components: components,\n      parentName: \"figure\",\n      props: {\n        \"href\": \"/static/2cbdf2abc17b1a4fb20bdf994f842716/f56ad/glm-5-2-not-spport-vision-task-2.webp\"\n      }\n    }, React.createElement(MDXTag, {\n      name: \"picture\",\n      components: components,\n      parentName: \"a\"\n    }, `\n  `, React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/2cbdf2abc17b1a4fb20bdf994f842716/0cc25/glm-5-2-not-spport-vision-task-2.png\",\n        \"srcSet\": [\"/static/2cbdf2abc17b1a4fb20bdf994f842716/5116e/glm-5-2-not-spport-vision-task-2.png 178w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/92f55/glm-5-2-not-spport-vision-task-2.png 356w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/0cc25/glm-5-2-not-spport-vision-task-2.png 712w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/7ae06/glm-5-2-not-spport-vision-task-2.png 1068w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/eee47/glm-5-2-not-spport-vision-task-2.png 1424w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/38407/glm-5-2-not-spport-vision-task-2.png 2136w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/34c79/glm-5-2-not-spport-vision-task-2.png 2628w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/2cbdf2abc17b1a4fb20bdf994f842716/690c8/glm-5-2-not-spport-vision-task-2.webp\",\n        \"srcSet\": [\"/static/2cbdf2abc17b1a4fb20bdf994f842716/25c8a/glm-5-2-not-spport-vision-task-2.webp 178w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/60698/glm-5-2-not-spport-vision-task-2.webp 356w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/690c8/glm-5-2-not-spport-vision-task-2.webp 712w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/d7e52/glm-5-2-not-spport-vision-task-2.webp 1068w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/456ef/glm-5-2-not-spport-vision-task-2.webp 1424w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/2a654/glm-5-2-not-spport-vision-task-2.webp 2136w\", \"/static/2cbdf2abc17b1a4fb20bdf994f842716/f56ad/glm-5-2-not-spport-vision-task-2.webp 2628w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), `\n  `, React.createElement(MDXTag, {\n      name: \"img\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/2cbdf2abc17b1a4fb20bdf994f842716/f56ad/glm-5-2-not-spport-vision-task-2.webp\",\n        \"alt\": \"OpenCode 的验证输出中，GLM-5.2 依赖 accessibility 快照，而不是真实的像素。\",\n        \"title\": \"把 Accessibility 快照错当成视觉能力\",\n        \"width\": 712,\n        \"height\": 243,\n        \"loading\": \"lazy\"\n      }\n    }))), `\n    `, React.createElement(MDXTag, {\n      name: \"figcaption\",\n      components: components,\n      parentName: \"figure\"\n    }, `\n        `, React.createElement(MDXTag, {\n      name: \"span\",\n      components: components,\n      parentName: \"figcaption\"\n    }, `\n            把 Accessibility 快照错当成视觉能力\n        `), `\n    `))), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `为了解决这两个问题，我给 OpenCode 写了一个插件，让 GLM-5.2 在里面也能「看见」。这篇文章会分享我在开发它时学到的几条主要经验：`), React.createElement(MDXTag, {\n      name: \"ol\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, `怎样在没有模型路由或融合模型的情况下，组合使用能力不同的模型。`), React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, `如何设计 agent-to-agent 通信。`), React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, `如何让 skills 在多模态内容上可靠触发。`)), React.createElement(MDXTag, {\n      name: \"h2\",\n      components: components\n    }, `安装与使用`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `如果你想马上安装这个插件，请执行以下命令：`), React.createElement(MDXTag, {\n      name: \"pre\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"code\",\n      components: components,\n      parentName: \"pre\",\n      props: {\n        \"className\": \"language-shell\"\n      }\n    }, `opencode plugin opencode-vision -g\n`)), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `这个插件带有一个 `, React.createElement(MDXTag, {\n      name: \"inlineCode\",\n      components: components,\n      parentName: \"p\"\n    }, `vision`), ` skill。把图片拖入输入框即可使用它。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"figure\",\n      components: components,\n      parentName: \"p\"\n    }, `\n    `, React.createElement(MDXTag, {\n      name: \"a\",\n      components: components,\n      parentName: \"figure\",\n      props: {\n        \"href\": \"/static/f02b6313689ce5373f8a29d98823d81d/f56ad/example-drop-image-1.webp\"\n      }\n    }, React.createElement(MDXTag, {\n      name: \"picture\",\n      components: components,\n      parentName: \"a\"\n    }, `\n  `, React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/f02b6313689ce5373f8a29d98823d81d/0cc25/example-drop-image-1.png\",\n        \"srcSet\": [\"/static/f02b6313689ce5373f8a29d98823d81d/5116e/example-drop-image-1.png 178w\", \"/static/f02b6313689ce5373f8a29d98823d81d/92f55/example-drop-image-1.png 356w\", \"/static/f02b6313689ce5373f8a29d98823d81d/0cc25/example-drop-image-1.png 712w\", \"/static/f02b6313689ce5373f8a29d98823d81d/7ae06/example-drop-image-1.png 1068w\", \"/static/f02b6313689ce5373f8a29d98823d81d/eee47/example-drop-image-1.png 1424w\", \"/static/f02b6313689ce5373f8a29d98823d81d/38407/example-drop-image-1.png 2136w\", \"/static/f02b6313689ce5373f8a29d98823d81d/34c79/example-drop-image-1.png 2628w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/f02b6313689ce5373f8a29d98823d81d/690c8/example-drop-image-1.webp\",\n        \"srcSet\": [\"/static/f02b6313689ce5373f8a29d98823d81d/25c8a/example-drop-image-1.webp 178w\", \"/static/f02b6313689ce5373f8a29d98823d81d/60698/example-drop-image-1.webp 356w\", \"/static/f02b6313689ce5373f8a29d98823d81d/690c8/example-drop-image-1.webp 712w\", \"/static/f02b6313689ce5373f8a29d98823d81d/d7e52/example-drop-image-1.webp 1068w\", \"/static/f02b6313689ce5373f8a29d98823d81d/456ef/example-drop-image-1.webp 1424w\", \"/static/f02b6313689ce5373f8a29d98823d81d/2a654/example-drop-image-1.webp 2136w\", \"/static/f02b6313689ce5373f8a29d98823d81d/f56ad/example-drop-image-1.webp 2628w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), `\n  `, React.createElement(MDXTag, {\n      name: \"img\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/f02b6313689ce5373f8a29d98823d81d/f56ad/example-drop-image-1.webp\",\n        \"alt\": \"OpenCode 会话示例，展示 vision 插件处理前的图片提示。\",\n        \"title\": \"vision 插件处理前的图片提示\",\n        \"width\": 712,\n        \"height\": 145,\n        \"loading\": \"lazy\"\n      }\n    }))), `\n    `, React.createElement(MDXTag, {\n      name: \"figcaption\",\n      components: components,\n      parentName: \"figure\"\n    }, `\n        `, React.createElement(MDXTag, {\n      name: \"span\",\n      components: components,\n      parentName: \"figcaption\"\n    }, `\n            vision 插件处理前的图片提示\n        `), `\n    `))), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `首次使用时，你需要从已配置的 provider 中挑选一个具备视觉能力的模型。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"figure\",\n      components: components,\n      parentName: \"p\"\n    }, `\n    `, React.createElement(MDXTag, {\n      name: \"a\",\n      components: components,\n      parentName: \"figure\",\n      props: {\n        \"href\": \"/static/70271622f8191c3605644106fb3095be/f56ad/example-drop-image-2.webp\"\n      }\n    }, React.createElement(MDXTag, {\n      name: \"picture\",\n      components: components,\n      parentName: \"a\"\n    }, `\n  `, React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/70271622f8191c3605644106fb3095be/0cc25/example-drop-image-2.png\",\n        \"srcSet\": [\"/static/70271622f8191c3605644106fb3095be/5116e/example-drop-image-2.png 178w\", \"/static/70271622f8191c3605644106fb3095be/92f55/example-drop-image-2.png 356w\", \"/static/70271622f8191c3605644106fb3095be/0cc25/example-drop-image-2.png 712w\", \"/static/70271622f8191c3605644106fb3095be/7ae06/example-drop-image-2.png 1068w\", \"/static/70271622f8191c3605644106fb3095be/eee47/example-drop-image-2.png 1424w\", \"/static/70271622f8191c3605644106fb3095be/38407/example-drop-image-2.png 2136w\", \"/static/70271622f8191c3605644106fb3095be/34c79/example-drop-image-2.png 2628w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/70271622f8191c3605644106fb3095be/690c8/example-drop-image-2.webp\",\n        \"srcSet\": [\"/static/70271622f8191c3605644106fb3095be/25c8a/example-drop-image-2.webp 178w\", \"/static/70271622f8191c3605644106fb3095be/60698/example-drop-image-2.webp 356w\", \"/static/70271622f8191c3605644106fb3095be/690c8/example-drop-image-2.webp 712w\", \"/static/70271622f8191c3605644106fb3095be/d7e52/example-drop-image-2.webp 1068w\", \"/static/70271622f8191c3605644106fb3095be/456ef/example-drop-image-2.webp 1424w\", \"/static/70271622f8191c3605644106fb3095be/2a654/example-drop-image-2.webp 2136w\", \"/static/70271622f8191c3605644106fb3095be/f56ad/example-drop-image-2.webp 2628w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), `\n  `, React.createElement(MDXTag, {\n      name: \"img\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/70271622f8191c3605644106fb3095be/f56ad/example-drop-image-2.webp\",\n        \"alt\": \"OpenCode 会话示例，vision 插件开始路由并发现可用的具备视觉能力的模型。\",\n        \"title\": \"发现视觉模型\",\n        \"width\": 712,\n        \"height\": 267,\n        \"loading\": \"lazy\"\n      }\n    }))), `\n    `, React.createElement(MDXTag, {\n      name: \"figcaption\",\n      components: components,\n      parentName: \"figure\"\n    }, `\n        `, React.createElement(MDXTag, {\n      name: \"span\",\n      components: components,\n      parentName: \"figcaption\"\n    }, `\n            发现视觉模型\n        `), `\n    `))), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `接着，配置好该具备视觉能力的模型的 subagent 会理解图片。主 agent 中的 GLM-5.2 会以文本形式收到 subagent 的评估结果。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"figure\",\n      components: components,\n      parentName: \"p\"\n    }, `\n    `, React.createElement(MDXTag, {\n      name: \"a\",\n      components: components,\n      parentName: \"figure\",\n      props: {\n        \"href\": \"/static/7c1870c2a4b33d05215079edcbcfd17e/f56ad/example-drop-image-3.webp\"\n      }\n    }, React.createElement(MDXTag, {\n      name: \"picture\",\n      components: components,\n      parentName: \"a\"\n    }, `\n  `, React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/7c1870c2a4b33d05215079edcbcfd17e/0cc25/example-drop-image-3.png\",\n        \"srcSet\": [\"/static/7c1870c2a4b33d05215079edcbcfd17e/5116e/example-drop-image-3.png 178w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/92f55/example-drop-image-3.png 356w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/0cc25/example-drop-image-3.png 712w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/7ae06/example-drop-image-3.png 1068w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/eee47/example-drop-image-3.png 1424w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/38407/example-drop-image-3.png 2136w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/34c79/example-drop-image-3.png 2628w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/7c1870c2a4b33d05215079edcbcfd17e/690c8/example-drop-image-3.webp\",\n        \"srcSet\": [\"/static/7c1870c2a4b33d05215079edcbcfd17e/25c8a/example-drop-image-3.webp 178w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/60698/example-drop-image-3.webp 356w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/690c8/example-drop-image-3.webp 712w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/d7e52/example-drop-image-3.webp 1068w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/456ef/example-drop-image-3.webp 1424w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/2a654/example-drop-image-3.webp 2136w\", \"/static/7c1870c2a4b33d05215079edcbcfd17e/f56ad/example-drop-image-3.webp 2628w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), `\n  `, React.createElement(MDXTag, {\n      name: \"img\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/7c1870c2a4b33d05215079edcbcfd17e/f56ad/example-drop-image-3.webp\",\n        \"alt\": \"OpenCode 会话示例，提示用户为图片分析选择具备视觉能力的模型。\",\n        \"title\": \"选择视觉模型\",\n        \"width\": 712,\n        \"height\": 155,\n        \"loading\": \"lazy\"\n      }\n    }))), `\n    `, React.createElement(MDXTag, {\n      name: \"figcaption\",\n      components: components,\n      parentName: \"figure\"\n    }, `\n        `, React.createElement(MDXTag, {\n      name: \"span\",\n      components: components,\n      parentName: \"figcaption\"\n    }, `\n            选择视觉模型\n        `), `\n    `))), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `这个 skill 也能处理 computer-use 和 browser-use 工具返回的图片。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"figure\",\n      components: components,\n      parentName: \"p\"\n    }, `\n    `, React.createElement(MDXTag, {\n      name: \"a\",\n      components: components,\n      parentName: \"figure\",\n      props: {\n        \"href\": \"/static/ccdd7094f9696cf5e88f919a0c5a9000/f56ad/example-computer-use-1.webp\"\n      }\n    }, React.createElement(MDXTag, {\n      name: \"picture\",\n      components: components,\n      parentName: \"a\"\n    }, `\n  `, React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/ccdd7094f9696cf5e88f919a0c5a9000/0cc25/example-computer-use-1.png\",\n        \"srcSet\": [\"/static/ccdd7094f9696cf5e88f919a0c5a9000/5116e/example-computer-use-1.png 178w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/92f55/example-computer-use-1.png 356w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/0cc25/example-computer-use-1.png 712w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/7ae06/example-computer-use-1.png 1068w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/eee47/example-computer-use-1.png 1424w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/38407/example-computer-use-1.png 2136w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/34c79/example-computer-use-1.png 2628w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/ccdd7094f9696cf5e88f919a0c5a9000/690c8/example-computer-use-1.webp\",\n        \"srcSet\": [\"/static/ccdd7094f9696cf5e88f919a0c5a9000/25c8a/example-computer-use-1.webp 178w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/60698/example-computer-use-1.webp 356w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/690c8/example-computer-use-1.webp 712w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/d7e52/example-computer-use-1.webp 1068w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/456ef/example-computer-use-1.webp 1424w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/2a654/example-computer-use-1.webp 2136w\", \"/static/ccdd7094f9696cf5e88f919a0c5a9000/f56ad/example-computer-use-1.webp 2628w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), `\n  `, React.createElement(MDXTag, {\n      name: \"img\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/ccdd7094f9696cf5e88f919a0c5a9000/f56ad/example-computer-use-1.webp\",\n        \"alt\": \"OpenCode computer-use 示例，GLM-5.2 用 cua-driver 截取 Finder 窗口，加载 vision skill，并返回屏幕描述。\",\n        \"title\": \"用 computer-use 工具描述 Finder 窗口\",\n        \"width\": 712,\n        \"height\": 498,\n        \"loading\": \"lazy\"\n      }\n    }))), `\n    `, React.createElement(MDXTag, {\n      name: \"figcaption\",\n      components: components,\n      parentName: \"figure\"\n    }, `\n        `, React.createElement(MDXTag, {\n      name: \"span\",\n      components: components,\n      parentName: \"figcaption\"\n    }, `\n            用 computer-use 工具描述 Finder 窗口\n        `), `\n    `))), React.createElement(MDXTag, {\n      name: \"h2\",\n      components: components\n    }, `架构`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `ZCode 实现视觉支持的方式是把图像路由到官方订阅计划里包含的具备视觉能力的模型。所以 ZCode 能看懂你发给它的图片；而通过非官方 provider 使用 GLM-5.2 时，这一能力就会消失。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `但 OpenCode 没法配置模型路由或融合模型。那么如何让 OpenCode 处理视觉内容？`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `其实 OpenCode 已经接入了 OpenAI ChatGPT、Kimi for Coding、OpenCode Go、Ollama Pro/Max 等 provider，其中不少都提供具备视觉能力的模型。借助 OpenCode 已有的原语，我们可以搭一个很轻量的架构：`), React.createElement(MDXTag, {\n      name: \"ol\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, `创建使用具备视觉能力的模型处理视觉内容的 subagent。`), React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, `需要时通过 skill 把视觉任务委派给这些 subagent。`)), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `凭借现有的 agent 工具，这两条思路已经足够让 agent 把这个插件搭出来。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `不过有两个细节仍然关键：`), React.createElement(MDXTag, {\n      name: \"ol\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, `agent-to-agent 通信的设计`), React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, `skill 描述的覆盖范围`)), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `这两点都会直接影响视觉任务结果的质量。`), React.createElement(MDXTag, {\n      name: \"h2\",\n      components: components\n    }, `Agent-to-agent 通信`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `稳定的 agent-to-agent 通信通常始于一份严格的协议，它会将 subagent 的输入与输出结构化。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `不过，为了处理尽可能多个种类的视觉任务，这份协议不能过于局限或僵化。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `举例来说，如果我们为了说明任务目的而新增一个字段，却只允许一小部分可选值，那么 subagents 就无法处理其他类型的任务了。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"p\"\n    }, `坏的设计:`)), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `以下代码来自我的第一个 agent-to-agent 协议设计。它存在几处设计瑕疵：`), React.createElement(MDXTag, {\n      name: \"ol\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, React.createElement(MDXTag, {\n      name: \"inlineCode\",\n      components: components,\n      parentName: \"li\"\n    }, `Image`), ` 对象中的 `, React.createElement(MDXTag, {\n      name: \"inlineCode\",\n      components: components,\n      parentName: \"li\"\n    }, `role`), ` 字段是为对比任务设计的，不过并非所有视觉任务都属于对比任务。`), React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, React.createElement(MDXTag, {\n      name: \"inlineCode\",\n      components: components,\n      parentName: \"li\"\n    }, `judgment`), ` 字段只覆盖了有限的视觉任务，而且我们在设计 skill 时无法列出所有可能的任务。`), React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, React.createElement(MDXTag, {\n      name: \"inlineCode\",\n      components: components,\n      parentName: \"li\"\n    }, `judgment`), ` 字段只能包含一个对象，于是只能有一个 `, React.createElement(MDXTag, {\n      name: \"inlineCode\",\n      components: components,\n      parentName: \"li\"\n    }, `Alignemnt`), ` 对象。如果我想同时检查某个对象在 X 和 Y 轴上的对齐效果，该怎么办？`)), React.createElement(MDXTag, {\n      name: \"pre\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"code\",\n      components: components,\n      parentName: \"pre\",\n      props: {\n        \"className\": \"language-typescript\"\n      }\n    }, `interface Image { path: string; label: string; role: \"baseline\" | \"current\" | \"reference\" }\ninterface Request {\n    id: string\n    images: [Image]\n    judgment: Presence | Absence | Alignment | Ordering | Equality | Layout | Readability | State | Diff | Describe;\n    criteria?: string;\n    responseContract?: string;\n}\ninterface Presence { kind: \"presence\"; subject: string; expectation: string }\ninterface Absence { kind: \"absence\"; subject: string; expectation: string }\ninterface Alignment { kind: \"alignment\"; subject: string; axis: string; expectation: string; tolerance: string }\ninterface Ordering { kind: \"ordering\"; direction: string; expected: string[] }\ninterface Equality { kind: \"equality\"; subjects: string[]; threshold: string }\ninterface Layout { kind: \"layout\"; expectations: string[] }\ninterface Readability { kind: \"readability\"; subject: string }\ninterface State { kind: \"state\"; subject: string; expectedState: string }\ninterface Diff { kind: \"diff\"; baseline: string; current: string }\ninterface Describe { kind: \"describe\"; focus: string }\n`)), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"p\"\n    }, `好的设计:`)), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `更好的做法是让 agent 从一份 prompt 模板和几条明确的原则出发，自行设计协议。`), React.createElement(MDXTag, {\n      name: \"ol\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, `把 subagent 的初始化 prompt 声明为一个模板。`), React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"li\"\n    }, `在初始化 prompt 内部`), `，把 subagent 的`, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"li\"\n    }, `回复 schema`), ` 也声明为一个模板。`), React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"li\"\n    }, `在初始化 prompt 内部`), `，加入约束原则，让 subagent 必须按照`, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"li\"\n    }, `回复 schema`), ` 返回内容。`), React.createElement(MDXTag, {\n      name: \"li\",\n      components: components,\n      parentName: \"ol\"\n    }, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"li\"\n    }, `在初始化 prompt 外部`), `，加入引导原则，让主 agent 在创建 subagent 时传入一个动态设计的`, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"li\"\n    }, `回复 schema`), `。`)), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `这样设计之后，agent 之间的通信既保持结构化，又足够灵活，可以描述各式各样的视觉任务。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"figure\",\n      components: components,\n      parentName: \"p\"\n    }, `\n    `, React.createElement(MDXTag, {\n      name: \"a\",\n      components: components,\n      parentName: \"figure\",\n      props: {\n        \"href\": \"/static/a116c008c5dd5b10dcbde805956293e5/e4ef5/agent-to-agent-prompt-example.webp\"\n      }\n    }, React.createElement(MDXTag, {\n      name: \"picture\",\n      components: components,\n      parentName: \"a\"\n    }, `\n  `, React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/a116c008c5dd5b10dcbde805956293e5/0cc25/agent-to-agent-prompt-example.png\",\n        \"srcSet\": [\"/static/a116c008c5dd5b10dcbde805956293e5/5116e/agent-to-agent-prompt-example.png 178w\", \"/static/a116c008c5dd5b10dcbde805956293e5/92f55/agent-to-agent-prompt-example.png 356w\", \"/static/a116c008c5dd5b10dcbde805956293e5/0cc25/agent-to-agent-prompt-example.png 712w\", \"/static/a116c008c5dd5b10dcbde805956293e5/7ae06/agent-to-agent-prompt-example.png 1068w\", \"/static/a116c008c5dd5b10dcbde805956293e5/eee47/agent-to-agent-prompt-example.png 1424w\", \"/static/a116c008c5dd5b10dcbde805956293e5/38407/agent-to-agent-prompt-example.png 2136w\", \"/static/a116c008c5dd5b10dcbde805956293e5/ad291/agent-to-agent-prompt-example.png 2988w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), React.createElement(MDXTag, {\n      name: \"source\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/a116c008c5dd5b10dcbde805956293e5/690c8/agent-to-agent-prompt-example.webp\",\n        \"srcSet\": [\"/static/a116c008c5dd5b10dcbde805956293e5/25c8a/agent-to-agent-prompt-example.webp 178w\", \"/static/a116c008c5dd5b10dcbde805956293e5/60698/agent-to-agent-prompt-example.webp 356w\", \"/static/a116c008c5dd5b10dcbde805956293e5/690c8/agent-to-agent-prompt-example.webp 712w\", \"/static/a116c008c5dd5b10dcbde805956293e5/d7e52/agent-to-agent-prompt-example.webp 1068w\", \"/static/a116c008c5dd5b10dcbde805956293e5/456ef/agent-to-agent-prompt-example.webp 1424w\", \"/static/a116c008c5dd5b10dcbde805956293e5/2a654/agent-to-agent-prompt-example.webp 2136w\", \"/static/a116c008c5dd5b10dcbde805956293e5/e4ef5/agent-to-agent-prompt-example.webp 2988w\"],\n        \"sizes\": \"(max-width: 712px) 100vw, 712px\"\n      }\n    }), `\n  `, React.createElement(MDXTag, {\n      name: \"img\",\n      components: components,\n      parentName: \"picture\",\n      props: {\n        \"src\": \"/static/a116c008c5dd5b10dcbde805956293e5/e4ef5/agent-to-agent-prompt-example.webp\",\n        \"alt\": \"该图展示了主 agent 为 subagent 定义 visual task、images、response template 和 response rules 的 prompt 模板。\",\n        \"title\": \"动态视觉 subagent 的 prompt 模板\",\n        \"width\": 712,\n        \"height\": 1196,\n        \"loading\": \"lazy\"\n      }\n    }))), `\n    `, React.createElement(MDXTag, {\n      name: \"figcaption\",\n      components: components,\n      parentName: \"figure\"\n    }, `\n        `, React.createElement(MDXTag, {\n      name: \"span\",\n      components: components,\n      parentName: \"figcaption\"\n    }, `\n            动态视觉 subagent 的 prompt 模板\n        `), `\n    `))), React.createElement(MDXTag, {\n      name: \"h2\",\n      components: components\n    }, `Skill 描述`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `人们可能以为多模态的支持只涉及用户输入，不过 tool results 也会带来多模态内容。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `因此 skill 描述需要覆盖 tool results 包含多模态内容的情况。在 OpenCode 里这很直接，因为 tool results 中的图片有两个明显特征：`), React.createElement(MDXTag, {\n      name: \"pre\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"code\",\n      components: components,\n      parentName: \"pre\",\n      props: {\n        \"className\": \"language-yaml\"\n      }\n    }, `description: >-\n  You **MUST** use the vision skill when your model is text-only (e.g.\n  glm-5.2, deepseek-v4-pro) AND:\n  ...\n  (5) OR a tool result contains an image attachment the current model\n  cannot see (attachments[].mime = \"image/png\",\n  url = \"data:image/png;base64,...\");\n`)), React.createElement(MDXTag, {\n      name: \"h2\",\n      components: components\n    }, `局限性`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"p\"\n    }, `原生多模态：`)), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `这个插件不会为 GLM-5.2 这样的纯文本模型增加原生多模态。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `多模态内容包含文本无法完整呈现的细节。而原生多模态能让模型直接看到这些细节。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `这个插件做不到这一点。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `它把视觉模型的识别结果以文本形式返回给主 agent，所以部分视觉信息仍会被压缩或丢失。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"p\"\n    }, `禁用插件：`)), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `有时你可能会切换到 GPT 这样的具备视觉能力的模型。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `这种情况下，在处理视觉任务时让该模型处理是更好的选择，因为它能以原生方式查看图像，效果会更好。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `不过，通过 `, React.createElement(MDXTag, {\n      name: \"inlineCode\",\n      components: components,\n      parentName: \"p\"\n    }, `opencode plugin`), ` 安装的插件不会出现在 OpenCode 的插件管理界面中。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `若要在单次任务中禁用该插件，可以在 prompt 开头加上这句话：`, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"p\"\n    }, `\"You MUST not use the vision skill.\"`), ` OpenCode 随后会跳过这个插件自带的 `, React.createElement(MDXTag, {\n      name: \"inlineCode\",\n      components: components,\n      parentName: \"p\"\n    }, `vision`), ` skill。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, React.createElement(MDXTag, {\n      name: \"strong\",\n      components: components,\n      parentName: \"p\"\n    }, `视频内容：`)), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `Kimi K2.7 Code 这类模型支持视频输入。`), React.createElement(MDXTag, {\n      name: \"p\",\n      components: components\n    }, `OpenCode 不接受视频输入，所以这个插件也不支持视频。`));\n  }\n\n}\nMDXContent.isMDXComponent = true;","scope":""},"headings":[{"value":"安装与使用","depth":2},{"value":"架构","depth":2},{"value":"Agent-to-agent 通信","depth":2},{"value":"Skill 描述","depth":2},{"value":"局限性","depth":2}]}}},"earlierPostExcerpt":{"slug":"/post/2026/06/glm-5-2-affordable-providers-vision-and-agents-8f8c","title":"GLM-5.2：高性价比套餐、视觉支持和 Agent 设置","subtitle":"","createdTime":"2026-06-26T00:00:00.000Z","tags":["AI","Agent"],"category":"Programming","file":{"childMdx":{"excerpt":"上周 Claude Code 和 Codex 额度用光之后，我试了一下 GLM 5.2，发现它的能力和 GPT 5.5 是一档的。 但是，中国官方网站的套餐买不到，并且服务稳定性和速度都较差。我这边自己探索了一些替代方案，分享给大家。 这篇文章讲三件事： 高性价比 GLM-5.2 套餐有哪些，以及它们各自的取舍。 缺少视觉支持实际会卡在哪里，现在的 agent 又是怎么绕过去的。 怎么在主流 coding agent 里通过文中所展示的套餐配置 GLM-5.2。 高性价比套餐 我现在会这样看这些套餐。 套餐 价格 用量限制 上下文窗口大小 速度 视觉支持 Cursor Pro $20 USD…"}}},"laterPostExcerpt":null},"pageContext":{"postId":"0a062ef5-baa1-5248-995e-f7a91dbeb97a","earlierPostId":"6ad54098-95c6-5831-ada9-07a82bb659d6","laterPostId":null}}