任务中断后继续# 第一步开始任务遇到INFO action result ask_agent_start_new_task( device_iddevice_id, task去淘宝帮我选一个生日礼物, # ... ) # 返回stop_reasonINFO_ACTION_NEEDS_REPLY, session_idxxx # 第二步用户提供回复后继续 result ask_agent_continue( device_iddevice_id, taskNone, # 不需要重新指定任务 session_idxxx, # 使用之前的会话ID reply_from_client铜苹果, # 用户的回复 # ... )多任务切换# 开始任务A result_a ask_agent_start_new_task( device_iddevice_id, task打开微信并发送消息, # ... ) # 完成后开始不相关的任务B result_b ask_agent_start_new_task( # 使用 start_new_task 重置环境 device_iddevice_id, task打开高德地图导航到公司, # ... )2.3.4 核心区别总结环境状态ask_agent_start_new_task重置设备环境到初始状态ask_agent_continue保持设备当前环境状态会话管理ask_agent_start_new_task创建新会话ask_agent_continue继续现有会话使用时机ask_agent_start_new_task新任务、独立任务、需要干净环境ask_agent_continue任务继续、保持上下文、Human-in-the-Loop上下文连续性ask_agent_start_new_task无上下文连续性ask_agent_continue保持任务上下文和应用状态这种设计使得系统既能处理独立的离散任务又能处理需要连续性的复杂任务提高了任务执行的灵活性和效率。2.4 代码ask_agent_start_new_task 代码如下mcp.tool def ask_agent_start_new_task( device_id: Annotated[str, Field(descriptionID of the device to perform the task on. listed by list_connected_devices tool.)], task: Annotated[str | None, Field(descriptionThe task that the agent needs to perform on the mobile device. if this is not None, the agent will try to perform this task. if None, the session_id must be provided to continue the previous session.)], # reset_environment: Annotated[bool, Field(descriptionWhether to reset the environment before executing the task, close current app, and back to home screen. If you want to execute a independent task, set this to True will make it easy to execute. If you want to continue the previous session, set this to False.)] False, max_steps: Annotated[int, Field(descriptionMaximum number of steps the agent can take to complete the task.)] 20, # session_id: Annotated[str | None, Field(descriptionOptional, session ID must provide when the last task endwith INFO action and you want to reply, the session id and device id and the reply from client must be provided.)] None, # When the INFO action is called, how to handle it. # 1. auto_reply: the INFO action will be handled automatically by calling the caption model to generate image captions. # 2. no_reply: the INFO action will be ignored. THE AGENT MAY GET STUCK IF THE INFO ACTION IS IGNORED. # 3. manual_reply: the INFO action will cause an interruption, and the user needs to provide the reply manually by input things in servers console. # 4. pass_to_client: the INFO action will be returned to the MCP client to handle it. # reply_mode: Annotated[str, Field(description # How to handle the INFO action during task execution. # Options: # - auto_reply: Automatically generate image captions for INFO actions. # - no_reply: Ignore INFO actions (may cause the agent to get stuck). # - manual_reply: Interrupt and require user input for INFO actions. # - pass_to_client: Pass INFO actions to the MCP client for handling. # )] auto_reply, # reply_from_client: Annotated[str | None, Field(descriptionIf the last task is ended with INFO action, and you want to give GUI agent a reply, provide the reply here. If you do so, you must provide last session id and last device id.)] None, ) - dict: # Ask GUI Agent to start performing a new task on a connected device. Ask the GUI agent to perform the specified task on a connected device. The GUI Agent can be able to understand natural language instructions and interact with the device accordingly. The agent will be able to execute a high-level task descriptionif you have any additional requirements, write them down in detail at tast string. This function will reset the environment before executing the task, close current app, and back to home screen. if you have ## The agent has the below limited capabilities: 1. The task must be related to an app that is already installed on the device. for example, 打开微信帮我发一条消息给张三说今天下午三点开会; 帮我在淘宝上搜索一款性价比高的手机并加入购物车; to purchase an ea on Amazon. 2. The task must be simple and specific. for example, do yyy in xxx app; find xxx information in xxx app. ONE THING AT ONE APP AT A TIME. 3. The agent may not be able to handle complex tasks that require multi-step reasoning or planning. for example. You may need to break down complex tasks into simpler sub-tasks and ask the agent to perform them sequentially. For example, instead of asking the agent to plan a trip to Paris for xxx, you can ask it to search for flights to Paris on xxx app, find hotels in Paris on xxx app, make the plan yourself and ask agent to sent the plan to xxx via IM app like wechat. 4. The agent connot accept multimodal inputs now. if you want to provide additional information like screenshot captions, please include them in the task description. ## Usage guidance 1. you should never directly ask an Agent to pay or order anything. If user want to make a purchase, you should ask agent to stop brfore ordering/paying, and let user to order/pay. 2. tell the agent, if human verification is appeared during the task execution, the agent should ask Client. when the you see the INFO, you should ask user to handle the verification manually. after user says done, you can continue the task with the session_id and device_id and ask the agent to continue in reply_from_client. 3. IF the last agentic call is failed or you want to perform a new task in different app, you should always use this function to start a new task, so that the environment will be reset before executing the task. Returns: dict: Execution log containing details of the task execution. with keys including - device_info: Information about the device used for task execution. - final_action: The final action taken by the agent to complete the task. - global_step_idx: The total number of steps taken during the task execution. - local_step_idx: The number of steps taken in the current session. - session_id: The session ID for maintaining context across multiple tasks. - stop_reason: The reason for stopping the task execution (e.g., TASK_COMPLETED_SUCCESSFULLY). - task: The original task description provided to the agent. reply_mode pass_to_client # if task is not None: # assert session_id is None, If task is provided, session_id must be None. # # New task, so reset_environment is True # reset_environment True # else: # assert session_id is not None, If task is None, session_id must be provided to continue the previous session. # # Continuing previous session, so reset_environment is False # reset_environment False reset_environment True return_log execute_task( device_iddevice_id, tasktask, reset_environmentreset_environment, max_stepsmax_steps, # enable_intermediate_logsFalse, # enable_intermediate_image_captionFalse, # enable_intermediate_logsTrue, # enable_intermediate_image_captionFalse, enable_intermediate_image_captionTrue, enable_intermediate_screenshotsFalse, enable_final_screenshotFalse, # enable_final_image_captionFalse, enable_final_image_captionTrue, reply_modereply_mode, session_idNone, # session_idsession_id, reply_from_clientNone, # reply_from_clientreply_from_client, ) return return_logask_agent_continue 代码如下mcp.tool def ask_agent_continue( device_id: Annotated[str, Field(descriptionID of the device to perform the task on. listed by list_connected_devices tool.)], task: Annotated[str | None, Field(descriptionThe task that the agent needs to perform on the mobile device. if this is not None, the agent will try to perform this task. if None, the session_id must be provided to continue the previous session.)], # reset_environment: Annotated[bool, Field(descriptionWhether to reset the environment before executing the task, close current app, and back to home screen. If you want to execute a independent task, set this to True will make it easy to execute. If you want to continue the previous session, set this to False.)] False, max_steps: Annotated[int, Field(descriptionMaximum number of steps the agent can take to complete the task.)] 20, # session_id: Annotated[str | None, Field(descriptionOptional, session ID must provide when the last task endwith INFO action and you want to reply, the session id and device id and the reply from client must be provided.)] None, # When the INFO action is called, how to handle it. # 1. auto_reply: the INFO action will be handled automatically by calling the caption model to generate image captions. # 2. no_reply: the INFO action will be ignored. THE AGENT MAY GET STUCK IF THE INFO ACTION IS IGNORED. # 3. manual_reply: the INFO action will cause an interruption, and the user needs to provide the reply manually by input things in servers console. # 4. pass_to_client: the INFO action will be returned to the MCP client to handle it. # reply_mode: Annotated[str, Field(description # How to handle the INFO action during task execution. # Options: # - auto_reply: Automatically generate image captions for INFO actions. # - no_reply: Ignore INFO actions (may cause the agent to get stuck). # - manual_reply: Interrupt and require user input for INFO actions. # - pass_to_client: Pass INFO actions to the MCP client for handling. # )] auto_reply, # reply_from_client: Annotated[str | None, Field(descriptionIf the last task is ended with INFO action, and you want to give GUI agent a reply, provide the reply here. If you do so, you must provide last session id and last device id.)] None, ) - dict: # Ask GUI Agent to continue performing a task on a connected device, using previous context. Ask the GUI agent to perform the specified task on a connected device. The GUI Agent can be able to understand natural language instructions and interact with the device accordingly. The agent will be able to execute a high-level task descriptionif you have any additional requirements, write them down in detail at tast string. This function will **NOT** reset the environment before executing the task, so that the agent can continue the previous session. if you have ## The agent has the below limited capabilities: 1. The task must be related to an app that is already installed on the device. for example, 打开微信帮我发一条消息给张三说今天下午三点开会; 帮我在淘宝上搜索一款性价比高的手机并加入购物车; to purchase an ea on Amazon. 2. The task must be simple and specific. for example, do yyy in xxx app; find xxx information in xxx app. ONE THING AT ONE APP AT A TIME. 3. The agent may not be able to handle complex tasks that require multi-step reasoning or planning. for example. You may need to break down complex tasks into simpler sub-tasks and ask the agent to perform them sequentially. For example, instead of asking the agent to plan a trip to Paris for xxx, you can ask it to search for flights to Paris on xxx app, find hotels in Paris on xxx app, make the plan yourself and ask agent to sent the plan to xxx via IM app like wechat. 4. The agent connot accept multimodal inputs now. if you want to provide additional information like screenshot captions, please include them in the task description. ## Usage guidance 1. you should never directly ask an Agent to pay or order anything. If user want to make a purchase, you should ask agent to stop brfore ordering/paying, and let user to order/pay. 2. tell the agent, if human verification is appeared during the task execution, the agent should ask Client. when the you see the INFO, you should ask user to handle the verification manually. after user says done, you can continue the task with the session_id and device_id and ask the agent to continue in reply_from_client. 3. IF the last agentic call is successful or the last action is INFO or the new task is related to the previous task, you can use this function to continue the task, so that the agent can finish the task faster by leveraging the previous context. dict: Execution log containing details of the task execution. with keys including - device_info: Information about the device used for task execution. - final_action: The final action taken by the agent to complete the task. - global_step_idx: The total number of steps taken during the task execution. - local_step_idx: The number of steps taken in the current session. - session_id: The session ID for maintaining context across multiple tasks. - stop_reason: The reason for stopping the task execution (e.g., TASK_COMPLETED_SUCCESSFULLY). - task: The original task description provided to the agent. reply_mode pass_to_client # if task is not None: # assert session_id is None, If task is provided, session_id must be None. # # New task, so reset_environment is True # reset_environment True # else: # assert session_id is not None, If task is None, session_id must be provided to continue the previous session. # # Continuing previous session, so reset_environment is False # reset_environment False reset_environment False return_log execute_task( device_iddevice_id, tasktask, reset_environmentreset_environment, max_stepsmax_steps, # enable_intermediate_logsFalse, # enable_intermediate_image_captionFalse, # enable_intermediate_logsTrue, enable_intermediate_image_captionTrue, enable_intermediate_screenshotsFalse, enable_final_screenshotFalse, # enable_final_image_captionFalse, enable_final_image_captionTrue, reply_modereply_mode, session_idNone, # session_idsession_id, reply_from_clientNone, # reply_from_clientreply_from_client, ) return return_log0x03 INFO 操作3.1 INFO 操作的核心特性INFO交互模式特殊性如下用户输入请求INFO 操作是唯一需要用户主动输入的交互模式与 CLICK、TYPE、AWAKE 等自动执行操作不同INFO 需要中断自动化流程以获取用户反馈。任务暂停机制当执行 INFO 操作时自动化流程暂停系统会等待用户提供必要信息后继续执行防止因缺少关键信息导致的错误操作3.2 处理策略INFO 操作有多种处理策略具体在 reply_mode 中设置auto_reply自动调用模型生成回复no_reply忽略 INFO 操作可能导致代理卡住manual_reply手动输入回复pass_to_client将 INFO 操作传递给 MCP 客户端处理何处设置 reply_mode具体如下在execute_task函数中定义处理模式gui_agent_loop函数根据reply_mode执行相应逻辑支持动态调整 INFO 操作处理方式自动回复机制的细节如下auto_reply函数结合当前任务、截图和 INFO 操作内容使用 LLM 生成合适的回复内容减少对用户手动输入的依赖人工回复处理的细节如下manual_reply模式下程序暂停并等待用户输入提供中英文提示信息来帮助用户理解需要回复的内容验证用户输入的有效性3.3 流程控制机制INFO 的流程控制机制如下会话中断与恢复INFO 操作触发时stop_reason设置为INFO_ACTION_NEEDS_REPLY保存当前会话状态包括session_id支持后续使用相同session_id继续执行回复传递机制用户回复通过reply_from_client参数传递在 payload 中作为query字段传递给代理代理将用户回复作为下一步操作的输入3.4 INFO 操作的实现细节INFO 操作的信息传递流程如下从代理到用户代理生成 INFO 操作并包含 value问题内容action[value] 被显示给用户用户输入回复内容从用户到代理用户输入通过reply_from_client参数传递reply_info变量存储用户回复作为query字段传递给下一次automate_step调用3.5 INFO 操作的应用场景INFO 操作的应用场景可能如下人机协作场景验证码处理当遇到图形验证码或短信验证码时触发 INFO 操作代理请求用户提供验证码用户输入验证码后代理继续执行敏感操作确认在执行支付、删除等敏感操作前代理可能通过 INFO 操作请求用户确认避免自动化操作导致的意外后果信息补充场景个性化信息获取代理需要获取用户的个人信息如姓名、地址等通过 INFO 操作请求用户提供特定信息完成表单填写等任务决策支持当面临多个选项需要用户选择时代理通过 INFO 操作询问用户偏好根据用户选择继续执行相应路径3.6 代码INFO的相关代码如下def gui_agent_loop( # 省略代码 ): Evaluate a task on a device using the provided frontend action converter and action function. # 省略代码 action uiTars_to_frontend_action(action) if action[action_type].upper() INFO: if reply_mode auto_reply: print(fAUTO REPLY INFO FROM MODEL!) reply_info auto_reply(image_b64_url, task, action, model_provideragent_loop_config[model_config][model_provider], model_nameagent_loop_config[model_config][model_name]) print(finfo: {reply_info}) elif reply_mode no_reply: print(fINFO action ignored as per reply_modeno_reply. Agent may get stuck.) reply_info Please follow the task and continue. Dont ask further questions. # do nothing, agent may get stuck elif reply_mode manual_reply: print(fEN: Agent asks: {action[value]} Please Reply: ) print(fZH: Agent 问你: {action[value]} 回复一下) reply_info input(Your reply:) print(fReplied info action: {reply_info}) elif reply_mode pass_to_client: print(fPassing INFO action to client for reply.) # break the loop and return to client for handling stop_reason INFO_ACTION_NEEDS_REPLY break else: raise ValueError(fUnknown reply_mode: {reply_mode}) # 省略代码 act_on_device(action, device_id, device_wm_size, print_commandTrue, reflush_appreflush_app) history_actions.append(action) # 省略代码 if stop_reason in [MANUAL_STOP_SCREEN_OFF, INFO_ACTION_NEEDS_REPLY, NOT_STARTED]: pass elif action[action_type].upper() COMPLETE: stop_reason TASK_COMPLETED_SUCCESSFULLY elif action[action_type].upper() ABORT: stop_reason TASK_ABORTED_BY_AGENT