CANN/ge自定义算子架构设计

📅 2026/7/4 7:20:29
CANN/ge自定义算子架构设计
GE Custom Operator Architecture Design【免费下载链接】geGEGraph Engine是面向昇腾的图编译器和执行器提供了计算图优化、多流并行、内存复用和模型下沉等技术手段加速模型执行效率减少模型内存占用。 GE 提供对 PyTorch、TensorFlow 前端的友好接入能力并同时支持 onnx、pb 等主流模型格式的解析与编译。项目地址: https://gitcode.com/cann/ge1. Introduction1.1 PurposeThis document describes the architecture design of GE custom operator integration mechanism, intended for GE internal developers and architects. The document covers the interface system of custom operators, registration and loading mechanisms, internal processes during compile time and runtime, and interaction with GE subsystems.User guide for external operator developers seecustom_op_development_guide.md.1.2 ScopeThis document covers:Custom operator interface design and registration mechanismSO deliverable loading and lifecycle managementCustom operator scheduling path in GE compiler and executorFrontend (PyTorch / TensorFlow / ONNX) integration architectureDesign constraints and feature cross-analysisNot covered:Specific operator kernel implementation (Ascend C / Triton etc.)GE built-in operator engine scheduling details2. Overall Overview2.1 Design MotivationGE has two core requirements for custom operator integration:Language-agnostic: Operator development is not limited to specific programming language (Ascend C, Triton, PPTO etc.), through unified integration interface to decouple operator integration process from specific programming language.Progressive development experience: Developers can gradually add capabilities as needed (execution → sink → compile optimization → offline OM), each stage completed can obtain corresponding performance benefits, rather than completing all integration work at once.2.2 Design GoalsGoalDescriptionLanguage-agnosticInterface layer doesnt assume kernel implementation language, only cares about kernel binary loading and launchProgressive capabilitiesCombine capability interfaces as needed, evolve from minimum runnable to full sink step by stepDeliverable class organizationOne .so contains all integration logic of one or more operators, convenient for distribution and maintenanceCoexist with built-in operatorsCustom operators and built-in operators execute mixed in the same graph, share stream allocation and memory planningInfrastructure positioningInterface layer as infrastructure, various programming languages can build common layer to further reduce development difficulty2.3 Three-Stage Evolution RoadmapStageCore CapabilityNew DeliverablePerformance BenefitStatusStage 1Execute (host schedules kernel)1 .soRunnable, has host scheduling overheadCompletedStage 2.1Execute (sink scheduling)No newEliminate host scheduling overhead under static shapeCompletedStage 2.2 InferShape CompileNo newShape inference, memory reuse, operator online compilationCompletedStage 3 Serialize / DeserializeNo newOffline OM deploymentCompleted2.4 Infrastructure Positioning and Language Common LayerCurrent interface system (BaseCustomOp 5 capability interfaces) is positioned asinfrastructure layer, not the final API for end operator developers. Various operator programming languages (Ascend C, Triton, PPTO etc.) can buildlanguage common layeron this infrastructure, encapsulate repetitive boilerplate logic, to minimize single operator graph integration development effort.cpp // Old version registration way IMPL_OP(MyOp) .InferShape(my_infer_shape_func) // shape inference function pointer .InferShapeRange(my_shape_range_func) // dynamic shape range inference .InferDataType(my_infer_datatype_func) // dtype inference function pointer .InferFormat(my_infer_format_func) // format inference function pointer .Tiling(my_tiling_func) // AICore tiling function .TilingParse (my_parse_func) // tiling parse function .InputsDataDependency({0}) // Declare which inputs need data .PrivateAttr(attr_name, int64_t(42)); // Private attributeOld version is designed for **AICore standard engine pipeline**: operator provides tiling function and shape inference function, frameworks FE engine and DavinciModel handle compilation, serialization, scheduling and address refresh. #### Core Difference: Who Controls Key difference between new and old versions is not whether has some capability, but **who controls** this capability: | Capability | Old Version (IMPL_OP) | New Version (BaseCustomOp) | |------|-------------------|------------------------| | **Execution** | Framework embedded (engine schedules tiling kernel launch) | User-defined (EagerExecuteOp::Execute()) | | **Online Compilation** | Framework embedded (FE engine schedules TBE/AscendC compiler) | User-defined (CompilableOp::Compile()) | | **Serialization** | Framework embedded (FE engine auto serializes tiling data / compile info) | User-defined (PortableOp::Serialize/Deserialize) | | **Address Refresh** | Framework embedded (DavinciModel SinkTask auto handles) | User-defined (ArgsUpdater::UpdateHostArgs()) | Other interface level differences: | Dimension | Old Version (IMPL_OP) | New Version (BaseCustomOp) | |------|-------------------|------------------------| | Registration style | Chain function pointer | Virtual function inheritance factory registration | | State management | Stateless function pointer | Stateful object (member variables) | | Shape inference | InferShapeKernelFunc function pointer | ShapeInferOp::InferShape() virtual method | | Shape range inference | InferShapeRangeKernelFunc independent registration | Reuse InferShape (-1 represents unknown dimension) | | DataType inference | InferDataTypeKernelFunc function pointer | ShapeInferOp::InferDataType() virtual method | | Format inference | InferFormatKernelFunc independent registration | Not supported temporarily, under design | | Data dependency declaration | Explicit .InputsDataDependency({indices}) | Implicit (access through context) | | Private attribute | .PrivateAttr(name, value) | Not supported | | Tiling | Dedicated .Tiling() / .TilingParse() interface | Weakens tiling concept, handled by subclass autonomously | | Design target | AICore built-in operators (standard engine pipeline) | External custom operators (arbitrary kernel binary) | | kernel language | Ascend C / TBE (framework compiles) | Arbitrary (user compiles or RTC) | #### Design Evolution Motivation ![mermaid](https://web-api.gitcode.com/mermaid/svg/eNqFkE1P4zAQhu_8ipFPi9iq99UKCfIhoS0Nogs9VBVy7WkywrGtsVO2iB-_TtPweWAung_5nXeemqVvYHZ7AilCt6kPtaiMhnvkQM7-gpJli0-OH6HsjNnDtbSyRi0Of_qoZvlDXtyvROWRZXQMOe7QuFSJNUwm5y_iht2ONAaIZMjWcAahkR6B7BYZrULxcpAplytRFlDYmizCFHK5I6vo2mk0vzc8Pf-hXOvJ4E8IyCQNPfepalB3fVNqzRgCMG7T25yK9QeX5XKws4jSaskaPHlMhsb1f4rbeTFbiYurzDHCI7Id914EhVZDlkz9vSxG4dQ6-cRujk9v7O6SyyO2zNnIzoQ3bvNi-R23BZrthFpvsEUbAxT_UHURp9lAYboYIaQDernqZiWyLkTXghtFlZEhDEdcyoDDuPK95cPoPaRBY9idE6OKoAbfxwWvhHhDkSXvj5BgQzZVX1kxRWf7pCHWEy857t_B-w_L4M78) 1. **Old version designed for standard engine pipeline**: Operator must go through TBE/AscendC → tiling → kernel launch standard path, framework fully handles compilation, serialization and scheduling. This is efficient for built-in operators, but cannot accept non-standard kernels (like Triton npubin, third-party pre-compiled binary). 2. **New version gives control to operator developer**: Through virtual function interface exposes execution, compilation, serialization and address refresh control, enables arbitrary kernel binary to access GE. Cost is developer needs to self-implement these stages (but can encapsulate through language common layer). 3. **Both complementary rather than substitutive**: Old version suitable for standard AICore operators (framework fully managed, small development effort), new version suitable for non-standard kernels (user fully controls, high flexibility). Two mechanisms coexist in GE, each serving different operator access scenarios. --- ## 3. System Architecture ### 3.1 Architecture View ![mermaid](https://web-api.gitcode.com/mermaid/svg/eNp9VF1u4jAQfucUVp6oqt2qfUSrSiEYivhJBAF1hVaRSVzwNtiW7ZSlB9gD7BH3JDuxExoobB6SzHjm83zfjL1RRG5R3G0heHSx3ljb6yvBDeUZ6iuyo3uhXtGYHKjybFz5RPHKiw6xUOkW3SL79Zn6tlZ3j3E4C56S8bA782ffYTEV_I0qA9k_julxf-XFlGuh-rnYQ5CfEQkhFiBn67TQRuwSIfVXLWCZyyLRhZQCcDLr_qkFbwCG0-nzyivfaCIymqM-y6lFm-HBcB7jWRIs5nE4ScIISaI0VUjmxYY1QQY4mfrxcIlX3gCjKTHsjaKBlaRbsDxjfFNDljC3KJRUEQMkSArvQwUFwrXOBI3cjndnaOeqDmcrz9cpAAxnqL3g7IXRDLw3dluX-_f3HzQFim0hE3OQ9MZ6nJj21zdGsXVh6PVygF0gdhIkUucl4GjlYQ66UARFG2aY4CetC0CdwLbHhdnS2r3pNFlOKo1vmuGDsI639YfSsB17r1pd1ltV0m5mzYdJsPLmWyLpkL9QRXlKjwnWY9dOc3zIMIqSHfLzXKSkLN1VB8GKaqTd6p6ZLfJZIBRt5k-A-YTuoI8oygnndbfbVoeUSWIAg3GUiw1LSY52LljRQp8Ajbo141DqEVWc5rbhDc6TsIfHSezPR_VYjvBsisf_bRn-RdMC5uy8ZR9By3s0N8A7PcZ-RJXP8j4J4rq4mOhXkFJ0Oj2mq5E5EbTKwHCwypox2VDlcGkoO53q91LKAlrhq41eyAxEU52O-3kS2pTuk5SS6wUmD6h34GR3ncpDMoYRGYs9VdApR6k8F-fVPCT9wBHoM57VfUGvti-fg3EVXLG7Hn-57nCC5lAQydm7nb-7HtVN-5QE9H0OQ0vejvvo46GzZ7nG-qwypPZK-iS7kvqxMb2o93HGohh9-fIIt0zL3cxNy16oDft4Q9bOlru2rImjlrtBrBXglrsunDUIm2Z5vk9sv2lNoqY16rpdRl1r2hk-cZST0HSUqroUG1tn4efPvoXb10LUYP2gdRyc2oedr0Sut-j9Az4-F6k) ### 3.2 Relationship with Built-in Operators Custom operators distinguish from built-in operators through independent custom engine DNN_VM_CUSTOM: | Dimension | Built-in Operator | Custom Operator | |------|---------|-----------| | Engine | DNNEngine (AiCore / VectorCore / AICPU) | DNN_VM_CUSTOM | | Operator registration | Operator repo REG_OP engine OpsKernelInfoStore | REG_OP REG_AUTO_MAPPING_OP CustomOpsKernelInfoStore | | Kernel build | TBE / AICPU KernelBuilder | CustomOpsKernelBuilder (generates MODEL_TASK_CUSTOM_KERNEL) | | Compile optimization | GE graph optimization pass engine internal optimization | CustomGraphOptimizer (callbacks Compile) | | Execution scheduling | Engine TaskInfo (TBE / AICPU) | CustomTaskInfo (callbacks Execute) | | Stream allocation | Independent engine stream | Merged allocation with AiCore stream | **Key Design Decision**: Custom operators are merged with AiCore engine nodes at stream allocation phase (engine_partitioner.cc), ensuring custom operators can correctly participate in multi-stream parallel scheduling, instead of being isolated to independent stream. --- ## 4. Core Component Design ### 4.1 Interface System ![mermaid](https://web-api.gitcode.com/mermaid/svg/eNqNkk1PwzAMhu_7FTmhDLQ7h2oStJPgVKTCGXnB6yK1TZS4aHz1t9O025rCojYnO3le-5VjUYC1iYTcQLlg7RHugt2Dxbi2pMpUs6_uwZ0ogq0lA4LW6_PlTePTfNk9_Cy8ahvI0WwOKGpCVy-KZEVodiDQr3MkeIenuk-lqmLV0ge6XrLWpd5nBFTb_12yPWh8rHZOO_J8sVkHdhI-hOFOgyYBguePk-yUzfQYq1LLArYFzvHY08hTfYxmdnlShub2yNBIKOQn8ncUpExTt9jtKzVXgSEkaM8SoSpLbFo4cndncvui36A1M22vBx-UJSfjferCyVGMVjj6Xq3-rGEA8rcogPifGECGHwgA3hB-AQcTF-E) **Design Principles:** - **Capability Combination**: Developer inherits as needed, not forced to implement all interfaces. GE detects which capabilities operator has at runtime through dynamic_cast. - **Orthogonal Design**: Each interface corresponds to independent callback timing, no coupling between interfaces. - **Context Isolation**: Each callback receives dedicated Context object, only exposes information operator needs. **Capability Detection Mechanism:** cpp auto *base CustomOpFactory::CreateOrGetCustomOp(op_type); auto *compilable dynamic_castCompilableOp*(base); if (compilable ! nullptr) { compilable-Compile(ctx); }4.2 Registration and Factory MechanismInstantiation Strategy:CustomOpFactoryImpladoptslazy-load singletonpattern for each op type. All graph nodes of same op type share sameBaseCustomOpinstance.Design Constraints:Implementation classs member variables are shared across nodes (likedevice_elves_map)Compilecallback may be called concurrently (CustomGraphOptimizeruses thread pool), implementation needs to ensure thread safetyRegistration is one-time, repeatedly registering same op type will returnGRAPH_FAILED4.3 SO Loading MechanismLoading Limits (PluginManager Enforced):Limit ItemUpper Limit.so file count64Single .so size800 MBTotal loading size1000 MB4.4 Custom Engine (DNN_VM_CUSTOM)Custom operators access GE compilation flow through independent engine component:ComponentResponsibilityKey FileCustomOpsKernelInfoStoreQueries registered op types at initialization, generates OpInfocompiler/engines/custom_engine/custom_ops_kernel_info_store.ccCustomGraphOptimizerParallelly traverses custom operator nodes, callbacks Compilecompiler/engines/custom_engine/custom_graph_optimizer.ccCustomOpsKernelBuilderGenerates MODEL_TASK_CUSTOM_KERNEL TaskDefcompiler/engines/custom_engine/custom_ops_kernel_builder.cc5. Key Flows5.1 SO Loading and Registration Flow5.2 Compilation Phase Flow5.3 Execution Phase FlowV1 Static Executor (Known Shape):V2 Dynamic Executor (Unknown Shape / RT2.0):### 5.4 Serialization/Deserialization FlowSerialization Mutual Exclusion Constraint: A graph cannot simultaneously contain custom operators that implementPortableOpand those that dont implementPortableOp.6. Frontend Access Architecture6.1 GE Native Graph BuildingSimplest path. Graph building side directly references REG_OP proto header file, creates nodes throughOperatorFactory::CreateOperator.6.2 PyTorch TorchAir6.3 TensorFlowUnlike GE native graph building, in TensorFlow scenarioREG_AUTO_MAPPING_OPcan auto generate GE operator prototype from TF operator prototype, developer doesnt need to additionally writeREG_OP.6.4 ONNXONNX parser plugin registers throughREGISTER_CUSTOM_OP, auto collects toOpRegistryduring dlopen. Plugin needs to implementParseParamsFn, maps ONNXNodeProtoattributes to GEOperatorattributes.7. Design Constraints and InvariantsConstraintDescriptionImpactSingleton instance per op typeCreateOrGetCustomOpcreates unique instance for each op typeMember variables shared across nodes;Compileconcurrent calls need thread safetydynamic_cast capability detectionGE judges which interfaces operator supports throughdynamic_castUnimplemented interfaces auto skipped, doesnt affect other flowsRegistration one-timeRegisterCustomOpCreatorrejects duplicate registrationCannot register same-named op type in same processSerialization mutual exclusionGraph cannot mix serializable and non-serializable custom operatorsOM sink scenario all custom operators must implementPortableOpSO loading limitMax 64 .so, single ≤ 800MB, total ≤ 1000MBLarge number of custom operators need merge into few .soShare AiCore streamCustom operators merged with AiCore at stream allocationCustom operators can participate in AiCore multi-stream parallelismAddress refreshOperators implementingArgsUpdatergo reserved memory allocation pathZero-copy scenario needs to implementArgsUpdater8. Cross-feature AnalysisScenarioApplicabilityAnalysis DescriptionStatic ShapeApplicableStage 2.1 sink scheduling eliminates host overhead.CustomTaskInfo::Distribute()calls Execute in DavinciModel SinkTask flow. Output tensor size participates in logical memory reuse.Dynamic ShapeApplicableStage 1 core scenario. V2 executor generatesFindCustomOpExecuteCustomOpkernel throughLoweringCustomNode, runtime host scheduling execution.Dynamic Shape Static SubgraphApplicableStatic subgraph goes V1DavinciModelKernelpath, custom operators execute throughCustomTaskInfo::Distribute(), consistent with pure static shape scenario.Offline Scenario (atc compile)ApplicableStage 3 coverage. ATC compilation callbacks Compile Serialize, OM load callbacks Deserialize Execute.Online Scenario (Framework Adaptation)ApplicablePyTorch/TorchAir and TensorFlow map to GE custom operator nodes through respective Adapter, go unified compile execute flow.9. Non-functional Requirements9.1 PerformanceStageCompile Phase ImpactExecute Phase ImpactStage 1 (host scheduling)CustomOpsKernelInfoStore::Initializetraverses registered op types, O(n)Each custom operator node has host-side Execute call overheadStage 2.1 (sink)No extra compile overheadEliminates host scheduling overhead, kernel directly executes in sink streamStage 2.2 (full)CustomGraphOptimizerparallelly callbacks Compile, increases compile timeShape inference and memory reuse bring execute phase benefitStage 3 (offline OM)Serialize increases OM save timeDeserialize increases OM load time, execute phase no extra overhead9.2 CompatibilityOM Forward Compatibility: New version GE can load custom operator serialized data in old version OMInterface Compatibility:BaseCustomOpand its sub-interfaces are all pure virtual functions, adding new interfaces doesnt affect existing implementations9.3 MaintainabilityCustom operator code fully maintained outside GE repo, through .so dynamic loadingInterface changes need to synchronously updatecustom_op.hheader file, belongs to GE public APIContext class extension needs to maintain POD layout compatibility【免费下载链接】geGEGraph Engine是面向昇腾的图编译器和执行器提供了计算图优化、多流并行、内存复用和模型下沉等技术手段加速模型执行效率减少模型内存占用。 GE 提供对 PyTorch、TensorFlow 前端的友好接入能力并同时支持 onnx、pb 等主流模型格式的解析与编译。项目地址: https://gitcode.com/cann/ge创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考