+

CN103916326B - System, method and equipment for data center - Google Patents

System, method and equipment for data center Download PDF

Info

Publication number
CN103916326B
CN103916326B CN201410138824.5A CN201410138824A CN103916326B CN 103916326 B CN103916326 B CN 103916326B CN 201410138824 A CN201410138824 A CN 201410138824A CN 103916326 B CN103916326 B CN 103916326B
Authority
CN
China
Prior art keywords
queue
module
data
peripheral processor
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410138824.5A
Other languages
Chinese (zh)
Other versions
CN103916326A (en
Inventor
P·辛德胡
G·艾贝
J-M·弗爱龙
A·文卡特马尼
Q·沃赫拉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peribit Networks Inc
Original Assignee
Peribit Networks Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/242,224 external-priority patent/US8154996B2/en
Priority claimed from US12/343,728 external-priority patent/US8325749B2/en
Priority claimed from US12/345,502 external-priority patent/US8804711B2/en
Priority claimed from US12/345,500 external-priority patent/US8804710B2/en
Priority claimed from US12/495,337 external-priority patent/US8730954B2/en
Priority claimed from US12/495,358 external-priority patent/US8335213B2/en
Priority claimed from US12/495,344 external-priority patent/US20100061367A1/en
Priority claimed from US12/495,361 external-priority patent/US8755396B2/en
Priority claimed from US12/495,364 external-priority patent/US9847953B2/en
Application filed by Peribit Networks Inc filed Critical Peribit Networks Inc
Publication of CN103916326A publication Critical patent/CN103916326A/en
Publication of CN103916326B publication Critical patent/CN103916326B/en
Application granted granted Critical
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种用于数据中心的系统、方法以及设备。在本公开的实施例中,设备包括可以具有分组处理模块的第一边缘设备。第一边缘设备被配置为接收分组。第一边缘设备的分组处理模块可以被配置为基于分组产生多个信元。第二边缘设备具有配置为基于多个信元重组分组的分组处理模块。多级交换结构能耦接到第一边缘设备和第二边缘设备。多级交换结构能够定义单个逻辑实体。多级交换结构能具有多个交换模块。多个交换模块中的每一个交换模块具有共享的存储装置。多级交换结构能被配置为交换多个信元,从而使多个信元被发送到第二边缘设备。

A system, method and apparatus for a data center. In an embodiment of the present disclosure, the device includes a first edge device that may have a packet processing module. The first edge device is configured to receive packets. The packet processing module of the first edge device may be configured to generate a plurality of cells based on the packet. The second edge device has a packet processing module configured to reassemble the packet based on the plurality of cells. A multi-stage switch fabric can be coupled to the first edge device and the second edge device. A multilevel switch fabric can define a single logical entity. A multi-stage switch fabric can have multiple switch modules. Each switch module of the plurality of switch modules has a shared storage device. The multi-stage switch fabric can be configured to switch multiple cells such that multiple cells are sent to the second edge device.

Description

用于数据中心的系统、方法以及设备Systems, methods and devices for data centers

本申请是申请日为2009年9月11日、申请号为200910246898.X以及发明名称为“用于数据中心的系统、方法以及设备”的中国专利申请的分案申请。This application is a divisional application of a Chinese patent application with an application date of September 11, 2009, an application number of 200910246898.X, and an invention title of "System, Method and Equipment for Data Center".

相关专利申请的交叉引用Cross references to related patent applications

本专利申请要求名为“Systems,Apparatus and Methods for a Data Centre(用于数据中心的系统、设备和方法)”并于2008年9月19日提交的美国专利申请No.61/098516的优先权和利益;同时要求名为“Methods and Apparatus Related to Flow Controlwithin a Data Centre(涉及在数据中心中流量控制的方法和设备)”并于2008年9月11日提交的美国专利申请No.61/096209的优先权和利益;两者在这里完全引用作为参考。This patent application claims priority to U.S. Patent Application No. 61/098516, filed September 19, 2008, entitled "Systems, Apparatus and Methods for a Data Center" and interests; also claiming U.S. Patent Application No.61/096209 entitled "Methods and Apparatus Related to Flow Control within a Data Center (involving methods and devices for flow control in a data center)" and filed on September 11, 2008 priority and interest; both are fully incorporated herein by reference.

本专利申请是名为“Methods and Apparatus for Transmission of Groups ofCell via a Switch Fabric(经由交换结构传输信元组的方法和设备)”并于2008年12月24日提交的美国专利申请No.12/343728的部分继续申请;是名为“System Architecture fora Scalable and Distributed Multi-Stage Switch Fabric(用于可缩放以及分布式多级交换结构的系统架构)”并于2008年12月29日提交的美国专利申请No.12/345500的部分继续申请;是名为“Methods and Apparatus Related to a Modular Switch Architecture(涉及模块化交换架构的方法和设备)”并于2008年12月29日提交的美国专利申请No.12/345502的部分继续申请;是名为“Methods and Apparatus for Flow Control Associatedwiths Multi-Stage Queue(用于与多级队列有关的流量控制的方法和设备)”并于2008年9月30日提交,要求了名为“Methods and Apparatus Related to Flow Control within aData Center(涉及在数据中心中流量控制的方法和设备)”,2008年9月11日提交的美国专利申请No.6I/096209的优先权和利益的美国专利申请No.12/242224的部分继续申请;是名为“Methods and Apparatus for Flow-Controllable Multi-Staged Queues(用于可控制流量的多级队列的方法和设备)”并于2008年9月30日提交,要求了名为“Methods andApparatus Related to Flow Control within a Data Centre(涉及在数据中心中流量控制的方法和设备)”,2008年9月11日提交的美国专利申请No.61/096209的优先权和利益的美国专利申请No.12/242230的部分继续申请。每一个上述提及的申请在这里都完全引用作为参考。This patent application is entitled "Methods and Apparatus for Transmission of Groups of Cell via a Switch Fabric (methods and devices for transmitting cell groups via a switch fabric)" and filed on December 24, 2008, U.S. Patent Application No.12/ Partial continuation of 343728; a U.S. patent filed on December 29, 2008 entitled "System Architecture for a Scalable and Distributed Multi-Stage Switch Fabric (system architecture for scalable and distributed multi-stage switch fabric)" Application No. 12/345500 is a continuation in part; is entitled "Methods and Apparatus Related to a Modular Switch Architecture (methods and devices involving modular switch architecture)" and filed on December 29, 2008, U.S. Patent Application No. Continuation-in-Part of .12/345502; is entitled "Methods and Apparatus for Flow Control Associated withs Multi-Stage Queue" and filed September 30, 2008 , claiming priority to U.S. Patent Application No. 6I/096209, filed September 11, 2008, entitled "Methods and Apparatus Related to Flow Control within a Data Center" Continuation-in-Part of U.S. Patent Application No. 12/242224 of interest; is titled "Methods and Apparatus for Flow-Controllable Multi-Staged Queues (method and apparatus for flow-controllable multi-stage queues)" and filed in 2008 Submitted on September 30, 2008, U.S. Patent Application No. Continuation-in-Part of US Patent Application No. 12/242230 having priority and benefit of 61/096209. Each of the above-mentioned applications is hereby incorporated by reference in its entirety.

本专利申请还是名为“Methods and Apparatus Related to Any-to-AnyConnectivity within a Data Centre(涉及数据中心中任意连接性的方法和设备)”并于2009年6月30日提交的美国专利申请No.12/495337的部分继续申请;是名为“Methods andApparatus Related to Lossless Operation within a Data Centre(涉及数据中心中无损操作的方法和设备)”并于2009年6月30日提交的美国专利申请No.12/495344的部分继续申请;是名为“Methods and Apparatus Related to Low Latercy within a Data Centre(涉及数据中心中低等待时间的方法和设备)”并于2009年6月30日提交的美国专利申请No.12/495358的部分继续申请;是名为“Methods and Apparatus Related to FlowControl within a Data Centre Switch Fabric(涉及数据中心交换结构中流量控制的方法和设备)”并于2009年6月30日提交的美国专利申请No.12/495361的部分继续申请;是名为“Methods and Apparatus Related to Virtualization of Data Centre Resources(涉及数据中心资源虚拟化的方法和设备)”并于2009年6月30日提交的美国专利申请No.12/495364的部分继续申请。每一个上述提及的申请在这里都完全引用作为参考。This patent application is also titled "Methods and Apparatus Related to Any-to-Any Connectivity within a Data Center" and filed on June 30, 2009, U.S. Patent Application No. Partial continuation of 12/495337; U.S. Patent Application No. entitled "Methods and Apparatus Related to Lossless Operation within a Data Center" and filed on June 30, 2009. Continuation-in-Part of 12/495344; U.S. Patent Application entitled "Methods and Apparatus Related to Low Latercy within a Data Center" and filed June 30, 2009 Partial continuation of No.12/495358; is titled "Methods and Apparatus Related to Flow Control within a Data Center Switch Fabric" and filed on June 30, 2009 Continuation-in-Part of U.S. Patent Application No. 12/495361; entitled "Methods and Apparatus Related to Virtualization of Data Center Resources" and filed June 30, 2009 A continuation-in-part of U.S. Patent Application No. 12/495,364. Each of the above-mentioned applications is hereby incorporated by reference in its entirety.

技术领域technical field

总体上,实施例涉及数据中心装备,以及更具体地涉及用于具有交换核心(switchcore)和边缘设备的数据中心系统的体系结构、设备和方法。Embodiments relate generally to data center equipment, and more specifically to architectures, devices, and methods for data center systems having switch cores and edge devices.

背景技术Background technique

用于数据中心系统的已知的体系结构涉及过于棘手和复杂的方法,增加了这种系统的开销和等待时间。例如,一些已知的数据中心网络由三个或更多交换层组成,其中在每一层都执行以太网和/或因特网协议(IP)分组处理。分组处理和排队开销不必要地在每一层重复,直接增加了开销和端到端等待时间。类似地,这样的已知数据中心网络并非典型地以有效成本方式扩展:对于给定的数据中心系统,在服务器数量上的增加通常需要额外的端口,导致在数据中心系统每一层增加更多的设备。这样糟糕的可扩展性增加了此类数据中心系统的开销。Known architectures for data center systems involve unwieldy and complex approaches, increasing the overhead and latency of such systems. For example, some known data center networks consist of three or more switching layers, where Ethernet and/or Internet Protocol (IP) packet processing is performed at each layer. Packet processing and queuing overhead is unnecessarily repeated at each layer, directly increasing overhead and end-to-end latency. Similarly, such known data center networks do not typically scale in a cost-effective manner: for a given data center system, an increase in the number of servers typically requires additional ports, resulting in more ports being added at each tier of the data center system. device of. Such poor scalability adds overhead to such data center systems.

因此,存在对于改善包括改进的体系结构、设备和方法的数据中心系统的需求。Accordingly, a need exists for improved data center systems including improved architectures, devices and methods.

发明内容Contents of the invention

在一个实施例中,一种通信设备包括可具有分组处理模块的第一边缘设备。第一边缘设备可被配置为接收分组。第一边缘设备的分组处理模块可被配置为基于所述分组产生多个信元。第二边缘设备可具有分组处理模块,该分组处理模块被配置为基于所述多个信元重新组装所述分组。多级交换结构能够被耦接到第一边缘设备和第二边缘设备。该多级交换结构能够定义一个单独的逻辑实体。该多级交换结构可具有多个交换模块。多个交换模块中的每一个交换模块具有共享的存储装置。多级交换结构能够被配置为交换多个信元从而使得多个信元被发送到第二边缘设备。In one embodiment, a communications device includes a first edge device that may have a packet processing module. The first edge device may be configured to receive packets. The packet processing module of the first edge device may be configured to generate a plurality of information elements based on the packet. The second edge device may have a packet processing module configured to reassemble the packet based on the plurality of cells. A multi-stage switch fabric can be coupled to the first edge device and the second edge device. The multilevel switch fabric can define a single logical entity. The multi-stage switch fabric may have multiple switch modules. Each switch module of the plurality of switch modules has a shared storage device. The multi-stage switch fabric can be configured to switch multiple cells such that multiple cells are sent to the second edge device.

根据本公开的实施例的一方面,提供一种通信设备,包括:交换核心,所述交换核心定义单个逻辑实体并且具有在物理上跨多个机架分布的多级交换结构,所述多级交换结构具有多个输入端口和多个输出端口,所述交换核心被配置为经由所述多个输入端口和所述多个输出端口耦接到多个外围处理装置,所述交换核心被配置为在安排有第一机架的第一外围处理装置和安排在第二机架内的第二外围处理装置之间以线速率提供无阻塞连接性。According to an aspect of an embodiment of the present disclosure, there is provided a communication device, including: a switching core, the switching core defines a single logical entity and has a multi-stage switching structure physically distributed across multiple racks, the multi-stage a switch fabric having a plurality of input ports and a plurality of output ports, the switch core is configured to be coupled to a plurality of peripheral processing devices via the plurality of input ports and the plurality of output ports, the switch core is configured to Non-blocking connectivity is provided at line rate between a first peripheral processing device arranged within a first bay and a second peripheral processing device arranged within a second bay.

根据本公开的一个实施例,所述多个外围处理装置包括至少一个具有虚拟化资源的外围处理装置和至少一个不具有虚拟化资源的外围处理装置。According to an embodiment of the present disclosure, the plurality of peripheral processing devices include at least one peripheral processing device with virtualized resources and at least one peripheral processing device without virtualized resources.

根据本公开的一个实施例,所述多个输入端口和所述多个输出端口的数目大于1000,所述多个输入端口中的每一个输入端口和所述多个输出端口的每一个输出端口均被配置为以不低于10Gb/s的速度操作。According to an embodiment of the present disclosure, the number of the plurality of input ports and the plurality of output ports is greater than 1000, each of the plurality of input ports and each of the plurality of output ports Both are configured to operate at a speed of no less than 10Gb/s.

根据本公开的一个实施例,所述第一外围处理装置和所述第二外围处理装置均是存储节点装置、计算节点装置、服务节点装置或路由器中的一个。According to an embodiment of the present disclosure, both the first peripheral processing device and the second peripheral processing device are one of a storage node device, a computing node device, a service node device or a router.

根据本公开的一个实施例,所述多个外围处理装置包括第三外围处理装置,所述交换核心被配置为在所述第二外围处理装置和所述第三外围处理装置之间以线速率提供无阻塞连接性,所述交换核心被配置为接收与所述第一外围处理装置相关联的第一分组,所述交换核心被配置为基于与所述第一分组相关联的信元,顺序地向所述第二外围处理装置发送第二分组并且向第三外围处理装置发送第三分组,所述多级交换结构被配置为从所述多个输入端口中的输入端口向所述输出端口中的输出端口发送所述信元。According to an embodiment of the present disclosure, the plurality of peripheral processing devices includes a third peripheral processing device, and the switch core is configured to communicate at a line rate between the second peripheral processing device and the third peripheral processing device. providing non-blocking connectivity, the switch core configured to receive a first packet associated with the first peripheral processing device, the switch core configured to sequence based on cells associated with the first packet sending a second packet to the second peripheral processing device and sending a third packet to a third peripheral processing device, the multi-stage switch fabric configured to transmit from an input port of the plurality of input ports to the output port The output port in sends the cell.

根据本公开的一个实施例,所述第一外围处理装置和所述第三外围处理装置均是存储节点装置、计算节点装置、服务节点装置或路由器中的一个;并且所述第二外围处理装置是防火墙装置、相交检测装置或负载平衡装置中的至少一个。According to an embodiment of the present disclosure, both the first peripheral processing device and the third peripheral processing device are one of a storage node device, a computing node device, a service node device, or a router; and the second peripheral processing device is at least one of a firewall device, an intersection detection device, or a load balancing device.

根据本公开实施例的另一方面,提供一种通信设备,包括:交换核心,所述交换核心具有在物理上跨多个机架分布的多级交换结构,所述多级交换结构具有多个输入端口和多个输出端口,所述交换核心被配置为经由所述多个输入端口和所述多个输出端口耦接到多个外围处理装置,所述交换核心被配置为以线速率为所述多个外围处理装置中的每一个外围处理装置提供到所述多个外围处理装置中的每一个其余处理装置的连通性,从而使所述多个输出端口中的每一个输出端口能够由所述多个外围处理装置中的每一个外围处理装置经由所述多个输入端口中的一个输入端口平等地访问。According to another aspect of the embodiments of the present disclosure, a communication device is provided, including: a switch core, the switch core has a multi-stage switch structure physically distributed across multiple racks, and the multi-stage switch structure has multiple An input port and a plurality of output ports, the switching core is configured to be coupled to a plurality of peripheral processing devices via the plurality of input ports and the plurality of output ports, the switching core is configured to be at a line rate for the Each peripheral processing device in the plurality of peripheral processing devices provides connectivity to each remaining processing device in the plurality of peripheral processing devices, thereby enabling each output port in the plurality of output ports to be controlled by the Each of the plurality of peripheral processing devices is equally accessed via one of the plurality of input ports.

根据本公开的一个实施例,所述多个外围处理装置包括至少一个经由以太网连接耦接到所述交换核心的外围处理装置和至少一个经由非以太网连接耦接到所述交换核心的外围处理装置。According to an embodiment of the present disclosure, the plurality of peripheral processing devices include at least one peripheral processing device coupled to the switching core via an Ethernet connection and at least one peripheral processing device coupled to the switching core via a non-Ethernet connection Processing device.

根据本公开的一个实施例,所述多个外围处理装置包括至少一个使用第3层路由的外围处理装置和至少一个第4层到第7层装置的外围处理装置。According to an embodiment of the present disclosure, the plurality of peripheral processing devices includes at least one peripheral processing device using layer 3 routing and at least one peripheral processing device using layer 4 to layer 7 devices.

根据本公开实施例的另一方面,提供一种通信设备,包括:交换核心,所述交换核心定义单个逻辑实体并且具有多级交换结构,所述多级交换结构具有多个在物理上跨多个机架分布的级,所述多个级共同具有多个输入端口和多个输出端口,所述交换核心被配置为经由所述多个输入端口和所述多个输出端口耦接到多个外围处理装置,所述交换核心被配置为当与分组相关联的多个信元的传送能被基本上保证而没有通过所述多级交换结构的损耗时,允许所述多个信元进入所述多个输入端口中的输入端口。According to another aspect of the embodiments of the present disclosure, there is provided a communication device, including: a switch core, the switch core defines a single logical entity and has a multi-level switch structure, and the multi-level switch structure has multiple rack-distributed stages, the plurality of stages collectively have a plurality of input ports and a plurality of output ports, the switching core is configured to be coupled to a plurality of peripheral processing means, the switch core configured to allow a plurality of cells associated with a packet to enter the plurality of cells when delivery of the plurality of cells can be substantially guaranteed without loss through the multi-stage switch fabric input port of the plurality of input ports.

根据本公开的一个实施例,所述多个外围处理装置包括被配置为与光纤信道协议通信的第一外围处理装置和被配置为与光纤信道覆盖的以太网协议通信的第二外围处理装置。According to one embodiment of the present disclosure, the plurality of peripheral processing devices includes a first peripheral processing device configured to communicate with a Fiber Channel protocol and a second peripheral processing device configured to communicate with a Fiber Channel over Ethernet protocol.

根据本公开的一个实施例,所述多级交换结构被配置为确定性网络。According to an embodiment of the present disclosure, the multi-stage switching fabric is configured as a deterministic network.

根据本公开的一个实施例,所述多级交换结构被配置为确定性网络,从而当所述多个信元在预定时间能被传送到所述多个输出端口中的一个输出端口时,所述多级交换结构允许所述分组进入输入端口。According to an embodiment of the present disclosure, the multi-stage switch fabric is configured as a deterministic network, so that when the plurality of cells can be transmitted to one output port among the plurality of output ports at a predetermined time, the The multi-stage switch fabric allows the packet to enter the input port.

根据本公开的一个实施例,所述交换核心被配置为从所述输入端口向所述多个输出端口中的第一输出端口和第二输出端口发送多个与所述分组相关联的信元,而不需要在所述多级交换结构的多个级中的至少一级处执行分组损耗处理。According to an embodiment of the present disclosure, the switch core is configured to send a plurality of information elements associated with the packet from the input port to a first output port and a second output port of the plurality of output ports , without performing packet loss processing at at least one of the plurality of stages of the multi-stage switch fabric.

根据本公开的一个实施例,所述交换核心包括多个经由所述多个输入端口和所述多个输出端口耦接到所述多级交换结构的边缘设备,所述多个边缘设备耦接到所述多个外围处理装置,所述多个边缘设备中的每一个边缘设备被配置为接收所述分组并且基于所述分组定义所述多个信元。According to an embodiment of the present disclosure, the switch core includes a plurality of edge devices coupled to the multi-stage switch structure via the plurality of input ports and the plurality of output ports, and the plurality of edge devices are coupled to To the plurality of peripheral processing devices, each edge device of the plurality of edge devices is configured to receive the packet and define the plurality of information elements based on the packet.

根据本公开的一个实施例,所述交换核心被配置为经由所述多级交换结构的多个级从所述输入端口向所述多个输出端口中的一个输出端口发送多个与所述分组相关联的信元,而不需要在所述多个级中的至少一级处执行分组损耗处理。According to an embodiment of the present disclosure, the switch core is configured to send a plurality of packets with the associated cells without performing packet loss processing at at least one of the plurality of stages.

根据本公开实施例的另一方面,提供一种通信设备,包括:交换核心,所述交换核心定义单个逻辑实体并且具有交换结构,所述交换结构具有多个在物理上跨多个机架分布的级,所述多级交换结构具有多个输入端口和多个输出端口,所述交换核心被配置为经由所述多个输入端口和所述多个输出端口耦接到多个外围处理装置,所述交换核心被配置为从所述多个输入端口中的输入端口接收分组,所述交换核心被配置为经由所述多个级从所述输入端口向所述多个输出端口中的输出端口发送多个与所述分组相关联的信元,而不需要在所述交换结构的多个级中的至少一级处执行分组损耗处理。According to another aspect of the embodiments of the present disclosure, there is provided a communication device, including: a switch core, the switch core defines a single logical entity and has a switch fabric, and the switch fabric has multiple physical distribution across multiple racks stages, the multi-stage switch fabric has a plurality of input ports and a plurality of output ports, the switch core is configured to be coupled to a plurality of peripheral processing devices via the plurality of input ports and the plurality of output ports, The switch core is configured to receive a packet from an input port of the plurality of input ports, the switch core is configured to pass from the input port to an output port of the plurality of output ports via the plurality of stages A plurality of cells associated with the packet is transmitted without performing packet loss processing at at least one of the plurality of stages of the switch fabric.

根据本公开的一个实施例,所述多级交换结构被配置为确定性网络,从而只有当能够基本上保证在所述交换结构内的与分组相关联的多个信元的传送而无损耗时,才允许来自所述多个输入端口中的输入端口的分组。According to one embodiment of the present disclosure, the multistage switch fabric is configured as a deterministic network such that only when delivery of a plurality of cells associated with a packet within the switch fabric can be substantially guaranteed without loss , to allow packets from input ports in the plurality of input ports.

根据本公开的一个实施例,所述输出端口是第一输出端口,所述交换核心被配置为从所述输入端口向所述多个输出端口中的所述第一输出端口和第二输出端口发送与所述多个与所述分组相关联的信元。According to an embodiment of the present disclosure, the output port is a first output port, and the switch core is configured to transmit from the input port to the first output port and the second output port among the plurality of output ports Sending the plurality of cells associated with the packet.

根据本公开的一个实施例,所述交换核心包括多个经由所述多个输入端口和所述多个输出端口耦接到所述多级交换结构的边缘设备,所述多个边缘设备耦接到所述多个外围处理装置,所述多个边缘设备中的每一个边缘设备被配置为接收所述分组并且基于所述分组定义所述多个信元。According to an embodiment of the present disclosure, the switch core includes a plurality of edge devices coupled to the multi-stage switch structure via the plurality of input ports and the plurality of output ports, and the plurality of edge devices are coupled to To the plurality of peripheral processing devices, each edge device of the plurality of edge devices is configured to receive the packet and define the plurality of information elements based on the packet.

根据本公开实施例的另一方面,提供一种通信设备,包括:交换核心,所述交换核心定义单个逻辑实体并且具有被配置为确定性网络的多级交换结构,所述多级交换结构具有多个输入端口和多个输出端口,所述交换核心被配置为经由所述多个输入端口和所述多个输出端口耦接到多个外围处理装置,所述交换核心被配置为从所述多个输入端口中的输入端口接收分组,所述交换核心被配置为从所述输入端口向所述多个输出端口中的输出端口发送多个与所述分组相关联的信元。According to another aspect of an embodiment of the present disclosure, there is provided a communication device, including: a switch core, the switch core defines a single logical entity and has a multi-stage switch structure configured as a deterministic network, the multi-stage switch structure has a plurality of input ports and a plurality of output ports, the switch core is configured to be coupled to a plurality of peripheral processing devices via the plurality of input ports and the plurality of output ports, the switch core is configured to receive from the An input port of a plurality of input ports receives a packet from which the switch core is configured to transmit a plurality of cells associated with the packet to an output port of the plurality of output ports.

根据本公开的一个实施例,所述多级交换结构在物理上跨多个机架分布。According to an embodiment of the present disclosure, the multi-stage switching fabric is physically distributed across multiple racks.

根据本公开的一个实施例,所述多级交换结构被配置为确定性网络,从而只有当能够基本上保证在所述交换结构内的与分组相关联的多个信元的传送而无损耗时,才允许来自所述多个输入端口中的输入端口的分组。According to one embodiment of the present disclosure, the multistage switch fabric is configured as a deterministic network such that only when delivery of a plurality of cells associated with a packet within the switch fabric can be substantially guaranteed without loss , to allow packets from input ports in the plurality of input ports.

根据本公开的一个实施例,所述多级交换结构被配置为确定性网络,从而当所述多个与分组相关联的信元在预定时间能被传送到所述多个输出端口中的一个输出端口时,所述交换核心来自所述多个输入端口中的输入端口的分组。According to an embodiment of the present disclosure, the multi-stage switch fabric is configured as a deterministic network such that when the plurality of cells associated with packets can be transmitted to one of the plurality of output ports at a predetermined time When outputting a port, the switch core receives packets from an input port of the plurality of input ports.

根据本公开的一个实施例,所述交换核心包括多个经由所述多个输入端口和所述多个输出端口耦接到所述多级交换结构的边缘设备,所述多个边缘设备耦接到所述多个外围处理装置,所述多个边缘设备中的每一个边缘设备被配置为接收所述分组并且基于所述分组定义所述多个信元。According to an embodiment of the present disclosure, the switch core includes a plurality of edge devices coupled to the multi-stage switch structure via the plurality of input ports and the plurality of output ports, and the plurality of edge devices are coupled to To the plurality of peripheral processing devices, each edge device of the plurality of edge devices is configured to receive the packet and define the plurality of information elements based on the packet.

根据本公开的一个实施例,所述交换核心被配置为经由所述多级交换结构的多个级从所述输入端口向所述输出端口发送多个与所述分组相关联的信元,而不需要在所述多个级中的至少一级处执行分组损耗处理。According to an embodiment of the present disclosure, the switch core is configured to send a plurality of cells associated with the packet from the input port to the output port via a plurality of stages of the multi-stage switch fabric, and Packet loss processing need not be performed at at least one of the plurality of stages.

根据本公开实施例的另一方面,提供一种通信设备,包括:交换核心,所述交换核心具有在物理上在多个机架之间分布的多级交换结构,所述多级交换结构具有多个输入缓冲器和多个输出端口,所述交换核心被配置为耦接到多个边缘设备;和在操作期间不需要软件而以硬件实现、以及在配置和监视期间需要软件实现的控制器,所述控制器耦接到所述多个输入缓冲器和所述多个输出端口,所述控制器配置为当在多个输出端口中的一个输出端口处的拥塞被预见时以及在所述交换核心内的拥塞发生之前,向所述多个输入缓冲器中的一个输入缓冲器发送流量控制信号。According to another aspect of the embodiments of the present disclosure, a communication device is provided, including: a switch core, the switch core has a multi-stage switch structure physically distributed among multiple racks, and the multi-stage switch structure has a plurality of input buffers and a plurality of output ports, the switch core configured to be coupled to a plurality of edge devices; and a controller that does not require software during operation but is implemented in hardware and requires software implementation during configuration and monitoring , the controller is coupled to the plurality of input buffers and the plurality of output ports, the controller is configured to when congestion at one of the plurality of output ports is foreseen and at the A flow control signal is sent to one of the plurality of input buffers before congestion within the switch core occurs.

根据本公开的一个实施例,所述控制器被配置为独立于针对所述交换核心的所述多级交换结构的结构内流量控制,对所述输入缓冲器和所述输出端口执行端到端流量控制。According to an embodiment of the present disclosure, the controller is configured to perform end-to-end on the input buffer and the output port independently of intra-fabric flow control for the multi-stage switch fabric of the switch core flow control.

根据本公开的一个实施例,所述控制器被配置为独立于针对所述多个边缘设备的流量控制,对所述输入缓冲器和所述输出端口执行端到端流量控制。According to an embodiment of the present disclosure, the controller is configured to perform end-to-end flow control on the input buffer and the output port independently of the flow control on the plurality of edge devices.

根据本公开的一个实施例,多个被配置为耦接到所述多个边缘设备的外围处理装置,所述控制器被配置为独立于针对所述多个边缘设备的流量控制,对所述输入缓冲器和所述输出端口执行端到端流量控制。According to an embodiment of the present disclosure, a plurality of peripheral processing devices configured to be coupled to the plurality of edge devices, the controller configured to control the The input buffer and the output port perform end-to-end flow control.

根据本公开的一个实施例,所述控制器被配置为执行端到端流量控制,从而信元在被发送到所述输出端口之前被缓存在所述输入缓冲器处一段时间,所述时间与所述端到端流量控制相关联。According to an embodiment of the present disclosure, the controller is configured to perform end-to-end flow control such that cells are buffered at the input buffer for a period of time before being sent to the output port, the time being equal to The end-to-end flow control is associated.

根据本公开的一个实施例,所述控制器被配置为独立于在所述多级交换结构的一个级处缓存的信元段并且独立于在所述多个边缘设备中的一个边缘设备处缓存的分组,对在所述输入缓冲器处缓存的信元执行端到端流量控制。According to an embodiment of the present disclosure, the controller is configured to be independent of the cell segment buffered at one stage of the multi-stage switch fabric and independently of the cell segment buffered at one of the plurality of edge devices. performing end-to-end flow control on cells buffered at said input buffer.

根据本公开的一个实施例,所述控制器被配置为独立于与以太网相关联的流量控制机制,对在所述输入缓冲器处缓存的信元执行端到端流量控制。According to one embodiment of the present disclosure, the controller is configured to perform end-to-end flow control on cells buffered at the input buffer independently of flow control mechanisms associated with Ethernet.

根据本公开实施例的另一方面,提供一种通信设备,包括:交换核心,所述交换核心具有在物理上在多个机架之间分布的多级交换结构,所述多级交换结构被配置为接收多个与分组相关联的信元并且被配置为基于所述多个信元交换多个信元段;多个耦接到所述交换核心的边缘设备,所述边缘设备中的一个边缘设备被配置为接收所述分组,所述边缘设备配置为向所述多级交换结构发送所述多个信元;和耦接到所述多级交换结构的控制器,所述控制器被配置为独立于针对所述多个边缘设备的流量控制和针对所述多级交换结构的结构内流量控制,对所述多个信元执行流量控制。According to another aspect of the embodiments of the present disclosure, there is provided a communication device, including: a switch core, the switch core has a multi-stage switch structure physically distributed among multiple racks, and the multi-stage switch structure is divided into configured to receive a plurality of cells associated with a packet and configured to switch a plurality of cell segments based on the plurality of cells; a plurality of edge devices coupled to the switching core, one of the edge devices an edge device configured to receive the packet, the edge device configured to send the plurality of cells to the multi-stage switch fabric; and a controller coupled to the multi-stage switch fabric, the controller being It is configured to perform flow control on the plurality of cells independently of flow control for the plurality of edge devices and intra-fabric flow control for the multi-stage switch fabric.

根据本公开的一个实施例,所述控制器在操作期间不需要软件而以硬件实现、以及在配置和监视期间需要软件实现。According to one embodiment of the present disclosure, the controller does not require software during operation but is implemented in hardware, and requires software implementation during configuration and monitoring.

根据本公开的一个实施例,所述多级交换结构具有多个输入缓冲器和多个输出端口,所述控制器被配置为当在所述多个输出端口中的一个输出端口处的拥塞被预见时以及在所述交换核心内的拥塞发生之前,向所述多个输入缓冲器中的一个输入缓冲器发送流量控制信号。According to an embodiment of the present disclosure, the multi-stage switch fabric has a plurality of input buffers and a plurality of output ports, and the controller is configured to, when congestion at one of the plurality of output ports is eliminated, A flow control signal is sent to one of the plurality of input buffers in anticipation and before congestion within the switch core occurs.

根据本公开的一个实施例,所述多级交换结构具有多个输入缓冲器和多个输出端口,所述控制器被配置为独立于与以太网相关联的流量控制机制,对在所述多个输入缓冲器中的一个输入缓冲器处缓存的信元执行端到端流量控制。According to an embodiment of the present disclosure, the multi-stage switch fabric has a plurality of input buffers and a plurality of output ports, the controller is configured to be independent of the flow control mechanism associated with End-to-end flow control is performed on cells buffered at one of the input buffers.

根据本公开实施例的另一方面,提供一种通信设备,包括:交换核心,所述交换核心具有多级交换结构;第一多个外围处理装置,通过多个具有协议的连接耦接到所述多级交换结构,所述第一多个外围处理装置中的每一个外围处理装置是具有虚拟化资源的存储节点,所述第一多个外围处理装置的所述虚拟化资源共同定义通过所述交换核心互连的虚拟存储资源;和第二多个外围处理装置,通过多个具有协议的连接耦接到所述多级交换结构,所述第二多个外围处理装置中的每一个外围处理装置是具有虚拟化资源的存储节点,所述第二多个外围处理装置的所述虚拟化资源共同定义通过所述交换核心互连的虚拟计算资源。According to another aspect of the embodiments of the present disclosure, there is provided a communication device, including: a switch core, the switch core has a multi-level switch structure; a plurality of first peripheral processing devices, coupled to the In the multi-level switch structure, each peripheral processing device in the first plurality of peripheral processing devices is a storage node with virtualized resources, and the virtualized resources of the first plurality of peripheral processing devices are jointly defined by the virtual storage resources interconnected by the switch core; and a second plurality of peripheral processing devices coupled to the multi-level switch fabric through a plurality of connections with protocols, each peripheral in the second plurality of peripheral processing devices The processing device is a storage node having virtualized resources of the second plurality of peripheral processing devices collectively defining virtual computing resources interconnected by the switching core.

根据本公开的一个实施例,所述第一多个外围处理装置中的每一个外围处理装置具有虚拟化资源,所述第一多个外围处理装置中的每一个外围处理装置被配置为使得其虚拟化资源能够被来自所述第一多个外围处理装置中的其余外围处理装置的虚拟资源替代;并且所述第二多个外围处理装置中的每一个外围处理装置具有虚拟化资源,所述第二多个外围处理装置中的每一个外围处理装置被配置为使得其虚拟化资源能够被来自所述第二多个外围处理装置中的其余外围处理装置的虚拟资源替代。According to one embodiment of the present disclosure, each peripheral processing device of the first plurality of peripheral processing devices has virtualized resources, and each peripheral processing device of the first plurality of peripheral processing devices is configured such that its virtualized resources capable of being replaced by virtualized resources from remaining peripheral processing devices in the first plurality of peripheral processing devices; and each peripheral processing device in the second plurality of peripheral processing devices has a virtualized resource, the Each peripheral processing device of the second plurality of peripheral processing devices is configured such that its virtualized resources can be replaced by virtual resources from remaining peripheral processing devices of the second plurality of peripheral processing devices.

根据本公开的一个实施例,所述第一多个外围处理装置与基于分组通信协议相关联并且与安全协议相关联;并且所述第二多个外围处理装置与基于分组通信协议相关联并且与安全协议相关联。According to one embodiment of the present disclosure, the first plurality of peripheral processing devices is associated with a packet-based communication protocol and is associated with a security protocol; and the second plurality of peripheral processing devices is associated with a packet-based communication protocol and is associated with a associated with a security protocol.

根据本公开实施例的另一方面,提供一种通信设备,包括:交换核心,所述交换核心具有多级交换结构,所述交换核心被配置为逻辑上划分为第一虚拟交换核心和第二虚拟转交换核心;多个耦接到所述多级交换结构的外围处理装置,所述多个外围处理装置具有可操作地耦接到所述第一虚拟交换核心的第一外围处理装置子集和可操作地耦接到所述第二虚拟交换核心的第二外围处理装置子集。According to another aspect of the embodiments of the present disclosure, a communication device is provided, including: a switching core, the switching core has a multi-level switching structure, and the switching core is configured to be logically divided into a first virtual switching core and a second virtual switching core. a virtual switch core; a plurality of peripheral processing devices coupled to the multi-level switch fabric, the plurality of peripheral processing devices having a first subset of peripheral processing devices operably coupled to the first virtual switch core and a second subset of peripheral processing devices operatively coupled to the second virtual switch core.

根据本公开的一个实施例,所述交换核心被配置为使得所述第一虚拟交换核心和所述第二虚拟交换核心独立于彼此被管理性地管理。According to one embodiment of the present disclosure, the switching core is configured such that the first virtual switching core and the second virtual switching core are administratively managed independently of each other.

根据本公开的一个实施例,所述交换核心被配置为使得所述第一虚拟交换核心具有独立于所述第二虚拟交换核心的带宽的带宽。According to an embodiment of the present disclosure, the switching core is configured such that the first virtual switching core has a bandwidth independent of the bandwidth of the second virtual switching core.

根据本公开的一个实施例,所述交换核心被配置为使得所述第一虚拟交换核心具有与所述第二虚拟交换核心的带宽和管理性管理独立的带宽和管理性管理。According to one embodiment of the present disclosure, the switching core is configured such that the first virtual switching core has bandwidth and administrative management independent from that of the second virtual switching core.

根据本公开的一个实施例,所述交换核心被配置为使得所述第一虚拟交换核心使用第2层协议操作,而所述第二虚拟交换核心使用第2层协议和第3层协议操作。According to an embodiment of the present disclosure, the switching core is configured such that the first virtual switching core operates using a layer 2 protocol, and the second virtual switching core operates using a layer 2 protocol and a layer 3 protocol.

根据本公开的一个实施例,所述第一外围处理装置子集具有虚拟资源,所述第二外围处理装置子集具有虚拟资源。According to one embodiment of the present disclosure, the first peripheral processing device subset has virtual resources, and the second peripheral processing device subset has virtual resources.

根据本公开的一个实施例,所述第一外围处理装置子集包括是计算节点、存储节点、服务节点装置和路由器中的一个的外围处理装置,并且包括是计算节点、存储节点、服务节点装置和路由器中的其余的一个的外围处理装置;并且所述第二外围处理装置子集包括是计算节点、存储节点、服务节点装置和路由器中的一个的外围处理装置,并且包括是计算节点、存储节点、服务节点装置和路由器中的其余的一个的外围处理装置。According to an embodiment of the present disclosure, the first peripheral processing device subset includes a peripheral processing device that is one of a computing node, a storage node, a service node device, and a router, and includes a computing node, a storage node, a service node device and a peripheral processing device of the remaining one of the routers; and the second subset of peripheral processing devices includes a peripheral processing device that is one of a computing node, a storage node, a service node device, and a router, and includes computing nodes, storage A peripheral processing device of the remaining one of the node, the service node device and the router.

附图说明Description of drawings

图1是根据一个实施例的数据中心(DC)的系统框图。Figure 1 is a system block diagram of a data center (DC) according to one embodiment.

图2是根据一个实施例表明具备任意连接性的数据中心一部分的实例的示意图。Figure 2 is a schematic diagram illustrating an example of a portion of a data center with arbitrary connectivity, according to one embodiment.

图3是根据一个实施例表明与数据中心相关联的资源逻辑组的示意图。Figure 3 is a diagram illustrating logical groups of resources associated with a data center, according to one embodiment.

图4A是根据一个实施例表明能够被包括在交换核心中的交换结构的示意图。Figure 4A is a schematic diagram illustrating a switch fabric that can be included in a switch core, according to one embodiment.

图4B是根据一个实施例表明能够被存储在图4A中所示的存储模块中的交换表的示意图。FIG. 4B is a schematic diagram illustrating a swap table that can be stored in the memory module shown in FIG. 4A according to one embodiment.

图5A是根据一个实施例表明交换结构系统的示意图。Figure 5A is a schematic diagram illustrating a switch fabric system, according to one embodiment.

图5B是根据一个实施例表明输入/输出模块的示意图。Figure 5B is a schematic diagram illustrating an input/output module, according to one embodiment.

图6是根据一个实施例表明图5A的交换结构系统一部分的示意图。Figure 6 is a schematic diagram illustrating a portion of the switch fabric system of Figure 5A, according to one embodiment.

图7是根据一个实施例表明图5A的交换结构系统一部分的示意图。Figure 7 is a schematic diagram illustrating a portion of the switch fabric system of Figure 5A, according to one embodiment.

图8和9根据一个实施例分别显示了用于遮盖交换结构的外壳的前视图和后视图。8 and 9 show front and rear views, respectively, of a housing for covering a switch fabric, according to one embodiment.

图10根据一个实施例显示了图8中外壳的一部分。Figure 10 shows a portion of the housing of Figure 8, according to one embodiment.

图11和12分别是根据另一实施例分别表明在第一配置和第二配置中的交换结构的示意图。11 and 12 are schematic diagrams showing switch fabrics in a first configuration and a second configuration, respectively, according to another embodiment.

图13是根据一个实施例表明与交换结构相关联的数据流的示意图。Figure 13 is a schematic diagram illustrating data flow associated with a switch fabric, according to one embodiment.

图14是根据一个实施例表明在图13所示的交换结构中流量控制的示意图。FIG. 14 is a schematic diagram illustrating flow control in the switch fabric shown in FIG. 13 according to one embodiment.

图15是根据一个实施例表明缓冲模块的示意图。Figure 15 is a schematic diagram illustrating a buffer module according to one embodiment.

图16A是根据一个实施例的配置为经由交换核心的交换结构协调信元组传输的入口调度模块和出口调度模块的示意框图。Figure 16A is a schematic block diagram of an ingress scheduling module and an egress scheduling module configured to coordinate cell group transmission via a switch fabric of a switching core, according to one embodiment.

图16B是根据一个实施例表明涉及信元组传输信令的信令流程图。Figure 16B is a signaling flow diagram illustrating signaling involved in cell group transmission, according to one embodiment.

图17是根据一个实施例表明在被安排在交换结构入口侧的入口队列排队的两个信元组的示意框图。Figure 17 is a schematic block diagram illustrating two groups of cells queued at an ingress queue arranged on the ingress side of a switch fabric, according to one embodiment.

图18是根据另一实施例表明在被安排在交换结构入口侧的入口队列排队的两个信元组的示意框图。Figure 18 is a schematic block diagram illustrating two groups of cells queued at an ingress queue arranged at the ingress side of a switch fabric according to another embodiment.

图19是根据一个实施例表明经由交换结构调度信元组传输的方法的流程图。Figure 19 is a flowchart illustrating a method of scheduling cell group transmission via a switch fabric, according to one embodiment.

图20是根据一个实施例表明处理与传输请求有关的请求序列值的信令流程图。Figure 20 is a signaling flow diagram illustrating the processing of request sequence values associated with transmission requests, according to one embodiment.

图21是根据一个实施例表明与传输响应相关联的响应序列值的信令流程图。Figure 21 is a signaling flow diagram illustrating response sequence values associated with transmission responses, according to one embodiment.

图22是根据一个实施例表明多级流量可控队列的示意框图。Figure 22 is a schematic block diagram illustrating a multi-level flow controllable queue according to one embodiment.

图23是根据一个实施例表明多级流量可控队列的示意框图。Figure 23 is a schematic block diagram illustrating a multi-level flow controllable queue according to one embodiment.

图24是根据一个实施例表明配置为定义与多个接收队列相关联的流量控制信号的目的地控制模块的示意框图。Figure 24 is a schematic block diagram illustrating a destination control module configured to define flow control signals associated with a plurality of receive queues, according to one embodiment.

图25是根据一个实施例表明流量控制分组的示意图。Figure 25 is a schematic diagram illustrating flow control packets, according to one embodiment.

具体实施方式detailed description

图1是根据一个实施例表明数据中心(DC)100(例如,超级数据中心,理想化数据中心)的示意图。数据中心100包括交换核心(SC)180,可操作地连接到4种类型的外围处理装置170:计算节点110、服务节点120、路由器130和存储节点140。在该实施例中,数据中心管理(DCM)模块190被配置为控制(例如管理)数据中心100的操作。在一些实施例中,数据中心100能够被称为数据中心。在一些实施例中,外围处理装置可以包括一个或更多虚拟资源例如虚拟机。FIG. 1 is a schematic diagram illustrating a data center (DC) 100 (eg, super data center, idealized data center) according to one embodiment. Data center 100 includes a switch core (SC) 180 operatively connected to four types of peripheral processing devices 170 : compute nodes 110 , service nodes 120 , routers 130 , and storage nodes 140 . In this embodiment, a data center management (DCM) module 190 is configured to control (eg, manage) the operation of the data center 100 . In some embodiments, data center 100 can be referred to as a data center. In some embodiments, the peripheral processing device may include one or more virtual resources such as virtual machines.

每一个外围处理装置170都被配置为经由数据中心100的交换核心180通信。特别地,数据中心100的交换核心180被配置为在外围处理装置170之间以相对低的等待时间提供任意连接性。例如,交换核心180能够被配置为在一个或多个计算节点110和一个或多个存储节点140之间发送(例如传送)数据。在一些实施例中,交换核心180能够具有至少几百或几千个端口(例如,出口端和/或入口端),通过这些端口外围处理装置170能够发送和/或接收数据。外围处理装置170包括一个或多个网络接口装置(例如网络接口卡(NIC)、10G比特(Gb)以太网集中网络适配器(CNA)装置),通过这些网络接口装置,外围处理装置170能够发送信号到交换核心180和/或从交换核心180接收信号。信号能够经由可操作地耦接到外围处理装置170的物理链路和/或无线链路被发送到交换核心180和/或从交换核心180接收。在一些实施例中,外围处理装置170能够被配置为基于一个或多个协议(例如以太网协议、多协议标签交换(MPLS)协议、光纤信道协议、光纤信道覆盖的以太网协议(fibre-channel-over Ethernet protocol)、涉及无限带宽的协议(Infiniband-relatedprotocol))发送数据到交换核心180和/或从交换核心180接收数据。Each peripheral processing device 170 is configured to communicate via the switch core 180 of the data center 100 . In particular, switch core 180 of data center 100 is configured to provide arbitrary connectivity between peripheral processing devices 170 with relatively low latency. For example, switch core 180 can be configured to send (eg, transfer) data between one or more compute nodes 110 and one or more storage nodes 140 . In some embodiments, switch core 180 can have at least hundreds or thousands of ports (eg, egress ports and/or ingress ports) through which peripheral processing device 170 can send and/or receive data. Peripheral processing device 170 includes one or more network interface devices (e.g., a network interface card (NIC), 10 Gigabit (Gb) Ethernet centralized network adapter (CNA) device) through which peripheral processing device 170 is capable of sending signals Signals are received to and/or from switch core 180 . Signals can be sent to and/or received from switch core 180 via physical and/or wireless links operatively coupled to peripheral processing device 170 . In some embodiments, peripheral processing device 170 can be configured to be based on one or more protocols (e.g., Ethernet protocol, Multi-Protocol Label Switching (MPLS) protocol, Fiber Channel protocol, Fiber-channel over Ethernet protocol (fibre-channel protocol) -over Ethernet protocol), Infiniband-related protocol (Infiniband-related protocol)) to send data to the switching core 180 and/or receive data from the switching core 180.

在一些实施例中,交换核心180可以是(例如能够具备功能)单独的合并交换(consolidated switch)(例如单独的大尺寸合并L2/L3交换(large-scale consolidatedL2/L3 switch))。换而言之,交换核心180能够被配置为与例如被配置为经由以太网连接相互通信的不同网络元件集合相反地,作为单独的逻辑实体(例如单独的逻辑网络元件)操作。交换核心180能够被配置为在数据中心100中连接(例如,便于其之间的通信)计算节点110、存储节点140、服务节点120和/或路由器130。在一些实施例中,交换核心180能够被配置为经由接口装置通信,其中接口装置被配置为以至少10Gb/s的速率发送数据。在一些实施例中,交换核心180能够被配置为经由接口装置(例如光纤信道接口装置)通信,所述接口装置被配置为以例如2Gb/s、4Gb/s、8Gb/s、10Gb/s、40Gb/s、100Gb/s和/或更快的链路速率发送数据。In some embodiments, switch core 180 may be (eg, capable of) a single consolidated switch (eg, a single large-scale consolidated L2/L3 switch). In other words, switch core 180 can be configured to operate as a single logical entity (eg, individual logical network elements) as opposed to, eg, a collection of different network elements configured to communicate with each other via Ethernet connections. Switch core 180 can be configured to connect (eg, facilitate communication between) compute nodes 110 , storage nodes 140 , service nodes 120 , and/or routers 130 in data center 100 . In some embodiments, switch core 180 can be configured to communicate via an interface device configured to transmit data at a rate of at least 10 Gb/s. In some embodiments, switch core 180 can be configured to communicate via an interface device (eg, a Fiber Channel interface device) configured to communicate at, for example, 2Gb/s, 4Gb/s, 8Gb/s, 10Gb/s, 40Gb/s, 100Gb/s and/or faster link rates to send data.

虽然交换核心180可以是逻辑集中的,但是交换核心180的实施可以是高度分布的,例如为了可靠性。例如,交换核心180的几部分可以是物理分布交叉,例如,许多机架。在一些实施例中,例如交换核心180的处理级段能被包括在第一机架中以及交换核心180的另一个处理级段能被包括在第二机架中。两个处理级段逻辑上都可以作为单独合并交换部分。有关交换核心180体系结构的更多细节将结合附图4到13一起被描述。While switch core 180 may be logically centralized, implementation of switch core 180 may be highly distributed, eg, for reliability. For example, portions of switch core 180 may be physically distributed across, eg, many racks. In some embodiments, for example, a processing stage of switch core 180 can be included in a first rack and another processing stage of switch core 180 can be included in a second rack. Both processing stages can logically be combined and exchanged as separate parts. More details about the architecture of switch core 180 will be described in conjunction with FIGS. 4 to 13 .

如图1中所示,交换核心180包括边缘部分185和交换结构187。边缘部分185可以包括边缘设备(未示出),能够作为交换结构187和外围处理装置170之间的网关装置工作。在一些实施例中,在边缘部分185中的边缘设备能够共同地具有几千个端口(例如100000个端口、500000个端口),通过这些端口来自外围处理装置170的数据能够被发送进入(例如,路由)交换核心180的一个或多个部分和/或从交换核心180的一个或多个部分发送出去。在一些实施例中,边缘设备能够被称为接入交换(access switch)、网络装置、和/或输入/输出模块(例如,在图5A和图5B中示出)。在一些实施例中,边缘设备能够被包括在例如机架的架顶(TOR)中。As shown in FIG. 1 , switch core 180 includes edge portion 185 and switch fabric 187 . Edge portion 185 may include an edge device (not shown) capable of functioning as a gateway device between switch fabric 187 and peripheral processing device 170 . In some embodiments, edge devices in edge portion 185 can collectively have thousands of ports (e.g., 100,000 ports, 500,000 ports) through which data from peripheral processing device 170 can be sent in (e.g., Routing) to and/or from one or more portions of the switch core 180. In some embodiments, edge devices can be referred to as access switches, network appliances, and/or input/output modules (eg, as shown in FIGS. 5A and 5B ). In some embodiments, an edge device can be included in, for example, a top-of-rack (TOR) of a rack.

数据能够在外围处理装置170、交换核心180、交换核心180的交换结构187、和/或交换核心180的边缘部分185(例如在被包括在边缘部分185中的边缘设备)处基于不同的平台被处理。例如,在一个或多个外围处理装置170和在边缘部分185的边缘设备之间的通信可以是基于以太网协议或非以太网协议定义的数据分组流。在一些实施例中,多种数据处理能够在边缘部分185内的边缘设备执行,而不是在交换核心180的交换结构187内执行。例如,数据分组能够在边缘部分185的边缘设备处被解析成信元,以及该信元被从边缘设备发送到交换结构187。信元能够被解析为段(segment)并在交换结构187内作为片段(在一些实施例中还能够被称为段(flits))被发送。在一些实施例中,数据分组能够在交换结构187的一部分处被解析为信元。在一些实施例中,拥塞解决方案和/或经由交换结构187的数据(例如信元)传输调度能够在交换中心180的边缘部分185内部的边缘设备(例如接入交换(access switches))被实施或执行。然而,拥塞解决方案和/或数据传输调度不可以在定义交换结构187的模块中执行。涉及数据中心的组件内部的数据分组、信元和/或片段处理的更多细节将在下面描述。例如,涉及信元处理的更多细节将至少结合图16A到图21描述。Data can be transferred based on different platforms at the peripheral processing device 170, the switching core 180, the switching fabric 187 of the switching core 180, and/or the edge portion 185 of the switching core 180 (for example, at an edge device included in the edge portion 185) deal with. For example, communications between the one or more peripheral processing devices 170 and the edge devices at the edge portion 185 may be based on data packet streams defined by an Ethernet protocol or a non-Ethernet protocol. In some embodiments, various data processing can be performed on edge devices within edge portion 185 rather than within switch fabric 187 of switch core 180 . For example, data packets can be parsed into cells at an edge device of edge portion 185 and the cells sent from the edge device to switch fabric 187 . Cells can be parsed into segments and sent within switch fabric 187 as fragments (which can also be referred to as flits in some embodiments). In some embodiments, data packets can be parsed into cells at a portion of switch fabric 187 . In some embodiments, congestion resolution and/or scheduling of data (e.g., cell) transfers via switch fabric 187 can be implemented at edge devices (e.g., access switches) within edge portion 185 of switching center 180 or execute. However, congestion resolution and/or data transmission scheduling may not be performed in the modules defining the switch fabric 187 . Further details relating to data packet, cell and/or fragment processing within components of the data center are described below. For example, more details related to cell processing will be described in connection with at least FIG. 16A through FIG. 21 .

在一些实施例中,边缘部分185内的边缘设备能够被配置为分类,例如在交换核心180从外围处理装置170接收的数据分组。特别地,交换核心180的边缘部分185内的边缘设备能够被配置为执行以太网类型的分类,其可以包括基于例如第2层以太网地址(例如媒体接入控制(MAC)地址)和/或第4层以太网地址(例如通用数据报协议(UDP)地址)的分类。在一些实施例中,目的地可以基于例如在交换核心180的边缘部分185的分组的分类而被确定。例如,第一边缘设备能够基于分组的分类识别第二边缘设备作为该分组的目的地。分组能够被解析成信元并被从第一边缘设备发送到交换结构187。信元能够通过交换结构187交换,从而它们能够被发送到第二边缘设备。在一些实施例中,信元能够通过交换结构187基于涉及目的地以及和信元相关联的信息而交换。In some embodiments, edge devices within edge portion 185 can be configured to classify, for example, data packets received at switching core 180 from peripheral processing device 170 . In particular, edge devices within edge portion 185 of switch core 180 can be configured to perform Ethertype classification, which may include classification based on, for example, Layer 2 Ethernet addresses (e.g., Media Access Control (MAC) addresses) and/or Classification of Layer 4 Ethernet addresses such as Universal Datagram Protocol (UDP) addresses. In some embodiments, the destination may be determined based on, for example, the classification of the packet at the edge portion 185 of the switch core 180 . For example, a first edge device can identify a second edge device as the packet's destination based on the packet's classification. Packets can be parsed into cells and sent from the first edge device to the switch fabric 187 . Cells can be switched through the switch fabric 187 so that they can be sent to the second edge device. In some embodiments, cells can be switched through switch fabric 187 based on information pertaining to the destination and associated with the cells.

关于交换核心180的安全策略能够更有效地应用,因为分类在交换核心180的单独逻辑层,在交换核心180的边缘部分185执行。特别地,许多安全策略能够在分类期间以相对统一的且无缝方式在交换核心180的边缘部分185应用。Security policies on the switch core 180 can be applied more efficiently because classification is performed at a separate logical layer of the switch core 180 , at the edge portion 185 of the switch core 180 . In particular, many security policies can be applied at the edge portion 185 of the switch core 180 during classification in a relatively uniform and seamless manner.

涉及数据中心内的分组分类的更多细节将结合例如图5A、图5B和图19描述。涉及数据中心内相关联的分组分类的附加细节在名为“Methods and Apparatus Related toPacket Classification Associated with a Multi-Stage Switch(涉及与多级交换有关的分组分类的方法和设备)”并于2008年9月30日提交的美国专利申请序号12/242168以及名为“Methods and Apparatus for Packet Classification Based on Policy Vectors(基于策略向量的分组分类的方法和设备)”并于2008年9月30日提交的美国专利申请序号12/242172中描述,这两者在这里都完全引用作为参考。Further details related to packet classification within a data center are described in connection with, for example, FIGS. 5A , 5B and 19 . Additional details relating to Packet Classification Associated within a Data Center are at "Methods and Apparatus Related to Packet Classification Associated with a Multi-Stage Switch" and published September 2008 U.S. Patent Application Serial No. 12/242168 filed on September 30 and entitled "Methods and Apparatus for Packet Classification Based on Policy Vectors (methods and devices for packet classification based on policy vectors)" and filed on September 30, 2008 described in Patent Application Serial No. 12/242,172, both of which are fully incorporated herein by reference.

交换核心180能够被定义从而数据(例如数据分组)的分类不在交换结构187中执行。因此,虽然交换结构187可以具有多级,但是多级不需要拓扑跳转,在该拓扑跳转中执行数据分类,并且交换结构187能够定义单独的拓扑跳转。作为替换,在边缘设备(例如交换核心180的边缘部分185内部的边缘设备)基于分类确定的目的地信息能够被用于交换结构187内部的交换(例如信元的交换)。涉及在交换结构187内部交换的更多细节将结合例如附图4A和4B被描述。Switch core 180 can be defined such that classification of data (eg, data packets) is not performed in switch fabric 187 . Therefore, although the switch fabric 187 may have multiple levels, the multiple levels do not require a topological hop in which data classification is performed, and the switch fabric 187 can define individual topological hops. Alternatively, destination information determined based on classification at an edge device (eg, an edge device within edge portion 185 of switch core 180 ) can be used for switching (eg, switching of cells) within switch fabric 187 . Further details related to switching within switch fabric 187 are described in connection with, for example, FIGS. 4A and 4B.

在一些实施例中,涉及分类的处理可以在包含在边缘设备(例如,输入/输出模块)的分类模块(未示出)执行。将分组解析成信元、经由交换结构187的信元传输调度、分组和/或信元的重组和/或等等能够在边缘设备(例如,输入/输出模块)的处理模块(未示出)执行。在一些实施例中,分类模块可以被称为分组分类模块,和/或处理模块可以被称为分组处理模块。涉及包括分类模块和处理模块的边缘设备的更多细节将结合图5B描述。In some embodiments, processing related to classification may be performed at a classification module (not shown) included in an edge device (eg, an input/output module). Parsing of packets into cells, scheduling of cell transmission via switch fabric 187, reassembly of packets and/or cells, and/or etc. implement. In some embodiments, the classification module may be referred to as a packet classification module, and/or the processing module may be referred to as a packet processing module. More details related to the edge device including the classification module and the processing module will be described in connection with Fig. 5B.

在一些实施例中,数据中心100的一个或多个部分可以是(或可以包括)基于硬件的模块(例如,专用集成电路(ASIC)、数字信号处理器(DSP)、现场可编程门阵列(FPGA))和/或基于软件的模块(例如,计算机代码模块、能够在处理器上执行的处理器可读指令集)。在一些实施例中,一个或多个与数据中心100有关的功可以被包括在不同的模块中和/或被结合到一个或多个模块中。例如,数据中心管理模块190可以是硬件模块和软件模块的结合,其被配置为管理数据中心100内的资源(例如交换核心180的资源)In some embodiments, one or more portions of data center 100 may be (or may include) hardware-based modules (e.g., application-specific integrated circuits (ASICs), digital signal processors (DSPs), field-programmable gate arrays ( FPGA)) and/or software-based modules (eg, computer code modules, processor-readable instruction sets executable on a processor). In some embodiments, one or more functions related to data center 100 may be included in different modules and/or combined into one or more modules. For example, data center management module 190 may be a combination of hardware modules and software modules configured to manage resources within data center 100 (eg, resources of switch core 180)

一个或多个计算节点110可以是通用目的计算引擎,其能够包括例如处理器、存储器、和/或一个或多个网络接口装置(例如网络接口卡(NIC))。在一些实施例中,计算节点110中的处理器可以是一个或多个高速缓存相干域的一部分。One or more compute nodes 110 may be general purpose compute engines that can include, for example, processors, memory, and/or one or more network interface devices (eg, network interface cards (NICs)). In some embodiments, processors in compute node 110 may be part of one or more cache coherent domains.

在一些实施例中,计算节点110可以是主机装置、服务器和/或等等。在一些实施例中,一个或多个计算节点110能够具有虚拟化资源,从而任何计算节点110(或其部分)都能被用来替代数据中心100内的任何其他计算节点110(或其部分)。In some embodiments, compute nodes 110 may be host devices, servers, and/or the like. In some embodiments, one or more compute nodes 110 can have virtualized resources such that any compute node 110 (or portion thereof) can be used in place of any other compute node 110 (or portion thereof) within the data center 100 .

一个或多个存储节点140可以是包括例如处理器、存储器、本地连接的磁盘存储器和/或一个或多个网络接口装置的装置。在一些实施例中,存储节点140能具有专用的模块(例如,硬件模块和/或软件模块),被配置为使得例如一个或多个计算节点110能够经由交换核心180读取来自一个或多个存储节点140的数据或写入数据到一个或多个存储节点140。在一些实施例中,一个或多个存储节点140可以具有虚拟化资源,从而任何存储节点140(或其部分)可以被用来替代数据中心100内的任何其他存储节点140(或其部分)。One or more storage nodes 140 may be devices including, for example, processors, memory, locally attached disk storage, and/or one or more network interface devices. In some embodiments, storage nodes 140 can have dedicated modules (eg, hardware modules and/or software modules) configured to enable, for example, one or more computing nodes 110 to read data from one or more Store the data of the node 140 or write the data to one or more storage nodes 140 . In some embodiments, one or more storage nodes 140 may have virtualized resources such that any storage node 140 (or portion thereof) may be used in place of any other storage node 140 (or portion thereof) within data center 100 .

一个或多个服务节点120可以是开放系统互连(OSI)第4层到第7层装置,其可以包括例如处理器(例如,网络处理器)、存储器、和/或一个或更多网络接口装置(例如,10Gb以太网装置)。在一些实施例中,服务节点120可以包括硬件和/或软件,被配置为对相对重的网络工作负荷执行计算。在一些实施例中,服务节点120可以被配置为基于每一个分组以相对有效(例如比在例如计算节点110上执行更有效)的方式执行计算。计算可包括例如全状态防火墙计算、入侵检测和阻止(IDP)计算、可扩展标记语言(XML)加速计算、传输控制协议(TCP)终端计算,和/或应用级别负载平衡计算。在一些实施例中,一个或多个服务节点120可以具有虚拟化资源,从而任何服务节点120(或其部分)能被用来替代数据中心100内部的任何其他服务节点120(或其部分)。One or more service nodes 120 may be Open Systems Interconnection (OSI) Layer 4 through Layer 7 devices, which may include, for example, a processor (e.g., a network processor), memory, and/or one or more network interfaces devices (eg, 10Gb Ethernet devices). In some embodiments, service nodes 120 may include hardware and/or software configured to perform computations on relatively heavy network workloads. In some embodiments, service node 120 may be configured to perform computations on a per-packet basis in a relatively efficient (eg, more efficient than performing on, eg, compute node 110 ) manner. Computations may include, for example, stateful firewall computations, intrusion detection and prevention (IDP) computations, Extensible Markup Language (XML) accelerated computations, Transmission Control Protocol (TCP) terminal computations, and/or application level load balancing computations. In some embodiments, one or more service nodes 120 may have virtualized resources such that any service node 120 (or portion thereof) can be used in place of any other service node 120 (or portion thereof) within the data center 100 .

一个或多个路由器130能够是网络装置,被配置为连接数据中心100的至少一部分到另一个网络(例如全球因特网)。例如,如图1所示,交换核心180能被配置为通过路由器130与网络135和网络137通信。虽然未示出,但是在一些实施例中,一个或多个路由器130能够激活数据中心100内组件(例如,外围处理装置170、交换核心180的部分)之间的通信。通信能够基于例如第3层路由协议定义。在一些实施例中,一个或多个路由器130能够具有一个或多个网络接口装置(例如,10Gb以太网装置),通过该网络接口装置路由器130能向和/或从例如交换核心180和/或其他外围处理装置170发送和/或接收信号。One or more routers 130 can be network devices configured to connect at least a portion of data center 100 to another network (eg, the global Internet). For example, as shown in FIG. 1 , switch core 180 can be configured to communicate with network 135 and network 137 through router 130 . Although not shown, in some embodiments, one or more routers 130 are capable of enabling communication between components within data center 100 (eg, peripheral processing devices 170 , portions of switch core 180 ). Communication can be defined based on, for example, layer 3 routing protocols. In some embodiments, one or more routers 130 can have one or more network interface devices (e.g., 10Gb Ethernet devices) through which routers 130 can communicate to and/or from, for example, switch core 180 and/or Other peripheral processing devices 170 send and/or receive signals.

涉及数据中心内虚拟化资源的更多细节在名为“Methods and Apparatus forDetermining a Network Topology During Network Provisioning(在网络供应期间用于确定网络拓扑的方法和设备)”并于2008年12月30日提交的共同未决美国专利申请No.12/346623、名为“Methods and Apparatus for Distributed Dynamic NetworkProvisioning(用于动态网络供应分布的方法和设备)”并于2008年12月30日提交的共同未决美国专利申请No.12/346632、以及名为“Methods and Apparatus for DistributedDynamic Network Provisioning(用于分布式动态网络供应的方法和设备)”并于2008年12月30日提交的共同未决美国专利申请No.12/346630中阐明,所有这些申请在这里都引用作为参考。More details concerning virtualized resources within a data center are presented in "Methods and Apparatus for Determining a Network Topology During Network Provisioning" and filed December 30, 2008 Co-pending U.S. Patent Application No. 12/346623, entitled "Methods and Apparatus for Distributed Dynamic Network Provisioning (method and apparatus for dynamic network provisioning distribution)" and filed on December 30, 2008. Patent Application No. 12/346632, and co-pending U.S. Patent Application No. .12/346630, all of which are incorporated herein by reference.

如上所述,交换核心180能被配置为具有单独通用交换的功能,其能够将数据中心100内的任何外围处理装置170连接到任何其他外围处理装置170。特别地,交换核心180能被配置为在外围处理装置170(例如相对多的外围处理装置170)和交换核心180之间提供任意连接性,除了那些被网络接口装置的带宽以及通过光速信令延迟(也被称为光速等待时间)施加的限制之外,基本上不具有可见的限制,网络接口装置连接外围处理装置170到交换核心180。换句话说,交换核心180能被配置为使得每一个外围处理装置170看起来被直接互联到数据中心100内的所有其他外围处理装置。在一些实施例中,交换核心180能被配置为使得外围处理装置170能够经由交换核心180以线路速率(line rate)(或基本上以线路速率)通信。任意连接性的示意性表示在图2中示出。As described above, switch core 180 can be configured to function as a single general purpose switch capable of connecting any peripheral processing device 170 within data center 100 to any other peripheral processing device 170 . In particular, switch core 180 can be configured to provide arbitrary connectivity between peripheral processing devices 170 (e.g., a relatively large number of peripheral processing devices 170) and switch core 180, except those delayed by the bandwidth of the network interface device and by speed-of-light signaling. (also known as speed-of-light latency) imposes substantially no visible constraints, the network interface device connects the peripheral processing device 170 to the switch core 180 . In other words, switch core 180 can be configured such that each peripheral processing device 170 appears to be directly interconnected to all other peripheral processing devices within data center 100 . In some embodiments, switch core 180 can be configured to enable peripheral processing device 170 to communicate at a line rate (or substantially at line rate) via switch core 180 . A schematic representation of any connectivity is shown in FIG. 2 .

此外,交换核心180能以期望的方式处理例如与交换核心180通信的任意外围处理装置170之间的虚拟资源的迁移,因为交换核心180具有单独逻辑实体的功能。因此,在外围处理装置170内的虚拟资源迁移范围可以跨越基本上所有耦接到交换核心180的端口(例如,交换核心180的边缘设备185的所有端口)。Furthermore, the switch core 180 can handle the migration of virtual resources, eg, between any peripheral processing devices 170 in communication with the switch core 180 in a desired manner because the switch core 180 functions as a single logical entity. Accordingly, virtual resource migration scope within peripheral processing device 170 may span substantially all ports coupled to switch core 180 (eg, all ports of edge devices 185 of switch core 180 ).

在一些实施例中,与虚拟资源迁移相关联的提供可以部分地通过网络管理模块处理。集中的网络管理实体或网络管理模块能够与网络装置(例如,交换核心180的几部分)合作以收集并管理网络拓扑信息。例如,由于资源是附着或独立于网络装置的,网络装置能将当前操作耦接于网络装置的有关资源(虚拟的和物理的)的信息推送到网络管理模块。例如外围处理装置管理工具(例如,服务器管理工具)和/或网络管理工具的外部管理实体能与网络管理模块通信以向网络装置和网络中的其他资源发送网络供应指令,而不需要网络的静态描述。这样的系统避免了静态网络描述的困难和由其他类型外围处理装置170和网络管理系统导致的网络性能退化。In some embodiments, provisioning associated with virtual resource migration may be handled in part by a network management module. A centralized network management entity or network management module can cooperate with network devices (eg, portions of switching core 180) to collect and manage network topology information. For example, the network device can push information about resources (virtual and physical) currently operationally coupled to the network device to the network management module, as the resources are attached or independent of the network device. An external management entity such as a peripheral processing device management tool (e.g., a server management tool) and/or a network management tool can communicate with the network management module to send network provisioning instructions to network devices and other resources in the network without requiring static configuration of the network. describe. Such a system avoids the difficulties of static network description and network performance degradation caused by other types of peripheral processing devices 170 and network management systems.

在一个实施例中,服务器管理工具或外部管理实体与网络管理模块通信以向网络装置提供与外围处理装置170有关的虚拟资源,并确定操作状态或情形(例如运行、暂停或迁移)以及虚拟资源在网络中的位置。虚拟资源可以是在经由数据中心中的接入交换(例如,包括在边缘部分187中的接入交换)耦合到交换结构的外围处理装置170(例如,服务器)上执行的虚拟机。许多类型的外围处理装置170能经由接入交换被耦接到交换结构。In one embodiment, a server management tool or external management entity communicates with the network management module to provide network devices with virtual resources associated with peripheral processing device 170 and to determine operational status or conditions (e.g., running, suspended, or migrated) and virtual resources position in the network. The virtual resource may be a virtual machine executing on a peripheral processing device 170 (eg, a server) coupled to the switch fabric via an access switch in a data center (eg, an access switch included in edge portion 187). Many types of peripheral processing devices 170 can be coupled to the switch fabric via access switches.

不是依靠对网络拓扑信息发现和/或(包括将虚拟资源捆绑到网络装置上)管理的静态网络描述,网络管理模块与接入交换和外部管理实体通信并合作从而发现或确定网络拓扑信息。在初始化(和/或开始)主机(和/或其他类型的外围处理装置170)上的虚拟机之后,外部管理实体能够向网络管理模块提供虚拟机的设备标识符。该设备标识符可以是,例如虚拟机或外围处理装置170的网络接口的媒体访问协议(“MAC”)地址、虚拟机或外围处理装置170的名称、全球唯一标识符(“GUID”)、和/或虚拟资源或外围处理装置170的通用唯一标识符(“UUID”)。GUID不需要是关于所有网络、虚拟资源、外围处理装置170、和/或网络装置的全球唯一的,但是其在由网络管理模块管理的网络或网络片段内是唯一的。此外,外部管理实体能够提供用于向管理虚拟机的外围处理装置170连接到的接入交换的端口提供指令。接入交换能检测虚拟机已经被初始化、开始、和/或移动到外围处理装置170。在检测到虚拟机之后,接入交换能够询问外围处理装置170有关外围处理装置170和/或虚拟机的信息,包括例如外围处理装置170或虚拟机的设备标识符。Rather than relying on a static network description for network topology information discovery and/or management (including binding of virtual resources onto network devices), the network management module communicates and cooperates with access switches and external management entities to discover or determine network topology information. After initializing (and/or starting) the virtual machine on the host (and/or other type of peripheral processing device 170 ), the external management entity can provide the network management module with the virtual machine's device identifier. The device identifier can be, for example, the Media Access Protocol (“MAC”) address of the network interface of the virtual machine or peripheral processing device 170, the name of the virtual machine or peripheral processing device 170, a globally unique identifier (“GUID”), and and/or a universally unique identifier (“UUID”) of a virtual resource or peripheral processing device 170 . A GUID need not be globally unique across all networks, virtual resources, peripheral processing devices 170, and/or network devices, but it is unique within a network or network segment managed by a network management module. Furthermore, the external management entity can provide instructions for providing instructions to ports of the access switch to which the peripheral processing device 170 managing the virtual machines is connected. The access switch can detect that a virtual machine has been initialized, started, and/or moved to the peripheral processing device 170 . After detecting the virtual machine, the access switch can query the peripheral processing device 170 for information about the peripheral processing device 170 and/or the virtual machine, including, for example, a device identifier of the peripheral processing device 170 or the virtual machine.

接入交换能够询问或请求例如使用例如链路层发现协议(“LLDP”)、一些基于其他标准或熟知协议,或私有协议的虚拟机的设备标识符的信息,其中该虚拟机被配置为经由上述协议通信。作为替代,虚拟机可以在检测到其已经被连接到接入交换之后,使用例如以太网或IP广播分组广播关于它自己的信息(包括虚拟机的设备标识符)。The access switch can interrogate or request information such as the device identifier of a virtual machine that is configured via The above protocol communication. Alternatively, the virtual machine may broadcast information about itself (including the virtual machine's device identifier) using, for example, Ethernet or IP broadcast packets after detecting that it has been connected to the access switch.

接入交换然后推送虚拟装置的设备标识符(有时候被称为虚拟设备标识符)以及,在一些实施例中,从虚拟机接收的其他信息给网络管理模块。此外,接入交换能推送接入交换的设备标识符和接入交换端口的端口标识符给网络管理模块,控制虚拟机的外围处理装置170连接到所述接入交换。该信息功能用作网络中虚拟机位置的描述,并且定义了将虚拟机捆绑到用于网络管理模块和外部管理实体的外围处理装置170上。换而言之,在接收该信息之后,网络管理模块能够能将虚拟机的设备标识符与特定接入交换上的特定端口相关联,该虚拟机(和/或操作虚拟机的外围处理装置170)连接到该特定接入交换上。The access switch then pushes the virtual device's device identifier (sometimes called a virtual device identifier) and, in some embodiments, other information received from the virtual machine to the network management module. In addition, the access switch can push the device identifier of the access switch and the port identifier of the port of the access switch to the network management module, and the peripheral processing device 170 controlling the virtual machine is connected to the access switch. This information function serves as a description of the location of the virtual machine in the network and defines the binding of the virtual machine to the peripheral processing means 170 for the network management module and external management entities. In other words, upon receiving this information, the network management module can be able to associate the device identifier of the virtual machine (and/or the peripheral processing device 170 operating the virtual machine) with a specific port on a specific access switch ) is connected to that particular access switch.

虚拟机的设备标识符、接入交换的设备标识符、端口标识符和由外部管理实体提供的供应指令能够被存储在网络管理模块可接入的存储器中。例如,虚拟机的设备标识符、接入交换的设备标识符、和端口标识符能够被存储在被配置作为数据库的存储器中,从而基于虚拟机的设备标识符的数据库询问返回接入交换的设备标识符、端口标识符和供应指令。The device identifier of the virtual machine, the device identifier of the access switch, the port identifier and the provisioning instructions provided by the external management entity can be stored in a memory accessible to the network management module. For example, the device identifier of the virtual machine, the device identifier of the access switch, and the port identifier can be stored in a memory configured as a database so that a query based on the database of device identifiers of the virtual machine returns the device of the access switch identifiers, port identifiers, and provisioning instructions.

因为网络管理模块能够基于虚拟机的设备标识符与虚拟机在网络中的位置相关联,外部管理实体不需要注意网络的拓扑或将虚拟机捆绑到外围处理装置170上以提供网络资源(例如,网络装置、虚拟机、虚拟交换或物理服务器)。换句话说,外部管理实体像网络中的互联和虚拟机在网络中的位置(例如,在网络中哪个接入交换的哪个端口处、哪个外围处理装置170上)一样不可知,并且能基于由网络中外围处理装置170控制的虚拟机的设备标识符提供网络中的接入交换。在一些实施例中,外部管理实体还能提供物理外围处理装置170。此外,因为网络管理模块动态确定并管理网络拓扑信息,外部管理实体不依靠对于供应网络的网络静态描述。Because the network management module can associate a virtual machine's location in the network based on the device identifier of the virtual machine, an external management entity need not pay attention to the topology of the network or tie the virtual machine to the peripheral processing device 170 to provide network resources (e.g., network appliances, virtual machines, virtual switches, or physical servers). In other words, the external management entity is as agnostic as the interconnection in the network and the location of the virtual machine in the network (e.g., at which port of which access switch, on which peripheral processing device 170 in the network), and can be based on The device identifiers of the virtual machines controlled by the peripheral processing device 170 in the network provide an access switch in the network. In some embodiments, the external management entity can also provide the physical peripheral processing device 170 . Furthermore, because the network management module dynamically determines and manages network topology information, external management entities do not rely on a static description of the network for the provisioning network.

如在本说明书中所使用的,供应可以包括多种类型或形式的装置和/或软件模块设置、配置和/或调整。例如,供应可以包括基于网络策略配置例如网络交换机的网络装置。更特别地,例如,网络供应可以包括:配置网络装置作为第2层或第3层网络交换机操作;改变网络装置的路由表;更新可操作地耦接到网络装置的设备的安全策略和/或设备地址或设备标识符;选择网络装置使用哪一个网络协议实施;设置例如用于网络装置端口的虚拟局域网络(“VLAN”)标记的网络段标识符;和/或应用接入控制列表(“ACL”)到网络装置。该网络交换机能被提供或配置,从而由网络策略定义的规则和/或接入限制被应用于从网络交换机经过的数据分组。在一些实施例中,虚拟装置被提供。虚拟装置可以是,例如实现虚拟交换、虚拟路由器、或虚拟网关的软件模块,其被配置为作为在物理网络之间的媒介操作并且其由例如外围处理装置170的主机装置控制。在一些实施例中,供应能包括建立虚拟端口或在虚拟资源和虚拟装置之间的连接。As used in this specification, provisioning may include various types or forms of device and/or software module settings, configurations and/or adjustments. For example, provisioning may include configuring a network device, such as a network switch, based on a network policy. More particularly, for example, network provisioning may include: configuring the network device to operate as a layer 2 or layer 3 network switch; changing a routing table of the network device; updating security policies and/or device address or device identifier; select which network protocol implementation the network device uses; set network segment identifiers such as virtual local area network ("VLAN") tagging for network device ports; and/or apply access control lists (" ACL") to the network device. The network switch can be provided or configured such that rules and/or access restrictions defined by network policies are applied to data packets passing through the network switch. In some embodiments, virtual appliances are provided. A virtual device may be, for example, a software module implementing a virtual switch, virtual router, or virtual gateway configured to operate as an intermediary between physical networks and which is controlled by a host device, such as peripheral processing device 170 . In some embodiments, provisioning can include establishing a virtual port or connection between a virtual resource and a virtual device.

图2是根据一个实施例表明具有任意连接性的数据中心的一部分的实例的示意图。如图2所示,外围处理装置PD(来自外围处理装置210组)经由交换核心280被连接到每一个外围处理装置210。在一些实施例中,为了清楚,仅有从外围处理装置PD到其他外围处理装置210(除了外围处理装置PD)的连接被示出。Figure 2 is a schematic diagram illustrating an example of a portion of a data center with arbitrary connectivity, according to one embodiment. As shown in FIG. 2 , a peripheral processing device PD (from the group of peripheral processing devices 210 ) is connected to each peripheral processing device 210 via a switch core 280 . In some embodiments, only the connections from the peripheral processing device PD to other peripheral processing devices 210 (other than the peripheral processing device PD) are shown for clarity.

在一些实施例中,交换核心280被定义,从而交换核心280在某种意义上是公平的,即在外围处理装置PD和其他外围处理装置210之间的目的链路的带宽被基本上合理地在竞争的外围处理装置210之间共享。例如,当图2所示的一些(或全部)外围处理装置210试图在给定时间接入外围处理装置PD时,可用于每一个外围处理装置280接入外围处理装置PD的带宽(例如,即时带宽)将是基本上相等的。在一些实施例中,交换核心280能被配置为使得一些(或全部)外围处理装置210能与外围处理装置PD以全带宽(例如,外围处理装置PD的全带宽)和/或以无阻塞的方式通信。此外,交换核心280能被配置为使得通过外围处理装置(来自外围处理装置210)到外围处理装置PD的接入可不被在其它外围处理装置和外围处理装置PD之间的其他链路(例如,存在或试图)限制。In some embodiments, the switch core 280 is defined such that the switch core 280 is fair in the sense that the bandwidth of the destination link between the peripheral processing device PD and other peripheral processing devices 210 is substantially reasonably Shared among competing peripheral processing devices 210 . For example, when some (or all) of the peripheral processing devices 210 shown in FIG. 2 attempt to access the peripheral processing device PD at a given time, the bandwidth (e.g., instant bandwidth) will be substantially equal. In some embodiments, switch core 280 can be configured such that some (or all) peripheral processing devices 210 can communicate with peripheral processing device PD at full bandwidth (e.g., the full bandwidth of peripheral processing device PD) and/or in a non-blocking manner. way of communication. Furthermore, switch core 280 can be configured such that access to peripheral processing device PD by a peripheral processing device (from peripheral processing device 210) may not be blocked by other links between other peripheral processing devices and peripheral processing device PD (e.g., exists or attempts to) limit.

在一些实施例中,交换核心280的属性,任意连接性、低等待时间、公平性和/或等等能够使得连接到(例如,与其通信)交换核心280的给定类型(例如存储节点类型、计算节点类型)的外围处理装置210能够被可互换地对待(例如,相对于其他处理装置210和交换核心280的位置独立)。这能被称作可互换性,并能促使包括交换核心280的数据中心的有效性和简易性。即使交换核心280可能具有大量的端口(例如,超过1000个端口),交换核心280仍能够具有任意连接性和/或公平性的属性,从而每一个端口可以以相对高的速度操作(例如,以超过10Gb/s的速度操作)。这不需要包括在例如超级计算机的专门的互连和/或不需要对所有通信模式完全先知就能实现。涉及具有任意连接性和/或公平性的交换核心体系结构的更多细节将至少部分地结合附图4到13描述。In some embodiments, the properties of switch core 280, arbitrary connectivity, low latency, fairness, and/or the like, enable connection to (e.g., communication with) switch core 280 of a given type (e.g., storage node type, Compute node type) peripheral processing devices 210 can be treated interchangeably (eg, independent of location relative to other processing devices 210 and switch core 280 ). This can be referred to as interchangeability, and can contribute to the efficiency and simplicity of data centers that include switching cores 280 . Even though switch core 280 may have a large number of ports (e.g., over 1000 ports), switch core 280 can still have arbitrary connectivity and/or fairness properties so that each port can operate at a relatively high speed (e.g., at operation at speeds exceeding 10Gb/s). This can be achieved without the need for specialized interconnects, such as those involved in supercomputers, and/or without complete knowledge of all communication modes. Further details relating to switching core architecture with arbitrary connectivity and/or fairness will be described at least in part in connection with FIGS. 4 to 13 .

重新参考图1,在一些实施例中,数据中心100被配置为允许灵活的超额订购(oversubscription)。在一些实施例中,通过灵活超额订购,网络基础结构(例如,涉及交换核心180的网络基础结构)的相对花费能够相对例如计算和存储的花费被降低。例如,在数据中心100的交换核心180内的资源(例如所有资源)能够作为灵活合并资源操作,从而与第一应用(或应用集)相关联的未充分利用的资源在例如第二应用的峰值处理期间能由第二应用(或应用集)被动态地提供使用。因此,数据中心100的资源(或资源的子集)能被配置为比如果资源被严格地分配为贮藏资源分配给特定应用(或应用集)能更有效地处理超额订购。如果作为贮藏资源管理,则超额订购能够仅在贮藏资源内实施,而不是例如跨越整个数据中心107。在一些实施例中,数据中心100中的一个或更多协议和/或组件能够基于开放标准(例如电气和电子工程师协会(IEEE)标准、互联网工程任务组(IETF)标准、国际信息技术标准委员会(INCITS)标准)。Referring back to FIG. 1 , in some embodiments, data center 100 is configured to allow flexible oversubscription. In some embodiments, through flexible oversubscription, the relative cost of network infrastructure (eg, involving switching core 180 ) can be reduced relative to costs such as computation and storage. For example, the resources (e.g., all resources) within the switching core 180 of the data center 100 can operate as flexibly pooled resources such that underutilized resources associated with a first application (or set of applications) are at the peak of a second application, for example The processing period can be dynamically provisioned by the second application (or set of applications). Accordingly, the resources (or subsets of resources) of data center 100 can be configured to handle oversubscription more efficiently than if resources were allocated strictly as storage resources allocated to specific applications (or sets of applications). If managed as a storage resource, oversubscription can only be implemented within the storage resource, rather than across the entire data center 107, for example. In some embodiments, one or more protocols and/or components in data center 100 can be based on open standards (e.g., Institute of Electrical and Electronics Engineers (IEEE) standards, Internet Engineering Task Force (IETF) standards, International Information Technology Standards Committee (INCITS) standard).

在一些实施例中,数据中心100能支持允许实施宽范围策略的安全模式。例如,数据中心100可以支持无通信策略,其中应用停留在数据中心100的独立的虚拟数据中心,但是能够共享相同的物理外围处理装置(例如计算节点100、存储节点140)和网络基础结构(例如交换核心180)。在一些配置中,数据中心100能支持相同应用部分的多处理并且需要几乎无限制地通信。在一些配置中,数据中心100能支持需要例如深入分组检查、全状态防火墙和/或无状态滤波器的策略。In some embodiments, data center 100 can support a security model that allows a wide range of policies to be enforced. For example, data center 100 may support a no-communication policy, where applications reside in separate virtual data centers in data center 100, but can share the same physical peripheral processing devices (e.g., compute nodes 100, storage nodes 140) and network infrastructure (e.g., switch core 180). In some configurations, data center 100 can support multiprocessing of the same application portion and require almost unlimited communication. In some configurations, data center 100 can support policies requiring, for example, deep packet inspection, stateful firewalls, and/or stateless filters.

数据中心100能具有端到端应用到基于源等待时间、零负载等待时间、拥塞等待时间和目的地等待时间定义的应用等待时间(也被称为端到端等待时间)。在一些实施例中,源等待时间可以是例如在源外围处理装置处理期间支出的时间(例如,由软件和/或NIC支出的时间)。类似地,目的地等待时间可以是,例如在目的地外围处理装置处理期间支出的时间(例如,由软件和/或NIC支出的时间)。在一些实施例中,零负载延迟可以是光速延迟加上例如数据中心180内部的处理和存储转发延迟。在一些实施例中,拥塞等待时间可以是,例如由网络中的拥塞引起的排队延迟。数据中心100能具有低端到端等待时间能激活应用的期望应用性能,所述应用对于例如具有实时约束和/或具有高级内部处理通信需求的应用的等待时间敏感。The data center 100 can have end-to-end application to application latencies (also referred to as end-to-end latencies) defined based on source latencies, zero load latencies, congestion latencies, and destination latencies. In some embodiments, source latency may be, for example, time spent during source peripheral processing (eg, time spent by software and/or a NIC). Similarly, destination latency may be, for example, time spent during processing by the destination peripheral processing device (eg, time spent by software and/or a NIC). In some embodiments, zero load latency may be the speed of light latency plus processing and store-and-forward latency within, for example, data center 180 . In some embodiments, the congestion latency may be, for example, a queuing delay caused by congestion in the network. Data center 100 can have a desired application performance with low end-to-end latency enabling applications that are latency sensitive to, for example, applications with real-time constraints and/or with advanced internal processing communication requirements.

交换核心180的零负载等待时间能明显小于具有基于以太网跳转的互联的数据中心核心部分明显减少。在一些实施例中,例如,交换核心180能具有从交换核心180输入端口到交换核心180输出端口低于6微秒的零负载等待时间(除了光速等待时间)。在一些实施例中,例如,交换核心180能具有低于12微秒的零负载等待时间(除了拥塞等待时间和光速等待时间)。基于以太网的数据中心核心部分具有明显高的等待时间是由于,例如不期望的拥塞级别(例如链路间的拥塞)。在基于以太网的数据中心核心部分内的拥塞可能由于基于以太网的数据中心核心(或者与基于以太网的数据中心核心有关的管理装置)的无能而加重,从而以不期望的方式处理拥塞。此外,在基于以太网的数据中心核心部分内的等待时间可以是不统一的,因为核心部分在不同源-目的地对之间和/或许多存储转发交换节点之间能具有不同数目的跳转,在该存储转发交换节点中执行数据分组的分类。相反,交换核心180的分类在边缘部分185执行,而不在交换结构187执行,以及交换核心180具有确定性的基于信元的交换结构187。例如,通过交换结构187的信元处理等待时间(而不是通过交换结构187的信元路径)可以是可预知的。The zero-load latency of the switch core 180 can be significantly reduced compared to the core portion of the data center with Ethernet hop-based interconnects. In some embodiments, for example, switch core 180 can have a zero-load latency of less than 6 microseconds from a switch core 180 input port to a switch core 180 output port (except for the speed of light latency). In some embodiments, for example, switch core 180 can have zero load latencies (excluding congestion latencies and lightspeed latencies) of less than 12 microseconds. Ethernet-based data center cores have significantly high latencies due to, for example, undesired levels of congestion (eg, between links). Congestion within portions of the Ethernet-based data center core may be exacerbated by the inability of the Ethernet-based data center core (or a management device associated with the Ethernet-based data center core) to handle the congestion in an undesired manner. Furthermore, latency within an Ethernet-based data center core can be non-uniform because the core can have a different number of hops between different source-destination pairs and/or between many store-and-forward switching nodes , performing classification of data packets in the store-and-forward switching node. In contrast, the classification of the switch core 180 is performed at the edge portion 185 , not at the switch fabric 187 , and the switch core 180 has a deterministic cell-based switch fabric 187 . For example, cell processing latencies through switch fabric 187 (rather than cell paths through switch fabric 187) may be predictable.

数据中心100的交换核心180能提供无损端到端分组传送,至少部分基于在数据中心100内执行的流量控制机制。例如,经由交换结构187的数据(例如,与数据分组有关的数据)传输调度在信元基础上使用请求授权机制(也被称为请求鉴权机制)被执行。特别地,在发送信元的请求已经基于基本上授权传送(无损)被授权之后,信元被发送到交换结构187(例如从边缘部分185发送到交换结构187)。一旦被允许进入交换结构187,信元在交换结构187中作为片段进行处理。交换结构187内的片段流能进一步被控制,例如这样当交换结构187内的拥塞被检测到时,片段不丢失。涉及交换核心180内的信元和片段处理的更多细节将在下面描述。Switch core 180 of data center 100 is capable of providing lossless end-to-end packet transfer based at least in part on flow control mechanisms implemented within data center 100 . For example, data (eg, data pertaining to data packets) transmission scheduling via the switch fabric 187 is performed on a cell basis using a request authorization mechanism (also referred to as a request authentication mechanism). In particular, a cell is sent to switch fabric 187 (eg, from edge portion 185 to switch fabric 187) after a request to send the cell has been authorized on a substantially authorized transfer (lossless) basis. Once admitted to switch fabric 187, cells are processed in switch fabric 187 as fragments. The flow of fragments within switch fabric 187 can further be controlled such that fragments are not lost when congestion within switch fabric 187 is detected, for example. More details concerning cell and fragment processing within switch core 180 are described below.

此外,通过交换结构187来自每一个外围处理装置170的数据流能够被隔离于通过交换结构187来自剩下的外围处理装置170的数据流。特别地,在一个或多个外围处理装置170的数据拥塞不以不期望的方式影响通过交换核心180的交换结构187的数据流,因为在交换核心180的边缘部分185,发送请求已经被授权工作,信元仅被发送到交换核心180的交换结构187。例如,在第一外围处理装置170的高级别数据通信量可以基于请求授权拥塞解决机制被处理,从而在第一外围处理装置170的高级别数据通信量将不会不利地影响第二外围处理装置170接入到交换核心180的单独逻辑实体。换句话说,当被允许进入交换核心180的交换结构187时,与第一外围处理装置170相关联的通信量将被隔离(例如,从拥塞角度被隔离)于与第二外围处理装置170有关的通信量。Additionally, the data flow from each peripheral processing device 170 through switch fabric 187 can be isolated from the data flow through switch fabric 187 from the remaining peripheral processing devices 170 . In particular, data congestion at one or more peripheral processing devices 170 does not affect in an undesired manner the flow of data through the switch fabric 187 of the switch core 180, because at the edge portion 185 of the switch core 180, send requests have been authorized to work , the cell is only sent to the switch fabric 187 of the switch core 180. For example, a high level of data traffic at the first peripheral processing device 170 may be handled based on a request authorization congestion resolution mechanism so that high levels of data traffic at the first peripheral processing device 170 will not adversely affect the second peripheral processing device 170 is connected to a separate logical entity of switching core 180. In other words, when admitted to the switch fabric 187 of the switch core 180, traffic associated with the first peripheral processing device 170 will be isolated (eg, from a congestion perspective) from traffic associated with the second peripheral processing device 170. traffic.

同样,能被解析成信元和片段的交换核心180中的数据分组流能在外围处理装置170基于精细粒度(fine grain)的流量控制机制被控制。在一些实施例中,精细粒度的流量控制基于队列的级段被执行。精细粒度的流量控制类型能阻止(或基本上阻止)导致糟糕的网络使用率的线头阻塞(head-of-line blocking)。精细粒度的流量控制还能被用于降低(或减少)交换核心180内的等待时间。在一些实施例中,精细粒度的流量控制能激活高性能块发送磁盘通信量到外围处理装置170和从外围处理装置170接收磁盘通信量,该外围处理装置170不可以使用以太网和因特网(IP)网络以期望的方式实现。涉及精细粒度的流量控制的更多细节结合附图22到25被描述。Likewise, the flow of data packets in the switching core 180 that can be parsed into cells and fragments can be controlled at the peripheral processing device 170 based on a fine grain flow control mechanism. In some embodiments, fine-grained flow control is performed based on the stages of the queue. A fine-grained type of flow control can prevent (or substantially prevent) head-of-line blocking that leads to poor network utilization. Fine-grained flow control can also be used to lower (or reduce) latency within switch core 180 . In some embodiments, fine-grained flow control enables high-performance blocks to send and receive disk traffic to and from peripheral processing devices 170 that cannot use Ethernet and Internet (IP ) network is implemented in the desired manner. Further details concerning fine-grained flow control are described in connection with FIGS. 22 to 25 .

在一些实施例中,数据中心100,以及特别地,交换核心180可具有模块体系结构。特别地,数据中心100的交换核心180能在小规模处起始实施并且能够依据需要扩展(例如增加扩展)。交换核心180能被扩展而基本上不需要中断现有网络的连续操作和/或能够扩展而不在交换核心180的新设备应当物理放置上受约束。In some embodiments, data center 100, and in particular switch core 180, may have a modular architecture. In particular, the switching core 180 of the data center 100 can be implemented starting at a small scale and can be expanded (eg, added expansion) as needed. The switch core 180 can be expanded without substantially interrupting the continuous operation of the existing network and/or can be expanded without constraints on the physical placement of new devices on the switch core 180 .

在一些实施例中,交换核心180的一个或多个部分能被配置以基于虚拟专用网(“VPN”)操作。特别地,交换核心180能被划分从而一个或多个外围处理装置170能被配置为经由交换核心180重叠或不重叠的虚拟化划分通信。交换核心180还能被分解为具有分离或重叠子集的虚拟化资源。换句话说,交换核心180可以是能被以灵活方式划分的单独交换。在一些实施例中,该方法能使得在数据中心100的合并交换核心180内一次扩展联网。这与数据中心相反,数据中心可以是独立可升级网络的集合,该网络的每一个具有定制和/或特定的资源。在一些实施例中,定义交换核心180的网络资源能被合并从而其可以有效地使用。In some embodiments, one or more portions of switch core 180 can be configured to operate based on a virtual private network ("VPN"). In particular, switch core 180 can be partitioned such that one or more peripheral processing devices 170 can be configured to partition communication via virtualized switching core 180 overlapping or non-overlapping. Switch core 180 can also be decomposed into virtualized resources with separate or overlapping subsets. In other words, the switch core 180 may be an individual switch that can be divided in a flexible manner. In some embodiments, the method enables extended networking at once within the consolidated switch core 180 of the data center 100 . This is in contrast to a data center, which may be a collection of independently scalable networks, each with customized and/or specific resources. In some embodiments, the network resources defining switch core 180 can be consolidated so that they can be used efficiently.

在一些实施例中,数据中心管理模块190能被配置为定义物理(和/或虚拟)资源虚拟的多级别,该资源定义了数据中心100。例如,数据中心管理模块190被配置为定义虚拟的多级别,其能体现数据中心100的应用宽度。在一些实施例中,(两个级别中的)较低级别可以包括虚拟应用簇(VAC),其可以是分配给属于(例如,被其控制的)一个或多个实体(例如,管理实体、财务制度)的单独应用的物理(或虚拟)资源集。(两个级别中的)较高级别可以包括虚拟数据中心(VDC),其可以包括属于(例如,被其控制的)一个或多个实体的VAC集。在一些实施例中,数据中心100包括多个VAC,其中每一个可以属于不同的管理实体。In some embodiments, data center management module 190 can be configured to define virtual multiple levels of physical (and/or virtual) resources that define data center 100 . For example, data center management module 190 is configured to define virtual multi-levels that can represent the application breadth of data center 100 . In some embodiments, the lower level (of the two levels) may include a Virtual Application Cluster (VAC), which may be assigned to (eg, controlled by) one or more entities (eg, management entities, A collection of physical (or virtual) resources for a single application of a financial system). The higher level (of the two levels) may include a virtual data center (VDC), which may include a set of VACs belonging to (eg, controlled by) one or more entities. In some embodiments, data center 100 includes multiple VACs, each of which may belong to a different management entity.

图3是根据一个实施例表明与数据中心相关联的资源的逻辑组300的示意图。如图3所示,逻辑组300包括虚拟数据中心VDC1、虚拟数据中心VDC2,和虚拟数据中心VDC3(一起被称为VDC)。同样,如图3所示,每一个VDC包括虚拟应用簇VAC(例如VDC3中的VAC32)。每一个VDC体现例如图1所示的数据中心100的数据中心的物理或虚拟部分(例如,交换核心的部分、外围处理装置的部分和/或外围处理装置内部的虚拟机)的逻辑组。例如,VDC内的每一个VAC体现例如计算节点的外围处理装置的逻辑组。例如,VDC1可以体现物理数据中心部分的逻辑组,而VAC22体现VDC1内的外围处理装置370的逻辑组。如图3所示,每一个VDC能基于一组能被配置为例如定义在VDC内运行的应用上操作参数允许范围的策略PY(还可以被称为商业规则)被管理。在一些实施例中,VDC能被称为逻辑资源的第一层(tier),而VAC被称为逻辑资源的第二层。FIG. 3 is a schematic diagram illustrating a logical grouping 300 of resources associated with a data center, according to one embodiment. As shown in FIG. 3, logical group 300 includes virtual data center VDC1, virtual data center VDC2, and virtual data center VDC3 (collectively referred to as VDC). Likewise, as shown in FIG. 3 , each VDC includes a virtual application cluster VAC (for example, VAC32 in VDC3). Each VDC embodies a logical grouping of physical or virtual portions of a data center such as data center 100 shown in FIG. 1 (eg, portions of switching cores, portions of peripheral processing devices, and/or virtual machines within peripheral processing devices). For example, each VAC within a VDC embodies a logical grouping of peripheral processing devices such as compute nodes. For example, VDC1 may embody a logical grouping of physical data center portions, while VAC22 embodies a logical grouping of peripheral processing devices 370 within VDC1. As shown in Figure 3, each VDC can be managed based on a set of policies PY (also referred to as business rules) that can be configured, for example, to define the allowable range of operating parameters on applications running within the VDC. In some embodiments, VDCs can be referred to as the first tier of logical resources, while VACs are referred to as the second tier of logical resources.

在一些实施例中,VDC(和VAC)可以被建立,从而与数据中心相关联的资源以期望的方式通过例如实体被管理,该实体使用(例如,出租、拥有、通过其通信)数据中心的资源和/或数据中心资源的管理者。例如,VDC1可以是与财务制度相关联的虚拟数据中心,而VDC2可以是与电信服务提供者相关联的虚拟数据中心。因此,策略PY1能通过财务制度定义从而VDC1(和与VDC1相关联的物理和/或虚拟数据中心资源)能以不同于基于策略PY2的管理VDC2(和与VDC2有关的物理和/或虚拟数据中心资源)的方式被管理,该PY2策略由电信服务提供商定义。在一些实施例中,一个或多个策略(例如,策略PY1的一部分)由网络管理者建立,从而当被实施时,在与财务制度有关的VDC1和与电信服务提供商有关的VDC2之间提供信息安全和/或防火墙。In some embodiments, a VDC (and VAC) may be established so that resources associated with a data center are managed in a desired manner, for example, by an entity that uses (e.g., leases, owns, communicates through) the data center's A manager of resources and/or data center resources. For example, VDC1 may be a virtual data center associated with a financial institution, while VDC2 may be a virtual data center associated with a telecommunications service provider. Thus, policy PY1 can be defined through financial regulations so that VDC1 (and the physical and/or virtual data center resources associated with VDC1) can manage VDC2 (and the physical and/or virtual data center resources associated with VDC2) differently than policy PY2 Resources) are managed in a manner that the PY2 policy is defined by the telecommunications service provider. In some embodiments, one or more policies (e.g., part of policy PY1) are established by the network administrator to, when implemented, provide Information Security and/or Firewalls.

在一些实施例中,策略能与数据中心管理(未示出)相关联(或在其中集成)。例如,VDC2能基于策略PY2(或策略PY2的子集)管理。在一些实施例中,数据中心管理能被配置为,例如监视VDC内应用的实时性能和/或能被配置为自动分配或解除分配资源以满足用于VDC内应用的相应策略。在一些实施例中,策略能被配置为基于时间阈值操作。例如,一个或多个策略能被配置为基于例如在一天的特定时间或一周的某天期间的参数值(例如,通信量级别)变化的周期事件(例如,可预知的周期事件)工作。In some embodiments, policies can be associated with (or integrated in) data center management (not shown). For example, VDC2 can be managed based on policy PY2 (or a subset of policy PY2). In some embodiments, data center management can be configured, for example, to monitor real-time performance of applications within a VDC and/or can be configured to automatically allocate or de-allocate resources to meet corresponding policies for applications within a VDC. In some embodiments, policies can be configured to operate based on time thresholds. For example, one or more policies can be configured to operate based on periodic events (eg, predictable periodic events) in which parameter values (eg, traffic levels) change, eg, during particular times of day or days of the week.

在一些实施例中,策略能够基于高级别的语言被定义。因此,策略可以以相对可接入的方式规定。策略的例子包括信息安全策略、故障隔离策略、防火墙策略、性能担保策略(例如涉及由应用实施的服务级别的策略)、和/或其他涉及信息保护或获取的管理策略(例如管理隔离策略)。In some embodiments, policies can be defined based on a high-level language. Thus, policies can be specified in a relatively accessible manner. Examples of policies include information security policies, fault isolation policies, firewall policies, performance assurance policies (eg, policies related to service levels enforced by applications), and/or other management policies related to information protection or access (eg, management isolation policies).

在一些实施例中,策略能够在分组分类模块实施,该分组分类模块能被配置为例如,分类数据分组(例如,IP分组、会话控制协议分组、媒体分组、在外围处理装置处定义的数据分组)。例如,策略能在交换核心的边缘部分内的接入交换的分组分类模块内实施。分类能包括任何执行的处理,从而数据分组能基于策略在数据中心(例如,数据中心的交换核心)内被处理。在一些实施例中,策略包括一个或多个与能被执行的指令相关联的策略条件。策略可以是,例如如果数据分组具有特定类型的网络地址(策略条件),则路由数据分组到特定目的地(指令)的策略。分组分类能包括确定策略条件是否已经满足,从而该指令能够被执行。例如,数据分组的一个或多个部分(例如,领域、有效载荷、地址部分、端口部分)能基于策略内定义的策略条件被分组分类模块分析。当策略条件满足时,数据分组能基于与策略条件相关联的指令被执行。In some embodiments, policies can be enforced at a packet classification module that can be configured, for example, to classify data packets (e.g., IP packets, session control protocol packets, media packets, data packets defined at a peripheral processing device) ). For example, policies can be enforced within the packet classification module of the access switch within the edge portion of the switch core. Classification can include any processing performed so that data packets can be processed within a data center (eg, a switching core of a data center) based on policy. In some embodiments, a policy includes one or more policy conditions associated with instructions that can be executed. A policy may be, for example, a policy to route a data packet to a specific destination (instruction) if the data packet has a specific type of network address (policy condition). Packet classification can include determining whether policy conditions have been met so that the instruction can be executed. For example, one or more portions of a data packet (eg, realm, payload, address portion, port portion) can be analyzed by the packet classification module based on policy conditions defined within the policy. When a policy condition is satisfied, the data packet can be executed based on the instructions associated with the policy condition.

在一些实施例中,逻辑组300的一个或多个部分能被配置为以来自多个远程位置的“熄灯号”(“lights out”)模式操作-例如对于每个VDC的独立位置以及一个或两个主位置来控制逻辑组300。在一些实施例中,具有例如图3中所示逻辑组的数据中心能被配置为不需要人员物理地在数据中心侧就能操作。在一些实施例中,数据中心具有足够的冗余资源以适应故障的发生,例如一个或多个外围处理装置(例如在VAC内的外围处理装置)的故障、数据中心管理模块的故障、和/或交换核心组件的故障。当在数据中心内(例如在数据中心的数据中心管理内)的监视软件指示该故障已经到达预定阈值时,人员能被通知和/或派遣为替换该故障的组件。In some embodiments, one or more portions of logical grouping 300 can be configured to operate in a "lights out" mode from multiple remote locations—such as a separate location for each VDC and one or Two master locations to control the logical group 300. In some embodiments, a data center having logical groups such as shown in FIG. 3 can be configured to operate without requiring personnel to be physically present on the data center side. In some embodiments, the data center has sufficient redundant resources to accommodate failures, such as failure of one or more peripheral processing devices (e.g., within a VAC), failure of a data center management module, and/or or failure of the exchange core component. When monitoring software within the data center (eg, within the data center's data center management) indicates that the failure has reached a predetermined threshold, personnel can be notified and/or dispatched to replace the failed component.

如图3所示,VDC可以是互相独立的逻辑组。在一些实施例中,数据中心(例如图1中所示的)的资源(例如,虚拟资源、物理资源)能被分割成相较于图3中所示的逻辑组不同的逻辑组300(例如,逻辑组的不同层)。在一些实施例中,逻辑组300的两个或更多VDC重叠。例如,第一VDC能和第二VDC共享数据中心的资源(例如,物理资源、虚拟资源)。特别地,第一VDC的交换核心的一部分能被与第二VDC共享。在一些实施例中,例如,包括在第一VDC的VAC中的资源能被包括在第二VDC的VAC中。As shown in Figure 3, VDCs may be logical groups independent of each other. In some embodiments, resources (e.g., virtual resources, physical resources) of a data center (e.g., shown in FIG. 1 ) can be partitioned into different logical groups 300 (e.g., , different layers of the logical group). In some embodiments, two or more VDCs of logical group 300 overlap. For example, the first VDC can share data center resources (eg, physical resources, virtual resources) with the second VDC. In particular, a portion of the switch core of the first VDC can be shared with the second VDC. In some embodiments, for example, resources included in a VAC of a first VDC can be included in a VAC of a second VDC.

在一些实施例中,一个或多个VDC能被手动定义(例如,由网络管理者手动定义)和/或自动定义(例如基于策略自动定义)。在一些实施例中,VDC能被配置为改变(例如动态改变)。例如,VDC(例如VDC1)能包括在一个时间周期内的特定资源集并且能包括在一个不同时间周期内(例如相互独立的时间周期、重叠的时间周期)的不同资源集(例如相互独立的资源集、重叠的资源集)。In some embodiments, one or more VDCs can be defined manually (eg, manually by a network administrator) and/or automatically (eg, automatically based on policy). In some embodiments, VDC can be configured to change (eg, change dynamically). For example, a VDC (e.g., VDC1) can include a particular set of resources within one time period and can include a different set of resources (e.g., mutually independent resource sets, overlapping resource sets).

在一些实施例中,数据中心的一个或多个部分能响应于改变、在改变之前或在改变期间被动态提供,该改变涉及VDC(例如像VDC的虚拟机一样的VDC的部分迁移)。例如,数据中心的交换核心能包括多个网络装置,例如网络交换机(network switches),每一个存储包括提供服务指令的配置模板数据库,该服务指令由虚拟机提供和/或请求。当虚拟机向和/或在连接到交换核心的网络交换机端口的服务器上迁移和/或初始化或开始时,服务器能向网络交换机发送涉及由虚拟机提供的服务的标识符。网络装置能基于该标识符从配置模板数据库中选择配置模板,并基于该配置模板提供端口和/或服务器。这样,供应网络端口和/或装置的任务能在交换核心中的网络交换机中分布(例如,以自动方式分布、不需要重定义模板分布),并能作为虚拟机动态变化或资源在外围处理装置间迁移。In some embodiments, one or more portions of a data center can be provisioned dynamically in response to, prior to, or during changes involving a VDC (eg, migration of portions of a VDC like its virtual machines). For example, a switching core of a data center can include a plurality of network devices, such as network switches, each storing a configuration template database including service instructions provided and/or requested by virtual machines. When a virtual machine migrates to and/or on a server connected to a network switch port of the switching core and/or initializes or starts, the server can send to the network switch an identifier related to the service provided by the virtual machine. The network device can select a configuration template from the configuration template database based on the identifier and provision ports and/or servers based on the configuration template. In this way, the task of provisioning network ports and/or devices can be distributed among the network switches in the switching core (e.g., in an automatic manner, without redefining the template distribution), and can be distributed as virtual machines dynamically or resources are distributed among peripheral processing devices Migrating between.

在一些实施例中,供应能包括多个类型或形式的装置和/或软件模块设置、配置和/或调整。例如,供应能包括基于例如图3中所示的策略PY中一个的策略配置数据中心内的网络装置,例如网络交换机。更特别地,例如,涉及数据中心的供应能包括下列中的一个或多个:配置网络装置以作为网络路由器或网络交换机操作;改变网络装置的路由表;更新安全策略和/或可操作地耦接到网络装置设备的地址或标识符;选择网络装置将实施哪一个网络协议;设置网络片段标识符例如用于网络装置端口的虚拟局域网络(“VLAN”)标记;和/或应用接入控制列表(“ACL”)到网络装置。数据中心的一部分能被供应或配置,从而由策略(例如,PY3)定义的规则和/或接入限制被应用(例如,通过分类处理应用)到通过数据中心的一部分的数据分组。In some embodiments, provisioning can include multiple types or forms of device and/or software module settings, configurations, and/or adjustments. For example, provisioning can include configuring network devices, such as network switches, within the data center based on a policy such as one of the policies PY shown in FIG. 3 . More particularly, for example, provisioning involving a data center can include one or more of: configuring a network device to operate as a network router or a network switch; changing a routing table of a network device; updating security policies and/or operably coupling Address or identifier of a device connected to a network device; select which network protocol the network device will implement; set a network segment identifier such as a virtual local area network ("VLAN") tag for a network device port; and/or apply access control list ("ACL") to the network device. A portion of a data center can be provisioned or configured such that rules and/or access restrictions defined by policies (eg, PY3) are applied (eg, applied by classification processing) to data packets passing through the portion of the data center.

在一些实施例中,与数据中心相关联的虚拟资源能被供应。虚拟资源可以是例如,实施虚拟交换(virtual switch)的软件模块、虚拟路由器,或配置为作为在物理网络和虚拟资源之间媒介操作的虚拟网关,虚拟资源由例如服务器的主装置控制。在一些实施例中,虚拟资源可以由主装置控制。在一些实施例中,供应可以包括建立虚拟资源和虚拟装置之间的虚拟端口或连接。In some embodiments, virtual resources associated with a data center can be provisioned. A virtual resource may be, for example, a software module implementing a virtual switch, a virtual router, or a virtual gateway configured to operate as an intermediary between a physical network and a virtual resource controlled by a host device such as a server. In some embodiments, virtual resources may be controlled by a host device. In some embodiments, provisioning may include establishing a virtual port or connection between the virtual resource and the virtual appliance.

涉及数据中心中虚拟化资源的更多细节在名为“Method and Apparatus forDetermining a Network Topology During Network Provisioning(在网络供应期间用于确定网络拓扑的方法和设备)”并于2008年12月30日提交的共同未决美国专利申请No.12/346623、名为“Methods and Apparatus for Distributed Dynamic NetowrkProvisioning(用于动态网络供应分布的方法和设备)”并于2008年12月30日提交的共同未决美国专利申请No.12/346632、名为“Methods and Apparatus for Distributed DynamicNetwork Provisioning(用于动态网络供应分布的方法和设备)”并于2008年12月30日提交的共同未决美国专利申请No.12/346630中阐明,所有这些申请在这里都引用来作为参考。More details dealing with virtualized resources in the data center are in "Method and Apparatus for Determining a Network Topology During Network Provisioning" and filed on December 30, 2008 Co-pending U.S. Patent Application No. 12/346623, entitled "Methods and Apparatus for Distributed Dynamic Netowrk Provisioning (Methods and Apparatus for Dynamic Network Provisioning Distribution)" and filed on December 30, 2008 U.S. Patent Application No. 12/346632, Co-pending U.S. Patent Application No. 12, entitled "Methods and Apparatus for Distributed Dynamic Network Provisioning" and filed December 30, 2008 /346630, all of which are incorporated herein by reference.

图4A是根据一个实施例表明可以包括在交换核心中的交换结构400的示意图。在一些实施例中,交换结构400能被包括在例如图1中所示的交换核心180的交换核心中。如图4A中所示,交换结构400是三级、无阻塞Clos(克洛斯)网络,并且包括第一级440、第二级442和第三级444。第一级440包括模块412(其每一个能被称为交换模块或信元交换机)。第一级440的每一个模块412是电子组件和电路的集成。在一些实施例中,例如,每一个模块是专用集成电路(ASIC)。在其他实施例中,多个模块被包含在一个单独的ASIC上。在一些实施例中,每一个模块是离散电子组件的集成。在一些实施例中,具有多级的交换结构能被称为多级交换结构。FIG. 4A is a schematic diagram illustrating a switch fabric 400 that may be included in a switch core, according to one embodiment. In some embodiments, switch fabric 400 can be included in a switch core such as switch core 180 shown in FIG. 1 . As shown in FIG. 4A , switch fabric 400 is a three-stage, non-blocking Clos (Clos) network and includes a first stage 440 , a second stage 442 and a third stage 444 . The first stage 440 includes modules 412 (each of which can be referred to as a switching module or a cell switch). Each module 412 of the first stage 440 is an integration of electronic components and circuits. In some embodiments, for example, each module is an application specific integrated circuit (ASIC). In other embodiments, multiple modules are contained on a single ASIC. In some embodiments, each module is an integration of discrete electronic components. In some embodiments, a switch fabric with multiple stages can be referred to as a multi-stage switch fabric.

在一些实施例中,第一级440的每一个模块412可以是信元交换机。信元交换机能被配置为有效地重定向数据(例如,片段),因为其通过交换结构400流动。在一些实施例中,例如,第一级的每一个模块412能被配置为基于包括在交换表中的信息重定向数据。在一些实施例中,例如在交换结构400级内的信元的数据重定向能被称为交换(例如,数据交换)或如果数据以交换结构400中信元的形式,则称为信元交换机。在一些实施例中,交换结构400的模块内的交换可以基于例如与数据相关联的信息(例如,报头)。由交换结构400的模块执行的交换可以与在边缘设备(例如,图1中所示的交换核心180的边缘部分185内的边缘设备)内部执行的以太网类型分类不同。换句话说,在交换结构400的模块内的交换不可以基于例如第2层以太网地址和/或第4层以太网地址。涉及基于交换表数据交换的更多细节将结合图4B描述。In some embodiments, each module 412 of the first stage 440 may be a cell switch. Cell switches can be configured to efficiently redirect data (eg, fragments) as it flows through switch fabric 400 . In some embodiments, for example, each module 412 of the first level can be configured to redirect data based on information included in a switching table. In some embodiments, the redirection of data, such as cells within a switch fabric 400 level, can be referred to as switching (eg, data switching) or, if the data is in the form of cells in the switch fabric 400, as a cell switch. In some embodiments, switching within modules of switch fabric 400 may be based on, for example, information associated with data (eg, headers). The switching performed by the modules of switch fabric 400 may differ from the Ethertype classification performed inside edge devices (eg, edge devices within edge portion 185 of switch core 180 shown in FIG. 1 ). In other words, switching within modules of switch fabric 400 may not be based on, for example, Layer 2 Ethernet addresses and/or Layer 4 Ethernet addresses. More details related to exchange table based data exchange will be described in conjunction with FIG. 4B.

在一些实施例中,每一个信元交换机还包括多个可操作地耦接到存储缓冲器(例如,直通缓冲器(cut-through buffer))的写接口的输入端口。在一些实施例中,存储缓冲器被包括在缓冲器模块中。类似地,输出端口集能可操作地耦接到存储缓冲器的读取接口处。在一些实施例中,存储缓冲器可以是使用片上静态随机存取存储器(SRAM)以向所有输入端口提供足够带宽用于每一时间周期写一个进入信元(例如,数据分组的一部分)以及向所有输出端口提供足够带宽用于每一时间周期读取一个移出信元的共享存储缓冲器。每一个信元交换机操作类似于能在每一时间周期之后被配置的纵横交换(crossbar switch)。In some embodiments, each cell switch also includes a plurality of input ports operatively coupled to write interfaces of a memory buffer (eg, a cut-through buffer). In some embodiments, a memory buffer is included in the buffer module. Similarly, the set of output ports can be operably coupled to the read interface of the memory buffer. In some embodiments, the memory buffer may be using on-chip static random access memory (SRAM) to provide sufficient bandwidth to all input ports for writing one incoming cell (e.g., a portion of a data packet) per time period and to All output ports provide sufficient bandwidth to read the shared memory buffer of one shifted-out cell per time period. Each cell switch operates like a crossbar switch that can be configured after each time period.

在一些实施例中,存储器缓冲器(例如,联合特定端口和/或流的存储缓冲器的几部分)具有足够的大小(例如,长度)用于交换结构400中的模块(例如,模块412)来实施交换(例如,信元交换机,数据交换)和/或数据(例如,信元)同步。然而,存储缓冲器可以具有对于交换结构400内的模块(例如,模块412)不足的大小(和/或太短的处理等待时间)来实施拥塞解决方案。例如授权/请求机制的拥塞解决方案能在例如与交换核心相关联的边缘设备(未示出)处实施,但是不能在使用存储器缓冲器用于与拥塞解决方案有关的数据排队的交换结构400内的模块内实施。在一些实施例中,模块(例如,模块414)内的一个或多个存储器缓冲器具有不足够的大小(和/或太短的处理等待时间)用于例如重组在模块处的数据(例如,信元)。涉及共享存储缓冲器的更多细节将结合附图15和名为“Methods andApparatus Related to a Shared Memory Buffer for Variable-Sized Cells(涉及可变化大小信元的共享存储缓冲器的方法和装置)”并于2009年3月31日提交的共同未决美国专利申请No.12/415517中描述,该专利申请在这里完全被引用作为参考。In some embodiments, memory buffers (e.g., portions of memory buffers associated with specific ports and/or flows) are of sufficient size (e.g., length) for modules in switch fabric 400 (e.g., module 412) To perform switching (eg, cell switching, data switching) and/or data (eg, cell) synchronization. However, the store buffer may have an insufficient size (and/or too low processing latency) for a module (eg, module 412 ) within switch fabric 400 to implement congestion resolution. Congestion resolution, such as a grant/request mechanism, can be implemented, for example, at an edge device (not shown) associated with a switch core, but not within switch fabric 400 that uses memory buffers for queuing data related to congestion resolution. implemented within the module. In some embodiments, one or more memory buffers within a module (e.g., module 414) have insufficient size (and/or too low processing latency) for, for example, reorganization of data at the module (e.g., cells). More details related to the shared memory buffer will be combined with accompanying drawing 15 and titled "Methods and Apparatus Related to a Shared Memory Buffer for Variable-Sized Cells (relate to the method and apparatus of the shared memory buffer of variable-sized cells)" and Described in co-pending US Patent Application No. 12/415517, filed March 31, 2009, which is incorporated herein by reference in its entirety.

在替代实施例中,第一级的每一个模块可以是具有输入口和输出口的纵横交换机。在纵横交换机内的多个交换将每一个输入杆(bar)连接到每一个输出杆。当纵横交换机内的交换在“开启”位置时,输入可操作地耦接到输出且数据可以流动。作为替代,当纵横交换内的交换位于“关闭”位置时,输入不被可操作地耦接到输出且数据不流动。这样,纵横交换机内的交换控制哪一个输入杆可操作地耦接到输出杆。In an alternate embodiment, each module of the first stage may be a crossbar switch having an input port and an output port. Multiple switches within a crossbar switch connect every input bar to every output bar. When a switch within a crossbar switch is in the "on" position, the input is operatively coupled to the output and data can flow. Alternatively, when a switch within a crossbar switch is in the "off" position, inputs are not operably coupled to outputs and data does not flow. In this way, the switches within the crossbar switch control which input rod is operatively coupled to the output rod.

第一级440的每一个模块412包括输入端口460集,被配置为当数据进入交换结构400时接收数据。在该实施例中,第一级440的每一个模块412包括相同数目的输入端口460。Each module 412 of the first stage 440 includes a set of input ports 460 configured to receive data as it enters the switch fabric 400 . In this embodiment, each module 412 of the first stage 440 includes the same number of input ports 460 .

类似于第一级440,交换结构400的第二级442包括模块414。第二级442的模块414结构上类似于第一级440的模块412。第二级442的每一个模块414通过数据路径420可操作地耦接到第一级440的每一个模块。在第一级440的每一个模块和第二级442的每一个模块414之间的每一条数据路径420被配置为促使数据从第一级440的模块412向第二级442的模块414传送。Similar to first stage 440 , second stage 442 of switch fabric 400 includes modules 414 . The modules 414 of the second stage 442 are structurally similar to the modules 412 of the first stage 440 . Each module 414 of the second stage 442 is operatively coupled to each module of the first stage 440 through a data path 420 . Each data path 420 between each module of the first stage 440 and each module 414 of the second stage 442 is configured to facilitate the transfer of data from the module 412 of the first stage 440 to the module 414 of the second stage 442 .

在第一级440的模块412和第二级442的模块414之间的数据路径420可以以任何方式构建配置为促使数据从第一级440的模块412向第二级442的模块414以期望的方式(例如,以有效的方式)传送。在一些实施例中,例如,数据路径是模块间的光连接器。在其它实施例中,数据路径在中平面内。这样的中平面可以类似于这里以更细节的方式描述的。这样的中平面能够有效地用于将第二级的每一个模块连接到第一级的每一个模块。在另外的实施例中,模块被包含在单独的芯片包中,并且该数据路径是电子轨迹。The data path 420 between the modules 412 of the first stage 440 and the modules 414 of the second stage 442 may be constructed in any manner configured to cause data to pass from the modules 412 of the first stage 440 to the modules 414 of the second stage 442 in a desired manner. way (for example, in an efficient way) to transmit. In some embodiments, for example, the data paths are optical connectors between modules. In other embodiments, the data paths are within the midplane. Such a midplane may be similar to that described here in more detail. Such a midplane can be effectively used to connect every module of the second stage to every module of the first stage. In other embodiments, the modules are contained in a single chip package and the data path is an electronic trace.

在一些实施例中,交换结构400是无阻塞Clos(克洛斯)网络。这样,交换结构400的第二级442的模块414数目基于第一级440的每一个模块412的输入端口460的数目而变化。在可重排列无阻塞Clos(克洛斯)网络(例如,Benes(巴内斯)网络)中,第二级442的模块414数目大于或等于第一级440的每一个模块412的输入端口460的数目。这样,如果n是第一级440的每一个模块412的输入端口460的数目并且m是第二级442的模块414的数目,m≥n。在一些实施例中,例如,第一级的每一个模块有5个输入端口。这样,第二级具有至少5个模块。第一级的所有5个模块通过数据路径可操作地耦接到第二级的所有5个模块。换句话说,第一级的每个模块能向第二级的任一模块发送数据。In some embodiments, switch fabric 400 is a non-blocking Clos (Clos) network. As such, the number of modules 414 of the second stage 442 of the switch fabric 400 varies based on the number of input ports 460 of each module 412 of the first stage 440 . In a rearrangeable non-blocking Clos (Klos) network (for example, a Benes (Barnes) network), the number of modules 414 of the second stage 442 is greater than or equal to the number of input ports 460 of each module 412 of the first stage 440 number. Thus, if n is the number of input ports 460 of each module 412 of the first stage 440 and m is the number of modules 414 of the second stage 442, m≧n. In some embodiments, for example, each module of the first stage has 5 input ports. Thus, the second level has at least 5 modules. All 5 modules of the first stage are operatively coupled to all 5 modules of the second stage through data paths. In other words, every module on the first stage can send data to any module on the second stage.

交换结构400的第三级444包括模块416。第三级444的模块416结构上类似于第一级440的模块412。第三级444的模块416的数目等于第一级440的模块412的数目。第三级444的每一个模块416包括输出端口462,输出端口被配置为允许数据从交换结构400送出。第三级444的每一个模块416包括相同数目的输出端口462。此外,第三级444的每一个模块416的输出端口462的数目等于第一级440的每一个模块412的输入端口460数目。Third stage 444 of switch fabric 400 includes module 416 . Module 416 of third stage 444 is structurally similar to module 412 of first stage 440 . The number of modules 416 of the third stage 444 is equal to the number of modules 412 of the first stage 440 . Each module 416 of the third stage 444 includes an output port 462 configured to allow data to be sent out of the switch fabric 400 . Each module 416 of the third stage 444 includes the same number of output ports 462 . Furthermore, the number of output ports 462 per module 416 of the third stage 444 is equal to the number of input ports 460 per module 412 of the first stage 440 .

第三级444的每一个模块416通过数据路径424被连接到第二级442的每一个模块414。在第二级442的模块414和第三级444的模块416之间的数据路径424被配置为促使数据从第二级442的模块414向第三级444的模块416传送。Each module 416 of the third stage 444 is connected to each module 414 of the second stage 442 by a data path 424 . The data path 424 between the module 414 of the second stage 442 and the module 416 of the third stage 444 is configured to facilitate the transfer of data from the module 414 of the second stage 442 to the module 416 of the third stage 444 .

在第二级442的模块414和第三级444的模块416之间的数据路径424能够以任意方式被构建以配置为有效地促使数据从第二级442的模块414向第三级444的模块416传送。在一些实施例中,例如,数据路径是在模块间的光连接器。在其他实施例中,数据路径是在中平面内的。这样的中平面类似于这里详细描述的。这样的中平面能够有效地用于将第二级的每一个模块连接到第三级的每一个模块。在另一个实施例中,模块被包含在单独的芯片包中且数据路径是电子轨迹。The data path 424 between the module 414 of the second stage 442 and the module 416 of the third stage 444 can be constructed in any manner to be configured to effectively drive data from the module 414 of the second stage 442 to the module of the third stage 444 416 send. In some embodiments, for example, the data paths are optical connectors between the modules. In other embodiments, the data path is within the midplane. Such a midplane is similar to that described in detail here. Such a midplane can be effectively used to connect every module of the second stage to every module of the third stage. In another embodiment, the modules are contained in a single chip package and the data paths are electronic traces.

图4B是根据一个实施例表明能被存储在如图4A所示模块的存储器498中的交换表49的示意图。例如图4A所示第二级模块414中一个的模块(例如交换模块)能被配置为基于例如图4B所示交换表49的交换表执行信元交换机。例如,交换表49(或类似配置的交换表)能通过(和/或被包括在内)一级模块中的模块被使用于例如,确定信元能否经由另一级模块中的模块被发送到其目的地。在一些实施例中,信元经由该模块能被发送到其目的地的模块被称为交换目的地。特别地,交换目的地可以基于包括例如信元的目的地信息(其能够在交换结构400之外被确定)在交换表49中查找。Figure 4B is a schematic diagram illustrating a swap table 49 that can be stored in memory 498 of the module shown in Figure 4A, according to one embodiment. A module such as one of the second stage modules 414 shown in FIG. 4A (eg, a switching module) can be configured to perform cell switching based on a switching table, such as switching table 49 shown in FIG. 4B. For example, switching table 49 (or a similarly configured switching table) can be used by (and/or included in) modules in one level of modules, for example, to determine whether a cell can be sent via a module in another level of modules to its destination. In some embodiments, a module through which a cell can be sent to its destination is referred to as a switch destination. In particular, the switching destination may be looked up in the switching table 49 based on destination information including eg cells (which can be determined outside the switching fabric 400).

交换表49包括二进制值(例如,二进制值“1”,二进制值“0”),其表示由目的地值DT1到DTk(在47行中示出)代表的一个或多个目的地能否通过由模块值SM1到SMM(在48列中示出)表示的一个或多个模块(其能够位于邻级)到达。特别地,当在包括二进制值的列中的目的地(例如,目的地DT1)能经由在与列交叉的行中的模块(例如,模块SM2)到达时,交换表49包括二进制值“1”。当在包括二进制值的列中的目的地不能经由在与列交叉的行中的模块到达时,交换表49包括二进制值“0”。例如,在46处每一个条目中的二进制值“1”表示如果模块(包括交换表49)向由模块值SM1到SM3表示的模块发送数据,则数据最终能被发送到由目的地值DT3代表的目的地。在一些实施例中,模块能被配置为随机选择由模块值SM1到SM3(其是交换目的地)表示的模块组中的一个模块,并且能够将数据发送到所选择的模块,从而数据能被发送到由目的地值DT3表示的目的地。Interchange table 49 includes binary values (e.g., binary value "1", binary value "0") that indicate whether one or more destinations represented by destination values DT1 through DTk (shown in row 47) can pass through One or more modules (which can be located adjacently) represented by module values SM1 to SMM (shown in column 48) arrive. In particular, the exchange table 49 includes a binary value "1" when a destination in a column comprising a binary value (eg, destination DT1) can be reached via a module (eg, module SM2) in a row intersecting the column . The exchange table 49 includes a binary value "0" when a destination in a column including a binary value cannot be reached via a module in a row intersecting the column. For example, a binary value of "1" in each entry at 46 indicates that if a module (including the switch table 49) sends data to a module represented by module values SM1 to SM3, then the data can eventually be sent to the module represented by destination value DT3 destination. In some embodiments, the modules can be configured to randomly select one of the module groups represented by the module values SM1 to SM3 (which are swap destinations) and can send data to the selected module so that the data can be Send to the destination indicated by the destination value DT3.

在一些实施例中,目的地值47可以是与例如交换核心的边缘设备(例如,接入交换机)、与边缘设备通信的服务器等等相关联的目的地端口值。在一些实施例中,目的地值(其对应于被包括在交换表49中的至少一个目的地值47)可以基于例如被包括在信元中的分组分类与信元(例如,被包括在信元报头)相关联。因此,与信元相关联的目的地值能通过模块被用于使用交换表49查询交换目的地。分组分类能在交换核心的边缘设备(例如,接入交换机)被执行。In some embodiments, the destination value 47 may be a destination port value associated with an edge device (eg, an access switch), eg, a switching core, a server communicating with the edge device, or the like. In some embodiments, the destination value (which corresponds to the at least one destination value 47 included in the switching table 49) may be based on, for example, the packet classification included in the information element and the information element (e.g., included in the information element header) associated. Thus, the destination value associated with the cell can be used by the module to query the switching destination using the switching table 49 . Packet classification can be performed at the edge devices (eg, access switches) of the switching core.

在一些实施例中,存储器(和这样的交换表49)能被包括在一个或多个模块的模块系统中。在一些实施例中,交换表49能与模块系统(或多个系统)的多于一个输入端口和/或多于一个输出端口相关联。涉及模块系统的更多细节将结合图7被描述。In some embodiments, memory (and such a switch table 49) can be included in a modular system of one or more modules. In some embodiments, the switch table 49 can be associated with more than one input port and/or more than one output port of the module system (or systems). More details related to the module system will be described in connection with FIG. 7 .

图5A是根据一个实施例表明交换结构系统500的示意图。交换结构系统500包括多个输入/输出模块502,第一电缆集540、第二电缆集542和交换结构575。交换结构575包括部署在外壳570或机架内的第一交换结构部分571,以及部署在外壳572和机架内的第二交换结构部分573。Figure 5A is a schematic diagram illustrating a switch fabric system 500, according to one embodiment. The switch fabric system 500 includes a plurality of input/output modules 502 , a first cable set 540 , a second cable set 542 and a switch fabric 575 . Switch fabric 575 includes a first switch fabric portion 571 disposed within enclosure 570 or a chassis, and a second switch fabric portion 573 disposed within enclosure 572 and the chassis.

输入/输出模块502(其可以是例如边缘设备)被配置为向和/或从第一交换结构部分571和/或第二交换结构部分573发送数据和/或接收数据。此外,每一个输入/输出模块502包括解析功能、分类功能、转发功能和/或排队和调度功能。这样,分组解析、分组分类、分组转发和分组排队及调度都在数据分组进入第一交换结构部分571和/或第二交换结构部分573之前发生。因此,这些功能不需要在交换结构575的每一级执行,并且交换结构部分571,573的每个模块(这里进一步详细描述)不需要包括执行这些功能的能力。这可以减少交换结构部分571,573每一个模块的花费、功率损耗、冷却要求和/或物理范围需要。这还能减少与交换结构相关联的等待时间。在一些实施例中,例如,端到端等待时间(即通过交换结构从输入/输出模块向另一个输入/输出模块发送数据所需要的时间)能比使用以太网协议的交换结构系统的端到端等待时间更低。在一些实施例中,交换结构部分571,573的吞吐量仅由交换结构系统500的连接密度而不是功率和/或热量限制来约束。在一些实施例中,输入/输出模块502(和/或与输入/输出模块502相关联的功能)能被包括在,例如,如图1所示的交换核心的边缘部分内的边缘设备中。解析功能、分类功能、转发功能和排队及调度功能可以类似于在名为“Methods and Apparatus Related to Packet ClassificationAssociated with a Multi-Stage Switch(涉及有关多级交换的分组分类的方法和设备)”并于2008年9月30日提交的美国专利申请序号12/242168和名为“Methods and Apparatusfor Packet Classification Based on Policy Vectors(基于策略矢量的分组分类的方法和设备)”并于2008年9月30日提交的美国专利申请序号12/242172中公开的功能执行,这两者在这里都完全引用作为参考。The input/output module 502 (which may be, for example, an edge device) is configured to send data and/or receive data to and/or from the first switch fabric part 571 and/or the second switch fabric part 573 . Additionally, each input/output module 502 includes parsing functionality, sorting functionality, forwarding functionality, and/or queuing and scheduling functionality. In this way, packet parsing, packet classification, packet forwarding, and packet queuing and scheduling all occur before data packets enter the first switch fabric portion 571 and/or the second switch fabric portion 573 . Accordingly, these functions need not be performed at every level of the switch fabric 575, and each module of the switch fabric portions 571, 573 (described in further detail herein) need not include the capability to perform these functions. This may reduce cost, power consumption, cooling requirements and/or physical footprint requirements per module of the switch fabric portions 571,573. This also reduces the latency associated with the switch fabric. In some embodiments, for example, the end-to-end latency (i.e., the time required to send data from an I/O module to another I/O module through the switch fabric) can be compared to the end-to-end latency of a switch fabric system using the Ethernet protocol. End-to-end latency is lower. In some embodiments, the throughput of the switch fabric portions 571, 573 is constrained only by the connection density of the switch fabric system 500 and not by power and/or thermal limitations. In some embodiments, I/O module 502 (and/or functionality associated with I/O module 502 ) can be included in an edge device, eg, within an edge portion of a switch core as shown in FIG. 1 . Parsing function, classification function, forwarding function and queuing and dispatching function can be similar to "Methods and Apparatus Related to Packet Classification Associated with a Multi-Stage Switch (involving the method and equipment of grouping classification related to multi-stage switching)" and in U.S. Patent Application Serial No. 12/242168 filed September 30, 2008 and entitled "Methods and Apparatus for Packet Classification Based on Policy Vectors" and filed September 30, 2008 implements the functions disclosed in US Patent Application Serial No. 12/242,172, both of which are fully incorporated herein by reference.

每一个输入/输出模块502被配置为将第一电缆集540电缆的第一端连接到第二电缆集542电缆的第一端。每一根电缆540在输入/输出模块502和第一交换结构部分571之间部署。类似地,每一根电缆542在输入/输出模块502和第二交换结构部分573之间部署。使用第一电缆集540和第二电缆集542,每一个输入/输出模块502能分别向和/或从第一交换结构部分571和/或第二交换结构部分573发送数据和/或接收数据。Each input/output module 502 is configured to connect a first end of a first cable set 540 cable to a first end of a second cable set 542 cable. Each cable 540 is deployed between an input/output module 502 and a first switch fabric portion 571 . Similarly, each cable 542 is deployed between an input/output module 502 and a second switch fabric portion 573 . Using the first set of cables 540 and the second set of cables 542, each input/output module 502 is capable of sending and/or receiving data to and/or from the first switch fabric portion 571 and/or the second switch fabric portion 573, respectively.

第一电缆集540和第二电缆集542能由适于在输入/输出模块502和交换结构部分571,573之间传送数据的任意材料组成。在一些实施例中,例如,每一根电缆540,542由多根光纤组成。在这样的实施例中,每一根电缆540,542可以具有12根发送和12根接收光纤。每一根电缆540,542的12根发送光纤可以包括8根用于发送数据的光纤、1根用于发送控制信号的光纤,以及3根用于扩展数据容量和/或用于冗余的光纤。类似地,每一根电缆540,542的12根接收光纤可以包括8根用于发送数据的光纤、1根用于发送控制信号的光纤,以及3根用于扩展数据容量和/或用于冗余的光纤。在其他实施例中,任意数目的光纤都可以被包含在每一根电缆中。The first set of cables 540 and the second set of cables 542 can be composed of any material suitable for transferring data between the input/output module 502 and the switch fabric portions 571,573. In some embodiments, for example, each cable 540, 542 is comprised of multiple optical fibers. In such an embodiment, each cable 540, 542 may have 12 transmit and 12 receive fibers. The 12 transmit fibers of each cable 540, 542 may include 8 fibers for transmitting data, 1 fiber for transmitting control signals, and 3 fibers for extending data capacity and/or for redundancy . Similarly, the 12 receive fibers of each cable 540, 542 may include 8 fibers for transmitting data, 1 fiber for transmitting control signals, and 3 fibers for extending data capacity and/or for redundancy. remaining fiber. In other embodiments, any number of optical fibers may be included in each cable.

第一交换结构部分571和第二交换结构部分573一起用于冗余和/或更大的容量。在其他实施例中,仅有一个交换结构部分被使用。仍在其他实施例中,超过2个交换结构部分被用于增加的冗余和/或更大的容量。例如,4个交换结构部分能够可操作地通过例如4根电缆耦接到每一个输入/输入模块。第二交换结构部分573在结构上和功能上类似于第一交换结构571。因此,这里仅详细描述第一交换结构部分571。The first switch fabric portion 571 and the second switch fabric portion 573 are used together for redundancy and/or greater capacity. In other embodiments, only one switch fabric portion is used. In still other embodiments, more than 2 switch fabric sections are used for increased redundancy and/or greater capacity. For example, 4 switch fabric sections can be operatively coupled to each I/O module by eg 4 cables. The second switch fabric portion 573 is structurally and functionally similar to the first switch fabric 571 . Therefore, only the first switch fabric portion 571 is described in detail here.

图5B是根据一个实施例表明输入/输出模块502的示意图。如图5B所示,输入/输出模块502包括分类模块596、处理模块597,和存储器598。分类模块596能被配置为执行数据分类,例如分组的以太网类型分类。Figure 5B is a schematic diagram illustrating an input/output module 502, according to one embodiment. As shown in FIG. 5B , the input/output module 502 includes a classification module 596 , a processing module 597 , and a memory 598 . Classification module 596 can be configured to perform data classification, such as Ethertype classification of packets.

数据处理的各种类型能在处理模块597执行。例如,数据,例如分组能在处理模块597处被解析成信元。在一些实施例中,拥塞解决方案能在处理模块597处实施和/或经由交换结构(例如,图4A所示的交换结构400)的数据(例如信元)传输调度能在处理模块597处执行。处理模块597还能被配置为将信息(例如,报头信息、目的地信息、源信息)串联成例如信元净负荷,信元净负荷能被用于通过交换结构(例如,图4A所示交换结构400)交换信元(基于如图4B所示的交换表)。Various types of data processing can be performed at the processing module 597 . For example, data, such as packets, can be parsed at processing module 597 into cells. In some embodiments, congestion resolution can be implemented at processing module 597 and/or data (e.g., cell) transmission scheduling via a switch fabric (e.g., switch fabric 400 shown in FIG. 4A ) can be performed at processing module 597 . The processing module 597 can also be configured to concatenate information (for example, header information, destination information, source information) into, for example, a cell payload, which can be used to pass through a switch fabric (for example, the switch shown in FIG. 4A Structure 400) switches cells (based on the switching table shown in FIG. 4B).

当数据处理在分类模块596和/或处理模块597处执行时,数据(例如分组、信元)的一个或多个部分能被存储于(例如,排队)存储器598。例如,当处理模块597执行涉及拥塞解决方案的处理时,被解析成信元的数据能在存储器598排队。因此,存储器598可以具有足够的大小以实施如附图16A到附图21所述的拥塞解决方案。When data processing is performed at classification module 596 and/or processing module 597 , one or more portions of data (eg, packets, cells) can be stored (eg, queued) in memory 598 . For example, data parsed into cells can be queued at memory 598 while processing module 597 performs processing related to congestion resolution. Accordingly, memory 598 may be of sufficient size to implement congestion resolution as described in FIGS. 16A-21 .

图6更详细地显示了图5A的包括第一交换结构部分571的交换结构系统500的一部分。第一交换结构部分571包括接口卡510,其与第一交换结构部分571的第一级和第三级相关联;接口卡516,其与第一交换结构部分571的第二级相关联;以及中平面550。在一些实施例中第一交换结构部分571包括8个接口卡510,其与第一交换结构的第一级和第三级相关联,以及8个接口卡516,其与第一交换结构的第二级相关联。在其他实施例中,可以使用与第一交换结构第一级和第三级相关联的接口卡的不同数目和/或与第一交换结构第二级相关联的接口卡的不同数目。FIG. 6 shows a portion of the switch fabric system 500 of FIG. 5A including the first switch fabric portion 571 in more detail. The first switch fabric portion 571 includes an interface card 510 associated with the first and third stages of the first switch fabric portion 571; an interface card 516 associated with the second stage of the first switch fabric portion 571; and Mid-plane 550. In some embodiments the first switch fabric portion 571 includes eight interface cards 510 associated with the first and third stages of the first switch fabric, and eight interface cards 516 associated with the first stage of the first switch fabric. Secondary association. In other embodiments, a different number of interface cards associated with the first and third levels of the first switch fabric and/or a different number of interface cards associated with the second level of the first switch fabric may be used.

如图6所示,每一个输入/输出模块502可操作地经由第一电缆集540的一根电缆耦接到接口卡510。在一些实施例中,例如8个接口卡510的每一个可操作地耦接到16个输入/输出模块502,如这里更详细描述的。这样,第一交换结构部分571能被耦接到128个输入/输出模块(16×8=128)。128个输入/输出模块502的每一个能向和从第一交换结构部分571发送数据和接收数据。As shown in FIG. 6 , each I/O module 502 is operably coupled to the interface card 510 via one cable of the first cable set 540 . In some embodiments, for example, eight interface cards 510 are each operatively coupled to sixteen input/output modules 502, as described in greater detail herein. Thus, the first switch fabric portion 571 can be coupled to 128 input/output modules (16×8=128). Each of the 128 input/output modules 502 is capable of sending and receiving data to and from the first switch fabric portion 571 .

每一个接口卡510经由中平面550被连接到每一个接口卡516。这样,每一个接口卡510能向和从每一个接口卡516发送数据和接收数据,如这里更详细描述的。使用中平面550将接口卡510连接到接口卡516减少了用于连接第一交换结构部分571级的电缆数目。Each interface card 510 is connected to each interface card 516 via a midplane 550 . As such, each interface card 510 is capable of sending and receiving data to and from each interface card 516, as described in greater detail herein. Using midplane 550 to connect interface card 510 to interface card 516 reduces the number of cables used to connect first switch fabric portion 571 stages.

图7更详细地显示了第一接口卡510’、中平面550,以及第一接口卡516’。接口卡510’与第一交换结构部分571的第一级和第三级相关联,以及接口卡516’与第一交换结构部分571的第二级相关联。每一个接口卡510在结构上和功能上与第一接口卡510’类似。类似地,每一个接口卡516在结构上和功能上与第一接口卡516’类似。Figure 7 shows the first interface card 510', the midplane 550, and the first interface card 516' in greater detail. Interface card 510' is associated with the first and third stages of the first switch fabric portion 571, and interface card 516' is associated with the second stage of the first switch fabric portion 571. Each interface card 510 is similar in structure and function to the first interface card 510'. Similarly, each interface card 516 is similar in structure and function to the first interface card 516'.

第一接口卡510’包括多个电缆连接器端口560、第一模块系统512、第二模块系统514,以及多个中平面连接器端口562。例如,图7显示了具有16个电缆连接器端口560和8个中平面连接器端口562的第一接口卡510’。第一接口卡510’的每一个电缆连接器端口560被配置为接收来自第一电缆集540的电缆的第二端。这样,如上所述,8个接口卡510每一个上的16个电缆连接器端口560被用于接收128根电缆(16×8=128)。虽然在图7中所示具有16个电缆连接器端口560,而在其他实施例中,任意数目的电缆连接器端口都能被使用,从而第一电缆集的每一根电缆都能通过第一交换结构中的电缆连接器端口被接收。例如,如果16个接口卡都被使用,则每一个接口卡能包括8个电缆连接器端口。First interface card 510' includes a plurality of cable connector ports 560, a first module system 512, a second module system 514, and a plurality of midplane connector ports 562. For example, Figure 7 shows a first interface card 510' having 16 cable connector ports 560 and 8 midplane connector ports 562. Each cable connector port 560 of the first interface card 510' is configured to receive a second end of a cable from the first cable set 540. Thus, as described above, the 16 cable connector ports 560 on each of the 8 interface cards 510 are used to receive 128 cables (16 x 8 = 128). Although 16 cable connector ports 560 are shown in FIG. 7, in other embodiments, any number of cable connector ports can be used so that each cable of the first set of cables can pass through the first The cable connector ports in the switch fabric are received. For example, if all 16 interface cards are used, each interface card can include 8 cable connector ports.

第一接口卡510’的第一模块系统512和第二模块系统514每一个包括第一交换结构部分571第一级的模块和第一交换结构部分571第三级的模块。在一些实施例中,16个电缆连接器端口560的8个电缆连接器端口可操作地耦接到第一模块系统512以及16个电缆连接器端口560剩下的8个电缆连接器端口可操作地耦接到第二模块系统514。第一模块系统512和第二模块系统514都能可操作地耦接到接口卡510’的8个中平面连接器端口562的每一个。Each of the first module system 512 and the second module system 514 of the first interface card 510' includes modules of the first stage of the first switch fabric part 571 and modules of the third stage of the first switch fabric part 571. In some embodiments, 8 of the 16 cable connector ports 560 are operatively coupled to the first module system 512 and the remaining 8 cable connector ports of the 16 cable connector ports 560 are operatively The ground is coupled to the second module system 514 . Both the first module system 512 and the second module system 514 are operably coupled to each of the eight midplane connector ports 562 of the interface card 510'.

第一接口卡510’的第一模块系统512和第二模块系统514是ASIC。第一模块系统512和第二模块系统514是相同ASIC的实例。这样,由于可以生产单独ASIC的多个实例,制造成本可以降低。此外,第一交换结构部分571第一级的模块和第一交换结构第三级的模块都被包括在每一个ASIC上。The first module system 512 and the second module system 514 of the first interface card 510' are ASICs. The first system of modules 512 and the second system of modules 514 are instances of the same ASIC. In this way, manufacturing costs can be reduced since multiple instances of a single ASIC can be produced. In addition, modules of the first stage of the first switch fabric part 571 and modules of the third stage of the first switch fabric are included on each ASIC.

在一些实施例中,8个中平面连接器端口562中的每一个中平面连接器端口有两倍于16个电缆连接器端口560中每一个电缆连接器端口的数据容量。这样,8个中平面连接器端口562每一个具有16数据发送和16数据接收连接,而不是具有8数据发送和8数据接收连接。这样,8个中平面连接器端口562的带宽等于16个电缆连接器端口560的带宽。在其他实施例中,每一个中平面连接器端口具有32数据发送和32数据接收连接。在这样的实施例中,每一个电缆连接器端口具有16数据发送和16数据接收连接。In some embodiments, each of eight midplane connector ports 562 has twice the data capacity of each of sixteen cable connector ports 560 . Thus, each of the 8 midplane connector ports 562 has 16 data send and 16 data receive connections instead of 8 data send and 8 data receive connections. Thus, the bandwidth of 8 midplane connector ports 562 is equal to the bandwidth of 16 cable connector ports 560 . In other embodiments, each midplane connector port has 32 data transmit and 32 data receive connections. In such an embodiment, each cable connector port has 16 data transmit and 16 data receive connections.

第一接口卡510’的8个中平面连接器端口562被连接到中平面550。中平面550被配置为将与第一交换结构部分571第一级和第三级相关联的每一个接口卡510连接到与第一交换结构部分571第二级相关联的每一个接口卡516。这样,中平面550确保每一个接口卡510的每一个中平面连接器端口562被连接到不同接口卡516的中平面连接器端口580。换句话说,没有接口卡510的两个相同的中平面连接器端口可操作地耦接到相同的接口卡516。这样,中平面550允许每一个接口卡510向和从8个接口卡516中的任意一个发送数据和接收数据。The eight midplane connector ports 562 of the first interface card 510' are connected to the midplane 550. The midplane 550 is configured to connect each interface card 510 associated with the first and third stages of the first switch fabric portion 571 to each interface card 516 associated with the second stage of the first switch fabric portion 571 . In this way, the midplane 550 ensures that each midplane connector port 562 of each interface card 510 is connected to a midplane connector port 580 of a different interface card 516 . In other words, no two identical midplane connector ports of interface card 510 are operatively coupled to the same interface card 516 . In this manner, midplane 550 allows each interface card 510 to send and receive data to and from any one of eight interface cards 516 .

虽然图7显示了第一接口卡510’、中平面550和第一接口卡516’的示意图,而在一些实施例中,第一接口卡510、中平面550和第一接口卡516是物理位置分别类似于水平位置接口卡620、中平面640以及垂直位置接口卡630,如图5-7所示并在这里进一步详细描述。这样,与第一级相关联的模块和与第三级相关联的模块(均在接口卡510上)位于中平面的一边,而与第二级相关联的模块(在接口卡516上)位于中平面550的对边。这样的拓扑允许与第一级相关联的每一个模块可操作地耦接到与第二级相关的每一个模块,以及与第二级相关的每一个模块可操作地耦接到与第三级相关的每一个模块。Although FIG. 7 shows a schematic diagram of first interface card 510', midplane 550, and first interface card 516', in some embodiments, first interface card 510, midplane 550, and first interface card 516 are physically located Similar to horizontal position interface card 620, midplane 640, and vertical position interface card 630, respectively, as shown in FIGS. 5-7 and described in further detail herein. Thus, the modules associated with the first stage and the modules associated with the third stage (both on interface card 510) are on one side of the midplane, while the modules associated with the second stage (on interface card 516) are on The opposite side of the midplane 550. Such a topology allows each module associated with the first stage to be operatively coupled to each module associated with the second stage, and each module associated with the second stage to be operatively coupled to a module associated with the third stage related to each module.

第一接口卡516’包括多个中平面连接器端口580、第一模块系统518,和第二模块系统519。多个中平面连接器端口580被配置经由中平面550向和从任意接口卡510发送数据和接收数据。在一些实施例中,第一接口卡516’包括8个中平面连接器端口580。The first interface card 516' includes a plurality of midplane connector ports 580, a first module system 518, and a second module system 519. A plurality of midplane connector ports 580 are configured to send and receive data to and from any interface card 510 via the midplane 550 . In some embodiments, the first interface card 516' includes eight midplane connector ports 580.

第一接口卡516’的第一模块系统518和第二模块系统519可操作地耦接到第一接口卡516’的每一个中平面连接器端口580。这样,通过中平面550,与第一交换结构部分571第一级和第三级相关联的每一个模块系统512、514可操作地耦接到与第一交换结构部分571第二级相关联的每一个模块系统518,519。换句话说,与第一交换结构部分571第一级和第三级相关的每一个模块系统512,514能向和从与第一交换结构部分571第二级相关联的每一个模块系统518,519发送数据和接收数据,反之亦然。特别地,与模块系统512或514内第一级相关联的模块能向与模块系统518或519内第二级相关联的模块发送数据。类似地,与模块系统518或519内第二级相关联的模块能向与模块系统512或514内第三级相关联的模块发送数据。在其他实施例中,与第三级相关联的模块能向与第二级相关联的模块发送数据和/或控制信号,以及与第二级相关联的模块能向与第一级相关联的模块发送数据和/或控制信号。The first module system 518 and the second module system 519 of the first interface card 516' are operably coupled to each midplane connector port 580 of the first interface card 516'. Thus, through the midplane 550, each module system 512, 514 associated with the first and third stages of the first switch fabric portion 571 is operatively coupled to the second stage associated with the first switch fabric portion 571. Each module system 518,519. In other words, each module system 512, 514 associated with the first and third levels of the first switch fabric portion 571 can communicate to and from each module system 518 associated with the second level of the first switch fabric portion 571, 519 send data and receive data and vice versa. In particular, a module associated with a first level within module system 512 or 514 can send data to a module associated with a second level within module system 518 or 519 . Similarly, a module associated with a second level within module system 518 or 519 can send data to a module associated with a third level within module system 512 or 514 . In other embodiments, modules associated with the third stage can send data and/or control signals to modules associated with the second stage, and modules associated with the second stage can send data and/or control signals to modules associated with the first stage. The module sends data and/or control signals.

在第一交换结构部分571第一级的每一个模块具有8个输入(即,每一个接口卡510两个模块)的实施例中,第一交换结构部分571第二级具有至少8个模块用于第一交换结构部分571以维持可重新安排的无阻塞。这样,第一交换结构部分571的第二级具有至少8个模块并被可重新安排无阻塞。在一些实施例中,两倍于第二级的模块数目被用于促使交换结构系统500从3级交换结构扩展为5级交换结构,如这里进一步详细描述的。在这样的5级交换结构中,第二级支持2倍于在交换结构系统500的三级交换结构中第二级的交换吞吐量。例如,在一些实施例中,第二级的16个模块能被用于促使交换结构系统500将来从三级交换结构扩展为5级交换结构。In an embodiment where each module of the first stage of the first switch fabric portion 571 has 8 inputs (i.e., two modules per interface card 510), the second stage of the first switch fabric portion 571 has at least 8 modules for In the first switch fabric portion 571 to maintain reschedulable non-blocking. Thus, the second stage of the first switch fabric portion 571 has at least 8 modules and is rearrangeable without blocking. In some embodiments, twice as many modules as the second level are used to facilitate scaling of switch fabric system 500 from a 3-level switch fabric to a 5-level switch fabric, as described in further detail herein. In such a five-level switch fabric, the second level supports twice the switching throughput of the second level in the three-level switch fabric of the switch fabric system 500 . For example, in some embodiments, the 16 modules of the second level can be used to facilitate future expansion of the switch fabric system 500 from a three-level switch fabric to a five-level switch fabric.

第一接口卡516’的第一模块系统518和第二模块系统519是ASIC。第一模块系统518和第二模块系统519是相同ASIC的实例。此外,在一些实施例中,与第一交换结构部分571第二级相关联的第一模块系统518和第二模块系统519是同样用于与第一交换结构部分571第一级和第三级相关联的第一接口卡510’的第一模块系统512和第二模块系统514的ASIC的实例。这样,因为单独ASIC的多个实例能被用于第一交换结构部分571的每一个模块系统,制作开销能降低。The first module system 518 and the second module system 519 of the first interface card 516' are ASICs. The first system of modules 518 and the second system of modules 519 are instances of the same ASIC. Furthermore, in some embodiments, the first module system 518 and the second module system 519 associated with the second stage of the first switch fabric portion 571 are also used for the first stage and the third stage of the first switch fabric portion 571 An instance of an ASIC of the first module system 512 and the second module system 514 of the associated first interface card 510'. Thus, since multiple instances of a single ASIC can be used for each modular system of the first switch fabric portion 571, manufacturing overhead can be reduced.

在使用中,数据经由第一交换结构部分571从第一输入/输出模块502被传送到第二输入/输出模块502。第一输入/输出模块502经由第一电缆集540的电缆向第一交换结构部分571发送数据。数据经过接口卡510’中的一个的电缆连接器端口560并被发送至模块系统512或514内第一级模块。In use, data is transferred from the first input/output module 502 to the second input/output module 502 via the first switch fabric portion 571 . The first input/output module 502 sends data to the first switch fabric part 571 via the cables of the first cable set 540 . Data is sent through the cable connector port 560 of one of the interface cards 510'

在模块系统512或514内的第一级模块通过接口卡510’的中平面中的一个连接器端口562、中平面550以及至接口卡516’中的一个发送数据,而将数据转发至模块系统518或519内的第二级模块。数据通过接口卡516’的中平面连接器端口580进入接口卡516’。然后数据被发送到模块系统518或519内的第二级模块。A first level module within a modular system 512 or 514 forwards data to the modular system by sending data through one of the connector ports 562 in the midplane of the interface card 510', the midplane 550, and to the interface card 516' Second level modules within 518 or 519. Data enters the interface card 516' through the midplane connector port 580 of the interface card 516'. The data is then sent to a second level module within the module system 518 or 519 .

第二级模块确定第二输入/输出模块502怎样经由中平面550连接及重定向数据回接口卡510’。因为每一个模块系统518或519可操作地耦接到接口卡510’上的每一个模块系统512和514,模块系统518或519内的第二级模块能确定模块系统512或514内的哪一个第三级模块被可操作地耦接到第二输入/输出模块并相应地发送数据。The second level module determines how the second I/O module 502 connects via the midplane 550 and redirects data back to the interface card 510'. Because each module system 518 or 519 is operably coupled to each module system 512 and 514 on the interface card 510', the second-level modules within the module system 518 or 519 can determine which of the module systems 512 or 514 The tertiary module is operatively coupled to the second input/output module and sends data accordingly.

数据被发送到接口卡510’上的模块系统512,514内的第三级模块。第三级模块然后经由第一电缆集540的电缆通过电缆连接器端口560向输入/输出模块502的第二输入/输出模块发送数据。Data is sent to tertiary modules within the module system 512, 514 on the interface card 510'. The tertiary module then sends data to the second I/O module of the I/O module 502 via the cables of the first cable set 540 through the cable connector port 560 .

在其他实施例中,代替第一级模块发送数据到单独的第二级模块,第一级模块将数据分割为独立的部分(例如,信元)并向每一个第二级模块转发数据的一部分,第一级模块被可操作地耦接到第二级模块(例如,在该实施例中,每一个第二级模块接收数据的一部分)。每一个第二级模块然后确定第二输入/输出模块怎样被连接并重定向数据的几部分回到单独的第三级模块。第三级模块然后重建接收的数据的几部分并向第二输入/输出模块发送数据。In other embodiments, instead of the first-level module sending data to individual second-level modules, the first-level module divides the data into separate parts (e.g., cells) and forwards a portion of the data to each second-level module , the first level modules are operatively coupled to the second level modules (eg, in this embodiment each second level module receives a portion of the data). Each second level module then determines how the second input/output modules are connected and redirects portions of the data back to individual third level modules. The third level module then reconstructs parts of the received data and sends the data to the second input/output module.

图8-10显示了根据一个实施例用于容纳交换结构(例如如上所述的第一交换结构部分571)的外壳600(即机架)。外壳600包括外套610、中平面640、水平位置的接口卡620和垂直位置的接口卡630。图8显示了外套610的前视图,其中能看到部署在外套610内的8个水平位置的接口卡620。图9显示了外套610的后视图,其中能看到部署在外套610内的8个垂直位置的接口卡630。8-10 illustrate an enclosure 600 (ie, a rack) for housing a switch fabric (eg, first switch fabric portion 571 as described above) according to one embodiment. The housing 600 includes a casing 610 , a midplane 640 , an interface card 620 in a horizontal position, and an interface card 630 in a vertical position. FIG. 8 shows a front view of housing 610 in which eight horizontal positions of interface cards 620 deployed within housing 610 can be seen. FIG. 9 shows a rear view of the housing 610 in which the eight vertical positions of the interface cards 630 deployed within the housing 610 can be seen.

每一个水平位置的接口卡620经由中平面640可操作地耦接到每一个垂直位置的接口卡630(参见图10)。中平面640包括前表面642、后表面644和连接前表面642和后表面644的插孔(receptacle)阵列650,如下所述。如图10所示,水平位置的接口卡620包括多个连接到中平面640的前表面642上插孔的中平面连接器端口622。类似地,垂直位置的接口卡630包括多个连接到中平面640的后表面644上插孔的中平面连接器632。以这样的方式,由每一个水平位置的接口卡620定义的平面与由每一个垂直位置的接口卡630定义的平面相交。Each horizontally positioned interface card 620 is operatively coupled to each vertically positioned interface card 630 via a midplane 640 (see FIG. 10 ). The midplane 640 includes a front surface 642, a rear surface 644, and an array of receptacles 650 connecting the front surface 642 and the rear surface 644, as described below. As shown in FIG. 10 , interface card 620 in a horizontal position includes a plurality of midplane connector ports 622 that connect to receptacles on a front surface 642 of midplane 640 . Similarly, interface card 630 in a vertical position includes a plurality of midplane connectors 632 that connect to receptacles on rear surface 644 of midplane 640 . In this manner, the plane defined by each horizontally positioned interface card 620 intersects the plane defined by each vertically positioned interface card 630 .

中平面640的插孔650可操作地耦接每一个水平位置的接口卡620到每一个垂直位置的接口卡630。插孔650促使水平位置接口卡620和垂直位置接口卡630间的信号传输。在一些实施例中,例如,插孔650可以是配置为接收放置在接口卡620,630的中平面连接器端口622,632上的多销钉式连接器(multiple pin-connector)的多销钉式连接器、允许水平位置接口卡620直接与垂直位置接口卡630连接的空管、和/或配置为可操作地耦接两个接口卡的任意其他装置。使用这样的中平面640,每一个水平位置接口卡620可操作地耦接到每一个垂直位置接口卡630,而不需要在中平面上的路由连接(例如,电子轨迹)。The receptacles 650 of the midplane 640 operably couple each of the horizontally positioned interface cards 620 to each of the vertically positioned interface cards 630 . Jack 650 facilitates signal transmission between horizontal position interface card 620 and vertical position interface card 630 . In some embodiments, for example, receptacle 650 may be a multi-pin connection configured to receive a multiple pin-connector placed on midplane connector ports 622, 632 of interface cards 620, 630 connector, an empty tube that allows the horizontal position interface card 620 to be directly connected to the vertical position interface card 630, and/or any other device configured to operatively couple the two interface cards. Using such a midplane 640, each horizontal location interface card 620 is operatively coupled to each vertical location interface card 630 without requiring routing connections (eg, electrical traces) on the midplane.

图10显示了包括位于8×8阵列中的全部64个插孔650的中平面。在这样的实施例中,8个水平位置接口卡620能够可操作地耦接到8个垂直位置接口卡630。在其他实施例中,任意数目的插孔能被包括在中平面中和/或任意数目的水平位置接口卡能通过中平面被耦接到任意数目的垂直位置接口卡。Figure 10 shows a midplane including all 64 jacks 650 in an 8x8 array. In such an embodiment, eight horizontal position interface cards 620 can be operably coupled to eight vertical position interface cards 630 . In other embodiments, any number of jacks can be included in the midplane and/or any number of horizontally positioned interface cards can be coupled to any number of vertically positioned interface cards through the midplane.

如果第一交换结构部分571位于外壳600中,例如,与第一交换结构部分571的第一级和第三级相关联的每一个接口卡510可以是水平位置以及与第一交换结构部分571第二级相关联的每一个接口卡516可以是垂直位置。这样,与第一交换结构部分571的第一级和第三级相关联的每一个接口卡510可以通过中平面640容易地被连接到与第一交换结构部分571第二级相关联的每一个接口卡516。在其他实施例中,与第一交换结构部分第一级和第三级相关联的每一个接口卡是垂直位置而与第一交换结构部分第二级相关联的每一个接口卡是水平位置。在另一个实施例中,与第一交换结构的第一级和第三级相关联的每一个接口卡可以是相对外壳的任意角度放置,并且与第一交换结构的第二级相关联的每一个接口卡可以是正交于与第一交换结构部分第一级和第三级相关联的接口卡相对于外壳的角度的位置。If the first switch fabric part 571 is located in the enclosure 600, for example, each interface card 510 associated with the first stage and the third stage of the first switch fabric part 571 may be horizontally positioned and connected to the first stage of the first switch fabric part 571. Each interface card 516 associated with the second level may be a vertical position. In this way, each interface card 510 associated with the first stage and the third stage of the first switch fabric part 571 can be easily connected to each interface card associated with the second stage of the first switch fabric part 571 through the midplane 640 Interface card 516. In other embodiments, each interface card associated with the first and third levels of the first switch fabric portion is in a vertical position and each interface card associated with the second level of the first switch fabric portion is in a horizontal position. In another embodiment, each interface card associated with the first and third stages of the first switch fabric may be placed at any angle relative to the enclosure, and each interface card associated with the second stage of the first switch fabric An interface card may be orthogonal to the angular position of the interface cards associated with the first and third stages of the first switch fabric portion relative to the enclosure.

图11和12是根据一个实施例表明分别在第一配置和第二配置中的交换结构1100的示意图。交换结构1100包括多个交换结构系统1108。11 and 12 are schematic diagrams illustrating a switch fabric 1100 in a first configuration and a second configuration, respectively, according to one embodiment. Switch fabric 1100 includes a plurality of switch fabric systems 1108 .

每一个交换结构系统1108包括多个输入/输出模块1102、第一电缆集1140、第二电缆集1142、部署在外壳1170内的第一交换结构部分1171、以及部署在外壳1172内的第二交换结构部分1173。每一个交换结构系统1108在结构上和功能上类似。此外,输入/输出模块1102、第一电缆集1140和第二电缆集1142在结构上和功能上分别类似于输入/输出模块202、第一电缆集240和第二电缆集242。Each switch fabric system 1108 includes a plurality of input/output modules 1102, a first cable set 1140, a second cable set 1142, a first switch fabric portion 1171 disposed within an enclosure 1170, and a second switch fabric portion disposed within an enclosure 1172 Structural part 1173. Each switch fabric system 1108 is similar in structure and function. Furthermore, input/output module 1102, first cable set 1140, and second cable set 1142 are structurally and functionally similar to input/output module 202, first cable set 240, and second cable set 242, respectively.

当交换结构1100在第一配置中时,每一个交换结构系统1108的第一交换结构部分1171和第二交换结构部分1173功能类似于上述的第一交换结构部分571和第二交换结构部分573。这样,当交换结构1100在第一配置中时,第一交换结构部分1171和第二交换结构部分1173作为独立存在的三级交换结构操作。因此,当交换结构1100在第一配置中时,每一个交换结构系统1108作为独立存在的交换结构系统动作并不可操作地耦接到其它交换结构系统1108。When switch fabric 1100 is in the first configuration, first switch fabric portion 1171 and second switch fabric portion 1173 of each switch fabric system 1108 function similarly to first switch fabric portion 571 and second switch fabric portion 573 described above. Thus, when the switch fabric 1100 is in the first configuration, the first switch fabric portion 1171 and the second switch fabric portion 1173 operate as independently existing three-level switch fabrics. Thus, when switch fabric 1100 is in the first configuration, each switch fabric system 1108 acts as a stand-alone switch fabric system and is not operably coupled to the other switch fabric systems 1108 .

在第二配置(图12)中,交换结构1100进一步包括第三电缆集1144和多个连接交换结构1191,每一个都位于外壳1190内。外壳1190可以类似于上面详细描述的外壳600。每一个交换结构系统1108的每一个交换结构部分1171、1173经由第三电缆集1144可操作地耦接到每一个连接交换结构1191。这样,当交换结构1100在第二配置中时,每一个交换结构系统1108经由连接交换结构1191可操作地耦接到其他交换结构系统1108。因此,在第二配置中的交换结构1100是5级Clos(克洛斯)网络。In a second configuration ( FIG. 12 ), switch fabric 1100 further includes a third set of cables 1144 and a plurality of connecting switch fabrics 1191 , each located within enclosure 1190 . Housing 1190 may be similar to housing 600 described in detail above. Each switch fabric portion 1171 , 1173 of each switch fabric system 1108 is operatively coupled to each connection switch fabric 1191 via a third set of cables 1144 . As such, each switch fabric system 1108 is operably coupled to the other switch fabric systems 1108 via connection switch fabric 1191 when switch fabric 1100 is in the second configuration. Thus, the switch fabric 1100 in the second configuration is a 5-level Clos network.

第三电缆集1144能够由适用于在交换结构部分1171、1173和连接交换结构1191之间传送数据的任意材料组成。在一些实施例中,例如,每一根电缆1144由多根光纤组成。在这样的实施例中,每一根电缆1144可以具有36根发送和36接收光纤。每一根电缆1144的36根发送光纤可以包括32根用于发送数据的光纤,以及4根用于扩展数据容量和/或用于冗余的光纤。类似地,每一根电缆1144的36根接收光纤包括32根用于发送数据的光纤,以及4根用于扩展数据容量和/或用于冗余的光纤。在其他实施例中,每一根电缆中能包含任意数目的光纤。通过使用具有增加数目光纤的电缆,使用的电缆数目能有效地减少。The third set of cables 1144 can be composed of any material suitable for transferring data between the switch fabric portions 1171 , 1173 and the connecting switch fabric 1191 . In some embodiments, for example, each cable 1144 is comprised of multiple optical fibers. In such an embodiment, each cable 1144 may have 36 transmit and 36 receive fibers. The 36 transmit optical fibers of each cable 1144 may include 32 optical fibers for transmitting data, and 4 optical fibers for expanding data capacity and/or for redundancy. Similarly, the 36 receive fibers of each cable 1144 include 32 fibers for transmitting data, and 4 fibers for expanding data capacity and/or for redundancy. In other embodiments, any number of optical fibers can be included in each cable. By using cables with an increased number of optical fibers, the number of cables used can be effectively reduced.

如上面所讨论的,流量控制能在例如数据中心的交换结构内部执行。图13和14以及伴随的描述,是表明在交换结构内部的流量控制的示意图。特别地,图13是根据一个实施例表明与交换结构1300相关联的数据流量的示意图。在图13中所示的交换结构1300类似于在图4A中所示的交换结构400,并且能在例如图1中所示的数据中心100的数据中心中实施。在该实施例中,交换结构1300是3级无阻塞Clos(克洛斯)网络并包括第一级1340、第二级1342,和第三级1344。第一级1340包括模块1312,第二级1342包括模块1314,以及第三级1344包括模块1316。在一些实施例中,交换结构1300可以是信元交换机的交换结构以及第一级1340的每一个模块1312可以是信元交换机。第一级1340的每一个模块1312包括输入端口集1360,被配置为当数据进入交换结构1300时接收数据。第三级1344的每一个模块1316包括输出端口1362,被配置为允许数据离开交换结构1300。第三级1344的每一个模块1316包括相同数目的输出端口1362。As discussed above, flow control can be performed within the switching fabric of, for example, a data center. Figures 13 and 14, and accompanying description, are schematic diagrams illustrating flow control within a switch fabric. In particular, Figure 13 is a schematic diagram illustrating data flow associated with switch fabric 1300, according to one embodiment. Switch fabric 1300 shown in FIG. 13 is similar to switch fabric 400 shown in FIG. 4A and can be implemented in a data center such as data center 100 shown in FIG. 1 . In this embodiment, switch fabric 1300 is a 3-stage non-blocking Clos network and includes a first stage 1340 , a second stage 1342 , and a third stage 1344 . First stage 1340 includes module 1312 , second stage 1342 includes module 1314 , and third stage 1344 includes module 1316 . In some embodiments, the switch fabric 1300 may be a cell switch switch fabric and each module 1312 of the first stage 1340 may be a cell switch. Each module 1312 of the first stage 1340 includes a set of input ports 1360 configured to receive data as it enters the switch fabric 1300 . Each module 1316 of third stage 1344 includes an output port 1362 configured to allow data to exit switch fabric 1300 . Each module 1316 of the third stage 1344 includes the same number of output ports 1362 .

第二级1342的每一个模块1314通过单向数据路径1320可操作地耦接到第一级1340的每一个模块。在第一级1340的每一个模块和第二级1342的每一个模块1314之间的每一条单向数据路径1320被配置为促使数据从第一级1340的模块1312传送到第二级1342的模块1314。因为数据路径1320是单向的,其并不促使数据从第二级1342的模块1314传送到第一级1340的模块1312。这样的单向数据路径1320相对于类似的双向数据路径花费较少、使用较少的数据连接并且更易于实施。Each module 1314 of the second stage 1342 is operatively coupled to each module of the first stage 1340 through a unidirectional data path 1320 . Each unidirectional data path 1320 between each module of the first stage 1340 and each module 1314 of the second stage 1342 is configured to facilitate the transfer of data from the module 1312 of the first stage 1340 to the module of the second stage 1342 1314. Because the data path 1320 is unidirectional, it does not cause data to be transferred from the module 1314 of the second stage 1342 to the module 1312 of the first stage 1340 . Such a unidirectional data path 1320 costs less, uses fewer data connections, and is easier to implement than a similar bidirectional data path.

第三级1344的每一个模块1316通过单向数据路径1324可操作地耦接到第二级1342的每一个模块1314。在第二级1342的模块1314和第三级1344的模块1316之间的每一条单向数据路径1324被配置为促使数据从第二级1342的模块1314传送到第三级1344的模块1316。因为数据路径1324是单向的,其并不促使数据从第三级1344的模块1316传送到第二级1344的模块1314。如上所述,这样的单向数据路径1324相对于类似的双向数据路径花费较少,使用较少的区域。Each module 1316 of the third stage 1344 is operatively coupled to each module 1314 of the second stage 1342 by a unidirectional data path 1324 . Each unidirectional data path 1324 between a module 1314 of the second stage 1342 and a module 1316 of the third stage 1344 is configured to facilitate the transfer of data from the module 1314 of the second stage 1342 to the module 1316 of the third stage 1344 . Because data path 1324 is unidirectional, it does not cause data to pass from module 1316 of third stage 1344 to module 1314 of second stage 1344 . As noted above, such a unidirectional data path 1324 is less expensive and uses less area than a similar bidirectional data path.

在第一级1340的模块1312和第二级1342的模块1314之间的单向数据路径1320和/或在第二级1342的模块1314和第三级1344的模块1316之间的单向数据路径可以以任何方式构造,被配置为有效促使数据传送。在一些实施例中,例如,数据路径是模块间的光连接器。在其他实施例中,数据路径在中平面连接器内。这样的中平面连接器可以是类似于如图8到10中所述的中平面连接器。这样的中平面连接器可以有效地用于将第二级的每一个模块连接到第三级的每一个模块。在其他实施例中,模块被包含在单独的芯片包以及单向数据路径是电子轨迹。Unidirectional data path 1320 between module 1312 of first stage 1340 and module 1314 of second stage 1342 and/or unidirectional data path between module 1314 of second stage 1342 and module 1316 of third stage 1344 Can be constructed in any manner configured to effectively facilitate data transfer. In some embodiments, for example, the data paths are optical connectors between modules. In other embodiments, the data path is within the midplane connector. Such a midplane connector may be similar to the midplane connector described in FIGS. 8 to 10 . Such midplane connectors can be effectively used to connect every module of the second stage to every module of the third stage. In other embodiments, the modules are contained in separate chip packages and the unidirectional data paths are electronic traces.

第一级1340的每一个模块1312是相对于第三级1344的相应模块1316物理上接近的。换句话说,第一级1340的每一个模块1312与第三级1344的模块1316成对。例如,在一些实施例中,第一级1340的每一个模块1312与第三级1344的模块1316在相同的芯片包内。双向流量控制路径1322在第一级1340的每一个模块1312和第三级1344的相应模块1316之间存在。流量控制路径1322允许第一级1340的模块1312向第三级1344的相应模块1316发送流量控制指示符,反之亦然。如这里进一步详细描述的,这允许交换结构任意级的任意模块向其发送数据的模块发送流量控制指示符。在一些实施例中,双向流量控制路径1322由两条单独的单向流量控制路径构建。两条单独的单向流量控制路径允许流量控制指示符在第一级1340的模块1312和第三级1344的模块1316之间通过。Each module 1312 of the first stage 1340 is in physical proximity relative to a corresponding module 1316 of the third stage 1344 . In other words, each module 1312 of the first stage 1340 is paired with a module 1316 of the third stage 1344 . For example, in some embodiments, each module 1312 of the first stage 1340 is in the same chip package as the module 1316 of the third stage 1344 . A bidirectional flow control path 1322 exists between each module 1312 of the first stage 1340 and a corresponding module 1316 of the third stage 1344 . The flow control path 1322 allows a module 1312 of the first stage 1340 to send a flow control indicator to a corresponding module 1316 of the third stage 1344, and vice versa. As described in further detail herein, this allows any module at any level of the switch fabric to send a flow control indicator to the module it is sending data to. In some embodiments, bidirectional flow control path 1322 is constructed from two separate unidirectional flow control paths. Two separate unidirectional flow control paths allow flow control indicators to pass between module 1312 of the first stage 1340 and module 1316 of the third stage 1344 .

图14是根据一个实施例表明在图13中所示的交换结构1300中流量控制的示意图。特别地,示意图表明图13所示的交换结构1300第一行1310的详细视图。第一行包括第一级1340的模块1312’,第二级1342的模块1314’,第三级1344的模块1316’。第一级1340的模块1312’包括处理器1330和存储器1332。处理器1330被配置为控制接收和发送数据。存储器1332被配置为当第二级1342的模块1314’还不能接收数据和/或第一级1340的模块1312’还不能发送数据时缓冲数据。在一些实施例中,例如,如果第二级1342的模块1314’已经向第一级1340的模块1312’发送了中止指示符,则第一级1340的模块1312’缓冲数据直到第二级1342的模块1314’能接收数据。类似地,在一些实施例中,当模块1312’在基本上同时接收多个数据信号(例如从多个输入端口)时,第一级1340的模块1312’能缓冲数据。在这样的实施例中,如果仅有一个单独的数据信号能由模块1312’在给定的时间(例如,每一个时钟周期)输出,则其他接收的数据信号能被缓冲。类似于第一级1340的模块1312’,交换结构1300内的每一个模块包括处理器和存储器。FIG. 14 is a diagram illustrating flow control in the switch fabric 1300 shown in FIG. 13 according to one embodiment. In particular, the schematic diagram shows a detailed view of the first row 1310 of the switch fabric 1300 shown in FIG. 13 . The first row includes a module 1312' of the first stage 1340, a module 1314' of the second stage 1342, and a module 1316' of the third stage 1344. The module 1312' of the first stage 1340 includes a processor 1330 and a memory 1332. Processor 1330 is configured to control receiving and sending data. The memory 1332 is configured to buffer data when the modules 1314' of the second stage 1342 are not yet able to receive data and/or the modules 1312' of the first stage 1340 are not yet able to transmit data. In some embodiments, for example, if module 1314' of second stage 1342 has sent an abort indicator to module 1312' of first stage 1340, module 1312' of first stage 1340 buffers data until module 1312' of second stage 1342 Module 1314' is capable of receiving data. Similarly, in some embodiments, the modules 1312' of the first stage 1340 can buffer data when the modules 1312' receive multiple data signals (e.g., from multiple input ports) at substantially the same time. In such an embodiment, if only a single data signal can be output by module 1312' at a given time (e.g., every clock cycle), then other received data signals can be buffered. Similar to modules 1312' of first stage 1340, each module within switch fabric 1300 includes a processor and memory.

第一级1340的模块1312’和与其配对的第三级1344的模块1316’都被包括在第一芯片包1326上。这允许在第一级1340的模块1312’和第三级1344的模块1316’之间的流量控制路径1322容易地构建。例如,流量控制路径1322可以是在第一级1340的模块1312’和第三级的模块1316’之间第一芯片包1326上的轨迹。在其他实施例中,第一级的模块和第三级的模块在独立的芯片包上但是彼此非常接近,其仍然允许在其之间的流量控制路径不需要使用大量的配线和/或长轨迹就能被建立。Both the module 1312' of the first stage 1340 and the module 1316' of the third stage 1344 paired with it are included on the first chip package 1326. This allows the flow control path 1322 between the module 1312' of the first stage 1340 and the module 1316' of the third stage 1344 to be easily constructed. For example, the flow control path 1322 may be a trace on the first chip package 1326 between the module 1312' of the first stage 1340 and the module 1316' of the third stage. In other embodiments, the modules of the first level and the modules of the third level are on separate chip packages but in close proximity to each other, which still allows a flow control path between them without using extensive wiring and/or long A trajectory can be created.

第二级1342的模块1314’被包括在第二芯片包1328上。在第一级1340的模块1312’和第二级1342的模块1314’之间的单向数据路径1320,和在第二级1342的模块1314’和第三级1344的模块1316’之间的单向数据路径1324可操作地将第一芯片包1326连接到第二芯片包1328。虽然未在图14中示出,但是第一级1340的模块1312’和第三级1344的模块1316’还通过单向数据路径被连接到第二级的每一个模块。如上所述,单向数据路径能以任何方式构造,被配置为有效促使数据在模块间传送。The modules 1314' of the second stage 1342 are included on the second chip package 1328. The unidirectional data path 1320 between the module 1312' of the first stage 1340 and the module 1314' of the second stage 1342, and the unidirectional data path 1320 between the module 1314' of the second stage 1342 and the module 1316' of the third stage 1344 A data path 1324 operably connects the first chip package 1326 to the second chip package 1328 . Although not shown in Figure 14, module 1312' of the first stage 1340 and module 1316' of the third stage 1344 are also connected to each module of the second stage by unidirectional data paths. As noted above, unidirectional data paths can be constructed in any manner configured to effectively facilitate the transfer of data between modules.

流量控制路径1322和单向数据路径1320,1324能有效地被用于在模块1312’,1314’,1316’之间发送流量控制指示符。例如,如果第一级1340的模块1312’正向第二级1342的模块1314’发送数据且在第二级1342的模块1314’的缓冲器中的数据量超过了阈值,则第二级1342的模块1314’能经由在第二级1342的模块1314’和第三级1344的模块1316'之间的单向数据路径1324向第三级1344的模块1316'发送流量控制指示符。该流量控制指示符触发第三级1344的模块1316’经由流量控制路径1322向第一级1340的模块1312’发送流量控制指示符。从第三级1344的模块1316’向第一级1340的模块1312’发送的流量控制指示符引发第一级1340的模块1312’停止向第二级1342的模块1314’发送数据。类似地,经由第三级1344的模块1316’从第二级1342的模块1314’向第一级1340的模块1312’发送的流量控制指示符,请求从第一级1340的模块1312’向第二级1342的模块1314’发送数据(即,继续发送数据)。Flow control path 1322 and unidirectional data paths 1320, 1324 can be effectively used to send flow control indicators between modules 1312', 1314', 1316'. For example, if the module 1312' of the first stage 1340 is sending data to the module 1314' of the second stage 1342 and the amount of data in the buffer of the module 1314' of the second stage 1342 exceeds a threshold, then the module of the second stage 1342 Module 1314' can send a flow control indicator to module 1316' of third stage 1344 via unidirectional data path 1324 between module 1314' of second stage 1342 and module 1316' of third stage 1344. This flow control indicator triggers module 1316' of third stage 1344 to send a flow control indicator to module 1312' of first stage 1340 via flow control path 1322. The flow control indicator sent from the module 1316' of the third stage 1344 to the module 1312' of the first stage 1340 causes the module 1312' of the first stage 1340 to stop sending data to the module 1314' of the second stage 1342. Similarly, a flow control indicator sent from module 1314' of second stage 1342 to module 1312' of first stage 1340 via module 1316' of third stage 1344 requests flow from module 1312' of first stage 1340 to module 1312' of second stage 1340. Module 1314' of stage 1342 sends data (ie, continues to send data).

在其之间具有芯片上双向流量控制路径的相同芯片包内的两级交换结构最小化单独芯片包间的连接,该单独芯片包体积大和/或需要大体积。此外,在其之间具有芯片上双向流量控制路径的相同包内的两级,当提供在发送模块和接收模块之间的流量控制通信能力时,允许在芯片包之间的数据路径为单向。涉及在交换结构内的双向流量控制路径的更多细节在名为“Flow Control in a Switch Fabric(在交换结构中的流量控制)”并于2008年12月29日提交的共同未决美国专利申请号12/345490中被描述,其在这里完全被引用作为参考。The two-level switch fabric within the same chip pack with on-chip bi-directional flow control paths between them minimizes connections between individual chip packs that are bulky and/or require large volumes. In addition, two stages within the same package with an on-chip bi-directional flow control path between them allows the data path between chip packages to be unidirectional while providing flow control communication capability between the transmit module and the receive module . More details concerning bidirectional flow control paths within a switch fabric are in co-pending U.S. patent application entitled "Flow Control in a Switch Fabric" and filed on December 29, 2008 No. 12/345490, which is incorporated herein by reference in its entirety.

如结合图13和14所述,缓冲器模块能被包括在交换结构级中的模块内。涉及能被包括在例如交换结构级内的缓冲器模块的更多细节将结合图15被描述。As described in connection with Figures 13 and 14, buffer modules can be included within modules at the switch fabric level. More details relating to buffer modules that can be included, for example, within a switch fabric level will be described in connection with FIG. 15 .

图15是根据一个实施例表明缓冲器模块1500的示意图。如图15所示,数据信号S0到SM在缓冲器模块1500处在缓冲器模块1500的输入侧1580上被接收(例如,通过缓冲器模块1500的输入端口1562)。在缓冲器模块1500处理后,数据信号S0到SM从缓冲器模块1500的输出侧1585上的缓冲器模块1500(例如,通过缓冲器模块1500的输出端口1564)发送。数据信号S0到SM中的每一个能定义信道(还能被称为数据信道)。数据信号S0到SM能共同被称为数据信号1560。虽然缓冲器模块1500的输入侧1580和缓冲器模块1500的输出侧1585显示在缓冲器模块1500的不同物理侧,缓冲器模块1500的输入侧1580和缓冲器模块1500的输出侧1585被逻辑定义且不排除缓冲器模块1500的各种物理配置。例如,缓冲器模块1500的一个或多个输入端口1562和/或一个或多个输出端口1564能物理地位于缓冲器模块1500的任意侧(和/或相同侧)。FIG. 15 is a schematic diagram illustrating a buffer module 1500 according to one embodiment. As shown in FIG. 15, data signals S0 through SM are received at buffer module 1500 on input side 1580 of buffer module 1500 (eg, through input port 1562 of buffer module 1500). After buffer module 1500 processing, data signals S0 through SM are sent from buffer module 1500 on output side 1585 of buffer module 1500 (eg, through output port 1564 of buffer module 1500 ). Each of the data signals S0 to SM can define a channel (also referred to as a data channel). Data signals S0 through SM can collectively be referred to as data signals 1560 . Although the input side 1580 of the buffer module 1500 and the output side 1585 of the buffer module 1500 are shown as different physical sides of the buffer module 1500, the input side 1580 of the buffer module 1500 and the output side 1585 of the buffer module 1500 are logically defined and Various physical configurations of buffer module 1500 are not excluded. For example, one or more input ports 1562 and/or one or more output ports 1564 of buffer module 1500 can be physically located on any side (and/or the same side) of buffer module 1500 .

缓冲器模块1500能被配置为处理数据信号1560从而通过缓冲器模块1500的数据信号1560处理等待时间能相对小且基本不变。因此,由于数据信号1560通过缓冲器模块1500被处理,数据信号1560的比特率能基本不变。例如,通过缓冲器模块1500的数据信号S2处理等待时间可以是基本不变的时钟周期数目(例如,单个时钟周期,几个时钟周期)。因此,数据信号S2可以是通过多个时钟周期的时间偏移,以及被发送到缓冲器模块1500输入侧1580的数据信号S2的比特率将基本上和从缓冲器模块1500的输出侧1585发送的数据信号S2的比特率相同。Buffer module 1500 can be configured to process data signal 1560 such that data signal 1560 processing latency through buffer module 1500 can be relatively small and substantially constant. Therefore, since the data signal 1560 is processed through the buffer module 1500, the bit rate of the data signal 1560 can be substantially unchanged. For example, the data signal S2 processing latency through the buffer module 1500 may be a substantially constant number of clock cycles (eg, a single clock cycle, several clock cycles). Accordingly, the data signal S2 may be time-shifted by a number of clock cycles, and the bit rate of the data signal S2 sent to the input side 1580 of the buffer module 1500 will be substantially the same as that sent from the output side 1585 of the buffer module 1500 The bit rate of the data signal S2 is the same.

缓冲器模块1500能被配置为响应于流量控制信号1570的一个或多个部分修改一个或多个数据信号1560的比特率。例如,缓冲器模块1500能被配置为响应于流量控制信号1570的一部分来延迟在缓冲器模块1500接收的数据信号S2,流量控制信号1570指示数据信号S2应当被延迟特定的时间周期。特别地,缓冲器模块1500能被配置为存储(例如,持有)数据信号S2的一个或多个部分直到缓冲器模块1500接收指示数据信号S2应当不再被延迟的指示符(例如,流量控制信号1570的一部分)。因此,发送到缓冲器模块1500的输入侧1580的数据信号S2的比特率与从缓冲器模块1500的输出侧1585发出的数据信号S2的比特率不同(例如,基本上不同)。Buffer module 1500 can be configured to modify the bit rate of one or more data signals 1560 in response to one or more portions of flow control signal 1570 . For example, buffer module 1500 can be configured to delay data signal S2 received at buffer module 1500 in response to a portion of flow control signal 1570 indicating that data signal S2 should be delayed for a particular period of time. In particular, buffer module 1500 can be configured to store (e.g., hold) one or more portions of data signal S2 until buffer module 1500 receives an indicator (e.g., flow control part of signal 1570). Accordingly, the bit rate of the data signal S2 sent to the input side 1580 of the buffer module 1500 is different (eg, substantially different) than the bit rate of the data signal S2 sent from the output side 1585 of the buffer module 1500 .

在一些实施例中,在缓冲器模块1500的处理能基于例如可变尺寸的信元片段在存储体执行。例如,在一些实施例中,信元的片段能通过不同的被包括在缓冲器模块1500内的存储体(例如,静态随机接入存储器(SRAM)存储体)在分配处理期间被处理。存储体能共同定义共享存储缓冲器。在一些实施例中,数据信号的片段能在分配处理期间以预定义方式(例如以根据预定义算法的预定义模式)被分配到存储体。例如,在一些实施例中,数据信号1560的引导片段能在缓冲器模块1500的几部分(例如,缓冲器模块1500的特定存储体)进行处理,该部分与在缓冲器模块1500内处理的跟踪段(trailing segments)的几部分不同。在一些实施例中,数据信号1560的段能以特定的顺序处理。在一些实施例中,例如,数据信号1560的每一个片段可以基于其在信元内的各自的位置进行处理。在信元片段已经通过共享的存储缓冲器被处理之后,信元段能被排序并在重组的处理期间从缓冲器模块1500发送。In some embodiments, processing at the buffer module 1500 can be performed at memory banks based on, for example, variable-sized cell segments. For example, in some embodiments, segments of cells can be processed during the allocation process by different memory banks (eg, Static Random Access Memory (SRAM) banks) included within buffer module 1500 . Memory banks can collectively define a shared memory buffer. In some embodiments, segments of a data signal can be allocated to memory banks in a predefined manner (eg, in a predefined pattern according to a predefined algorithm) during the allocation process. For example, in some embodiments, leading segments of the data signal 1560 can be processed in portions of the buffer module 1500 (e.g., specific memory banks of the buffer module 1500) that are related to the traces processed within the buffer module 1500. Several parts of the trailing segments are different. In some embodiments, the segments of data signal 1560 can be processed in a particular order. In some embodiments, for example, each segment of data signal 1560 may be processed based on its respective location within a cell. After the cell segments have been processed through the shared memory buffer, the cell segments can be ordered and sent from the buffer module 1500 during the reassembly process.

在一些实施例中,例如,缓冲器模块1500的读取多路复用模块能被配置为重组与数据信号1560相关联的片段并从缓冲器模块1500发送(例如,传送)数据信号1560。重组处理能够基于预定义的用于向缓冲器模块1500的存储体分配片段的方法论被定义。例如,读取多路技术模块能被配置为以轮询方式(因为片段是以轮询方式写入的)从引导存储体第一读取与信元相关联的引导片段,以及然后从跟踪存储体以轮询方式读取与信元有关的跟踪片段。因此,非常少的控制信号,如果有的话,需要在写入多路复用模块和读取多路复用模块之间被发送。涉及片段处理(例如,片段分配和/或片段重组)的更多细节在名为“Methods and Apparatus Related to Shared Memory Buffer for Variable-SizedCells(涉及用于可变大小信元的共享存储缓冲器的方法和设备)”并于2009年3月31日提交的共同未决美国专利申请号12/415517中描述,其在这里完全被引用作为参考。In some embodiments, for example, the read multiplexing module of buffer module 1500 can be configured to reassemble segments associated with data signal 1560 and send (eg, transmit) data signal 1560 from buffer module 1500 . The reassembly process can be defined based on a predefined methodology for allocating segments to the memory banks of the buffer module 1500 . For example, the read multiplexing module can be configured to first read the boot segment associated with the cell from the boot bank in a polled fashion (since the segment is written in a polled fashion), and then read the boot segment associated with the cell from the track bank Trace fragments associated with cells are read in a polling fashion. Therefore, very few, if any, control signals need to be sent between the write multiplexer and the read multiplexer. More details concerning fragment handling (e.g., fragment allocation and/or fragment reassembly) can be found in the document entitled "Methods and Apparatus Related to Shared Memory Buffer for Variable-SizedCells" and apparatus)" and described in co-pending U.S. Patent Application No. 12/415,517, filed March 31, 2009, which is incorporated herein by reference in its entirety.

图16A是根据一个实施例,被配置为经由交换核心1690的交换结构1600协调传输信元组的入口调度模块1620和出口调度模块1630的示意框图。协调可以包括例如经由交换结构1600调度传输信元组、跟踪涉及传输信元组的请求和/或响应等等。入口调度模块1620能被包括在交换结构1600的入口侧以及出口调度模块1630能被包括在交换结构1600的出口侧。交换结构1600能包括入口级1602、中间级1604,和出口级1606。在一些实施例中,交换结构1600能基于Clos(克洛斯)网络架构(例如,无阻塞Clos网络、严格意义上的无阻塞Clos网络、Benes(巴内斯)网络)被定义,且交换结构1600能包括数据平面和控制平面。在一些实施例中,交换结构1600可以是数据中心(未示出)的核心部分,其能包括网络或装置互连。16A is a schematic block diagram of an ingress scheduling module 1620 and an egress scheduling module 1630 configured to coordinate transmission of groups of cells via the switch fabric 1600 of the switching core 1690, according to one embodiment. Coordination may include, for example, scheduling transmission cell groups via switch fabric 1600, tracking requests and/or responses involving transmission cell groups, and the like. Ingress scheduling module 1620 can be included on the ingress side of switch fabric 1600 and egress scheduling module 1630 can be included on the egress side of switch fabric 1600 . Switch fabric 1600 can include ingress stage 1602 , intermediate stage 1604 , and egress stage 1606 . In some embodiments, the switch fabric 1600 can be defined based on a Clos network architecture (e.g., a non-blocking Clos network, a strictly non-blocking Clos network, a Benes (Barnes) network), and the switch fabric 1600 Can include data plane and control plane. In some embodiments, switch fabric 1600 may be a core portion of a data center (not shown), which can include a network or device interconnect.

如图16A所示,输入队列IQ1到IQK(共同被称为入口队列1610)能位于交换结构1600的入口侧。入口队列1610能与交换结构1600的入口级1602相关联。在一些实施例中,入口队列1610能被包括在线卡(line card)中。在一些实施例中,入口队列1610能位于交换结构1600之外和/或交换核心1690之外。每一个入口队列1610可以是先入先出(FIFO)类型队列。虽然为示出,然而在一些实施例中,每一个入口队列IQ1到IQK可以与输入/输出端口(例如,10Gb/s端口)相关(例如,唯一相关)。在一些实施例中,每一个入口队列IQ1到IQK能具有足够的大小以实施拥塞解决方案,例如请求授权拥塞解决方案。例如,输入队列IQK-1能具有足够的大小以持有信元(或信元组)、直到请求授权拥塞方案已经对于信元(或信元组)被执行。As shown in FIG. 16A , input queues IQ1 through IQK (collectively referred to as ingress queues 1610 ) can be located on the ingress side of switch fabric 1600 . Ingress queue 1610 can be associated with ingress stage 1602 of switch fabric 1600 . In some embodiments, ingress queue 1610 can be included in a line card. In some embodiments, ingress queue 1610 can be located outside of switch fabric 1600 and/or outside of switch core 1690 . Each ingress queue 1610 may be a first-in-first-out (FIFO) type queue. Although not shown, in some embodiments, each ingress queue IQ1 through IQK may be associated (eg, uniquely) with an input/output port (eg, a 10Gb/s port). In some embodiments, each ingress queue IQ1 through IQK can be of sufficient size to implement congestion resolution, such as request grant congestion resolution. For example, input queue IQK-1 can be of sufficient size to hold a cell (or group of cells) until a request grant congestion scheme has been performed for the cell (or group of cells).

如图16A所示,输出端口P1到PL(共同被称为输出端口1640)能位于交换结构1600的输出侧。输出端口1640可以与交换结构1600的输出级1606相关。在一些实施例中,输出端口1640能被称为目的地端口。As shown in FIG. 16A , output ports P1 through PL (collectively referred to as output ports 1640 ) can be located on the output side of switch fabric 1600 . Output port 1640 may be associated with output stage 1606 of switch fabric 1600 . In some embodiments, output port 1640 can be referred to as a destination port.

在一些实施例中,输入队列1610能被包括在一个或多个位于交换结构1600的输入级1602之外的输入线卡(未示出)中。在一些实施例中,输出端口1640可以被包括在一个或多个位于交换结构1600的输出级1606之外的输出线卡(未示出)中。在一些实施例中,一个或多个输入队列1610和/或一个或多个输出端口1640能被包括在交换结构1600的一个或多个级(例如,输入级1602)中。在一些实施例中,输出调度模块1620能被包括在一个或多个输出线卡中和/或输入调度模块1630能被包括在一个或多个输入线性中。在一些实施例中,与交换核心1690有关的每一个线卡(例如,输出线卡,输入线卡)可以包括一个或多个调度模块(例如,输出调度模块、输入调度模块)。In some embodiments, input queue 1610 can be included on one or more input line cards (not shown) located outside of input stage 1602 of switch fabric 1600 . In some embodiments, output ports 1640 may be included in one or more output line cards (not shown) located outside of output stage 1606 of switch fabric 1600 . In some embodiments, one or more input queues 1610 and/or one or more output ports 1640 can be included in one or more stages of switch fabric 1600 (eg, input stage 1602 ). In some embodiments, the output scheduling module 1620 can be included in one or more output line cards and/or the input scheduling module 1630 can be included in one or more input lines. In some embodiments, each line card (eg, output line card, input line card) associated with switch core 1690 may include one or more scheduling modules (eg, output scheduling module, input scheduling module).

在一些实施例中,输入队列1610和/或输出端口1640能被包括在一个或多个位于交换结构1600和/或外围处理装置(未示出)之间的网关装置(未示出)中。一个或多个网关装置、交换结构1600和/或外围处理装置能共同定义数据中心(未示出)的至少一部分。在一些实施例中,一个或多个网关装置可以是在交换核心1690的边缘部分内的边缘设备。在一些实施例中,交换结构1600和外围处理装置能被配置为基于不同的协议处理数据。例如,外围处理装置能包括,例如一个或多个能被配置为基于以太网协议和可以是基于信元的结构的交换结构1600而通信的主装置(例如,被配置为执行一个或多个虚拟资源的主装置、万维网服务器)。换句话说,一个或多个网关装置能向配置为经由一个协议通信的其他装置提供到交换结构1600的接入,该交换结构能被配置为经由另一个协议通信。在一些实施例中,一个或多个网关装置能被称为接入交换或网络装置。在一些实施例中,一个或多个网关装置能被配置为作为路由器、网络集线器装置、和/或网络桥装置。In some embodiments, input queue 1610 and/or output port 1640 can be included in one or more gateway devices (not shown) between switch fabric 1600 and/or peripheral processing devices (not shown). One or more gateway devices, switch fabric 1600, and/or peripheral processing devices can collectively define at least a portion of a data center (not shown). In some embodiments, one or more gateway devices may be edge devices within the edge portion of switch core 1690 . In some embodiments, switch fabric 1600 and peripheral processing devices can be configured to process data based on different protocols. For example, the peripheral processing devices can include, for example, one or more master devices that can be configured to communicate based on the Ethernet protocol and the switch fabric 1600, which can be a cell-based fabric (e.g., configured to execute one or more virtual resource host, web server). In other words, one or more gateway devices can provide access to switch fabric 1600 to other devices configured to communicate via one protocol, which can be configured to communicate via another protocol. In some embodiments, one or more gateway devices can be referred to as access switches or network devices. In some embodiments, one or more gateway devices can be configured as routers, hub devices, and/or bridge devices.

在该实施例中,例如,输入调度模块1630能被配置为定义在输入队列IQ1排队的信元组GA以及在输入队列IQK-1排队的信元组GC。信元组GA在输入队列IQ1的前部排队,而信元组GB在输入队列IQ1内信元组GA之后排队。因为输入队列IQ1是FIFO类型队列,信元组GB不能经由交换结构1600发送直到信元组GA已经从输入队列IQ1发送。信元组GC在输入队列IQK-1的前部排队。In this embodiment, for example, the input scheduling module 1630 can be configured to define the cell group GA queued at the input queue IQ1 and the cell group GC queued at the input queue IQK-1. The cell group GA is queued at the front of the input queue IQ1, and the cell group GB is queued after the cell group GA in the input queue IQ1. Since input queue IQ1 is a FIFO type queue, cell group GB cannot be sent via switch fabric 1600 until cell group GA has been sent from input queue IQ1. The cell group GC is queued at the front of the input queue IQK-1.

在一些实施例中,输入队列1610的部分能被映射到(例如,指配到)一个或多个输出端口1640。例如,输入队列IQ1到IQK-1能被映射到输出端口P1,从而所有在输入端口1Q1到IQK-1排队的信元310都将由输入调度模块1620调度经由交换结构1600传输到输出端口P1。类似地,输入队列IQK能被映射到输出端口P2。该映射能作为例如查询表被存储在存储器(例如,存储器1622),当调度(例如,请求)传输信元组时输入调度模块1620能访问该查询表。In some embodiments, portions of input queue 1610 can be mapped to (eg, assigned to) one or more output ports 1640 . For example, input queues IQ1 to IQK-1 can be mapped to output port P1, so that all cells 310 queued at input ports 1Q1 to IQK-1 will be scheduled by input scheduling module 1620 for transmission to output port P1 via switch fabric 1600. Similarly, input queue IQK can be mapped to output port P2. This mapping can be stored in memory (eg, memory 1622 ) as, for example, a look-up table that input scheduling module 1620 can access when scheduling (eg, requesting) transmission of groups of cells.

在一些实施例中,一个或多个输入队列1610能与优先权值(还称为传输优先权值)相关。输入调度模块1620能被配置为基于优先权值从输入队列1610调度信元的传输。例如,因为输入队列IQK-1能与比输入队列IQ1更高的优先权值相关联,输入调度模块1620能被配置为在请求信元组GA传输到输出端口P1之前请求信元组GC传输到输出端口P1。优先权值能基于服务级别(例如,服务质量(QoS))被定义。例如,在一些实施例中,不同类型的网络通信量可以与不同的服务级别(和不同的优先权)相关联。例如,存储通信量(例如,读取和写入通信量)、内部处理器通信、媒体信令、会话层信令等等每一个都与至少一个服务级别相关联。在一些实施例中,优先权值能基于例如IEEE802.1qbb协议,其定义了基于优先权的流量控制策略。In some embodiments, one or more input queues 1610 can be associated with priority values (also referred to as transmission priority values). The input scheduling module 1620 can be configured to schedule transmission of cells from the input queue 1610 based on a priority value. For example, because input queue IQK-1 can be associated with a higher priority value than input queue IQ1, input scheduling module 1620 can be configured to request cell group GC to transfer to before requesting cell group GA to transfer to output port P1. output port P1. Priority values can be defined based on service levels (eg, Quality of Service (QoS)). For example, in some embodiments, different types of network traffic may be associated with different service levels (and different priorities). For example, storage traffic (eg, read and write traffic), inter-processor communication, media signaling, session layer signaling, etc. are each associated with at least one service level. In some embodiments, the priority value can be based on, for example, the IEEE 802.1qbb protocol, which defines a priority-based flow control policy.

在一些实施例中,一个或多个输入队列1610和/或一个或多个输出端口1640可以被暂停。在一些实施例中,一个或多个输入队列1610和/或一个或多个输出端口1640能被暂停从而信元不会丢失。例如,如果输出端口P1暂时不可用,则从输入队列IQ1和/或输入队列IQK-1传输的信元能被暂停,从而在输出端口P1信元不会因为输出端口P1暂时不可用而丢失。在一些实施例中,一个或多个输入队列1610可以与优先权值相关联。例如,如果输出端口P1拥塞,则从输入队列IQ1向输出端口P1的信元传输可以暂停,而不是从输入队列IQK-1向输出端口P1的信元可以传输,因为输入队列IQK-1能与比输入队列1Q1更高的优先权值相关联。In some embodiments, one or more input queues 1610 and/or one or more output ports 1640 may be suspended. In some embodiments, one or more input queues 1610 and/or one or more output ports 1640 can be paused so that cells are not lost. For example, if output port P1 is temporarily unavailable, cell transmission from input queue IQ1 and/or input queue IQK-1 can be suspended so that cells at output port P1 are not lost due to output port P1 being temporarily unavailable. In some embodiments, one or more input queues 1610 may be associated with a priority value. For example, if output port P1 is congested, the transfer of cells from input queue IQ1 to output port P1 may be suspended, but not the transfer of cells from input queue IQK-1 to output port P1 because input queue IQK-1 can communicate with Associated with a higher priority value than input queue 1Q1.

输入调度模块1620能被配置为与(例如,向其发送信号以及从其接收信号)输出调度模块1630交换信号以协调经由交换结构1600向输出端口P1传输信元组GA,以及协调经由交换结构1600向输出端口P1传输信元组GC。因为信元组GA将被发送到输出端口P1,该输出端口P1能被称为信元组GA的目的地端口。类似地,输出端口P1能被称为信元组GB的目的地端口。如图16A所示,信元组GA能经由传输路径4112被发送,传输路径4112不同于发送信元组GC的传输路径4114。The input scheduling module 1620 can be configured to exchange signals with (e.g., send signals to and receive signals from) the output scheduling module 1630 to coordinate the transmission of the cell group GA to the output port P1 via the switch fabric 1600, and to coordinate the transmission of the group GA via the switch fabric 1600. The cell group GC is transmitted to the output port P1. Since the cell group GA will be sent to the output port P1, this output port P1 can be called the destination port of the cell group GA. Similarly, the output port P1 can be called the destination port of the cell group GB. As shown in FIG. 16A, the cell group GA can be transmitted via a transmission path 4112, which is different from the transmission path 4114 for transmitting the cell group GC.

信元组GA和信元组GB由定义由输入调度模块1620基于在输入队列IQ1排队的信元4110定义。特别地,信元组GA能基于来自具有公共目的地端口和具有在输入队列IQ1内特定位置的信元组GA中每一个信元被定义。类似地,信元组GC能基于来自具有公共目的地端口和具有在输入队列IQK-1内特定位置的信元组GC中每一个信元被定义。虽然未示出,然而在一些实施例中,例如信元4110能包括在交换核心1690从一个或多个外围处理装置(例如,个人计算机、服务器、路由器、个人数字助理(PDA))经由一个或多个可以是有线和/或无线的网络(例如,局域网(LAN)、广域网(WAN)、虚拟网)接收的内容(例如,数据分组)。涉及定义信元组,例如信元组GA、信元组GB和/或信元组GC的更多细节,结合附图17和18进行讨论。Cell group GA and cell group GB are defined by the input scheduling module 1620 based on the cells 4110 queued at the input queue IQ1. In particular, a cell group GA can be defined based on each cell from the cell group GA having a common destination port and having a specific position within the input queue IQ1. Similarly, a cell group GC can be defined based on each cell from the cell group GC having a common destination port and having a specific position within the input queue IQK-1. Although not shown, in some embodiments, for example, cell 4110 can comprise switching core 1690 from one or more peripheral processing devices (e.g., personal computers, servers, routers, personal digital assistants (PDAs)) via one or more Content (eg, data packets) received by a plurality of networks (eg, local area network (LAN), wide area network (WAN), virtual network), which may be wired and/or wireless. Further details concerning the definition of cell groups, such as cell groups GA, cell groups GB and/or cell groups GC, are discussed in conjunction with FIGS. 17 and 18 .

图16B是根据一个实施例表明涉及信元组GA传输的信令的信令流程图。如图16B所示,时间在下行方向上增长。在信元组GA已经被定义(如图16A中所示)之后,输入调度模块1620能被配置为发送请求以调度信元组GA以经由交换结构1600传输;该请求作为传输请求22显示。传输请求22能被定义为向信元组GA的目的地端口,即输出端口P1发送信元组GA的请求。在一些实施例中,信元组GA的目的地端口还能被称为传输请求22的目标(还被称为目标目的地端口)。在一些实施例中,传输请求22可以包括经由特定的传输路径(例如图16A中所示的传输路径4112)通过交换结构1600,或在特定时间发送信元组GA的请求。输入调度模块1620能被配置为在传输请求22已经在输入调度模块1620被定义之后向输出调度模块1630发送传输请求22。Figure 16B is a signaling flow diagram illustrating signaling related to cell group GA transmission, according to one embodiment. As shown in Figure 16B, the time increases in the downstream direction. After the cell group GA has been defined (as shown in FIG. 16A ), the input scheduling module 1620 can be configured to send a request to schedule the cell group GA for transmission via the switch fabric 1600 ; this request is shown as a transmission request 22 . The transfer request 22 can be defined as a request to send the cell group GA to the destination port of the cell group GA, ie output port P1. In some embodiments, the destination port of the cell group GA can also be referred to as the target of the transfer request 22 (also referred to as the target destination port). In some embodiments, the transmission request 22 may include a request to send the cell group GA through the switch fabric 1600 via a specific transmission path (eg, transmission path 4112 shown in FIG. 16A ), or at a specific time. The input scheduling module 1620 can be configured to send the transmission request 22 to the output scheduling module 1630 after the transmission request 22 has been defined in the input scheduling module 1620 .

在一些实施例中,传输请求22能在被发送到交换结构1600的输出侧之前,在交换结构1600的输入侧排队。在一些实施例中,传输请求22能排队直到输入调度模块1620触发发送传输请求22到交换结构1600的输出侧。在一些实施例中,因为用于从交换结构1600的输入侧发送的传输请求的容量高于阈值,输入调度模块1620能被配置为保持(或触发保持)传输请求22在例如输入传输请求队列(未示出)中。该阈值能基于经由交换结构1600的传输等待时间被定义。In some embodiments, transfer requests 22 can be queued on the input side of switch fabric 1600 before being sent to the output side of switch fabric 1600 . In some embodiments, the transfer request 22 can be queued until the input scheduling module 1620 triggers sending the transfer request 22 to the output side of the switch fabric 1600 . In some embodiments, because the capacity for transfer requests sent from the input side of the switch fabric 1600 is above a threshold, the input scheduling module 1620 can be configured to hold (or trigger holding) the transfer requests 22 in, for example, the input transfer request queue ( not shown). The threshold can be defined based on transmission latency via switch fabric 1600 .

在一些实施例中,传输请求22能在交换结构1600的输出侧的输出队列(未示出)排队。在一些实施例中,输出队列能被包括在位于交换结构1600内或外,或位于交换核心1690外的线卡(未示出)中。虽然未示出,但是在一些实施例中,传输请求22能在与特定输入队列(例如,输入队列IQ1)相关联的输出队列或输出队列的一部分处排队。在一些实施例中,每一个输出端口1640能与输出队列相关,输出队列与输入队列1610的优先权值相关联(例如,对应于)。例如,输出端口P1能与和输入队列IQ1(其具有特定的优先权值)相关联的输出队列(或输出队列的部分)以及和输入队列IQK(其具有特定的优先权值)相关联的输出队列(或输出队列的一部分)相关联。因此,在输入队列IQ1排队的传输请求22能在与输入队列IQ1相关联的输出队列排队。换句话说,传输请求22能在(交换结构1600的输出侧)与至少一个输入队列1610的优先权值相关联的输出队列排队。类似地,传输请求22能在输入传输请求队列(未示出)或与至少一个输入队列1610的优先权值相关联的输入传输队列的一部分中排队。In some embodiments, transfer requests 22 can be queued in an output queue (not shown) on the output side of switch fabric 1600 . In some embodiments, output queues can be included in line cards (not shown) located inside or outside switch fabric 1600 , or outside switch core 1690 . Although not shown, in some embodiments, a transfer request 22 can be queued at an output queue or a portion of an output queue associated with a particular input queue (eg, input queue IQ1). In some embodiments, each output port 1640 can be associated with an output queue that is associated with (eg, corresponds to) the priority value of the input queue 1610 . For example, output port P1 can be associated with an output queue (or portion of an output queue) associated with input queue IQ1 (which has a particular priority value) and an output queue associated with input queue IQK (which has a particular priority value). queue (or part of an output queue) to associate with. Thus, a transfer request 22 queued at input queue IQ1 can be queued at the output queue associated with input queue IQ1. In other words, the transfer request 22 can be queued (on the output side of the switch fabric 1600 ) on an output queue associated with the priority value of at least one input queue 1610 . Similarly, transfer requests 22 can be queued in an input transfer request queue (not shown) or a portion of an input transfer queue associated with at least one input queue 1610 priority value.

如果输出调度模块1630确定信元组GA的目的地端口(即图16A所示的输出端口P1)可用于接收信元组GA,则输出调度模块1630能被配置为向输入调度模块1620发送传输响应24。传输响应24可以是例如,对于将要(例如,从图16A所示的输入队列发送IQ1发送)向信元组GA的目的地端口发送的信元组GA的授权。发送信元组的授权能称为传输授权。在一些实施例中,信元组GA和/或输入队列IQ1能被称为传输响应24的目标。在一些实施例中,当经过交换结构1600的传输基本上被授权时,例如,因为目的地端口可用时,对于将要被发送的信元组GA的授权能被授予。If the output scheduling module 1630 determines that the destination port of the cell group GA (ie, the output port P1 shown in FIG. 16A ) is available for receiving the cell group GA, the output scheduling module 1630 can be configured to send a transmission response to the input scheduling module 1620 twenty four. Transport response 24 may be, for example, an authorization for cell group GA to be sent (eg, sent from input queue send IQ1 shown in FIG. 16A ) to the destination port of cell group GA. An authorization to send a group of cells can be called a transmission authorization. In some embodiments, cell group GA and/or input queue IQ1 can be referred to as the target of transmission response 24 . In some embodiments, authorization for a cell group GA to be sent can be granted when transmission through switch fabric 1600 is substantially authorized, eg, because the destination port is available.

响应于传输响应24,输入调度模块1620能被配置为从交换结构1600的输入侧向交换结构1600的输出侧经由交换结构1600发送信元组GA。在一些实施例中,传输响应24能包括经由特定传输路径(例如图16A中所示的传输路径4112)通过交换结构1600,或在特定时间发送信元组GA的指令。在一些实施例中,该指令能基于例如路由策略被定义。In response to the transmission response 24 , the input scheduling module 1620 can be configured to send the cell group GA via the switch fabric 1600 from the input side of the switch fabric 1600 to the output side of the switch fabric 1600 . In some embodiments, transport response 24 can include an instruction to send cell group GA through switch fabric 1600 via a particular transport path (eg, transport path 4112 shown in FIG. 16A ), or at a particular time. In some embodiments, the instructions can be defined based on routing policies, for example.

如图16B所示,传输请求22包括信元数量值30、目的地标识符(ID)32、队列标识符(ID)34、队列序列值(SV)36(其可以共同被称为请求标签)。信元数量值30能体现包括在信元组GA中的信元数量。例如,在该实施例中,信元组GA包括七(7)个信元(在图16A中所示)。目的地标识符32能表示信元组GA的目的地端口从而传输请求22的目标能由输出调度模块1630确定。As shown in Figure 16B, the transfer request 22 includes a cell quantity value 30, a destination identifier (ID) 32, a queue identifier (ID) 34, a queue sequence value (SV) 36 (which may collectively be referred to as a request tag) . The number of cells value 30 can represent the number of cells included in the cell group GA. For example, in this embodiment, the cell group GA includes seven (7) cells (shown in FIG. 16A ). The destination identifier 32 can indicate the destination port of the cell group GA so that the destination of the transmission request 22 can be determined by the output scheduling module 1630 .

信元数量值30和目的地标识符32能被输出调度模块1630使用以调度信元组GA经由交换结构1600向输出端口P1(图16A所示)传输。如图16B所示,在该实施例中,因为包括在信元组GA中的信元数量能在信元组GA的目的地段端口(例如,图16A中所示的输出端口P1)被处理(例如,能被接收),输出调度模块1630能被配置为定义并发送传输响应24。The cell quantity value 30 and the destination identifier 32 can be used by the output scheduling module 1630 to schedule the transmission of the cell group GA via the switch fabric 1600 to the output port P1 (shown in FIG. 16A ). As shown in FIG. 16B, in this embodiment, since the number of cells included in the cell group GA can be processed at the destination section port (for example, the output port P1 shown in FIG. 16A ) of the cell group GA ( For example, can be received), the output scheduling module 1630 can be configured to define and send the transmission response 24.

在一些实施例中,如果因为信元组GA的目的地端口不可用(例如,在不可用状态中、在拥塞状态),包括在信元组GA中的信元数量不能在信元组GA的目的地端口(例如,图16A中所示的输出端口P1)被处理(例如,不能被接收),则输出调度模块1630能被配置为不可用于通信到输入调度模块1620。在一些实施例中,例如,输出调度模块1630能被配置为当信元组GA的目的地端口不可用时拒绝经由交换结构1600发送信元组GA的请求(未示出)。传输请求22的拒绝能被称为传输拒绝。在一些实施例中,传输拒绝能包括响应标签。In some embodiments, if the number of cells included in the cell group GA cannot be included in the number of cells in the cell group GA because the destination port of the cell group GA is unavailable (for example, in an unavailable state, in a congested state), If the destination port (eg, output port P1 shown in FIG. 16A ) is processed (eg, cannot be received), the output scheduling module 1630 can be configured to be unavailable for communication to the input scheduling module 1620 . In some embodiments, for example, the output scheduling module 1630 can be configured to deny a request (not shown) to send a cell group GA via the switch fabric 1600 when the destination port of the cell group GA is unavailable. A denial of a transfer request 22 can be referred to as a transfer denial. In some embodiments, a transfer rejection can include a response tag.

在一些实施例中,例如输出端口P1(图16A中所示)的可用或不可用能由输出调度模块1630基于满足的条件确定。例如,条件能涉及超过与输出端口P1相关联的队列(未在图16A中示出)的存储限制、经由输出端口P1的数据流量速率、准备好调度用于从输入队列1610经由交换结构1600(图16A中所示)传输的信元数量等等。在一些实施例中,当输出端口P1被禁用时,输出端口P1不可用于经由交换结构1600接收信元。In some embodiments, availability or unavailability of, for example, output port P1 (shown in FIG. 16A ) can be determined by output scheduling module 1630 based on satisfied conditions. For example, the conditions can relate to exceeding the memory limit of a queue (not shown in FIG. shown in FIG. 16A) the number of cells transmitted and so on. In some embodiments, when output port P1 is disabled, output port P1 is unavailable for receiving cells via switch fabric 1600 .

如图16B所示,队列标识符34和队列序列值36在传输请求22中被发送到输出调度模块1630。队列标识符34能表示和/或能用于标识(例如,单独标识)信元组GA在其中排队的输入队列IQ1(图16A中所示)。队列序列值36能表示信元组GA相对于输入队列IQ1内其他信元组的位置。例如,信元组GA能与队列序列值x相关联而信元组GB(在如图16A中所示的输入队列IQ1处排队)能与队列序列值Y相关联。队列序列值x能指示信元组GA将要在与队列序列值Y相关的信元组GB之前从输入队列IQ1被发送。As shown in FIG. 16B , the queue identifier 34 and the queue sequence value 36 are sent in the transmission request 22 to the output scheduling module 1630 . The queue identifier 34 can represent and/or can be used to identify (eg, individually identify) the input queue IQ1 (shown in FIG. 16A ) in which the cell group GA is queued. The queue sequence value 36 can indicate the position of the cell group GA relative to other cell groups in the input queue IQ1. For example, cell group GA can be associated with queue sequence value x and cell group GB (queued at input queue IQ1 as shown in FIG. 16A) can be associated with queue sequence value Y. The queue sequence value x can indicate that the cell group GA is to be sent from the input queue IQ1 before the cell group GB associated with the queue sequence value Y.

在一些实施例中,从与输入队列IQ1(图16A中所示)相关联的队列序列值的范围中选择队列序列值36。队列序列值的范围能被定义从而来自于队列序列值范围中的序列值对于输入队列IQ1在特定的时间段内不重复。例如,队列序列值的范围能被定义从而来自于队列序列值范围中的队列序列值在至少一个时间段内不重复,该时间周期需要通过交换核心1690(在图16A中所示)清除一些在输入队列IQ1排队的几个信元周期(例如,信元160)。在一些实施例中,队列序列值能被增加(在队列序列值范围内)并与由输入调度模块1620基于在输入队列IQ1排队的信元4110定义的每一个信元组相关联。In some embodiments, queue sequence value 36 is selected from the range of queue sequence values associated with input queue IQ1 (shown in FIG. 16A ). Ranges of queue sequence values can be defined such that sequence values from within the range of queue sequence values are not repeated for input queue IQ1 within a specified time period. For example, ranges of queue sequence values can be defined such that queue sequence values from within the range of queue sequence values are not repeated for at least a period of time that requires clearing of some of the queue sequence values by switch core 1690 (shown in FIG. 16A ). The number of cell periods (eg, cells 160) that are queued in input queue IQ1. In some embodiments, a queue sequence value can be incremented (within a queue sequence value range) and associated with each cell group defined by the input scheduling module 1620 based on cells 4110 queued at the input queue IQ1.

在一些实施例中,与输入队列IQ1相关联的队列序列值范围能与和输入队列1610(图16A中所示)的另一个相关联的队列序列值范围重叠。因此,队列序列值36,即使来自于队列序列值的非唯一范围,也能被包括(例如,包括在内)队列标识符34(其能是唯一的)以唯一标识信元组GA(至少在特定的时间段期间)。在一些实施例中,队列序列值36在交换结构1600内是唯一的或是全球唯一值(GUID)(例如,通用唯一标识符(UUID))。In some embodiments, the range of queue sequence values associated with input queue IQ1 can overlap with the range of queue sequence values associated with another of input queues 1610 (shown in FIG. 16A ). Therefore, the queue sequence value 36, even from a non-unique range of queue sequence values, can be included (e.g., included) the queue identifier 34 (which can be unique) to uniquely identify the cell group GA (at least in during a specific period of time). In some embodiments, the queue sequence value 36 is unique or a globally unique value (GUID) (eg, a universally unique identifier (UUID)) within the switch fabric 1600 .

在一些实施例中,输入调度模块1620能被配置为等待来定义与信元组GB相关联的传输请求(未示出)。例如,输入调度模块1620能被配置为等待直到传输请求22被发送或等待直到响应于传输请求22的响应(例如,传输响应24、传输拒绝)在定义与信元组GB相关联的传输请求之前被接收。In some embodiments, the input scheduling module 1620 can be configured to wait to define a transmission request (not shown) associated with a cell group GB. For example, the input scheduling module 1620 can be configured to wait until a transfer request 22 is sent or to wait until a response in response to a transfer request 22 (e.g., a transfer response 24, a transfer reject) is received prior to defining a transfer request associated with a cell group GB. take over.

如图16B所示,输出调度模块1630能被配置为在传输响应24中包括队列标识符34和队列序列值36(其能共同被称为响应标签)。当传输响应24在输入调度模块1620被接收时,队列标识符34和队列序列值36能被包括在传输响应24中,从而传输响应24能与在输入调度模块1620的信元组GA相关联。特别地,队列标识符34和队列序列值36能共同被用于将信元组GA标识为授权经由交换结构1600传输。As shown in Figure 16B, the output scheduling module 1630 can be configured to include a queue identifier 34 and a queue sequence value 36 (which can collectively be referred to as a response tag) in the transmission response 24. When a transport response 24 is received at the input scheduling module 1620 , the queue identifier 34 and the queue sequence value 36 can be included in the transport response 24 so that the transport response 24 can be associated with the cell group GA at the input scheduling module 1620 . In particular, queue identifier 34 and queue sequence value 36 together can be used to identify cell group GA as authorized for transmission via switch fabric 1600 .

在一些实施例中,输出调度模块1630能被配置为延迟发送相应于传输请求22的传输响应24。在一些实施例中,输出调度模块1630能被配置为如果例如信元组GA的目的地端口(即图16A中所示的输出端口P1)不可用(例如,临时不可用)则延迟响应。在一些实施例中,输出调度模块1630能被配置为响应于输出端口P1从不可用状态改变为可用状态发送传输响应24。In some embodiments, output scheduling module 1630 can be configured to delay sending transmission responses 24 corresponding to transmission requests 22 . In some embodiments, output scheduling module 1630 can be configured to delay the response if, for example, the destination port of cell group GA (ie, output port P1 shown in FIG. 16A ) is unavailable (eg, temporarily unavailable). In some embodiments, the output scheduling module 1630 can be configured to send the transmission response 24 in response to the output port P1 changing from the unavailable state to the available state.

在一些实施例中,输出调度模块1630能被配置为因为信元组GA的目的地端口(即图16A中所示的输出端口P1)从另一个输入队列1610接收数据,而延迟发送传输响应24。例如,因为输出端口P1从例如输入队列IQK(图16A中所示)接收不同的信元组(未示出),输出端口P1不可用于从输入队列IQ1接收数据。在一些实施例中,基于与输入队列IQ1和输入队列IQK相关联的优先权值,来自输入队列IQ1的信元组能与比来自输入队列IQK的信元组具有更高的优先权值。输出调度模块1630能被配置为延迟发送传输响应24一时间段,该时间段基于例如在输出端口P1接收的不同信元组的大小计算。例如,输出调度模块1630能被配置为为了完成在输出端口P1的不同信元组的处理而延迟发送传输响应24一个预期时间段,传输响应24目标定于信元组GA。换句话说,输出调度模块1630能被配置为基于输出端口P1从不可用状态改变到可用状态的预定时间延迟发送目标定于信元组GA的传输响应24。In some embodiments, the output scheduling module 1630 can be configured to delay sending the transport response 24 because the destination port of the cell group GA (ie, the output port P1 shown in FIG. 16A ) receives data from another input queue 1610. . For example, because output port P1 receives a different group of cells (not shown) from, for example, input queue IQK (shown in FIG. 16A ), output port P1 is not available to receive data from input queue IQ1. In some embodiments, groups of cells from input queue IQ1 can have a higher priority value than groups of cells from input queue IQK based on the priority values associated with input queue IQ1 and input queue IQK. The output scheduling module 1630 can be configured to delay sending the transmission response 24 for a time period calculated based on, for example, the size of the different groups of cells received at the output port P1. For example, output scheduling module 1630 can be configured to delay sending transmission response 24 targeted for cell group GA for a desired period of time in order to complete processing at output port P1 for a different cell group. In other words, the output scheduling module 1630 can be configured to delay sending the transmission response 24 targeted to the cell group GA based on a predetermined time for the output port P1 to change from the unavailable state to the available state.

在一些实施例中,因为信元组GA通过其发送的至少一部分传输路径(例如图16A中所示的传输路径4112)不可用(例如,拥塞),输出调度模块1630能被配置为延迟发送传输响应24。输出调度模块1630能被配置为延迟发送传输响应24直到该部分传输路径不再拥塞,或基于该部分传输路径不再拥塞的预定时间。In some embodiments, the output scheduling module 1630 can be configured to delay sending transmissions because at least a portion of the transmission path (e.g., transmission path 4112 shown in FIG. 16A ) over which the cell group GA is sent is unavailable (e.g., congested). Response 24. The output scheduling module 1630 can be configured to delay sending the transmission response 24 until the portion of the transmission path is no longer congested, or a predetermined time based on the portion of the transmission path is no longer congested.

如图16B所示,信元组GA能基于(例如,响应于)传输响应24被发送到信元组GA的目的地端口。在一些实施例中,信元组GA能基于一个或多个包括在传输响应24中的指令被发送。例如,在一些实施例中,信元组GA能经由传输路径4112(图16A中所示)基于包括在传输响应24中的指令,或基于一个或多个用于经由交换结构1600的信元组传输的规则(例如,用于经由可重组交换结构的信元组传输的规则)被发送。虽然未示出,但是在一些实施例中,在信元组GA已经在输出端口P1(图16A中所示)被接收之后,来自信元组的内容(例如,数据分组)能经由一个或多个可以是有线和/或无线的网络(例如,LAN、WAN、虚拟网)被发送到一个或多个网络实体(例如,个人计算机、服务器、路由器、PDA)。As shown in FIG. 16B , the cell group GA can be sent to the destination port of the cell group GA based on (eg, in response to) the transport response 24 . In some embodiments, cell group GA can be sent based on one or more instructions included in transmission response 24 . For example, in some embodiments, cell group GA can be based on instructions included in transmission response 24 via transmission path 4112 (shown in FIG. 16A ), or based on one or more cell group Rules for transmission (eg, rules for cell group transmission via the reconfigurable switch fabric) are sent. Although not shown, in some embodiments, after cell group GA has been received at output port P1 (shown in FIG. A network (eg, LAN, WAN, virtual network), which may be wired and/or wireless, is sent to one or more network entities (eg, personal computer, server, router, PDA).

重新参考图16A,在一些实施例中,信元组GA经由传输路径4112被发送且在相比于例如输入队列1610相对小的输出队列(未示出)被接收。在一些实施例中,输出队列(或输出队列的一部分)能与优先权值有关。优先权值能与一个或多个输入队列1610相关联。输出调度模块1630能被配置为从输出队列提取信元组GA并能被配置为向输出端口P1发送信元组GA。Referring back to FIG. 16A , in some embodiments, cell group GA is sent via transmission path 4112 and received at an output queue (not shown) that is relatively small compared to, for example, input queue 1610 . In some embodiments, an output queue (or a portion of an output queue) can be associated with a priority value. A priority value can be associated with one or more input queues 1610 . The output scheduling module 1630 can be configured to fetch the group of cells GA from the output queue and can be configured to send the group of cells GA to the output port P1.

在一些实施例中,当信元组GA被发送到交换结构1600的输出侧时,信元组GA伴随包括在信元组GA内的响应标识符一起能由输入调度模块1620提取并发送到输出端口P1。响应标识符能在输出调度模块1630被定义并被包括在传输响应24中。在一些实施例中,如果信元组GA在与信元组GA的目的地端口相关联的输出队列(未示出)排队,则响应标识符能用于从信元组GA的目的地端口提取信元组GA,从而信元组GA能从交换结构1600经由信元组GA的目的地端口被发送。响应标识符能与在输出队列中的位置相关联,该输出队列已经由输出调度模块1630为信元组GA的排队保留。In some embodiments, when the cell group GA is sent to the output side of the switch fabric 1600, the cell group GA can be extracted by the input scheduling module 1620 and sent to the output side along with the response identifier included in the cell group GA. Port P1. A response identifier can be defined at the output scheduling module 1630 and included in the transmission response 24 . In some embodiments, if the cell group GA is queued in an output queue (not shown) associated with the cell group GA's destination port, the response identifier can be used to extract the information from the cell group GA's destination port. A tuple GA, and thus a cell group GA, can be sent from the switch fabric 1600 via the destination port of the cell group GA. The response identifier can be associated with a position in the output queue that has been reserved by the output scheduling module 1630 for the queuing of the cell group GA.

在一些实施例中,当与信元组相关联的传输请求(例如图16B中所示的传输请求22)被定义时,在输入队列1610排队的信元组能被移动到存储器1622。例如,在输入队列IQK排队的信元组GD能响应于与信元组GD相关联的传输请求被定义而被移动到存储器1622。在一些实施例中,信元组GD能在与信元组GD相关联的传输请求从输入调度模块1620向输出调度模块1630发送之前被移动到存储器1622。信元组GD能被存储在存储器1622中,直到信元组GD从交换结构1600的输入侧向交换结构1600的输出侧发送。在一些实施例中,信元组能被移动到存储器1622,从而减少在输入队列IQK处的拥塞(例如,线头(HOL)阻塞)。In some embodiments, cell groups queued in input queue 1610 can be moved to memory 1622 when a transfer request (eg, transfer request 22 shown in FIG. 16B ) associated with the cell group is defined. For example, cell groups GD queued at input queue IQK can be moved to memory 1622 in response to a transfer request associated with cell groups GD being defined. In some embodiments, the cell group GD can be moved to the memory 1622 before the transmission request associated with the cell group GD is sent from the input scheduling module 1620 to the output scheduling module 1630 . The cell group GD can be stored in the memory 1622 until the cell group GD is sent from the input side of the switch fabric 1600 to the output side of the switch fabric 1600 . In some embodiments, groups of cells can be moved to memory 1622, thereby reducing congestion at the input queue IQK (eg, head-of-line (HOL) blocking).

在一些实施例中,输入调度模块1620能被配置为基于与信元组相关联的队列标识符和/或队列序列值提取存储在存储器1622中的信元组。在一些实施例中,信元在存储器1622内的信元组位置能基于查询表和/或索引值被确定。信元组能在信元组被从交换结构1600的输入侧向交换结构1600的输出侧发送之前被提取。例如,信元组GD能与队列标识符和/或队列序列值有关联。信元组GD存储在存储器1622内的位置能与队列标识符和/或队列序列值相关联。由输入调度模块1620定义并向输出调度模块1630发送的传输请求能包括队列标识符和/或队列序列值。从输出调度模块1630接收的传输响应能包括队列标识符和/或队列序列值。响应于传输响应,输入调度模块1620能被配置为在基于队列标识符和/或队列序列值的位置从存储器1622中提取信元组GD,且输入调度模块1620能触发信元组GD的传输。In some embodiments, input scheduling module 1620 can be configured to retrieve groups of cells stored in memory 1622 based on queue identifiers and/or queue sequence values associated with the groups of cells. In some embodiments, the cell group location of a cell within memory 1622 can be determined based on a lookup table and/or index values. Groups of cells can be extracted before groups of cells are sent from the input side of switch fabric 1600 to the output side of switch fabric 1600 . For example, a cell group GD can be associated with a queue identifier and/or a queue sequence value. The location at which a cell group GD is stored within memory 1622 can be associated with a queue identifier and/or a queue sequence value. Transmission requests defined by the input scheduling module 1620 and sent to the output scheduling module 1630 can include queue identifiers and/or queue sequence values. The transmission response received from the output scheduling module 1630 can include a queue identifier and/or a queue sequence value. In response to the transmission response, input scheduling module 1620 can be configured to fetch cell group GD from memory 1622 at a location based on the queue identifier and/or queue sequence value, and input scheduling module 1620 can trigger transmission of cell group GD.

在一些实施例中,包括在信元组内的一些信元数目能基于在存储器1622中的可用空间量被定义。例如,输入调度模块1620能被配置为基于在信元组GD被定义时包括在存储器1622内的可用存储空间量确定包括在信元组GD内的信元数量。在一些实施例中,如果包括在存储器1622中的可用存储空间量增长,则包括在信元组GD内的信元数量能增长。在一些实施例中,在信元组GD被移动到存储器1622用于存储之前和/或之后,包括在信元组GD中的信元数量能由输入调度模块1620增长。In some embodiments, some number of cells included in a cell group can be defined based on the amount of space available in memory 1622 . For example, the input scheduling module 1620 can be configured to determine the number of cells to include in the cell group GD based on the amount of available storage space included in the memory 1622 when the cell group GD was defined. In some embodiments, if the amount of available storage space included in the memory 1622 is increased, the number of cells included in the cell group GD can be increased. In some embodiments, the number of cells included in a cell group GD can be increased by the input scheduling module 1620 before and/or after the cell group GD is moved to the memory 1622 for storage.

在一些实施例中,包括在信元组内的一些信元的数量能基于经过例如交换结构1600的传输的等待时间被定义。特别地,鉴于与交换结构1600相关联的等待时间,输入调度模块1620能被配置为定义信元组的大小以促使流量经过交换结构1600。例如,因为信元组已经达到基于交换结构1600的等待时间定义的阈值大小,输入调度模块1620能被配置为关闭信元组(例如,定义信元组的大小)。在一些实施例中,输入调度模块1620能被配置为立即发送信元组中的数据分组,而不是等待另外的数据分组来定义更大的信元组,因为经过交换结构1600的等待时间短。In some embodiments, the number of certain cells included in a cell group can be defined based on latency of transmission through switch fabric 1600, for example. In particular, the input scheduling module 1620 can be configured to define the size of a cell group to drive traffic through the switch fabric 1600 in view of the latency associated with the switch fabric 1600 . For example, the input scheduling module 1620 can be configured to close a cell group (eg, define the size of the cell group) because the cell group has reached a threshold size defined based on the latency of the switch fabric 1600 . In some embodiments, the input scheduling module 1620 can be configured to send data packets in a cell group immediately rather than waiting for additional data packets to define a larger cell group because of the low latency through the switch fabric 1600 .

在一些实施例中,输入调度模块1620能被配置为限制从交换结构1600的输入侧向交换结构1600的输出侧发送的传输请求的数量。在一些实施例中,该限制能基于存储在输入调度模块1620的策略被定义。在一些实施例中,该限制能基于与一个或多个输入队列1610相关联的优先权值被定义。例如,输入调度模块1620能被配置为允许(基于阈值限制)与输入队列IQ1相关联的传输请求比来自输入队列IQK的传输请求多,因为输入队列IQ1具有比输入队列IQK更高的优先权值。In some embodiments, the input scheduling module 1620 can be configured to limit the number of transmission requests sent from the input side of the switch fabric 1600 to the output side of the switch fabric 1600 . In some embodiments, the limits can be defined based on policies stored in the input scheduling module 1620 . In some embodiments, the limit can be defined based on priority values associated with one or more input queues 1610 . For example, input scheduling module 1620 can be configured to allow (based on a threshold limit) more transmission requests associated with input queue IQ1 than from input queue IQK because input queue IQ1 has a higher priority value than input queue IQK .

在一些实施例中,输入调度模块1620和/或输出调度模块1630的一个或多个部分可以是基于硬件的模块(例如,DSP、FPGA)和/或基于软件的模块(例如,计算机代码模块、能在处理器上执行的处理器可读指令集)。在一些实施例中,与输入调度模块1620和/或输出调度模块1630相关联的一个或多个功能能被包括在不同的模块中和/或结合成一个或多个模块。例如,信元组GA能由输入调度模块1620内的第一子模块定义以及传输请求22(图16B所示)能由输入调度模块1620内的第二子模块定义。In some embodiments, one or more portions of input scheduling module 1620 and/or output scheduling module 1630 may be hardware-based modules (e.g., DSP, FPGA) and/or software-based modules (e.g., computer code modules, A set of processor-readable instructions that can be executed on a processor). In some embodiments, one or more functions associated with the input scheduling module 1620 and/or the output scheduling module 1630 can be included in different modules and/or combined into one or more modules. For example, cell group GA can be defined by a first sub-module within input scheduling module 1620 and transmission request 22 (shown in FIG. 16B ) can be defined by a second sub-module within input scheduling module 1620 .

在一些实施例中,交换结构1600具有比在图16A中所示的更多或更少级。在一些实施例中,交换结构1600可以是可重配置(例如,可重组)的交换结构和/或时分复用交换结构。在一些实施例中,交换结构1600能基于Clos(克洛斯)网络体系结构(例如,严格意义上的无阻塞Clos(克洛斯)网络、Benes(巴内斯)网络)被定义。In some embodiments, switch fabric 1600 has more or fewer stages than shown in Figure 16A. In some embodiments, switch fabric 1600 may be a reconfigurable (eg, reconfigurable) switch fabric and/or a time division multiplexing switch fabric. In some embodiments, the switch fabric 1600 can be defined based on a Clos network architecture (eg, strictly non-blocking Clos network, Benes network).

图17是根据一个实施例表明在位于交换结构1700输入侧的输入队列1720处排队的两个信元组的示意框图。信元组由输入调度模块1740在交换结构1700的输入侧上定义,交换结构1700可以是例如与交换核心相关联和/或被包括在例如图16A中所示的交换核心中。输入队列1720还在交换结构1700的输入侧上。在一些实施例中,输入队列1720能被包括在与交换结构1700相关联的输入线卡(未示出)中。虽然未示出,然而在一些实施例中,一个或多个信元组能包括多个信元(例如,25个信元、10个信元、100个信元)或仅一个信元。Figure 17 is a schematic block diagram illustrating two groups of cells queued at input queue 1720 located on the input side of switch fabric 1700, according to one embodiment. Cell groups are defined by input scheduling module 1740 on the input side of switch fabric 1700, which may be, for example, associated with and/or included in a switch core such as that shown in FIG. 16A. Input queue 1720 is also on the input side of switch fabric 1700 . In some embodiments, input queue 1720 can be included in an input line card (not shown) associated with switch fabric 1700 . Although not shown, in some embodiments, one or more cell groups can include multiple cells (eg, 25 cells, 10 cells, 100 cells) or only one cell.

如图17所示,输入队列1720包括信元1到T(即信元1到信元T),其能共同被称为排队信元1710。输入队列1720是FIFO类型队列,信元1位于队列的前端1724(或传输端)以及信元T位于队列的后端1722(或入口端)。如图17所示,在输入队列1720处的排队信元1710包括第一信元组1712和第二信元组1716。在一些实施例中,来自于排队信元1710的每一个信元具有相等的长度(例如,32比特长度、64比特长度)。在一些实施例中,排队信元1710中的两个或更多能具有不同的长度。As shown in FIG. 17 , input queue 1720 includes cells 1 through T (ie, cells 1 through T), which can collectively be referred to as queued cells 1710 . The input queue 1720 is a FIFO type queue with cell 1 at the front end 1724 (or transmission end) of the queue and cell T at the back end 1722 (or entry end) of the queue. As shown in FIG. 17 , queued cells 1710 at input queue 1720 include a first group 1712 of cells and a second group 1716 of cells. In some embodiments, each cell from queued cells 1710 has an equal length (eg, 32 bit length, 64 bit length). In some embodiments, two or more of the queued cells 1710 can have different lengths.

来自于排队信元1710的每一个信元具有为向由来自于排队信元1710的每一个信元上的输出端口标签(例如,字母“E”、字母“F”)指示的四个输出端口1770中的一个-输出端口E、输出端口F、输出端口G、或输出端口H传输排队的内容。信元被发送至的输出端口1770能被称为目的地端口。排队信元1710每一个能经由交换结构1700被发送到其相对应的目的地端口。在一些实施例中,输入调度模块1740能被配置为基于例如像路由表一样的查询表(LUT)确定对于来自于排队信元1710的每一个信元的目的地端口。在一些实施例中,来自于排队信元1710的每一个信元的目的地端口能基于包括在信元内的内容(例如,数据)的目的地被确定。在一些实施例中,一个或多个输出端口1770能与输出队列相关联,输出队列中信元能排队直到经由输出端口1770被发送。Each cell from queued cells 1710 has four output ports indicated by an output port label (e.g., letter "E", letter "F") on each cell from queued cells 1710 One of 1770 - output port E, output port F, output port G, or output port H transfers the queued content. The output port 1770 to which the cell is sent can be referred to as a destination port. Each of the queued cells 1710 can be sent via the switch fabric 1700 to its corresponding destination port. In some embodiments, input scheduling module 1740 can be configured to determine a destination port for each cell from queued cells 1710 based on, for example, a look-up table (LUT) such as a routing table. In some embodiments, the destination port of each cell from queued cells 1710 can be determined based on the destination of the content (eg, data) included within the cell. In some embodiments, one or more output ports 1770 can be associated with an output queue in which cells can be queued until sent via the output port 1770 .

第一信元组1712和第二信元组1716能由输入调度模块1740基于排队信元1710的目的地端口被定义。如图17所示,包括在第一信元组1712中的每一个信元具有由输出端口标签“E”指示的相同目的地端口(即,输出端口E)。类似地,包括在第二信元组1716中的每一个信元具有由输出端口标签“F”指示的相同目的地端口(即,输出端口F)。The first cell group 1712 and the second cell group 1716 can be defined by the input scheduling module 1740 based on the destination port of the queued cells 1710 . As shown in FIG. 17, each cell included in the first cell group 1712 has the same destination port (ie, output port E) indicated by the output port label "E". Similarly, each cell included in the second cell group 1716 has the same destination port indicated by the output port label "F" (ie, output port F).

信元组(例如,第一信元组1712)能基于目的地端口被定义,因为信元组经由交换结构1700作为组被发送。例如,如果信元1被包括在第一信元组1712中,则第一信元组1712不能被发送到单独的目的地端口,因为信元1具有与信元2到信元7(输出端口“E”)不同的目的地端口(输出端口“F”)。这样,第一信元组1712不经由交换结构1700作为组被传送。Groups of cells (eg, first group of cells 1712 ) can be defined based on destination ports because groups of cells are sent via switch fabric 1700 as groups. For example, if Cell 1 is included in First Cell Group 1712, then First Cell Group 1712 cannot be sent to a separate destination port because Cell 1 has the same E") a different destination port (output port "F"). As such, the first group of cells 1712 is not transmitted via the switch fabric 1700 as a group.

信元组作为连续的信元块被定义,因为信元组经由交换结构1700作为组被发送而且因为输入队列1720是FIFO类型的队列。例如,信元12,以及信元2到信元7不能作为信元组被定义,因为信元12不能和信元2到信元7的信元块一起被发送。信元8到信元11是介于中间的信元,其在信元2到信元7从输入队列1720被发送之后,但是在信元12从输入队列1720被发送之前必须从输入队列1720被发送。在一些实施例中,如果输入队列1720不是FIFO类型队列,一个或多个排队信元1710可能不按顺序发送以及组可能跨越介于中间的信元。A cell group is defined as a contiguous block of cells because the cell group is sent as a group via the switch fabric 1700 and because the input queue 1720 is a FIFO-type queue. For example, Cell 12, and Cell 2 through Cell 7 cannot be defined as a group of cells because Cell 12 cannot be sent together with a cell block of Cell 2 through Cell 7. Cells 8 through 11 are intermediate cells that must be sent from input queue 1720 after cells 2 through 7 are sent from input queue 1720, but before cell 12 is sent from input queue 1720. send. In some embodiments, if the input queue 1720 is not a FIFO type queue, one or more queued cells 1710 may be sent out of order and groups may span intervening cells.

虽然未示出,然而来自于排队信元1710的每一个信元可以具有能被称为信元序列值的序列值。信元序列值能表示例如信元2相对于信元3的顺序。信元序列值能用于在例如一个或多个输出端口1770在与信元相关联的内容从输出端口1770被发送之前重排列信元。例如,在一些实施例中,信元组1712能在与输出端口E相关联的输出队列(未示出)被接收并基于信元序列值重排列。在一些实施例中,输出队列可以相比输入队列1720相对小(例如,浅(shallow)输出队列)。Although not shown, each cell from queued cells 1710 may have a sequence value that can be referred to as a cell sequence value. The cell sequence value can indicate, for example, the order of cell 2 with respect to cell 3. The cell sequence value can be used to rearrange cells at, for example, one or more output ports 1770 before content associated with the cells is transmitted from the output ports 1770 . For example, in some embodiments, group of cells 1712 can be received at an output queue (not shown) associated with output port E and rearranged based on the cell sequence value. In some embodiments, the output queue may be relatively small compared to the input queue 1720 (eg, a shallow output queue).

此外,包括在信元内的数据(例如,数据分组)还能具有被称为数据序列值的序列值。例如,数据序列值能表示例如第一数据分组相对于第二数据分组的相对顺序。数据序列值能被用于在例如一个或多个输出端口1770处在数据分组从输出端口1770被发送之前重排列数据分组。Furthermore, the data (eg, data packets) included in cells can also have a sequence value called a data sequence value. For example, a data sequence value can represent, for example, the relative order of a first data packet with respect to a second data packet. The data sequence values can be used to rearrange data packets at, for example, one or more output ports 1770 before the data packets are sent from the output ports 1770 .

图18是根据另一个实施例有明在位于交换结构1800输入侧的输入队列1820处排队的两个信元组的示意框图。信元组由输入调度模块1840在交换结构1800输入侧上定义,交换结构1800可以是例如与交换核心相关联和/或被包括在如图16A所示的交换核心中。输入队列1820还在交换结构1800的输入侧上。在一些实施例中,输入队列1820能被包括在与交换结构1800相关联的输入线卡(未示出)中。虽然未示出,然而在一些实施例中,一个或多个信元组能包括仅一个信元。FIG. 18 is a schematic block diagram illustrating two groups of cells queued at input queue 1820 located on the input side of switch fabric 1800 according to another embodiment. Cell groups are defined by input scheduling module 1840 on the input side of switch fabric 1800, which may be, for example, associated with and/or included in a switch core as shown in FIG. 16A. Input queue 1820 is also on the input side of switch fabric 1800 . In some embodiments, input queue 1820 can be included in an input line card (not shown) associated with switch fabric 1800 . Although not shown, in some embodiments one or more cell groups can include only one cell.

如图18所示,输入队列1820包括信元1到Z(即信元1到信元Z),其共同被称为排队信元1810。输入队列1820是FIFO类型队列,其中信元1在队列的前端1824(或传输端)以及信元Z在队列的后端1822(或入口端)。如图18所示,在输入队列1820处的排队信元1810包括第一信元组1812和第二信元组1816。在一些实施例中,来自排队信元1810的每一个信元具有相等的长度(例如,32比特长度,64比特长度)。在一些实施例中,两个或更多排队信元1810具有不同的长度。在该实施例中,输入队列1820被映射到输出端口F2从而所有的信元1810由输入调度模块1840调度用于经由交换结构1800传输到输出端口F2。As shown in FIG. 18 , input queue 1820 includes cells 1 through Z (ie, cell 1 through cell Z), which are collectively referred to as queued cells 1810 . The input queue 1820 is a FIFO type queue with cell 1 at the front 1824 (or transmit end) of the queue and cell Z at the back 1822 (or ingress end) of the queue. As shown in FIG. 18 , queued cells 1810 at input queue 1820 include a first group 1812 of cells and a second group 1816 of cells. In some embodiments, each cell from queued cells 1810 has an equal length (eg, 32 bit length, 64 bit length). In some embodiments, two or more queued cells 1810 have different lengths. In this embodiment, input queue 1820 is mapped to output port F2 such that all cells 1810 are scheduled by input scheduling module 1840 for transmission to output port F2 via switch fabric 1800 .

来自于排队信元1810的每一个信元具有与一个或多个数据分组(例如,以太网数据分组)相关联的内容。该数据分组由字母“Q”到“Y”表示。例如,如图18所示,数据分组R被分割成三个不同的信元,信元2,信元3和信元4。Each cell from queued cells 1810 has content associated with one or more data packets (eg, Ethernet data packets). The data packets are represented by the letters "Q" to "Y". For example, as shown in FIG. 18, a data packet R is segmented into three different cells, Cell2, Cell3 and Cell4.

信元组(例如,第一信元组1812)被定义,从而部分数据分组不关联到不同的信元组。换句话说,信元组被定义,从而所有的数据分组都关联到单独的信元组。信元组的边界基于在输入队列1820处排队的数据分组的边界被定义,从而数据分组不被包括在不同的信元组中。分割数据分组为不同的信元组可能导致不期望的结果,例如在交换结构1800输出侧的缓冲。例如,如果数据分组T的第一部分(例如信元6)被包括在第一信元组1812中以及数据分组T的第二部分(例如信元7)被包括在第二信元组1816,则数据分组T的第一部分必须在交换结构1800输出侧的一个或多个输出队列(未示出)中的至少一部分处缓冲,直到数据分组T的第二部分被发送到交换结构1800输出侧,从而全部数据分组T从交换结构1800经由输出端口E2被发送。Cell groups (eg, first cell group 1812) are defined such that portions of data packets are not associated to different cell groups. In other words, cell groups are defined such that all data packets are associated to a single cell group. Cell group boundaries are defined based on the boundaries of data packets queued at input queue 1820 such that data packets are not included in different cell groups. Splitting data packets into different groups of cells may lead to undesired results, such as buffering on the output side of switch fabric 1800 . For example, if a first portion of data packet T (e.g., cell 6) is included in first cell group 1812 and a second portion of data packet T (e.g., cell 7) is included in second cell group 1816, then The first portion of the data packet T must be buffered in at least a portion of one or more output queues (not shown) on the output side of the switch fabric 1800 until the second portion of the data packet T is sent to the output side of the switch fabric 1800, thereby All data packets T are sent from the switch fabric 1800 via the output port E2.

在一些实施例中,包括在排队信元1810内的数据分组也能具有序列值,其被称为数据序列值。数据序列值能表示例如数据分组R相对于数据分组S的相对顺序。数据序列值能被用于在数据分组从输出端口1870被发送之前,在例如一个或多个输出端口1870处重组数据分组。In some embodiments, data packets included within queued cells 1810 can also have sequence values, referred to as data sequence values. The data sequence value can represent, for example, the relative order of the data packet R with respect to the data packet S. The data sequence value can be used to reassemble the data packet at, for example, one or more output ports 1870 before the data packet is sent from the output port 1870 .

图19是根据一个实施例表明经由交换结构调度信元组传输的方法流程图。如图19所示,在1900,信元在输入队列处排队用于传输的指示符经由交换结构被接收。在一些实施例中,交换结构能够基于Clos(克洛斯)体系结构,且可以具有多级。在一些实施例中,交换结构能与交换核心相关联(例如,在其之内)。在一些实施例中,当新信元在输入队列被接收时,或当信元准备好(或马上准备好)经由交换结构被发送时,指示符能被接收。Figure 19 is a flowchart illustrating a method for scheduling cell group transmission via a switch fabric, according to one embodiment. As shown in Figure 19, at 1900, an indicator that a cell is queued for transmission at an input queue is received via a switch fabric. In some embodiments, the switch fabric can be based on a Clos architecture and can have multiple levels. In some embodiments, a switch fabric can be associated with (eg, within) a switch core. In some embodiments, an indicator can be received when a new cell is received in an input queue, or when a cell is ready (or soon ready) to be sent via the switch fabric.

在1910,具有共同目的地的信元组根据在输入队列处排队的信元被定义。来自于信元组的每一个信元的目的地基于查询表被确定。在一些实施例中,目的地基于策略和/或基于分组分类算法被确定。在一些实施例中,共同目的地可以是与交换结构输入部分相关联的共同目的地端口。At 1910, groups of cells having a common destination are defined from cells queued at the input queue. The destination of each cell from the cell group is determined based on the look-up table. In some embodiments, the destination is determined policy-based and/or based on a packet classification algorithm. In some embodiments, the common destination may be a common destination port associated with the input portion of the switch fabric.

在1920,请求标签与信元组相关。请求标签可以包括例如,一个或多个信元数量值、目的地标识符、队列标识符、队列序列值等等。在信元组被发送到交换结构的输入侧之前,请求标签可以与信元组相关联。At 1920, request tags are associated with groups of cells. A request tag may include, for example, one or more cell quantity values, destination identifiers, queue identifiers, queue sequence values, and the like. Request tags may be associated with groups of cells before the groups of cells are sent to the input side of the switch fabric.

在1930,包括请求标签的传输请求被发送到输出调度模块。在一些实施例中,传输请求包括在特定时间或经由特定传输路径被发送的请求。在一些实施例中,传输请求能在信元组已经被存储在与交换结构输入级相关联的存储器中之后被发送。在一些实施例中,信元组能被移动到存储器以减少在输入队列处拥塞的可能性。换句话说,信元组能被移动到存储器从而在信元组之后排队的其他信元能被准备用于从输入队列处的传输(或发送),而不需要等待信元组从输入队列处发送。在一些实施例中,传输请求可以是发送到特定输出端口(例如,特定的目的地端口)的请求。At 1930, the transfer request including the request tag is sent to the output scheduling module. In some embodiments, a transmission request includes a request to be sent at a specific time or via a specific transmission path. In some embodiments, the transfer request can be sent after the group of cells has been stored in memory associated with the switch fabric input stage. In some embodiments, groups of cells can be moved to memory to reduce the possibility of congestion at the input queue. In other words, the cell group can be moved to memory so that other cells queued after the cell group can be prepared for transmission (or transmission) from the input queue without waiting for the cell group to be transferred from the input queue send. In some embodiments, a transfer request may be a request to send to a specific output port (eg, a specific destination port).

在1950,当响应于传输请求,经由交换结构的传输在1940没有被授权时,包括响应标签的传输拒绝被发送到输入调度模块。在一些实施例中,传输请求可以被拒绝,因为交换结构拥塞、目的地端口不可用等等。在一些实施例中,传输请求能被拒绝一个特定的时间段。在一些实施例中,响应标签可以包括一个或多个能被用于将传输拒绝与信元组关联的标识符。At 1950, when a transmission via the switch fabric is not authorized at 1940 in response to the transmission request, a transmission rejection including a response tag is sent to the input scheduling module. In some embodiments, the transfer request may be denied because the switch fabric is congested, the destination port is unavailable, and the like. In some embodiments, transfer requests can be denied for a specified period of time. In some embodiments, a response tag may include one or more identifiers that can be used to associate a transmission rejection with a cell group.

如果在1940经由交换结构的传输被授权,则在1960,包括到输入调度模块的响应标签的传输响应被发送。在一些实施例中,传输响应可以是传输授权。在一些实施例中,传输响应可以在信元组的目的地准备好(或马上准备好)接收信元组之后被发送。If at 1940 transmission via the switch fabric is authorized, then at 1960 a transmission response including a response tag to the input scheduling module is sent. In some embodiments, the transfer response may be a transfer grant. In some embodiments, the transport response may be sent after the destination of the cell group is ready (or immediately ready) to receive the cell group.

在1970,信元组基于响应标签被提取。如果信元组已经被移动至存储器,则信元组能从存储器中被提取。如果信元组在输入队列处排队,则信元组能从输入队列被提取。信元组可以基于包括在响应标签中的队列标识符和/或队列序列值被提取。队列标识符和/或队列序列值可来自于队列标签。At 1970, cell groups are extracted based on response tags. If the cell group has been moved to memory, the cell group can be retrieved from memory. If a cell group is queued at the input queue, the cell group can be extracted from the input queue. Cell groups may be extracted based on the queue identifier and/or queue sequence value included in the response tag. The queue identifier and/or queue sequence value can be derived from the queue label.

在1980,信元组可经由交换结构被发送。信元组可根据包括在传输响应中的指令经由交换结构被发送。在一些实施例中,信元组可在特定的时间和/或经由特定的传输路径被发送。在一些实施例中,信元组可经由交换结构向例如输出端口的目的地发送。在一些实施例中,在经由交换结构被发送之后,信元组能在与信元组的目的地(例如,目的地端口)相关联的输出队列处排队。At 1980, groups of cells can be sent via the switch fabric. Groups of cells may be sent via the switch fabric according to instructions included in the transport response. In some embodiments, groups of cells may be sent at specific times and/or via specific transmission paths. In some embodiments, groups of cells may be sent via a switch fabric to a destination, such as an output port. In some embodiments, after being sent via the switch fabric, the cell group can be queued at an output queue associated with the cell group's destination (eg, destination port).

图20是根据一个实施例表明与传输请求相关联的请求序列值处理的信令流程图。如图20所示,传输请求52从在交换结构输入侧上的输入调度模块2020被发送到在交换结构输出侧上的输出调度模块2030。传输请求56在传输请求52被发送之后从输入调度模块2020被发送到输出调度模块2030。如图20所示,传输请求54从输入调度模块2020被发送,但是不由输出调度模块2030接收。传输请求52、传输请求54和传输请求56每一个都与相同的输入队列IQ1相关联,如其相应的队列标识符所指示,以及与相同的目的地端口EP1有关,如其相应的目的地标识符所指示。传输请求52,传输请求54和传输请求56可共同被称为传输请求58。如图20所示,时间在下行方向上增长。Figure 20 is a signaling flow diagram illustrating request sequence value processing associated with a transfer request, according to one embodiment. As shown in FIG. 20, the transmission request 52 is sent from the input scheduling module 2020 on the input side of the switch fabric to the output scheduling module 2030 on the output side of the switch fabric. Transmission request 56 is sent from input scheduling module 2020 to output scheduling module 2030 after transmission request 52 is sent. As shown in FIG. 20 , transmission requests 54 are sent from the input scheduling module 2020 but are not received by the output scheduling module 2030 . Transfer request 52, transfer request 54, and transfer request 56 are each associated with the same input queue IQ1, as indicated by their respective queue identifiers, and with the same destination port EP1, as indicated by their corresponding destination identifiers. instruct. Transfer request 52 , transfer request 54 , and transfer request 56 may collectively be referred to as transfer request 58 . As shown in Figure 20, the time increases in the downstream direction.

如图20所示,每一个传输请求58可包括请求序列值(SV)。请求序列值可表示传输请求相对于其他传输请求的序列。在该实施例中,请求序列值可来自于与目的地端口EP1相关联的请求序列值的范围,并且按数值顺序以全整数的形式增长。在一些实施例中,请求序列值可以是例如串(strings),并且能以不同的顺序(例如,相反的数值顺序)增长。传输请求52包括请求序列值5200,传输请求54包括请求序列值5201,以及传输请求56包括请求序列值5202。在该实施例中,请求序列值5200指示传输请求52在传输请求54之前被定义以及被发送,传输请求54具有请求序列值5201。As shown in FIG. 20, each transfer request 58 may include a request sequence value (SV). A request sequence value may represent the sequence of a transfer request relative to other transfer requests. In this embodiment, the request sequence value may be from a range of request sequence values associated with the destination port EP1 and incremented in numerical order in full integers. In some embodiments, request sequence values can be, for example, strings, and can be incremented in a different order (eg, reverse numerical order). Transfer request 52 includes request sequence value 5200 , transfer request 54 includes request sequence value 5201 , and transfer request 56 includes request sequence value 5202 . In this example, request sequence value 5200 indicates that transfer request 52 was defined and sent prior to transfer request 54 , which has request sequence value 5201 .

输出调度模块2030能基于请求序列值确定从输入调度模块2020的传输请求的传输可能已经失败。特别地,输出调度模块2030能确定与请求序列值5201相关联的传输请求不在传输请求56被接收之前被接收,传输请求56与请求序列值5202有关。在一些实施例中,当在传输请求52和传输请求56的接收之间的时间段(显示为时间段2040)超过阈值时间段时,输出调度模块2030能执行关于丢失的传输请求54的动作。在一些实施例中,输出调度模块2030能请求输入调度模块2020重传传输请求54。输出调度模块2030可包括丢失的请求序列值,从而输入调度模块2020能识别传输请求54未被接收。在一些实施例中,输出调度模块2030能拒绝包括在传输请求56中的用于传输信元组的请求。在一些实施例中,输出调度模块2030能被配置为基于队列序列值以基本上类似于和请求序列值一起被描述方法的方式处理和/或响应传输请求(例如传输请求58)。The output scheduling module 2030 can determine that the transmission of the transmission request from the input scheduling module 2020 may have failed based on the request sequence value. In particular, output scheduling module 2030 can determine that the transmission request associated with request sequence value 5201 was not received prior to transmission request 56 , which is associated with request sequence value 5202 . In some embodiments, the output scheduling module 2030 can perform actions regarding missing transmission requests 54 when the time period between receipt of transmission requests 52 and transmission requests 56 (shown as time period 2040 ) exceeds a threshold time period. In some embodiments, output scheduling module 2030 can request input scheduling module 2020 to retransmit transmission request 54 . The output scheduling module 2030 can include a missing request sequence value so that the input scheduling module 2020 can recognize that the transmission request 54 was not received. In some embodiments, output scheduling module 2030 can deny a request to transmit a group of cells included in transmission request 56 . In some embodiments, output scheduling module 2030 can be configured to process and/or respond to transfer requests (eg, transfer request 58 ) based on queue sequence values in a manner substantially similar to the methods described with request sequence values.

图21是根据一个实施例表明与传输响应有关的响应序列值的信令流程图。如图21所示,传输响应62从交换结构输出侧上的输出调度模块2130被发送到交换结构输入侧的输入调度模块2120。传输响应66在传输响应62被发送之后从输出调度模块2130发送到输入调度模块2120。如图21所示,传输响应64从输出调度模块2130发送,但是不由输入调度模块2120接收。传输响应62、传输响应64和传输响应66与相同的由其相应地队列标识符指示的输入队列IQ2相关联。传输响应62、传输响应64和传输响应66可共同被称为传输响应68。如图21所示,时间在下行方向上增长。Figure 21 is a signaling flow diagram illustrating response sequence values associated with transmission responses, according to one embodiment. As shown in FIG. 21 , the transport response 62 is sent from the output scheduling module 2130 on the output side of the switch fabric to the input scheduling module 2120 on the input side of the switch fabric. Transmission response 66 is sent from output scheduling module 2130 to input scheduling module 2120 after transmission response 62 is sent. As shown in FIG. 21 , transmission responses 64 are sent from the outgoing scheduling module 2130 , but are not received by the incoming scheduling module 2120 . Transmission response 62, transmission response 64, and transmission response 66 are associated with the same input queue IQ2 indicated by its respective queue identifier. Transmission response 62 , transmission response 64 , and transmission response 66 may collectively be referred to as transmission response 68 . As shown in Figure 21, the time increases in the downstream direction.

如图21所示,每一个传输响应68可包括响应序列值(SV)。响应序列值可表示相对于其他传输响应的传输响应序列。在该实施例中,响应序列值可以来自于与输入队列IQ2相关联的响应序列值的范围,并且按照数值顺序以全整数的形式增长。在一些实施例中,响应序列值可以是例如串,并且能以不同的顺序(例如,反向数值顺序)增长。传输响应62可包括响应序列值5300,传输响应64包括响应序列值5301,且传出响应66包括响应序列值5302。在该实施例中,响应序列值5300指示传输响应62在具有相应序列值5301的传输响应64之前被定义及发送。As shown in FIG. 21, each transmission response 68 may include a response sequence value (SV). A ResponseSequence value may indicate the sequence of a transport response relative to other transport responses. In this embodiment, the response sequence value may come from a range of response sequence values associated with the input queue IQ2, and increase in numerical order in the form of full integers. In some embodiments, the response sequence value may be, for example, a string, and can be incremented in a different order (eg, reverse numerical order). Transmit response 62 may include response sequence value 5300 , transmit response 64 include response sequence value 5301 , and outgoing response 66 include response sequence value 5302 . In this example, a response sequence value of 5300 indicates that a transport response 62 is defined and sent before a transport response 64 with a corresponding sequence value of 5301 .

输入调度模块2120能基于响应序列值确定从输出调度模块2130的传输响应的传输可能已经失败。特别地,输入调度模块2120能确定与响应序列值5301相关联的传输响应不在传输响应66被接收之前被接收,传输响应66与响应序列值5302相关联。在一些实施例中,当在传输响应62和传输响应66的接收之间的时间段(显示为时间周期2140)超过阈值时间周期时,输入调度模块2120能执行关于丢失的传输响应64的动作。在一些实施例中,输入调度模块2120能请求输出调度模块2130重传传输响应64。输入调度模块2120可包括丢失的响应序列值,从而输出调度模块2130能识别传输响应64未被接收。在一些实施例中,当与传输请求相关联的传输响应在特定的时间周期内未被接收时,输入调度模块2120能丢弃信元组。The input scheduling module 2120 can determine based on the response sequence value that the transmission of the transmission response from the output scheduling module 2130 may have failed. In particular, input scheduling module 2120 can determine that the transmission response associated with response sequence value 5301 was not received before transmission response 66 , which is associated with response sequence value 5302 . In some embodiments, input scheduling module 2120 can perform actions regarding missing transmission responses 64 when the time period (shown as time period 2140 ) between receipt of transmission responses 62 and 66 exceeds a threshold time period. In some embodiments, the input scheduling module 2120 can request the output scheduling module 2130 to retransmit the transmission response 64 . The input scheduling module 2120 can include a missing response sequence value so that the output scheduling module 2130 can identify that the transmission response 64 was not received. In some embodiments, the input scheduling module 2120 can discard groups of cells when a transmission response associated with a transmission request is not received within a specified time period.

图22是根据一个实施例表明流量可控队列的多级的示意框图。如图22所示,第一级队列2210的发送侧和第二级队列2220的发送侧被包括在物理链路2200发送侧上的源实体2230中。第一级队列2210的接收侧和第二级队列2220的接收侧被包括在物理链路2200接收侧上的目的实体2240中。源实体2230和/或目的实体2240可以是任意类型的计算装置(例如,交换核心的一部分、外围处理装置),其可以被配置为经由物理链路2200接收和/或发送数据。在一些实施例中,源实体2230和/或目的实体2240可以与数据中心相关联。Figure 22 is a schematic block diagram illustrating multiple levels of flow-controllable queues, according to one embodiment. As shown in FIG. 22 , the sending side of the first-stage queue 2210 and the sending side of the second-stage queue 2220 are included in the source entity 2230 on the sending side of the physical link 2200 . The receiving side of the first-level queue 2210 and the receiving side of the second-level queue 2220 are included in the destination entity 2240 on the receiving side of the physical link 2200 . Source entity 2230 and/or destination entity 2240 may be any type of computing device (eg, part of a switch core, peripheral processing device) that may be configured to receive and/or transmit data via physical link 2200 . In some embodiments, source entity 2230 and/or destination entity 2240 may be associated with a data center.

如图22所示,第一级队列2210包括在物理链路2200发送侧上的发送队列A1到A4(称为第一级发送队列2234)以及在物理链路2200接收侧上的接收队列D1到D4(称为第一级接收队列2244)。第二级队列2220包括在物理链路2200发送侧上的发送队列B1和B2(称为第二级发送队列2232)以及物理链路2200接收侧上的接收队列C1和C2(称为第二级接收队列2242)。As shown in FIG. 22 , the first-level queues 2210 include sending queues A1 to A4 (called first-level sending queues 2234) on the sending side of the physical link 2200 and receiving queues D1 to A4 on the receiving side of the physical link 2200. D4 (referred to as the first-level receive queue 2244). Second-level queues 2220 include send queues B1 and B2 (referred to as second-stage send queues 2232) on the send side of physical link 2200 and receive queues C1 and C2 (referred to as second-stage send queues 2232) on the receive side of physical link 2200. receive queue 2242).

经由物理链路2200的数据流能够基于与在源实体2230和目的实体2240之间的流量控制环相关联的流量控制信令被控制(例如,修改、暂停)。例如,从物理链路2200发送侧上的源实体2230发送的数据能在物理链路2200接收侧上的目的实体2240接收。当目的实体2240不可用于从源实体2230经由物理链路2200接收数据时,流量控制信号能在目的实体2240处被定义和/或能从目的实体2240被发送到源实体2230。流量控制信号能被配置为触发源实体2230以修改从源实体2230到目的实体2240的数据流。Data flow via physical link 2200 can be controlled (eg, modified, suspended) based on flow control signaling associated with a flow control ring between source entity 2230 and destination entity 2240 . For example, data sent from the source entity 2230 on the sending side of the physical link 2200 can be received at the destination entity 2240 on the receiving side of the physical link 2200 . A flow control signal can be defined at the destination entity 2240 and/or can be sent from the destination entity 2240 to the source entity 2230 when the destination entity 2240 is not available to receive data from the source entity 2230 via the physical link 2200 . The flow control signal can be configured to trigger the source entity 2230 to modify the flow of data from the source entity 2230 to the destination entity 2240 .

例如,如果接收队列D2不可用于处理从发送队列A1发送的数据,则目的实体2240能被配置为向源实体2230发送与流量控制环相关联的流量控制信号;流量控制信号能被配置为触发从发送队列A1向接收队列D2经由传输路径的数据传输的暂停,传输路径包括第二级队列2220的至少一部分和物理链路2200。在一些实施例中,接收队列D2可能不可用,例如,当接收队列D2太满而不能接接收数据时。在一些实施例中,接收队列D2能响应于先前从发送队列A1接收的数据从可用状态改变为不可用状态(例如,拥塞状态)。在一些实施例中,发送队列A1能被称为流量控制信号的目标。发送队列A1能在流量控制信号内基于与发送队列A1相关联的队列标识符被识别。在一些实施例中,流量控制信号能被称为反馈信号。For example, if receive queue D2 is not available to process data sent from send queue A1, destination entity 2240 can be configured to send a flow control signal associated with the flow control ring to source entity 2230; the flow control signal can be configured to trigger Suspension of data transmission from send queue A1 to receive queue D2 via a transmission path comprising at least a portion of second stage queue 2220 and physical link 2200 . In some embodiments, receive queue D2 may not be available, for example, when receive queue D2 is too full to accept received data. In some embodiments, receive queue D2 can change from an available state to an unavailable state (eg, a congested state) in response to data previously received from transmit queue A1 . In some embodiments, transmit queue A1 can be referred to as the target of the flow control signal. Send queue A1 can be identified within the flow control signal based on a queue identifier associated with send queue A1. In some embodiments, the flow control signal can be referred to as a feedback signal.

在该实施例中,流量控制环与物理链路2200相关联(称为物理链路控制环),流量控制环与第一级队列2210相关联(称为第一级控制环),以及流量控制环与第二级队列2220相关联(称为第二级控制环)。特别地,物理链路控制环与包括物理链路2200且不包括第一级队列2210和第二级队列2200的传输路径相关联。经由物理链路2200的数据流能基于与物理链路控制环有关的流量控制信令被接通和断开。In this embodiment, the flow control ring is associated with the physical link 2200 (called the physical link control ring), the flow control ring is associated with the first-level queue 2210 (called the first-level control ring), and the flow control A ring is associated with the second-level queue 2220 (referred to as the second-level control ring). In particular, the physical link control loop is associated with a transmission path that includes the physical link 2200 and does not include the first-level queue 2210 and the second-level queue 2200 . Data flow via the physical link 2200 can be switched on and off based on flow control signaling associated with the physical link control loop.

第一级控制环可以基于来自于第二级队列2210内的至少一个发送队列2234的数据传输以及基于第一级队列2210内至少一个接收队列2244可用性(例如,可用性的指示符)定义的流量控制信号。这样,第一级控制环能被称为与第一级队列2210相关联。第一级控制环能与包括物理链路2200、第二级队列2220的至少一部分和第一级队列2210的至少一部分的传输路径相关联。与第一级控制环有关的流量控制信令能触发控制来自与第一级队列2210相关联的发送队列2234的数据流。The first-level control loop may be based on data transmission from at least one send queue 2234 within the second-level queue 2210 and flow control defined based on availability (e.g., an indicator of availability) of at least one receive queue 2244 within the first-level queue 2210 Signal. As such, a first level control loop can be said to be associated with a first level queue 2210 . A first level control loop can be associated with a transmission path that includes physical link 2200 , at least a portion of second level queue 2220 and at least a portion of first level queue 2210 . Flow control signaling associated with the first stage control loop can trigger control of data flow from the send queue 2234 associated with the first stage queue 2210 .

第二级控制环能与包括物理链路2200以及包括第二级队列2220的至少一部分,但不包括第一级队列2210的传输路径相关联。第二级控制环能基于从第二级队列2220内至少一个发送队列2232和基于第二级队列2220内至少一个接收队列2242可用性(例如,可用性的指示符)定义的流量控制信号的数据传输。这样,第二级控制环能被称为与第二级队列2220相关联。与第二级控制环相关联的流量控制信令能触发控制从与第二级队列2220相关联的发送队列2232的数据流。A second-level control loop can be associated with a transmission path that includes physical link 2200 and that includes at least a portion of second-level queues 2220 , but excludes first-level queues 2210 . The second level control loop can be based on data transmission from flow control signals defined from the at least one transmit queue 2232 within the second level queues 2220 and based on the availability (eg, an indicator of availability) of the at least one receive queue 2242 within the second level queues 2220. As such, the second level control loop can be said to be associated with the second level queue 2220 . Flow control signaling associated with the second stage control loop can trigger control of data flow from the send queue 2232 associated with the second stage queue 2220 .

在该实施例中,与第二级队列2220相关联的流量控制环是基于优先权的流量控制环。特别地,来自于第二级发送队列2232的每一个发送队列与来自于第二级接收队列2242的接收队列配对;以及每一个队列对与服务级别(还被称为服务等级或服务质量)有关。在该实施例中,第二级发送队列B1和第二级发送队列C1定义队列对且与服务级别x相关联。第二级发送队列B2和第二级发送队列C2定义队列对并与服务级别Y相关联。在一些实施例中,不同类型的网络通信量可以与不同服务级别(即不同优先权)相关联。例如,存储通信量(例如,读取和写入通信量)、内部处理器通信、媒体信令、会话层信令等等可以与至少一服务级别相关。在一些实施例中,第二级控制环可以基于,例如电气和电子工程师协会(IEEE)802.1qbb协议,其定义基于优先权的流量控制策略。In this embodiment, the flow control loop associated with the second level queue 2220 is a priority based flow control loop. In particular, each transmit queue from the second-level transmit queue 2232 is paired with a receive queue from the second-level receive queue 2242; and each queue pair is associated with a service level (also referred to as service level or quality of service) . In this embodiment, the second-level send queue B1 and the second-level send queue C1 define a queue pair and are associated with service class x. Second-level send queue B2 and second-level send queue C2 define a queue pair and are associated with service level Y. In some embodiments, different types of network traffic may be associated with different service levels (ie, different priorities). For example, storage traffic (eg, read and write traffic), inter-processor communication, media signaling, session layer signaling, etc. can be associated with at least one level of service. In some embodiments, the second level control loop may be based on, for example, the Institute of Electrical and Electronics Engineers (IEEE) 802.1qbb protocol, which defines a priority-based flow control policy.

经由传输路径74的数据流量,如图22所示,能使用至少一个控制环被控制。传输路径74包括第一级发送队列A2、第二级发送队列B1、物理链路2200、第二级接收队列C1和第一级接收队列D3。然而,经由传输路径74一级中的队列基于与该级相关联的流量控制环的数据流中的改变,能够通过传输路径74的另一级影响数据流。在一级处的流量控制能影响在另一级的数据流,因为源实体2230内的队列(例如,发送队列2232、发送队列2234)和目的实体2240内的队列(例如,接收队列2242、接收队列2244)是分级段的。换句话说,基于一流量控制环的流量控制能具有经由与不同流量控制环相关联的因素对数据流产生影响。Data flow via transmission path 74, as shown in FIG. 22, can be controlled using at least one control loop. The transmission path 74 includes a first-level sending queue A2, a second-level sending queue B1, a physical link 2200, a second-level receiving queue C1, and a first-level receiving queue D3. However, a queue in one stage of transmission path 74 via another stage of transmission path 74 can affect data flow through another stage of transmission path 74 based on a change in data flow of the flow control loop associated with that stage. Flow control at one level can affect data flow at another level because queues within source entity 2230 (e.g., send queue 2232, send queue 2234) and queues within destination entity 2240 (e.g., receive queue 2242, receive Queues 2244) are segmented. In other words, flow control based on a flow control loop can have an effect on data flow via factors associated with different flow control loops.

例如,从第一级发送队列A1经由传输路径74到第一级接收队列D3的数据流能基于一个或多个控制环-第一级控制环、第二级控制环和/或物理链路控制环被修改。到第一级接收队列D3的数据流的暂停可能由于第一级接收队列D3从可用状态改变为不可用状态(例如,拥塞状态)而被触发。For example, the data flow from the first-stage transmit queue A1 to the first-stage receive queue D3 via the transmission path 74 can be based on one or more control loops—a first-stage control loop, a second-stage control loop, and/or a physical link control The ring is modified. The suspension of data flow to the first stage receive queue D3 may be triggered due to the first stage receive queue D3 changing from an available state to an unavailable state (eg, a congested state).

如果到第一级接收队列D3的数据流与服务级别x相关联,则经由第二级发送队列B1和第二级接收队列C1(其定义与服务级别x相关联的队列对)的数据流能基于与第二级控制环(其是基于优先权的控制环)相关联的流量控制信令暂停。但是经由与服务级别x相关联的队列对的数据传输暂停能导致来自于输入到第二级发送队列B1的发送队列的数据传输暂停。特别地,经由与服务级别x相关联的队列对的数据传输暂停能导致不仅来自于第一级发送队列A2的数据传输,还来自于第一级发送队列A1的数据传输的暂停。换句话说,来自第一级发送队列A1的数据流间接或并行地被影响。在一些实施例中,在发送队列A1处接收的数据和在发送队列A2处接收的数据能与相同的服务级别X相关联,但是在发送队列A1处接收的数据和在发送队列A2处接收的数据可能来自于例如不同的(例如,独立的)网络装置(未示出),例如外围处理装置,其可以与不同的服务级别相关联。If data flow to first-level receive queue D3 is associated with service class x, then data flow via second-level send queue B1 and second-level receive queue C1 (which defines the queue pair associated with service level x) can Pausing based on flow control signaling associated with the second level control loop (which is a priority based control loop). But the suspension of data transmission via the queue pair associated with service class x can cause the suspension of data transmission from the send queue input to the second-stage send queue B1. In particular, the suspension of data transmission via the queue pair associated with service class x can result in a suspension of data transmission not only from the first-stage send queue A2, but also from the first-stage send queue A1. In other words, the data flow from the first-stage send queue A1 is affected indirectly or in parallel. In some embodiments, the data received at send queue A1 and the data received at send queue A2 can be associated with the same service level X, but the data received at send queue A1 and the data received at send queue A2 Data may come from, for example, different (eg, separate) network devices (not shown), such as peripheral processing devices, which may be associated with different service levels.

到第一级接收队列D3的数据流还能够特别地由来自于第一级发送队列A2的数据传输暂停基于与第一级控制环有关的流量控制信令而暂停。通过来自于第一级发送队A2数据传输的直接暂停,来自于第一级发送队列A1的数据传输可以不被中断。换句话说,第一级发送队列A2的流量控制基于与第一级控制环相关联的流量控制信号能被直接控制,而不需要来自于其他第一级发送队列例如第一级发送队列A1的数据传输暂停。Data flow to the first-stage receive queue D3 can also be suspended based on flow control signaling related to the first-stage control loop, in particular by the suspension of data transmission from the first-stage send queue A2. By directly suspending the data transmission from the first-level sending queue A2, the data transmission from the first-level sending queue A1 may not be interrupted. In other words, the flow control of the first-stage send queue A2 can be directly controlled based on the flow control signal associated with the first-stage control loop, without the need for a flow control from other first-stage send queues such as the first-stage send queue A1 Data transfer paused.

到第一级接收队列D3的数据流还能由经由物理链路220基于与物理链路控制环有关的流量控制信令数据传输暂停被控制。但是经由物理链路2200的数据传输暂停能导致经由物理链路2200的所有数据传输暂停。Data flow to the first stage receive queue D3 can also be controlled by pausing data transmission via the physical link 220 based on flow control signaling associated with the physical link control loop. But the suspension of data transmission via physical link 2200 can cause all data transmission via physical link 2200 to be suspended.

在物理链路2200发送侧上的队列能被称为发送队列2236以及在物理链路接收侧上的队列能被称为接收队列2246。在一些实施例中,发送队列2236还能被称为源队列,而接收队列2246能被称为目的队列。虽然未示出,但是在一些实施例中,一个或多个发送队列2236能被包括在一个或多个与源实体2230相关联的接口卡中,以及一个或多个接收队列2246能被包括在一个或多个与目的实体2240有关的接口卡中。The queue on the send side of the physical link 2200 can be referred to as a send queue 2236 and the queue on the receive side of the physical link can be referred to as a receive queue 2246 . In some embodiments, send queue 2236 can also be referred to as a source queue, and receive queue 2246 can also be referred to as a destination queue. Although not shown, in some embodiments, one or more transmit queues 2236 can be included in one or more interface cards associated with source entity 2230, and one or more receive queues 2246 can be included in In one or more interface cards associated with the destination entity 2240.

当源实体2230经由物理链路2200发送数据时,源实体2230能被称为位于物理链路2200发送侧的发送机。目的实体2240能被配置为接收数据并被称为位于物理链路2200接收侧上的接收机。虽然未示出,但是在一些实施例中,源实体2230(和相关联的元件(例如,发送队列2236))能被配置为作为目的实体(例如,接收机)工作以及目的实体2240(和相关的元件(例如,接收队列2246))能被配置为作为源实体(例如,发送机)工作。此外,物理链路2200能作为双向链路工作。When the source entity 2230 transmits data via the physical link 2200 , the source entity 2230 can be referred to as a transmitter located on the sending side of the physical link 2200 . The destination entity 2240 can be configured to receive data and is referred to as a receiver on the receiving side of the physical link 2200 . Although not shown, in some embodiments, source entity 2230 (and associated elements (e.g., transmit queue 2236)) can be configured to work as a destination entity (e.g., receiver) and destination entity 2240 (and associated An element (eg, receive queue 2246)) can be configured to work as a source entity (eg, sender). Additionally, physical link 2200 can operate as a bidirectional link.

在一些实施例中,物理链路2200可以是有形的链路,例如光链路(例如,光纤电缆、塑料光纤电缆)、电缆链路(例如,基于铜的电线)、双绞线链路(例如,5类电缆)等等。在一些实施例中,物理链路2200可以是无线链路。经由物理链路2200的数据传输能基于例如以太网协议、无线协议、以太网协议、光纤信道协议、以太网光纤信道协议、涉及无限带宽的协议和/或等等协议被定义。In some embodiments, physical link 2200 may be a physical link, such as an optical link (eg, fiber optic cable, plastic fiber optic cable), a cable link (eg, copper-based wire), a twisted pair link ( For example, category 5 cable) and so on. In some embodiments, physical link 2200 may be a wireless link. Data transmission via physical link 2200 can be defined based on protocols such as Ethernet protocols, wireless protocols, Ethernet protocols, Fiber Channel protocols, Fiber Channel over Ethernet protocols, protocols involving InfiniBand, and/or the like.

在一些实施例中,第二级控制环能被称为嵌套在第一级控制环内,因为与第二级控制环相关联的第二级队列2220位于与第一级控制环相关联的第一级队列2210内。类似地,物理链路控制环能被称为嵌套在第二级控制环内。在一些实施例中,第二级控制环能被称为内部控制环,且第一级控制环能被称为外部控制环。In some embodiments, the second-level control loop can be said to be nested within the first-level control loop because the second-level queue 2220 associated with the second-level control loop is located in the Inside the first level queue 2210. Similarly, a physical link control loop can be said to be nested within a second level control loop. In some embodiments, the second level control loop can be referred to as an inner control loop and the first level control loop can be referred to as an outer control loop.

图23是根据一个实施例表明流量可控队列的多级的示意框图。如图23所示,第一级队列2310的发送侧和第二级队列2320的发送侧被包括在位于物理链路2300发送侧上的源实体2330内。第一级队列2310的接收侧和第二级队列2320的接收侧被包括在位于物理链路2300接收侧上的目的实体2340内。物理链路2300发送侧上的队列可共同被称为发送队列2336,以及物理链路接收侧上的队列可共同被称为接收队列2346。虽然未示出,但是在一些实施例中,源实体2330能被配置为作为目的实体工作,以及目的实体2340能被配置为作为源实体(例如,发送机)工作。此外,物理链路2300能作为双向链路工作。Figure 23 is a schematic block diagram illustrating multiple stages of flow-controllable queues, according to one embodiment. As shown in FIG. 23 , the sending side of the first-level queue 2310 and the sending side of the second-level queue 2320 are included in a source entity 2330 located on the sending side of the physical link 2300 . The receive side of the first stage queue 2310 and the receive side of the second stage queue 2320 are included in a destination entity 2340 located on the receive side of the physical link 2300 . The queues on the transmit side of the physical link 2300 may be collectively referred to as transmit queues 2336 and the queues on the receive side of the physical link may be collectively referred to as receive queues 2346 . Although not shown, in some embodiments, source entity 2330 can be configured to operate as a destination entity, and destination entity 2340 can be configured to operate as a source entity (eg, a sender). Additionally, physical link 2300 can operate as a bidirectional link.

如图23所示,源实体2330与目的实体2340经由物理链路2300通信。源实体2330具有队列QP1,其被配置为在数据经由物理链路2300被发送之前缓冲数据(如果需要),以及目的实体2340具有队列QP2,其被配置为数据在目的实体2340被分配之前缓冲经由物理链路2300接收的数据(如果需要)。在一些实施例中,经由物理链路2300的数据流能被处理,而不需要缓冲队列QP1和队列QP2。As shown in FIG. 23 , a source entity 2330 communicates with a destination entity 2340 via a physical link 2300 . The source entity 2330 has a queue QP1 configured to buffer data (if needed) before it is sent via the physical link 2300, and the destination entity 2340 has a queue QP2 configured to buffer the data before the destination entity 2340 is allocated via Data received by physical link 2300 (if required). In some embodiments, data flow via physical link 2300 can be processed without buffering queues QP1 and QP2.

包括在第一级队列2310中的发送队列QAl到QAN每一个能被称为第一级发送队列且能共同被称为发送队列2334(或队列2334)。包括在第二级队列2320中的发送队列QB1到QBM每一个能被称为第二级发送队列且能共同被称为发送队列2332(或队列2332)。包括在第一级队列2310中的接收队列QD1到QDR每一个能被称为第一级接收队列并能共同被称为接收队列2344(或队列2344)。包括在第二级队列2320中的接收队列QC1到QCM每一个能被称为第二级接收队列且能共同被称为接收队列2342(或队列2342)。Each of the transmission queues QA1 to QAN included in the first-level queue 2310 can be referred to as a first-level transmission queue and can be collectively referred to as a transmission queue 2334 (or queues 2334). The transmission queues QB1 to QBM included in the second-level queue 2320 can each be referred to as a second-level transmission queue and can be collectively referred to as a transmission queue 2332 (or queues 2332). Each of the receive queues QD1 to QDR included in the first-stage queue 2310 can be referred to as a first-stage receive queue and can be collectively referred to as a receive queue 2344 (or queues 2344). The reception queues QC1 to QCM included in the second-stage queue 2320 can each be referred to as a second-stage reception queue and can be collectively referred to as a reception queue 2342 (or queues 2342).

如图23所示,来自于第二级队列2320的每一个队列位于在物理链路2300和来自于第一级队列2310中至少一个队列之间的传输路径之内。例如,传输路径的一部分能由第一级接收队列QD4、第二级接收队列QC1和物理链路2300定义。第二级接收队列QC1位于在第一级接收队列QD4和物理链路2300之间的传输路径内。As shown in FIG. 23 , each queue from the second-level queues 2320 is located within a transmission path between the physical link 2300 and at least one queue from the first-level queues 2310 . For example, a portion of the transmission path can be defined by the first-stage receive queue QD4 , the second-stage receive queue QC1 , and the physical link 2300 . The second-stage receive queue QC1 is located within the transmission path between the first-stage receive queue QD4 and the physical link 2300 .

在这个实施例中,物理链路控制环与物理链路2300相关联,第一级控制环与第一级队列2310相关联,以及第二级控制环与第二级队列2320相关联。在一些实施例中,第二级控制环可以是基于优先权的控制环。在一些实施例中,物理链路控制环包括物理链路2300、队列QP1和队列QP2。In this embodiment, a physical link control ring is associated with physical link 2300 , a first level control ring is associated with first level queue 2310 , and a second level control ring is associated with second level queue 2320 . In some embodiments, the second level control loop may be a priority based control loop. In some embodiments, the physical link control ring includes physical link 2300, queue QP1, and queue QP2.

流量控制信号能在源实体2330处的源控制模块2370和目的实体2340处的目的控制模块2380被定义和/或在其之间被发送。在一些实施例中,源控制模块2370能被称为源流量控制模块,以及目的控制模块2380能被称为目的流量控制模块。例如,目的控制模块2380能被配置为当在目的实体2340处的一个或多个接收队列2346(例如,接收队列QD2)不可用于接受数据时,向源控制模块2370发送流量控制信号。流量控制信号能被配置为触发源控制模块2370例如暂停从一个或多个接收队列2330向一个或多个接收队列2346的数据流。Flow control signals can be defined and/or sent between a source control module 2370 at the source entity 2330 and a destination control module 2380 at the destination entity 2340 . In some embodiments, the source control module 2370 can be referred to as a source flow control module, and the destination control module 2380 can be referred to as a destination flow control module. For example, the destination control module 2380 can be configured to send a flow control signal to the source control module 2370 when one or more receive queues 2346 (eg, receive queue QD2 ) at the destination entity 2340 are not available to accept data. The flow control signal can be configured to trigger the source control module 2370 to pause the flow of data from the one or more receive queues 2330 to the one or more receive queues 2346, for example.

在数据被发送之前,源控制模块2370将队列标识符与在来自于发送队列2336的发送队列处排队的数据相关联。队列标识符能表示和/或被用于识别数据排队的发送队列。例如,当数据分组在第一级发送队列QA4排队时,唯一识别第一级发送队列QA4的队列标识符能被添加到数据分组上或被包括在数据分组内的字段(例如,头部、尾部、有效载荷)中。在一些实施例中,队列标识符可以与源控制模块2370处的数据有关,或由源控制模块2370触发。在一些实施例中,仅在数据被发送之前,或数据已经从发送队列2336之一被发送之后,队列标识符能与数据相关联。Source control module 2370 associates queue identifiers with data queued at send queues from send queues 2336 before the data is sent. A queue identifier can represent and/or be used to identify a transmit queue to which data is queued. For example, when a data packet is queued in the first-level transmit queue QA4, a queue identifier that uniquely identifies the first-level transmit queue QA4 can be added to the data packet or included in a field (e.g., header, tail, etc.) of the data packet. , payload). In some embodiments, the queue identifier may be related to data at the source control module 2370 or triggered by the source control module 2370 . In some embodiments, a queue identifier can be associated with data only before the data is sent, or after the data has been sent from one of the send queues 2336 .

队列标识符能与从物理链路2300发送侧发送到物理链路2300接收侧的数据相关联从而数据源(例如,源队列)能被识别。因此,流量控制信号能被定义以基于队列标识符暂停一个或多个发送队列2336的传输。例如,与第一级发送队列QAN相关联的队列标识符能被包括在从第一级发送队列QAN向第一级接收队列QD3发送的数据分组中。如果在接收数据分组之后,第一级接收队列QD3不能接收来自于第一级发送队列QAN的另一个数据分组,则请求第一级发送队列QAN暂停到第一级接收队列QD3的附加数据分组传输的流量控制信号能基于与第一级发送队列QAN相关联的队列标识符被定义。队列标识符能由目的控制模块2380从数据分组中解析,并被目的控制模块2380用于定义流量控制信号。A queue identifier can be associated with data sent from the sending side of the physical link 2300 to the receiving side of the physical link 2300 so that the source of the data (eg, the source queue) can be identified. Accordingly, a flow control signal can be defined to suspend transmission for one or more transmit queues 2336 based on the queue identifier. For example, the queue identifier associated with the first-stage transmit queue QAN can be included in data packets sent from the first-stage transmit queue QAN to the first-stage receive queue QD3. If, after receiving a data packet, the first-stage receiving queue QD3 cannot receive another data packet from the first-stage sending queue QAN, request the first-stage sending queue QAN to suspend the transmission of additional data packets to the first-stage receiving queue QD3 The flow control signal for can be defined based on the queue identifier associated with the first level transmit queue QAN. The queue identifier can be parsed from the data packet by the destination control module 2380 and used by the destination control module 2380 to define flow control signals.

在一些实施例中,从几个发送队列2336(例如,第一级发送队列2334)到第一级接收队列QDR的数据传输能响应于第一级接收队列QDR从可用状态改变为不可用状态而暂停。几个发送队列2336中的每一个能基于其相应的队列标识符在流量控制信号内被识别。In some embodiments, data transfers from several transmit queues 2336 (e.g., first-stage transmit queues 2334) to first-stage receive queues QDR can be performed in response to first-stage receive queues QDR changing from an available state to an unavailable state. pause. Each of several transmit queues 2336 can be identified within the flow control signal based on its corresponding queue identifier.

在一些实施例中,一个或多个发送队列2336和/或一个或多个接收队列2346可以是虚拟队列(例如,逻辑定义的队列组)。因此,队列标识符可以与虚拟队列相关联(例如,能体现)。在一些实施例中,队列标识符可以与来自于定义虚拟队列的队列集中的队列相关联。在一些实施例中,来自于与物理链路2300相关联的队列标识符集的每一个队列标识符可以是唯一的。例如,与物理链路2300(例如,与跳转相关联的)相关联的每一个发送队列2336能与唯一的队列标识符相关联。In some embodiments, one or more transmit queues 2336 and/or one or more receive queues 2346 may be virtual queues (eg, logically defined groups of queues). Accordingly, a queue identifier can be associated (eg, embodied) with a virtual queue. In some embodiments, a queue identifier may be associated with a queue from a set of queues defining a virtual queue. In some embodiments, each queue identifier from the set of queue identifiers associated with physical link 2300 may be unique. For example, each transmit queue 2336 associated with a physical link 2300 (eg, associated with a hop) can be associated with a unique queue identifier.

在一些实施例中,源控制模块2370能被配置为将队列标识符仅与发送队列2336的一个特定子集和/或仅与在发送队列2336中之一处排队的数据子集相关联。例如,如果数据不伴随队列标识符从第一级发送队列QA2被发送到第一级接收队列QD1,则配置为请求来自于第一级发送队列QA2的数据传输暂停的流量控制信号可以不被定义,因为不知晓源数据。因此,当数据从发送队列被发送时,通过不将队列标识符与数据联系(例如,省略),来自于发送队列2336的发送队列能从流量控制中被免除。In some embodiments, source control module 2370 can be configured to associate queue identifiers with only a particular subset of send queues 2336 and/or with only a subset of data queued at one of send queues 2336 . For example, if data is sent from the first-stage send queue QA2 to the first-stage receive queue QD1 without a queue identifier, the flow control signal configured to request the suspension of data transmission from the first-stage send queue QA2 may not be defined , because the source data is unknown. Thus, the send queue from send queue 2336 can be exempted from flow control by not associating (eg, omitting) the queue identifier with the data when data is sent from the send queue.

在一些实施例中,在目的实体2340处的一个或多个接收队列2346的不可用性能基于条件被满足而被定义。该条件能涉及队列的存储限制、队列接入速率、输入到队列的数据流量速率等等。例如,流量控制信号能在目的控制模块2380处响应于一个或多个接收队列2346的状态,例如第二级接收队列QC2从可用状态基于阈值存储限制被超过改变为不可用状态(例如,拥塞状态)被定义。当在不可用状态中时,第二级接收队列QC2不可用于接收数据,因为例如第二级接收队列QC2被认为太满(如由阈值存储限制的超过所指示)。在一些实施例中,当禁用时,一个或多个接收队列2346能处于不可用状态。在一些实施例中,当接收队列不可用于接收数据时,流量控制信号能基于请求到来自于接收队列2346的接收队列的数据传输暂停被定义。在一些实施例中,一个或多个接收队列2346的状态能响应于接收队列2346(例如,特定级内的接收队列)处于拥塞状态的特定子集从可用状态改变为拥塞状态(通过目的控制模块2380)。In some embodiments, the unavailability of one or more receive queues 2346 at the destination entity 2340 is defined based on conditions being met. The condition can relate to the storage limit of the queue, the rate of access to the queue, the rate of data traffic input to the queue, and the like. For example, the flow control signal can be at the destination control module 2380 in response to a state of one or more receive queues 2346, such as a second stage receive queue QC2 changing from an available state to an unavailable state (e.g., a congested state) based on a threshold storage limit being exceeded. ) is defined. When in the unavailable state, the second-stage receive queue QC2 is unavailable for receiving data because, for example, the second-stage receive queue QC2 is considered too full (as indicated by the exceeding of a threshold storage limit). In some embodiments, when disabled, one or more receive queues 2346 can be in an unavailable state. In some embodiments, a flow control signal can be defined based on a request to suspend data transmissions to receive queues from receive queues 2346 when receive queues are not available to receive data. In some embodiments, the state of one or more receive queues 2346 can change from an available state to a congested state (via a destination control module) in response to a particular subset of receive queues 2346 (e.g., receive queues within a particular stage) being in a congested state. 2380).

在一些实施例中,流量控制信号能在目的控制模块2380处被定义以指示接收队列2346中的一个已经从不可用状态改变为可用状态。例如,初始地,目的控制模块2380能被配置为定义并响应于第一级接收队列QD3从可用状态改变为不可用状态发送第一流量控制信号到源控制模块2370。第一级接收队列QD3能响应于从第一级发送队列QA2发送的数据从可用状态改变为不可用状态。因此,第一流量控制信号的目标可以是第一级发送队列QA2(基于队列标识符指示的)。当第一级接收队列QD3从不可用状态改变回可用状态时,目的控制模块2380能被配置为定义并向源控制模块2370发送第二流量控制信号,其指示从不可用状态改变回可用状态。在一些实施例中,源控制模块2370能被配置为响应于第二流量控制信号触发从一个或多个发送队列2336到第一级接收队列QD3的数据传输。In some embodiments, a flow control signal can be defined at the destination control module 2380 to indicate that one of the receive queues 2346 has changed from an unavailable state to an available state. For example, initially, the destination control module 2380 can be configured to define and send a first flow control signal to the source control module 2370 in response to the first-stage receive queue QD3 changing from an available state to an unavailable state. The first-stage receive queue QD3 can change from an available state to an unavailable state in response to data transmitted from the first-stage transmit queue QA2. Thus, the first flow control signal may be targeted to the first stage transmit queue QA2 (indicated based on the queue identifier). When the first stage receive queue QD3 changes back to the available state from the unavailable state, the destination control module 2380 can be configured to define and send to the source control module 2370 a second flow control signal indicating the change back from the unavailable state to the available state. In some embodiments, the source control module 2370 can be configured to trigger the transmission of data from the one or more send queues 2336 to the first stage receive queue QD3 in response to the second flow control signal.

在一些实施例中,流量控制信号可具有一个或多个参数值,其通过源控制模块2370被用于修改来自于发送队列2336中之一(由队列标识符在流量控制信号内识别)的传输。例如,流量控制信号可包括触发源控制模块2370暂停来自于发送队列2336中之一的传输一特定时间段(例如,10毫秒(ms))的参数值。换而言之,流量控制信号可包括暂停时间段参数值。在一些实施例中,暂停时间段可以是不确定的。在一些实施例中,流量控制信号能定义从一个或多个发送队列2336以特定速率(例如,每秒特定数目帧、每秒特定数目比特)发送数据的请求。In some embodiments, the flow control signal may have one or more parameter values that are used by the source control module 2370 to modify transmissions from one of the transmit queues 2336 (identified within the flow control signal by a queue identifier) . For example, the flow control signal may include a parameter value that triggers the source control module 2370 to suspend transmissions from one of the send queues 2336 for a specified period of time (eg, 10 milliseconds (ms)). In other words, the flow control signal may include a pause period parameter value. In some embodiments, the pause period may be indeterminate. In some embodiments, a flow control signal can define a request to send data from one or more transmit queues 2336 at a particular rate (eg, a particular number of frames per second, a particular number of bits per second).

在一些实施例中,流量控制信号(例如,流量控制信号内的暂停时间段)能基于流量控制算法被定义。暂停时间段能基于在来自于接收队列2346(例如,第一级接收队列QD4)的接收队列为不可用状态经过的时间周期被定义。在一些实施例中,暂停时间段能基于多于一个第一级接收队列2344为不可用状态被定义。例如,在一些实施例中,当差不多一特定数目的第一级接收队列2344为拥塞状态时,暂停时间段增加。在一些实施例中,这种类型的确定能在目的控制模块2380被确定。接收队列处于不可用经历的时间段可以是由目的控制模块2380基于例如来自于接收队列数据的流量率(例如,历史流量率、先前流量率)计算的计划(例如,预计)时间段。In some embodiments, a flow control signal (eg, a pause period within the flow control signal) can be defined based on a flow control algorithm. A pause period can be defined based on the time period that elapses while a receive queue from receive queue 2346 (eg, first stage receive queue QD4 ) is in an unavailable state. In some embodiments, a pause period can be defined based on more than one first level receive queue 2344 being unavailable. For example, in some embodiments, the pause period increases when approximately a specified number of first-level receive queues 2344 are congested. In some embodiments, this type of determination can be determined at the purpose control module 2380 . The period of time that the receive queue is unavailable may be a planned (eg, projected) period of time calculated by the destination control module 2380 based on, for example, traffic rates (eg, historical traffic rates, previous traffic rates) from receive queue data.

在一些实施例中,源控制模块2370能拒绝或改变修改来自于一个或多个发送队列2336的数据流的请求。例如,在一些实施例中,源控制模块2370能被配置为降低或增加暂停时间段。在一些实施例中,不是响应于流量控制信号暂停数据传输,源控制模块2370可被配置为修改与传输队列2336中之一相关联的传输路径。例如,如果第一级发送队列QA2已经基于第一级接收队列QD2状态的改变接收暂停传输的请求,则源控制模块2370能被配置为触发从第一级发送队列QA2向例如第一级接收队列QD3的数据传输,而不是按照暂停传输的请求进行。In some embodiments, source control module 2370 can deny or alter requests to modify data streams from one or more transmit queues 2336 . For example, in some embodiments, the source control module 2370 can be configured to decrease or increase the pause period. In some embodiments, rather than suspending data transmission in response to a flow control signal, source control module 2370 may be configured to modify a transmission path associated with one of transmission queues 2336 . For example, if the first-level sending queue QA2 has received a request to suspend transmission based on a change in the status of the first-level receiving queue QD2, the source control module 2370 can be configured to trigger a transfer from the first-level sending queue QA2 to, for example, the first-level receiving queue. QD3 data transfer, instead of following the request to suspend the transfer.

如图23所示,第二级队列2320之内的队列扇入(fan into)或扇出(fan out)物理链路2300。例如,物理链路2300发送侧上的发送队列2332(例如,QB1到QBM)扇入物理链路2300发送侧上的队列QP1。因此,在任意发送队列2332处排队的数据能被发送到物理链路2300的队列QP1。在物理链路2300接收侧上,从物理链路2300经由队列QP2发送的数据能被广播到接收队列2342(即,队列QC1到QCM)。As shown in FIG. 23 , the queues within the second-level queue 2320 fan into or fan out the physical link 2300 . For example, transmit queue 2332 (eg, QB1 to QBM) on the transmit side of physical link 2300 fans into queue QP1 on the transmit side of physical link 2300 . Thus, data queued at any send queue 2332 can be sent to queue QP1 of physical link 2300 . On the receive side of physical link 2300, data sent from physical link 2300 via queue QP2 can be broadcast to receive queues 2342 (ie, queues QC1 to QCM).

同样,如图23所示,在第一级队列2310内的发送队列2334扇入到第二级队列2320内的发送队列2332。例如,在第一级发送队列QA1、QA4和QAN-2中任意处排队的数据能被发送到第二级发送队列QB2。在物理链路2300接收侧上,从例如第二级接收队列QCM发送的数据能被广播到第一级接收队列QDR-1和QDR。Likewise, as shown in FIG. 23 , the send queue 2334 in the first-level queue 2310 fans into the send queue 2332 in the second-level queue 2320 . For example, data queued anywhere in the first-stage transmission queues QA1, QA4, and QAN-2 can be transmitted to the second-stage transmission queue QB2. On the receive side of the physical link 2300, data transmitted from, for example, the second-stage receive queue QCM can be broadcast to the first-stage receive queues QDR-1 and QDR.

由于许多流量控制环(例如,第一控制环)与不同的扇入、扇出体系结构相关联,流量控制环对经由物理链路2300的数据流具有不同的影响。例如,当从第二级发送队列QB1的数据传输基于第二级控制环被暂停时,从第一级发送队列QA1、QA2、QA3和QAN-1经由第二级发送队列QB1到一个或多个接收队列2346的数据传输也被暂停。在这种情况下,当来自于下行流队列(例如,第二级发送队列QB1)的传输暂停时,来自于一个或多个上行流队列(例如,第一级发送队列QA1)的数据传输能被暂停。相反,如果从第一级发送队列QA1沿着包括至少下行流第二级发送队列QB1的传输路径的数据传输基于第一级控制环暂停,则来自于第二级发送队列QB1的流量数据率可以减少,而不需要来自于第二级发送队列QB1的数据传输全部暂停;例如,第一级发送队列QA1,仍能够经由第二级发送队列QB1发送数据。Since many flow control loops (eg, the first control loop) are associated with different fan-in, fan-out architectures, the flow control loops have different impacts on the data flow via the physical link 2300 . For example, when data transmission from the second-stage send queue QB1 is suspended based on the second-stage control loop, from the first-stage send queues QA1, QA2, QA3, and QAN-1 to one or more Data transfers to receive queue 2346 are also suspended. In this case, data transmission from one or more upstream queues (e.g., first-stage send queue QA1) can be is suspended. On the contrary, if the data transmission from the first-stage transmit queue QA1 along the transmission path including at least the downstream second-stage transmit queue QB1 is suspended based on the first-stage control loop, the traffic data rate from the second-stage transmit queue QB1 can be The reduction does not require all data transmissions from the second-level sending queue QB1 to be suspended; for example, the first-level sending queue QA1 can still send data through the second-level sending queue QB1.

在一些实施例中,扇入和扇出体系结构可以与图23中所示的不同。例如,在一些实施例中,第一级队列2310内的一些队列能被配置为迂回第二级队列2320地扇入物理链路2300。In some embodiments, the fan-in and fan-out architecture may differ from that shown in FIG. 23 . For example, in some embodiments, some queues within first stage queues 2310 can be configured to fan into physical link 2300 detouring second stage queues 2320 .

与发送队列2336相关联的流量控制信令由源控制模块2370处理且与接收队列2346相关联的流量控制信令由目的控制模块2380处理。虽然未示出,但是在一些实施例中,流量控制信令能由一个或多个可以是独立的和/或集成到单个控制模块上的控制模块(或控制子模块)处理。例如,与第一级接收队列2344相关联的流量控制信令可以由独立于被配置为处理与第二级接收队列2342相关联的流量控制信令的控制模块的控制模块处理。类似地,与第一级发送队列2334相关联的流量控制信令可以由独立于被配置为处理与第二级发送队列2332有关的流量控制信令控制模块的控制模块处理。在一些实施例中,源控制模块2370和/或目的控制模块2380的一个或多个部分可以是基于硬件的模块(例如,DSP、FPGA)和/或基于软件的模块(例如,计算节点模块、能在处理器上执行的处理器可读指令集)。Flow control signaling associated with send queue 2336 is handled by source control module 2370 and flow control signaling associated with receive queue 2346 is handled by destination control module 2380 . Although not shown, in some embodiments flow control signaling can be handled by one or more control modules (or control sub-modules) which may be separate and/or integrated into a single control module. For example, the flow control signaling associated with the first stage receive queue 2344 may be processed by a control module that is separate from the control module configured to process the flow control signaling associated with the second stage receive queue 2342 . Similarly, flow control signaling associated with the first stage transmit queue 2334 may be handled by a control module independent of the control module configured to process flow control signaling associated with the second stage transmit queue 2332 . In some embodiments, one or more portions of source control module 2370 and/or destination control module 2380 may be hardware-based modules (e.g., DSP, FPGA) and/or software-based modules (e.g., compute node modules, A set of processor-readable instructions that can be executed on a processor).

图24是根据一个实施例表明目的控制模块2450的示意框图,该目的控制模块被配置为定义与多个接收队列相关联的流量控制信号6428。队列级包括第一级队列2410和第二级队列2420。如图24所示,源控制模块2460与第一级队列2410的发送侧相关联而目的控制模块2450与第一级队列2410的接收侧相关联。在物理链路2400发送侧上的队列能共同被称为发送队列2470。在物理链路2400接收侧上的队列能共同被称为接收队列2480。24 is a schematic block diagram illustrating a destination control module 2450 configured to define flow control signals 6428 associated with a plurality of receive queues, according to one embodiment. The queue levels include first-level queues 2410 and second-level queues 2420 . As shown in FIG. 24 , source control module 2460 is associated with the sending side of first-level queue 2410 and destination control module 2450 is associated with the receiving side of first-level queue 2410 . The queues on the transmit side of physical link 2400 can collectively be referred to as transmit queues 2470 . The queues on the receive side of physical link 2400 can collectively be referred to as receive queues 2480 .

目的控制模块2450被配置为响应于第一级队列2410内的一个或多个接收队列不可用于从第一级队列2410处的单个源队列接收数据,向源控制模块2460发送流量控制信号6428。源控制模块2460被配置为基于流量控制信号6428暂停从第一级队列2410处的源队列到第一级队列2410处的多个接收队列的数据传输。Destination control module 2450 is configured to send flow control signal 6428 to source control module 2460 in response to one or more receive queues within first level queues 2410 being unavailable to receive data from a single source queue at first level queues 2410 . The source control module 2460 is configured to suspend data transmission from the source queue at the first stage queue 2410 to the plurality of receive queues at the first stage queue 2410 based on the flow control signal 6428 .

流量控制信号6428能由目的控制模块2450基于与第一级队列2410内的每一个不可用接收队列相关联的信息而定义。目的控制模块2450能被配置为收集与不可用接收队列相关联的信息并被配置为定义流量控制信号6428,从而潜在冲突的流量控制信号(未示出)不被发送到第一级队列2410处的单个源队列。在一些实施例中,基于收集的信息定义的流量控制信号6428能被称为集合流量控制信号。Flow control signal 6428 can be defined by destination control module 2450 based on information associated with each unavailable receive queue within primary queue 2410 . Destination control module 2450 can be configured to collect information associated with unavailable receive queues and to define flow control signals 6428 such that potentially conflicting flow control signals (not shown) are not sent to primary queues 2410 A single source queue for . In some embodiments, the flow control signal 6428 defined based on the collected information can be referred to as an aggregate flow control signal.

特别地,在该例子中,目的控制模块2450被配置为响应于两个接收队列-接收队列2442和接收队列2446-在第一级队列2410接收侧处不可用于从第一级队列2410发送侧上的发送队列2412接收数据,来定义流量控制信号6428。在该实施例中,响应于从发送队列2412分别经由传输路径6422和传输路径6424发送的数据分组,接收队列2442和接收队列2446从可用状态改变为不可用状态。如图24所示,传输路径6422包括发送队列2412、第二级队列2420内的发送队列2422、物理链路2400、第二级队列2420内的接收队列2432和接收队列2442。传输路径6424包括发送队列2412、发送队列2422、物理链路2400、接收队列2432和接收队列2446。In particular, in this example, destination control module 2450 is configured to respond to two receive queues—receive queue 2442 and receive queue 2446—at the receive side of primary queue 2410 being unavailable from the transmit side of primary queue 2410 The send queue 2412 on receives data to define the flow control signal 6428. In this embodiment, receive queue 2442 and receive queue 2446 change from an available state to an unavailable state in response to data packets transmitted from transmit queue 2412 via transmit path 6422 and transmit path 6424, respectively. As shown in FIG. 24 , the transmission path 6422 includes a sending queue 2412 , a sending queue 2422 in the second-level queue 2420 , a physical link 2400 , and a receiving queue 2432 and a receiving queue 2442 in the second-level queue 2420 . Transmission path 6424 includes transmit queue 2412 , transmit queue 2422 , physical link 2400 , receive queue 2432 and receive queue 2446 .

在一些实施例中,流量控制算法能被用于基于涉及接收队列2442不可用性的信息和/或涉及接收队列2446不可用性的信息而定义流量控制信号6428。例如,如果目的控制模块2450确定接收队列2442和接收队列2446不可用于不同的时间段,则目的控制模块2450可以被配置为基于不同的时间段定义流量控制信号6428。例如,目的控制模块2450能经由流量控制信号6428请求从发送队列2412的数据传输暂停一时间段,该时间段基于不同的时间段(例如,等于不同时间段平均值的时间段、等于不同时间段中较大值的时间段)计算。在一些实施例中,流量控制信号6428能基于来自于第一级队列2410接收侧的独立暂停请求(例如,与接收队列2442相关联的暂停请求和与接收队列2446相关联的暂停请求)定义。In some embodiments, a flow control algorithm can be used to define flow control signal 6428 based on information related to receive queue 2442 unavailability and/or information related to receive queue 2446 unavailability. For example, if destination control module 2450 determines that receive queue 2442 and receive queue 2446 are unavailable for different time periods, destination control module 2450 may be configured to define flow control signal 6428 based on the different time periods. For example, destination control module 2450 can request via flow control signal 6428 that data transmission from send queue 2412 be suspended for a time period based on a different time period (e.g., a time period equal to the average of different time periods, a time period equal to different time periods The time period of the larger value) calculation. In some embodiments, the flow control signal 6428 can be defined based on separate pause requests from the receive side of the primary queue 2410 (eg, a pause request associated with receive queue 2442 and a pause request associated with receive queue 2446 ).

在一些实施例中,流量控制信号6428能基于最大或最小可允许时间段定义。在一些实施例中,流量控制信号6428能基于来自于例如发送队列2412的集合数据流量速率计算。例如,暂停时间段能基于来自于发送队列2412的集合数据流量速率测量。在一些实施例中,例如,如果来自于发送队列2412的数据流量速率高于阈值,则暂停时间段能被增加,以及如果来自于发送队列2412的数据流量速率低于阈值则暂停时间段可被减少。In some embodiments, the flow control signal 6428 can be defined based on a maximum or minimum allowable time period. In some embodiments, the flow control signal 6428 can be calculated based on the aggregate data flow rate from, for example, the transmit queue 2412 . For example, the pause period can be based on aggregate data traffic rate measurements from the transmit queue 2412 . In some embodiments, for example, the pause period can be increased if the rate of data traffic from the transmit queue 2412 is above a threshold, and the pause period can be increased if the rate of data traffic from the transmit queue 2412 is below a threshold reduce.

在一些实施例中,流量控制算法能被配置为在定义和/或发送流量控制信号6428之前等待特定的时间段。等待时间段能被定义使得涉及发送队列2412且能在等待段内的不同时间被接收的多个暂停请求能被用于定义流量控制信号6428。在一些实施例中,等待时间段响应于涉及发送队列2412的至少一个暂停请求被接收而被触发。In some embodiments, the flow control algorithm can be configured to wait a certain period of time before defining and/or sending a flow control signal 6428 . A wait period can be defined such that multiple pause requests involving the transmit queue 2412 and which can be received at different times within the wait period can be used to define the flow control signal 6428 . In some embodiments, the wait period is triggered in response to at least one pause request involving the send queue 2412 being received.

在一些实施例中,流量控制信号6428能基于与第一级队列2410内每一个接收队列相关联的优先权值由流量控制算法定义。例如,如果接收队列2442具有比与接收队列2446相关联的优先权值更高的优先权值,则目的控制模块2450能被配置为基于与接收队列2442而不是接收队列2446相关联的信息定义流量控制信号6428。例如,流量控制信号6428能基于与接收队列2442相关联的暂停时间段而不是与接收队列2446相关联的暂停时间段定义,因为接收队列2442具有比与接收队列2446相关联的优先权值更高的优先权值。In some embodiments, the flow control signal 6428 can be defined by a flow control algorithm based on a priority value associated with each receive queue within the first level queue 2410 . For example, if receive queue 2442 has a higher priority value than the priority value associated with receive queue 2446, destination control module 2450 can be configured to define traffic based on information associated with receive queue 2442 rather than receive queue 2446 Control signal 6428. For example, flow control signal 6428 can be defined based on a pause period associated with receive queue 2442 rather than a pause period associated with receive queue 2446 because receive queue 2442 has a higher priority value than receive queue 2446 priority value.

在一些实施例中,流量控制信号6428能基于与第一级队列2410内部的每一个接收队列相关联的属性由流量控制算法定义。例如,流量控制信号6428能基于是特定类型队列(例如,后入先出(LIFO)队列,先入先出(FIFO)队列)的接收队列2442和/或接收队列2446定义。在一些实施例中,流量控制信号6428能基于被配置为接收特定类型数据(例如,控制数据/信号队列、媒体数据/信号队列)的接收队列2442和/或接收队列2446定义。In some embodiments, the flow control signal 6428 can be defined by a flow control algorithm based on attributes associated with each receive queue within the first level queue 2410 . For example, flow control signal 6428 can be defined based on receive queue 2442 and/or receive queue 2446 being a particular type of queue (eg, last-in-first-out (LIFO) queue, first-in-first-out (FIFO) queue). In some embodiments, flow control signals 6428 can be defined based on receive queues 2442 and/or receive queues 2446 configured to receive particular types of data (eg, control data/signal queues, media data/signal queues).

虽然未示出,但是与队列级(例如,第一级队列2410)相关联的一个或多个控制模块能被配置为向不同的控制模块发送信息,其中该信息被用于定义流量控制信号。不同的控制模块与不同的队列级有关。例如,与接收队列2442相关联的暂停请求和与接收队列2446有关的暂停请求能在目的控制模块2450被定义。暂停请求能被发送到与第二级队列2420接收侧相关联的目的控制模块(未示出)。流量控制信号(未示出)能在与第二级队列2420接收侧相关联的目的控制模块处基于暂停请求和基于流量控制算法定义。Although not shown, one or more control modules associated with a queue level (eg, first level queue 2410) can be configured to send information to a different control module, where the information is used to define flow control signals. Different control modules are associated with different queue levels. For example, a pause request associated with receive queue 2442 and a pause request associated with receive queue 2446 can be defined at destination control module 2450 . The pause request can be sent to a destination control module (not shown) associated with the receiving side of the secondary queue 2420 . A flow control signal (not shown) can be defined at the destination control module associated with the receiving side of the second stage queue 2420 based on the pause request and based on the flow control algorithm.

流量控制信号6428能基于与第一级队列2410相关联的流量控制环(例如,第一级控制环)定义。一个或多个流量控制信号(未示出)还能基于与第二级队列2420相关联的流量控制环和/或与物理链路2400相关联的流量控制环定义。The flow control signal 6428 can be defined based on a flow control loop associated with the first stage queue 2410 (eg, a first stage control loop). One or more flow control signals (not shown) can also be defined based on the flow control ring associated with second stage queue 2420 and/or the flow control ring associated with physical link 2400 .

与第一级队列2410内发送队列(除了发送队列2412)相关联的数据传输基本不受流量控制信号6428限制,因为到接收队列2442和2446的数据流基于第一级流量控制环来控制。例如,即使从发送队列2412的数据传输暂停,发送队列2414还能继续经由发送队列2422发送数据。例如,发送队列2414能被配置为即使从发送队列2412经由发送队列2422的数据传输已经暂停,还能经由包括发送队列2422的传输路径6426向接收队列2448发送数据。在一些实施例中,发送队列2422能被配置为即使从队列2412经由传输路径6422的数据传输已经基于流量控制信号6428被暂停,还能继续从例如发送队列2416向接收队列2442发送数据。Data transfers associated with send queues (except send queue 2412 ) within first level queues 2410 are not substantially restricted by flow control signal 6428 because data flow to receive queues 2442 and 2446 is controlled based on the first level flow control loop. For example, even if data transmission from send queue 2412 is suspended, send queue 2414 can continue to send data via send queue 2422 . For example, send queue 2414 can be configured to send data to receive queue 2448 via transmission path 6426 that includes send queue 2422 even if data transmission from send queue 2412 via send queue 2422 has been suspended. In some embodiments, send queue 2422 can be configured to continue sending data from, for example, send queue 2416 to receive queue 2442 even if data transfer from queue 2412 via transfer path 6422 has been suspended based on flow control signal 6428 .

反之,如果到接收队列2442和2446的数据传输通过基于与第二级控制环有关的流量控制信号(未示出)控制经由发送队列2422的数据流被暂停,则(除来自于发送队列2412的数据传输之外)从发送队列2414和发送队列2416经由发送队列2422的数据传输还将被限制。从发送队列2422的数据传输将被暂停,因为其与特定服务级别相关联,以及引起例如在接收队列2442和2446处拥塞的数据可以与特定服务级别相关联。Conversely, if data transmission to receive queues 2442 and 2446 is suspended by controlling the flow of data via transmit queue 2422 based on a flow control signal (not shown) associated with a second-level control loop, then (except from transmit queue 2412 In addition to data transfers) data transfers from send queue 2414 and send queue 2416 via send queue 2422 will also be restricted. Data transmission from send queue 2422 will be suspended because it is associated with a particular service level, and data causing congestion, for example, at receive queues 2442 and 2446 may be associated with a particular service level.

在流量控制信号6428之内定义的一个或多个参数值能被存储于目的控制模块2450的存储器2452中。在一些实施例中,在一个或多个参数值被定义之后和/或当流量控制信号6428被发送到源控制模块2460时,参数值能被存储在目的控制模块2450的存储器2452处。在流量控制信号6428内定义的参数值能用于跟踪例如发送队列2412的状态。例如,在存储器2452内的条目能指示发送队列2412在暂停状态(例如非发送状态)中。条目能基于在流量控制信号6428内定义的暂停时间段参数值被定义。当暂停时间段已经超时,该条目能被更新以指示发送队列2412的状态已经改变为例如活动状态(例如发送状态)。虽然未示出,但是在一些实施例中,一个或多个参数值能被存储在目的控制模块2450之外的存储器(例如,远程存储器)中。One or more parameter values defined within flow control signal 6428 can be stored in memory 2452 of destination control module 2450 . In some embodiments, parameter values can be stored at memory 2452 of destination control module 2450 after one or more parameter values are defined and/or when flow control signal 6428 is sent to source control module 2460 . Parameter values defined within the flow control signal 6428 can be used to track the status of the transmit queue 2412, for example. For example, an entry within memory 2452 can indicate that transmit queue 2412 is in a suspended state (eg, a non-transmit state). Entries can be defined based on the Pause Period parameter value defined within Flow Control Signal 6428. When the pause period has expired, this entry can be updated to indicate that the state of the send queue 2412 has changed to, for example, an active state (eg, a send state). Although not shown, in some embodiments, one or more parameter values can be stored in memory external to the destination control module 2450 (eg, remote memory).

在一些实施例中,存储在目的控制模块2450的存储器2452中的一个或多个参数值(例如,基于一个或多个参数值定义的状态信息)能由目的控制模块2450用于确定附加的流量控制信号(未示出)是否应当被定义。在一些实施例中,一个或多个参数值能由目的控制模块2450定义一个或多个附加的流量控制信号。In some embodiments, one or more parameter values stored in memory 2452 of destination control module 2450 (e.g., status information defined based on one or more parameter values) can be used by destination control module 2450 to determine additional traffic flow Whether a control signal (not shown) should be defined. In some embodiments, one or more parameter values can define one or more additional flow control signals by the purpose control module 2450 .

例如,如果接收队列2442响应于从发送队列2412接收的第一数据分组从可用状态改变为不可用状态(例如,拥塞状态),则暂停从发送队列2412的数据传输的请求能经由流量控制信号6428被发送。流量控制信号6428能基于队列指示符指示发送队列2412是该请求的目标以及能指定暂停时间段。当流量控制信号6428被发送到源控制模块2460时,与发送队列2412相关联的暂停时间段和队列标识符能被存储在目的控制模块2450的存储器2452中。在流量控制信号6428被发送之后,接收队列2444能响应于从发送队列2412接收的第二数据分组从可用状态改变为拥塞状态(传输路径在图24中未示出)。在从发送队列2412的数据传输暂停之前,第二数据分组能基于流量控制信号6428从发送队列2412被发送。目的控制模块2450能访问存储在存储器2452中的信息,并且能响应于与接收队列2444有关状态的改变,来确定目标为发送队列2412的附加流量控制信号不应当被定义并发送到源控制模块2460,因为流量控制信号6428已经被发送。For example, if the receive queue 2442 changes from an available state to an unavailable state (e.g., a congested state) in response to a first data packet received from the transmit queue 2412, a request to suspend data transmission from the transmit queue 2412 can be transmitted via the flow control signal 6428 is sent. The flow control signal 6428 can indicate that the send queue 2412 is the target of the request based on the queue indicator and can specify a pause period. When flow control signal 6428 is sent to source control module 2460 , the pause period and queue identifier associated with send queue 2412 can be stored in memory 2452 of destination control module 2450 . After the flow control signal 6428 is sent, the receive queue 2444 can change from an available state to a congested state in response to receiving a second data packet from the transmit queue 2412 (transmission path not shown in FIG. 24 ). Before data transmission from the send queue 2412 is suspended, a second data packet can be sent from the send queue 2412 based on the flow control signal 6428 . Destination control module 2450 has access to information stored in memory 2452 and can, in response to a change of state associated with receive queue 2444, determine that additional flow control signals destined for transmit queue 2412 should not be defined and sent to source control module 2460 , because flow control signal 6428 has been sent.

在一些实施例中,源控制模块2460能被配置为基于最近的流量控制信号参数值暂停来自于发送队列2412的传输。例如,在目标为发送队列2412的流量控制信号6428已经被发送到源控制模块2460之后,目标为发送队列2412的较迟流量控制信号(未示出)能在源控制模块2460处被接收。源控制模块2460能被配置为执行与随后流量控制信号相关联的一个或多个参数值,而不是与流量控制信号6428相关联的参数值。在一些实施例中,较迟流量控制信号能触发发送队列2412维持在暂停状态保持一个比在流量控制信号6428中指示的更长或更短的时间段。In some embodiments, source control module 2460 can be configured to suspend transmissions from send queue 2412 based on the most recent flow control signal parameter value. For example, a later flow control signal (not shown) destined for transmit queue 2412 can be received at source control module 2460 after flow control signal 6428 destined for transmit queue 2412 has been sent to source control module 2460 . Source control module 2460 can be configured to enforce one or more parameter values associated with subsequent flow control signals other than the parameter values associated with flow control signal 6428 . In some embodiments, the later flow control signal can trigger the send queue 2412 to remain in the paused state for a longer or shorter period of time than indicated in the flow control signal 6428 .

在一些实施例中,当与一个或多个参数值相关联的优先权值高于(或低于)与和流量控制信号6428相关联的一个或多个参数值相关联的优先权值时,源控制模块2460可选地执行一个或多个与较迟流量控制信号相关联的参数值。在一些实施例中,每一个优先权值能在目的控制模块2450被定义,并且每一个优先权值能基于与一个或多个接收队列2480相关联的优先权值定义。In some embodiments, when the priority value associated with the one or more parameter values is higher (or lower) than the priority value associated with the one or more parameter values associated with the flow control signal 6428, The source control module 2460 optionally enforces one or more parameter values associated with the later flow control signal. In some embodiments, each priority value can be defined at the destination control module 2450 , and each priority value can be defined based on priority values associated with one or more receive queues 2480 .

在一些实施例中,流量控制信号6428和较迟流量控制信号(都是目标为发送队列2412)都响应于来自于接收队列2480的相同接收队列不可用而被定义。例如,较迟流量控制信号能包括由目的控制模块2450基于接收队列2442定义的更新参数值,接收队列2442在不可用状态中维持一个比先前计算更长的时间段。在一些实施例中,目标为发送队列2412的流量控制信号6428能响应于接收队列2480中之一改变状态(例如,从可用状态改变为不可用状态)而被定义,以及目标为发送队列2412的较迟流量控制信号能响应于接收队列2480中另一个改变状态(例如,从可用状态改变为不可用状态)而被定义。In some embodiments, both the flow control signal 6428 and the later flow control signal (both targeted to the transmit queue 2412 ) are defined in response to the unavailability of the same receive queue from the receive queue 2480 . For example, the later flow control signal can include updated parameter values defined by the destination control module 2450 based on the receive queue 2442 remaining in the unavailable state for a longer period of time than previously calculated. In some embodiments, a flow control signal 6428 targeted to the send queue 2412 can be defined in response to one of the receive queues 2480 changing state (e.g., from an available state to an unavailable state), and a flow control signal 6428 targeted to the send queue 2412 A later flow control signal can be defined in response to another change of state in receive queue 2480 (eg, changing from an available state to an unavailable state).

在一些实施例中,多个流量控制信号能在目的控制模块2450被定义以暂停来自第一级队列2410多个发送队列的传输。在一些实施例中,多个发送队列可以是向单独接收队列例如接收队列2444发送数据。在一些实施例中,到来自第一级队列2410的多个发送队列的流量控制信号的历史能被存储在目的控制模块2450的存储器2452中。在一些实施例中,与单独接收队列相关联的较迟流量控制信号能基于流量控制信号的历史被计算。In some embodiments, multiple flow control signals can be defined at the destination control module 2450 to suspend transmissions from the multiple send queues of the first level queue 2410 . In some embodiments, multiple send queues may send data to a single receive queue, such as receive queue 2444 . In some embodiments, the history of flow control signals to the multiple transmit queues from the first stage queue 2410 can be stored in the memory 2452 of the destination control module 2450 . In some embodiments, later flow control signals associated with individual receive queues can be calculated based on the history of flow control signals.

在一些实施例中,与多个发送队列相关的暂停时间段能被分组并被包括在流量控制分组中。例如,与发送队列2412相关联的暂停时间段和与发送队列2414相关联的暂停时间段能被包括在流量控制分组(还被称为流量控制分组)中。涉及流量控制分组的更多细节将结合图25被描述。In some embodiments, pause periods associated with multiple transmit queues can be grouped and included in flow control packets. For example, a pause period associated with transmit queue 2412 and a pause period associated with transmit queue 2414 can be included in a flow control packet (also referred to as a flow control packet). More details related to flow control packets will be described in connection with FIG. 25 .

图25是根据一个实施例表明流量控制分组的示意图。流量控制分组包括头部2510、尾部2520和包括用于由队列标识符(ID)(在列2514中显示)表示的几个发送队列的暂停时间段参数值(在列2512中显示)的有效负载2530。如图25所示,由队列ID 1到V(即队列ID1到队列IDV)表示的发送队列每一个与暂停时间段参数值1到V(即暂停时间周期1到暂停时间周期V)相关联。暂停时间段参数值2514指示由队列2512表示的发送队列从发送数据起应当被暂停(例如,禁止)所经历的时间段。Figure 25 is a schematic diagram illustrating flow control packets, according to one embodiment. The flow control packet includes a header 2510, a trailer 2520, and a payload including pause period parameter values (shown in column 2512) for several transmit queues represented by queue identifiers (IDs) (shown in column 2514) 2530. As shown in FIG. 25, the transmit queues represented by queue IDs 1 to V (ie, queue ID1 to queue IDV) are each associated with pause period parameter values 1 to V (ie, pause time period 1 to pause time period V). Pause period parameter value 2514 indicates the time period for which the send queue represented by queue 2512 should be suspended (eg, inhibited) from sending data.

在一些实施例中,流量控制分组能在例如,例如图24中所示的目的控制模块2450的目的控制模块处被定义。在一些实施例中,目的控制模块能被配置为在规律的时间间隔定义流量控制分组。例如,目的控制模块能被配置为每10ms定义一个流量控制分组。在一些实施例中,当暂停时间段参数值已经被计算时,和/或当暂停时间段参数值的特定数目已经被计算时,目的控制模块能被配置为以随机时间定义流量控制分组。在一些实施例中,目的控制模块能基于例如一个或多个参数值和/或由目的控制模块访问的状态信息确定至少一部分流量控制分组不应当被定义和/或发送。In some embodiments, flow control packets can be defined at, for example, a destination control module such as destination control module 2450 shown in FIG. 24 . In some embodiments, the destination control module can be configured to define flow control packets at regular intervals. For example, the destination control module can be configured to define a flow control packet every 10 ms. In some embodiments, the destination control module can be configured to define flow control packets at random times when pause period parameter values have been calculated, and/or when a specified number of pause period parameter values have been calculated. In some embodiments, the destination control module can determine that at least a portion of the flow control packets should not be defined and/or sent based on, for example, one or more parameter values and/or state information accessed by the destination control module.

虽然未示出,但是在一些实施例中,多个队列ID能与单独暂停时间周期参数值相关联。在一些实施例中,至少一个队列ID能与除了暂停时间段参数值之外的参数值相关联。例如,队列ID能与流量速率参数值相关联。流量速率参数值能指示发送队列(由队列ID表示)应当发送数据的流量速率(例如,最大流量速率)。在一些实施例中,流量控制分组能具有一个或多个被配置为指示特定接收队列是否可用于接收数据的手段。Although not shown, in some embodiments, multiple queue IDs can be associated with individual pause time period parameter values. In some embodiments, at least one queue ID can be associated with a parameter value other than a pause period parameter value. For example, a queue ID can be associated with a traffic rate parameter value. The traffic rate parameter value can indicate the traffic rate (eg, the maximum traffic rate) at which the sending queue (represented by the queue ID) should send data. In some embodiments, a flow control packet can have one or more means configured to indicate whether a particular receive queue is available to receive data.

流量控制分组能从目的控制模块向源控制模块(例如图24中所示的源控制模块2460)经由流量控制信号(例如图24中所示的流量控制信号6428)发送。在一些实施例中,流量控制分组能基于第2层(例如,OSI模型的第2层)协议被定义。换句话说,流量控制分组能在网络系统的第2层被定义并在其中被使用。在一些实施例中,流量控制分组能在与第2层相关联的装置(例如,MAC装置)之间被发送。Flow control packets can be sent from a destination control module to a source control module (eg, source control module 2460 shown in FIG. 24 ) via a flow control signal (eg, flow control signal 6428 shown in FIG. 24 ). In some embodiments, flow control packets can be defined based on a layer 2 (eg, layer 2 of the OSI model) protocol. In other words, flow control packets can be defined at layer 2 of the network system and used therein. In some embodiments, flow control packets can be sent between devices associated with layer 2 (eg, MAC devices).

重新参考图25,与流量控制信号6428相关联的一个或多个参数值(例如,基于参数值定义的状态信息)能被存储在源控制模块2560的存储器2562中。在一些实施例中,当流量控制信号6428在源控制模块2560被接收时,一个或多个参数值能被存储在源控制模块2560的存储器2562中。在流量控制信号6428中定义的参数值能被用于跟踪一个或多个接收队列2580(例如,接收2542)的状态。例如,存储器2562中的条目能指示接收队列2542不可用于接收数据。该条目能基于在流量控制信号6428中定义的暂停时间周期参数值被定义并且与接收队列2542的标识符(例如,队列标识符)相关联。当暂停时间段超时,该条目能被更新以指示接收队列2542的状态已经改变为例如活动状态。虽然未示出,然而在一些实施例中,一个或多个参数值能被存储在源控制模块2560之外的存储器(例如,远程存储器)中。Referring back to FIG. 25 , one or more parameter values associated with flow control signal 6428 (eg, state information defined based on the parameter values) can be stored in memory 2562 of source control module 2560 . In some embodiments, one or more parameter values can be stored in memory 2562 of source control module 2560 when flow control signal 6428 is received at source control module 2560 . The parameter values defined in flow control signal 6428 can be used to track the status of one or more receive queues 2580 (eg, receive 2542). For example, an entry in memory 2562 can indicate that receive queue 2542 is unavailable to receive data. This entry can be defined based on the pause time period parameter value defined in the flow control signal 6428 and is associated with an identifier of the receive queue 2542 (eg, a queue identifier). When the pause period expires, this entry can be updated to indicate that the status of the receive queue 2542 has changed to, for example, active. Although not shown, in some embodiments, one or more parameter values can be stored in memory external to source control module 2560 (eg, remote memory).

在一些实施例中,存储在源控制模块2560的存储器2562处的一个或多个参数值(和/或状态信息)能由源控制模块2560用于确定数据是否应当被发送到一个或多个接收队列2580。例如,源控制模块2560能被配置为基于涉及接收队列2544和接收队列2542的状态信息从发送队列2516向接收队列2544而不是接收队列2542发送数据。In some embodiments, one or more parameter values (and/or status information) stored at memory 2562 of source control module 2560 can be used by source control module 2560 to determine whether data should be sent to one or more receiving Queue 2580. For example, source control module 2560 can be configured to send data from send queue 2516 to receive queue 2544 instead of receive queue 2542 based on state information related to receive queue 2544 and receive queue 2542 .

在一些实施例中,源控制模块2560能分析数据传输模式以确定数据是否应当从一个或多个源队列2570发送到一个或多个接收队列2580。例如,源控制模块2560能基于存储在源控制模块2560的存储器2562处的参数值确定发送队列2514向接收队列2546发送相对高的数据量。基于该确定,源控制模块2560能触发队列2516向接收队列2548而不是接收队列2546发送数据,因为接收队列2546从发送队列2514接收高数据量。通过分析与发送队列2570相关联的传输模式,一个或多个接收队列2580处的拥塞开始可以基本上被避免。In some embodiments, source control module 2560 can analyze data transfer patterns to determine whether data should be sent from one or more source queues 2570 to one or more receive queues 2580 . For example, source control module 2560 can determine based on parameter values stored at memory 2562 of source control module 2560 that send queue 2514 is sending a relatively high amount of data to receive queue 2546 . Based on this determination, source control module 2560 can trigger queue 2516 to send data to receive queue 2548 instead of receive queue 2546 because receive queue 2546 receives a high volume of data from send queue 2514 . By analyzing transmission patterns associated with transmit queues 2570, the onset of congestion at one or more receive queues 2580 can be substantially avoided.

在一些实施例中,源控制模块2560能分析存储在源控制模块2560的存储器2562处的参数值(和/或状态信息)以确定数据是否应当被发送到一个或多个接收队列2580。通过分析存储的参数值(和/或状态信息),在一个或多个发送队列2580处的拥塞开始可以基本上被避免。例如,源控制模块2560能基于相比于接收队列2542历史可用性(例如,更好,更差)的接收队列2540历史可用性来触发数据发送到接收队列2540而不是接收队列2542。在一些实施例中,例如,源控制模块2560能基于有关数据突发模式相比于接收队列2544历史性能的接收队列2542历史性能向接收队列2542而不是接收队列2544发送数据。在一些实施例中,涉及一个或多个接收队列2580的参数值分析可以基于特定的时间窗、特定类型的网络处理(例如,内部处理器通信)、特定服务级别等等。In some embodiments, source control module 2560 can analyze parameter values (and/or status information) stored at memory 2562 of source control module 2560 to determine whether data should be sent to one or more receive queues 2580 . By analyzing stored parameter values (and/or state information), the onset of congestion at one or more transmit queues 2580 can be substantially avoided. For example, source control module 2560 can trigger data to be sent to receive queue 2540 instead of receive queue 2542 based on the historical availability of receive queue 2540 compared to (eg, better, worse) than the historical availability of receive queue 2542 . In some embodiments, for example, source control module 2560 can send data to receive queue 2542 instead of receive queue 2544 based on receive queue 2542 historical performance regarding data burst patterns compared to receive queue 2544 historical performance. In some embodiments, analysis of parameter values involving one or more receive queues 2580 may be based on particular time windows, particular types of network processing (eg, interprocessor communications), particular service levels, and the like.

在一些实施例中,目的控制模块2550能发送有关接收队列2580的状态信息(例如,当前状态信息),其能由源控制模块2560用于确定数据是否应当从一个或多个源队列2570被发送。例如,源控制模块2560能触发队列2514向队列2544而不是队列2546发送数据,因为队列2546具有如目的控制模块2550所指示的比队列2544更多的可用容量。在一些实施例中,当前状态信息、传输模式分析和历史数据分析的任意结合能被用于基本上阻止或减少一个或多个接收队列2580的拥塞开始的可能性。In some embodiments, destination control module 2550 can send status information about receive queues 2580 (e.g., current status information), which can be used by source control module 2560 to determine whether data should be sent from one or more source queues 2570 . For example, source control module 2560 can trigger queue 2514 to send data to queue 2544 instead of queue 2546 because queue 2546 has more capacity available than queue 2544 as indicated by destination control module 2550 . In some embodiments, any combination of current state information, transmission pattern analysis, and historical data analysis can be used to substantially prevent or reduce the likelihood of congestion onset for one or more receive queues 2580 .

在一些实施例中,流量控制信号6428能从目的控制模块2550经由带外传输路径被发送到源控制模块2560。例如,流量控制信号6428能经由涉及流量控制信令通信的专用链路被发送。在一些实施例中,流量控制信号6428能经由与第二级队列2520相关联的队列、与第一级队列2510相关联的队列,和/或物理链路2500被发送。In some embodiments, the flow control signal 6428 can be sent from the destination control module 2550 to the source control module 2560 via an out-of-band transmission path. For example, flow control signal 6428 can be sent via a dedicated link involving flow control signaling communications. In some embodiments, flow control signals 6428 can be sent via queues associated with second-level queues 2520 , queues associated with first-level queues 2510 , and/or physical link 2500 .

这里描述的一些实施例涉及具有计算机可读媒介(还被称为处理器可读媒介)的计算机存储产品,计算机可读媒介具有其上有用于执行各种计算机可执行操作的指令或计算机代码。媒介和计算机代码(还被称为代码)可以是被设计以及构建用于特定目的的那些媒介和计算机代码。计算机可读媒介的例子包括,但不被限制为:例如硬盘、软盘和磁带的磁存储媒介;例如压缩光盘/数字化视频光盘(CD/DVD)、压缩只读光盘存储器(CD-ROM)和全息装置的光存储媒介;例如光盘的磁-光存储媒介;载波信号处理模块;以及被特别配置为存储并执行程序代码的硬件装置,例如ASIC、可编程逻辑装置(PLD),和只读存储器(ROM)和RAM装置。Some embodiments described herein relate to a computer storage product having a computer-readable medium (also referred to as a processor-readable medium) having thereon instructions or computer code for performing various computer-executable operations. The media and computer code (also referred to as code) may be those designed and constructed for a specific purpose. Examples of computer-readable media include, but are not limited to: magnetic storage media such as hard disks, floppy disks, and magnetic tape; Optical storage media for devices; magneto-optical storage media such as optical discs; carrier signal processing modules; and hardware devices specially configured to store and execute program codes, such as ASICs, programmable logic devices (PLDs), and read-only memories ( ROM) and RAM devices.

计算机代码的例子包括,但不被限制为,微代码或微指令、机器指令,例如由汇编者产生的、用于产生万维网服务的代码,和包含由计算机使用翻译机执行的高级别指令的文件。例如,实施例可以使用Java、C++或其他编程语言(例如,面向对象的编程语言)和开发工具被实现。计算机代码的附加例子包括,但不被限制为控制信号、加密代码和压缩代码。Examples of computer code include, but are not limited to, microcode or microinstructions, machine instructions, such as those produced by assemblers to produce World Wide Web services, and files containing high-level instructions for execution by a computer using a translator . For example, embodiments can be implemented using Java, C++, or other programming languages (eg, object-oriented programming languages) and development tools. Additional examples of computer code include, but are not limited to, control signals, encrypted code, and compressed code.

虽然各种实施例在以上已经被描述,但是应当理解的是其仅仅是通过例子而不是限制的方式体现,以及可以进行形式和细节上的各种变化。这里描述的设备和/或方法的任意部分可以以任意方式结合,除了互相排斥的结合。这里描述的实施例能包括描述的不同实施例的功能、组件和/或特征的各种结合和/或子结合。While various embodiments have been described above, it should be understood that this has been presented by way of example only, rather than limitation, and various changes in form and detail may be made. Any portion of the devices and/or methods described herein may be combined in any manner, except mutually exclusive combinations. The embodiments described herein can include various combinations and/or sub-combinations of the functions, components and/or features of the different embodiments described.

Claims (47)

1. a kind of communication equipment, including:
Exchcange core, the exchcange core defines single logic entity and with the multistage being physically distributed across multiple frames Switching fabric, the multilevel interchange frame has multiple input ports and multiple output ports, and the exchcange core is configured as Multiple peripheral processors are couple to via the multiple input port and the multiple output port,
The exchcange core is configured as in the first peripheral processor for arranging to have the first frame and is arranged in the second frame The second peripheral processor between provide clog-free connectivity with line rate, the exchcange core be configured as receiving with it is described The first associated packet of first peripheral processor, the exchcange core is configured as based on associated with the described first packet Cell, sequentially to second peripheral processor send second packet and to the 3rd peripheral processor send the 3rd Packet, the multilevel interchange frame is configured as from the input port in the multiple input port into the output port Output port sends the cell.
2. communication equipment as claimed in claim 1, has virtually wherein the multiple peripheral processor includes at least one Change the peripheral processor and at least one peripheral processor without virtual resources of resource.
3. communication equipment as claimed in claim 1, wherein the number of the multiple input port and the multiple output port More than 1000, each output port of each input port and the multiple output port in the multiple input port It is both configured to operate with the speed for being not less than 10Gb/s.
4. communication equipment as claimed in claim 1, wherein:
First peripheral processor and second peripheral processor be memory node device, calculate node device, One in service node device or router.
5. communication equipment as claimed in claim 1, wherein the exchcange core is configured as filling in second peripheral processes Put between the 3rd peripheral processor with the clog-free connectivity of line rate offer.
6. communication equipment as claimed in claim 5, wherein:
First peripheral processor and the 3rd peripheral processor be memory node device, calculate node device, One in service node device or router;And
Second peripheral processor is at least one in firewall device, intersecting detection means or load balance device.
7. a kind of communication equipment, including:
Exchcange core, the exchcange core has the multilevel interchange frame being physically distributed across multiple frames, the multistage friendship Changing structure has multiple input ports and multiple output ports, and the exchcange core is configured as via the multiple input port Multiple peripheral processors are couple to the multiple output port,
The exchcange core be configured as using line rate as the multiple peripheral processor in each peripheral processor The connectedness of each remaining processing unit in the multiple peripheral processor is provided, so that the multiple output end Each output port in mouthful can be by each peripheral processor in the multiple peripheral processor via described An input port in multiple input ports is coequally accessed, the number of the multiple input port and the multiple output port Mesh is more than each output end of each input port and the multiple output port in 1000, the multiple input port Mouth is both configured to operate with the speed for being not less than 10Gb/s.
8. communication equipment as claimed in claim 7, wherein the multiple peripheral processor includes at least one via ether The peripheral processor that net connection is couple to the exchcange core is couple to the friendship with least one via non-Ethernet connection Change the peripheral processor of core.
9. communication equipment as claimed in claim 7, wherein the multiple peripheral processor, which includes at least one, uses the 3rd layer The peripheral processor of route and at least one the 4th layer of peripheral processor to the 7th layer of device.
10. a kind of communication equipment, including:
Exchcange core, the exchcange core defines single logic entity and with multilevel interchange frame, and multistage exchange is tied Structure has multiple levels being physically distributed across multiple frames, and the multiple level has multiple input ports and multiple outputs jointly Port, the exchcange core is configured as being couple to multiple peripheries via the multiple input port and the multiple output port Processing unit,
The transmission that the exchcange core is configured as the multiple cells associated with packet can be essentially ensures that without logical When crossing the loss of the multilevel interchange frame, it is allowed to the input port that the multiple cell enters in the multiple input port.
11. communication equipment as claimed in claim 10, wherein the multiple peripheral processor includes being configured as and optical fiber Channel agreement communication the first peripheral processor and be configured as with fiber channel covering Ethernet protocol communicate second Peripheral processor.
12. communication equipment as claimed in claim 10, wherein being configured to determine that property of the multilevel interchange frame network.
13. communication equipment as claimed in claim 10, wherein being configured to determine that property of the multilevel interchange frame network, so that When the multiple cell can be sent to an output port in the multiple output port in the scheduled time, the multistage Switching fabric allows the packet to enter input port.
14. communication equipment as claimed in claim 10, wherein:
The exchcange core is configured as the first output port and into the multiple output port from the input port Two output ports send multiple cells associated with the packet, without in multiple levels of the multilevel interchange frame At least one-level at perform packet loss processing.
15. communication equipment as claimed in claim 10, wherein:
The exchcange core is couple to the multistage including multiple via the multiple input port and the multiple output port The edge device of switching fabric, the multiple edge device is couple to the multiple peripheral processor, and the multiple edge is set Each edge device in standby is configured as receiving described be grouped and based on the multiple cell of the packet definition.
16. communication equipment as claimed in claim 10, wherein:
The exchcange core is configured as multiple levels via the multilevel interchange frame from the input port to the multiple An output port in output port sends multiple cells associated with the packet, without in the multiple level At least one-level at perform packet loss processing.
17. a kind of communication equipment, including:
Exchcange core, the exchcange core defines single logic entity and with multilevel interchange frame, and multistage exchange is tied Structure has multiple levels being physically distributed across multiple frames, and the multilevel interchange frame has multiple input ports and multiple defeated Exit port, the exchcange core is configured as being couple to outside multiple via the multiple input port and the multiple output port Enclose processing unit,
The exchcange core is configured as receiving packet, the exchcange core quilt from the input port in the multiple input port It is configured to send multiple and institute via output port of the multiple level from the input port into the multiple output port The associated cell of packet is stated, is damaged without performing packet at least one-level in multiple levels of the multilevel interchange frame Consumption processing.
18. communication equipment as claimed in claim 17, wherein being configured to determine that property of the multilevel interchange frame network, so that Only when the transmission for the multiple cells associated with packet that can be essentially ensures that in the switching fabric is lossless, Just allow the packet of the input port in the multiple input port.
19. communication equipment as claimed in claim 17, wherein:
The output port is the first output port,
The exchcange core is configured as first output port into the multiple output port from the input port Sent and the multiple cell associated with the packet with the second output port.
20. communication equipment as claimed in claim 17, wherein:
The exchcange core is couple to the multistage including multiple via the multiple input port and the multiple output port The edge device of switching fabric, the multiple edge device is couple to the multiple peripheral processor, and the multiple edge is set Each edge device in standby is configured as receiving described be grouped and based on the multiple cell of the packet definition.
21. a kind of communication equipment, including:
Exchcange core, the exchcange core defines single logic entity and the multistage exchange with being configured to determine that property network Structure, the multilevel interchange frame has a multiple input ports and multiple output ports, the exchcange core be configured as via The multiple input port and the multiple output port are couple to multiple peripheral processors,
The exchcange core is configured as receiving packet, the exchcange core quilt from the input port in the multiple input port It is configured to the output port from the input port into the multiple output port and sends multiple associated with the packet Cell.
22. communication equipment as claimed in claim 21, wherein the multilevel interchange frame is physically distributed across multiple frames.
23. communication equipment as claimed in claim 21, wherein being configured to determine that property of the multilevel interchange frame network, so that Only when the transmission for the multiple cells associated with packet that can be essentially ensures that in the multilevel interchange frame is lossless It is time-consuming, just allow the packet of the input port in the multiple input port.
24. communication equipment as claimed in claim 21, wherein being configured to determine that property of the multilevel interchange frame network, so that An output in the multiple output port can be sent in the scheduled time when the multiple cell associated with packet During port, it is allowed to the packet of the input port in the multiple input port.
25. communication equipment as claimed in claim 21, wherein:
The exchcange core is couple to the multistage including multiple via the multiple input port and the multiple output port The edge device of switching fabric, the multiple edge device is couple to the multiple peripheral processor, and the multiple edge is set Each edge device in standby is configured as receiving described be grouped and based on the multiple cell of the packet definition.
26. communication equipment as claimed in claim 21, wherein:
The exchcange core is configured as multiple levels via the multilevel interchange frame from the input port to the output Port sends multiple cells associated with the packet, without performing packet at least one-level in the multiple level Loss is handled.
27. a kind of communication equipment, including:
Exchcange core, the exchcange core has the multilevel interchange frame being physically distributed between multiple frames, described many Level switching fabric has multiple input buffers and multiple output ports, and the exchcange core is configured to couple to multiple edges Equipment;With
The control for not needing software during operation and realizing and needed during configuration and monitoring software to realize with hardware Device, the controller is couple to the multiple input buffer and the multiple output port, and the controller is configured to work as When the congestion at an output port in multiple output ports is foreseen and it occurs for the congestion in the exchcange core Before, an input buffer transmitted traffic control signal into the multiple input buffer.
28. communication equipment as claimed in claim 27, wherein the controller is configured as independently of for the exchange core Flow is controlled in the structure of the multilevel interchange frame of the heart, the input buffer and the output port is performed end-to-end Flow is controlled.
29. communication equipment as claimed in claim 27, wherein the controller is configured as independently of for the multiple side The flow control of edge equipment, End-to-end flow control is performed to the input buffer and the output port.
30. communication equipment as claimed in claim 27, further comprises:
Multiple peripheral processors for being configured to couple to the multiple edge device,
The controller is configured as independently of the flow control for the multiple edge device, to the input buffer and The output port performs End-to-end flow control.
31. communication equipment as claimed in claim 27, wherein the controller is configured as performing End-to-end flow control, from And cell is buffered at the input buffer for a period of time being sent to before the output port, the time and institute Stating End-to-end flow control is associated.
32. communication equipment as claimed in claim 27, wherein the controller is configured as independently of in the multistage exchange At one level of structure cache cell section and independently of at an edge device in the multiple edge device cache Packet, at the input buffer cache cell perform End-to-end flow control.
33. communication equipment as claimed in claim 27, wherein the controller is configured as independently of associated with Ethernet Flow control mechanism, at the input buffer cache cell perform End-to-end flow control.
34. a kind of communication equipment, including:
Exchcange core, the exchcange core has the multilevel interchange frame being physically distributed between multiple frames, described many Level switching fabric is configured as receiving multiple cells associated with packet and is configured as being based on the multiple cell switching Multiple cell sections;
An edge device in multiple edge devices for being couple to the exchcange core, the edge device is configured as receiving The packet, the edge device is configured to send the multiple cell to the multilevel interchange frame;With
The controller of the multilevel interchange frame is couple to, the controller is configured as setting independently of for the multiple edge Standby flow is controlled and controlled for flow in the structure of the multilevel interchange frame, to the multiple cell traffic control System.
35. communication equipment as claimed in claim 34, wherein:
The controller is not needed software and is realized with hardware and need software real during configuration and monitoring during operation It is existing.
36. communication equipment as claimed in claim 34, wherein:
The multilevel interchange frame has multiple input buffers and multiple output ports,
When the congestion that the controller is configured as at an output port in the multiple output port is foreseen with And before the congestion in the exchcange core occurs, an input buffer into the multiple input buffer sends stream Measure control signal.
37. communication equipment as claimed in claim 34, wherein:
The multilevel interchange frame has multiple input buffers and multiple output ports,
The controller is configured as independently of the flow control mechanism associated with Ethernet, to being buffered in the multiple input The cell cached at an input buffer in device performs End-to-end flow control.
38. a kind of communication equipment, including:
Exchcange core, the exchcange core has multilevel interchange frame;
More than first peripheral processor, the multilevel interchange frame, described are couple to by multiple connections with agreement Each peripheral processor in more than one peripheral processor is the memory node with virtual resources, more than described first The virtual storage resource that the virtual resources common definition of individual peripheral processor is interconnected by the exchcange core;With
More than second peripheral processor, the multilevel interchange frame, described are couple to by multiple connections with agreement Each peripheral processor in more than two peripheral processor is the memory node with virtual resources, more than described second The virtual computing resource that the virtual resources common definition of individual peripheral processor is interconnected by the exchcange core.
39. communication equipment as claimed in claim 38, wherein:
Each peripheral processor in more than first peripheral processor has virtual resources, more than described first Each peripheral processor in peripheral processor is configured such that its virtual resources can be by from described first The virtual resource of remaining peripheral processor in multiple peripheral processors is substituted;And
Each peripheral processor in more than second peripheral processor has virtual resources, more than described second Each peripheral processor in peripheral processor is configured such that its virtual resources can be by from described second The virtual resource of remaining peripheral processor in multiple peripheral processors is substituted.
40. communication equipment as claimed in claim 38, wherein:
More than first peripheral processor is associated and associated with security protocol with based on packet communication protocol;And
More than second peripheral processor is associated and associated with security protocol with based on packet communication protocol.
41. a kind of communication equipment, including:
Exchcange core, the exchcange core has multilevel interchange frame, and the exchcange core is configured as being logically divided into One virtual switch core and the second virtual switch core;
Multiple peripheral processors for being couple to the multilevel interchange frame, the multiple peripheral processor has operationally It is couple to the first peripheral processor subset of the first virtual switch core and to be operably coupled to described second virtual Second peripheral processor subset of exchcange core.
42. communication equipment as claimed in claim 41, wherein:
The exchcange core be configured such that the first virtual switch core and the second virtual switch core independently of Manage to being managed property each other.
43. communication equipment as claimed in claim 41, wherein:
The exchcange core is configured such that the first virtual switch core has independently of the second virtual switch core The bandwidth of the bandwidth of the heart.
44. communication equipment as claimed in claim 41, wherein:
The exchcange core is configured such that the first virtual switch core has and the second virtual switch core Bandwidth and the independent bandwidth of managerial management and managerial management.
45. communication equipment as claimed in claim 41, wherein:
The exchcange core is configured such that the first virtual switch core is operated using l2 protocol, and described second Virtual switch core is operated using l2 protocol and layer-3 protocol.
46. communication equipment as claimed in claim 41, wherein:
The first peripheral processor subset has virtual resource, and the second peripheral processor subset has virtual money Source.
47. communication equipment as claimed in claim 41, wherein:
The first peripheral processor subset is included in being calculate node, memory node, service node device and router The peripheral processor of one, and including being remaining in calculate node, memory node, service node device and router The peripheral processor of one;And
The second peripheral processor subset is included in being calculate node, memory node, service node device and router The peripheral processor of one, and including being remaining in calculate node, memory node, service node device and router The peripheral processor of one.
CN201410138824.5A 2008-09-11 2009-09-11 System, method and equipment for data center Active CN103916326B (en)

Applications Claiming Priority (25)

Application Number Priority Date Filing Date Title
US9620908P 2008-09-11 2008-09-11
US61/096,209 2008-09-11
US9851608P 2008-09-19 2008-09-19
US61/098,516 2008-09-19
US12/242,224 2008-09-30
US12/242,224 US8154996B2 (en) 2008-09-11 2008-09-30 Methods and apparatus for flow control associated with multi-staged queues
US12/242,230 US8218442B2 (en) 2008-09-11 2008-09-30 Methods and apparatus for flow-controllable multi-staged queues
US12/242,230 2008-09-30
US12/343,728 2008-12-24
US12/343,728 US8325749B2 (en) 2008-12-24 2008-12-24 Methods and apparatus for transmission of groups of cells via a switch fabric
US12/345,500 US8804710B2 (en) 2008-12-29 2008-12-29 System architecture for a scalable and distributed multi-stage switch fabric
US12/345,502 2008-12-29
US12/345,500 2008-12-29
US12/345,502 US8804711B2 (en) 2008-12-29 2008-12-29 Methods and apparatus related to a modular switch architecture
US12/495,361 2009-06-30
US12/495,344 2009-06-30
US12/495,358 US8335213B2 (en) 2008-09-11 2009-06-30 Methods and apparatus related to low latency within a data center
US12/495,337 US8730954B2 (en) 2008-09-11 2009-06-30 Methods and apparatus related to any-to-any connectivity within a data center
US12/495,364 2009-06-30
US12/495,358 2009-06-30
US12/495,337 2009-06-30
US12/495,344 US20100061367A1 (en) 2008-09-11 2009-06-30 Methods and apparatus related to lossless operation within a data center
US12/495,361 US8755396B2 (en) 2008-09-11 2009-06-30 Methods and apparatus related to flow control within a data center switch fabric
US12/495,364 US9847953B2 (en) 2008-09-11 2009-06-30 Methods and apparatus related to virtualization of data center resources
CN200910246898.XA CN101917331B (en) 2008-09-11 2009-09-11 Systems, methods and devices for data centers

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN200910246898.XA Division CN101917331B (en) 2008-09-11 2009-09-11 Systems, methods and devices for data centers

Publications (2)

Publication Number Publication Date
CN103916326A CN103916326A (en) 2014-07-09
CN103916326B true CN103916326B (en) 2017-10-31

Family

ID=43324725

Family Applications (2)

Application Number Title Priority Date Filing Date
CN200910246898.XA Active CN101917331B (en) 2008-09-11 2009-09-11 Systems, methods and devices for data centers
CN201410138824.5A Active CN103916326B (en) 2008-09-11 2009-09-11 System, method and equipment for data center

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN200910246898.XA Active CN101917331B (en) 2008-09-11 2009-09-11 Systems, methods and devices for data centers

Country Status (1)

Country Link
CN (2) CN101917331B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102420775A (en) * 2012-01-10 2012-04-18 西安电子科技大学 Routing method for module-expansion-based data center network topology system
US9094308B2 (en) 2012-06-06 2015-07-28 Juniper Networks, Inc. Finding latency through a physical network in a virtualized network
US9064216B2 (en) * 2012-06-06 2015-06-23 Juniper Networks, Inc. Identifying likely faulty components in a distributed system
CN103023803B (en) * 2012-12-12 2015-05-20 华中科技大学 Method and system for optimizing virtual links of fiber channel over Ethernet
WO2014096970A2 (en) * 2012-12-20 2014-06-26 Marvell World Trade Ltd. Memory sharing in a network device
US9419892B2 (en) * 2013-09-30 2016-08-16 Juniper Networks, Inc. Methods and apparatus for implementing connectivity between edge devices via a switch fabric
US9787559B1 (en) 2014-03-28 2017-10-10 Juniper Networks, Inc. End-to-end monitoring of overlay networks providing virtualized network services
CN105099939A (en) * 2014-04-23 2015-11-25 株式会社日立制作所 Method and device for implementing flow control among different data centers
CN105577575B (en) * 2014-10-22 2019-09-17 深圳市中兴微电子技术有限公司 A kind of chainlink control method and device
CN107104871B (en) * 2016-02-22 2021-11-19 中兴通讯股份有限公司 Subnet intercommunication method and device
CN105827544B (en) * 2016-03-14 2019-01-22 烽火通信科技股份有限公司 A kind of jamming control method and device for multistage CLOS system
CN107276908B (en) * 2016-04-07 2021-06-11 深圳市中兴微电子技术有限公司 Routing information processing method and packet switching equipment
US10243840B2 (en) * 2017-03-01 2019-03-26 Juniper Networks, Inc. Network interface card switching for virtual networks
CN113099488B (en) * 2019-12-23 2024-04-09 中国移动通信集团陕西有限公司 Method, device, computing equipment and computer storage medium for solving network congestion
US11323312B1 (en) 2020-11-25 2022-05-03 Juniper Networks, Inc. Software-defined network monitoring and fault localization
CN113595935A (en) * 2021-07-20 2021-11-02 锐捷网络股份有限公司 Data center switch architecture and data center
CN113630809B (en) * 2021-08-12 2024-07-05 迈普通信技术股份有限公司 Service forwarding method and device and computer readable storage medium
CN113961628B (en) * 2021-12-20 2022-03-22 广州市腾嘉自动化仪表有限公司 Distributed data analysis control system
CN115225589A (en) * 2022-07-17 2022-10-21 奕德(广州)科技有限公司 A CrossPoint Switching Method Based on Virtual Packet Switching

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5457682A (en) * 1993-05-05 1995-10-10 At&T Ipm Corp. Apparatus and method for supporting a line group apparatus remote from a line unit
US5945922A (en) * 1996-09-06 1999-08-31 Lucent Technologies Inc. Widesense nonblocking switching networks
CN1084579C (en) * 1997-03-27 2002-05-08 上海贝尔电话设备制造有限公司 S12 exchanger timing supply method and system thereof
JP2001313660A (en) * 2000-02-21 2001-11-09 Nippon Telegr & Teleph Corp <Ntt> WDM optical network
US7420969B2 (en) * 2000-11-29 2008-09-02 Rmi Corporation Network switch with a parallel shared memory
US6567576B2 (en) * 2001-02-05 2003-05-20 Jds Uniphase Inc. Optical switch matrix with failure protection
CN201075868Y (en) * 2006-08-21 2008-06-18 丛林网络公司 Multi spider route device with multipath optical interlinkage parts

Also Published As

Publication number Publication date
CN103916326A (en) 2014-07-09
CN101917331A (en) 2010-12-15
CN101917331B (en) 2014-05-07

Similar Documents

Publication Publication Date Title
US11451491B2 (en) Methods and apparatus related to virtualization of data center resources
CN103916326B (en) System, method and equipment for data center
US10454849B2 (en) Methods and apparatus related to a flexible data center security architecture
US11711319B2 (en) Methods and apparatus for flow control associated with a switch fabric
US8730954B2 (en) Methods and apparatus related to any-to-any connectivity within a data center
US8340088B2 (en) Methods and apparatus related to a low cost data center architecture
US8755396B2 (en) Methods and apparatus related to flow control within a data center switch fabric
US8335213B2 (en) Methods and apparatus related to low latency within a data center
US9065773B2 (en) Methods and apparatus for virtual channel flow control associated with a switch fabric
US20100061367A1 (en) Methods and apparatus related to lossless operation within a data center
EP2557742A1 (en) Systems, methods, and apparatus for a data centre
US9172645B1 (en) Methods and apparatus for destination based hybrid load balancing within a switch fabric
US12068978B2 (en) Methods and apparatus related to a flexible data center security architecture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载