WO2007038981A1

WO2007038981A1 - Parallel database server and method for operating a database server

Info

Publication number: WO2007038981A1
Application number: PCT/EP2005/054761
Authority: WO
Inventors: Mike Stengle
Original assignee: Knowledge Resources (Switzerland) Gmbh
Priority date: 2005-09-22
Filing date: 2005-09-22
Publication date: 2007-04-12

Abstract

A parallel database server (1) comprising a non-hierarchical structure of interconnected harddiskless processing nodes (7), at least several of said processing nodes comprising: one SQL processing element component (9), for executing SQL queries, local RAM (11) associated with the SQL processing elements, local non-volatile writable memory (12) associated with said SQL processing element, for non-volatile, decentralized storage of the dataset, the entire dataset being partitioned and stored in said processing nodes.

Description

Knores-1 -pct

Parallel database server and method for operating a database server

ReId of the invention

The present invention concerns a database server, in particular a high performance, massively parallel database server, and a method for operating such a database server.

Description of related art

Current high performance database serversare usually based on more or less generic server hardware with some featuresoptimized for databases, such as vast raid arrays for rapid data access, and usually make use of several parallel processors, or even arrays of high performance processing nodes Extensive indexing is used to speed up data accesstimes and use model-specific data partitioning, or even replication, to manage the complex interactions required by the multiple processors. Even with all those advanced techniques, the systems are still held back by the inherent latency of disk accesstimes (several milliseconds) and by the limit of sustainable throughput of hard-disk drives (lessthan 100MB/s). Moreover, a multimode computer environment using computers connected with high level protocols using complex layer stacks adds a computational overhead and a dependency to operating systems

US-B1 -6'463'433 describes a database system comprising one or more front end computers and one or more computer nodes interconnected by a network. A query from a user istransmitted from one of the front end computersto one of the computer nodes, called the home node, which extracts features from the query and sends it to other computer nodes Each participating computer node then performs a search on its respective partition of the database. The structure of the database is thus hierarchical; the home nodes decide which computer node should participate in a query. Performance thusdependson a good repartition of the overall processing power between the home nodes and the other nodes. Moreover, each node is made up of a complete computer with its 2 KNOFES 1 -PCT

own processor, RAM and hard disk memory. Performance is limited by the answer time of the slowest computer in the network, and especially by the computer having the slowest accesstime to its hard disk memory. If new 9QL commands have to be interpreted, or if an update to 9QL interpreter should be made, a new software must be dispatched and installed in each participating computer. Thissystem thus lacks rapidity and f lexibility.

U9-B1 -5'857'180 describes a method and apparatusfor parallel processing in a database environment. The method providesthe ability to dynamically partition row sources and for directing row sourcesto one or more query slaves The organization isthus hierarchical; a query coordinator controlsthe processing of a query. Moreover, the organization described in thisdocument is a purely software-based, and does not rely on any physical partitioning of the query slaves or memory partitions. It does not necessarily make an efficient use of the underlying hardware organization. If the method is used in a software-implemented shared disk system, performance strongly depends on the bandwidth of the communication busto the disk.

USFE-38'410 describes an apparatusfor parallel data storage and processing. The apparatus is optimized for fast processing of pixmap images; applicationsto databases are not suggested. It relies on an array of processing nodes, each node comprising one processor and at least one disk. Performance is thus limited by the access time of the disks and by the speed of the general purpose processors in the nodes.

U9-A1 -2003/0018630 describes a data storage and retrieval device using a systolic array of FPGA for performing searches over data in a shared disk. Although use of dedicated FPGA strongly improvesthe processing power of each node, the throughput still relies on the accesstime and bandwidth of the shared disk.

There isthus a need for a better parallel database server and for an improved method for operating such a server in order to improve the response time to queries. 3 KNOFES 1 -PCT

Brief summary of the invention

According to one aspect of the invention, these aims are achieved by means of a non-hierarchical structure of interconnected nodes. A non-hierarchical structure hasthe advantage, among others, of being less dependant on the processing power of the higher level nodes, and of being more f lexible and better adapted to various tasks and database organizations. Moreover, as each node can act as a master for some tasks, a higher redundancy is achieved, thus making the whole system less sensitive to malfunctioning of one single node. The non-hierarchical structure also allowsfor a better scalability; processing power and memory space can be increased very easily by adding processing nodes. Different nodes in input/output capacity will be accommodated by varying the ratio of input/output nodes (I/O nodes) in the structure. Thus, fewer I/O nodes will be required for applicationssuch asdata mining than for media streaming, video on demand or OLTP applications, for example.

According to another aspect, the nodes are diskless. One thus avoidsthe burden, costs and unreliability of an array of fast harddisks. Instead, each node has its own local Rash memory for non-volatile or permanent storage of a partition of the entire dataset. The use of f lash memory, instead of hard disks, f urther reducesthe volume, footprint and power consumption of the server. Even the non-volatile storage of the database content isthus decentralized. A local RAM is also associated with each processing node, for short-term storage of the partition of the memory on which the node performs computations. Thus, the data resides totally in the decentralized RAM during operation, and in the decentralized Rash memory for non-volatile data storage. Each processing node thus has a very fast, not shared accessto its local permanent and temporary memory space. During operation, the f lash memory is always in write mode, and used as a destination for the write through processduring data updates and to ensure compliance with data retention tests in case of system failure. 4 KNOFES 1 -PCT

By placing the flash and RAM memory location in the immediate vicinity of associated processing elements, the access latency can be reduced from several milliseconds in conventional systemsto a few hundreds nanoseconds, or even to a few tensof nanosecondsif the read access are entirely RAM based, the f lash memories being only used asa backup.

According to another aspect of the invention, the processing elements in each node are made up of FPGA components programmed and optimized for executing SQL commands on the local dataset. By using dedicated componentsfor processing 9QL commands, instead of generic processors, a further increase in computational power can be achieved. In another embodiment, a dedicated ASIC is used instead of FPGA components.

Furthermore, use of dedicated, relatively low-cost FPGA computing elementsallowsan increase of the parallelism and number of computing elementsfor a given cost. Thus, each computing node hasa very short memory horizon, for example 128 Mbytes only, and can perform extremely fast operationson the partition of the dataset in thissmall memory space. For example, search queries, such as " select from Where" SQL queries, can be performed extremely fast, without any need to index the tables, thus reducing the load time.

Processing nodes are preferably mutually connected through a bi-dimensional, three-dimensional or n-dimensional mesh of high speed links, preferably serial links. The processing nodes are all of equal importance asthere is no hierarchy in the network structure; each node is part of the query and result-routing infrastructure.

For this, the processing nodes preferably each comprise at least one routing engine to forward data and commandsto the relevant nodes within the structure. In a preferred embodiment, this routing engine isalso made up of FPGA components, or even of the same FPGA chip asthe processing element in the node. The processing nodesfurther comprise 5 KNOFES 1 -PCT

processing means for joining processing results computed by different processing nodes. In an embodiment, the joining of results is performed by the routing engines having dedicated computing capabilities. For example, during the routing of the result f ragments of a sorted list, the routing engine will join the pre-sorted list f ragments retrieved f rom the individual nodes into a f ully sorted list, and deliversthe consolidated, sorted result to the input/output node.

Each processing node may further comprise a storage management circuit for managing storage and retrieval of data f rom and into said local RAM and local Rash memory. In a preferred embodiment, the storage management circuit is also made up of FPGA components, preferably sharing a FPGA chip with the other FPGA components in the node.

Brief Description of the Drawings

The invention will be better understood with the aid of the description of an embodiment given by way of example and illustrated by the figures, in which :

Rg. 1 is a block diagram illustrating an example of database server comprising 5 electronic cards.

Rg. 2 is a block diagram illustrating an example of electronic card comprising one input/output node and 5 clusters

Rg. 3 is a block diagram illustrating an example of cluster comprising 10 processing nodes

Rg. 4 is a block diagram illustrating an example of processing node. 6 KNOFES 1 -PCT

Detailed Description of possible embodiments of the Invention

Rg. 1 isa block diagram illustrating an example of database server 1 comprising five electronic cards 3 in a cage. The electronic cards 3 may comprise connectors on one, two, three or all four sidesfor interconnectionsof the cards In an embodiment, the cards are mounted on a backplane having a connecting busfor mounting and interconnecting any number of cardsfrom 1 to n. In a preferred embodiment, the cards are mounted in a bi-dimensional array, each card being connected over serial linkswith all itsdirect neighbors. In the example, the five cardsare connected in a single row, and thuseach card is connected to a maximum of two other cards. It would be possible however to interconnect the cards in several rows and columns, thus having cards connected to three or four other cards All cards preferably feature interconnection meansfor at least four different cards.

Accessto the database server may be achieved through a user computer 2, for example a personal computer or a mainframe, running a database software such as, for example, Oracle, SQLServer, DB2, MySQL, etc. The user computer 2 is used for entering data, storing queries, sending queriesto the database server 1 , retrieving and displaying the results over any suitable front-end, and various other database management tasks .Exchange of queries, commands, resultsand data between the user computer 2 and the database server 1 may be performed over an ODBC link or any other suitable high level database interface. The physical architecture of the database server 1 isthustransparent for the database server.

Rg. 2 isa block diagram illustrating an example of electronic card 3 as used in the cage of Rg. 1. The electronic card of thisembodiment comprises one input/output node (I/O node) 4, for example a FPGA or a dedicated component, for interconnecting the electronic card with its neighborsover a link 40, for example a serial and/or Ethernet link, for exchanging ODBC commands, for example. The card further comprises a plurality of clustersδ of processing nodes Each cluster isconnected over 7 KNOFES 1 -PCT

preferably serial linkswith itsdirect neighbors in the North, South, West and East directions and at least one cluster is connected with the I/O node 4. In the illustrated embodiment, the card comprises five clusters δ of nodes arranged in a single row, each cluster being connected to a maximum of two neighbor clusters Other arrangements of clusters, for example in a bi- dimensional array, may be considered. Furthermore, it would be possible to have more than one I/O node 4, and/or to connect the I/O node or each I/O node with several clusters δ, especially when the throughput of retrieved data must be high.

The clusters δ may be made up of various electronic components, such as individual integrated circuits, on a circuit board. In a preferred embodiment, the circuit board is organized like a DIM M (dual in-line memory module) removably held in a DIM M socket on the card 3. Other kinds of circuit boards can be used within the f rame of the invention.

Figure 3 illustrates an example of cluster 5 comprising in this embodiment ten processing nodes 7 in two lines of five. Again, each processing node is interconnected over preferably serial linksto itsdirect neighbors in the North, South, West and East directions. Interconnection with other clusters can be made through connecting pins of the circuit board on which the cluster is mounted and through conducting paths of the electronic cards3.

Figure 4 illustrates an example of processing node 7. In this embodiment, the processing node is implemented with two FPGA or ASIC chips 13, a RAM memory 1 1 and a non-volatile memory, for example Rash memory, 12. In an embodiment, the processing node is implemented in Xilinx (Trademark) Virtex 4 components.

Each FPGA component 13 is designed and programmed so asto logically correspond to a routing engine 8, several (four in this embodiment) SQL processing engines 9, and several (two in the example) storage management circuits 10. The routing engines δ forward incoming data and commands or queries, aswell as processed results, to the relevant 8 KNOFES 1 -PCT

9QL processing engine 9 within the node, or to other nodes in the North, South, East or West direction. Preferably, the routing engine also features computing capabilities for processing the queries and the results routed to other nodes. For example, if a table is partitioned into several partsstored and processed by different processing nodes, each processing node may sort its partition and the routing engine may retrieve the pre-sorted f ragmentsf rom the different nodes, join them, and deliver an entirely sorted list to another node. Very fast algorithms for joining pre-sorted lists may be used if the routing engines use a content addressable memory (CAM) and a helper storage memory; after the initial data read, only data pointers are passed within the engine, which keeps it compact and fast.

The 9QL processing engines 9 are designed and programmed so asto be able to answer 9QL queries, and possibly other database management commands, on the partitions of the dataset stored in their local memory space 11 , 12. In an embodiment, in order to keep the 9QL processing engines simple, fast and cost effective, only a subset of the 9QL commands and queries can be interpreted by the engines 9; other less f requent or I ess time-critical queries may be executed by other parts of the database server 1 , in the user computer 2 or by the user f ront-end application.

During operation, i.e. when processing resultsto queries, the 9QL processing engines preferably reads data only f rom their local RAM 1 1. Extremely short read-access time isthus ensured. Updated tablesare written both in the RAM and in the local Rash memory, to ensure that no data is lost even in case of system failure. The Rash memory isthus accessed only in write mode during operation; read access is only required for restoring data after a failure or reboot.

In order to f urther increase the speed, the memory horizon of each processing node is kept low, for example at 128 M bytes This ensures very fast select or sort operations on the small local partition data.

Moreover, the small size of the local partitions allows fast searches on un- 9 KNOFES 1 -PCT

indexed data. The speed is nearly independent of the size of the global dataset.

The storage management circuits 10 are used for controlling accessto the Rash and, possibly, RAM memory. RAM 11 is preferably based on DDRII components, Rash on NAND Rash chips

The two FPGAs 13 are interconnected mutually and with other processing nodesover North, South, East and West linkswhich, when coupled, build an almost uniform non-hierarchical network mesh, for example in a bi-dimensional array, in a cube, in an n-cube or any other suitable organization. Additional channels may be added in one or several directionsto increase throughput in preferred directions; moreover, links between nodes may be broken to insert user access nodes or other data peripheral devices.

Other interconnect schemes, including circular schemes, diagonal links, etc. may be used within the f rame of the invention.

In an embodiment, the global size of the Rash and RAM memory is250 GBytes; scalability to I OTBytesor more may be achieved with current components.

Claims

1 0 KNOFES 1 -PCTClaims

1. A parallel database server (1 ) comprising a non-hierarchical structure of interconnected harddiskless processing nodes (7), at least several of said processing nodes comprising: at least one SQL processing element component (9) for executing SQL queries, local RAM (1 1 ) associated with said SQL processing element, local non-volatile writable memory (12) associated with said SQL processing element, for non-volatile, decentralized storage of the dataset, the entire dataset being partitioned and stored in said processing nodes

2. The parallel database server of claim 1 , wherein said SQL processing elements (9) are made up of FPGA components

3. The parallel database server of claim 1 , wherein said SQL processing elements (9) are made up of dedicated ASICa

4. The parallel database server of one of the claims 1 to 3, wherein said local non-volatile memory (12) comprises Rash memory.

5. The database server of one of the claims 1 to 4, wherein said local RAM (1 1 ) allowsfor decentralized storage of the entire dataset during operation.

6. The database server of claim 4, wherein read access to said dataset during operation is only performed f rom said local RAM (1 1 ).

7. The database server of one of the claims 1 to 6, at least several of said processing nodes (7) f urther comprising a routing engine (8) to forward data and querieswithin said structure, at least some of said routing engines (8) having computing capabilities for joining results processed by different processing nodes (7). 1 1 KNOFES1 -PCT

8. The database server of one of the claims 1 to 7, at least several of said processing nodes (7) f urther comprising at least one storage management circuit (10) for managing the storage and retrieval of data f rom and into said local RAM (1 1 ) and local non-volatile memory (12).

9. The database server of claim 8, said 9QL processing elements

(9), said routing engines (8) and said storage management circuits (10) being all made up of FPGA components.

10. The database server of one of the claims 1 to 9, said structure being non-hierarchical.

1 1. The database server of the one of the claims 1 to 10, said processing nodes (7) being mutually interconnected over serial links.

12. The database server of claim 1 1 , said processing nodes (7) being logically arranged in a mesh, each processing node (7) being connected to all its immediate neighbours in said mesh.

13. The database server of one of the claims 1 to 12, comprising a cage (1 ) with several electronic cards (3), each electronic card (3) having a plurality of clusters (5), each cluster comprising a plurality of processing nodes as electronic chips mounted on a removable circuit board.

14. The database server of claim 13, each electronic card (3) further comprising at least one user I/O node(4), for example a computer, connected to at least one cluster (5) on said electronic card, said I/O node (4) establishing a database link with a user computer (2) outside said database server (1 ).

15. The database server of claim 14, said database link being an

ODBC link. 1 2 KNOFES1 -PCT

16. The database server of one of the claims 1 to 15, wherein at least some of the largest tables of the dataset are partitioned between each of said processing node (7).

17. The database server of one of the claims 1 to 16, f urther comprising means for broadcasting queriesto said processing nodes (7), each processing node comprising means for deciding if and/or how to answer said query.

18. The database server of one of the claims 16 to 17, wherein each one of said processing nodes (7) comprises meansfor performing 9QL computation on its own partition of said tables

19. The database server of one of the claims 1 to 18, being f ree of hard disk drive array.

20. A method for operating a database server, comprising the steps of : storing the entire dataset in a decentralized structure of harddiskless processing nodes (7), broadcasting queriesthrough said structure of processing nodes, joining results processed by several 9QL processing nodes.

21. The method of claim 20, wherein each one of said processing nodes (7) decides if and how to answer a broadcast query.

22. The method of claim 20, wherein each one of said processing nodes (7) can execute said query in a locally available partition of a dataset.

23. The method of one of the claims20 to 22, further comprising the step of partitioning at least one table of a dataset into several parts and storing each said part in a different non-volatile memory of said processing nodes (7). 1 3 KNOFES1 -PCT

24. The method of claim 23, further comprising a step of simultaneously performing a SQL query in each of said processing nodes (7) over said part of said dataset, and joining the results processed by various processing nodes

25. The method of claim 24, said step of joining results processed by various processing nodes (7) making use of content addressable memory, and wherein after an initial data read, only data pointers are passed.

26. The method of one of the claims20 to 25, comprising the steps of : partitioning a table into a plurality of parts, storing each of said parts in a RAM of one of said processing nodes (7), broadcasting said query to said processing nodes, said query being a SQL query, locally computing in at least some of said processing nodes (7) a result to said query, based on the locally available part of the dataset and using a locally available 9QL processing engine (9), said 9QL processing engine being made of an FPGA or ASJC component, storing an updated part of the dataset into a locally available non-volatile, writable semiconductor memory in each processing node (7), retrieving and joining results computed by different processing nodes (7).