Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
The embodiment of the application provides a data storage method, a data storage device, electronic equipment, a storage medium and a program product.
The data storage device may be integrated in an electronic device, which may be a terminal, a server, or other devices. The terminal comprises, but not a smart phone, a tablet personal computer, a notebook computer, a desktop computer, an intelligent sound box, an intelligent watch, intelligent voice interaction equipment, intelligent household appliances, a vehicle-mounted terminal, an aircraft and the like, wherein the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and an artificial intelligent platform. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like.
In some embodiments, the data storage device may also be integrated in a plurality of electronic apparatuses, for example, the data storage device may be integrated in a plurality of servers, and the data storage method of the present application is implemented by the plurality of servers.
In some embodiments, the server may also be implemented in the form of a terminal.
For example, referring to fig. 1a, the data storage method is integrated in a server, the server may acquire graph data and graph data storage space, the graph data includes nodes and edges connected between the nodes, the graph data storage space includes index storage space and attribute storage space, attribute information of the graph data is stored in the attribute storage space, the attribute information includes node attribute information and edge attribute information, for any node, an index identification of the node is determined according to directivity information corresponding to the node, the directivity information corresponding to the node is used for pointing to the node attribute information and adjacent edge attribute information of the node stored in the attribute storage space, the adjacent edge attribute information is edge attribute information of an adjacent edge connected with the node, and according to a connection relation of the nodes in the graph data, the index identification of the node is stored in the index storage space to store the graph data in the graph data storage space.
In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.
The following will describe in detail. The order of the following examples is not limited to the preferred order of the examples. It will be appreciated that in the specific embodiments of the present application, data related to a user, such as user account numbers, characters, places, events, relationships between entities, etc., are related to the user, and when the embodiments of the present application are applied to specific products or technologies, user permissions or agreements need to be obtained, and the collection, use and processing of related data need to comply with relevant laws and regulations and standards of relevant countries and regions.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a technology that utilizes a digital computer to simulate the human perception environment, acquire knowledge, and use knowledge, which can enable machines to function similar to human perception, reasoning, and decision. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
Among them, natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicle, robot, smart medical treatment, smart customer service, car networking, smart transportation, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and become more and more important value.
Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.
Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.
Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside.
At present, a storage method of a storage system is that logical volumes are created, and when the logical volumes are created, a physical storage space is allocated to each logical volume, where the physical storage space may be a disk of a certain storage device or a plurality of storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as a data Identification (ID) and the like, the file system writes each object into a physical storage space of the logical volume, and the file system records storage position information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage position information of each object.
The storage system allocates physical storage space for a logical volume, specifically, the physical storage space is divided into stripes in advance according to the set of capacity estimation of objects stored in the logical volume (the estimation often has a large margin with respect to the capacity of the objects actually to be stored) and redundant array of independent disks (RAID, redundant Array of INDEPENDENT DISK), and one logical volume can be understood as one stripe, so that the physical storage space is allocated for the logical volume.
In this embodiment, a data storage method is provided, as shown in fig. 1b, and a specific flow of the data storage method may be as follows:
110. the method comprises the steps of obtaining graph data and a graph data storage space, wherein the graph data storage space comprises an index storage space and an attribute storage space.
The graph data includes nodes and edges connected between the nodes. The graph data refers to data represented in a graph form. A graph consists of nodes (or vertices) and edges, each edge connecting two nodes, representing some relationship between them. The data storage method provided by the embodiment of the application can be applied to various data storage scenes, so that the graph data can be the data of various scenes. For example, the graph data may store data of various application scenarios such as an application program, a knowledge graph, a traffic network, and the like, where the graph data may represent entities by nodes and represent relationships between the entities by edges. The graph data can be data of application programs, wherein nodes represent user account numbers of the application programs, edges represent relations among the user account numbers, or can be data of traffic networks, the nodes represent traffic nodes in the traffic networks, and the edges represent relations among the traffic nodes, or the data storage method provided by the embodiment of the application can be applied to the field of artificial intelligence, for example, the graph data of knowledge graphs are formed by organizing entities, relations and attributes, the nodes represent entities such as characters, places and events, the edges represent relations among the entities, and the graph data can assist tasks of various artificial intelligence scenes such as question answering systems, information extraction and semantic search.
Where storage space refers to a space for storing data, the space may be a logical space capable of storing and accessing data, and the space may be defined and implemented by software or a data structure. For example, the storage space may be a combination of one or more data structures that can store and access data, such as file text storage, tree data structures, hash structures, linked lists, and adjacency lists. The map data storage space refers to a space for storing map data. In the embodiment of the application, the graph data storage space comprises two sub-storage spaces, namely an index storage space and an attribute storage space, wherein the index storage space is used for storing index identifiers of nodes in the graph data, and the attribute storage space is used for storing attribute information in the graph data. The index storage space and the attribute storage space can be deployed on a single computer or a server, and can also be deployed in cloud storage or distributed storage.
For example, as shown in the graph data storage schematic diagram of fig. 1c, graph data including nodes V1, V2, V3, V4, V5, and the like may be acquired, and a graph data storage space for storing the graph data may be acquired. In the initial state, the data stored in the map data storage space is empty.
120. Attribute information of the graph data is stored in an attribute storage space, and the attribute information includes node attribute information and side attribute information.
The node attribute information and the side attribute information are divided into attribute information of nodes and attribute information of sides of the graph data. In the graph data, attribute information refers to information describing the node or edge attribute, and in different graph data, attribute information may be different. For example, the node attribute information may include names or labels, features, etc. of the node representation entities, and the side attribute information may include features, etc. of the sides, such as labels, setup time, length, weight, strength, etc. of the sides.
Typically, the graph data is a graph data store implemented using a key-value store, where key is a point/edge relational encoding and value is a property (i.e., attribute information) of a point or edge. Therefore, in the embodiment of the application, the attribute information (i.e., value) of each node and each edge can be directly obtained from the key-value of the graph data, and the attribute information is stored in the attribute storage space according to the order of the nodes and edges in the graph data. For example, as shown in the graph data storage schematic diagram of fig. 1c, attribute information such as attribute 1, attribute 2, attribute 3, attribute 4, and the like in the graph data may be sequentially stored in the attribute storage space.
In some embodiments, the index storage space can be implemented in a single-machine full-volume storage deployment, and the device of the attribute storage space can be implemented in a cloud storage or distributed storage mode. Where stand-alone full-size storage refers to a storage manner in which the entire data set is stored entirely on a single computer or server. In stand-alone mass storage, all data is stored on the same physical device. Therefore, the topological structure of the graph is placed on a single-machine full-volume storage, high-efficiency graph traversing and navigation capability can be provided, attribute information is stored in a cloud or distributed mode, and advantages of a distributed storage system, such as high expandability, high fault tolerance and high performance, can be fully utilized. This can support the storage and querying requirements of large-scale data.
In some embodiments, the attribute information of the graph data may be stored in the attribute storage space in the form of a columnar storage. Column storage means that data is organized and stored in a column rather than a row mode, so that attribute information of an attribute storage space can be read according to requirements, only required column data is read, unnecessary columns are prevented from being scanned, and accordingly data query performance is improved.
In some embodiments, in the attribute storage space, the attribute information of the node and its adjacent edge may be stored in an adjacent position, so that when the attribute of the node and its adjacent edge needs to be acquired, the data may be acquired through a loading or reading operation, so as to improve the efficiency of data reading. Meanwhile, for the query operation of simultaneously acquiring the node and the adjacent side attribute information, the query operation can be directly read at the adjacent position, so that jump type data access is avoided, and the query efficiency of the graph data is improved. Specifically, storing attribute information of the map data in an attribute storage space includes:
determining adjacent edges to be stored, which are connected with nodes to be stored, according to the connection relation of the nodes to be stored in the graph data, wherein the nodes to be stored are any node in the graph data;
And storing the node attribute information of the node to be stored and the side attribute information of the adjacent side to be stored in adjacent positions in the attribute storage space.
Wherein, the adjacent edges of the nodes refer to the edges connected with the nodes in the graph data.
For example, a node in the graph data may be taken as a node to be stored. For example, for any node V1 to be stored, it is connected to nodes V2 and V5 by edges L21 and L15, respectively. The attribute information of the node V1 to be stored, the side L21, and the side L15 may be stored as { valueV1, valueL, 15, valueL }, with value representing the attribute such that the attribute information of the node V1 to be stored and the attribute information of its neighboring sides are stored in neighboring locations.
In some embodiments, the graph data can be traversed sequentially according to the number sequence of the nodes in the graph data to determine the nodes to be stored, so that missing nodes are avoided, and meanwhile, the sequential traversing can simplify the data processing process, so that the operation on the nodes is more orderly and clear, and the error probability is reduced while the processing efficiency is improved. Specifically, storing the attribute information of the map data in the attribute storage space further includes:
and determining the node to be stored from the graph data according to the numbering sequence of the nodes in the graph data.
In the graph data, each node has a unique identifier or number, so in the embodiment of the application, the nodes in the graph data can be traversed according to the sequence of the identifiers or numbers of the nodes in the graph data, and the traversed nodes are used as the nodes to be stored. For example, as shown in FIG. 1d, the graph data includes nodes V1, V2, V3, V4, and V5, which are connected by edges L21, L13, L15, L42, L45, and L53. V1 may be first used as a node to be stored according to the numbering sequence, and attribute information of the node to be stored V1, the edge L21, the edge L13, and the edge L15 may be stored in the attribute storage space, to obtain { valueV1, valueL13, valueL15, valueL }. Then V2 is used as a node to be stored, the attribute information of the node to be stored V2, the side L21 and the side L42 is continuously stored in an attribute storage space to obtain { valueV1, valueL13, valueL, valueL21; valueV2, valueL21; valueL }, and the like, V3, V4 and V5 are continuously used as the node to be stored in sequence, and the final attribute storage space stores the information as follows {valueV1,valueL13,valueL15,valueL21;valueV2,valueL21;valueL42;valueV3,valu eL13;valueL53;valueV4,valueL42;valueL45;valueV5,valueL53;valueL45}.
Since a node may be connected to other nodes by one or more adjacent edges in the graph data. Therefore, in some embodiments, in order to efficiently store and find data of a node and its adjacent edges, for a node to be stored having a plurality of adjacent edges, edge attribute information of the plurality of adjacent edges of the node to be stored may be stored in an attribute storage space according to a numbering order and/or an edge direction in the graph data, so that the attribute information of the edges is conveniently and quickly accessed. For example, the attribute information of the node V1 to be stored of the graph data and its adjacent sides L21, L13, L15 may be stored as valueV, valueL, valueL, valueL21 in the attribute storage space, and the node V5 to be stored and its adjacent sides L45, L53 as valueV, valueL53, valueL in the attribute storage space in the storage order of the side attribute information of the first stored side and the side attribute information of the second stored side, and in the storage order of the number order from small to large.
In some embodiments, node attribute information and side attribute information in the graph data may be stored in log form. Specifically, the attribute storage space includes a log-type storage space (i.e., log file). Log-Structured Storage is a format of data storage that records data in the form of a Log and appends the data to a Log file that is written sequentially. Compared with the traditional random writing mode, the log type storage has the characteristics of sequential writing, high writing throughput and the like. For example, in an embodiment of the present application, each attribute information to be stored may be recorded as a log, and sorted in a log file according to a timestamp (or other incremental number), so that all node attribute information and side attribute information of the graph data are stored in a log form in one log file.
130. For any node, determining the index identification of the node according to the directivity information corresponding to the node.
The directivity information corresponding to the node is used for pointing to node attribute information of the node stored in the attribute storage space and adjacent side attribute information, and the adjacent side attribute information is side attribute information of an adjacent side connected with the node. The directivity information is information pointing to the attribute information stored in the attribute storage space. For example, the directivity information may include a pointer, a reference, or other representation of information having directivity. Index identification refers to an identifier or flag for an index that is used to quickly find or access corresponding data in an index data structure.
In some implementations, the directivity information may include a pointer. For example, when storing the attribute information of the nodes and edges of the graph data in the attribute storage space, after storing the node attribute information and the adjacent edge attribute information of each node to be stored in the attribute storage space, a pointer 1 to the node attribute information of the node to be stored and a pointer 2 to the adjacent edge attribute information of the node to be stored may be generated. The pointer 1 and the pointer 2 may be identified as indexes of nodes. It should be noted that, when the node to be stored has multiple adjacent edges, that is, multiple adjacent edge attributes, a corresponding pointer may be generated for each adjacent edge attribute information, or a pointer may be generated for multiple adjacent edge attributes.
In some embodiments, one piece of directivity information may be generated for the attribute information of the adjacent edge of the node to be stored, where the attribute information of the adjacent edge of the node to be stored is the input edge, that is, the directivity information corresponding to the node includes directivity information of the node attribute information pointing to the node to be stored (hereinafter referred to as node directivity information), directivity information of the attribute information of the adjacent edge of the node to be stored (hereinafter referred to as side directivity information), and directivity information of the attribute information of the adjacent edge of the node to be stored (hereinafter referred to as input edge directivity information). For example, for the node V1 to be stored and its adjacent sides L21, L13, L15, a node pointer 1 (i.e., node directivity information) pointing to the attribute information of the node V1, an outgoing side pointer 2 (i.e., outgoing side directivity information) pointing to the attribute information of the outgoing sides L13 and L15, an incoming side pointer 3 (i.e., incoming side directivity information) pointing to the attribute information of the incoming side L21 may be generated.
In some embodiments, the storage position of the attribute information can be used as directivity information, so that when the attribute information of the node or the edge needs to be acquired, the corresponding storage position can be found directly according to the directivity information, so that the required attribute data can be quickly positioned, the cost of searching in the whole attribute storage space is avoided, and the query efficiency of the graph data is improved. Meanwhile, the storage position of the attribute information is used as directivity information, when the attribute information of the node or the edge needs to be updated or modified, only the corresponding storage position is required to be updated, other attribute data cannot be influenced, so that the consistency of the data is improved, and the possibility of data redundancy and errors is reduced. Specifically, for any node, before determining the index identifier of the node according to the directivity information corresponding to the node, the method further comprises:
The storage position of the node attribute information in the attribute storage space and the storage position of the side attribute information in the attribute storage space are respectively used as the directivity information of the node corresponding to the node attribute information and the directivity information of the side corresponding to the side attribute information.
Wherein the storage locations in the attribute storage space may be represented by unique identifiers or addresses. This identifier may be a number, a string, or other form of data that indicates the specific location of the attribute in the memory space so that the corresponding attribute information stored in the attribute memory space can be accurately located and accessed according to the memory location. The manner in which the storage locations are represented may vary in different systems and scenarios.
For example, for the node V1 to be stored and its adjacent edges L21, L13, L15, the storage address of the node attribute information of the node V1 in the attribute storage space and the storage address of the edge attribute information of the edges L21, L13, L15 in the attribute storage space may be respectively taken as a pointer to the node V1 and a pointer to the adjacent edge of the node V1. In practical applications, the storage addresses of the side attribute information of the outgoing sides L13 and L15 in the attribute storage space may be used as the outgoing side pointers pointing to the attribute information of the outgoing sides L13 and L15, and the storage addresses of the side attribute information of the incoming side L21 in the attribute storage space may be used as the incoming side pointers pointing to the attribute information of the outgoing side L21.
In some embodiments, when node attribute information and side attribute information in graph data are stored in a log form, the storage position of the attribute information in the attribute storage space can be represented by an offset value, so that the storage space occupied by directivity information is reduced, and the storage efficiency of the graph data is improved. Further, when the attribute information in the attribute storage space is read according to the directivity information, the attribute information of the corresponding position can be directly accessed according to the offset value to improve the efficiency of reading the attribute information of the map data from the attribute storage space. For example, the offset value is used to represent an offset of the attribute information from the start data, e.g., the attribute storage space may be in the form of a log file, and the storage location of the attribute information in the log file may be represented by a byte offset (i.e., offset value) of the storage location of the attribute information in the log file from the beginning of the file.
In some embodiments, the index number of the node can be used as a key, the directivity information of the node is used as a value, and a key value pair corresponding to the node is constructed so as to correlate the index number with the node information, thereby being convenient for managing the node and quickly positioning the corresponding node information according to the index number. Specifically, for any node, determining an index identifier of the node according to directivity information corresponding to the node includes:
aiming at any node, acquiring an index number of the node;
Taking the index number of the node as a key, taking directivity information corresponding to the node as a value, and constructing a key value pair corresponding to the node;
and taking the key value pair corresponding to the node as an index identifier of the node.
The index number of the node refers to the number for identifying the node in the graph data. The index number of a node is used to uniquely identify each node in the graph for quick node location and access. The index number of a node may be a unique ID generated for the graph data node, or may be some unique attribute value of the node. For example, the index number of the node of the graph data may be expressed in an integer or a character string, such as sequentially numbering the nodes of the graph data as 0,1, 2, 3, or the like, or such as sequentially numbering the nodes of the graph data as 0001, 0002, 0003, or the like. In the embodiment of the application, the index number of the node can be the number of the node in the graph data, or can be other self-defined numbers. To save memory, the index number of a node may be represented by an 8-byte value.
For example, the index number of a node in the graph data may be used as a key to point to a pointer with a value and a pointer from edge to edge. Specifically, taking the node V1 to be stored as an example, after storing the node attribute information of the node V1 to be stored and the side attribute information of the adjacent sides L21, L13, L15 in the attribute storage space, the attribute storage space may return the node attribute information of the node V1 such as the node pointer 1, and return the storage positions of the side attribute information of the adjacent sides L21, L13, L15 in the attribute storage space such as the outgoing side pointer 2 and the incoming side pointer 3. The number 0001 of the node V1 can be used as a Key, the node pointer 1, the outgoing side pointer 2 and the incoming side pointer 3 corresponding to the node V1 are used as values, a Key Value pair is constructed, namely Key is "0001", value (NodePointer: 1, outEdgePointer:2, inEdgePointer: 3), wherein NodePointer:1, outEdgePointer:2, inEdgePointer:3 respectively represent the node pointer 1, the outgoing side pointer 2 and the incoming side pointer 3. The constructed key value pair may be identified as an index to node V1. And by analogy, when traversing the nodes in the graph data and taking the traversed nodes as the nodes to be stored, key value pairs corresponding to all the nodes in the graph data can be synchronously constructed to serve as index identifiers of the nodes, so that index identifiers of all the nodes in the graph data are obtained.
In some embodiments, in addition to being able to manage and retrieve nodes in the Index storage space through key-value pair, a secondary Index may be established in the Index storage space according to the specified query information, where the secondary Index refers to an additional Index established beyond the Primary Index (Primary Index). The primary index is typically built based on the primary key (PRIMARY KEY), while the secondary index may be built based on any column or combination of columns in the table. By creating an index for a non-primary key column in the table, nodes conforming to query conditions (i.e., specifying query information) of a secondary index can be located directly in the index storage space without traversing the entire index. Meanwhile, the secondary indexes can establish a plurality of indexes according to different query information, so that diversified query requirements are supported, and the flexibility and applicability of query are improved. Specifically, the data storage method further comprises the following steps:
And establishing a designated node index for the index identification of the designated query information corresponding node according to the designated query information, wherein the designated node index is used for determining the index identification of the designated query information corresponding node.
The designated query information is information designated for querying the node, and the designated query information is related to the node. The specified query information may be information such as a query type or specific query content.
For example, the designated query information may include a type of directivity information, such as outgoing side directivity information or incoming side directivity information, that is, an index relationship between the type of directivity information and a node of a key value pair including the type of directivity information (i.e., designated node index) may be established, so that keys (i.e., index numbers) of all nodes including the corresponding type of directivity information may be directly queried from the index relationship according to the type of directivity information, so as to obtain the directivity information in the index identifier of the node according to the key, so as to obtain the corresponding attribute information from the attribute storage space according to the directivity information. Or the designated query information may include a keyword of a certain storage location, that is, an index relationship between the keyword and a node containing the keyword in the key value pair (i.e., designated node index) may be established, so that according to the keyword, keys (i.e., index numbers) of all nodes containing the keyword may be directly queried from the index relationship, so as to obtain, according to the keys, directivity information in an index identifier of the node, so as to obtain, according to the directivity information, corresponding attribute information from the attribute storage space. It should be noted that the secondary index occupies additional storage space, and each time the data in the index changes (e.g., an insert, delete, or update operation), the index also needs to be updated accordingly, which increases the overhead of the write operation. Thus, in deciding whether to create a secondary index, there is a tradeoff between the improvement in query performance and the increase in storage and write overhead.
For another example, the specified query information may include other information related to the node, such as a node name, in addition to the directivity information. For example, an index relationship (i.e., a designated node index) between a node name of a node and a key (i.e., an index number) of the node may be established, so that the key (i.e., the index number) of the node corresponding to the node name may be directly queried from the index relationship according to the node name, so as to obtain, according to the key, directivity information in an index identifier of the node, so as to obtain, according to the directivity information, corresponding attribute information from an attribute storage space.
140. According to the connection relation of the nodes in the graph data, the index identification of the nodes is stored in the index storage space, so that the graph data is stored in the graph data storage space.
For example, after determining the index identifier of each node in the graph data, the index identifier of the node may be stored in the index storage space according to the connection relationship of the node in the graph data and the graph topology (i.e., the connection relationship of the node in the graph data). Compared with the prior art, the method for storing the data of the graph by using the key-value key value stores the graph data, and the data storage method provided by the embodiment of the application stores the index identification and the attribute information of the graph data in the index storage space and the attribute storage space separately so as to realize separate management of the graph index and the graph attribute, thereby improving the storage flexibility and the query efficiency of the graph data. When inquiring the graph data, the graph topological structure can be directly traversed in the index storage space without reading the graph attribute information so as to inquire and traverse the graph, thereby reducing the call of data resources in the inquiring process, avoiding the waste of resources and improving the inquiring efficiency of the graph data.
In some implementations, the index storage space may include an adjacency table through which index identifications of nodes in the graph data may be stored to store topology of the graph data in the index storage space. Adjacency tables use arrays and linked lists to represent nodes and edges in the graph data. Each element of the array corresponds to a node, and the linked list is used for storing other vertexes adjacent to the node, so that the node can be quickly positioned through the array, the adjacent node information is accessed, and the adjacent table represents the topological structure of the graph data.
In some embodiments, the connection relation of the nodes in the graph data can be stored through the index array, so that the topological structure of the graph data is stored in the index storage space, the nodes in the graph data can be quickly positioned according to the topology of the graph data, and the efficiency of querying the graph data is improved. Meanwhile, the adjacent nodes of each node are stored by utilizing the adjacent node table so as to rapidly locate the adjacent nodes of the nodes in the graph data, thereby improving the query efficiency of the graph data. Specifically, according to the connection relation of the nodes in the graph data, storing the index identifier of the node in the index storage space to store the graph data in the graph data storage space, including:
Creating an index array according to the connection relation of the nodes in the graph data, wherein the index array is used for storing index identifications of the nodes according to the connection relation of the nodes, and elements in the index array correspond to the nodes;
creating an adjacent node table of the element corresponding node in the index array, wherein the adjacent node table is used for storing index information of the adjacent node of the element corresponding node.
The adjacent nodes of the element corresponding nodes refer to nodes connected with the element corresponding nodes through edges in the graph data. The neighbor node table may be a linked list or other table data structure.
Wherein, the index information refers to information for nodes in the index storage space. For example, the index information may include a combination of one or more of an index number, an index identification, etc. of the node.
For example, the index information may include index numbers of neighboring nodes. In the embodiment of the application, the adjacency list uses an array to store the index identification of the node, each element of the array corresponds to one node in the graph data, and the linked list is used for storing the index identifications of other nodes adjacent to the node. Each element of the array may contain an index identifier of its corresponding node, while each element of the array may also contain a pointer to a linked list of its corresponding node, where the linked list stores the index number of the node connected to the corresponding node of the element. Therefore, when the graph data is queried, the corresponding elements (nodes) can be quickly positioned in the array of the adjacency list through the information such as the index number, so that the index identification of the nodes is obtained. In addition, the linked list of the node corresponding to the element can be accessed according to the pointer which points to the corresponding linked list and is contained in the element, so that the index identification of the neighbor node can be obtained according to the obtained index number by obtaining the stored index number of the neighbor node from the linked list. Therefore, the embodiment of the application directly maintains the pointer pointing to the neighbor node of each node through the linked list in the adjacency list, so that the neighbor node of one node can be directly accessed without Index when being inquired, and the data storage and inquiry strategy of Index-free adjacency (Index-free adjacency) is realized. Therefore, the nodes and the adjacent nodes in the graph data can be quickly accessed through the data of the adjacent table and the linked list, and the query efficiency of the graph data is improved.
Specifically, as shown in a schematic diagram of storing data in a storage space in fig. 1e, attribute information is stored in a column in an attribute storage space. The adjacency list is represented in the index storage space by a dot array (Matrix), which is a two-dimensional data structure, which is an array composed of rows and columns. Each element in the array of points corresponds to a node in the graph data, the element storing an attribute pointer (i.e., node pointer), an out-side pointer, and an in-side pointer for the corresponding node. The index information in the linked list to which each element in the point array points may contain an outgoing edge array representing its outgoing edge information and an incoming edge array representing its incoming edge information. The outgoing edge array may include index numbers of nodes adjacent to the element corresponding node along the outgoing edge direction and pointers of adjacent edges (i.e., edge attribute pointers), and the incoming edge array may include index numbers of nodes adjacent to the element corresponding node along the incoming edge direction and pointers of adjacent edges (i.e., edge attribute pointers). The outgoing and incoming arrays may further include one or more of meta information, locks, and the number of adjacent edges of the element corresponding node along the outgoing and incoming directions, etc. Meta information is used to describe attributes, features and other relevant information of the edge arrays and the in-edge arrays, helping users understand and manage the data. Locks are used to limit access to threads of the array-side array to avoid race conditions and data inconsistency problems, improving concurrency performance and stability of the program.
In some embodiments, the index information further includes an index identifier, so that a linked list of a node corresponding to an element in the index array can be accessed according to a pointer to the linked list corresponding to the element, and the stored index identifier of the adjacent node is directly obtained from the linked list, so that quick access of the adjacent node of the node in the graph data is realized, and the query efficiency of the graph data is improved.
In some implementations, adjacency tables of index storage space can be stored in a tree structure. The tree structure refers to a nonlinear data structure, wherein each node has zero or more child nodes, and the child nodes further comprise their own child nodes, and the like, so as to form a hierarchical relationship. For example, the adjacency list may be stored in a tree structure in which each node represents one node in the graph data, and the parent-child relationship between nodes of the tree structure represents the connection relationship between nodes in the graph data.
In the embodiment of the application, the adjacency list of the index storage space can be stored through various tree structures, such as a common tree, a B tree/B+ tree and the like. In some embodiments, the adjacency list of index storage is stored in an Append-only tree structure, such as Append-only-B+tree (Append-only-B+tree), which means Append operations can only be performed and delete or update operations cannot be performed. Wherein the b+ tree is a self-balancing tree data structure that maintains data ordering and allows lookup, sequential access, insert and delete operations of logarithmic time complexity. B+ trees are widely used for index structures for databases and file systems. An application-only tree is a data structure that is characterized by allowing new data to be added only at the end of the tree, but not modifying existing data, and is commonly used in situations where historical data needs to be preserved or version control needs to be performed.
In some embodiments, when the index identification and attribute information of the graph data are stored in the index storage space and the attribute storage space separately, when the node related information is deleted or modified, the related information in the index storage space and the attribute storage space needs to be deleted or modified at the same time so as to maintain consistency of graph data storage.
For example, in some embodiments, after the graph data is stored in the graph data storage space, when any node in the graph data is to be deleted, the index identifier in the index storage space may be deleted first, and then according to the directivity information corresponding to the index identifier, the corresponding attribute information in the attribute storage space is deleted, so as to avoid accessing unnecessary nodes and attribute information, and improve the processing efficiency of the graph data. Specifically, the data storage method further comprises the following steps:
Acquiring a deleting instruction of any node;
according to the deleting instruction, deleting the index identifier of any node from the index storage space;
Returning the directivity information corresponding to the index mark deleted by the index storage space to the attribute storage space;
And deleting node attribute information and adjacent side attribute information pointed by the directivity information returned by the index storage space from the attribute storage space according to the directivity information returned by the index storage space.
The deletion instruction is an instruction for deleting a node in the graph data. The delete instruction may carry information including the index number of the node, the node name, etc. that can be used to locate the node in the index storage space.
For example, after storing the graph data in the graph data storage space, when any node in the graph data such as the node V3 is to be deleted, the deletion operation corresponding to the deletion instruction may be decomposed into two operations connected in series. Specifically, firstly, according to the index number of the node V3 carried by the deleting instruction, querying an element corresponding to the node V3 in an array of an adjacent table in the index storage space to delete the element and a linked list pointed by a pointer in the element, and according to index identifiers of adjacent nodes in the linked list, searching elements corresponding to the adjacent nodes (hereinafter referred to as adjacent elements) in the array of the adjacent table, deleting the index identifiers of the node V3 from the linked list pointed by the adjacent elements, so as to delete all index identifiers related to the node V3 to be deleted, which are stored in the index space. Meanwhile, directivity information contained in the elements of the node V3 in the array of the adjacency list is returned to the attribute storage space, so that the attribute storage space searches corresponding node attribute information and side attribute information according to a storage position corresponding to the directivity information and deletes the corresponding node attribute information, all attribute information which is stored in the attribute storage space and is related to the node V3 to be deleted is deleted, and deletion of nodes for storing the graph data in the graph data storage space is realized.
In some embodiments, the operation of deleting the node attribute information and the adjacent side attribute information pointed by the directivity information returned by the index storage space from the attribute storage space may be to mark the node attribute information and the adjacent side attribute information pointed by the directivity information returned by the index storage space in the attribute storage space as deleted states, so that the storage positions of other attribute information in the attribute storage space are unchanged, thereby eliminating the need to update the directivity information of other nodes or sides, and reducing the system overhead and complexity.
For another example, in some embodiments, after the graph data is stored in the graph data storage space, when any node in the graph data is to be modified, the directivity information of the node determined based on the index storage space can be directly positioned in the attribute storage space and the attribute information of the node can be modified, so that unnecessary traversal operation is avoided, the modification process is simplified, and the modification efficiency is improved. And updating the directivity information reversely based on the modified attribute information to update the index identifier of the node stored in the index storage space, so that invalid node traversal is avoided while consistency of the node index identifier and the attribute information is ensured, and the data modification efficiency is improved. Meanwhile, the data storage method specifically further comprises the following steps:
Acquiring a modification instruction of attribute information to be modified, wherein the attribute information to be modified is the attribute information of any node or any side in the graph data;
acquiring directivity information corresponding to the attribute information to be modified according to the modification instruction and the index identifier stored in the index storage space;
returning the acquired directivity information to the attribute storage space;
inquiring attribute information to be modified from the attribute storage space according to the directivity information returned by the index storage space;
Updating the queried attribute information to be modified according to the modification instruction to obtain updated attribute information;
According to the updated attribute information, updated directivity information is determined, and the updated directivity information is used for pointing to the updated attribute information;
and updating the index identification of the node corresponding to the attribute information to be modified stored in the index storage space according to the updated directivity information.
The attribute information to be modified refers to node attribute information or side attribute information to be modified in the graph data. It can be understood that when the attribute information to be modified is node attribute information, the directivity information corresponding to the attribute information to be modified is node directivity information of the corresponding node, and when the attribute information to be modified is side attribute information, the directivity information corresponding to the attribute information to be modified is side directivity information of the corresponding side.
The modification instruction is an instruction for modifying node attribute information or side attribute information in the graph data. The modification instruction may carry information including the index number of the node, the node name, etc. that can be used to locate the node in the index storage space.
For example, after storing the graph data in the graph data storage space, when the information of any node in the graph data, such as node V3, is to be modified, the modification operation corresponding to the modification instruction may be decomposed into three operations connected in series. Specifically, firstly, according to the index number of the node V3 carried by the modification instruction, an element corresponding to the node V3 is queried in an array of an adjacency list of the index storage space, so that node directivity information contained in the element is returned to the attribute storage space, the attribute storage space queries corresponding node attribute information according to a storage position corresponding to the node directivity information, and the queried node attribute information is modified according to the modification instruction, so that updated node attribute information is obtained, and the attribute information stored in the attribute storage space is modified according to the modification instruction. And taking the storage position of the updated node attribute information in the attribute storage space as updated directivity information, and returning the updated directivity information to the index storage space so that the index storage space updates the node directivity information contained in the element corresponding to the node V3 into the updated directivity information. Therefore, all directivity information related to the attribute to be modified in the index storage space is modified according to the modification instruction, so that the modification of the stored image data in the image data storage space is realized.
In some embodiments, the attribute information to be modified in the attribute storage space may be marked as a deleted state, and the updated attribute information is stored at the tail of the attribute storage space, so that the storage position of other attribute information in the attribute storage space is unchanged, and thus, the directivity information of other nodes or edges does not need to be updated, and the system overhead and the complexity are reduced. Specifically, updating the queried attribute information to be modified according to the modification instruction to obtain updated attribute information, including:
marking the attribute information to be modified in the attribute storage space as a deleting state;
generating updated attribute information according to the modification instruction and the attribute information to be modified;
determining updated directivity information according to the updated attribute information, including:
And storing the updated attribute information in the tail part of the attribute storage space to determine the updated directivity information.
For example, the attribute information to be modified is queried from the attribute storage space, the attribute information to be modified is copied to the memory, and the attribute information to be modified in the attribute storage space is marked as a deletion state. And combining the copied attribute information to be modified with the value of the update content carried by the modification instruction by the attribute storage space to obtain updated attribute information. And adding the updated attribute information to the end of the attribute information stored in the attribute storage space, and returning the storage position of the updated attribute information in the attribute storage space to the index storage space as updated directivity information. So that the index storage space updates the node directivity information contained in the element corresponding to the node V3 to the updated directivity information.
In some embodiments, after the graph data is stored in the graph data storage space, when any node in the graph data is to be queried, the attribute information of the node can be directly positioned in the attribute storage space based on the directivity information of the node determined by the index storage space, so that unnecessary traversal operation is avoided, the modification process is simplified, and the query efficiency is improved. Specifically, the data storage method further comprises the following steps:
acquiring an attribute information query instruction, wherein the attribute information query instruction is used for querying at least one of node attribute information of at least one node and side attribute information of at least one side of the graph data;
Acquiring directivity information corresponding to the attribute information inquiry instruction according to the attribute information inquiry instruction and the index identifier stored in the index storage space;
And acquiring attribute information pointed by the directivity information corresponding to the attribute information inquiry instruction from the attribute storage space according to the acquired directivity information.
In the embodiment of the application, the attribute information query instruction can be used for querying node attribute information of any node, node attribute information of adjacent nodes of any node or side attribute information of adjacent sides of the graph data. The attribute information query instruction may carry information including an index number of a node, a node name, etc. that can be used to locate the node in the index storage space.
For example, after the graph data is stored in the graph data storage space, when the attribute information of any node, such as the node V3, in the graph data is to be queried, according to the index number of the node V3 carried by the attribute information query instruction, an element corresponding to the node V3 may be queried in the array of the adjacency table in the index storage space, so as to return the node directivity information contained in the element to the attribute storage space, and the attribute storage space queries the attribute information of the corresponding node according to the storage position corresponding to the node directivity information.
For another example, after the graph data is stored in the graph data storage space, when node attribute information of any node in the graph data, such as a neighboring node of the node V3, or side attribute information of a neighboring side is to be queried, an element corresponding to the node V3 may be queried in an array of an adjacency table in the index storage space according to an index number of the node V3 carried by the attribute information query instruction, so as to return node directivity information of the neighboring node of the node V3 and side directivity information of the neighboring side stored in a linked list pointed by the element to the attribute storage space, so that the attribute storage space queries node attribute information of the neighboring node of the node V3 or side attribute information of the neighboring side according to the returned node directivity information and a storage position corresponding to the side directivity information.
The data storage scheme provided by the embodiment of the application can be applied to various graph data storage scenes. For example, drawing data of a knowledge graph is taken as an example, drawing data and drawing data storage space are obtained, the drawing data comprises nodes and edges connected between the nodes, the drawing data storage space comprises index storage space and attribute storage space, the attribute information of the drawing data is stored in the attribute storage space, the attribute information comprises node attribute information and edge attribute information, for any node, according to the directivity information corresponding to the nodes, index identification of the nodes is determined, the directivity information corresponding to the nodes is used for pointing to the node attribute information and the adjacent edge attribute information of the nodes stored in the attribute storage space, the adjacent edge attribute information is the edge attribute information of the adjacent edges connected with the nodes, and according to the connection relation of the nodes in the drawing data, the index identification of the nodes is stored in the index storage space, so that the drawing data is stored in the drawing data storage space.
As can be seen from the above, the data storage method provided by the embodiment of the present application separately stores the index identifier and the attribute information of the graph data in the index storage space and the attribute storage space, so as to implement separate management of the graph index and the graph attribute, so as to improve the storage flexibility and the query efficiency of the graph data. The index storage space stores index identifiers of nodes corresponding to the attribute information according to the connection relation of the nodes in the graph data. Therefore, when the graph data is queried, the connection relation of the nodes can be directly traversed in the index storage space without reading the graph attribute information so as to query and traverse the graph. After the related data to be queried is determined, corresponding attribute information is acquired from the attribute storage space according to the directivity information corresponding to the index identification of the queried node, so that the call of data resources in the query process is reduced, the resource waste is avoided, and the query efficiency of graph data is improved.
The method described in the above embodiments will be described in further detail below.
In this embodiment, a method of the embodiment of the present application will be described in detail.
As shown in fig. 2a, a specific flow of a data storage method is as follows:
210. the method comprises the steps of obtaining graph data and graph data storage space, wherein the graph data comprises nodes and edges connected between the nodes, and the graph data storage space comprises index storage space and attribute storage space.
220. Attribute information of the graph data is stored in an attribute storage space, and the attribute information includes node attribute information and side attribute information.
230. For any node, determining the index identification of the node according to the directivity information corresponding to the node.
The directivity information corresponding to the node is used for pointing to node attribute information of the node stored in the attribute storage space and adjacent side attribute information, and the adjacent side attribute information is side attribute information of an adjacent side connected with the node.
240. According to the connection relation of the nodes in the graph data, the index identification of the nodes is stored in the index storage space, so that the graph data is stored in the graph data storage space.
For example, the graph data storage space may include GraphIndex components (index storage space) and GraphLog components (attribute storage space), the GraphIndex components implement storage of the graph topology by way of an application-only b+tree, and management of the secondary index, and the GraphLog components implement read operations by way of Mmap (memory mapping technique), which is a method of mapping files or other resources into computer memory so that these resources can be accessed as efficiently as memory. The GraphLog component adopts a log type storage mode to store the inserted data, and adopts an efficient writing mode of application, so that quick writing can be supported.
When the graph data is written or queried, the graph data is added, deleted and revised through two components GraphIndex and GraphLog. For example, when the graph data is written into the graph data storage space, the attribute information of the graph data is inserted into GraphLog, graphLog, and an offset value (or a data combination of [ page value+offset in page ]) of the corresponding file is returned, and then GraphIndex uses the returned value as a value, and uses the key generated by combining the point edges as an index identifier of the node to store the index in GraphIndex.
Specifically, the implementation flow of this procedure can be expressed as UserPut (vertex) =graphindex. Put (key), graphlog. Put (vertex).
Wherein UserPut is the user's write operation, and the incoming parameter is the point (edge) data. The process may be broken down into a 1, graphLog put operation that writes the point (edge) data into the file while returning an offset value, and a 2, graphIndex put operation that first generates a corresponding index value (key) from the user's point (edge) data and then writes the offset value returned by GraphIndex as a value into the GraphIndex file.
In the above process, graphIndex may first generate a unique ID (or use a unique attribute of a point) for each node point in the graph data, and use the unique ID as a unique identification key of the point (typically, the unique identification is represented by an 8-byte value to save memory), and then the key points to a pointer with a value and a pointer with an outgoing edge and an incoming edge. Where the pointer to the value is an offset of the attribute value file, and the pointer to the outgoing or incoming edge points to a contiguous piece of memory, storing the value of the pointer to the attribute of the edge and the other endpoint of the edge. Meanwhile, in GraphIndex, a graph topological structure formed by a linked list and an adjacent table and attribute values in GraphLog can be adopted for separation, so that when the adjacent edge of a query point is queried without indexes, graphIndex can realize the query of all the adjacent edges of the point only by pointing a pointer to a memory block of a link table of the edge, and meanwhile, the purposes that when graph traversal or graph calculation is performed, calculation and traversal can be directly performed on the graph without using attribute information of the graph can be met, thereby reducing the occupied space of data in a memory and reducing the IO pressure of a disk. When the calculation is completed, graphIndex returns the index value of the attribute point/edge to GraphLog, and then the attribute query is performed at GraphLog to obtain the query result.
Obviously, the embodiment of the application manages the topological structure and attribute information of the graph data through two groups of components GraphIndex and GraphLog so as to realize the separation of the graph topology and the attribute, increase the flexibility of graph data storage, facilitate the complex query and calculation of the graph, increase the use efficiency of the memory and improve the query calculation performance. Because the attribute information of the same side of the graph data is only required to be stored once, the cost of storage can be reduced. Meanwhile, for the scene with reduced performance requirements, attribute values can be stored on other clusters through distributed design, so that the storage expandability is improved.
In the related art, when traversing or calculating on the graph, the attribute values of the nodes or edges are not required to be returned sometimes, and only the connection relation of the nodes or edges in the graph topology is required to be known, so that if the above method is used, redundant overhead is generated. Such as a query match (n: person } - [. 2] - > (m) return m, which is a two-degree query whose statement is to query a two-degree neighbor named jack (assuming only one type of edge), then return the attributes of the two-degree neighbor, in which the attributes of the one-degree neighbor of jack and the attributes of an edge are both returned to memory if the above technique is employed, then two-hop expansion is performed on a one-piece basis, and if the attributes of a point or edge are large, then memory problems must occur in small memories.
However, in the embodiment of the present application, the topology of the graph data is stored through GraphIndex, that is, compared with the above structure, the value in the topology of the embodiment of the present application is an address offset of the point-edge attribute in the GraphLog component, instead of the real attribute value, so that the size of the address offset is smaller than the size of the attribute value. Therefore, when the above query process is implemented in the embodiment of the present application, the graph may be expanded on GraphIndex, so that the second degree neighbor of jack may be quickly found, but after the query is completed from GraphIndex, an address of GraphLog is returned, and at this time, the corresponding attribute value is obtained from GraphLog according to the offset address, so that the above query process may be efficiently completed.
However, when the point of name=jack is queried and located, the query is directly from the topological structure of the graph data without jack information, so that the embodiment of the application establishes a secondary index in the index storage space according to the appointed query information besides the management and retrieval nodes in GraphIndex form through key value pairs. To quickly locate the location (key of the point) of the name=jack point in the graph topology by means of a secondary index made on the name attribute, to quickly locate the starting point, and then to perform secondary expansion.
In the embodiment of the application, when the graph data is stored in the graph data storage space, one or more of deleting operation, modifying operation and inquiring operation can be performed on the graph data stored in the graph data storage space. It should be noted that, the deleting operation, the modifying operation and the querying operation in the embodiment of the present application may all perform batch operation under the concurrent condition, so as to improve the throughput capability of reading and writing.
250. And according to the deleting instruction, deleting the graph data stored in the graph data storage space.
For example, after storing the map data in the map data storage space, a delete operation may be performed on the map data. As shown in fig. 2b, the flow of the specific deletion operation is as follows:
251. And acquiring a deleting instruction of any node.
252. And deleting the index identification of any node from the index storage space according to the deleting instruction.
253. And returning the directivity information corresponding to the index identification deleted by the index storage space to the attribute storage space.
And deleting node attribute information and adjacent side attribute information pointed by the directivity information returned by the index storage space from the attribute storage space according to the directivity information returned by the index storage space.
For example, the delete operation on the graph data may represent an operation of UserDelete (vertex) =graphlog. Wherein UserDelete (vertex)'s semantics are points (i.e., nodes) that indicate that the user is to delete a vertex. This operation is split into two operations of in-series, 1.GraphIndex deleting the index corresponding to the vertex, and returning the value of the attribute corresponding to the index in the offset of the log file, and 2.GraphLog deleting the corresponding attribute value (i.e. attribute information) according to the returned offset.
260. And carrying out modification operation on the graph data stored in the graph data storage space according to the modification instruction.
For example, after storing the graph data in the graph data memory space, a modification operation may be performed on the graph data. As shown in fig. 2c, the flow of the specific modification operation is as follows:
261. and acquiring a modification instruction of attribute information to be modified, wherein the attribute information to be modified is the attribute information of any node or any side in the graph data.
262. And acquiring directivity information corresponding to the attribute information to be modified according to the modification instruction and the index identifier stored in the index storage space.
263. And returning the acquired directivity information to the attribute storage space.
264. And inquiring attribute information to be modified from the attribute storage space according to the directivity information returned by the index storage space.
265. And updating the queried attribute information to be modified according to the modification instruction to obtain updated attribute information.
266. And determining updated directivity information according to the updated attribute information, wherein the updated directivity information is used for pointing to the updated attribute information.
267. And updating the index identification of the node corresponding to the attribute information to be modified stored in the index storage space according to the updated directivity information.
For example, a modify operation on the graph data may represent operations UserUpdate (vertex) =graphindex. The specific operation process comprises the steps of 1.GraphIndex firstly inquires vertex and returns an offset value, 2.GraphLog copies data to a memory according to the returned offset value and marks the current record as a deleting state, 3.GraphLog combines the value copied to the memory and the value of a part to be updated, catches up to the tail end of a file and returns a new offset value, and 4.GraphIndex obtains the new offset value and indexes corresponding attribute offset (namely updating directivity information) in the index before bit updating.
270. And according to the attribute information query instruction, performing query operation on the graph data stored in the graph data storage space.
For example, after storing the graph data in the graph data memory space, a query operation may be performed on the graph data. As shown in fig. 2d, the flow of the specific query operation is as follows:
271. And acquiring an attribute information query instruction, wherein the attribute information query instruction is used for querying at least one of node attribute information of at least one node and side attribute information of at least one side of the graph data.
272. And acquiring directivity information corresponding to the attribute information inquiry instruction according to the attribute information inquiry instruction and the index identifier stored in the index storage space.
273. And acquiring attribute information pointed by the directivity information corresponding to the attribute information inquiry instruction from the attribute storage space according to the acquired directivity information.
For example, a query operation on graph data may represent operations UserRetrieve (vertex) =graphlog. The specific operation process is as follows, 1, such as VertexSet =graph index. Retrieve (TRANVERSAL/Computer) (Vertex), when a user performs graph data query (including traversal/graph calculation), graphIndex performs graph-on-graph wandering according to a graph topology structure, and finds an ID set of a point meeting the requirement. 2. E.g., properties=graphlog.batch get (VertexSet), then sending all the offsets corresponding to the set of points to GraphLog component returns the attribute values of all points (edges), resulting in chaxun results.
As can be seen from the above, in the embodiment of the present application, the method of separately storing the graph topology and the graph attribute is used to store the attribute graph, that is, the topology structure and the attribute information of the graph data are managed by two groups of components GraphIndex and GraphLog, so that the graph topology is stored separately as the graph index, and the graph attribute is stored in another space as an independent part, so that the storage and the query of the large attribute graph are realized by the design of separating the edge aggregation graph and the attribute, and the query performance can be greatly improved, thereby realizing the high efficiency of the graph calculation and the complex query of the graph, and reducing the resource overhead. Meanwhile, a pointer pointing to a neighbor node of each node is directly maintained for each node through a linked list in the adjacency list, so that when the neighbor node of one node is inquired, the index is not needed, direct access is realized, the data storage and inquiry strategy without index adjacency is realized, the caching of the memory to the attribute is reduced through a secondary attribute inquiry mode, and the utilization rate of the memory is improved.
In order to better implement the method, the embodiment of the application also provides a data storage device, which can be integrated in electronic equipment, and the electronic equipment can be a terminal, a server and other equipment. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices, and the server can be a single server or a server cluster consisting of a plurality of servers.
For example, in this embodiment, a method according to an embodiment of the present application will be described in detail by taking a specific integration of a data storage device in a server as an example.
For example, as shown in fig. 3, the data storage device may include an acquisition unit 310, an attribute storage unit 320, an index determination unit 330, and an index storage unit 340, as follows:
First acquisition unit 310
The method comprises the steps of acquiring graph data and graph data storage space, wherein the graph data comprises nodes and edges connected between the nodes, and the graph data storage space comprises index storage space and attribute storage space.
(Two) attribute storage unit 320
The attribute information for storing the graph data in the attribute storage space includes node attribute information and side attribute information.
In some embodiments, the attribute storage unit comprises a first attribute storage subunit and a second attribute storage subunit, wherein the first attribute storage subunit is used for determining adjacent edges to be stored, which are connected with the nodes to be stored, according to the connection relation of the nodes to be stored in the graph data, the nodes to be stored are any node in the graph data, and the second attribute storage subunit is used for storing node attribute information of the nodes to be stored and edge attribute information of the adjacent edges to be stored in adjacent positions in the attribute storage space.
In some embodiments, the attribute storage unit further comprises a third attribute storage subunit, including a third attribute storage subunit, configured to determine a node to be stored from the graph data according to the numbering order of the nodes in the graph data.
(III) index determination unit 330
The node index identification method comprises the steps of determining index identification of a node according to directivity information corresponding to the node for any node, wherein the directivity information corresponding to the node is used for pointing to node attribute information and adjacent side attribute information of the node stored in an attribute storage space, and the adjacent side attribute information is side attribute information of an adjacent side connected with the node.
In some embodiments, the data storage device further comprises a directivity unit, wherein the directivity unit is used for respectively taking the storage position of the node attribute information in the attribute storage space and the storage position of the side attribute information in the attribute storage space as directivity information of the node corresponding to the node attribute information and directivity information of the side corresponding to the side attribute information.
In some embodiments, the index determination unit comprises a first index determination subunit, a second index determination subunit and a third index determination subunit, wherein the first index determination subunit is used for acquiring an index number of a node for any node, the second index determination subunit is used for taking the index number of the node as a key, directivity information corresponding to the node is used as a value, a key value pair corresponding to the node is constructed, and the third index determination subunit is used for taking the key value pair corresponding to the node as an index identifier of the node.
In some embodiments, the data storage device further comprises a designated index unit, wherein the designated index unit is used for establishing a designated node index for index identification of a designated query information corresponding node according to the designated query information, and the designated node index is used for determining the index identification of the designated query information corresponding node.
(IV) index storage unit 340
And storing the index identification of the node in the index storage space according to the connection relation of the node in the graph data so as to store the graph data in the graph data storage space.
In some embodiments, the index storage unit comprises a first index storage subunit and a second index storage subunit, wherein the first index storage subunit is used for creating an index array according to the connection relation of nodes in the graph data, the index array is used for storing index identifications of the nodes according to the connection relation of the nodes, elements in the index array correspond to the nodes, the second index storage subunit is used for creating a neighboring node table of the elements in the index array, and the neighboring node table is used for storing index information of neighboring nodes of the elements correspond to the nodes.
In some embodiments, the data storage device further comprises a deleting unit, wherein the deleting unit is used for acquiring a deleting instruction for any node, deleting the index identifier of any node from the index storage space according to the deleting instruction, returning directivity information corresponding to the index identifier deleted by the index storage space to the attribute storage space, and deleting node attribute information pointed by the directivity information returned by the index storage space and adjacent side attribute information from the attribute storage space according to the directivity information returned by the index storage space.
In some embodiments, the data storage device further comprises a modification unit, wherein the modification unit is used for obtaining a modification instruction of attribute information to be modified, the attribute information to be modified is attribute information of any node or any side in the graph data, obtaining directivity information corresponding to the attribute information to be modified according to the modification instruction and an index identifier stored in an index storage space, returning the obtained directivity information to the attribute storage space, inquiring the attribute information to be modified from the attribute storage space according to the directivity information returned by the index storage space, updating the inquired attribute information to be modified according to the modification instruction, obtaining updated attribute information, determining updated directivity information according to the updated attribute information, wherein the updated directivity information is used for pointing to the updated attribute information, and updating the index identifier of the node corresponding to the attribute information to be modified stored in the index storage space according to the updated directivity information.
In some embodiments, updating the queried attribute information to be modified according to the modification instruction to obtain updated attribute information comprises marking the attribute information to be modified in an attribute storage space as a deletion state, generating the updated attribute information according to the modification instruction and the attribute information to be modified, and determining the updated directivity information according to the updated attribute information, wherein the updated attribute information is stored at the tail part of the attribute storage space to determine the updated directivity information.
In some embodiments, the data storage device further comprises a query unit, wherein the query unit is used for acquiring an attribute information query instruction, the attribute information query instruction is used for querying at least one of node attribute information of at least one node and side attribute information of at least one side of the graph data, acquiring directivity information corresponding to the attribute information query instruction according to the attribute information query instruction and index identifiers stored in the index storage space, and acquiring attribute information pointed by directivity information corresponding to the attribute information query instruction from the attribute storage space according to the acquired directivity information.
In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.
As can be seen from the above, the data storage device of the present embodiment includes an acquisition unit, an attribute storage unit, an index determination unit, and an index storage unit. The device comprises an acquisition unit, an index determination unit and an index storage unit, wherein the acquisition unit is used for acquiring graph data and graph data storage space, the graph data comprises nodes and edges connected between the nodes, the graph data storage space comprises an index storage space and an attribute storage space, the attribute storage unit is used for storing attribute information of the graph data in the attribute storage space, the attribute information comprises node attribute information and edge attribute information, the index determination unit is used for determining an index identifier of each node according to directivity information corresponding to the node, the directivity information corresponding to the node is used for pointing to the node attribute information and the adjacent edge attribute information of the node stored in the attribute storage space, the adjacent edge attribute information is the edge attribute information of the adjacent edge connected with the node, and the index storage unit is used for storing the index identifier of the node in the index storage space according to the connection relation of the nodes in the graph data so as to store the graph data in the graph data storage space.
Therefore, the embodiment of the application stores the index identification and the attribute information of the graph data in the index storage space and the attribute storage space separately so as to realize the separate management of the graph index and the graph attribute and improve the storage flexibility and the query efficiency of the graph data. The index storage space stores index identifiers of nodes corresponding to the attribute information according to the connection relation of the nodes in the graph data. Therefore, when the graph data is queried, the connection relation of the nodes can be directly traversed in the index storage space without reading the graph attribute information so as to query and traverse the graph. After the related data to be queried is determined, corresponding attribute information is acquired from the attribute storage space according to the directivity information corresponding to the index identification of the queried node, so that the call of data resources in the query process is reduced, the resource waste is avoided, and the query efficiency of graph data is improved.
The embodiment of the application also provides electronic equipment which can be a terminal, a server and other equipment. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and the like, and the server can be a single server or a server cluster formed by a plurality of servers and the like.
In some embodiments, the data storage device may also be integrated in a plurality of electronic apparatuses, for example, the data storage device may be integrated in a plurality of servers, and the data storage method of the present application is implemented by the plurality of servers.
In this embodiment, a detailed description will be given taking an example that the electronic device of this embodiment is a server, for example, as shown in fig. 4, which shows a schematic structural diagram of the server according to the embodiment of the present application, specifically:
The server may include one or more processor cores 'processors 410, one or more computer-readable storage media's memory 420, a power supply 430, an input module 440, and a communication module 450, among other components. Those skilled in the art will appreciate that the server architecture shown in fig. 4 is not limiting of the server and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:
The processor 410 is a control center of the server, connects various parts of the entire server using various interfaces and lines, performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 420, and calling data stored in the memory 420. In some embodiments, the processor 410 may include one or more processing cores, and in some embodiments, the processor 410 may integrate an application processor that primarily processes operating systems, user interfaces, applications, etc., and a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 410.
The memory 420 may be used to store software programs and modules, and the processor 410 may perform various functional applications and data processing by executing the software programs and modules stored in the memory 420. The memory 420 may mainly include a storage program area that may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data created according to the use of the server, etc. In addition, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, memory 420 may also include a memory controller to provide processor 410 with access to memory 420.
The server also includes a power supply 430 that provides power to the various components, and in some embodiments, the power supply 430 may be logically connected to the processor 410 via a power management system, such that charge, discharge, and power consumption management functions are performed by the power management system. Power supply 430 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The server may also include an input module 440, which input module 440 may be used to receive entered numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
The server may also include a communication module 450, and in some embodiments the communication module 450 may include a wireless module, through which the server may wirelessly transmit over short distances, thereby providing wireless broadband internet access to the user. For example, the communication module 450 may be used to assist a user in e-mail, browsing web pages, accessing streaming media, and the like.
Although not shown, the server may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 410 in the server loads executable files corresponding to the processes of one or more application programs into the memory 420 according to the following instructions, and the processor 410 executes the application programs stored in the memory 420, so as to implement various functions as follows:
The method comprises the steps of obtaining graph data and graph data storage spaces, wherein the graph data comprises nodes and edges connected between the nodes, the graph data storage spaces comprise index storage spaces and attribute storage spaces, storing attribute information of the graph data in the attribute storage spaces, wherein the attribute information comprises node attribute information and edge attribute information, determining index identifications of the nodes according to directivity information corresponding to the nodes for any node, wherein the directivity information corresponding to the nodes is used for pointing to the node attribute information and adjacent edge attribute information of the nodes stored in the attribute storage spaces, the adjacent edge attribute information is the edge attribute information of the adjacent edges connected with the nodes, and storing the index identifications of the nodes in the index storage spaces according to the connection relation of the nodes in the graph data so as to store the graph data in the graph data storage spaces.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
As can be seen from the above, in the embodiment of the present application, the index identifier and the attribute information of the graph data are separately stored in the index storage space and the attribute storage space, so as to implement separate management of the graph index and the graph attribute, thereby improving the storage flexibility and the query efficiency of the graph data. The index storage space stores index identifiers of nodes corresponding to the attribute information according to the connection relation of the nodes in the graph data. Therefore, when the graph data is queried, the connection relation of the nodes can be directly traversed in the index storage space without reading the graph attribute information so as to query and traverse the graph. After the related data to be queried is determined, corresponding attribute information is acquired from the attribute storage space according to the directivity information corresponding to the index identification of the queried node, so that the call of data resources in the query process is reduced, the resource waste is avoided, and the query efficiency of graph data is improved.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the data storage methods provided by the embodiments of the present application. For example, the instructions may perform the steps of:
The method comprises the steps of obtaining graph data and graph data storage spaces, wherein the graph data comprises nodes and edges connected between the nodes, the graph data storage spaces comprise index storage spaces and attribute storage spaces, storing attribute information of the graph data in the attribute storage spaces, wherein the attribute information comprises node attribute information and edge attribute information, determining index identifications of the nodes according to directivity information corresponding to the nodes for any node, wherein the directivity information corresponding to the nodes is used for pointing to the node attribute information and adjacent edge attribute information of the nodes stored in the attribute storage spaces, the adjacent edge attribute information is the edge attribute information of the adjacent edges connected with the nodes, and storing the index identifications of the nodes in the index storage spaces according to the connection relation of the nodes in the graph data so as to store the graph data in the graph data storage spaces.
The storage medium may include a Read Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or the like.
According to an aspect of the present application there is provided a computer program product or computer program comprising computer programs or instructions which when executed by a processor implement the steps of the methods provided in the various alternative implementations provided in the above embodiments. The computer program/instructions are stored in a computer readable storage medium. The processor of the electronic device reads the computer program/instructions from the computer-readable storage medium, and the processor executes the computer program/instructions to cause the electronic device to perform the methods provided in the various alternative implementations provided in the above-described embodiments.
The steps in any data storage method provided by the embodiment of the present application can be executed by the instructions stored in the storage medium, so that the beneficial effects that any data storage method provided by the embodiment of the present application can be achieved, and detailed descriptions of the foregoing embodiments are omitted herein.
The foregoing describes in detail a data storage method, apparatus, electronic device, storage medium and program product provided by the embodiments of the present application, and specific examples are set forth herein to illustrate the principles and embodiments of the present application, and the above description of the embodiments is only for aiding in understanding the method and core concept of the present application, and meanwhile, for those skilled in the art, according to the concept of the present application, there are variations in the specific embodiments and application ranges, so that the disclosure should not be interpreted as limiting the scope of the present application.