CN119025481A

CN119025481A - A seismic data query method, device, electronic equipment and medium

Info

Publication number: CN119025481A
Application number: CN202310594488.4A
Authority: CN
Inventors: 罗劭衡; 赵长海; 尚民强; 杜吉国; 王增波; 孙孝萍
Original assignee: Cnpc Oil Gas Exploration Software National Engineering Research Center Co ltd; China National Petroleum Corp; BGP Inc
Current assignee: Cnpc Oil Gas Exploration Software National Engineering Research Center Co ltd; China National Petroleum Corp; BGP Inc
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2024-11-26

Abstract

The present invention relates to the field of seismic exploration, and in particular, to a method, an apparatus, an electronic device, and a medium for querying seismic data. The method comprises the following steps: performing data sampling and data distributed learning on the seismic data based on the MapReduce framework and the machine learning model to generate a node allocation model; performing distributed segment ordering on the seismic data based on a MapReduce framework and the node allocation model to generate an ordered index file; training a learning index comprising an underlying model and a non-underlying model based on a MapReduce framework, the ordered index file, and combining the learning index with the node allocation model to construct a learning index structure; and acquiring keywords to be queried and inputting the keywords to the learning index structure to acquire query results. According to the scheme provided by the invention, the learning type index can be multiplexed to a certain extent, so that the index construction time is effectively shortened, and the index construction efficiency is improved.

Description

Seismic data query method and device, electronic equipment and medium

Technical Field

The present invention relates to the field of seismic exploration, and in particular, to a method, an apparatus, an electronic device, and a medium for querying seismic data.

Background

Seismic data processing is an important technology in the petroleum exploration industry, and has the function of processing and calculating the field-collected seismic data according to a specific processing algorithm, so that an image of an underground geological structure is obtained and is used for guiding subsequent drilling and petroleum exploitation work. With the continuous application of new exploration technology and high-precision acquisition technology in petroleum exploration, the volume of original seismic data acquired from the field is rapidly increased, the current scale of a single data body exceeds PB level, and the number of seismic channels can reach trillion. The object that is handled by the seismic application is typically a massive volume of seismic data that is logically similar to the data tables in a relational database, organized in row order, with each row record being referred to as a seismic trace. The seismic channel consists of two parts, namely a channel head and a channel body. The trace head stores attribute information related to the seismic trace, including information such as shot coordinates, detector coordinates, sampling points, cannons, trace numbers and the like, and each attribute is called a trace head keyword. The trace is a floating point array, each floating point number being referred to as a sample point. Because the seismic data volume is high-dimensional structured data, each seismic trace has hundreds of attribute information and is stored in different trace head keywords.

However, a large number of interactive seismic applications are typically only interested in a partial dataset of a seismic data volume when accessing that volume. Thus, a large number of seismic data accesses specify a range of values for some of the attributes to filter out particular datasets, while also possibly specifying the ordering of query results in the order of some of the attributes. Since multi-dimensional range queries are the most common data query patterns in seismic applications, their query speed is critical to the performance and user experience of the seismic application, and in particular the interactive application. Efficient index querying is the basis for guaranteeing querying efficiency and reducing querying delay of seismic data. The B+ tree index is used as a balanced search tree designed for disk or other direct access auxiliary storage devices, and can effectively reduce disk I/O operands during inquiry. And since the b+ tree can support fast range scanning along leaf nodes, it has better range query performance.

However, the conventional b+ tree construction method inserts each record into an empty tree, and the process thereof may include many complicated operations, such as splitting, rotating, etc., of nodes. For trillion-channel unordered data, the overhead of the traditional construction method is huge, multi-machine parallelization construction is difficult to carry out, and the acceleration ratio in a multi-thread environment is limited. Meanwhile, if each data is stored in the B+ tree, the scale of the final tree is huge, and the problem of low efficiency still faces during searching. In addition, only a single keyword can be searched by using the B+ tree, and when searching in a multi-keyword range, all data conforming to the first keyword condition needs to be searched first, and then the subsequent keywords are screened, so that the searching efficiency is greatly reduced. Therefore, for massive seismic data, a distributed index construction method different from the traditional B+ tree structure needs to be designed, so that the index construction efficiency, the acceleration ratio, the expandability and the query efficiency are improved.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, apparatus, electronic device and medium for querying seismic data.

According to a first aspect of the present invention there is provided a method of seismic data querying, the method comprising:

Performing data sampling and data distributed learning on the seismic data based on the MapReduce framework and the machine learning model to generate a node allocation model;

Performing distributed segment ordering on the seismic data based on a MapReduce framework and the node allocation model to generate an ordered index file;

training a learning index comprising an underlying model and a non-underlying model based on a MapReduce framework, the ordered index file, and combining the learning index with the node allocation model to construct a learning index structure;

and acquiring keywords to be queried and inputting the keywords to the learning index structure to acquire query results.

In some embodiments, the step of performing data sampling and data distributed learning on the seismic data based on the MapReduce framework and the machine learning model to generate the node assignment model includes:

Setting the sampling scale and the number of Reducer to be 1, and using a MapReduce framework to evenly distribute the sampling scale to each node;

In the Map stage, each node reads the interval seismic channels according to the sampling scale, extracts the keyword information containing the selected keyword value from each seismic channel, and ensures that the number of the seismic channels read by sampling accords with the sampling scale;

in the Reduce stage, integrating the read keyword information, and generating an allocation table of the sequence data range of each subsequent node by taking the sampling result as the whole data scanning result;

And taking the keyword value as input based on the allocation table, taking the node serial number appointed for the value as output, carrying out normalization processing, then transmitting the normalized value into a machine learning model according to the channel number ratio of the keyword value for training, fitting the cumulative distribution function of the keywords, and storing to generate a node allocation model.

In some embodiments, the step of distributing segment ordering of the seismic data based on the MapReduce framework and the node allocation model to generate an ordered index file comprises:

Dividing the header data into a plurality of sections in the Map stage, wherein each Map task is responsible for reading one section of data, and then distributing the Map tasks to Map works of each node in the cluster in an average manner to finish;

After the Map workbench receives the Map task, processing the corresponding track head data segments, reading all the track heads, sequentially extracting the selected keyword data and the track number in each track as index items, and storing by using key/value pairs;

The Map Worker obtains the node number corresponding to each index item by calling the node allocation model, so that each key/value is respectively sent to the corresponding Reduce Worker for sorting;

After receiving all key/value pairs sent by the Map workbench, the Reduce workbench orders the index data stored in the key/value pairs according to the keyword sequence, finally stores the ordered index data into a file corresponding to the Reduce workbench, and counts the information of each keyword to be written into a corresponding keyword information file.

In some embodiments, the step of training a learning index including an underlying model and a non-underlying model based on the MapReduce framework, the ordered index file, and combining the learning index with the node assignment model to construct a learning index structure, comprises:

In the Map stage, the MakeMLIndexInputFormat class is called to divide the data blocks, the number of the data blocks is determined according to the number of index files, and the data blocks are put into the InputSplit class for storage;

After the Map Worker receives the task, carrying out index data reading on each data block by using MakeMLIndexRecordReader types, counting a section of index items with a fixed number as a training set of one bottom model in the learning index after each section of index items are read, generating one bottom model in the learning index through MakeIndexLeafModel functions, training and storing;

after all leaf nodes are established, generating, training and storing a non-bottom model of the learning index by using MakeIndexNodeModel functions from bottom to top;

And enabling the node allocation model to serve as a root node model of the learning index to obtain a learning index structure, wherein the learning index structure starts from the root node model when inquiring, each layer of model selects a next layer of model, and the last layer of model outputs the approximate position of an inquiring value in the overall index data.

In some embodiments, the step of obtaining the keyword to be queried and inputting the keyword to the learning index structure to obtain the query result includes:

Acquiring a keyword to be queried input by a user, inputting a keyword value into a root node model of a learning index structure, acquiring a lower model number where the keyword is located, and searching downwards by analogy until reaching a bottom model;

Predicting the offset of the keyword in the index data file through the bottom layer model, and reading the index file according to the predicted offset;

if the read index item value is inaccurate, the forward or backward item-by-item searching can be determined by comparing the sizes of the current item and the searching value until the correct value is matched.

In some embodiments, the method further comprises:

And after inquiring, the inquiring condition and the real offset are used as training data to be transmitted into the model again for training so as to update the learning index structure.

In some embodiments, the machine learning model is a multi-layer perceptron model.

According to a second aspect of the present invention there is provided a seismic data query apparatus, the apparatus comprising:

the sampling learning module is configured to sample data and learn data in a distributed manner based on the MapReduce framework and the machine learning model so as to generate a node allocation model;

the ordering module is configured to perform distributed segment ordering on the seismic data based on the MapReduce framework and the node allocation model so as to generate an ordered index file;

A building module configured to train a learning index including an underlying model and a non-underlying model based on a MapReduce framework, the ordered index file, and combine the learning index with the node allocation model to build a learning index structure;

and the query module is configured to acquire keywords to be queried and input the keywords to the learning index structure to acquire query results.

According to a third aspect of the present invention, there is also provided an electronic device including:

at least one processor; and

And the memory stores a computer program which can be run on a processor, and the processor executes the seismic data query method when executing the program.

According to a fourth aspect of the present invention there is also provided a computer readable storage medium storing a computer program which when executed by a processor performs the aforementioned seismic data query method.

The seismic data query method has the following beneficial technical effects: the learning type index structure can learn and store the distribution rule of data by using a small amount of space, the storage space of the index structure is obviously reduced, the index structure is used as a black box, the index structure is regarded as a model from the angle of input and output data, the position of the index structure in a ordered data body is predicted by fitting the cumulative distribution function of the input data, and the machine learning model is used for learning and predicting the distribution rule of seismic data, so that the index function is realized, the learning type index can be multiplexed to a certain extent, the index construction time is effectively reduced, and the index construction efficiency is improved.

In addition, the invention also provides a seismic data query device, an electronic device and a computer readable storage medium, which can also achieve the technical effects, and are not repeated here.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for querying seismic data according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the overall structure of a large-scale seismic data indexing system for implementing the method of the invention according to one embodiment of the invention;

FIG. 3 is a schematic diagram of a learning index function implementation principle according to the present invention;

FIG. 4 is a schematic diagram of a multi-layer perceptron model used by the learning index of the present invention;

FIG. 5 is a schematic diagram of a relationship between a first key value and an offset in PB level seismic data using an indexing function according to the method of the present invention;

FIG. 6 is a schematic diagram of a data sampling flow based on a MapReduce framework and a machine learning model in the method of the present invention;

FIG. 7 is a schematic diagram of a distributed segment ordering flow based on a MapReduce framework in the method of the present invention;

FIG. 8 is a schematic diagram of a learning index construction flow based on a MapReduce framework in the method of the present invention;

FIG. 9 is a diagram of a learning index structure and a query flow in the method of the present invention;

FIG. 10 is a schematic diagram of a seismic data query apparatus according to another embodiment of the present invention;

FIG. 11 is an internal block diagram of an electronic device in accordance with another embodiment of the present invention;

Fig. 12 is a block diagram of a computer readable storage medium according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two entities with the same name but different entities or different parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention, and the following embodiments are not described one by one.

In one embodiment, referring to FIG. 1, the present invention provides a seismic data query method 100, specifically, the method comprises the steps of:

step 101, performing data sampling and data distributed learning on seismic data based on a MapReduce framework and a machine learning model to generate a node allocation model;

step 102, performing distributed segment ordering on the seismic data based on a MapReduce framework and the node allocation model to generate an ordered index file;

step 103, training a learning index comprising an underlying model and a non-underlying model based on a MapReduce framework and the ordered index file, and combining the learning index with the node allocation model to construct a learning index structure;

Step 104, obtaining the keywords to be queried and inputting the keywords to the learning index structure to obtain query results.

The seismic data query method of the embodiment has the following beneficial technical effects: the learning type index structure can learn and store the distribution rule of data by using a small amount of space, the storage space of the index structure is obviously reduced, the index structure is used as a black box, the index structure is regarded as a model from the angle of input and output data, the position of the index structure in a ordered data body is predicted by fitting the cumulative distribution function of the input data, and the machine learning model is used for learning and predicting the distribution rule of seismic data, so that the index function is realized, the learning type index can be multiplexed to a certain extent, the index construction time is effectively reduced, and the index construction efficiency is improved.

In some embodiments, the foregoing step 101 performs data sampling and data distributed learning on the seismic data based on the MapReduce framework and the machine learning model to generate a node assignment model, including:

In some embodiments, the foregoing step 102 performs distributed segment ordering on the seismic data based on the MapReduce framework and the node allocation model to generate an ordered index file, including:

In some embodiments, the foregoing step 103, training a learning index including an underlying model and a non-underlying model based on a MapReduce framework, the ordered index file, and combining the learning index with the node allocation model to construct a learning index structure, includes:

In some embodiments, the step 104 of obtaining the keyword to be queried and inputting the keyword to the learning index structure to obtain the query result includes:

In some embodiments, the method further comprises:

In another embodiment, to facilitate understanding of the solution of the present invention, taking the following as an example the use of the seismic data query method of the present invention on a high performance cluster, please refer to fig. 2, which is a diagram of a large-scale seismic data indexing system, in general, divided into two parts of index construction and index query, wherein the index construction is further divided into two parts: and (5) data sorting and index structure construction. Similarly, the present invention summarizes the indexing system in two stages: and the index data layer and the index structure layer are both constructed in a distributed mode by using a MapReduce framework. Firstly, extracting selected key word values and the seismic channel serial numbers from all seismic channels as index items, sequencing all the index items, and storing ordered index files in a centralized storage to complete the construction of an index data layer; thereafter, an index structure is built for the ordered index item data set.

The cumulative distribution function, which may also be referred to as a distribution function, is the integral of the probability density function and may fully describe the probability distribution of a real random variable X. For all real discrete variables X, the cumulative distribution function represents the sum of all probabilities of occurrence of values less than or equal to X, defined as follows:

F_X(x)＝P(X≤x)

Thus, for an ordered data set, the cumulative distribution function can be expressed as the ratio of the locations of a certain value x in the set in the overall data set, and after knowing the number of elements in the data set, the product of the ratio and the total number of elements is the specific location of the value x in the overall data set. Assuming that the cumulative distribution function of an ordered data set with a total number of elements of N is F _X (x), the position O (x) of one value x in the overall data can be calculated by the following formula:

O(x)＝N·F_X(x)

Therefore, the index for each value in one dataset can be calculated by the above formula, and three conditions are required to be satisfied to ensure that the formula is satisfied: the data sets are globally ordered, the overall data size N is known, and the cumulative distribution function F _X (x) is known. Wherein, the overall ordering of the data sets can be accomplished by overall ordering of the data sets; meanwhile, the whole data scale can be obtained after the sorting; finally, the cumulative distribution function of the data set can be obtained by scanning and collecting data samples, the accurate cumulative distribution can be obtained by complete scanning when the data size is small, and the approximate cumulative distribution function can be obtained by sampling part of the data volume when the data size is large. From this, it can be concluded that: by learning the accurate or approximate cumulative distribution function of the data body (i.e. function fitting) through a machine learning model, a mapping relation between the key words (values) and the positions (offsets) can be established, and meanwhile, the position relation of unknown values can be predicted, so that the effect of data indexing is achieved, as shown in fig. 3. Since the index predictor of the query content is directly calculated by the learning index, the complexity of the query time is O (1). The model can be approximately understood as a hash function realized based on a machine learning algorithm, and the mapping relation can be realized through a simple multi-layer perceptron model as shown in fig. 4.

According to the basic principle, the learning index takes a keyword value as input, takes the offset of the value in the index file as output, and sequentially takes the index items in the generated ordered index file as a training set input model for training. A scatter plot of the first key value and the index file offset for that value is shown in fig. 5, which value-offset relationship can be understood as a cumulative distribution function of the key. The graph is similar to the overall curve of the node assignment graph, so that the same model structure can be used for learning and prediction. However, since the total amount of seismic traces is large, if one model is inputted for training, huge learning time and lower learning accuracy are caused. Aiming at the situation, the thought of dividing and controlling can be adopted, all index data are divided into a plurality of segments, a query model is built for each segment of index to learn and store, so that the learning precision is improved, the learning process can be executed in parallel, and the time consumption of the learning stage is reduced.

The design idea of the learning index structure has the following main characteristics:

[1.1] carrying out construction of a learning index based on a distributed training method, wherein the construction flow is divided into three parts of sampling prediction, data sequencing and learning index construction;

1.2, predicting data distribution by using a machine learning model in sampling scanning prediction sequencing, and performing segment sequencing by using node serial numbers distributed by the model to realize efficient large-scale data sequencing;

1.3, performing distributed construction on each part of the index construction flow based on a MapReduce programming model;

[1.4], the number of index files is the same as the number of cluster nodes, and the number of sub-model trees of the learning type index is the same as the number of index files;

[1.5] starting from the root model in query, selecting a next layer model from each layer model, and outputting an approximate position value of query data by the last layer model;

and (1.6) after the inquiry, the data and the position value are transmitted into the model again for training, so that the inquiry efficiency of the high-frequency inquiry is improved.

[2] Sampling prediction based on MLP model.

In the MapReduce framework, the Map stage can send values with the same Key Value to the same Reducer through GetPartition functions for unified processing. Thus, when ordering using the MapReduce framework, each index item data can be sent to a designated numbered Reducer. And setting an orderly, continuous and mutually-disjoint keyword range for each Reducer, wherein each Mapper sends the index item to the Reducer responsible for processing the index item, and the Reducer generates a plurality of orderly index files after sorting and is orderly among the files. In this way, the parallelism of the clusters is maximally utilized in the ordering.

However, in the actual seismic trace data, the types, the value ranges, the distribution rules and other attributes of different keywords are different, so that the overall efficiency is the opposite to the wood barrel effect caused by that some nodes process excessive data, and the sorting data quantity distributed by each node needs to be ensured to be as even as possible. This requires a single scan of the entire data before ordering, determining the total value of the key, and the number of lanes that each value appears, and then assigning it to the different nodes as evenly and orderly as possible based on the number of lanes of each value. The scanning of all the track heads can be realized based on the MapReduce framework, firstly, each node reads the key words of the track heads in parallel in the Map stage, counts the read key word information (namely the key word value and the track number contained in the key word value), and sends the key word information of each node to the unique node; then in the Reduce phase, the nodes responsible for receiving the information collect all the key information and merge into a total key information table. And allocating the number of channels which are as average as possible to each node according to the key value, and finally generating an allocation table of the ordering data range of each node. And each node reads the allocation table in the ordering stage and sends different values to the designated node to realize segment ordering. And then constructing a search structure for each file, wherein the order among the structures enables the files to be directly combined into an integral index search structure, and searching for one time to obtain a result.

Because the efficiency of scanning all data is very low, and the number of tracks with the same key value in the actual data is very large, and all nodes cannot be completely and evenly distributed to the same number of tracks for sorting, the data distribution can be predicted by adopting a method combining sampling and machine learning, and the scheme mainly comprises the following steps:

(step 2.1) setting the sampling scale and the number of Reducer to 1, and using a MapReduce framework to evenly distribute the sampling scale to each node;

(step 2.2) in the Map stage, each node reads the interval seismic channels according to the sampling scale, extracts the selected keyword value in each seismic channel, and ensures that the number of the seismic channels sampled and read accords with the sampling scale;

(step 2.3) in the Reduce stage, integrating the read keyword information, and taking the sampling result as the whole data scanning result to generate an allocation table of the sequence data range of each subsequent node;

And (2.4) taking the keyword value as input, taking the node serial number appointed for the value as output, carrying out normalization processing, then transmitting the normalized value into a machine learning model according to the track number ratio of the keyword value for training, fitting the cumulative distribution function of the keyword, and storing.

[3] Distributed ordering based on a MapReduce programming model.

The traditional distributed ordering algorithm is divided into two parts: and (3) sorting and merging sorting inside each node, wherein the ordered data of each node needs to be merged and sorted in the final stage of sorting. However, the merging step may centralize all data IOs on a single node, resulting in a serious decrease in ordering efficiency and a significant increase in time consumption. In this regard, a piecewise ordering approach may be used to avoid merging to increase ordering efficiency. Firstly, data distribution is determined through scanning data, and then data in different mutually-disjoint numerical ranges are distributed to each node evenly, so that the inside of the ordered files and the files are ordered, and an overall ordered ordering result is obtained.

As the second part of the learning type index construction flow, the data sorting part has the purposes of sorting and generating index data files. The specific sequencing flow is shown in fig. 7, and comprises the following steps:

firstly, dividing the header data into a plurality of sections in the Map stage, wherein each Map task is responsible for reading one section of data, and then distributing the Map tasks to Map works of all nodes in the cluster evenly;

And (3.2) after the Map Worker receives the Map task, processing the corresponding header data segment is started. Reading all track heads, sequentially extracting selected keyword data and the track number in each track as index items, and storing by using key/value pairs;

(step 3.3) the Map Worker obtains the node number corresponding to each index item by calling the node allocation model, so that each key/value is respectively sent to the corresponding Reduce Worker for sequencing, and therefore the segmentation sequencing is realized;

And (3.4) after receiving all key/value pairs sent by the Map workbench, sequencing the index data stored in the key/value pairs according to the keyword sequence, and finally storing the sequenced index data into a file corresponding to the Reduce workbench. Meanwhile, the information of each keyword (no repeated value, the maximum value of the next keyword, the minimum value of the next keyword, etc.) is counted and written into the corresponding keyword information file. To this end, all header index data is extracted and aligned into a plurality of index files.

[4] Distributed learning index construction based on MapReduce framework.

The invention designs a distributed learning index construction scheme based on a MapReduce framework, and a flow chart of the scheme is shown in fig. 8, and the method mainly comprises the following steps:

(step 4.1) setting a maximum training time time_max, a maximum loss loss_max and a prediction allowance offset_tol of the model;

(step 4.2) after the data sorting is completed, reading index items of all different values of the index data and corresponding offset;

(step 4.3) taking all index items and corresponding offsets as training set input models;

(step 4.4) starting model training until the training time is greater than time_max, or the tolerance of the training output is less than offset_tol while the loss value is less than loss_max;

(step 4.5) saving the model.

The index model generated in the step is combined with the node allocation model generated in the sampling prediction step, so that the node allocation model is used as a root node model, and finally the hierarchical structure index model shown in fig. 9 is realized. When the query starts from the root model, each layer of model selects the next layer of model, and the last layer of model outputs the approximate position of the query value in the overall index data.

[5] A query algorithm based on a learning index structure.

The query flow for the learning index structure is shown in fig. 9, and the specific steps are as follows:

(step 5.1) inputting the key value into the root model (namely node allocation model), obtaining the number of the lower model where the key is located, and searching downwards until the model of the lower layer by analogy;

(step 5.2) predicting the offset of the keyword in the index data file through the underlying model, and reading the index file according to the predicted offset;

(step 5.3) if the read index item value is inaccurate, determining to search forward or backward item by comparing the current item with the search value until the current item is matched with the correct value;

And (5.4) after the query, the query conditions and the real offset are transmitted into the model again for training.

[6] Compared with the traditional index structures such as B+ tree, the invention has the following advantages:

(1) The learning index structure only needs to store the underlying multi-layer perceptron model, and has a space occupation which is obviously smaller than that of the traditional index structure.

(2) After learning index inquiry, data and real offset are transmitted into the model again for training, so that the inquiry efficiency of the index on the same condition can be effectively improved, namely, the high-frequency inquiry performance is achieved.

(3) The distributed index construction algorithm based on the MapReduce programming model is designed and realized, indexes can be quickly and concurrently constructed by utilizing the computing resources of multiple nodes, the expandability is good, the efficiency of constructing the indexes is improved, and the user experience of interactive application is improved.

(4) Index building efficiency can be greatly improved by the GPU.

(5) After one-time construction, the data with similar distribution rules can be reused, and the model training time is shortened.

The seismic data query method of the embodiment has the following beneficial technical effects:

(1) And checking whether an index meeting the query condition exists currently according to the designated query, if so, directly using the index, otherwise, constructing the index.

(2) The advantage of high-performance clusters is fully utilized, the rapid construction of the learning type index is realized based on the parallelization of the MapReduce programming model, the effect that the index construction task is uniformly scheduled to each node to be executed as much as possible is achieved, better performance expansibility is obtained in a large-scale cluster environment, and meanwhile, the acceleration ratio close to linearity is obtained.

(3) Aiming at the constructed learning index structure, the invention provides a corresponding query method for quickly querying.

(4) At different data scales, the query performance of the learning index is substantially the same.

(5) Under the data of similar distribution rules, the learning index can be multiplexed to a certain extent, and the index construction time is effectively reduced.

In yet another embodiment, the invention further provides a seismic data query method, specifically including construction of a learning index and use of the learning index after construction, where the implementation example of the invention mainly divides construction of the learning index into two parts: data ordering and indexing are constructed, wherein the data ordering part can be subdivided into sampling scanning and segment ordering parts. Firstly, carrying out data sampling and learning of a distribution rule based on a MapReduce framework and a machine learning model to generate a node distribution model; then, index generation and segment sorting are carried out by using a node allocation model based on a MapReduce framework, and a plurality of ordered index files are generated; finally, parallelizing each index file by using a MapReduce framework to construct a learning index. When learning index inquiry is started from a root node model, each layer of model selects a next layer of model, and the last layer of model outputs the approximate position of an inquiry value in overall index data. The specific implementation steps of each part are as follows:

data sampling and data distribution learning based on MapReduce framework and machine learning model:

and (step 1) in the Map stage, invoking MLTraceScanInputFormat classes to divide the data blocks, determining the number of the divided data blocks according to the number of available nodes in the cluster, and storing the data blocks in InputSplit classes.

And (2) after the Map Worker receives the task, carrying out reading of the header data by using MLTraceScanRecordReader types according to a mode of spaced random extraction on each data block, reading one header data each time, extracting a selected keyword value in the header, and storing the keyword value in a key/value pair.

(Step 3) Map Worker sends each key/value pair to a unique Reduce task through a Partitioner class partition function GetPartition (conststd:: string & key, conststd:: string & value, int numPartitions).

(Step 4) after the Reduce Worker receives all key/value pairs sent by the Map Worker, the received index item sample data is ordered by using a MLTraceScanComparator class Compare (constraint std:: string & key1, constraint std::: string & key 2) function.

And (5) after the Reduce part is completed, counting the number of different values of each key word of the ordered index sample, and orderly distributing the index item samples to each node as uniformly as possible to generate a key word-node distribution table.

(Step 6) constructing an MLP model based on Libtorch frames, and training by taking the key values and the node numbers in the key-node allocation table as input and output input models of the model respectively.

And (7) finally storing the trained model and the parameter information into nonvolatile storage.

Data ordering based on MapReduce programming model:

and (step 1) in the Map stage, invoking MLTraceSortInputFormat classes to divide the data blocks, determining the number of the divided data blocks according to the number of available nodes in the cluster, and storing the data blocks in InputSplit classes.

And (2) after the Map Worker receives the task, reading the header data of each data block by using MLTraceSortRecordReader types, extracting a selected keyword value in each header, and storing the keyword value in a key/value pair together with the current number.

(Step 3) Map Worker reads the node allocation model through a MLTraceSortPartitioner class division function GetPartition (conststd:: string & key, conststd:: string & value, int numPartitions), inputs the key value of each index item into the model, and obtains a corresponding node number, so that each key/value pair is sent to a corresponding Reduce task.

(Step 4) after the Reduce Worker receives the key/value pair sent by the Map Worker, the received index data is ordered by using a compact (constraint std:: string & key1, constraint std::: string & key 2) function of MLTraceSortComparator classes, and finally the index data and key information are written into nonvolatile storage by using MLTraceSortRecordWriter classes.

Learning index construction based on MapReduce programming model:

And (step 1) in the Map stage, invoking MakeMLIndexInputFormat classes to divide the data blocks, determining the number of the data blocks according to the number of the index files, and storing the data blocks in InputSplit classes.

And (2) after the Map Worker receives the task, reading index data of each data block by using MakeMLIndexRecordReader types, counting the index data as a training set of a bottom layer model after each section of index items with fixed number are read, generating a bottom layer model through MakeIndexLeafModel(std::vector<std::string>&input_index_data,int model_id,double time_max,std::string save_path) functions, training and storing. After all leaf nodes are created, the generation, training and storage of the non-bottom model of the learning index are completed from bottom to top by using MakeIndexNodeModel(std::vector<std::vector<std::string>>&index_data_strs,double time_max,std::string save_path) functions.

(Step 3) Map Worker sends each key/value pair to a unique Reduce task (the number of Reduce has been set to 1 by SetNumReduceTasks (int num) functions before the task starts).

Learning type index query:

and (step 1) acquiring information of the current index through a given index path, and reading a first keyword information file and a top keyword-node allocation model of the learning index.

(Step 2) starting from the top-level model through search_index(vector<BAKeyValue>&vFromKey,vector<BAKeyValue>&vToKey,std::vector<IndexData_v>&keyDatas) functions, searching the starting and ending range of each given keyword, and storing all index items meeting the conditions into keyDatas arrays.

And (3) summarizing all index item results, continuing screening other attributes (such as grouping, tolerance and the like), and storing index items meeting all conditions into BATraceIndexs types.

In some embodiments, referring to fig. 10, the present invention further provides a seismic data query apparatus 200, which includes:

A sample learning module 201 configured to perform data sampling and data distributed learning on the seismic data based on the MapReduce framework and the machine learning model to generate a node allocation model;

A ranking module 202 configured to perform distributed segment ranking on the seismic data based on a MapReduce framework and the node allocation model to generate an ordered index file;

A building module 203 configured to train a learning index including an underlying model and a non-underlying model based on a MapReduce framework, the ordered index file, and combine the learning index with the node allocation model to build a learning index structure;

The query module 204 is configured to obtain a keyword to be queried and input the keyword to the learning index structure to obtain a query result.

The seismic data query device of the embodiment has the following beneficial technical effects: the learning type index structure can learn and store the distribution rule of data by using a small amount of space, the storage space of the index structure is obviously reduced, the index structure is used as a black box, the index structure is regarded as a model from the angle of input and output data, the position of the index structure in a ordered data body is predicted by fitting the cumulative distribution function of the input data, and the machine learning model is used for learning and predicting the distribution rule of seismic data, so that the index function is realized, the learning type index can be multiplexed to a certain extent, the index construction time is effectively reduced, and the index construction efficiency is improved.

It should be noted that, the specific limitation of the seismic data query device may be referred to the limitation of the seismic data query method hereinabove, and will not be described herein. The various modules in the seismic data query apparatus described above may be implemented in whole or in part in software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a processor in the electronic device, or may be stored in software in a memory in the electronic device, so that the processor may call and execute operations corresponding to the above modules.

According to another aspect of the present invention, there is provided an electronic device, which may be a server, and an internal structure thereof is shown in fig. 11. The electronic device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the electronic device is for storing data. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements the seismic data query method described above.

According to yet another aspect of the present invention, a computer readable storage medium is provided, as shown in fig. 12, on which a computer program is stored, which when executed by a processor, implements the above-described seismic data query method.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A seismic data query method, characterized in that the method comprises:

Based on the MapReduce framework and machine learning model, data sampling and data distributed learning are performed on seismic data to generate a node allocation model;

Based on the MapReduce framework and the node allocation model, the seismic data is sorted in distributed segments to generate an ordered index file;

Based on the MapReduce framework, the ordered index file trains a learning index including an underlying model and a non-underlying model, and combines the learning index with the node allocation model to construct a learning index structure;

The keywords to be queried are obtained and input into the learning index structure to obtain query results.

2. The seismic data query method according to claim 1, characterized in that the step of performing data sampling and data distributed learning on seismic data based on the MapReduce framework and the machine learning model to generate a node allocation model comprises:

Set the sampling scale and set the number of Reducers to 1, and use the MapReduce framework to evenly distribute the sampling scale to each node;

In the Map stage, each node reads seismic traces at intervals according to the sampling scale, extracts the keyword information containing the selected keyword value in each seismic trace, and ensures that the number of seismic traces sampled and read meets the sampling scale;

In the Reduce phase, the read keyword information is integrated, and the sampling results are used as the overall data scanning results to generate the allocation table of the sorting data range of each node;

Based on the allocation table, the keyword value is used as input and the node number specified for the value is used as output. After normalization, it is passed into the machine learning model for training according to the proportion of the number of keyword values, and the cumulative distribution function of the keyword is fitted and saved to generate a node allocation model.

3. The seismic data query method according to claim 1, characterized in that the step of performing distributed segment sorting on the seismic data based on the MapReduce framework and the node allocation model to generate an ordered index file comprises:

In the Map phase, the header data is evenly divided into multiple segments. Each Map task is responsible for reading a segment of data. These Map tasks are then evenly distributed to the Map Workers of each node in the cluster.

After receiving the Map task, the Map Worker starts processing the corresponding trace header data segment, reading all trace headers, extracting the selected keyword data and the trace number in each trace in sequence as index items, and storing them using key/value pairs;

The Map Worker obtains the node number corresponding to each index item by calling the node allocation model, and then sends each key/value to the corresponding Reduce Worker for sorting;

After the Reduce Worker receives all the key/value pairs sent by the Map Worker, it sorts the index data stored in the key/value pairs according to the keyword order, and finally stores the sorted index data into the file corresponding to the Reduce Worker, and writes the statistics of each keyword information into the corresponding keyword information file.

4. The seismic data query method according to claim 1, characterized in that the step of training a learning index including an underlying model and a non-underlying model based on the MapReduce framework and the ordered index file, and combining the learning index with the node allocation model to construct a learning index structure comprises:

In the Map stage, call the MakeMLIndexInputFormat class to divide the data into blocks. The number of data blocks is determined based on the number of index files and stored in the InputSplit class.

After receiving the task, the Map Worker uses the MakeMLIndexRecordReader class to read the index data of each data block. Every time a fixed number of index items are read, they are counted as a training set of an underlying model in the learning index. The MakeIndexLeafModel function is used to generate an underlying model in the learning index for training and saving.

After all leaf nodes are created, use the MakeIndexNodeModel function to complete the generation, training, and saving of the non-bottom-level model of the learning index from bottom to top;

The node allocation model is used as the root node model of the learning index to obtain a learning index structure, wherein the learning index structure starts from the root node model when querying, each layer of models selects the next layer of models, and the last layer of models outputs the approximate position of the query value in the overall index data.

5. The seismic data query method according to claim 4, characterized in that the step of obtaining keywords to be queried and inputting them into the learning index structure to obtain query results comprises:

Obtain the keyword to be queried input by the user, and input the keyword value into the root node model of the learning index structure, obtain the lower model number where the keyword is located, and search downwards in this way until the bottom model;

The underlying model is used to predict the offset of the keyword in the index data file, and the index file is read according to the predicted offset.

If the index item value read is inaccurate, you can determine whether to search forward or backward item by item by comparing the current item with the search value until the correct value is matched.

6. The seismic data query method according to claim 5, characterized in that the method further comprises:

After the query, the query conditions and the actual offset are passed into the model again as training data for training to update the learning index structure.

7. The earthquake data query method according to claim 1 is characterized in that the machine learning model is a multi-layer perceptron model.

8. A seismic data query device, characterized in that the device comprises:

A sampling learning module, configured to perform data sampling and data distributed learning on seismic data based on a MapReduce framework and a machine learning model to generate a node allocation model;

A sorting module configured to perform distributed segment sorting of seismic data based on a MapReduce framework and the node allocation model to generate an ordered index file;

A construction module configured to train a learning index including an underlying model and a non-underlying model based on a MapReduce framework and the ordered index file, and to combine the learning index with the node allocation model to construct a learning index structure;

The query module is configured to obtain keywords to be queried and input them into the learning index structure to obtain query results.

9. An electronic device, comprising:

at least one processor; and

A memory storing a computer program executable in the processor, wherein the processor executes the method according to any one of claims 1 to 7 when executing the program.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, performs the method according to any one of claims 1 to 7.