WO2018209594A1

WO2018209594A1 - Olap cube optimization using weightings

Info

Publication number: WO2018209594A1
Application number: PCT/CN2017/084707
Authority: WO
Inventors: Qiaoneng QIAN; Ken Wang; Gang Ma; Yanghong ZHONG; Julian PAN; Nemo CHEN; Seshu Adunuthula; Samant MEDHA
Original assignee: Ebay Inc.
Priority date: 2017-05-17
Filing date: 2017-05-17
Publication date: 2018-11-22

Abstract

An OLAP cube comprising a plurality of cuboids can be optimized by analyzing a computational efficiency for each of the cuboids in the OLAP cube. The computational efficiency of a given cuboid can be offset by a likelihood of the cuboid being accessed by future queues issued against the OLAP cube. The likelihood can be generated based at least in part on past queues issued against the OLAP cube. A user interface can display a representation of the OLAP cube before and after a suggested optimization process is performed in parallel via a Hadoop platform.

Description

OLAP CUBE OPTIMIZATION USING WEIGHTINGS

TECHNICAL FIELD

Embodiments of the present disclosure relate generally to data structure management and， more particularly， but not by way of limitation， to data structure optimization using weightings.

BACKGROUND

Some multidimensional data structures， such as Online Analytical Processing (OLAP) cubes， have pre-generated data in called measures that are categorized by dimensions. As the amount of data tracked by the multidimensional data structure increases， the time it takes for the multidimensional data structure to service queries becomes impractical. Further， in some implementations， the amount of data to be tracked by the multidimensional data structure is so large that building the entire structure can be impractical.

BRIEF DESCRIPTION OF THE DRAWINGS

Various ones of the appended drawings merely illustrate example embodiments of the present disclosure and should not be considered as limiting its scope.

FIG. 1 is a block diagram illustrating a networked system， according to some example embodiments.

FIG. 2 illustrates a block diagram showing components provided within an OLAP tuner system， according to some embodiments.

FIG. 3 shows a flow diagram of a method for implementing OLAP cube optimization， according to some example embodiments.

FIG. 4 shows a flow diagram of a method of determining a set of cuboids to include in an OLAP cube， according to some example embodiments.

FIG. 5 shows an example OLAP cube as a lattice， according to some example embodiments.

FIG. 6 shows an example lattice of an OLAP cube with ordered pairs for row counts， according to some example embodiments.

FIG. 7A shows an example source cost lattice， according to some example embodiments.

FIG. 7B shows an example actual cost lattice， according to some example embodiments.

FIG. 8 shows a user interface for management of an OLAP cube， according to some example embodiments.

FIG. 9 shows a user interface resulting from optimizing an OLAP cube， according to some example embodiments.

FIG. 10 shows an example block diagram for modifying portions of an OLAP cube in parallel， according to some example embodiments.

FIG. 1 1 illustrates a diagrammatic representation of a machine in the form of a computer system within which a set of instructions can be executed for causing the machine to perform any one or more of the methodologies discussed herein， according to an example embodiment.

DETAILED DESCRIPTION

The description that follows includes systems， methods， techniques， instruction sequences， and computing machine program products that embody illustrative embodiments of the disclosure. In the following description， for the purposes of explanation， numerous specific details are set forth in order to provide an understanding of various embodiments of the inventive subject matter. It will be evident， however， to those skilled in the art， that embodiments of the inventive subject matter can be practiced without these specific details. In general， well-known instruction instances， protocols， structures， and techniques are not necessarily shown in detail.

In order to address the challenges described above， an OLAP tuner system in one example implements a scheme (e.g.， an algorithm) to determine a beneficial set of cuboids to be pre-calculated for inclusion in the OLAP cube. The algorithm chooses cuboids using their benefit and cost， as weighted by query frequencies.

With reference to FIG. 1， an example embodiment of a high-level client-server-based network architecture 100 is shown. A networked system 102 provides server-side functionality via a network 104 (e.g.， the Internet or a wide-area network (WAN) ) to one or more client devices 110. In some implementations， a user 106 interacts with the networked system 102 using the client device 110， e.g.， to issue queries against a database 126. FIG. 1 illustrates， for example， a web client 112 (e.g.， a browser) ， a client application 114， and a programmatic client 116 executing on the client device 110. The client device 110 includes the web client 112， the client application 114， and the programmatic client 116 alone， together， or in any suitable combination. Although FIG. 1 shows one client device 110， in other implementations， the network architecture 100 comprises multiple client devices.

In various implementations， the client device 110 comprises a computing device that includes at least a display and communication capabilities that provide access to the networked system 102 via the network 104. The client device 110 comprises， but is not limited to， a remote device， work station， computer， general-purpose computer， Internet appliance， hand-held device， wireless device， portable device， wearable computer， cellular or mobile phone， Personal Digital Assistant (PDA) ， smart phone， tablet， ultrabook， netbook， laptop， desktop， multi-processor system， microprocessor-based or programmable consumer electronic system， game console， set-top box， network Personal Computer (PC) ， mini-computer， and so forth. In an example embodiment， the client device 110 comprises one or more of a touch screen， accelerometer， gyroscope， biometric sensor， camera， microphone， Global Positioning System (GPS) device， and the like.

The client device 110 communicates with the network 104 via a wired or wireless connection. For example， one or more portions of the network 104 comprise an ad hoc network， an intranet， an extranet， a virtual private network (VPN) ， a local-area network (LAN) ， a wireless LAN (WLAN) ， a WAN， a wireless WAN (WWAN) ， a metropolitan area network (MAN) ， a portion of the Internet， a portion of the public switched telephone network (PSTN) ， a cellular telephone network， a wireless network， a

network， a Worldwide Interoperability for Microwave Access (WiMax) network， another type of network， or any suitable combination thereof.

In some example embodiments， the client device 110 includes one or more applications (also referred to as “apps” ) such as， but not limited to， web browsers， book reader apps (operable to read e-books) ， media apps (operable to present various media forms including audio and video) ， fitness apps， biometric monitoring apps， messaging apps， and electronic mail (email) apps. In some implementations， the client application 114 includes various components operable to present information to the user 106 and communicate with the networked system 102.

The web client 112 accesses the various systems of the networked system 102 via the web interface supported by a web server 122. Similarly， the programmatic client 116 and client application 114 access the various services and functions provided by the networked system 102 via the programmatic interface provided by an Application Programming Interface (API) server 120.

Users (e.g.， the user 106) comprise a person， a machine， or other means of interacting with the client device 110. In some example embodiments， the user is not part of the network architecture 100， but interacts with the network architecture 100 via the client device 110 or another means. For instance， the user provides input (e.g.， touch screen input or alphanumeric input) to the client device 110 and the input， e.g.， a query or optimization request， is communicated to the networked system 102 via the network 104. In this instance， the networked system 102， in response to receiving the input from the user， communicates information to the client device 110 via the network 104 to be presented to the user. In this way， the user can interact with the networked system 102 using the client device 110.

The API server 120 and the web server 122 are coupled to， and provide programmatic and web interfaces respectively to， one or more application servers 140. The application server 140 can host an OLAP tuner system 150， which can comprise one or more engines or applications， each of which can be embodied as hardware， software， firmware， or any combination thereof. The application server 140 is， in turn， shown to be coupled to a database server 124 that facilitates access to one or more information storage repositories， such as the database 126. In some example embodiments， the database server 124 is a server specially configured to access the type of data structure stored in the database 126. For example， the database server 124 is an OLAP database server configured to receive OLAP queries from the client device 110 and retrieve query results from an OLAP cube 160 in the database 126.

In an example embodiment， the database 126 comprises one or more storage devices that store information to be optimized or tuned by the OLAP tuner system 150 or service queries issued by the client device 110. Additionally， a third-party application 132， executing on a third-party server 130， is shown as having programmatic access to the networked system 102 via the programmatic interface provided by the API server 120. For example， the third-party application 132， utilizing information retrieved from the networked system 102， supports one or more features or functions on a website hosted by a third party. In some implementations， the OLAP tuner system 150 provides functionality to analyze a multidimensional data structure， such as the OLAP cube 160 stored in the database 126. The OLAP tuner system 150 will be discussed further in connection with FIG. 2 below.

Further， while the client-server-based network architecture 100 shown in FIG. 1 employs a client-server architecture， the present inventive subject matter is， of course， not limited to such an architecture， and can equally well find application in a distributed， or peer-to-peer， architecture system， for example. The various systems of the application server 140 (e.g.， the OLAP tuner system 150) can also be implemented as standalone software programs， which do not necessarily have networking capabilities.

FIG. 2 illustrates a block diagram showing components provided within the OLAP tuner system 150， according to some embodiments. In various example embodiments， the OLAP tuner system 150 comprises a generator engine 205， a gain engine 210， a weighting engine 215， a modifier engine 220， and an interface engine 225. The components themselves are communicatively coupled (e.g.， via appropriate interfaces) to each other and to various data sources， so as to allow information to be passed between or among the components or so as to allow the components to share and access common data. Furthermore， the components access the database 126 via the database server 124.

The generator engine 205 is configured to generate an OLAP cube having a plurality of cuboids. The gain engine 210 is configured to determine which cuboids， if materialized or generated， result in the greatest gains in computational efficiency. In some example embodiments， the gain engine 210 generates gain factors for use in analyzing which of the cuboids will result in the greatest computational efficiency. The weighting engine 215 is configured to work in concert with the gain engine 210 to generate weighted gain factors. In some example embodiments， the weighting engine 215 generates， for a given cuboid， the probability that the cuboid will be accessed (e.g.， hit) by future queries. In some example embodiments， the probability is calculated as a probability factor that is generated using a set of past queries issued against the OLAP cube 160. The modifier engine 220 is configured to modify the OLAP cube 160 by adding or removing cuboids to or from the OLAP cube 160 based on data such as weighted gain factors of the cuboids. The interface engine 225 is configured to generate a user interface to visualize the cuboids as segments in a graph. The segments may be modified to have visual indicators (e.g.， color coding) that indicate access frequency for a given cuboid. Users can use the interface to understand the layout of a given cuboid and issue commands to modify or otherwise optimize the cuboid by adding or removing cuboids to or from the OLAP cube.

OLAP cubes have pre-generated data in called measures that are categorized by dimensions. FIG. 1 shows an example OLAP cube 160 stored in the database 126. In the depicted OLAP cube 160， three dimensions are shown. The Z dimension is “Time” . Slices of the Z dimension can specify different granularities of time， such as quarters (e.g.， Q1， Q2， Q3， Q4) or months (e.g.， January， February， etc. ) . The Y dimension tracks web pages. Slices of the Y dimension can specify different types of web pages (e.g.， home page， catalog page， item page 1， item page 2， etc. ) . The X dimension tracks user actions. Slices of the X dimension can specify different user actions (e.g.， page views， sign ups， likes， add to baskets) . The example OLAP cube 160 in FIG. 1 can be used to quickly service complex queries involving user actions on an enterprise website or network platform. Although the example OLAP cube 160 has three dimensions， OLAP cubes can have more dimensions， and thus can be hypercube data structures.

OLAP cubes which have more than three dimensions can be shown in a lattice or hierarchical form to help visualize different views of the hyper-dimensional data structure. FIG. 5 shows an example OLAP cube as a lattice 500. The four dimensions in the cube are time， item， location， and supplier. From bottom to top， each level of the lattice decreases by a dimension. Each node (depicted as a circle) is a view or cuboid having some or all of the total dimensions. The four-dimensional base “cuboid” comprises all four dimensions. The next level up has four three-dimensional (3D) cuboids， each of the 3D cuboids excluding a dimension. The next level up has six two-dimensional (2D) cuboids， each of the 2D cuboids excluding two dimensions. The next level up has four one-dimensional cuboids， each featuring a single dimension. Further up is the zero-dimensional apex cuboid， which contains zero dimensions.

Each of the cuboids has pre-calculated values per the dimension (s) . Users can issue queries against the different cuboids， depending on what information they need. For example， a user who has a query that doesn’t require analysis of the supplier dimension can have the query addressed to the 3D cuboid having dimensions of time， item， and location. Further， administrators or the applications managing the OLAP cube can partially build a high-dimensional OLAP cube based on the fact that queries addressing a lower-dimensional cuboid can be answered by its parent cuboid since the parent cuboid has all of the information of the lower-dimensional cuboid. For example， a query addressed to a time and location 2D cuboid can be serviced by a 3D cuboid having time， item， and location dimensions since the 3D cuboid has all of the data in the 2D cuboid (e.g.， times and locations) .

As the amount of data to be tracked grows， some cubes may require 20 or more dimensions， which can cause access to the cube to be slow. Further， in some implementations， building and managing a 20-dimensional OLAP cube may not be computationally practical. To make OLAP cubes more manageable， some systems only partially build an OLAP cube， building some of the cuboids and forgoing building others based on the fact that the built cuboids can answer queries addressed to the unbuilt cuboids， as discussed above.

To determine which cuboids should be included in a partially built OLAP cube， each cuboid can be analyzed using a gain (e.g.， benefit) versus cost analysis. In this approach， the cuboids that yield the greatest benefits in view of their respective costs are included in the built OLAP cube. The cost of a given cuboid includes the space cost and the query cost. The space cost includes the resources spent to build the cuboid (e.g.， row count of cube， storage space) . The query cost includes the resources (e.g.， time， processing power to scan row count of a cube) spent to accomplish a query against a given cuboid. The benefit of a given cuboid is correlated to reducing the query cost by scanning the row count of another cuboid (e.g.， a parent of the cuboid) instead of the given cuboid itself.

Analytically， let C (i) denote the cost of a given cuboid (i) and the operator ≤ denote that one cuboid answers the query addressed to another cuboid. For example， consider two cuboids： c1， c2. If c1 ≤ c2， then a query issued to c1 can be answered by c2. Further， for each w ≤ i， let u be the cuboid of least cost in set “S” such that w ≤ u. To maximize benefit， a max function can be implemented： B_w ＝ max {C (u) -C (i) ， 0} . Finally， the benefit of a cuboid (i) relative to a set of cuboids S， B (i， S) ， is given by B (i， S) ＝ ∑_w≤i B_w.

Algorithmically， this approach can be implemented in a number of iterations as shown in the example algorithm below. In each iteration， the benefit of each cuboid is calculated per the formula above. Then the cuboid with the most benefit is added to the set “S” ， which is then used to build or modify the OLAP cube.

As shown in this example algorithm， the set “S” is complete when the remaining cuboids no longer maximize B (i， S) .

FIG. 6 shows an example lattice 600 of an OLAP cube， with ordered pairs (i， j) next to each cuboid， where “i” is the row count of the given cuboid， and “j” is the row count of another cuboid from which the given cuboid can be calculated. For example， cuboid “b” has a row count of 50， and can be calculated from cuboid “a” which has a row count of 100. The example lattice 600 is organized from the top down such that the top cuboid “a” is the base cuboid with the most dimensions， and the number of dimensions in the cuboids decreases from top to bottom.

As an example of determining which cuboids should be included in the set “S” ， the benefit for all cuboids can be calculated as follows. The benefit for cuboid “b” is the difference between the row count of “b” and the row count of a cuboid from which “b” can be calculated (which is cuboid “a” ) ， multiplied by the number of cuboids that materializing or building “b” can satisfy. In particular， the benefit of “b” in a first iteration， “b₁” ， is 250； that is， b₁ → (100-50) *5 ＝ 250. The 100 is from “a” ， and the 50 is from “b” . Further， if “b” is materialized， then cuboids “d” ， “e” ， “g” ， and “h” can use “b” to satisfy their queries； thus the multiplier of 5.

Similar calculations can be performed for the other cuboids as follows： c₁→ (100-75) *5 ＝ 125；d₁→ (100-20) *2 ＝ 160；e₁→ (100-30) *3 ＝ 210；f₁→ (100-40) *2＝ 120；g₁→ (100-1) *1 ＝99； andh₁→ (100-10) *1 ＝ 90.

In the first iteration， “b” has the greatest benefit； thus， “b” is added to set “S” . Although “e” has the second-highest benefit， the analysis changes in the second iteration because the materialization of cuboid “b” changes the benefit results of the other cuboids.

In a second iteration， both cuboids “b” and “a” are materialized and in set “S” . To determine the benefit of “e” ， the calculation is e₂→ (50-30) *3 ＝ 60. The 50 is from “b” which is now materialized and is preferable to “a” because “a” has a row count of 100 versus the 50 of “b” . Further， as per above， the 3 is from the fact that if “e” is materialized， it can manage queries for “g” and “h” . Thus， instead of 250， “e” has a benefit of 60 in the second iteration.

c₂→ (100-75) *2 ＝ 50；

d₂→ (50-20) *2 ＝ 60；

f₂→ (100-40) + (50-40) ＝ 70；

g₂→ (50-1) *1 ＝ 49； and

h₂： (50-10) *1 ＝ 40.

Note that if “f” is chosen to be materialized， then “h” can be more efficiently calculated using “f” instead of “b” . From the above， we see that “f” has the greatest benefit or gain in the second iteration； thus， “f” is selected and added to set “S” . At the end of the iterations， the OLAP cube is either built using the cuboids in set “S” ， or ifthe OLAP cube exists， then the OLAP cube is modified to only include the cuboids in the set “S” (to the extent that the existing OLAP cube does not already contain the cuboids in set “S” ) . In some example embodiments， the iterations end when materializing further cuboids no longer maximizes B (i ， S) . In some example embodiments， the iterations end when a pre-specified number of cuboids are materialized.

If the OLAP cube already exists， then in some implementations a set of queries issued by users has already been issued against the OLAP cube. Further， in some implementations where the OLAP cube has not yet been built， there still may be queries that have been submitted by users for a given network or platform. The queries can be used to determine the importance of the cuboids based on how frequently the cuboids are hit or accessed to answer a query.

The queries issued by users can make up a query sequence， such as

where j_k ∈ {1， 2， 3， ...， N} and k ∈ {1， 2， 3 ， ...， M} . An estimation of the probability，

that a given cuboid， o_l， is hit can de determined as follows：

To determine which cuboids should be included in an initial build of an OLAP cube， the full cube can correspond to a source cost lattice 700， as show in FIG. 7A. Given a dimension set D ＝ {d₁， d₂， d₃ ... d_L} ， the corresponding cuboid set can be

Further， f_p： O → φ_o， where φ_o is the set of all the subsets of O. For each cuboid o_l， f_p (o_l) denotes the parent cuboid set of o_l， in which each element can roll up to be o_l. Roll-up is an operation for OLAP data structures in which one dimension is rolled up into another having greater granularity， or one dimension is removed from the OLAP data structure.

Further， the target cube corresponds to an actual cost lattice 705， shown in FIG. 7B. Given a query set Q ＝ {q₁， q₂， q₃ ... q_N} ， f_s： Q → O. For each query q_i， where i ∈ {1， 2， 3， ...， N} ， there is a corresponding source cuboid to be hit f_s (q_i) ， where f_s (q_i) ∈ O. Further， f_t： O → O. For each source cuboid o_l， there is a target cuboid used for scanning results f_t (o_l) ， where f_t (o_l) ∈ f_p (o_l) .

According to some example embodiments， the steps for planning a cuboid include： (1) identify mandatory cuboid set， (2) estimate the cost for cuboids in O^m， to get source cost lattice T_s， (3) identify cuboid set for selection， O^s， (4) initialize the actual cost lattice T_a， (5) select cuboids from O^s based on T_s and T_a to get

and (6) return O^r ＝ {O^m， O^e} .

For a random query， ifthe row count rolling up from target cuboid f_t (o_l) to source cuboid o_l is expected to be larger than a threshold， then cuboid o_l can be set to be mandatory. For query q_i，

denotes the row count rolling up from target cuboid f_t (f_s (q_i) ) to source cuboid f_s (q_i) . IfM (the number of queries) is large enough， then the expectation of the row count rolling up from target cuboid f_t (o_l) to source cuboid o_l is as follows：

According to some example embodiments， a threshold， ε， and a probability of a cuboid being hit can be used to determine whether the cuboid should be considered mandatory and built in the initial OLAP cube. In particular， for example， if

where， for example， ε ＝ 0.01 *500000 ＝ 5000， then cuboid o_l is set to mandatory.

In determining the initial cost for mandatory cuboids， let

denote the row count of cuboid o_l. Further， f_p： O → φ_o， where φ_o is the set of all the subsets of O. For each cuboid o_l， f_p (o_l) denotes the parent cuboid set of o_l， in which each element can roll up to be o_l. For a mandatory cuboid o_l，

To weight the cost by the probability， the expectation of a future query cost can be determined as follows：

where

is the probability of cuboid o_l being hit.

As discussed above， the benefit or gain is the reduction of cost. To weight the gain by the probability，

Then， factoring：

Equation 5 can be implemented to perform benefit/cost analysis with the likelihood of the cuboids being hit integrated into the approach. Accordingly， cuboids can be added to set “S” and included in the OLAP cube by using the above equations in the algorithm discussed above.

FIGS. 3 and 4 display methods for implementing the above approaches using flow diagrams. FIG. 3 shows a flow diagram of a method 300 for implementing OLAP cube optimization， according to some example embodiments. At operation 305， the generator engine 205 generates an OLAP cube. In some example embodiments in which the OLAP cube already exists， at operation 305 the generator engine 205 identifies the pre-existing OLAP cube， e.g.， OLAP cube 160. At operation 310， the gain engine 210 identifies a set of queries. The queries are past queries issued by users (e.g.， user 106) against the OLAP cube 160 in the database 126， according to some example embodiments. At operation 315， the gain engine 210 generates gain factors for each of the cuboids in the OLAP cube 160. At operation 320， the weighting engine 215 generates weighted gain factors for each of the cuboids in the OLAP cube 160. At operation 325， the modifier engine 220 modifies the OLAP cube 160 to include only the highest-ranking cuboids (e.g.， the cuboids in set “S” ) according to the weighted gain factors. In some example embodiments， the operations of generating the gain factors， weighting the gain factors， and selecting the highest-ranking cuboids for inclusion in the OLAP cube 160 are performed in multiple iterations， as discussed further with respect to FIG. 4.

FIG. 4 shows a flow diagram of a method 400 of determining a set of cuboids to include in the OLAP cube 160， according to some example embodiments At operation 405， the weighting engine 215 identifies the cuboids of the OLAP cube 160. At operation 410， in a first iteration， the weighting engine 215 generates the weighted gain factor for each of the cuboids (e.g.， using Equation 5) . At operation 415， the weighting engine 215 adds the cuboid having the highest weighted gain factor to the set “S” . At operation 420， the gain engine 210 determines whether there are additional iterations of weighted gain factors to be analyzed. In some example embodiments， if there are additional parent cuboids (e.g.， cuboids having child cuboids with fewer dimensions) ， then there are additional iterations and the method 400 continues to operation 410 where an additional iteration is initiated. In some example embodiments， there is a next iteration if set “S” does not have “k” cuboids in the set， where “k” is a pre-specified value. In some example embodiments， there are no additional iterations where B (i， S) is no longer maximized. Continuing the example， if at operation 420， the weighting engine 215 determines that there are no additional iterations， then the method 400 continues to operation 425. At operation 425， the modifier engine 220 modifies the OLAP cube 160 by adding or removing cuboids so that the OLAP cube 160 consists only of cuboids in the set “S” ， which have been predetermined as having the greatest weighted benefit to cost.

FIG. 8 shows a user interface 800 for management of an OLAP cube， according to some example embodiments. The user interface 800 displays a visual representation of an OLAP cube as a sunburst chart 810. The sunburst chart 810 shows the different cuboids as different segments. The circle segment in the center of the sunburst chart 810 is the full cube， and child cuboid segments having fewer dimensions extend radially outward. The segments are color-coded according to how often they are accessed or hit by queries. The highest-frequency segments， e.g.， segment 815， are most darkly color-coded. The lowest-frequency segments， e.g.， segment 820， are more lightly color-coded. Some of the lowest-frequency segments， e.g.， segment 825， may be color-coded or otherwise flagged to be retired by removing them from the OLAP cube. The segments to be retired can be determined using the above approaches to determine which segments correspond to segments not in set “S” . A user can issue a request to optimize the cube by using a button 830， which initiates the weighted gain analysis discussed above.

FIG. 9 shows a user interface 900 resulting from optimizing a cube， according to some example embodiments. A visual representation such as a sunburst graph 905 shows the segments of an OLAP cube after optimization. As illustrated， many of the segments have been removed.

In some example embodiments， the sunburst chart is from the Angular-nvD3 library. On the backend， the interface engine 225 implements a REST API to return the access frequency information from the cuboid tree or lattice. On the frontend， the interface engine 225 implements a JavaScript (JS) controller to parse the REST response into a relative JSON format， which is then used to render a sunburst chart on a display device.

An example of the REST service is included below：

An example of the JS controller is included below：

FIG. 10 shows an example block diagram 1000 for modifying portions of an OLAP cube in parallel， according to some example embodiments. The OLAP cube may be implemented using Apache Kylin， which is configured to manage OLAP cubes from Hadoop. An OLAP cube may be distributed into segments on Kylin. Metrics， including which cuboids have been accessed and how frequently， can be collected in parallel for each segment. The block diagram 1000 shows an example cube at three points in time. The cube at a first point in time 1005 contains three segments S1， S2， S3. In response to an optimization request， each segment of the cube may be analyzed using the above approaches to determine which cuboids should be included for each of the respective segments. The cube at a second later point in time 1010 comprises the segments S1， S2， and S3， as well as their respective suggested formats， S1 ’ ， S2’ ， and S3’ ， per the weighted benefit analysis. The cube at a third later point in time 1015 has replaced the original segments S 1， S2， and S3 with the optimized segments S1’ ， S2’ ， and S3’ in parallel. In this way， the OLAP cube can be efficiently monitored and managed， according to some example embodiments.

Certain embodiments are described herein as including logic or a number of components， modules， or mechanisms. Modules can constitute either software modules (e.g.， code embodied on a machine-readable medium) or hardware modules. A “hardware module” is a tangible unit capable of performing certain operations and can be configured or arranged in a certain physical manner. In various example embodiments， one or more computer systems (e.g.， a standalone computer system， a client computer system， or a server computer system) or one or more hardware modules of a computer system (e.g.， a processor or a group of processors) can be configured by software (e.g.， an application or application portion) as a hardware module that operates to perform certain operations as described herein.

In some embodiments， a hardware module can be implemented mechanically， electronically， or any suitable combination thereof. For example， a hardware module can include dedicated circuitry or logic that is permanently configured to perform certain operations. For example， a hardware module can be a special-purpose processor， such as a Field-Programmable Gate Array (FPGA) or an Application-Specific Integrated Circuit (ASIC) . A hardware module may also include programmable logic or circuitry that is temporarily configured by software to perform certain operations. For example， a hardware module can include software executed by a general-purpose processor or other programmable processor. Once configured by such software， hardware modules become specific machines (or specific components of a machine) uniquely tailored to perform the configured functions and are no longer general-purpose processors. It will be appreciated that the decision to implement a hardware module mechanically， in dedicated and permanently configured circuitry， or in temporarily configured circuitry (e.g.， configured by software) can be driven by cost and time considerations.

Accordingly， the phrase “hardware module” should be understood to encompass a tangible entity， be that an entity that is physically constructed， permanently configured (e.g.， hardwired) ， or temporarily configured (e.g.， programmed) to operate in a certain manner or to perform certain operations described herein. As used herein， “hardware-implemented module” refers to a hardware module. Considering embodiments in which hardware modules are temporarily configured (e.g.， programmed) ， each of the hardware modules need not be configured or instantiated at any one instance in time. For example， where a hardware module comprises a general-purpose processor configured by software to become a special-purpose processor， the general-purpose processor may be configured as respectively different special-purpose processors (e.g.， comprising different hardware modules) at different times. Software accordingly configures a particular processor or processors， for example， to constitute a particular hardware module at one instance of time and to constitute a different hardware module at a different instance of time.

Hardware modules can provide information to， and receive information from， other hardware modules. Accordingly， the described hardware modules can be regarded as being communicatively coupled. Where multiple hardware modules exist contemporaneously， communications can be achieved through signal transmission (e.g.， over appropriate circuits and buses) between or among two or more of the hardware modules. In embodiments in which multiple hardware modules are configured or instantiated at different times， communications between or among such hardware modules may be achieved， for example， through the storage and retrieval of information in memory structures to which the multiple hardware modules have access. For example， one hardware module can perform an operation and store the output of that operation in a memory device to which it is communicatively coupled. A further hardware module can then， at a later time， access the memory device to retrieve and process the stored output. Hardware modules can also initiate communications with input or output devices， and can operate on a resource (e.g.， a collection of information) .

The various operations of example methods described herein can be performed， at least partially， by one or more processors that are temporarily configured (e.g.， by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured， such processors constitute processor-implemented modules that operate to perform one or more operations or functions described herein. As used herein， “processor-implemented module” refers to a hardware module implemented using one or more processors.

Similarly， the methods described herein can be at least partially processor-implemented， with a particular processor or processors being an example of hardware. For example， at least some of the operations of a method can be performed by one or more processors or processor-implemented modules. Moreover， the one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as “software as a service” (SaaS) . For example， at least some of the operations may be performed by a group of computers (as examples of machines including processors) ， with these operations being accessible via a network (e.g.， the Internet) and via one or more appropriate interfaces (e.g.， an API) .

The performance of certain of the operations may be distributed among the processors， not only residing within a single machine， but deployed across a number of machines. In some example embodiments， the processors or processor-implemented modules can be located in a single geographic location (e.g.， within a home environment， an office environment， or a server farm) . In other example embodiments， the processors or processor-implemented modules are distributed across a number of geographic locations.

The modules， methods， applications， and so forth described in conjunction with FIGS. 1-10 are implemented in some embodiments in the context of a machine and an associated software architecture. The sections below describe a representative software architecture and machine (e.g.， hardware) architecture that are suitable for use with the disclosed embodiments.

FIG. 11 is a block diagram illustrating components of a machine 1100， according to some example embodiments， able to read instructions from a machine-readable medium (e.g.， a machine-readable storage medium) and perform any one or more of the methodologies discussed herein. Specifically， FIG. 11 shows a diagrammatic representation of the machine 1100 in the example form of a computer system， within which instructions 1116 (e.g.， software， a program， an application， an applet， an app， or other executable code) for causing the machine 1100 to perform any one or more of the methodologies discussed herein can be executed. For example， the instructions 1116 can cause the machine 1100 to execute the flow diagrams of FIGS. 3 and 4. Additionally， or alternatively， the instructions 1116 can implement the generator engine 205， the gain engine 210， the weighting engine 215， the modifier engine 220， and the interface engine 225 of FIG. 2， and so forth. The instructions 1116 transform the general， non-programmed machine into a particular machine programmed to carry out the described and illustrated functions in the manner described. In alternative embodiments， the machine 1100 operates as a standalone device or can be coupled (e.g.， networked) to other machines. In a networked deployment， the machine 1100 may operate in the capacity of a server machine or a client machine in a server-client network environment， or as a peer machine in a peer-to-peer (or distributed) network environment. The machine 1100 can comprise， but not be limited to， a server computer， a client computer， a personal computer (PC) ， a tablet computer， a laptop computer， a netbook， a set-top box (STB) ， a personal digital assistant (PDA) ， an entertainment media system， a cellular telephone， a smart phone， a mobile device， a wearable device (e.g.， a smart watch) ， a smart home device (e.g.， a smart appliance) ， other smart devices， a web appliance， a network router， a network switch， a network bridge， or any machine capable of executing the instructions 1116， sequentially or otherwise， that specify actions to be taken by the machine 1100. Further， while only a single machine 1100 is illustrated， the term “machine” shall also be taken to include a collection of machines 1100 that individually or jointly execute the instructions 1116 to perform any one or more of the methodologies discussed herein.

The machine 1100 can include processors 1110， memory/storage 1130， and I/O components 1150， which can be configured to communicate with each other such as via a bus 1102. In an example embodiment， the processors 1110 (e.g.， a Central Processing Unit (CPU) ， a Reduced Instruction Set Computing (RISC) processor， a Complex Instruction Set Computing (CISC) processor， a Graphics Processing Unit (GPU) ， a Digital Signal Processor (DSP) ， an ASIC， a Radio-Frequency Integrated Circuit (RFIC) ， another processor， or any suitable combination thereof) can include， for example， a processor 1112 and a processor 1114 that may execute the instructions 1116. The term “processor” is intended to include a multi-core processor that may comprise two or more independent processors (sometimes referred to as “cores” ) that can execute instructions contemporaneously. Although FIG. 11 shows multiple processors 1110， the machine 1100 may include a single processor with a single core， a single processor with multiple cores (e.g.， a multi-core processor) ， multiple processors with a single core， multiple processors with multiples cores， or any combination thereof.

The memory/storage 1130 can include a memory 1132， such as a main memory， or other memory storage， and a storage unit 1136， both accessible to the processors 1110 such as via the bus 1102. The storage unit 1136 and memory 1132 store the instructions 1116 embodying any one or more of the methodologies or functions described herein. The instructions 1116 can also reside， completely or partially， within the memory 1132， within the storage unit 1136， within at least one of the processors 1110 (e.g.， within the processor’s cache memory) ， or any suitable combination thereof， during execution thereof by the machine 1100. Accordingly， the memory 1132， the storage unit 1136， and the memory of the processors 1110 are examples of machine-readable media.

As used herein， the term “machine-readable medium” means a device able to store instructions and data temporarily or permanently and may include， but not be limited to， random-access memory (RAM) ， read-only memory (ROM) ， buffer memory， flash memory， optical media， magnetic media， cache memory， other types of storage (e.g.， Erasable Programmable Read-Only Memory (EPROM) ) ， or any suitable combination thereof. The term “machine-readable medium” should be taken to include a single medium or multiple media (e.g.， a centralized or distributed database， or associated caches and servers) able to store the instructions 1116. The term “machine-readable medium” shall also be taken to include any medium， or combination of multiple media， that is capable of storing instructions (e.g.， the instructions 1116) for execution by a machine (e.g.， machine 1100) ， such that the instructions， when executed by one or more processors of the machine (e.g.， processors 1110) ， cause the machine to perform any one or more of the methodologies described herein. Accordingly， a “machine-readable medium” refers to a single storage apparatus or device， as well as “cloud-based” storage systems or storage networks that include multiple storage apparatus or devices. The term “machine-readable medium” excludes signals per se.

The I/O components 1150 can include a wide variety of components to receive input， provide output， produce output， transmit information， exchange information， capture measurements， and so on. The specific I/O components 1150 that are included in a particular machine will depend on the type of machine. For example， portable machines such as mobile phones will likely include a touch input device or other such input mechanisms， while a headless server machine will likely not include such a touch input device. It will be appreciated that the I/O components 1150 can include many other components that are not shown in FIG. 11. The I/O components 1150 are grouped according to functionality merely for simplifying the following discussion， and the grouping is in no way limiting. In various example embodiments， the I/O components 1150 can include output components 1152 and input components 1154. The output components 1152 can include visual components (e.g.， a display such as a plasma display panel (PDP) ， a light emitting diode (LED) display， a liquid crystal display (LCD) ， a projector， or a cathode ray tube (CRT) ) ， acoustic components (e.g.， speakers) ， haptic components (e.g.， a vibratory motor， resistance mechanisms) ， other signal generators， and so forth. The input components 1154 can include alphanumeric input components (e.g.， a keyboard， a touch screen configured to receive alphanumeric input， a photo-optical keyboard， or other alphanumeric input components) ， point-based input components (e.g.， a mouse， a touchpad， a trackball， a joystick， a motion sensor， or other pointing instruments) ， tactile input components (e.g.， a physical button， a touch screen that provides location and force of touches or touch gestures， or other tactile input components) ， audio input components (e.g.， a microphone) ， and the like.

In further example embodiments， the I/O components 1150 can include biometric components 1156， motion components 1158， environmental components 1160， or position components 1162 among a wide array of other components. For example， the biometric components 1156 can include components to detect expressions (e.g.， hand expressions， facial expressions， vocal expressions， body gestures， or eye tracking) ， measure biosignals (e.g.， blood pressure， heart rate， body temperature， perspiration， or brain waves) ， identify a person (e.g.， voice identification， retinal identification， facial identification， fingerprint identification， or electroencephalogram-based identification) ， and the like. The motion components 1158 can include acceleration sensor components (e.g.， an accelerometer) ， gravitation sensor components， rotation sensor components (e.g.， a gyroscope) ， and so forth. The environmental components 1160 can include， for example， illumination sensor components (e.g.， a photometer) ， temperature sensor components (e.g.， one or more thermometers that detect ambient temperature) ， humidity sensor components， pressure sensor components (e.g.， a barometer) ， acoustic sensor components (e.g.， one or more microphones that detect background noise) ， proximity sensor components (e.g.， infrared sensors that detect nearby objects) ， gas sensor components (e.g.， machine olfaction detection sensors， gas detection sensors to detect concentrations of hazardous gases for safety or to measure pollutants in the atmosphere) ， or other components that may provide indications， measurements， or signals corresponding to a surrounding physical environment. The position components 1162 can include location sensor components (e.g.， a GPS receiver component) ， altitude sensor components (e.g.， altimeters or barometers that detect air pressure from which altitude may be derived) ， orientation sensor components (e.g.， magnetometers) ， and the like.

Communication can be implemented using a wide variety of technologies. The I/O components 1150 may include communication components 1164 operable to couple the machine 1100 to a network 1180 or devices 1170 via a coupling 1182 and a coupling 1172， respectively. For example， the communication components 1164 include a network interface component or other suitable device to interface with the network 1180. In further examples， the communication components 1164 include wired communication components， wireless communication components， cellular communication components， Near Field Communication (NFC) components，

components (e.g.，

Low Energy) ，

components， and other communication components to provide communication via other modalities. The devices 1170 may be another machine or any of a wide variety of peripheral devices (e.g.， a peripheral device coupled via a Universal Serial Bus (USB)) .

Moreover， the communication components 1164 can detect identifiers or include components operable to detect identifiers. For example， the communication components 1164 can include Radio Frequency Identification (RFID) tag reader components， NFC smart tag detection components， optical reader components (e.g.， an optical sensor to detect one-dimensional bar codes such as a Universal Product Code (UPC) bar code， multi-dimensional bar codes such as a Quick Response (QR) code， Aztec Code， Data Matrix， Dataglyph， MaxiCode， PDF417， Ultra Code， Uniform Commercial Code Reduced Space Symbology (UCC RSS) -2D bar codes， and other optical codes) ， acoustic detection components (e.g.， microphones to identify tagged audio signals) ， or any suitable combination thereof. In addition， a variety of information can be derived via the communication components 1164， such as location via Internet Protocol (IP) geo-location， location via

signal triangulation， location via detecting a

or NFC beacon signal that may indicate a particular location， and so forth.

In various example embodiments， one or more portions of the network 1180 can be an ad hoc network， an intranet， an extranet， a VPN， a LAN， a WLAN， a WAN， a WWAN， a MAN， the Internet， a portion of the Internet， a portion of the PSTN， a plain old telephone service (POTS) network， a cellular telephone network， a wireless network， a

network， another type of network， or a combination of two or more such networks. For example， the network 1180 or a portion of the network 1180 may include a wireless or cellular network， and the coupling 1182 may be a Code Division Multiple Access (CDMA) connection， a Global System for Mobile communications (GSM) connection， or another type of cellular or wireless coupling. In this example， the coupling 1182 can implement any of a variety of types of data transfer technology， such as Single Carrier Radio Transmission Technology (1xRTT) ， Evolution-Data Optimized (EVDO) technology， General Packet Radio Service (GPRS) technology， Enhanced Data rates for GSM Evolution (EDGE) technology， third Generation Partnership Project (3GPP) including 3G， fourth generation wireless (4G) networks， Universal Mobile Telecommunications System (UMTS) ， High-Speed Packet Access (HSPA) ， Worldwide Interoperability for Microwave Access (WiMAX) ， Long-Term Evolution (LTE) standard， others defined by various standard-setting organizations， other long-range protocols， or other data-transfer technology.

The instructions 1116 can be transmitted or received over the network 1180 using a transmission medium via a network interface device (e.g.， a network interface component included in the communication components 1164) and utilizing any one of a number of well-known transfer protocols (e.g.， Hypertext Transfer Protocol (HTTP) ) . Similarly， the instructions 1116 can be transmitted or received using a transmission medium via the coupling 1172 (e.g.， a peer-to-peer coupling) to the devices 1170. The term “transmission medium” shall be taken to include any intangible medium that is capable of storing， encoding， or carrying the instructions 1116 for execution by the machine 1100， and includes digital or analog communications signals or other intangible media to facilitate communication of such software.

Throughout this specification， plural instances may implement components， operations， or structures described as a single instance. Although individual operations of one or more methods are illustrated and described as separate operations， one or more of the individual operations may be performed concurrently， and nothing requires that the operations be performed in the order illustrated. Structures and functionality presented as separate components in example configurations may be implemented as a combined structure or component. Similarly， structures and functionality presented as a single component may be implemented as separate components. These and other variations， modifications， additions， and improvements fall within the scope of the subject matter herein.

Although an overview of the inventive subject matter has been described with reference to specific example embodiments， various modifications and changes may be made to these embodiments without departing from the broader scope of embodiments of the present disclosure. Such embodiments of the inventive subject matter may be referred to herein， individually or collectively， by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single disclosure or inventive concept if more than one is， in fact， disclosed.

The embodiments illustrated herein are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed. Other embodiments may be used and derived therefrom， such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. The Detailed Description， therefore， is not to be taken in a limiting sense， and the scope of various embodiments is defined only by the appended claims， along with the full range of equivalents to which such claims are entitled.

As used herein， the term “or” may be construed in either an inclusive or exclusive sense. Moreover， plural instances may be provided for resources， operations， or structures described herein as a single instance. Additionally， boundaries between various resources， operations， modules， engines， and data stores are somewhat arbitrary， and particular operations are illustrated in a context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within a scope of various embodiments of the present disclosure. In general， structures and functionality presented as separate resources in the example configurations may be implemented as a combined structure or resource. Similarly， structures and functionality presented as a single resource may be implemented as separate resources. These and other variations， modifications， additions， and improvements fall within a scope of embodiments of the present disclosure as represented by the appended claims. The specification and drawings are， accordingly， to be regarded in an illustrative rather than a restrictive sense.

Claims

A method comprising：

identifying a multidimensional data structure comprising a plurality of cuboids；

generating， using a set of queries， a likelihood that one or more of the cuboids of the plurality will be accessed；

generating a gain factor for the plurality of cuboids， the gain factor specifying a computational gain in efficiency from including the one or more cuboids in the multidimensional data structure；

generating a weighted gain factor by offsetting the gain factor by the likelihood； and

modifying， using at least one processor of a machine， the multidimensional data structure to include the one or more cuboids.
The method of claim 1， wherein the likelihood is generated for each of the cuboids of the plurality， wherein the gain factor is generated for each of the cuboids of the plurality， and wherein the weighted gain factor is generated for each of the cuboids of the plurality.
The method of claim 1， wherein the likelihood is based at least in part on a total quantity of queries in the set of queries.
The method of claim 1， wherein the gain factor for a given cuboid of the plurality is based at least in part on a difference between a first access cost of the given cuboid and a second access cost of an additional cuboid， the additional cuboid being a cuboid of least access cost with respect to the given cuboid.
The method of claim 1， wherein the weighted gain factor for the cuboids is generated in iterations including at least a first iteration and a second iteration.
The method of claim 5， wherein a first cuboid is selected for inclusion in the multidimensional data structure based at least in part on the first cuboid having a highest weighted gain factor in the first iteration.
The method of claim 5， wherein a second cuboid is selected for inclusion in the multidimensional data structure based at least in part on the second cuboid having a highest weighted gain factor in the second iteration.
The method of claim 1， wherein the set of queries are queries that have been issued against the multidimensional data structure.
The method of claim 1， wherein the multidimensional data structure is an Online Analytical Processing (OLAP) cube.
The method of claim 1， wherein the multidimensional data structure is modified to include the one or more cuboids in response to an optimization request generated through a user interface comprising a graph visualization of the multidimensional data structure.
The method of claim 10， wherein the multidimensional data structure comprises a plurality of segments and each of the segments is optimized in parallel.
A system comprising：

one or more processors of a machine； and

a memory storing instructions that， when executed by the one or more processors， cause the machine to perform operations comprising：

identifying a multidimensional data structure comprising a plurality of cuboids；

generating， using a set of queries， a likelihood that one or more of the cuboids of the plurality will be accessed；

generating a gain factor for the plurality of cuboids， the gain factor specifying a computational gain in efficiency from including the one or more of the cuboids in the multidimensional data structure；

generating a weighted gain factor by offsetting the gain factor by the likelihood； and

modifying the multidimensional data structure to include the one or more cuboids.
The system of claim 12， wherein the likelihood is generated for each of the cuboids of the plurality， wherein the gain factor is generated for each of the cuboids of the plurality， and wherein the weighted gain factor is generated for each of the cuboids of the plurality.
The system of claim 12， wherein the likelihood is based at least in part on a total quantity of queries in the set of queries.
The system of claim 12， wherein the gain factor for a given cuboid of the plurality is based at least in part on a difference between a first access cost of the given cuboid and a second access cost of an additional cuboid， the additional cuboid being a cuboid of least access cost with respect to the given cuboid.
The system of claim 12， wherein the weighted gain factor for the cuboids is generated in iterations including at least a first iteration and a second iteration.
The system of claim 16， wherein a first cuboid is selected for inclusion in the multidimensional data structure based at least in part on the first cuboid having a highest weighted gain factor in the first iteration.
The system of claim 16， wherein a second cuboid is selected for inclusion in the multidimensional data structure based at least in part on the second cuboid having a highest weighted gain factor in the second iteration.
The system of claim 12， wherein the set of queries are queries that have been issued against the multidimensional data structure， and the multidimensional data structure is an Online Analytical Processing (OLAP) cube.
A non-transitory machine-readable storage device embodying instructions that， when executed by a machine， cause the machine to perform operations comprising：

identifying a multidimensional data structure comprising a plurality of cuboids；

generating， using a set of queries， a likelihood that one or more of the cuboids of the plurality will be accessed；

generating a gain factor for the plurality of cuboids， the gain factor specifying a computational gain in efficiency from including the one or more cuboids in the multidimensional data structure；

generating a weighted gain factor by offsetting the gain factor by the likelihood； and

modifying， using at least one processor of a machine， the multidimensional data structure to include the one or more cuboids.