US20190228002A1 - Multi-Language Support for Interfacing with Distributed Data - Google Patents
Multi-Language Support for Interfacing with Distributed Data Download PDFInfo
- Publication number
- US20190228002A1 US20190228002A1 US16/268,503 US201916268503A US2019228002A1 US 20190228002 A1 US20190228002 A1 US 20190228002A1 US 201916268503 A US201916268503 A US 201916268503A US 2019228002 A1 US2019228002 A1 US 2019228002A1
- Authority
- US
- United States
- Prior art keywords
- document
- distributed data
- ddf
- data
- collaborators
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2291—User-Defined Types; Storage management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/31—Programming languages or programming paradigms
- G06F8/315—Object-oriented languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/101—Collaborative creation, e.g. joint development of products or services
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L12/00—Data switching networks
- H04L12/02—Details
- H04L12/16—Arrangements for providing special services to substations
- H04L12/18—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
- H04L12/1813—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
- H04L12/1822—Conducting the conference, e.g. admission, detection, selection or grouping of participants, correlating users to one or more conference sessions, prioritising transmission
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/14—Session management
- H04L67/141—Setup of application sessions
Definitions
- the disclosure relates to efficient processing of large data sets using parallel and distributed systems. More specifically, the disclosure concerns various aspects of processing of distributed data structures including collaborative processing of distributed data structures using shared documents.
- Enterprises produce large amount of data based on their daily activities. This data is stored in a distributed fashion among a large number of computer systems. For example, large amount of information is stored as logs of various systems of the enterprise. Processing such large amount of data to gain meaningful insights into the information describing the enterprise requires large amount of resources. Furthermore, conventional techniques available for processing such large amount of data typically require users to perform cumbersome programming.
- Some tools and systems are available to assist data scientists and business experts with the above process of providing requirements and analyzing results of big data analysis.
- the tools and systems used by data scientists are typically difficult for business experts to use and tools and systems used by business experts are difficult for data scientists to use. This creates another gap between the analysis performed by data scientists and the analysis performed by business experts. Therefore conventional techniques for providing insights into big data stored in distributed systems of an enterprise fail to provide suitable interface for users to analyze the information available in the enterprise.
- Embodiments support multi-language support for data processing.
- a system stores an in-memory distributed data frame structure (DDF) across a plurality of compute nodes. Each compute node stores a portion of the in-memory distributed data structure (DDF segment). The data of the DDF conforms to a primary language.
- the system further stores a document comprising text and code blocks.
- the code blocks comprise a first code block for providing instructions using the primary language and a second code block for providing instructions using a secondary language.
- the system receives a request to process instructions specified in the first code block using the primary language.
- Each compute node processes the instructions to process the DDF segment mapped to the compute node.
- the system further receives a request to process instructions specified in the second code block using the secondary language.
- Each compute node transforms the data of the DDF segment mapped to the compute node to conform to the format of the secondary language.
- Each compute node executes the instructions of the secondary language to generate a result DDF segment.
- the system transforms data of the result DDF segment to a format conforming to the primary language.
- Each compute node processes further instructions specified using the primary language to process the transformed result DDF segment mapped to the compute node.
- FIG. 1 shows the overall system environment for performing analysis of big data, in accordance with an embodiment of the invention.
- FIG. 2 shows the system architecture of a big data analysis system, in accordance with an embodiment.
- FIG. 3 shows the system architecture of the distributed data framework, in accordance with an embodiment.
- FIG. 4 illustrates a process for collaborative processing based on a DDF between users, according to an embodiment of the invention.
- FIG. 5 illustrates a process for converting immutable data sets received from the in-memory cluster computing engine to mutable data sets, according to an embodiment of the invention.
- FIG. 6 shows the process illustrating use of a global editing mode and a local editing mode during collaboration, according to an embodiment of the invention.
- FIG. 7 shows collaborative editing of documents including text, code, and charts based on distributed data structures, according to an embodiment.
- FIG. 8 shows an example of shared document with code and results based on code, according to an embodiment.
- FIG. 9 shows an example of shared document with code and results updated based on execution of the code, according to an embodiment.
- FIG. 10 shows an example of a dashboard generated from a document based on distributed data structures, according to an embodiment.
- FIG. 11 shows the architecture of the in-memory cluster computing engine illustrating how a distributed data structure (DDF) is allocated to various compute nodes, according to an embodiment.
- DDF distributed data structure
- FIG. 12 shows the architecture of the in-memory cluster computing engine illustrating how multiple runtimes are used to process instructions provided in multiple languages, according to an embodiment.
- FIG. 13 shows an interaction diagram illustrating the processing of DDFs based on instructions received in multiple languages, according to an embodiment.
- FIG. 14 is a high-level block diagram illustrating an example of a computer for use as a system for performing formal verification with low power considerations, in accordance with an embodiment.
- a big data analysis system provides an abstraction of database tables based on distributed in-memory data structures obtained from big data sources, for examples, files of a distributed file system.
- a user can retrieve data from a large, distributed, complex file system and treat the data as database tables.
- the big data analysis system allows users to use familiar data abstractions such as filtering data, joining data for large datasets that are commonly supported by systems that handle with small data, for example, single processor database management systems.
- the big data analysis system supports various features including schema, filtering, projection, transformation, data mining, machine learning, and so on.
- Embodiments support various operations commonly used by data scientists. These include various types of statistical computation, sampling, machine learning, and so on. However, these operations are supported on large data sets processed by a distributed architecture. Conventional systems that allow processing of large data using distributed systems Embodiments create a long term session for a user and track distributed data structures for users so as to allow users to modify the distributed data structures. The ability to create long term sessions allows embodiments to provide functionality similar to existing data analysis systems that are used for small data processing, for example, the R programming language and interactive system.
- embodiments support high-level data analytics functionality, thereby allowing users to focus on the data analysis rather than low level implementation details of how to manage large data sets.
- This is distinct from conventional systems, for example, systems that support map-reduce paradigm and require user to express high-level analytics functions into map and reduce functions.
- the map reduce paradigm requires users to be aware of the distributed nature of data and requires users to use the map and reduce operations for expressing the data analysis operations.
- Embodiments further allow integration of large data sets with various machine learning techniques, for example, with externally available machine learning libraries. Furthermore, the ability to store distributed data structures in memory and identify the distributed data structures using URI allows embodiments to support clients using various languages, for example, Java, Scala, R and Python, and also natural language.
- Embodiments support collaboration between multiple users working on the same distributed data set.
- a user can refer to a distributed data structure using a URI (uniform resource identifier).
- the URI can be passed between users, for example, by email. Accordingly, a new user can get access to a distributed data structure that is stored in memory.
- Embodiments further allow a user to train a machine learning model, create a name for the machine learning model and transfer the name to another user so as to allow the other user to execute the machine learning model.
- a data scientist can create a distribute data structure or a machine learning model and provide to an executive of an enterprise to present the data or model to an audience. The executive can perform further processing using the data or the model as part of a presentation by connecting to a system based on these embodiments.
- FIG. 1 shows the overall system environment for performing analysis of big data, in accordance with an embodiment of the invention.
- the overall system environment includes an enterprise 110 , a big data analysis system 100 , a network 150 and client devices 130 .
- Other embodiments can use more or less or different systems than those illustrated in FIG. 1 .
- Functions of various modules and systems described herein can be implemented by other modules and/or systems than those described herein.
- FIG. 1 and the other figures use like reference numerals to identify like elements.
- the enterprise 110 is any business or organization that uses computer systems for processing its data. Enterprises 110 are typically associated with a business activity, for example, sale of certain products or services but can be any organization or groups of organizations that generates significant amount of data.
- the enterprise 110 includes several computer systems 120 for processing information of the enterprise. For example, a business may use computer systems for performing various tasks related to the products or services offered by the business. These tasks include sales transactions, inventory management, employee activities, workflow coordination, information technology management, and so on.
- Performing these tasks may generate large amount of data for the enterprise.
- an enterprise may perform thousands of transactions daily. Different types of information is generated for each transaction including information describing the product/services involved in the transaction, errors/warning generated by the system during transactions, information describing involvement of personnel from the enterprise, for example, sales representative, technical support, and so on. This information accumulates over days, weeks, months, and years, resulting in large amount of data.
- airlines process data of hundreds of thousands of passengers traveling every day and large numbers of flights carrying passengers every day.
- the information describing the flights and passengers of each flight over few years can be several terabytes of data. Enterprises that process petabytes of data are not uncommon.
- search engines may store information describing millions of searches performed by users on a daily basis that can generate terabytes of data in a short time interval.
- social networking systems can have hundreds of millions of users. These users interact daily with the social networking system generating petabytes of data.
- the big data analysis system 100 allows analysis of the large amount of data generated by the enterprise.
- the big data analysis system 100 may include a large number of processors for analyzing the data of the enterprise 110 .
- the big data analysis system 100 is part of the enterprise 110 and utilizes computer systems 120 of the enterprise 110 . Data from the computer systems 120 of enterprise 110 that generate the data may be imported 155 into the computer systems that perform the big data analysis.
- the client devices 130 are used by users of the big data analysis system 100 to perform the analysis and study of data obtained from the enterprise 110 .
- the users of the client devices 130 include data analysts, data engineers, and business experts.
- the client device 130 executes a client application that allows users to interact with the big data analysis system 100 .
- the client application executing on the client device 130 may be an internet browser that interacts with web servers executing on computer systems of the big data analysis system 100 .
- a computing device can be a conventional computer system executing, for example, a MicrosoftTM WindowsTM-compatible operating system (OS), AppleTM OS X, and/or a Linux distribution.
- a computing device can also be a client device having computer functionality, such as a personal digital assistant (PDA), mobile telephone, video game system, etc.
- PDA personal digital assistant
- the interactions between the client devices 130 and the big data analysis system 100 are typically performed via a network 150 , for example, via the internet.
- the interactions between the big data analysis system 100 and the computer systems 120 of the enterprise 110 are also typically performed via a network 150 .
- the network uses standard communications technologies and/or protocols.
- the various entities interacting with each other, for example, the big data analysis system 100 , the client devices 130 , and the computer systems 120 can use custom and/or dedicated data communications technologies instead of, or in addition to, the ones described above.
- the network can also include links to other networks such as the Internet.
- FIG. 2 shows the system architecture of a big data analysis system, in accordance with an embodiment.
- a big data analysis system 100 comprises a distributed file system 210 , an in-memory cluster computing engine 220 , a distributed data framework 200 , an analytics framework 230 , a web server 240 , a custom application server 260 , and a programming language interface 270 .
- the big data analysis system 100 may include additional or less modules than those shown in FIG. 2 . Furthermore, specific functionality may be implemented by modules other than those described herein.
- the big data analysis system 100 is also referred to herein as a data analysis system or a system.
- the distributed file system 210 includes multiple data stores 250 . These data stores 250 may execute on different computers.
- the distributed file system 210 may store large data files that may store gigabytes or terabytes of data.
- the data files may be distributed across multiple computer systems.
- the distributed file system 210 replicates the data for high availability.
- the distributed file system 210 processes immutable files to which writes are not performed.
- An example of a distributed file system is HADOOP DISTRIBUTED FILE SYSTEM (HDFS).
- the in-memory cluster computing engine 220 loads data from the distributed file system 210 into a cluster of compute nodes 280 .
- Each compute node includes one or more processors and memory for storing data.
- the in-memory cluster computing engine 220 stores data in-memory for fast access and fast processing.
- the distributed data framework 200 may receive repeated queries for processing the same data structure stored in the in-memory cluster computing engine 220 , the distributed data framework 200 can process the queries efficiently by reusing the data structure stored in memory without having to load the data from the file system.
- An example of an in-memory cluster computing engine is the APACHE SPARK system.
- the distributed data framework 200 provides an abstraction that allows the modules interacting with the distributed data framework 200 to treat the underlying data provided by the distributed file system 210 or the in-memory cluster computing engine 220 interface as structured data comprising tables.
- the distributed data framework 200 supports an application programming interface (API) that allows a caller to treat the underlying data as tables.
- API application programming interface
- a software module can interact with the distributed data framework 200 by invoking APIs supported by the distributed data framework 200 .
- the interface provided by the distributed data framework 200 is independent of the underlying system.
- the distributed data framework 200 may be provided using different implementations in-memory cluster computing engines 220 (or different distributed file systems 210 ) that are provided by different vendors and support different types of interfaces.
- the interface provided by the distributed data framework 200 is the same for different underlying systems.
- the table based structure allows users familiar with database technology to process data stored in the in-memory cluster computing engine 220 .
- the table based distributed data structure provided by the distributed data framework is referred to as distributed data-frame (DDF).
- the data stored in the in-memory cluster computing engine 220 may be obtained from data files stored in the distributed file system 210 , for example, log files generated by computer systems of an enterprise.
- the distributed data framework 200 processes large amount of data using the in-memory cluster computing engine 220 , for example, materialization and transformation of large distributed data structures.
- the distributed data framework 200 performs computations that generate smaller size data, for example, aggregation or summarization results and provides these results to a caller of the distributed data framework 200 .
- the caller of the distributed data framework 200 is typically a machine that is not capable of handling large distributed data structures.
- a client device may receive the smaller size data generated by the distributed data framework 200 and perform visualization of the data or presentation of data via different types of user interfaces. Accordingly the distributed data framework 200 hides the complexity of large distributed data structures and provides an interface that is based on manipulation of small data structures, for example, database tables.
- the distributed data framework 200 supports SQL (structured query language) queries, data table filtering, projections, group by, and join operations based on distributed data-frames.
- SQL structured query language
- the distributed data framework 200 provides transparent handling of missing data, APIs for transformation of data, and APIs providing machine-learning features based on distributed data-frames.
- the analytics framework 230 supports higher level operations based on the table abstraction provided by the distributed data framework 200 .
- the analytics framework 230 supports collaboration using the distributed data structures represented within the in-memory cluster computing engine 220 .
- the analytics framework 230 supports naming of distributed data structures to facilitate collaboration between users of the big data analysis system 100 .
- the analytics framework 230 maintains a table mapping user specified names to locations of data structures.
- the analytics framework 230 allows computation of statistics describing a DDF, for example, mean, standard deviation, variance, count, minimum value, maximum value, and so on.
- the analytics framework 230 also determines multivariate statistics for a DDF including correlation and contingency tables.
- analytics framework 230 allows grouping of DDF data and merging of two or more DDFs.
- the big data analysis system 100 allows different types of interfaces to interact with the underlying data. These include programming language based interfaces as well as graphical user interface based user interfaces.
- the web server 240 allows users to interact with the big data analysis system 100 via browser applications or via web services.
- the custom application server 260 is also responsible for managing the custom application server 260 .
- the web server 240 receives requests from web browser clients and processes the requests.
- the web browser requests are typically requests sent using a web browser protocol, for example, a hyper-text transfer protocol (HTTP.)
- HTTP hyper-text transfer protocol
- the results returned to the requester is typically in the form of markup language documents, for example, documents specified in hyper-text markup language (HTML).
- the custom application server 260 receives and processes requests from custom applications that are designed for interacting with big data analysis system 100 .
- a customized user interface receives requests for the big data analysis system 100 specified using data analysis languages, for example, the R language used for statistical computing.
- the customized user interface may use a proprietary protocol for interacting with the big data analysis system 100 .
- the programming language interface 270 allows programs written in specific programming languages supported by the big data analysis system 100 to interact with the big data analysis system 100 .
- programmers can interact with the data analysis system 100 using PYTHON or JAVA language constructs.
- the distributed data framework 200 supports various types of analytics operations based on the data structures exposed by the distributed data framework 200 .
- FIG. 3 shows various modules within a distribute data framework module, in accordance with an embodiment of the invention.
- the distribute data framework 200 includes a distributed data-frame manager 210 and handlers including ETL handler 320 , statistics handler 330 , and machine learning handler 340 .
- Other embodiments may include more or fewer modules than those shown in FIG. 3 .
- the distributed data-frame manager 310 supports loading data from big data sources of the distributed data file system 210 into DDFs.
- the distributed data-frame manager 310 also manages a pool of DDFs.
- the various handlers provide a pluggable architecture making it easy to include new functionality into or replace existing functionality from the distribute data framework 200 .
- the ETL handler 320 supports ETL (extract, transform, and load operations, the statistics handler 330 supports various statistical computations applied to DDFs, and the machine learning handler 340 supports machine learning operations based on DDFs.
- the distributed data framework 200 provides interfaces in different programming languages including Java, Scala, R and Python so that users can easily interact with the in-memory cluster computing engine 220 .
- a client can connect to a distributed data framework 200 via a web browser or a custom application based interface and issue commands for execution on the in-memory cluster computing engine 220 .
- the distributed data framework 200 allows users to load a DDF in memory and perform operations on the data stored in memory. These include filtering, aggregating, joining a data set with another and so on. Since a client device 130 has limited resources in terms of computing power or memory, a client device 130 is unable to load an entire DDF from the in-memory cluster computing engine 220 . Therefore, the distributed data framework 200 supports APIs that allow a subset of data to be retrieved from a DDF by a requestor.
- distributed data framework 200 supports an API “fetchRows(df, N)” that allows the caller to retrieve the first N rows of a DDF df. If the distributed data framework 200 receives a request “fetchRows(df, N)”, the distributed data framework 200 identifies the first N rows of the DDF df and returns the identified rows to the caller.
- the distributed data framework 200 supports an API “sample(df, N)” that allows the caller to retrieve a sample of N rows of a DDF df.
- the distributed data framework 200 samples data of the DDF df based on a preconfigured sampling strategy and returns a set of N rows obtained by sampling to the caller.
- the distributed data framework 200 supports an API “sample2ddf(df, p)” that allows the caller to compute a sample of p % of rows of the DDF df and assign the result to a new DDF.
- the distributed data framework 200 samples data of the DDF df based on a preconfigured sampling strategy to identify p % of rows of the DDF df and creates a new DDF based on the result.
- the distributed data framework 200 returns the result DDF to the caller, for example, by sending a reference or pointer to the DDF to the caller.
- the distributed data-frame manager 310 acts as a server that allows users to connect and create sessions that allow the users to interact with the distributed data framework 200 and process data.
- the distributed data framework 200 creates sessions allows users to maintain distributed in-memory data structures in the in-memory cluster computing engine 220 for long periods of times, for example, weeks.
- the session maintains the state of the in-memory data structures so as to allow a sequence of multiple interactions with the same data structure.
- the interactions include requests that modify the data structure such that a subsequent request can access the modified data structure.
- a user can perform a sequence of operations to modify the data structure.
- the distributed data-frame manager 310 allows users to collaborate using a particular large distributed data structure (i.e., DDF). For example, a particular user can create a DDF loading data from log files of an enterprise into an in-memory data structure, perform various transformations, and share the data structure with other users. The other user may continue making transformations or view the data from the DDF in a user interface, for example, build a chart based on the data of the DDF for presentation to an audience.
- DDF distributed data structure
- the analytics framework 230 receives a user request (or a request from a software module) to assign a name to a distributed data structure, for example, a DDF or a machine learning model.
- the analytics framework 230 may further receive requests to provide the name of the distributed data structure to other users or software modules. Accordingly, the analytics framework 230 allows multiple users to refer to the same distributed data structure residing in the in-memory cluster computing engine 220 .
- the analytics framework 230 supports an API that sets the name of a distributed data structure, for example, a DDF or any data set. For example, a user may invoke a function (or method) “setDDFName(df, string name)” where “df” is a reference/pointer to the distributed data structure stored in the in-memory cluster computing engine 220 and “string_name” is an input string specified by the user for use as the name of the “df” structure.
- the analytics framework 230 processes the function setDDFName by assigning the name “string_name” to the structure “df”. For example, a user may execute queries to generate a data set representing flight information based on data obtained from airlines.
- a function/method call “setDDFName(df, “flightinfo”)” assigns name “flightinfo” to the data set identified by df.
- the analytics framework 230 further supports an API to get a uniform resource identifier (URI) for a data structure.
- the analytics framework 230 may receive a request to execute “getURI(df)”.
- the analytics framework 230 generates a URI corresponding to the data structure or data set represented by df and returns the URI to the requstor.
- the analytics framework 230 may generate a URI “ddf://servername/flightinfo” in response to the “getURI(df)” call.
- the URI may be provided to an application executing on a client device 130 .
- the analytics framework 230 maintains a mapping from DDFs to URIs. If a DDF is removed from memory, the corresponding URI becomes invalid. For example, if a client application presented a document having a URI that has become invalid, the data analysis system 100 does not process queries based on that URI. The data analysis system 100 may return an error indicating that the query is directed to an invalid (or non-existent) DDF. If the data analysis system 100 loads the same set of data (as the DDF which is removed from the memory), as a new DDF, the client devices request a new URI for the newly created DDF.
- the data analysis system 100 may two or more copies of the same DDF. For example, two or more clients may request access to the same data set with the possibility of making modifications. In this situation, each DDF representing the same data is assigned a different URI. For example, a first DDF representing a first copy of the data is assigned a first URI and a second DDF representing a second copy of the same data is assigned a second URI distinct from the first URI. Accordingly, requests for processing received by the data analysis system 100 based on the first URI are processed using the first DDF and, requests for processing received by the data analysis system 100 based on the second URI are processed using the second DDF.
- the URI can be communicated between applications or client devices.
- the client device 130 may communicate the URI to another client device.
- an application that has the value of the URI can send the URI to another application.
- the URI may be communicated via email, text message, or by physically copying and pasting from one user interface to another user interface.
- the recipient of the URI can use the URI to locate the DDF corresponding to the URI and process it.
- the recipient of the URI can use the URI to inspect the data of the DDF or to transform the data of the DDF.
- FIG. 4 illustrates collaboration between two client devices using a DDF, according to an embodiment of the invention.
- client device 130 a has access to an in-memory distributed data structure, for example, a DDF 420 stored in the in-memory cluster computing engine 220 .
- the client device 130 a has access to the DDF 420 because an application on client device 130 a created the DDF 420 .
- the client device 420 may have received a reference to the DDF 420 from another client device or application.
- the distributed data framework 200 receives 425 a request to generate a URI (or a name) corresponding to the DDF 420 .
- the distributed data framework 200 generates a URI corresponding to the DDF 420 .
- the distributed data framework sends 435 the generated URI to the client device 130 a that requested the URI.
- the client device 130 a may send 450 the URI corresponding to the DDF 420 to another client device 130 b , for example, via a communication protocol such as email.
- the URI may be shared between two applications running on the same client device.
- the URI may be copied from a user interface of an application and pasted into the user interface of the other application by a user.
- the client device 130 b receives the URI.
- the client device 130 b can send 440 requests to the distributed data framework 200 using the URI to identify the DDF 420 .
- the client device 130 b can send a request to receive a portion of data of the DDF 420 .
- the in-memory cluster computing engine 220 stores a distributed data structure that represents a machine learning model.
- the distributed data framework 200 receives a request to create a name or URI identifying the machine learning model from a client device 130 .
- the distributed data framework 200 generates the URI or name and provides to the client device 130 .
- the client device receiving the URI can transmit the URI to other client devices.
- Any client device that receives the URI can interact with the distributed data framework 200 to interact with the machine learning model, for example, to use the machine learning model for predicting certain behavior of entities.
- These embodiments allow a user (or users) to implement the machine learning model and train the model.
- the model is stored in-memory by the big data analysis system 100 and is ready to use by other users.
- the access to the in-memory model is provided by generating the URI and transmitting the URI to other applications or client devices. Users accessing the other client devices or applications can start using the machine learning model stored in memory.
- the in-memory cluster computing engine 220 supports only immutable data sets.
- a user e.g., a software module that creates/loads the data set for processing
- the in-memory cluster computing engine 220 may not provide any methods/functions or commands that allow a data set to modify.
- the in-memory cluster computing engine 220 may return error if a user attempts to modify data of a data set.
- the in-memory cluster computing engine 220 may not support mutable datasets if the in-memory cluster computing engine 220 supports a functional paradigm, i.e., a functional programming model based on functions that take a data set as input and return a new data set upon each invocation. As a result, each operation requires invocation of a function that returns a new data set. Accordingly, the in-memory cluster computing engine 220 does not support modification of states of data sets (these datasets may be referred to as stateless datasets.)
- the distributed data framework 200 allows users to convert a dataset to a mutable dataset.
- the distributed data framework 200 supports a method/function “setMutable(ddf)” that converts an immutable dataset (or DDF) input to the method/function to a mutable DDF.
- the distributed data framework 200 allows users to make modifications to the mutable DDF.
- the distributed data framework 200 may add new rows, delete rows, modify rows, and so on from the mutable DDF based on requests.
- the distributed data framework 200 implements a data structure, for example, a table that tracks all DDFs that are mutable.
- a mutable DDF can have a long life since a caller may continue to make a series of modifications to the DDF.
- a user of the DDF may even pass a reference to the DDF to another user, thereby allowing the other user to continue modifying the dataset.
- immutable datasets have a relatively short life since the dataset cannot be modified and is used as a read only value that input to a function (or output as a result by a function).
- the distributed data framework 200 maintains metadata that tracks each mutable DDF.
- the distributed data framework 200 implements certain mutable operations by invoking functions supported by the in-memory cluster computing engine 220 . Accordingly, the distributed data framework 200 updates the metadata pointing at the DDF with a new DDF returned by the in-memory cluster computing engine 220 as a result of invocation of the function. Subsequent requests to process the DDF are directed to the new DDF structure pointed at by the metadata identifying the DDF.
- the user of the distributed data framework 200 is able to manipulate the data structures as if they are mutable.
- FIG. 5 illustrates a process for converting immutable data sets received from the in-memory cluster computing engine to mutable data sets, according to an embodiment of the invention.
- the distributed data framework 200 receives a request to build a DDF structure.
- the DDF structure may be built by loading data from a set of files of the distributed file system 210 , for example, by processing a set of log files of an enterprise.
- the distributed data framework 200 sends 520 a request to the in-memory cluster computing engine 220 to retrieve the data of the requested data set.
- the in-memory cluster computing engine 220 supports only immutable data sets and does not allow or support modifications to datasets.
- the in-memory cluster computing engine 220 loads the requested dataset in memory.
- the dataset may be distributed across memory of a plurality of compute nodes 280 of the in-memory cluster computing engine 220 (the dataset is also referred to as a DDF.)
- the distributed data framework 200 marks the data set as immutable (for example, by storing a flag indicating the dataset as mutable in metadata.) This step may be performed if the default type of datasets supported by the in-memory cluster computing engine 220 . Accordingly, if the distributed data framework 200 receives a request to modify the data of the dataset, for example, by deleting existing data, adding new data, or modifying existing data, the distributed data framework 200 denies the request.
- the dataset is represented as a DDF structure. In other embodiments, the distributed data framework 200 may mark all DDF structures as mutable when they are created.
- the distributed data framework 200 receives 530 a request to convert the dataset to a mutable dataset.
- the request to convert the dataset may be supported as a method/function call, for example, “setMutable” method/function.
- a caller may invoke the “setMutable” method/function providing a DDF structure as input.
- the distributed data framework 200 updates 540 metadata structure describing the DDF to indicate that the DDF is mutable.
- the distributed data framework 200 receives 550 a request to modify the DDF, for example, by adding data, deleting data, or updating data.
- the distributed data framework 200 performs the requested modifications to the DDF.
- the distributed data framework 200 invokes a function of the in-memory cluster computing engine 220 corresponding to the modification operation.
- the in-memory cluster computing engine 220 generates 560 a new dataset that has the value equivalent to the requested modified DDF.
- the distributed data framework 200 modifies the metadata describing the DDF to refer to the modified DDF instead of the original DDF. Accordingly, if a requester accesses the data of the DDF, the requester receives the data of the modified DDF. Similarly, if a requester attempts to modify the DDF again, the new modification is applied to the modified DDF as identified by the metadata.
- Embodiments allow multiple collaborators to interact with a document.
- a collaborator can be represented as a user account of the system.
- Each user account may be associated with a client device, for example, a mobile device such as a mobile phone, a laptop, a notebook, or any other computing device.
- a collaborator may be represented as a client device. Accordingly, the shared document is shared between a plurality of client devices.
- the document includes information based on one or more DDFs stored in the in-memory cluster computing engine 220 , for example, a chart based on the DDF in the document.
- the collaborators can interact with the document in a global editing mode that causes changes to be propagated to all collaborators. For example, if a collaborator makes changes to the document or to the DDF identified by the document, all collaborators see the modified document (or the modified document.)
- the document collaboration is based on a push model in which changes made to the document by any user are pushed to all collaborators.
- the document includes a chart based on a DDF stored in the in-memory cluster computing engine 220 . If the distributed data framework 200 receives requests to modify the DDF, the distributed data framework 200 performs the requested modifications to the DDF and propagates a new chart based on the modified DDF to the client devices of the various collaborators sharing the document.
- the distributed data framework 200 allows a collaborator X (or a set of collaborators) to switch to a local editing mode in which the changes made by the collaborator X (or any collaborator from the set) to a specified portion of the shared document are local and are not shared with the remaining collaborators.
- the local editing mode is also referred to herein as limited sharing mode, limited editing mode, or editing mode. For example, if the collaborator modifies the DDF, the changes based on the modifications to the DDF are visible in the document to only the collaborator X.
- the distributed data framework 200 does not propagate the modifications to the document (or to a portion of the document or the DDF associated with the document) to the remaining collaborators.
- the distributed data framework 200 continues propagating the original document and any information based on the version of the DDF before collaborator X switched to local editing mode to the remaining collaborators.
- the remaining collaborators can modify the original document (and associated DDFs) and the distributed data framework 200 does not propagate the modifications to user X.
- the collaborator X may share the modified document based on the local edits to a new set of collaborators. Accordingly, the new set of collaborators can continue modifying the version of document created by user X without affecting the document edited by the original set of users.
- the local edits to the shared document are shared between a set of collaborators. Accordingly, if any of the collaborators from the set of collaborators makes a modification to the shared document, the modifications are propagated to only the set of collaborators identified for the local editing. This allows a team of collaborators to make modifications to the shared document before making the modifications publicly available to a larger group of collaborators sharing the document.
- the distributed data framework 200 receives a request that identifies a particular portion of the shared document for local editing. Furthermore, the request received specifies a set of collaborators for sharing local edits to the identified portion. Accordingly, any modifications made by the collaborators of the set to the identified portion are propagated to all the collaborators of the set. However, any modifications made by any collaborator to the shared document outside the identified portion are propagated to all collaborators that share the document, independent of whether the collaborator belongs to the specified set or not.
- FIG. 6 shows the process illustrating use of a global editing mode and a local editing mode during collaboration, according to an embodiment of the invention.
- the distributed data framework 200 receives 610 a request to create a document.
- the document is associated with one or more DDFs and receives information associated with the DDF.
- the document may include a chart based on information available in the DDF.
- the distributed data framework 200 updates the chart based on changes to the data of the DDF. For example, if the distributed data framework 200 receives an update to the data of the DDF (including requests to delete, add, or modify data), the distributed data framework 200 modifies the chart to display the modified information.
- the collaboration module 370 receives 620 a request to share the document with a first plurality of collaborators.
- the requests may include requests to edit the document, requests to make modifications to the data of the DDF, and so on.
- the collaboration module 370 further receives a request from a particular collaborator (say collaborator X) to perform local editing on a selected portion of the shared document (or the entire document).
- the request identifies a set of collaborators that share the local edits to the selected portion of the document.
- the collaboration module 370 may create 650 a copy of data related to the identified portion of the shared document for collaborator X to perform local editing.
- the copy of the portion of the shared document is called the locally accessible document and the original shared document (which can be edited by all collaborators) is called the globally accessible document.
- the collaboration module 370 shares the associated DDFs between the locally accessible document and the globally accessible document when the locally accessible document is created. However, if the distributed data framework 200 receives a request from any of the collaborators to modify an underlying DDF, the distributed data framework 200 makes a copy of the DDF and modifies the copy of the DDF. One of the documents subsequently is associated with the modified DDF and the other document is associated with the original DDF.
- the collaboration module 370 may obtain a subset of data of the DDF that provides data to the chart displayed on the locally edited document.
- the chart may display data for a small time period out of a longer period of data stored in the DDF.
- the chart may display partially aggregated data.
- the DDF may store data at an interval of seconds and the chart may display data aggregated data at intervals of days. Accordingly, the distributed data framework 200 determines the aggregated data that may be much smaller than the total data of the DDF and can be stored on the client device instead of the in-memory cluster computing engine.
- the distributed data framework 200 checks if the size of the aggregated data is below a threshold value. If the size of the aggregated data is below a threshold value, the distributed data framework 200 sends the data to the client device 130 for further processing.
- the client device can perform certain operations based on the locally stored data, for example, further aggregation based on the data. Processing the locally stored data allows the client device to efficiently process user requests. For example, if the user wants to view a smaller slice of data than that shown on the chart, the client device 130 can use the locally stored data to respond to the query. Accordingly, the chart displayed on the client device is updated without updating the charts displayed on the remaining client devices that share the original document.
- the client device requests to further aggregate the data, for example, by requesting aggregates at the intervals of weeks or months, the request can be processed using the locally stored data.
- the data set associated with the chart (for example, the partially aggregated data) is stored on another system distinct from the distributed data framework 200 and the client device. The other system allows large data sets to be loaded in memory that exceed the capacity of the client device 130 .
- the collaboration module 370 determines which copy of the document is associated with the modified DDF and which copy of the document is associated with the original DDF. If the request to modify the DDF is received from the locally accessible document, the distributed data framework 200 associates the locally accessible document with the modified DDF and the globally accessible document with the original DDF. Alternatively, if the request to modify the DDF is received from the globally accessible document, the distributed data framework 200 associates the globally accessible document with the modified DDF and the locally accessible document with the original DDF
- the collaboration module 370 receives 660 a request to share the locally accessible document with other collaborators (referred to here as a second plurality of collaborators).
- the second plurality of collaborators may overlap with the first plurality of collaborators.
- the collaboration module 370 provides access to the document to the second plurality of collaborators.
- the distributed data framework 200 receives request to modify the locally accessible document from collaborators belonging to the second plurality of collaborators.
- the ability to locally edit a portion of the document allows one or more collaborators to modify the document before making the modifications publicly available to all collaborators.
- a portion of the shared document may be associated with a query that processes an in-memory distributed data structure.
- the portion of the document may show results of the query as a chart of in text form or both.
- the system allows one or more collaborators to develop and test the query in a local edit mode to make sure the chart presented is accurate. Once the collaborators have fully developed and tested the query and the chart, the system receives a request from the collaborators to share the identified portion with all users that share the document (not just the developers and testers of the query and the chart.)
- the system determines a target set of collaborators that receive each modification made to the shared document.
- the target set of collaborators is determined based on whether the modification is made to the portion identified for local editing or another portion. Accordingly, if the system receives a request to modify a portion of the document that is distinct from the portion identified for local editing, the system propagates the changes to the all collaborators sharing the document. This is so because by default all portions of the document are marked for global editing by all collaborators. However, if the system receives a request to modify the portion identified for local editing and the request is received from a collaborator from the set of collaborators S allowed to perform local editing on that portion, the system propagates the modification to all collaborators from the set of collaborators.
- the collaborators not belonging to the set S of collaborators are allowed to modify the portion identified for the local editing.
- the system propagates these modifications only to collaborators that do not belong to the set S of collaborators allowed to perform local editing to the identified portion.
- the system maintains a separate copy of the identified portion. Accordingly, the modifications made by users of the set S are made to one copy of the document (and propagated to the collaborators belonging to S) and the modifications made by users outside set S are made to another copy (and propagated to the collaborators outside S).
- the portion of the shared document identified for local editing includes a query Q 1 processing a DDF associated with the shared document.
- a set S 1 of collaborators are allowed to perform local edits to the document.
- the selected portion of the document may include a chart of the document associated with the query or result of the query in text form.
- the local edits made by collaborators of set S may modify Q 1 to become a query Q 2 . Accordingly, a chart based on query Q 2 is propagated to collaborators belonging to set S 1 and a chart based on the original query Q 1 is propagated to the remaining collaborators (outside set S 1 ) that share the document.
- the queries Q 1 and Q 2 are reevaluated to build a new corresponding chart (or textual representation of the result).
- the chart or results based on query Q 2 are propagated to the collaborators of set S 1 and the charts or results based on query Q 1 are propagated to the remaining collaborators (outside the set S 1 .)
- the system allows various portions of the same shared document to be locally edited by different collaborators or different sets of collaborators. For example, the system may receive a first request to allow local editing of a first portion of the shared document by a first set of collaborators. Subsequently the system may receive a second request to allow local editing of a second portion of the shared document by a second set of collaborators. The first and second set of collaborators may overlap or may be distinct.
- a shared document includes text portions, result portions, and code blocks.
- a text portion is received from a user and shared with other users.
- the shared document may be associated with one or more DDFs stored across a plurality of compute nodes.
- a code block may process data of a DDF.
- the code block may include queries that are executed. The result of execution of a query is displayed on the result portions of the document, for example, as charts.
- a code block is also referred to herein as a cell.
- Embodiments allow references to DDFs to be included in documents. Users interacting with the big data analysis system 100 can share documents and interact with the same shared document via different client devices 130 . If two or more documents share a DDF, changes made to the DDF via a document result in data displayed on the other documents being modified. For example, documents D 1 and D 2 may be distinct documents that have references to a DDF df. Document D 1 may be shared by a set of users S 1 and document D 2 may be shared by a set of users S 2 where S 1 and S 2 may be distinct sets of users with no overlap.
- a user U 1 from set S 1 executes code via document D 1 that modifies the DDF df
- a user U 2 from set S 2 can view the modifications to the DDF df even though U 2 is not sharing the document D 1 with user U 1 .
- the code modifications made by user U 1 via document D 1 may cause a chart or a result set displayed on document D 2 to be updated as a result of modifications made to DDF df.
- FIG. 7 shows collaborative editing of documents including text, code, and charts based on distributed data structures, according to an embodiment.
- the distributed data framework 200 stores two shared documents 710 a and 710 b .
- the shared documents include references to the in-memory distributed data frame structure 420 (referred to as the DDF) stored in the in-memory cluster computing engine 220 .
- the DDF in-memory distributed data frame structure 420
- Each shared document 710 is associated with a set 730 of users 720 interacting with the shared document 710 via client devices 130 .
- users 720 p and 720 q interact with shared document 710 a via client devices 130 p and 130 q respectively.
- users 720 r and 720 s interact with shared document 710 b via client devices 130 r and 130 s respectively.
- the sets 730 of users sharing a document may be distinct from a set of users sharing another document (i.e., have no overlap between the sets) or the sets of users sharing two distinct shared documents 710 may have an overlap (with one or more users having access to both the shared documents).
- the shared document 710 may include text, code, and results based on code.
- the results based on code may comprise results present as text or results presented as charts, for example, bar charts, scatter plots, pie charts and so on.
- the results presented in a document are associated with in-memory distributed data frame structure 420 (referred to as the DDF) stored in the in-memory cluster computing engine 220 .
- the document may specify a query based on the DDF such that the results/chart displayed in the document are based on the result of executing the query against the DDF.
- the code specified in the document may include a query for which the results are shown in the document. If a user updates the query of the shared document, each of the users that share the document, receive updated results displayed in the shared document.
- the code specified in the document may include statements that modify the DDF, for example, by deleting, adding, updating rows, columns, or any other portion of data of the DDF. If a user modifies the DDF, the results displayed in the document may get updated based on the modified DDF. For example, if certain rows of the DDF are deleted, any aggregate results displayed in the document or charts based on the DDF may get updated to reflect the deletion of the rows. Furthermore, if there are other documents that share the same DDF (for example, by including a URI to the DDF), the results/chart displayed in those documents may be updated to reflect the modifications to the shared DDF.
- the shared documents may represent articles, presentations, reports and so on.
- the collaborative editing allows users to include charts and results of large distributed data structures in documents.
- a team of developers may build an in-memory distributed data structure and share the URI of the in-memory distributed data structure with an executive for presentation to an audience.
- the ability to share the in-memory distributed data structure allows the ability to update the data structure to reflect the latest information. This is distinct from a presentation with static information that doesn't change no matter when the presentation is given.
- the sharing of documents with code and results based on executable code allows presentation of latest results that may get updated as the executive makes the presentation.
- user U 1 may update text of the shared document 710 a (the text referring to static strings of text that are distinct from executable code).
- the updated text of the document is sent by the distributed data framework 200 to all users (e.g., users U 1 , U 2 ) that share the document 710 a .
- the result of editing the text does not affect other documents that are distinct from the updated document, for example, document 710 b and therefore, the users of set 730 b are not affected by the updating of the text of document 710 a.
- Embodiments further allow sharing of executable code and results based on sharing of code with other users.
- user U 1 may update a query included in shared document 710 a .
- the modification of the query may cause changes to results or charts presented in the document.
- the distributed data framework 200 executes the updated query using the DDF references in the document 710 a and sends the updated results or chart to the remaining users of the set 730 a (e.g., user U 2 .) If the query is simply reading the data of the DDF, the result of modification of the query is presented only to the users of the set 730 a that share the modified document (and is not shown to users of set 730 b that share the document 710 b ).
- Embodiments further allow a user to execute code that modifies the DDF referenced by the shared document 710 a .
- the data of the DDF may be changed (e.g., deleted, updated, or new data added.)
- the modification of DDF may cause results of queries of the document to be updated if the queries use the DDF.
- the distributed data framework 200 identifies all queries of the document 710 a that use the DDF and updates the results of the queries displayed in the document.
- the updated document is sent for presentation to the users of the set 730 a.
- the distributed data framework 200 identifies all other documents of that include a reference to the DDF.
- the distributed data framework 200 identifies queries of all the identified documents and updates the results/charts of the queries displayed in the respective documents if necessary.
- the distributed data framework 200 sends the updated documents for presentation to all users that share the document.
- the DDF 420 may be updated based on execution of code of shared document 710 a .
- the distributed data framework 200 updates results of queries based on the DDF 420 in document 710 a as well as document 710 b .
- the updated document 710 a is sent to users of the set 730 a and the updated document 710 b is sent to the users of the set 730 b.
- a user may request a document or a portion of a document to be locally edited (and not shared).
- the distributed data framework 200 makes a copy of the DDF 420 or an intermediate result set based on the DDF 420 .
- the distributed data framework 200 simply notes that the document is being locally edited and continues to share the DDF 420 with other documents until the DDF 420 is edited. If the DDF 420 is edited, the distributed data framework 200 makes a copy of the DDF for the document that is being locally edited. Note that the document being locally edited may be shared by a set of users even though it does not share the DDF referenced in the document with other documents.
- the portion of the document being locally edited is based on an intermediate result derived from the DDF 420 .
- the distributed data framework 200 stores the intermediate result in either the in-memory cluster computing engine 220 (if the intermediate result is large) or else in a separate server (that may not be distributed).
- the intermediate result is stored in the client device 130 .
- Certain operations based on the intermediate results can be performed based on the data of the intermediate result, for example, aggregation of the intermediate results, changing the formal of the chart (so long as the new format does not require additional data from the DDF).
- a bar chart may be changed to a line chart based on the intermediate result.
- changing of a bar chart to a scatter plot may require accessing the DDF for obtaining a new sample data (for example, if the user requests to display a scatter plot based on a subset of data of the bar chart.)
- FIG. 8 shows an example of shared document with code and results based on code, according to an embodiment.
- the document 800 includes executable code 810 .
- the document 800 also includes results of execution of the executable code.
- the executable code 810 is based on a DDF (identified as “ddf”) used in a machine learning model (identified as “lm”). This example code builds a linear model of total_amount as function of trip_distance and payment_type.
- the result 820 of execution of the linear model using the DDF ddf is shown in text form in FIG. 8 . In other embodiments, the result of execution of a command can be shown in a chart form.
- FIG. 9 shows the document of FIG. 8 with modifications made to the code, thereby causing the results to be updated, according to an embodiment.
- the model shown in FIG. 8 is updated to change the model.
- the distributed data framework 200 executes the modified code 910 to obtain a new set of results 920 .
- the updated results are presented to all users that share the document 800 . If a modification is made to the data of the DDF ddf, the results of all documents that include executable code based on the DDF ddf gets modified.
- the analytics framework 230 generates reports, presentations, or dashboards based on the document comprising the text, code, and results.
- FIG. 10 shows an example of a dashboard generated from a document based on distributed data structures, according to an embodiment.
- the analytics framework 230 receives information identifying portions of the shared document that should not be displayed in the generated report/dashboard. For example, a user may indicate that portions of the document including executable code should not be displayed in the report/presentation/dashboard.
- the analytics framework 230 generates the requested report/presentation/dashboard by rendering the information identified for inclusion and excluding the information identified for exclusion.
- the analytics framework 230 receives a request to convert the shared document into a periodic report.
- the analytics framework 230 receives a schedule for generating the periodic report.
- the analytics framework 230 executes the code blocks of the shared document in accordance with the schedule. Accordingly, the analytics framework 230 updates the result portions of the shared document based on the latest execution of the code block.
- a code block may include a query based on a DDF.
- the analytics framework 230 updates the result portion corresponding to the code block based on the latest data of the DDF.
- the analytics framework 230 shares the updated document with users that have access to the shared document.
- the shared document may include a reference to a DDF based on an airlines database and the analytics framework 230 provides weekly or monthly reports to the users sharing the document.
- the analytics framework 230 can convert the shared document into a slide show or a dashboard based on a user request.
- the analytics framework 230 receives a request to generate a periodic report, slideshow, or a dashboard and generates the requested document based on the shared document rather than convert the shared document as requested.
- the analytics framework 230 maintains the periodic reports, slideshows, or dashboards as shared documents that can be further edited and shared with other users. Accordingly, various operations disclosed herein apply to these generated/transformed documents.
- FIG. 10 shows various charts 1010 a , 1010 b , and 1010 c shown in the dashboard generated by the analytics framework 230 .
- the charts 1010 refer to DDFs stored in the in-memory cluster computing engine 220 . Accordingly, the distributed data framework 200 may automatically update the data of the charts 1010 based on updates to the data of the associated DDF.
- the analytics framework 230 identifies all charts in an input document.
- the analytics framework 230 determines a layout for all the charts in a grid, for example, a 3 column grid.
- the analytics framework 230 may receive (from the user) a selection of a template specifying the layout of the dashboard.
- the analytics framework 230 receives instructions from users specifying modifications to the layout.
- the big data analysis system 100 allows users to drag-drop charts snapping to the grid, resize charts within the grid.
- the big data analysis system 100 also allows users to set dashboards to automatically be refreshed at a specified time interval e.g. 30 second, 1 minute, etc.
- the generated dashboard includes instructions to execute any queries associated with each chart at the specified time interval by sending the queries to the distributed data framework 200 for execution.
- Various portions of a document that is shared can be edited by all users that share the document.
- the system receives a user request for execution of a code block (or cell)
- the system shows an indication that the code block in the shared document is being executed. Accordingly, the system shows a change in the status of the code block.
- the status of the code block may be indicated based on a color of the text or background of the code block, font of the code block, or by any visual mechanism, for example, by showing the code block as flashing.
- the status of the code block may be shown by a widget, for example, an image or icon associated with the code block. Accordingly, a status change of the code block causes a visual change in the icon or the widget.
- the changed status of the code block is synchronized across all client applications or client devices that share the document. Accordingly, the system shows the status of the code block as executing on any client device that is displaying a portion of the shared document including the code block that is executing.
- the system if the system receives a request to execute a code block of the shared document, the system locks the code block of the document, thereby preventing any users from editing the code block. The system also prevents other users from executing the code block. Accordingly, the system does not allow any edits to be performed on the code block that is executing from any client device that is displaying a portion of the shared document including the code block. Users are allowed to modify other portions of the document, for example, text portions or other code blocks. Nor does the system allow the code block to be executed again from any client device until the current execution is complete. In other words, the system allows a single execution by a user for a code block at a time. Once the execution of the code block is complete, the system allows users to edit the code block or execute it again.
- a user may close the client application (e.g., a browser or a user-agent software) used to view/edit the shared document on a client device 130 . If the client closes the client application while one or more code blocks, the system continues executing the code blocks and tracks the status of the code blocks. If a user that closed the client application reopens the client application to view the document, the system receives a login request from the user. In response to the request to view the shared document, the system provides the latest status of the code blocks. If a code block is executing, the system provides information indicating that the code block is executing and the user is still not allowed to edit or execute the code block. If the code block has completed execution, the system updates the result portions of the document and sends the updated document to the client device of the user and allows the user to edit or execute the code block.
- the client application e.g., a browser or a user-agent software
- Embodiments allow shared documents that interact with the big data analysis system 100 using multiple languages for processing data of DDFs.
- a shared document includes text portions, result portions, and code blocks.
- a text portion is received from a user and shared with other users.
- the shared document may be associated with one or more DDFs stored across a plurality of compute nodes.
- a code block may process data of a DDF.
- a user can interact with the DDF by processing a query and receiving results of the query.
- the results of the query are displayed in the document and may be shared with other users.
- a user can also execute a statement via the document that modifies the DDF.
- the distributed data framework 200 receives statements sent by a user via a document and processes the statements (a statement can be a command or a query).
- a code block may include instructions that modify the DDF.
- a code block may include queries that are executed by the data analysis system. The result of execution of the queries is presented in result portions of the shared document. A result portion may present results in text form or graphical form, for example, as charts. Modification of a query by a user in a code block may result in the result portion of all users sharing the document getting updated.
- the big data analysis system 100 allows users to send instructions for processing data of a DDF using different languages. For example, the big data analysis system 100 receives a first set of instructions in a first language via a document and subsequently a second set of instructions in a second language provided via the same document (or via a different document). Both the first and second set of instructions may process data of the same DDF.
- the ability to collaborate via multiple languages allows different users to use the language of their choice while collaborating.
- certain features may be supported by one language and not another. Accordingly, a user can use a first language for providing instructions and operations supported by that language and switch to a second language to use operations supported by the second language (and not supported by the first language).
- the big data analysis system 100 allows users to specify code cells or code blocks in a document. Each code block may be associated with a specific language. This allows a user to specify the language for a set of instructions.
- a shared document uses a primary language for processing the DDFs. However, code blocks of one or more secondary languages may be included.
- FIG. 11 shows the architecture of the in-memory cluster computing engine illustrating how a distributed data structure (DDF) is allocated to various compute nodes, according to an embodiment.
- the in-memory cluster computing engine 220 comprises multiple compute nodes 280 .
- a DDF is distributed across multiple compute nodes 280 .
- Each compute node 280 is allocated a portion of the data of the DDF, referred to as a DDF segment 1110 .
- the distributed data framework 200 receives a request to process data of a DDF, the distributed data framework 200 sends a corresponding request to each compute node 280 storing a DDF segment 1110 to process data of the DDF segment 1110 .
- FIG. 11 shows the architecture of the in-memory cluster computing engine illustrating how a distributed data structure (DDF) is allocated to various compute nodes, according to an embodiment.
- the in-memory cluster computing engine 220 comprises multiple compute nodes 280 .
- a DDF is distributed across multiple compute nodes 280
- each compute node 280 includes a primary runtime 1120 that stores the DDF segment 1110 .
- the DDF segment 1110 has a data structure based on the primary runtime 1120 .
- the data structure of the DDF segment 1110 conforms to the primary language of the distributed data framework and can be processed by the primary runtime executing instructions of the primary language.
- the primary runtime 1120 is capable of processing instructions in a primary language of operation for the distributed data framework 100 . Accordingly, if a user provides a set of instructions using the primary language, the distributed data framework 100 provides corresponding instructions to the primary runtime for execution.
- the primary runtime 1120 may be a virtual machine of a language, for example, a JAVA virtual machine for processing instructions received in the programming language JAVA.
- the primary runtime 1120 may support other programming languages such as PYTHON, R language, or any proprietary languages.
- FIG. 12 shows the architecture of the in-memory cluster computing engine illustrating how multiple runtimes are used to process instructions provided in multiple languages, according to an embodiment.
- the distributed data framework 100 may receive instructions in a language different from the primary language. For example, if the primary language for interacting with the distributed data framework 100 is JAVA, a user may provide a statement in the R language.
- users can interact with the distributed data framework 100 using a set of language agnostic APIs supported by the distributed data framework 100 .
- the language agnostic APIs allow users to provide the required parameters and identify a method/function to be invoked using the primary language.
- the distributed data framework 100 receives the parameters and the method/function identifier and provides these to the primary runtime 1120 .
- the primary runtime 1120 invokes the appropriate method/function using the provided parameter values.
- the primary runtime 1120 provides the results by executing the method/function.
- the distributed data framework 100 provides the results to the caller for display via the document used to send the request.
- the distributed data framework 100 may receive instructions in a language other than the primary language of the distributed data framework 100 (referred to as a secondary language).
- the distributed data framework 100 may receive a request to process a function that is available in the secondary language but not in the primary language.
- the R language supports several functions commonly used by data scientists that may not be supported by JAVA (or not available in the set of libraries accessible to the primary runtime 1120 .
- the in-memory cluster computing engine 220 starts a secondary runtime 1220 that is configured to execute instructions provided in the secondary language.
- the secondary runtime 1220 is started on each compute node 280 that has a DDF segment 1110 for the DDF being processed.
- Each compute node 280 transform the data structure representing the DDF segment 1110 conforming to the primary language to a data structure representing the DDF segment 1210 conforming to the secondary language.
- the compute node transforms a DDF segment represented as a list of byte buffers (representing a TablePartition structure conforming to the JAVA language representation) to a list of vectors in R (representing a DataFrame structure of R language). Furthermore, the compute node performs appropriate data type conversions, e.g. the compute node converts a TablePartition Columniterator of Integer to an R integer vector, Java Boolean to R logical vector, and so on.
- the compute node encodes any special values based on the target runtime, for example, the compute node converts floating point NaN (not a number special value) to R's NA value (not-available value) while converting to an R representation, or to Java null pointers while converting to a Java representation. If the secondary runtime is based on a Python, the compute node converts the DDF segment to a DataFrame representation of Python language.
- the primary runtime 1120 (of each compute node having a DDF segment of the DDF being processed) executes instructions that transform the DDF segment 1110 representation (conforming to the primary language) to a DDF segment representation conforming to the secondary language).
- the primary runtime 1120 uses certain protocol to communicate the transformed DDF segment representation to the secondary runtime 1220 .
- the primary runtime 1120 may open a pipe (or socket) to communicate with the process of the secondary runtime 1220 .
- the transformed DDF segment representation is stored in the secondary runtime 1220 as DDF segment 1210 .
- the secondary runtime 1220 performs the processing based on the DDF segment 1210 by executing the received instructions in the secondary language.
- the processing performed by the secondary runtime 1220 may result in generation of a new DDF (that is distributed as DDF segments across the compute nodes.) Accordingly, each secondary runtime 1220 instance stores a DDF segment corresponding to the generated DDF.
- the generated DDF segment stored in the secondary runtime 1220 conforms to the secondary language.
- the secondary runtime 1220 transforms the generated DDF segment to a transformed generated DDF segment that conforms to the primary language.
- the secondary runtime 1220 sends the transformed generated DDF segment to the primary runtime 1120 .
- the secondary runtime stores the transformed generated DDF segment for processing instructions received via the document in the primary language.
- the processing performed by the secondary runtime 1220 may result in modifications to the stored DDF segment 1210 .
- the modified DDF segment conforms to the secondary language.
- the secondary runtime 1220 sends the transformed modified DDF segment to the primary runtime 1120 .
- the secondary runtime stores the transformed modified DDF segment for processing instructions received via the document in the primary language.
- This mechanism allows the distributed data framework 200 to process instructions received for processing the DDF in languages other than the primary language of the distributed data framework 200 . Accordingly, embodiments allow the DDF to be mutated using a secondary language.
- the distributed data framework 200 allows further processing to be performed using the primary language. Accordingly, a user can mix instructions for processing a DDF in different languages in the same document.
- the document for processing the DDF in multiple languages is shared, thereby allowing different users to provide instructions in different languages.
- the same DDF is shared between different documents.
- the DDF may be processed using instructions in different languages received from different documents.
- the distributed data framework 200 may modify a DDF based on instructions in one language and then receive queries (or statements to further modify the DDF) in a different language.
- Embodiments can support multiple secondary languages by creating multiple secondary runtimes, one for processing instructions of each type of secondary language.
- FIG. 13 shows an interaction diagram illustrating the processing of DDFs based on instructions received in multiple languages, according to an embodiment.
- the distributed data framework 200 receives instructions in different languages from documents edited by users via client devices 130 .
- the instructions are received by the in-memory cluster computing engine 220 and sent to each compute node that stores a DDF segment of the DDF being processed by the instructions.
- there may be different components/modules involved in the processing (different from those shown in FIG. 13 ).
- the client device 130 sends 1310 instructions in the primary language to the primary runtime 1120 of each compute node storing a DDF segment 1110 .
- the primary runtime 1120 receives the instructions in the primary language from the client device 130 and processes 1315 them using the DDF segment.
- the primary runtime 1120 sends 1320 the results back to the client device 130 .
- the results may be sent via different software modules, for example, the primary runtime 1120 may send the results to the in-memory cluster engine 220 , the in-memory cluster computing engine 220 may send the results to the analytics framework 230 which in turn may send the results to the client device 130 .
- the client device 130 is shown interacting with the primary runtime 1120 .
- the processing 1315 of the instructions may cause the DDF to mutate such that subsequent instructions process the mutated DDF.
- the client 130 subsequently sends 1325 instructions in the secondary language 1325 .
- the instructions may include a call to a built-in function that is implemented in the secondary language and not in the primary language.
- the primary runtime 1120 transforms 1330 the DDF segment stored in the compute node of the primary runtime 1120 to a transformed DDF segment that conforms to the secondary language.
- the primary runtime 1120 sends 1335 the transformed DDF segment to the secondary runtime 1220 .
- the secondary runtime 1220 processes 1340 the instructions in the secondary language using the transformed DDF segment.
- the processing 1340 may generate a result DDF.
- the result DDF may be a new DDF segment generated by processing 1340 the instructions.
- the result DDF segment may be a mutated form of the input DDF segment.
- the secondary runtime 1220 transforms the result DDF to a format that conforms to the primary language.
- the secondary runtime 1220 sends 1350 the transformed result DDF to the primary runtime 1120 .
- the primary runtime 1120 stores the transformed result for further processing, for example, if subsequent instructions based on the result DDF are received.
- the primary runtime 1120 sends 1335 any results based on the processing 1340 to the client device (for example, any result code, aggregate values, and so on).
- the client device 130 sends 1360 further instructions in primary language for processing using the result DDF.
- the primary runtime 1120 processes 1365 the received instructions using the result DDF.
- the primary runtime 1120 sends 1370 any results based on the processing 1365 back to the client device 130 .
- the distributed data framework 200 runtime automatically select the best representation of data for in-memory storage and algorithm execution without user's involvement. By default, a compressed columnar data format is used which is optimized for analytic queries and univariate statistical analysis.
- the distributed data framework 200 performs conversion that is optimized for such algorithm, e.g. in a linear regression command, a conversion is performed by the distributed data framework 200 to extract values from selected columns and build a matrix representation.
- the distributed data framework 200 caches the matrix representation in memory for the iterative machine learning process.
- the distributed data framework 200 deletes the matrix representation from the cache (i.e., uncaches) the matrix representation when the algorithm is finished.
- the distributed data framework 200 provides an extensible framework for providing support for different programming languages.
- the distributed data framework 200 receives from a user, software modules for performing conversions of data values conforming to format of one language to format of a new language.
- the distributed data framework 200 further receives code for runtime of the new language.
- the distributed data framework 200 allows code blocks to be specified using the new language. As a result the distributed data framework 200 can be easily extended with support for new languages without requiring modifications to the code for existing languages.
- any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment.
- the appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
- Coupled and “connected” along with their derivatives. It should be understood that these terms are not intended as synonyms for each other. For example, some embodiments may be described using the term “connected” to indicate that two or more elements are in direct physical or electrical contact with each other. In another example, some embodiments may be described using the term “coupled” to indicate that two or more elements are in direct physical or electrical contact. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. The embodiments are not limited in this context.
- the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion.
- a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
- “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Entrepreneurship & Innovation (AREA)
- Computing Systems (AREA)
- Strategic Management (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Multimedia (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Quality & Reliability (AREA)
- Operations Research (AREA)
- Economics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- This application is a continuation of U.S. patent application Ser. No. 14/814,457, which claims the benefits of U.S. Provisional Application No. 62/086,158 filed on Dec. 1, 2014, both of which are hereby incorporated by reference in their entirety.
- The disclosure relates to efficient processing of large data sets using parallel and distributed systems. More specifically, the disclosure concerns various aspects of processing of distributed data structures including collaborative processing of distributed data structures using shared documents.
- Enterprises produce large amount of data based on their daily activities. This data is stored in a distributed fashion among a large number of computer systems. For example, large amount of information is stored as logs of various systems of the enterprise. Processing such large amount of data to gain meaningful insights into the information describing the enterprise requires large amount of resources. Furthermore, conventional techniques available for processing such large amount of data typically require users to perform cumbersome programming.
- Furthermore, users have to deal with complex systems that perform parallel/distributed programming to be able to process such large amount of data. Software developers and programmers (also referred to as data engineers) who are experts at programming and using such complex systems typically do not have the knowledge of a business expert or a data scientist to be able to identify the requirements for the analysis. Nor are the software developers able to analyze the results on their own.
- As a result, there is a gap between the process of identifying requirements and analyzing results and the process of programming the parallel/distributed systems to achieve the results. This gap results in time consuming communications between the business experts/data scientists and the data engineers. Data scientists, business experts, as well as data engineers act as resources of an enterprise. As a result the above gap adds significant costs to the process of data analysis. Furthermore, this gap leads to possibilities of errors in the analysis since a data engineer can misinterpret certain requirements and may generate incorrect results. The business experts or the data scientists do not have the time or the expertise to verify the software developed by the developers to verify its accuracy.
- Some tools and systems are available to assist data scientists and business experts with the above process of providing requirements and analyzing results of big data analysis. The tools and systems used by data scientists are typically difficult for business experts to use and tools and systems used by business experts are difficult for data scientists to use. This creates another gap between the analysis performed by data scientists and the analysis performed by business experts. Therefore conventional techniques for providing insights into big data stored in distributed systems of an enterprise fail to provide suitable interface for users to analyze the information available in the enterprise.
- Embodiments support multi-language support for data processing. A system stores an in-memory distributed data frame structure (DDF) across a plurality of compute nodes. Each compute node stores a portion of the in-memory distributed data structure (DDF segment). The data of the DDF conforms to a primary language. The system further stores a document comprising text and code blocks. The code blocks comprise a first code block for providing instructions using the primary language and a second code block for providing instructions using a secondary language.
- The system receives a request to process instructions specified in the first code block using the primary language. Each compute node processes the instructions to process the DDF segment mapped to the compute node. The system further receives a request to process instructions specified in the second code block using the secondary language. Each compute node transforms the data of the DDF segment mapped to the compute node to conform to the format of the secondary language. Each compute node executes the instructions of the secondary language to generate a result DDF segment. The system transforms data of the result DDF segment to a format conforming to the primary language. Each compute node processes further instructions specified using the primary language to process the transformed result DDF segment mapped to the compute node.
- The disclosed embodiments have other advantages and features which will be more readily apparent from the detailed description, the appended claims, and the accompanying figures (or drawings). A brief introduction of the figures is below.
-
FIG. 1 shows the overall system environment for performing analysis of big data, in accordance with an embodiment of the invention. -
FIG. 2 shows the system architecture of a big data analysis system, in accordance with an embodiment. -
FIG. 3 shows the system architecture of the distributed data framework, in accordance with an embodiment. -
FIG. 4 illustrates a process for collaborative processing based on a DDF between users, according to an embodiment of the invention. -
FIG. 5 illustrates a process for converting immutable data sets received from the in-memory cluster computing engine to mutable data sets, according to an embodiment of the invention. -
FIG. 6 shows the process illustrating use of a global editing mode and a local editing mode during collaboration, according to an embodiment of the invention. -
FIG. 7 shows collaborative editing of documents including text, code, and charts based on distributed data structures, according to an embodiment. -
FIG. 8 shows an example of shared document with code and results based on code, according to an embodiment. -
FIG. 9 shows an example of shared document with code and results updated based on execution of the code, according to an embodiment. -
FIG. 10 shows an example of a dashboard generated from a document based on distributed data structures, according to an embodiment. -
FIG. 11 shows the architecture of the in-memory cluster computing engine illustrating how a distributed data structure (DDF) is allocated to various compute nodes, according to an embodiment. -
FIG. 12 shows the architecture of the in-memory cluster computing engine illustrating how multiple runtimes are used to process instructions provided in multiple languages, according to an embodiment. -
FIG. 13 shows an interaction diagram illustrating the processing of DDFs based on instructions received in multiple languages, according to an embodiment. -
FIG. 14 is a high-level block diagram illustrating an example of a computer for use as a system for performing formal verification with low power considerations, in accordance with an embodiment. - The features and advantages described in the specification are not all inclusive and in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter.
- A big data analysis system provides an abstraction of database tables based on distributed in-memory data structures obtained from big data sources, for examples, files of a distributed file system. A user can retrieve data from a large, distributed, complex file system and treat the data as database tables. As a result, the big data analysis system allows users to use familiar data abstractions such as filtering data, joining data for large datasets that are commonly supported by systems that handle with small data, for example, single processor database management systems. The big data analysis system supports various features including schema, filtering, projection, transformation, data mining, machine learning, and so on.
- Embodiments support various operations commonly used by data scientists. These include various types of statistical computation, sampling, machine learning, and so on. However, these operations are supported on large data sets processed by a distributed architecture. Conventional systems that allow processing of large data using distributed systems Embodiments create a long term session for a user and track distributed data structures for users so as to allow users to modify the distributed data structures. The ability to create long term sessions allows embodiments to provide functionality similar to existing data analysis systems that are used for small data processing, for example, the R programming language and interactive system.
- Furthermore, embodiments support high-level data analytics functionality, thereby allowing users to focus on the data analysis rather than low level implementation details of how to manage large data sets. This is distinct from conventional systems, for example, systems that support map-reduce paradigm and require user to express high-level analytics functions into map and reduce functions. The map reduce paradigm requires users to be aware of the distributed nature of data and requires users to use the map and reduce operations for expressing the data analysis operations.
- Embodiments further allow integration of large data sets with various machine learning techniques, for example, with externally available machine learning libraries. Furthermore, the ability to store distributed data structures in memory and identify the distributed data structures using URI allows embodiments to support clients using various languages, for example, Java, Scala, R and Python, and also natural language.
- Embodiments support collaboration between multiple users working on the same distributed data set. A user can refer to a distributed data structure using a URI (uniform resource identifier). The URI can be passed between users, for example, by email. Accordingly, a new user can get access to a distributed data structure that is stored in memory. Embodiments further allow a user to train a machine learning model, create a name for the machine learning model and transfer the name to another user so as to allow the other user to execute the machine learning model. For example, a data scientist can create a distribute data structure or a machine learning model and provide to an executive of an enterprise to present the data or model to an audience. The executive can perform further processing using the data or the model as part of a presentation by connecting to a system based on these embodiments.
- Reference will now be made in detail to several embodiments, examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. The figures depict embodiments of the disclosed system (or method) for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein.
-
FIG. 1 shows the overall system environment for performing analysis of big data, in accordance with an embodiment of the invention. The overall system environment includes anenterprise 110, a bigdata analysis system 100, anetwork 150 andclient devices 130. Other embodiments can use more or less or different systems than those illustrated inFIG. 1 . Functions of various modules and systems described herein can be implemented by other modules and/or systems than those described herein. -
FIG. 1 and the other figures use like reference numerals to identify like elements. A letter after a reference numeral, such as “120 a,” indicates that the text refers specifically to the element having that particular reference numeral. A reference numeral in the text without a following letter, such as “120,” refers to any or all of the elements in the figures bearing that reference numeral (e.g. “120” in the text refers to reference numerals “120” and/or “120” in the figures). - The
enterprise 110 is any business or organization that uses computer systems for processing its data.Enterprises 110 are typically associated with a business activity, for example, sale of certain products or services but can be any organization or groups of organizations that generates significant amount of data. Theenterprise 110 includes several computer systems 120 for processing information of the enterprise. For example, a business may use computer systems for performing various tasks related to the products or services offered by the business. These tasks include sales transactions, inventory management, employee activities, workflow coordination, information technology management, and so on. - Performing these tasks may generate large amount of data for the enterprise. For example, an enterprise may perform thousands of transactions daily. Different types of information is generated for each transaction including information describing the product/services involved in the transaction, errors/warning generated by the system during transactions, information describing involvement of personnel from the enterprise, for example, sales representative, technical support, and so on. This information accumulates over days, weeks, months, and years, resulting in large amount of data.
- For example, airlines process data of hundreds of thousands of passengers traveling every day and large numbers of flights carrying passengers every day. The information describing the flights and passengers of each flight over few years can be several terabytes of data. Enterprises that process petabytes of data are not uncommon. Similarly, search engines may store information describing millions of searches performed by users on a daily basis that can generate terabytes of data in a short time interval. As another example, social networking systems can have hundreds of millions of users. These users interact daily with the social networking system generating petabytes of data.
- The big
data analysis system 100 allows analysis of the large amount of data generated by the enterprise. The bigdata analysis system 100 may include a large number of processors for analyzing the data of theenterprise 110. In some embodiments, the bigdata analysis system 100 is part of theenterprise 110 and utilizes computer systems 120 of theenterprise 110. Data from the computer systems 120 ofenterprise 110 that generate the data may be imported 155 into the computer systems that perform the big data analysis. - The
client devices 130 are used by users of the bigdata analysis system 100 to perform the analysis and study of data obtained from theenterprise 110. The users of theclient devices 130 include data analysts, data engineers, and business experts. In an embodiment, theclient device 130 executes a client application that allows users to interact with the bigdata analysis system 100. For example, the client application executing on theclient device 130 may be an internet browser that interacts with web servers executing on computer systems of the bigdata analysis system 100. - Systems and applications shown in
FIG. 1 can be executed using computing devices. A computing device can be a conventional computer system executing, for example, a Microsoft™ Windows™-compatible operating system (OS), Apple™ OS X, and/or a Linux distribution. A computing device can also be a client device having computer functionality, such as a personal digital assistant (PDA), mobile telephone, video game system, etc. - The interactions between the
client devices 130 and the bigdata analysis system 100 are typically performed via anetwork 150, for example, via the internet. The interactions between the bigdata analysis system 100 and the computer systems 120 of theenterprise 110 are also typically performed via anetwork 150. In one embodiment, the network uses standard communications technologies and/or protocols. In another embodiment, the various entities interacting with each other, for example, the bigdata analysis system 100, theclient devices 130, and the computer systems 120 can use custom and/or dedicated data communications technologies instead of, or in addition to, the ones described above. Depending upon the embodiment, the network can also include links to other networks such as the Internet. -
FIG. 2 shows the system architecture of a big data analysis system, in accordance with an embodiment. A bigdata analysis system 100 comprises a distributedfile system 210, an in-memory cluster computing engine 220, a distributeddata framework 200, ananalytics framework 230, aweb server 240, acustom application server 260, and aprogramming language interface 270. The bigdata analysis system 100 may include additional or less modules than those shown inFIG. 2 . Furthermore, specific functionality may be implemented by modules other than those described herein. The bigdata analysis system 100 is also referred to herein as a data analysis system or a system. - The distributed
file system 210 includes multiple data stores 250. These data stores 250 may execute on different computers. The distributedfile system 210 may store large data files that may store gigabytes or terabytes of data. The data files may be distributed across multiple computer systems. In an embodiment, the distributedfile system 210 replicates the data for high availability. Typically, the distributedfile system 210 processes immutable files to which writes are not performed. An example of a distributed file system is HADOOP DISTRIBUTED FILE SYSTEM (HDFS). - The in-memory cluster computing engine 220 loads data from the distributed
file system 210 into a cluster of compute nodes 280. Each compute node includes one or more processors and memory for storing data. The in-memory cluster computing engine 220 stores data in-memory for fast access and fast processing. For example, the distributeddata framework 200 may receive repeated queries for processing the same data structure stored in the in-memory cluster computing engine 220, the distributeddata framework 200 can process the queries efficiently by reusing the data structure stored in memory without having to load the data from the file system. An example of an in-memory cluster computing engine is the APACHE SPARK system. - The distributed
data framework 200 provides an abstraction that allows the modules interacting with the distributeddata framework 200 to treat the underlying data provided by the distributedfile system 210 or the in-memory cluster computing engine 220 interface as structured data comprising tables. The distributeddata framework 200 supports an application programming interface (API) that allows a caller to treat the underlying data as tables. For example, a software module can interact with the distributeddata framework 200 by invoking APIs supported by the distributeddata framework 200. - Furthermore, the interface provided by the distributed
data framework 200 is independent of the underlying system. In other words, the distributeddata framework 200 may be provided using different implementations in-memory cluster computing engines 220 (or different distributed file systems 210) that are provided by different vendors and support different types of interfaces. However, the interface provided by the distributeddata framework 200 is the same for different underlying systems. - The table based structure allows users familiar with database technology to process data stored in the in-memory cluster computing engine 220. The table based distributed data structure provided by the distributed data framework is referred to as distributed data-frame (DDF). The data stored in the in-memory cluster computing engine 220 may be obtained from data files stored in the distributed
file system 210, for example, log files generated by computer systems of an enterprise. - The distributed
data framework 200 processes large amount of data using the in-memory cluster computing engine 220, for example, materialization and transformation of large distributed data structures. The distributeddata framework 200 performs computations that generate smaller size data, for example, aggregation or summarization results and provides these results to a caller of the distributeddata framework 200. The caller of the distributeddata framework 200 is typically a machine that is not capable of handling large distributed data structures. For example, a client device may receive the smaller size data generated by the distributeddata framework 200 and perform visualization of the data or presentation of data via different types of user interfaces. Accordingly the distributeddata framework 200 hides the complexity of large distributed data structures and provides an interface that is based on manipulation of small data structures, for example, database tables. - In an embodiment, the distributed
data framework 200 supports SQL (structured query language) queries, data table filtering, projections, group by, and join operations based on distributed data-frames. The distributeddata framework 200 provides transparent handling of missing data, APIs for transformation of data, and APIs providing machine-learning features based on distributed data-frames. - The
analytics framework 230 supports higher level operations based on the table abstraction provided by the distributeddata framework 200. For example, theanalytics framework 230 supports collaboration using the distributed data structures represented within the in-memory cluster computing engine 220. Theanalytics framework 230 supports naming of distributed data structures to facilitate collaboration between users of the bigdata analysis system 100. In an embodiment, theanalytics framework 230 maintains a table mapping user specified names to locations of data structures. - The
analytics framework 230 allows computation of statistics describing a DDF, for example, mean, standard deviation, variance, count, minimum value, maximum value, and so on. Theanalytics framework 230 also determines multivariate statistics for a DDF including correlation and contingency tables. Furthermore,analytics framework 230 allows grouping of DDF data and merging of two or more DDFs. Several examples of the types of computations supported by theanalytics framework 230 are disclosed in the Appendix. - The big
data analysis system 100 allows different types of interfaces to interact with the underlying data. These include programming language based interfaces as well as graphical user interface based user interfaces. Theweb server 240 allows users to interact with the bigdata analysis system 100 via browser applications or via web services. Thecustom application server 260, - The
web server 240 receives requests from web browser clients and processes the requests. The web browser requests are typically requests sent using a web browser protocol, for example, a hyper-text transfer protocol (HTTP.) The results returned to the requester is typically in the form of markup language documents, for example, documents specified in hyper-text markup language (HTML). - The
custom application server 260 receives and processes requests from custom applications that are designed for interacting with bigdata analysis system 100. For example, a customized user interface receives requests for the bigdata analysis system 100 specified using data analysis languages, for example, the R language used for statistical computing. The customized user interface may use a proprietary protocol for interacting with the bigdata analysis system 100. - The
programming language interface 270 allows programs written in specific programming languages supported by the bigdata analysis system 100 to interact with the bigdata analysis system 100. For example, programmers can interact with thedata analysis system 100 using PYTHON or JAVA language constructs. - The distributed
data framework 200 supports various types of analytics operations based on the data structures exposed by the distributeddata framework 200.FIG. 3 shows various modules within a distribute data framework module, in accordance with an embodiment of the invention. As shown inFIG. 3 , the distributedata framework 200 includes a distributed data-frame manager 210 and handlers includingETL handler 320,statistics handler 330, andmachine learning handler 340. Other embodiments may include more or fewer modules than those shown inFIG. 3 . - The distributed data-
frame manager 310 supports loading data from big data sources of the distributeddata file system 210 into DDFs. The distributed data-frame manager 310 also manages a pool of DDFs. The various handlers provide a pluggable architecture making it easy to include new functionality into or replace existing functionality from the distributedata framework 200. TheETL handler 320 supports ETL (extract, transform, and load operations, thestatistics handler 330 supports various statistical computations applied to DDFs, and themachine learning handler 340 supports machine learning operations based on DDFs. - In an embodiment, the distributed
data framework 200 provides interfaces in different programming languages including Java, Scala, R and Python so that users can easily interact with the in-memory cluster computing engine 220. In a client/server setting, a client can connect to a distributeddata framework 200 via a web browser or a custom application based interface and issue commands for execution on the in-memory cluster computing engine 220. - The distributed
data framework 200 allows users to load a DDF in memory and perform operations on the data stored in memory. These include filtering, aggregating, joining a data set with another and so on. Since aclient device 130 has limited resources in terms of computing power or memory, aclient device 130 is unable to load an entire DDF from the in-memory cluster computing engine 220. Therefore, the distributeddata framework 200 supports APIs that allow a subset of data to be retrieved from a DDF by a requestor. - In an embodiment, distributed
data framework 200 supports an API “fetchRows(df, N)” that allows the caller to retrieve the first N rows of a DDF df. If the distributeddata framework 200 receives a request “fetchRows(df, N)”, the distributeddata framework 200 identifies the first N rows of the DDF df and returns the identified rows to the caller. - The distributed
data framework 200 supports an API “sample(df, N)” that allows the caller to retrieve a sample of N rows of a DDF df. In response to a request “sample(df, N)”, the distributeddata framework 200 samples data of the DDF df based on a preconfigured sampling strategy and returns a set of N rows obtained by sampling to the caller. - The distributed
data framework 200 supports an API “sample2ddf(df, p)” that allows the caller to compute a sample of p % of rows of the DDF df and assign the result to a new DDF. In response to a request “sample2ddf(df, p)”, the distributeddata framework 200 samples data of the DDF df based on a preconfigured sampling strategy to identify p % of rows of the DDF df and creates a new DDF based on the result. The distributeddata framework 200 returns the result DDF to the caller, for example, by sending a reference or pointer to the DDF to the caller. - In an embodiment, the distributed data-
frame manager 310 acts as a server that allows users to connect and create sessions that allow the users to interact with the distributeddata framework 200 and process data. Accordingly, the distributeddata framework 200 creates sessions allows users to maintain distributed in-memory data structures in the in-memory cluster computing engine 220 for long periods of times, for example, weeks. Furthermore, the session maintains the state of the in-memory data structures so as to allow a sequence of multiple interactions with the same data structure. The interactions include requests that modify the data structure such that a subsequent request can access the modified data structure. As a result, a user can perform a sequence of operations to modify the data structure. - The distributed data-
frame manager 310 allows users to collaborate using a particular large distributed data structure (i.e., DDF). For example, a particular user can create a DDF loading data from log files of an enterprise into an in-memory data structure, perform various transformations, and share the data structure with other users. The other user may continue making transformations or view the data from the DDF in a user interface, for example, build a chart based on the data of the DDF for presentation to an audience. - The
analytics framework 230 receives a user request (or a request from a software module) to assign a name to a distributed data structure, for example, a DDF or a machine learning model. Theanalytics framework 230 may further receive requests to provide the name of the distributed data structure to other users or software modules. Accordingly, theanalytics framework 230 allows multiple users to refer to the same distributed data structure residing in the in-memory cluster computing engine 220. - The
analytics framework 230 supports an API that sets the name of a distributed data structure, for example, a DDF or any data set. For example, a user may invoke a function (or method) “setDDFName(df, string name)” where “df” is a reference/pointer to the distributed data structure stored in the in-memory cluster computing engine 220 and “string_name” is an input string specified by the user for use as the name of the “df” structure. Theanalytics framework 230 processes the function setDDFName by assigning the name “string_name” to the structure “df”. For example, a user may execute queries to generate a data set representing flight information based on data obtained from airlines. A function/method call “setDDFName(df, “flightinfo”)” assigns name “flightinfo” to the data set identified by df. - The
analytics framework 230 further supports an API to get a uniform resource identifier (URI) for a data structure. For example, theanalytics framework 230 may receive a request to execute “getURI(df)”. Theanalytics framework 230 generates a URI corresponding to the data structure or data set represented by df and returns the URI to the requstor. For example, theanalytics framework 230 may generate a URI “ddf://servername/flightinfo” in response to the “getURI(df)” call. The URI may be provided to an application executing on aclient device 130. - The
analytics framework 230 maintains a mapping from DDFs to URIs. If a DDF is removed from memory, the corresponding URI becomes invalid. For example, if a client application presented a document having a URI that has become invalid, thedata analysis system 100 does not process queries based on that URI. Thedata analysis system 100 may return an error indicating that the query is directed to an invalid (or non-existent) DDF. If thedata analysis system 100 loads the same set of data (as the DDF which is removed from the memory), as a new DDF, the client devices request a new URI for the newly created DDF. This is so because the new DDF may have a different location within the parallel/distributed system and may be distributed differently from the previously loaded DDF even though the two DDFs store the same identical data. In an embodiment, thedata analysis system 100 may two or more copies of the same DDF. For example, two or more clients may request access to the same data set with the possibility of making modifications. In this situation, each DDF representing the same data is assigned a different URI. For example, a first DDF representing a first copy of the data is assigned a first URI and a second DDF representing a second copy of the same data is assigned a second URI distinct from the first URI. Accordingly, requests for processing received by thedata analysis system 100 based on the first URI are processed using the first DDF and, requests for processing received by thedata analysis system 100 based on the second URI are processed using the second DDF. - The URI can be communicated between applications or client devices. For example, the
client device 130 may communicate the URI to another client device. Alternatively, an application that has the value of the URI can send the URI to another application. For example, the URI may be communicated via email, text message, or by physically copying and pasting from one user interface to another user interface. The recipient of the URI can use the URI to locate the DDF corresponding to the URI and process it. For example, the recipient of the URI can use the URI to inspect the data of the DDF or to transform the data of the DDF. -
FIG. 4 illustrates collaboration between two client devices using a DDF, according to an embodiment of the invention. Assume thatclient device 130 a has access to an in-memory distributed data structure, for example, aDDF 420 stored in the in-memory cluster computing engine 220. Theclient device 130 a has access to theDDF 420 because an application onclient device 130 a created theDDF 420. Alternatively, theclient device 420 may have received a reference to theDDF 420 from another client device or application. The distributeddata framework 200 receives 425 a request to generate a URI (or a name) corresponding to theDDF 420. The distributeddata framework 200 generates a URI corresponding to theDDF 420. The distributed data framework sends 435 the generated URI to theclient device 130 a that requested the URI. - The
client device 130 a may send 450 the URI corresponding to theDDF 420 to anotherclient device 130 b, for example, via a communication protocol such as email. In some embodiments, the URI may be shared between two applications running on the same client device. For example, the URI may be copied from a user interface of an application and pasted into the user interface of the other application by a user. Theclient device 130 b receives the URI. Theclient device 130 b can send 440 requests to the distributeddata framework 200 using the URI to identify theDDF 420. For example, theclient device 130 b can send a request to receive a portion of data of theDDF 420. - In an embodiment, the in-memory cluster computing engine 220 stores a distributed data structure that represents a machine learning model. The distributed
data framework 200 receives a request to create a name or URI identifying the machine learning model from aclient device 130. The distributeddata framework 200 generates the URI or name and provides to theclient device 130. The client device receiving the URI can transmit the URI to other client devices. Any client device that receives the URI can interact with the distributeddata framework 200 to interact with the machine learning model, for example, to use the machine learning model for predicting certain behavior of entities. These embodiments allow a user (or users) to implement the machine learning model and train the model. The model is stored in-memory by the bigdata analysis system 100 and is ready to use by other users. The access to the in-memory model is provided by generating the URI and transmitting the URI to other applications or client devices. Users accessing the other client devices or applications can start using the machine learning model stored in memory. - In some embodiments, the in-memory cluster computing engine 220 supports only immutable data sets. In other words, a user (e.g., a software module that creates/loads the data set for processing) of the data set is not allowed to modify the dataset. For example, the in-memory cluster computing engine 220 may not provide any methods/functions or commands that allow a data set to modify. Alternatively, the in-memory cluster computing engine 220 may return error if a user attempts to modify data of a data set.
- The in-memory cluster computing engine 220 may not support mutable datasets if the in-memory cluster computing engine 220 supports a functional paradigm, i.e., a functional programming model based on functions that take a data set as input and return a new data set upon each invocation. As a result, each operation requires invocation of a function that returns a new data set. Accordingly, the in-memory cluster computing engine 220 does not support modification of states of data sets (these datasets may be referred to as stateless datasets.)
- The distributed
data framework 200 allows users to convert a dataset to a mutable dataset. For example, the distributeddata framework 200 supports a method/function “setMutable(ddf)” that converts an immutable dataset (or DDF) input to the method/function to a mutable DDF. Subsequently, the distributeddata framework 200 allows users to make modifications to the mutable DDF. For example, the distributeddata framework 200 may add new rows, delete rows, modify rows, and so on from the mutable DDF based on requests. - The distributed
data framework 200 implements a data structure, for example, a table that tracks all DDFs that are mutable. A mutable DDF can have a long life since a caller may continue to make a series of modifications to the DDF. A user of the DDF may even pass a reference to the DDF to another user, thereby allowing the other user to continue modifying the dataset. In contrast, immutable datasets have a relatively short life since the dataset cannot be modified and is used as a read only value that input to a function (or output as a result by a function). - The distributed
data framework 200 maintains metadata that tracks each mutable DDF. In an embodiment, the distributeddata framework 200 implements certain mutable operations by invoking functions supported by the in-memory cluster computing engine 220. Accordingly, the distributeddata framework 200 updates the metadata pointing at the DDF with a new DDF returned by the in-memory cluster computing engine 220 as a result of invocation of the function. Subsequent requests to process the DDF are directed to the new DDF structure pointed at by the metadata identifying the DDF. As a result, even though the underlying infrastructure of the in-memory cluster computing engine 220 supports only immutable data structures, the user of the distributeddata framework 200 is able to manipulate the data structures as if they are mutable. -
FIG. 5 illustrates a process for converting immutable data sets received from the in-memory cluster computing engine to mutable data sets, according to an embodiment of the invention. The distributeddata framework 200 receives a request to build a DDF structure. In an embodiment, the DDF structure may be built by loading data from a set of files of the distributedfile system 210, for example, by processing a set of log files of an enterprise. - The distributed
data framework 200 sends 520 a request to the in-memory cluster computing engine 220 to retrieve the data of the requested data set. In an embodiment, the in-memory cluster computing engine 220 supports only immutable data sets and does not allow or support modifications to datasets. The in-memory cluster computing engine 220 loads the requested dataset in memory. The dataset may be distributed across memory of a plurality of compute nodes 280 of the in-memory cluster computing engine 220 (the dataset is also referred to as a DDF.) - The distributed
data framework 200 marks the data set as immutable (for example, by storing a flag indicating the dataset as mutable in metadata.) This step may be performed if the default type of datasets supported by the in-memory cluster computing engine 220. Accordingly, if the distributeddata framework 200 receives a request to modify the data of the dataset, for example, by deleting existing data, adding new data, or modifying existing data, the distributeddata framework 200 denies the request. In an embodiment, the dataset is represented as a DDF structure. In other embodiments, the distributeddata framework 200 may mark all DDF structures as mutable when they are created. - The distributed
data framework 200 receives 530 a request to convert the dataset to a mutable dataset. In an embodiment, the request to convert the dataset may be supported as a method/function call, for example, “setMutable” method/function. A caller may invoke the “setMutable” method/function providing a DDF structure as input. The distributeddata framework 200updates 540 metadata structure describing the DDF to indicate that the DDF is mutable. - Subsequently, the distributed
data framework 200 receives 550 a request to modify the DDF, for example, by adding data, deleting data, or updating data. The distributeddata framework 200 performs the requested modifications to the DDF. In an embodiment, the distributeddata framework 200 invokes a function of the in-memory cluster computing engine 220 corresponding to the modification operation. The in-memory cluster computing engine 220 generates 560 a new dataset that has the value equivalent to the requested modified DDF. The distributeddata framework 200 modifies the metadata describing the DDF to refer to the modified DDF instead of the original DDF. Accordingly, if a requester accesses the data of the DDF, the requester receives the data of the modified DDF. Similarly, if a requester attempts to modify the DDF again, the new modification is applied to the modified DDF as identified by the metadata. - Embodiments allow multiple collaborators to interact with a document. A collaborator can be represented as a user account of the system. Each user account may be associated with a client device, for example, a mobile device such as a mobile phone, a laptop, a notebook, or any other computing device. A collaborator may be represented as a client device. Accordingly, the shared document is shared between a plurality of client devices.
- The document includes information based on one or more DDFs stored in the in-memory cluster computing engine 220, for example, a chart based on the DDF in the document. The collaborators can interact with the document in a global editing mode that causes changes to be propagated to all collaborators. For example, if a collaborator makes changes to the document or to the DDF identified by the document, all collaborators see the modified document (or the modified document.)
- In an embodiment, the document collaboration is based on a push model in which changes made to the document by any user are pushed to all collaborators. For example, assume that the document includes a chart based on a DDF stored in the in-memory cluster computing engine 220. If the distributed
data framework 200 receives requests to modify the DDF, the distributeddata framework 200 performs the requested modifications to the DDF and propagates a new chart based on the modified DDF to the client devices of the various collaborators sharing the document. - The distributed
data framework 200 allows a collaborator X (or a set of collaborators) to switch to a local editing mode in which the changes made by the collaborator X (or any collaborator from the set) to a specified portion of the shared document are local and are not shared with the remaining collaborators. The local editing mode is also referred to herein as limited sharing mode, limited editing mode, or editing mode. For example, if the collaborator modifies the DDF, the changes based on the modifications to the DDF are visible in the document to only the collaborator X. The distributeddata framework 200 does not propagate the modifications to the document (or to a portion of the document or the DDF associated with the document) to the remaining collaborators. Accordingly, the distributeddata framework 200 continues propagating the original document and any information based on the version of the DDF before collaborator X switched to local editing mode to the remaining collaborators. In an embodiment, the remaining collaborators can modify the original document (and associated DDFs) and the distributeddata framework 200 does not propagate the modifications to user X. The collaborator X may share the modified document based on the local edits to a new set of collaborators. Accordingly, the new set of collaborators can continue modifying the version of document created by user X without affecting the document edited by the original set of users. - In an embodiment, the local edits to the shared document are shared between a set of collaborators. Accordingly, if any of the collaborators from the set of collaborators makes a modification to the shared document, the modifications are propagated to only the set of collaborators identified for the local editing. This allows a team of collaborators to make modifications to the shared document before making the modifications publicly available to a larger group of collaborators sharing the document.
- In an embodiment, the distributed
data framework 200 receives a request that identifies a particular portion of the shared document for local editing. Furthermore, the request received specifies a set of collaborators for sharing local edits to the identified portion. Accordingly, any modifications made by the collaborators of the set to the identified portion are propagated to all the collaborators of the set. However, any modifications made by any collaborator to the shared document outside the identified portion are propagated to all collaborators that share the document, independent of whether the collaborator belongs to the specified set or not. -
FIG. 6 shows the process illustrating use of a global editing mode and a local editing mode during collaboration, according to an embodiment of the invention. The distributeddata framework 200 receives 610 a request to create a document. In an embodiment, the document is associated with one or more DDFs and receives information associated with the DDF. For example, the document may include a chart based on information available in the DDF. The distributeddata framework 200 updates the chart based on changes to the data of the DDF. For example, if the distributeddata framework 200 receives an update to the data of the DDF (including requests to delete, add, or modify data), the distributeddata framework 200 modifies the chart to display the modified information. - The
collaboration module 370 receives 620 a request to share the document with a first plurality of collaborators. Thecollaboration module 370 receives 630 requests to interact with the shared document from the first plurality of collaborators. The requests may include requests to edit the document, requests to make modifications to the data of the DDF, and so on. - The
collaboration module 370 further receives a request from a particular collaborator (say collaborator X) to perform local editing on a selected portion of the shared document (or the entire document). The request identifies a set of collaborators that share the local edits to the selected portion of the document. Thecollaboration module 370 may create 650 a copy of data related to the identified portion of the shared document for collaborator X to perform local editing. The copy of the portion of the shared document is called the locally accessible document and the original shared document (which can be edited by all collaborators) is called the globally accessible document. - In an embodiment, the
collaboration module 370 shares the associated DDFs between the locally accessible document and the globally accessible document when the locally accessible document is created. However, if the distributeddata framework 200 receives a request from any of the collaborators to modify an underlying DDF, the distributeddata framework 200 makes a copy of the DDF and modifies the copy of the DDF. One of the documents subsequently is associated with the modified DDF and the other document is associated with the original DDF. - In an embodiment, the
collaboration module 370 may obtain a subset of data of the DDF that provides data to the chart displayed on the locally edited document. For example, the chart may display data for a small time period out of a longer period of data stored in the DDF. Alternatively, the chart may display partially aggregated data. For example, the DDF may store data at an interval of seconds and the chart may display data aggregated data at intervals of days. Accordingly, the distributeddata framework 200 determines the aggregated data that may be much smaller than the total data of the DDF and can be stored on the client device instead of the in-memory cluster computing engine. - In an embodiment, the distributed
data framework 200 checks if the size of the aggregated data is below a threshold value. If the size of the aggregated data is below a threshold value, the distributeddata framework 200 sends the data to theclient device 130 for further processing. The client device can perform certain operations based on the locally stored data, for example, further aggregation based on the data. Processing the locally stored data allows the client device to efficiently process user requests. For example, if the user wants to view a smaller slice of data than that shown on the chart, theclient device 130 can use the locally stored data to respond to the query. Accordingly, the chart displayed on the client device is updated without updating the charts displayed on the remaining client devices that share the original document. - Similarly, if the client device requests to further aggregate the data, for example, by requesting aggregates at the intervals of weeks or months, the request can be processed using the locally stored data. In an embodiment, the data set associated with the chart (for example, the partially aggregated data) is stored on another system distinct from the distributed
data framework 200 and the client device. The other system allows large data sets to be loaded in memory that exceed the capacity of theclient device 130. - The
collaboration module 370 determines which copy of the document is associated with the modified DDF and which copy of the document is associated with the original DDF. If the request to modify the DDF is received from the locally accessible document, the distributeddata framework 200 associates the locally accessible document with the modified DDF and the globally accessible document with the original DDF. Alternatively, if the request to modify the DDF is received from the globally accessible document, the distributeddata framework 200 associates the globally accessible document with the modified DDF and the locally accessible document with the original DDF - The
collaboration module 370 receives 660 a request to share the locally accessible document with other collaborators (referred to here as a second plurality of collaborators). The second plurality of collaborators may overlap with the first plurality of collaborators. Thecollaboration module 370 provides access to the document to the second plurality of collaborators. The distributeddata framework 200 receives request to modify the locally accessible document from collaborators belonging to the second plurality of collaborators. - The ability to locally edit a portion of the document allows one or more collaborators to modify the document before making the modifications publicly available to all collaborators. For example, a portion of the shared document may be associated with a query that processes an in-memory distributed data structure. The portion of the document may show results of the query as a chart of in text form or both. The system allows one or more collaborators to develop and test the query in a local edit mode to make sure the chart presented is accurate. Once the collaborators have fully developed and tested the query and the chart, the system receives a request from the collaborators to share the identified portion with all users that share the document (not just the developers and testers of the query and the chart.)
- In an embodiment, the system determines a target set of collaborators that receive each modification made to the shared document. The target set of collaborators is determined based on whether the modification is made to the portion identified for local editing or another portion. Accordingly, if the system receives a request to modify a portion of the document that is distinct from the portion identified for local editing, the system propagates the changes to the all collaborators sharing the document. This is so because by default all portions of the document are marked for global editing by all collaborators. However, if the system receives a request to modify the portion identified for local editing and the request is received from a collaborator from the set of collaborators S allowed to perform local editing on that portion, the system propagates the modification to all collaborators from the set of collaborators. In an embodiment, the collaborators not belonging to the set S of collaborators are allowed to modify the portion identified for the local editing. However, the system propagates these modifications only to collaborators that do not belong to the set S of collaborators allowed to perform local editing to the identified portion. In an embodiment, the system maintains a separate copy of the identified portion. Accordingly, the modifications made by users of the set S are made to one copy of the document (and propagated to the collaborators belonging to S) and the modifications made by users outside set S are made to another copy (and propagated to the collaborators outside S).
- In an embodiment, the portion of the shared document identified for local editing includes a query Q1 processing a DDF associated with the shared document. Assume that a set S1 of collaborators are allowed to perform local edits to the document. The selected portion of the document may include a chart of the document associated with the query or result of the query in text form. The local edits made by collaborators of set S may modify Q1 to become a query Q2. Accordingly, a chart based on query Q2 is propagated to collaborators belonging to set S1 and a chart based on the original query Q1 is propagated to the remaining collaborators (outside set S1) that share the document. If the data of the DDF is modified, the queries Q1 and Q2 are reevaluated to build a new corresponding chart (or textual representation of the result). The chart or results based on query Q2 are propagated to the collaborators of set S1 and the charts or results based on query Q1 are propagated to the remaining collaborators (outside the set S1.)
- The system allows various portions of the same shared document to be locally edited by different collaborators or different sets of collaborators. For example, the system may receive a first request to allow local editing of a first portion of the shared document by a first set of collaborators. Subsequently the system may receive a second request to allow local editing of a second portion of the shared document by a second set of collaborators. The first and second set of collaborators may overlap or may be distinct.
- A shared document includes text portions, result portions, and code blocks. A text portion is received from a user and shared with other users. The shared document may be associated with one or more DDFs stored across a plurality of compute nodes. A code block may process data of a DDF. The code block may include queries that are executed. The result of execution of a query is displayed on the result portions of the document, for example, as charts. A code block is also referred to herein as a cell.
- Embodiments allow references to DDFs to be included in documents. Users interacting with the big
data analysis system 100 can share documents and interact with the same shared document viadifferent client devices 130. If two or more documents share a DDF, changes made to the DDF via a document result in data displayed on the other documents being modified. For example, documents D1 and D2 may be distinct documents that have references to a DDF df. Document D1 may be shared by a set of users S1 and document D2 may be shared by a set of users S2 where S1 and S2 may be distinct sets of users with no overlap. However if a user U1 from set S1 executes code via document D1 that modifies the DDF df, a user U2 from set S2 can view the modifications to the DDF df even though U2 is not sharing the document D1 with user U1. For example, the code modifications made by user U1 via document D1 may cause a chart or a result set displayed on document D2 to be updated as a result of modifications made to DDF df. -
FIG. 7 shows collaborative editing of documents including text, code, and charts based on distributed data structures, according to an embodiment. As shown inFIG. 7 , the distributeddata framework 200 stores two shareddocuments - Each shared document 710 is associated with a set 730 of users 720 interacting with the shared document 710 via
client devices 130. For example,users document 710 a viaclient devices users document 710 b viaclient devices FIG. 7 . The sets 730 of users sharing a document may be distinct from a set of users sharing another document (i.e., have no overlap between the sets) or the sets of users sharing two distinct shared documents 710 may have an overlap (with one or more users having access to both the shared documents). - The shared document 710 may include text, code, and results based on code. The results based on code may comprise results present as text or results presented as charts, for example, bar charts, scatter plots, pie charts and so on. In an embodiment, the results presented in a document are associated with in-memory distributed data frame structure 420 (referred to as the DDF) stored in the in-memory cluster computing engine 220. For example, the document may specify a query based on the DDF such that the results/chart displayed in the document are based on the result of executing the query against the DDF. The code specified in the document may include a query for which the results are shown in the document. If a user updates the query of the shared document, each of the users that share the document, receive updated results displayed in the shared document.
- The code specified in the document may include statements that modify the DDF, for example, by deleting, adding, updating rows, columns, or any other portion of data of the DDF. If a user modifies the DDF, the results displayed in the document may get updated based on the modified DDF. For example, if certain rows of the DDF are deleted, any aggregate results displayed in the document or charts based on the DDF may get updated to reflect the deletion of the rows. Furthermore, if there are other documents that share the same DDF (for example, by including a URI to the DDF), the results/chart displayed in those documents may be updated to reflect the modifications to the shared DDF.
- The shared documents may represent articles, presentations, reports and so on. The collaborative editing allows users to include charts and results of large distributed data structures in documents. For example, a team of developers may build an in-memory distributed data structure and share the URI of the in-memory distributed data structure with an executive for presentation to an audience. The ability to share the in-memory distributed data structure allows the ability to update the data structure to reflect the latest information. This is distinct from a presentation with static information that doesn't change no matter when the presentation is given. In contrast, the sharing of documents with code and results based on executable code allows presentation of latest results that may get updated as the executive makes the presentation.
- As shown in
FIG. 7 , user U1 may update text of the shareddocument 710 a (the text referring to static strings of text that are distinct from executable code). As a result of editing of the text by user U1, the updated text of the document is sent by the distributeddata framework 200 to all users (e.g., users U1, U2) that share thedocument 710 a. However, the result of editing the text does not affect other documents that are distinct from the updated document, for example,document 710 b and therefore, the users of set 730 b are not affected by the updating of the text ofdocument 710 a. - Embodiments further allow sharing of executable code and results based on sharing of code with other users. As shown in
FIG. 7 , user U1 may update a query included in shareddocument 710 a. The modification of the query may cause changes to results or charts presented in the document. As a result, the distributeddata framework 200 executes the updated query using the DDF references in thedocument 710 a and sends the updated results or chart to the remaining users of the set 730 a (e.g., user U2.) If the query is simply reading the data of the DDF, the result of modification of the query is presented only to the users of the set 730 a that share the modified document (and is not shown to users of set 730 b that share thedocument 710 b). - Embodiments further allow a user to execute code that modifies the DDF referenced by the shared
document 710 a. As a result, the data of the DDF may be changed (e.g., deleted, updated, or new data added.) The modification of DDF may cause results of queries of the document to be updated if the queries use the DDF. Accordingly, the distributeddata framework 200 identifies all queries of thedocument 710 a that use the DDF and updates the results of the queries displayed in the document. The updated document is sent for presentation to the users of the set 730 a. - Furthermore, the distributed
data framework 200 identifies all other documents of that include a reference to the DDF. The distributeddata framework 200 identifies queries of all the identified documents and updates the results/charts of the queries displayed in the respective documents if necessary. The distributeddata framework 200 sends the updated documents for presentation to all users that share the document. For example, theDDF 420 may be updated based on execution of code of shareddocument 710 a. The distributeddata framework 200 updates results of queries based on theDDF 420 indocument 710 a as well asdocument 710 b. The updateddocument 710 a is sent to users of the set 730 a and the updateddocument 710 b is sent to the users of the set 730 b. - In an embodiment, a user may request a document or a portion of a document to be locally edited (and not shared). In this embodiment, the distributed
data framework 200 makes a copy of theDDF 420 or an intermediate result set based on theDDF 420. In some embodiments, the distributeddata framework 200 simply notes that the document is being locally edited and continues to share theDDF 420 with other documents until theDDF 420 is edited. If theDDF 420 is edited, the distributeddata framework 200 makes a copy of the DDF for the document that is being locally edited. Note that the document being locally edited may be shared by a set of users even though it does not share the DDF referenced in the document with other documents. - In an embodiment, the portion of the document being locally edited is based on an intermediate result derived from the
DDF 420. Accordingly, the distributeddata framework 200 stores the intermediate result in either the in-memory cluster computing engine 220 (if the intermediate result is large) or else in a separate server (that may not be distributed). In an embodiment, the intermediate result is stored in theclient device 130. Certain operations based on the intermediate results can be performed based on the data of the intermediate result, for example, aggregation of the intermediate results, changing the formal of the chart (so long as the new format does not require additional data from the DDF). For example, a bar chart may be changed to a line chart based on the intermediate result. However, changing of a bar chart to a scatter plot may require accessing the DDF for obtaining a new sample data (for example, if the user requests to display a scatter plot based on a subset of data of the bar chart.) -
FIG. 8 shows an example of shared document with code and results based on code, according to an embodiment. As shown inFIG. 8 , thedocument 800 includesexecutable code 810. Thedocument 800 also includes results of execution of the executable code. The example executable code shown inFIG. 8 is “m<-adataalm(total_amount˜trip_distance+payment_type, data=ddf)”. Theexecutable code 810 is based on a DDF (identified as “ddf”) used in a machine learning model (identified as “lm”). This example code builds a linear model of total_amount as function of trip_distance and payment_type. Theresult 820 of execution of the linear model using the DDF ddf is shown in text form inFIG. 8 . In other embodiments, the result of execution of a command can be shown in a chart form. - The user can modify the executable code, thereby causing updated results to be presented to all users sharing the document.
FIG. 9 shows the document ofFIG. 8 with modifications made to the code, thereby causing the results to be updated, according to an embodiment. The model shown inFIG. 8 is updated to change the model. The distributeddata framework 200 executes the modifiedcode 910 to obtain a new set of results 920. The updated results are presented to all users that share thedocument 800. If a modification is made to the data of the DDF ddf, the results of all documents that include executable code based on the DDF ddf gets modified. - The
analytics framework 230 generates reports, presentations, or dashboards based on the document comprising the text, code, and results.FIG. 10 shows an example of a dashboard generated from a document based on distributed data structures, according to an embodiment. Theanalytics framework 230 receives information identifying portions of the shared document that should not be displayed in the generated report/dashboard. For example, a user may indicate that portions of the document including executable code should not be displayed in the report/presentation/dashboard. Theanalytics framework 230 generates the requested report/presentation/dashboard by rendering the information identified for inclusion and excluding the information identified for exclusion. - In an embodiment, the
analytics framework 230 receives a request to convert the shared document into a periodic report. Theanalytics framework 230 receives a schedule for generating the periodic report. Theanalytics framework 230 executes the code blocks of the shared document in accordance with the schedule. Accordingly, theanalytics framework 230 updates the result portions of the shared document based on the latest execution of the code block. For example, a code block may include a query based on a DDF. Theanalytics framework 230 updates the result portion corresponding to the code block based on the latest data of the DDF. Theanalytics framework 230 shares the updated document with users that have access to the shared document. These embodiments allow theanalytics framework 230 to provide a periodic report to users. For example, the shared document may include a reference to a DDF based on an airlines database and theanalytics framework 230 provides weekly or monthly reports to the users sharing the document. Similarly, theanalytics framework 230 can convert the shared document into a slide show or a dashboard based on a user request. - In an embodiment, the
analytics framework 230 receives a request to generate a periodic report, slideshow, or a dashboard and generates the requested document based on the shared document rather than convert the shared document as requested. Theanalytics framework 230 maintains the periodic reports, slideshows, or dashboards as shared documents that can be further edited and shared with other users. Accordingly, various operations disclosed herein apply to these generated/transformed documents. -
FIG. 10 showsvarious charts analytics framework 230. The charts 1010 refer to DDFs stored in the in-memory cluster computing engine 220. Accordingly, the distributeddata framework 200 may automatically update the data of the charts 1010 based on updates to the data of the associated DDF. - In embodiment, the
analytics framework 230 identifies all charts in an input document. Theanalytics framework 230 determines a layout for all the charts in a grid, for example, a 3 column grid. Theanalytics framework 230 may receive (from the user) a selection of a template specifying the layout of the dashboard. Theanalytics framework 230 receives instructions from users specifying modifications to the layout. For example, the bigdata analysis system 100 allows users to drag-drop charts snapping to the grid, resize charts within the grid. The bigdata analysis system 100 also allows users to set dashboards to automatically be refreshed at a specified time interval e.g. 30 second, 1 minute, etc. The generated dashboard includes instructions to execute any queries associated with each chart at the specified time interval by sending the queries to the distributeddata framework 200 for execution. - Various portions of a document that is shared can be edited by all users that share the document. In an embodiment, if the system receives a user request for execution of a code block (or cell), the system shows an indication that the code block in the shared document is being executed. Accordingly, the system shows a change in the status of the code block. The status of the code block may be indicated based on a color of the text or background of the code block, font of the code block, or by any visual mechanism, for example, by showing the code block as flashing. In some embodiments, the status of the code block may be shown by a widget, for example, an image or icon associated with the code block. Accordingly, a status change of the code block causes a visual change in the icon or the widget. The changed status of the code block is synchronized across all client applications or client devices that share the document. Accordingly, the system shows the status of the code block as executing on any client device that is displaying a portion of the shared document including the code block that is executing.
- In an embodiment, if the system receives a request to execute a code block of the shared document, the system locks the code block of the document, thereby preventing any users from editing the code block. The system also prevents other users from executing the code block. Accordingly, the system does not allow any edits to be performed on the code block that is executing from any client device that is displaying a portion of the shared document including the code block. Users are allowed to modify other portions of the document, for example, text portions or other code blocks. Nor does the system allow the code block to be executed again from any client device until the current execution is complete. In other words, the system allows a single execution by a user for a code block at a time. Once the execution of the code block is complete, the system allows users to edit the code block or execute it again.
- A user may close the client application (e.g., a browser or a user-agent software) used to view/edit the shared document on a
client device 130. If the client closes the client application while one or more code blocks, the system continues executing the code blocks and tracks the status of the code blocks. If a user that closed the client application reopens the client application to view the document, the system receives a login request from the user. In response to the request to view the shared document, the system provides the latest status of the code blocks. If a code block is executing, the system provides information indicating that the code block is executing and the user is still not allowed to edit or execute the code block. If the code block has completed execution, the system updates the result portions of the document and sends the updated document to the client device of the user and allows the user to edit or execute the code block. - Embodiments allow shared documents that interact with the big
data analysis system 100 using multiple languages for processing data of DDFs. A shared document includes text portions, result portions, and code blocks. A text portion is received from a user and shared with other users. The shared document may be associated with one or more DDFs stored across a plurality of compute nodes. A code block may process data of a DDF. - A user can interact with the DDF by processing a query and receiving results of the query. The results of the query are displayed in the document and may be shared with other users. A user can also execute a statement via the document that modifies the DDF. The distributed
data framework 200 receives statements sent by a user via a document and processes the statements (a statement can be a command or a query). - A code block may include instructions that modify the DDF. A code block may include queries that are executed by the data analysis system. The result of execution of the queries is presented in result portions of the shared document. A result portion may present results in text form or graphical form, for example, as charts. Modification of a query by a user in a code block may result in the result portion of all users sharing the document getting updated.
- The big
data analysis system 100 allows users to send instructions for processing data of a DDF using different languages. For example, the bigdata analysis system 100 receives a first set of instructions in a first language via a document and subsequently a second set of instructions in a second language provided via the same document (or via a different document). Both the first and second set of instructions may process data of the same DDF. The ability to collaborate via multiple languages allows different users to use the language of their choice while collaborating. Furthermore, certain features may be supported by one language and not another. Accordingly, a user can use a first language for providing instructions and operations supported by that language and switch to a second language to use operations supported by the second language (and not supported by the first language). In an embodiment, the bigdata analysis system 100 allows users to specify code cells or code blocks in a document. Each code block may be associated with a specific language. This allows a user to specify the language for a set of instructions. In an embodiment, a shared document uses a primary language for processing the DDFs. However, code blocks of one or more secondary languages may be included. -
FIG. 11 shows the architecture of the in-memory cluster computing engine illustrating how a distributed data structure (DDF) is allocated to various compute nodes, according to an embodiment. As shown inFIG. 11 , the in-memory cluster computing engine 220 comprises multiple compute nodes 280. A DDF is distributed across multiple compute nodes 280. Each compute node 280 is allocated a portion of the data of the DDF, referred to as a DDF segment 1110. When the distributeddata framework 200 receives a request to process data of a DDF, the distributeddata framework 200 sends a corresponding request to each compute node 280 storing a DDF segment 1110 to process data of the DDF segment 1110. As shown inFIG. 11 , each compute node 280 includes aprimary runtime 1120 that stores the DDF segment 1110. The DDF segment 1110 has a data structure based on theprimary runtime 1120. For example, the data structure of the DDF segment 1110 conforms to the primary language of the distributed data framework and can be processed by the primary runtime executing instructions of the primary language. - The
primary runtime 1120 is capable of processing instructions in a primary language of operation for the distributeddata framework 100. Accordingly, if a user provides a set of instructions using the primary language, the distributeddata framework 100 provides corresponding instructions to the primary runtime for execution. For example, theprimary runtime 1120 may be a virtual machine of a language, for example, a JAVA virtual machine for processing instructions received in the programming language JAVA. Alternatively, theprimary runtime 1120 may support other programming languages such as PYTHON, R language, or any proprietary languages. -
FIG. 12 shows the architecture of the in-memory cluster computing engine illustrating how multiple runtimes are used to process instructions provided in multiple languages, according to an embodiment. The distributeddata framework 100 may receive instructions in a language different from the primary language. For example, if the primary language for interacting with the distributeddata framework 100 is JAVA, a user may provide a statement in the R language. - In an embodiment, users can interact with the distributed
data framework 100 using a set of language agnostic APIs supported by the distributeddata framework 100. The language agnostic APIs allow users to provide the required parameters and identify a method/function to be invoked using the primary language. The distributeddata framework 100 receives the parameters and the method/function identifier and provides these to theprimary runtime 1120. Theprimary runtime 1120 invokes the appropriate method/function using the provided parameter values. Theprimary runtime 1120 provides the results by executing the method/function. The distributeddata framework 100 provides the results to the caller for display via the document used to send the request. - The distributed
data framework 100 may receive instructions in a language other than the primary language of the distributed data framework 100 (referred to as a secondary language). For example, the distributeddata framework 100 may receive a request to process a function that is available in the secondary language but not in the primary language. For example, the R language supports several functions commonly used by data scientists that may not be supported by JAVA (or not available in the set of libraries accessible to theprimary runtime 1120. - The in-memory cluster computing engine 220 starts a
secondary runtime 1220 that is configured to execute instructions provided in the secondary language. Thesecondary runtime 1220 is started on each compute node 280 that has a DDF segment 1110 for the DDF being processed. Each compute node 280 transform the data structure representing the DDF segment 1110 conforming to the primary language to a data structure representing the DDF segment 1210 conforming to the secondary language. - For example, if the primary runtime is a JAVA virtual machine and the secondary runtime is a R runtime, the compute node transforms a DDF segment represented as a list of byte buffers (representing a TablePartition structure conforming to the JAVA language representation) to a list of vectors in R (representing a DataFrame structure of R language). Furthermore, the compute node performs appropriate data type conversions, e.g. the compute node converts a TablePartition Columniterator of Integer to an R integer vector, Java Boolean to R logical vector, and so on. Furthermore, the compute node encodes any special values based on the target runtime, for example, the compute node converts floating point NaN (not a number special value) to R's NA value (not-available value) while converting to an R representation, or to Java null pointers while converting to a Java representation. If the secondary runtime is based on a Python, the compute node converts the DDF segment to a DataFrame representation of Python language.
- In an embodiment, the primary runtime 1120 (of each compute node having a DDF segment of the DDF being processed) executes instructions that transform the DDF segment 1110 representation (conforming to the primary language) to a DDF segment representation conforming to the secondary language). The
primary runtime 1120 uses certain protocol to communicate the transformed DDF segment representation to thesecondary runtime 1220. For example, theprimary runtime 1120 may open a pipe (or socket) to communicate with the process of thesecondary runtime 1220. The transformed DDF segment representation is stored in thesecondary runtime 1220 as DDF segment 1210. Thesecondary runtime 1220 performs the processing based on the DDF segment 1210 by executing the received instructions in the secondary language. - The processing performed by the
secondary runtime 1220 may result in generation of a new DDF (that is distributed as DDF segments across the compute nodes.) Accordingly, eachsecondary runtime 1220 instance stores a DDF segment corresponding to the generated DDF. The generated DDF segment stored in thesecondary runtime 1220 conforms to the secondary language. Thesecondary runtime 1220 transforms the generated DDF segment to a transformed generated DDF segment that conforms to the primary language. Thesecondary runtime 1220 sends the transformed generated DDF segment to theprimary runtime 1120. The secondary runtime stores the transformed generated DDF segment for processing instructions received via the document in the primary language. - Alternatively the processing performed by the
secondary runtime 1220 may result in modifications to the stored DDF segment 1210. The modified DDF segment conforms to the secondary language. Thesecondary runtime 1220 sends the transformed modified DDF segment to theprimary runtime 1120. The secondary runtime stores the transformed modified DDF segment for processing instructions received via the document in the primary language. This mechanism allows the distributeddata framework 200 to process instructions received for processing the DDF in languages other than the primary language of the distributeddata framework 200. Accordingly, embodiments allow the DDF to be mutated using a secondary language. The distributeddata framework 200 allows further processing to be performed using the primary language. Accordingly, a user can mix instructions for processing a DDF in different languages in the same document. - In an embodiment, the document for processing the DDF in multiple languages is shared, thereby allowing different users to provide instructions in different languages. In another embodiment, the same DDF is shared between different documents. The DDF may be processed using instructions in different languages received from different documents. Accordingly, the distributed
data framework 200 may modify a DDF based on instructions in one language and then receive queries (or statements to further modify the DDF) in a different language. Embodiments can support multiple secondary languages by creating multiple secondary runtimes, one for processing instructions of each type of secondary language. -
FIG. 13 shows an interaction diagram illustrating the processing of DDFs based on instructions received in multiple languages, according to an embodiment. The distributeddata framework 200 receives instructions in different languages from documents edited by users viaclient devices 130. The instructions are received by the in-memory cluster computing engine 220 and sent to each compute node that stores a DDF segment of the DDF being processed by the instructions. In other embodiments, there may be different components/modules involved in the processing (different from those shown inFIG. 13 ). Also, there is aprimary runtime 1120 and asecondary runtime 1220 on each compute node on which a DDF segment of the DDF being processed is stored. - The
client device 130 sends 1310 instructions in the primary language to theprimary runtime 1120 of each compute node storing a DDF segment 1110. Theprimary runtime 1120 receives the instructions in the primary language from theclient device 130 andprocesses 1315 them using the DDF segment. Theprimary runtime 1120 sends 1320 the results back to theclient device 130. Note that the results may be sent via different software modules, for example, theprimary runtime 1120 may send the results to the in-memory cluster engine 220, the in-memory cluster computing engine 220 may send the results to theanalytics framework 230 which in turn may send the results to theclient device 130. For simplicity, theclient device 130 is shown interacting with theprimary runtime 1120. Theprocessing 1315 of the instructions may cause the DDF to mutate such that subsequent instructions process the mutated DDF. - The
client 130 subsequently sends 1325 instructions in thesecondary language 1325. For example, the instructions may include a call to a built-in function that is implemented in the secondary language and not in the primary language. Theprimary runtime 1120transforms 1330 the DDF segment stored in the compute node of theprimary runtime 1120 to a transformed DDF segment that conforms to the secondary language. Theprimary runtime 1120 sends 1335 the transformed DDF segment to thesecondary runtime 1220. - The
secondary runtime 1220processes 1340 the instructions in the secondary language using the transformed DDF segment. Theprocessing 1340 may generate a result DDF. The result DDF may be a new DDF segment generated by processing 1340 the instructions. Alternatively the result DDF segment may be a mutated form of the input DDF segment. - The
secondary runtime 1220 transforms the result DDF to a format that conforms to the primary language. Thesecondary runtime 1220 sends 1350 the transformed result DDF to theprimary runtime 1120. Theprimary runtime 1120 stores the transformed result for further processing, for example, if subsequent instructions based on the result DDF are received. Theprimary runtime 1120 sends 1335 any results based on theprocessing 1340 to the client device (for example, any result code, aggregate values, and so on). - As shown in
FIG. 13 , theclient device 130 sends 1360 further instructions in primary language for processing using the result DDF. Theprimary runtime 1120processes 1365 the received instructions using the result DDF. Theprimary runtime 1120 sends 1370 any results based on theprocessing 1365 back to theclient device 130. - The distributed
data framework 200 runtime automatically select the best representation of data for in-memory storage and algorithm execution without user's involvement. By default, a compressed columnar data format is used which is optimized for analytic queries and univariate statistical analysis. When a machine learning algorithm is invoked, the distributeddata framework 200 performs conversion that is optimized for such algorithm, e.g. in a linear regression command, a conversion is performed by the distributeddata framework 200 to extract values from selected columns and build a matrix representation. The distributeddata framework 200 caches the matrix representation in memory for the iterative machine learning process. The distributeddata framework 200 deletes the matrix representation from the cache (i.e., uncaches) the matrix representation when the algorithm is finished. - The distributed
data framework 200 provides an extensible framework for providing support for different programming languages. The distributeddata framework 200 receives from a user, software modules for performing conversions of data values conforming to format of one language to format of a new language. The distributeddata framework 200 further receives code for runtime of the new language. The distributeddata framework 200 allows code blocks to be specified using the new language. As a result the distributeddata framework 200 can be easily extended with support for new languages without requiring modifications to the code for existing languages. - It is to be understood that the Figures and descriptions of the present invention have been simplified to illustrate elements that are relevant for a clear understanding of the present invention, while eliminating, for the purpose of clarity, many other elements found in a typical distributed system. Those of ordinary skill in the art may recognize that other elements and/or steps are desirable and/or required in implementing the embodiments. However, because such elements and steps are well known in the art, and because they do not facilitate a better understanding of the embodiments, a discussion of such elements and steps is not provided herein. The disclosure herein is directed to all such variations and modifications to such elements and methods known to those skilled in the art.
- Some portions of above description describe the embodiments in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
- As used herein any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
- Some embodiments may be described using the expression “coupled” and “connected” along with their derivatives. It should be understood that these terms are not intended as synonyms for each other. For example, some embodiments may be described using the term “connected” to indicate that two or more elements are in direct physical or electrical contact with each other. In another example, some embodiments may be described using the term “coupled” to indicate that two or more elements are in direct physical or electrical contact. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. The embodiments are not limited in this context.
- As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
- In addition, use of the “a” or “an” are employed to describe elements and components of the embodiments herein. This is done merely for convenience and to give a general sense of the invention. This description should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.
- Upon reading this disclosure, those of skill in the art will appreciate still additional alternative structural and functional designs for a system and a process for displaying charts using a distortion region through the disclosed principles herein. Thus, while particular embodiments and applications have been illustrated and described, it is to be understood that the disclosed embodiments are not limited to the precise construction and components disclosed herein. Various modifications, changes and variations, which will be apparent to those skilled in the art, may be made in the arrangement, operation and details of the method and apparatus disclosed herein without departing from the spirit and scope defined in the appended claims.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/268,503 US20190228002A1 (en) | 2014-12-01 | 2019-02-06 | Multi-Language Support for Interfacing with Distributed Data |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201462086158P | 2014-12-01 | 2014-12-01 | |
US14/814,457 US10229148B1 (en) | 2014-12-01 | 2015-07-30 | Multi-language support for interfacing with distributed data |
US16/268,503 US20190228002A1 (en) | 2014-12-01 | 2019-02-06 | Multi-Language Support for Interfacing with Distributed Data |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/814,457 Continuation US10229148B1 (en) | 2014-12-01 | 2015-07-30 | Multi-language support for interfacing with distributed data |
Publications (1)
Publication Number | Publication Date |
---|---|
US20190228002A1 true US20190228002A1 (en) | 2019-07-25 |
Family
ID=59034123
Family Applications (5)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/814,454 Active 2035-12-21 US9686086B1 (en) | 2014-12-01 | 2015-07-30 | Distributed data framework for data analytics |
US14/814,456 Expired - Fee Related US10185930B1 (en) | 2014-12-01 | 2015-07-30 | Collaboration using shared documents for processing distributed data |
US14/814,457 Expired - Fee Related US10229148B1 (en) | 2014-12-01 | 2015-07-30 | Multi-language support for interfacing with distributed data |
US15/628,200 Expired - Fee Related US10110390B1 (en) | 2014-12-01 | 2017-06-20 | Collaboration using shared documents for processing distributed data |
US16/268,503 Abandoned US20190228002A1 (en) | 2014-12-01 | 2019-02-06 | Multi-Language Support for Interfacing with Distributed Data |
Family Applications Before (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/814,454 Active 2035-12-21 US9686086B1 (en) | 2014-12-01 | 2015-07-30 | Distributed data framework for data analytics |
US14/814,456 Expired - Fee Related US10185930B1 (en) | 2014-12-01 | 2015-07-30 | Collaboration using shared documents for processing distributed data |
US14/814,457 Expired - Fee Related US10229148B1 (en) | 2014-12-01 | 2015-07-30 | Multi-language support for interfacing with distributed data |
US15/628,200 Expired - Fee Related US10110390B1 (en) | 2014-12-01 | 2017-06-20 | Collaboration using shared documents for processing distributed data |
Country Status (1)
Country | Link |
---|---|
US (5) | US9686086B1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210049182A1 (en) * | 2019-08-14 | 2021-02-18 | Palantir Technologies Inc. | Multi-language object cache |
Families Citing this family (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021161104A1 (en) * | 2020-02-12 | 2021-08-19 | Monday.Com | Enhanced display features in collaborative network systems, methods, and devices |
CN104796392B (en) * | 2014-01-22 | 2019-01-08 | 腾讯科技(北京)有限公司 | One kind jumping context synchronizing device, method and client |
US9686086B1 (en) * | 2014-12-01 | 2017-06-20 | Arimo, Inc. | Distributed data framework for data analytics |
US10929417B2 (en) * | 2015-09-11 | 2021-02-23 | International Business Machines Corporation | Transforming and loading data utilizing in-memory processing |
US11163732B2 (en) * | 2015-12-28 | 2021-11-02 | International Business Machines Corporation | Linking, deploying, and executing distributed analytics with distributed datasets |
US10013214B2 (en) * | 2015-12-29 | 2018-07-03 | International Business Machines Corporation | Adaptive caching and dynamic delay scheduling for in-memory data analytics |
US10642896B2 (en) | 2016-02-05 | 2020-05-05 | Sas Institute Inc. | Handling of data sets during execution of task routines of multiple languages |
US10795935B2 (en) | 2016-02-05 | 2020-10-06 | Sas Institute Inc. | Automated generation of job flow definitions |
US10650046B2 (en) | 2016-02-05 | 2020-05-12 | Sas Institute Inc. | Many task computing with distributed file system |
US10324947B2 (en) * | 2016-04-26 | 2019-06-18 | Informatica Llc | Learning from historical logs and recommending database operations on a data-asset in an ETL tool |
US10380137B2 (en) * | 2016-10-11 | 2019-08-13 | International Business Machines Corporation | Technology for extensible in-memory computing |
USD898059S1 (en) | 2017-02-06 | 2020-10-06 | Sas Institute Inc. | Display screen or portion thereof with graphical user interface |
US10585647B2 (en) * | 2017-05-02 | 2020-03-10 | International Business Machines Corporation | Program optimization by converting code portions to directly reference internal data representations |
US10057373B1 (en) * | 2017-05-15 | 2018-08-21 | Palantir Technologies Inc. | Adaptive computation and faster computer operation |
USD898060S1 (en) | 2017-06-05 | 2020-10-06 | Sas Institute Inc. | Display screen or portion thereof with graphical user interface |
US10878019B2 (en) * | 2017-10-20 | 2020-12-29 | Dropbox, Inc. | Hosted storage for third-party services |
US11113411B2 (en) | 2017-10-20 | 2021-09-07 | Dropbox, Inc. | Authentication security model for a content management system |
US10979235B2 (en) | 2017-10-20 | 2021-04-13 | Dropbox, Inc. | Content management system supporting third-party code |
WO2019082982A1 (en) * | 2017-10-26 | 2019-05-02 | 日本電気株式会社 | Distributed processing management device, distributed processing method, and computer-readable storage medium |
US10503498B2 (en) | 2017-11-16 | 2019-12-10 | Sas Institute Inc. | Scalable cloud-based time series analysis |
US11580068B2 (en) * | 2017-12-15 | 2023-02-14 | Palantir Technologies Inc. | Systems and methods for client-side data analysis |
US10467335B2 (en) | 2018-02-20 | 2019-11-05 | Dropbox, Inc. | Automated outline generation of captured meeting audio in a collaborative document context |
US11488602B2 (en) * | 2018-02-20 | 2022-11-01 | Dropbox, Inc. | Meeting transcription using custom lexicons based on document history |
US10657954B2 (en) | 2018-02-20 | 2020-05-19 | Dropbox, Inc. | Meeting audio capture and transcription in a collaborative document context |
US10783214B1 (en) | 2018-03-08 | 2020-09-22 | Palantir Technologies Inc. | Adaptive and dynamic user interface with linked tiles |
US10810216B2 (en) | 2018-03-20 | 2020-10-20 | Sap Se | Data relevancy analysis for big data analytics |
US10725745B2 (en) * | 2018-05-24 | 2020-07-28 | Walmart Apollo, Llc | Systems and methods for polyglot analysis |
US11263179B2 (en) | 2018-06-15 | 2022-03-01 | Microsoft Technology Licensing, Llc | System for collaborative editing based on document evaluation |
US10938824B2 (en) | 2018-06-20 | 2021-03-02 | Microsoft Technology Licensing, Llc | Metric-based content editing system |
US10798152B2 (en) | 2018-06-20 | 2020-10-06 | Microsoft Technology Licensing, Llc | Machine learning using collaborative editing data |
US11100052B2 (en) | 2018-06-20 | 2021-08-24 | Microsoft Technology Licensing, Llc | System for classification based on user actions |
US10606851B1 (en) | 2018-09-10 | 2020-03-31 | Palantir Technologies Inc. | Intelligent compute request scoring and routing |
US10409641B1 (en) | 2018-11-26 | 2019-09-10 | Palantir Technologies Inc. | Module assignment management |
FR3089501B1 (en) * | 2018-12-07 | 2021-09-17 | Safran Aircraft Engines | COMPUTER ENVIRONMENT SYSTEM FOR AIRCRAFT ENGINE MONITORING |
US11379496B2 (en) | 2019-04-18 | 2022-07-05 | Oracle International Corporation | System and method for universal format driven data transformation and key flex fields in a analytic applications environment |
US12248490B2 (en) | 2019-04-18 | 2025-03-11 | Oracle International Corporation | System and method for ranking of database tables for use with extract, transform, load processes |
US11966870B2 (en) | 2019-04-18 | 2024-04-23 | Oracle International Corporation | System and method for determination of recommendations and alerts in an analytics environment |
US11436259B2 (en) | 2019-04-30 | 2022-09-06 | Oracle International Corporation | System and method for SaaS/PaaS resource usage and allocation in an analytic applications environment |
JP7611843B2 (en) | 2019-04-30 | 2025-01-10 | オラクル・インターナショナル・コーポレイション | SYSTEM AND METHOD FOR DATA ANALYTICS USING ANALYTICAL APPLICATION ENVIRONMENT - Patent application |
US11689379B2 (en) | 2019-06-24 | 2023-06-27 | Dropbox, Inc. | Generating customized meeting insights based on user interactions and meeting media |
US12153595B2 (en) | 2019-07-04 | 2024-11-26 | Oracle International Corporation | System and method for data pipeline optimization in an analytic applications environment |
US11030556B1 (en) | 2019-11-18 | 2021-06-08 | Monday.Com | Digital processing systems and methods for dynamic object display of tabular information in collaborative work systems |
CN111381940B (en) * | 2020-05-29 | 2020-08-25 | 上海冰鉴信息科技有限公司 | Distributed data processing method and device |
US12086112B2 (en) | 2020-09-08 | 2024-09-10 | International Business Machines Corporation | Processing large machine learning datasets |
US11734291B2 (en) * | 2020-10-21 | 2023-08-22 | Ebay Inc. | Parallel execution of API calls using local memory of distributed computing devices |
US11533235B1 (en) | 2021-06-24 | 2022-12-20 | Bank Of America Corporation | Electronic system for dynamic processing of temporal upstream data and downstream data in communication networks |
CN113543210B (en) * | 2021-06-28 | 2022-03-11 | 北京科技大学 | 5G-TSN cross-domain QoS and resource mapping method, equipment and computer readable storage medium |
CN116166655B (en) * | 2023-04-25 | 2023-07-07 | 尚特杰电力科技有限公司 | Big data cleaning system |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6959303B2 (en) | 2001-01-17 | 2005-10-25 | Arcot Systems, Inc. | Efficient searching techniques |
US8677091B2 (en) | 2006-12-18 | 2014-03-18 | Commvault Systems, Inc. | Writing data and storage system specific metadata to network attached storage device |
US8745075B2 (en) * | 2007-03-26 | 2014-06-03 | Xerox Corporation | Notification method for a dynamic document system |
US8069188B2 (en) * | 2007-05-07 | 2011-11-29 | Applied Technical Systems, Inc. | Database system storing a data structure that includes data nodes connected by context nodes and related method |
US8024361B2 (en) * | 2007-10-23 | 2011-09-20 | International Business Machines Corporation | Method and system for allowing multiple users to access and unlock shared electronic documents in a computer system |
US8990235B2 (en) * | 2009-03-12 | 2015-03-24 | Google Inc. | Automatically providing content associated with captured information, such as information captured in real-time |
US8326848B2 (en) * | 2009-08-11 | 2012-12-04 | International Business Machines Corporation | Proactive analytic data set reduction via parameter condition injection |
US8407322B1 (en) * | 2010-08-24 | 2013-03-26 | Adobe Systems Incorporated | Runtime negotiation of execution blocks between computers |
US8898593B2 (en) | 2011-10-05 | 2014-11-25 | Microsoft Corporation | Identification of sharing level |
US20130138739A1 (en) | 2011-11-24 | 2013-05-30 | NetOrbis Social Media Private Limited | System and method for sharing data between users in a collaborative environment |
US9529785B2 (en) * | 2012-11-27 | 2016-12-27 | Google Inc. | Detecting relationships between edits and acting on a subset of edits |
US20140181085A1 (en) | 2012-12-21 | 2014-06-26 | Commvault Systems, Inc. | Data storage system for analysis of data across heterogeneous information management systems |
US9699152B2 (en) * | 2014-08-27 | 2017-07-04 | Microsoft Technology Licensing, Llc | Sharing content with permission control using near field communication |
US9807073B1 (en) | 2014-09-29 | 2017-10-31 | Amazon Technologies, Inc. | Access to documents in a document management and collaboration system |
US9686086B1 (en) * | 2014-12-01 | 2017-06-20 | Arimo, Inc. | Distributed data framework for data analytics |
-
2015
- 2015-07-30 US US14/814,454 patent/US9686086B1/en active Active
- 2015-07-30 US US14/814,456 patent/US10185930B1/en not_active Expired - Fee Related
- 2015-07-30 US US14/814,457 patent/US10229148B1/en not_active Expired - Fee Related
-
2017
- 2017-06-20 US US15/628,200 patent/US10110390B1/en not_active Expired - Fee Related
-
2019
- 2019-02-06 US US16/268,503 patent/US20190228002A1/en not_active Abandoned
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210049182A1 (en) * | 2019-08-14 | 2021-02-18 | Palantir Technologies Inc. | Multi-language object cache |
US11334591B2 (en) * | 2019-08-14 | 2022-05-17 | Palantir Technologies Inc. | Multi-language object cache |
US12189646B2 (en) | 2019-08-14 | 2025-01-07 | Palantir Technologies Inc. | Multi-language object cache |
Also Published As
Publication number | Publication date |
---|---|
US9686086B1 (en) | 2017-06-20 |
US10185930B1 (en) | 2019-01-22 |
US10110390B1 (en) | 2018-10-23 |
US10229148B1 (en) | 2019-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10110390B1 (en) | Collaboration using shared documents for processing distributed data | |
US10853338B2 (en) | Universal data pipeline | |
US10803029B2 (en) | Generating javascript object notation (JSON) schema from JSON payloads | |
CN110023923B (en) | Generating a data transformation workflow | |
US20210334301A1 (en) | Design-time information based on run-time artifacts in transient cloud-based distributed computing clusters | |
Ram | Git can facilitate greater reproducibility and increased transparency in science | |
Stein et al. | The enterprise data lake: Better integration and deeper analytics | |
CN106547809B (en) | Representing compound relationships in a graph database | |
US10122783B2 (en) | Dynamic data-ingestion pipeline | |
CN109033113B (en) | Data warehouse and data mart management method and device | |
US8712947B2 (en) | Collaborative system for capture and reuse of software application knowledge and a method of realizing same | |
US20130117289A1 (en) | Content migration framework | |
JP6412924B2 (en) | Using projector and selector component types for ETL map design | |
Oancea et al. | Integrating R and hadoop for big data analysis | |
Gupta et al. | Practical enterprise data lake insights: Handle data-driven challenges in an enterprise big data lake | |
US20140379632A1 (en) | Smarter big data processing using collaborative map reduce frameworks | |
Schnase | Climate analytics as a service | |
Challawala et al. | MySQL 8 for Big Data: Effective Data Processing with MySQL 8, Hadoop, NoSQL APIs, and Other Big Data Tools | |
US12039416B2 (en) | Facilitating machine learning using remote data | |
Nandi | Spark for Python Developers | |
US20110289041A1 (en) | Systems and methods for managing assignment templates | |
US11521089B2 (en) | In-database predictive pipeline incremental engine | |
Dinh et al. | Data Process Approach by Traditional and Cloud Services Methodologies | |
Cardoso | Framework for collecting and processing georeferencing data | |
Leon et al. | Data Processing with Optimus: Supercharge big data preparation tasks for analytics and machine learning with Optimus using Dask and PySpark |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: ARIMO LLC, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NGUYEN, CHRISTOPHER T.;TRINH, ANH H.;BUI, BACH D.;SIGNING DATES FROM 20160812 TO 20170207;REEL/FRAME:049923/0717 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |