CN115599980A - SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm - Google Patents
SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm Download PDFInfo
- Publication number
 - CN115599980A CN115599980A CN202211226862.7A CN202211226862A CN115599980A CN 115599980 A CN115599980 A CN 115599980A CN 202211226862 A CN202211226862 A CN 202211226862A CN 115599980 A CN115599980 A CN 115599980A
 - Authority
 - CN
 - China
 - Prior art keywords
 - super
 - mashup
 - matrix
 - api
 - point
 - Prior art date
 - Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 - Withdrawn
 
Links
Images
Classifications
- 
        
- G—PHYSICS
 - G06—COMPUTING OR CALCULATING; COUNTING
 - G06F—ELECTRIC DIGITAL DATA PROCESSING
 - G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
 - G06F16/90—Details of database functions independent of the retrieved data types
 - G06F16/95—Retrieval from the web
 - G06F16/951—Indexing; Web crawling techniques
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING OR CALCULATING; COUNTING
 - G06F—ELECTRIC DIGITAL DATA PROCESSING
 - G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
 - G06F16/90—Details of database functions independent of the retrieved data types
 - G06F16/95—Retrieval from the web
 - G06F16/953—Querying, e.g. by the use of web search engines
 - G06F16/9535—Search customisation based on user profiles and personalisation
 
 
Landscapes
- Engineering & Computer Science (AREA)
 - Databases & Information Systems (AREA)
 - Theoretical Computer Science (AREA)
 - Data Mining & Analysis (AREA)
 - Physics & Mathematics (AREA)
 - General Engineering & Computer Science (AREA)
 - General Physics & Mathematics (AREA)
 - Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
 
Abstract
Description
技术领域technical field
本发明涉及Web环境中的Api推荐场景,具体涉及一种面向Saas的融合带重启随机游走算法的Web Api多样性推荐方法。The invention relates to an Api recommendation scene in a Web environment, in particular to a Saas-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm.
背景技术Background technique
SaaS全称为Software-as-a-service,意思是“软件即服务”。SaaS是一种基于互联网提供软件服务的应用模式,而Web API是SaaS的核心元素之一,能够提供特定功能的应用程序接口,可以对接类型多样的客户端。Mashup是由Web API构成的通过组合来自不同Web源的内容、表示或应用程序功能而生成的Web应用程序,其目标是结合这些资源来创建有用的新应用程序或服务。Mashup有简单、高可用和易于访问等特点,且其简单性优于功能完整性和完全可扩展性。The full name of SaaS is Software-as-a-service, which means "software as a service". SaaS is an application model that provides software services based on the Internet, and Web API is one of the core elements of SaaS, which can provide application programming interfaces with specific functions and can connect to various types of clients. A Mashup is a Web application composed of Web APIs that is generated by combining content, presentation, or application functionality from different Web sources, with the goal of combining these resources to create useful new applications or services. Mashups are simple, highly available, and easy to access, and their simplicity is better than functional integrity and full scalability.
如何当下海量的服务仓库中寻找合适的服务已成为重要课题。国内外不少学者已经开展了对服务推荐的研究。例如,Gao等人将Mashup的描述文档转化成向量,将向量用k-means算法进行聚类,并提出一种广义流式排序算法进行API推荐。Qi等人通过图神经网络学习Mashup和API的潜在表示,并通过一个具有注意力机制的神经网络分析需求Mashup采用各个API的概率,根据概率大小返回推荐列表。How to find suitable services in the current massive service warehouse has become an important issue. Many scholars at home and abroad have carried out research on service recommendation. For example, Gao et al. converted Mashup description documents into vectors, clustered the vectors with the k-means algorithm, and proposed a generalized streaming sorting algorithm for API recommendation. Qi et al. learned the potential representation of Mashup and API through a graph neural network, and analyzed the probability that each API was adopted by the demand Mashup through a neural network with an attention mechanism, and returned a recommendation list according to the probability.
超图是近年来在国际上流行的图模型,超图中的边可以联结任意个数量的超点,每条超边是超图点集中的非空子集。相较于普通图,超图能直观地对现实世界中的数据信息建模。将超图应用于Api推荐可以将高阶信息以超边的形式融入超图,在捕捉同类型、跨类型对象之间的关系的同时,生动、直观的可视化多元关系。Hypergraph is a popular graph model in the world in recent years. The edges in the hypergraph can connect any number of hyperpoints, and each hyperedge is a non-empty subset of the hypergraph point set. Compared with ordinary graphs, hypergraphs can intuitively model data information in the real world. Applying hypergraphs to Api recommendation can integrate high-level information into hypergraphs in the form of hyperedges. While capturing the relationships between objects of the same type and across types, they can vividly and intuitively visualize multiple relationships.
发明内容Contents of the invention
为了提升服务推荐准确率的同时提高服务推荐的多样性。本发明提出了面向Saas的融合带重启随机游走算法的Web Api多样性推荐方法。首先建立数据模型将Mashup信息建模,然后构建相应超图,并对原始图超点进行多级聚类减小搜索空间,最后,通过带重启的随机游走算法对Web API进行推荐。In order to improve the accuracy of service recommendation and improve the diversity of service recommendation. The invention proposes a Saas-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm. Firstly, a data model is established to model the Mashup information, and then the corresponding hypergraph is constructed, and the original graph superpoints are multi-level clustered to reduce the search space. Finally, the Web API is recommended through the random walk algorithm with restart.
本发明提供如下的技术方案:The present invention provides following technical scheme:
一种面向Saas的融合带重启随机游走算法的Web Api多样性推荐方法,所述方法包括以下步骤:A kind of Web Api diversity recommendation method of Saas-oriented fusion band restart random walk algorithm, described method comprises the following steps:
第一步、提取MATaTo(Mashup-API-Tag-Topic)数据模型,过程如下:The first step is to extract the MATaTo (Mashup-API-Tag-Topic) data model, the process is as follows:
步骤(1.1)从互联网网站爬取Mashup服务集合信息,进行步骤(1.2);Step (1.1) crawls the collection information of Mashup service from Internet website, carries out step (1.2);
步骤(1.2)对爬取到的Mashup服务集合进行分析,选择有利于提升服务推荐准确性且内容丰富的元信息,进行步骤(1.3);Step (1.2) analyzes the crawled Mashup service collection, selects meta information that is conducive to improving the accuracy of service recommendation and is rich in content, and proceeds to step (1.3);
其中,爬取到的信息有服务描述文档、Mashup与API的调用关系、主题、标签、服务关注者、服务调用链接、更新日志信息;Among them, the crawled information includes service description documents, call relationships between Mashup and API, topics, labels, service followers, service call links, and update log information;
步骤(1.3)定义MATaTo模型的符号,M表示Mashup集合、A表示API集合、Ta表示标签集合、To表示主题集合、R1、R2、R3、R4、R5表示分别表示五种高阶关系、n表示超点数量、m表示超边数量、WHG∈Rm×n表示加权超图、v表示超点、V表示由n个超点构成的集合、e表示超边、E表示由m个超边构成的集合,Rm×n表示大小为m×n的实数集合;进行步骤(1.4);Step (1.3) defines the symbols of the MATaTo model, M represents the Mashup set, A represents the API set, Ta represents the label set, To represents the topic set, R1, R2, R3, R4, R5 represent five high-order relationships respectively, and n represents The number of super vertices, m represents the number of hyper edges, WHG∈R m×n represents a weighted hypergraph, v represents a super point, V represents a set composed of n super points, e represents a hyper edge, E represents a set composed of m hyper edges The collection, R m * n represents the real number collection that size is m * n; Carry out step (1.4);
步骤(1.4)遍历爬取到的信息,从中提取出MATaTo数据模型;Step (1.4) traverses the crawled information and extracts the MATaTo data model therefrom;
第二步、基于MATaTo数据模型构建超图,过程如下:The second step is to build a hypergraph based on the MATaTo data model. The process is as follows:
步骤(2.1)使用NMF模型对Mashup和API描述文档建模,得到对应的主题向量,进行步骤(2.2);Step (2.1) uses the NMF model to model the Mashup and API description documents, obtains the corresponding topic vector, and performs step (2.2);
其中,NMF模型是非负矩阵分解,是一种无监督学习算法;Among them, the NMF model is a non-negative matrix factorization, which is an unsupervised learning algorithm;
步骤(2.2)根据MATaTo数据模型,针对四种类型的对象构成超图的超点;Step (2.2) according to the MATaTo data model, forms the superpoint of hypergraph for four types of objects;
步骤(2.3)根据五种高阶关系中建立十种类型的超边,定义十种类型的超边EM *kTo*、EM*kTa*、EA*kTo*、EA*kTa*、EM*ToTa*、EA*To*Ta*、EM*A*To*、EM*A*Ta*、和建立超图的超边集;Step (2.3) Establish ten types of hyperedges according to five high-order relations, and define ten types of hyperedges E M *kTo* , E M*kTa* , E A*kTo* , E A*kTa* , E M*ToTa* 、 E A*To*Ta* 、 E M*A*To* 、 E M*A*Ta* 、 and Create hyperedge sets of hypergraphs;
第三步、对MATaTo数据模型构成的超图进行多级聚类,过程如下:The third step is to perform multi-level clustering on the hypergraph formed by the MATaTo data model. The process is as follows:
步骤(3.1)粗化阶段:对原始图超点进行连续的粗化直到超图规模减小到一定程度;Step (3.1) Coarsening stage: Continuously coarsen the superpoints of the original graph until the scale of the hypergraph is reduced to a certain extent;
步骤(3.2)聚类阶段:对粗化阶段得到的最粗超图使用不同的参数进行谱聚类,根据结果选择效果最优的聚类,进行步骤(3.3);Step (3.2) clustering stage: use different parameters to perform spectral clustering on the coarsest hypergraph obtained in the coarsening stage, select the cluster with the best effect according to the results, and proceed to step (3.3);
步骤(3.3)细化阶段:将粗化阶段中被粗化的超点映射到聚类中,结束;Step (3.3) refinement stage: map the coarsened superpoints in the coarsening stage to clusters, and end;
第四步、基于带重启随机游走算法对Mashup服务进行推荐;The fourth step is to recommend the Mashup service based on the random walk algorithm with restart;
其中,随机游走是一种被广泛地用于描述自然界中许多实体的运动方式的数学统计模型,在本发明设计的带重启的随机游走算法中,超图中的超点在每一步的游走过程中有两种选择,以特定的概率α∈(0,1)游走到邻近的超点或者以1-α的概率返回初始超点,这种随机游走使得最终的服务推荐列表将更加符合需求Mashup。Among them, random walk is a mathematical statistical model widely used to describe the motion of many entities in nature. In the random walk algorithm with restart designed in the present invention, the superpoints in the hypergraph are at each step There are two choices in the walk process, walk to the adjacent superpoint with a specific probability α∈(0,1) or return to the initial superpoint with the probability of 1-α, this random walk makes the final service recommendation list Will be more in line with the needs of Mashup.
进一步,所述步骤(1.2)中,服务描述文档是传统Web服务WSDL文档的替代品,通过直观的语言描述服务信息;Further, in the step (1.2), the service description document is a substitute for the traditional Web service WSDL document, and describes the service information through an intuitive language;
其中,服务调用链接仅代表了服务的资源地址,并无实际帮助,因此,不使用服务调用链接信息。Among them, the service call link only represents the resource address of the service, and has no actual help. Therefore, the service call link information is not used.
更进一步,所述步骤(1.4)的过程如下:Further, the process of the step (1.4) is as follows:
步骤(1.4.1)提取服务描述文档、Mashup与API的调用关系、主题、标签四种类型信息,进行步骤(1.4.2);Step (1.4.1) extracts four types of information, the service description document, the call relationship between Mashup and API, the subject, and the label, and proceeds to step (1.4.2);
其中,四种类型信息的说明具体如下:Among them, the description of the four types of information is as follows:
Mashup(M*)表示Mashup服务,是服务推荐场景中推荐的请求者,Mashup本身携带有标签、主题的信息,并且与API存在调用关系;Mashup (M*) represents the Mashup service, which is the recommended requester in the service recommendation scenario. Mashup itself carries information about tags and topics, and has a calling relationship with the API;
API(A*)表示Web API,是服务推荐场景中的推荐对象,与Mashup服务对象相同,也具有标签、主题的信息,与Mashup存在被调用的关系;API(A*) means Web API, which is the recommended object in the service recommendation scenario. It is the same as the Mashup service object, and also has label and subject information, and has a calling relationship with Mashup;
标签(Ta*)表示爬取到的标签信息,具有相同标签的Mashup和API关系更加紧密;The label (Ta*) indicates the crawled label information, and the Mashup with the same label has a closer relationship with the API;
主题(To*)表示爬取到的主题信息,同一主题下的Mashup和API有相近的功能;Topic (To*) indicates the crawled topic information, Mashup and API under the same topic have similar functions;
步骤(1.4.2)归纳出R1(M*-Ta*-To*)、R2(A*-Ta*-To*)、R3(M*-M*)、R4(A*-A*)、R5(M*-A*)五种类型的高阶关系;Step (1.4.2) summarizes R1(M*-Ta*-To*), R2(A*-Ta*-To*), R3(M*-M*), R4(A*-A*), R5(M*-A*) five types of higher-order relationships;
其中,五种高阶关系的说明具体如下:Among them, the description of the five high-order relationships is as follows:
R1(M*-Ta*-To*)表示Mashup及其属性的关系,Mashup除了自身的描述文档外,用其携带的主题、标签信息进行补充说明;R1(M*-Ta*-To*) represents the relationship between Mashup and its attributes. In addition to its own description document, Mashup uses its carried subject and label information for supplementary explanation;
R2(A*-Ta*-To*)表示API及其属性的关系,与R1相同,API用其携带的主题、标签信息进行补充说明,Mashup调用具有相同主题、标签的API;R2(A*-Ta*-To*) represents the relationship between the API and its attributes. It is the same as R1. The API uses the topic and label information it carries to make supplementary explanations. Mashup calls the API with the same topic and label;
R3(M*-M*)表示Mashup之间的关系,本发明计算不同Mashup描述文档之间的相似度,来表示不同Mashup服务之间的关系;R3(M*-M*) represents the relationship between Mashups, and the present invention calculates the similarity between different Mashup description documents to represent the relationship between different Mashup services;
R4(A*-A*)表示API之间的关系,与R3相同,使用相似度表示API服务之间的关系,除此之外,被同时调用的两个或多个API之间能反映其共现度,一定程度上提升推荐的准确度;R4(A*-A*) represents the relationship between APIs, which is the same as R3, and uses similarity to represent the relationship between API services. In addition, two or more APIs that are called at the same time can reflect their relationship Co-occurrence can improve the accuracy of recommendation to a certain extent;
R5(M*-A*)表示Mashup和API之间的关系,Mashup由API组合而成,除了显式的调用关系外,API的主题或标签同样能对Mashup进行功能性描述。R5(M*-A*) indicates the relationship between the Mashup and the API. The Mashup is composed of APIs. In addition to the explicit call relationship, the topic or label of the API can also describe the functionality of the Mashup.
再进一步,所述步骤(2.2)的过程如下:Further, the process of the step (2.2) is as follows:
步骤(2.2.1)定义Mashup集合M={m1,m2,m3,...,mN},进行步骤(2.2.2);Step (2.2.1) defines the Mashup set M={m 1 , m 2 , m 3 ,...,m N }, and proceeds to step (2.2.2);
其中,N为Mashup的总数;Among them, N is the total number of Mashup;
步骤(2.2.2)定义Mashup mi调用的API集合为Ai={ai1,ai1,ai1,...,aik'},进行步骤(2.2.3);Step (2.2.2) defines the API set called by Mashup m i as A i ={a i1 ,a i1 ,a i1 ,...,a ik' }, proceed to step (2.2.3);
其中,i代表当前Mashup在集合M中的编号,k’为Mashup mi调用的API总数,则所有Mashup调用的API集合为其中,∪表示取并集;Among them, i represents the number of the current mashup in the set M, k' is the total number of APIs called by mashup m i , then the set of APIs called by all mashups is Among them, ∪ means to take the union;
步骤(2.2.3)定义Ta和To为所有Mashup和API携带的标签和主题信息集合,进行步骤(2.2.4);Step (2.2.3) defines Ta and To as the tags and subject information collections carried by all Mashups and APIs, and proceeds to step (2.2.4);
步骤(2.2.4)初始化超图的超点集为V=M∪A∪Ta∪To。Step (2.2.4) initializes the superpoint set of the hypergraph as V=M∪A∪T a ∪T o .
所述步骤(2.3)中,超边定义的具体说明如下:In the described step (2.3), the specific description of the hyperedge definition is as follows:
对于每个Mashup,选取top-k个相似Mashup以及它们的主题构成一条超边,该边的权重设定为与k个Mashup之间相似度的平均值,相似度的计算采用常见的余弦相似度,其中,余弦相似度的计算公式如下: For each mashup, select top-k similar mashups and their topics to form a hyperedge, and the weight of this edge is set as the average value of the similarity with k mashups, and the similarity is calculated using the common cosine similarity , where the calculation formula of cosine similarity is as follows:
其中,θ表示预先相似度的夹角,vmx'表示第x’个Mashup的特征向量,|vmx'|表示特征向量的模; Among them, θ represents the angle of the pre-similarity, vm x' represents the feature vector of the x'th Mashup, | vm x' | represents the modulus of the feature vector;
对于每个Mashup,选取top-k个相似Mashup以及它们的标签构成一条超边,该边的权重设定为与k个Mashup之间相似度的平均值,相似度的计算和步骤(2.3)中的余弦相似度的计算公式相同; For each Mashup, select top-k similar Mashups and their labels to form a hyperedge, and the weight of this edge is set as the average value of the similarity between the k Mashups. The calculation of the similarity is the same as in step (2.3). The calculation formula of cosine similarity is the same;
对于每个API,选取top-k个相似API以及它们的主题构成一条超边,该边的权重设定为与k个API之间相似度的平均值,相似度的计算和步骤(2.3)中的余弦相似度的计算公式相同; For each API, select top-k similar APIs and their topics to form a hyperedge, and the weight of this edge is set as the average value of the similarity between the k APIs. The calculation of similarity is the same as in step (2.3). The calculation formula of cosine similarity is the same;
对于每个API,选取top-k个相似API以及它们的标签构成一条超边,该边的权重设定为与k个API之间相似度的平均值,相似度的计算和步骤(2.3)中的余弦相似度的计算公式相同; For each API, select top-k similar APIs and their labels to form a hyperedge. The weight of this edge is set as the average value of the similarity between the k APIs. The calculation of similarity is the same as in step (2.3). The calculation formula of cosine similarity is the same;
对于每个Mashup,选取其携带的任意数量主题与标签构成一条超边,该边的权重设定为1; For each Mashup, select any number of topics and labels it carries to form a hyperedge, and the weight of this edge is set to 1;
EA*To*Ta*:对于每个API,选取其携带的任意数量主题与标签构成一条超边,该边的权重设定为1;E A*To*Ta* : For each API, select any number of topics and tags it carries to form a hyperedge, and the weight of this edge is set to 1;
EM*A*To*:对于每个Mashup,选取其调用的所有API以及携带的主题,该边的权重设定为1;E M*A*To* : For each Mashup, select all the APIs it calls and the topics it carries, and set the weight of this edge to 1;
EM*A*Ta*:对于每个Mashup,选取其调用的所有API以及携带的标签,该边的权重设定为1;E M*A*Ta* : For each Mashup, select all the APIs it calls and the labels it carries, and set the weight of this edge to 1;
流行的API往往代表较高的质量,对于每个API,与其所在的主题构建一条超边,该边的权重设定为流行度大小,其中,流行度大小的公式如下: Popular APIs often represent higher quality. For each API, a hyperedge is constructed with its topic, and the weight of this edge is set as the popularity. The formula of popularity is as follows:
其中Call(Ar)表示API Ar被Mashup调用的频次,Topic(Ar)表示与Ar具有相同主题的所有API,MinCall(Topic(Ar))表示与Ar相同主题中的所有API中调用频次最低的API,MaxCall(Topic(Ar))表示与Ar相同主题中的所有API中调用频次最高的API; Among them, Call(A r ) indicates the frequency of API A r being called by mashup, Topic(A r ) indicates all APIs with the same topic as A r , and MinCall(Topic(A r )) indicates all APIs in the same topic as A r The API with the lowest call frequency in , MaxCall(Topic(A r )) means the API with the highest call frequency among all APIs in the same topic as A r ;
经常一起被调用的API具有相关性,该边包含所有共现次数大于1的API,该边的权重设定为共现度大小,其中,共现度大小的公式如下: The APIs that are often called together are related. This edge includes all APIs whose co-occurrence times are greater than 1. The weight of this edge is set to the co-occurrence degree. The formula for the co-occurrence degree is as follows:
其中,|Am'∩...∩An'|表示API Am’,...,An’被相同Mashup调用的次数,|Am'∪...∪An'|表示API Am’,...,An’被Mashup调用的总次数。 Among them, |A m' ∩...∩A n' | indicates the number of times API A m' ,...,A n' is called by the same Mashup, and |A m' ∪...∪A n' | indicates API The total number of times A m' ,...,An ' is called by the mashup.
所述步骤(3.1)的过程如下:The process of said step (3.1) is as follows:
步骤(3.1.1)定义粗化阶段所需的符号,c表示超点的权重集合、w表示超边的权重集合、G=(V,E,c,w)表示原始超图、iterTime表示迭代次数、coarse_limit表示粗化阈值、max_weight表示超点权重阈值、G'=(V',E',c',w')表示粗化后的图,V'表示粗化后未被合并的超点,E'表示粗化后的超边,c'表示粗化后超点的权重集合,w'表示粗化超边后的权重集合,进行步骤(3.1.2);Step (3.1.1) defines the symbols required for the coarsening stage, c represents the weight set of hyperpoints, w represents the weight set of hyperedges, G=(V,E,c,w) represents the original hypergraph, and iterTime represents iteration The number of times, coarse_limit indicates the coarsening threshold, max_weight indicates the superpoint weight threshold, G'=(V',E',c',w') indicates the graph after coarsening, and V' indicates the superpoint that has not been merged after coarsening , E' represents the hyperedge after coarsening, c' represents the weight set of superpoints after coarsening, w' represents the weight set after coarsening the hyperedge, and proceed to step (3.1.2);
步骤(3.1.2)将所有超点加入到未访问超点集合UV中,进行步骤(3.1.3);Step (3.1.2) joins all superpoints in the unvisited superpoint collection UV, carries out step (3.1.3);
步骤(3.1.3)初始化空集合C用于记录所有被粗化的超点,进行步骤(3.1.4);Step (3.1.3) initializes the empty collection C to be used for recording all superpoints that are coarsened, and carries out step (3.1.4);
步骤(3.1.4)定义变量I用于存储迭代次数,并将其初始化为0,进行步骤(3.1.4.1-3.1.4.4),对集合UV中的超点进行粗化,进行步骤(3.1.5);Step (3.1.4) defines the variable I for storing the number of iterations, and initializes it to 0, performs steps (3.1.4.1-3.1.4.4), coarsens the super points in the set UV, and performs steps (3.1. 5);
步骤(3.1.4.1)定义集合UVT,用于存储对本次迭代需要粗化的超点合集,将集合UV中的超点拷贝进集合UVT中,进行步骤(3.1.4.2);Step (3.1.4.1) defines the set UVT, which is used to store the superpoint collection that needs to be roughened for this iteration, and copies the superpoints in the set UV into the set UVT, and proceeds to step (3.1.4.2);
步骤(3.1.4.2)进行步骤(3.1.4.2.1-3.1.4.2.12),对UVT中超点进行粗化,直到迭代次数到达len(UVT),进行步骤(3.1.4.3);Step (3.1.4.2) proceed to step (3.1.4.2.1-3.1.4.2.12), coarsen the super point in UVT until the number of iterations reaches len(UVT), proceed to step (3.1.4.3);
其中,len()表示获取集合的大小,len(UVT)表示集合UVT的大小;Among them, len() indicates the size of the collection, and len(UVT) indicates the size of the collection UVT;
步骤(3.1.4.2.1)随机取集合UVT中一个超点,命名为rv,进行步骤(3.1.4.2.2);Step (3.1.4.2.1) randomly takes a super point in the set UVT, named rv, and proceeds to step (3.1.4.2.2);
步骤(3.1.4.2.2)判断该超点rv是否属于集合UVT,是则进行步骤(3.1.4.2.3),否则直接进入下一轮循环;Step (3.1.4.2.2) judges whether the super point rv belongs to the set UVT, if so, proceed to step (3.1.4.2.3), otherwise directly enter the next cycle;
步骤(3.1.4.2.3)定义一个超点cv,并将超点rv拷贝给cv,定义score用于存储评分,并初始化评分为0,进行步骤(3.1.4.2.4);Step (3.1.4.2.3) defines a super point cv, and copies the super point rv to cv, defines score to store the score, and initializes the score to 0, proceed to step (3.1.4.2.4);
步骤(3.1.4.2.4)遍历与超点rv邻接的其他超点,并将该超点拷贝给iv,进行步骤(3.1.4.2.4.1-3.1.4.2.4.3),找到要与rv合并的超点iv,进行步骤(3.1.4.2.5);Step (3.1.4.2.4) traverses other superpoints adjacent to superpoint rv, and copies the superpoint to iv, and performs steps (3.1.4.2.4.1-3.1.4.2.4.3) to find the superpoint to be merged with rv Super point iv, proceed to step (3.1.4.2.5);
其中,与超点rv邻接的其他超点即与该超点在同一边中的超点;Among them, other superpoints adjacent to superpoint rv are superpoints on the same side as this superpoint;
步骤(3.1.4.2.4.1)计算rv的权重与iv的权重之和;Step (3.1.4.2.4.1) calculates the sum of the weight of rv and the weight of iv;
步骤(3.1.4.2.4.2)判断该权重和是否大于超点权重阈值max_weight,否则进行步骤(3.1.4.2.4.3),是则直接进入下一次循环;Step (3.1.4.2.4.2) judges whether the weight sum is greater than the superpoint weight threshold max_weight, otherwise proceed to step (3.1.4.2.4.3), and if so, directly enter the next cycle;
步骤(3.1.4.2.4.3)判断iv是不是属于集合UVT中的超点,是则直接进入下一次循环,否则直接结束循环;Step (3.1.4.2.4.3) judges whether iv belongs to a superpoint in the set UVT, if so, directly enters the next cycle, otherwise directly ends the cycle;
步骤(3.1.4.2.5)根据评分函数计算rv与iv的评分函数,定义变量cur_score,将评分存储在cur_score中,进行步骤(3.1.4.2.6);Step (3.1.4.2.5) calculates the scoring function of rv and iv according to scoring function, defines variable cur_score, scores are stored in cur_score, and carries out step (3.1.4.2.6);
其中,评分函数为score(rv,iv)=a·con(rv,iv)+b·hNcut(rv,iv),a、b的值为经验值,依据实验效果调参后所得,其中,其中,I*(rv)、I*(iv)分别表示包含超点rv、iv的超边集合,∩表示取交集,即取符号左右集合都存在的元素,其中c*(rv)、c*(iv)分别为超点rv、iv的权重,e1表示同时和超点rv、iv存在于同一超边的超点,w(e1)表示超边e1的权重,de(e1)表示超边e1的度,de(e1)=|e1|,|e1|表示超边e1中的超点数量;∑表示求和,其中hNcut(rv,iv)表示超点rv和iv所在超边的归一化切割,代表了rv和iv分别被看作一个簇类时这两个簇类的相关度,计算公式为其中,dv(rv)、dv(iv)表示超点rv、iv的度,即包含超点rv、iv的边的权重之和,e2表示属于集合Qe的边,Qe表示由这两个集合切割的超边集,当一条超边中的超点存在于不同簇类中时,称这条超边是被切割的超边;Among them, the scoring function is score(rv,iv)=a·con(rv,iv)+b·hNcut(rv,iv), the values of a and b are empirical values, which are obtained after adjusting the parameters according to the experimental results. Among them, Among them, I*(rv), I*(iv) represent the hyperedge set containing the superpoint rv, iv respectively, ∩ means to take the intersection, that is, to take the elements that exist in both the left and right sets of symbols, Among them, c*(rv) and c*(iv) are the weights of superpoints rv and iv respectively, e 1 means a superpoint that exists on the same hyperedge as superpoint rv and iv at the same time, w(e 1 ) means hyperedge e The weight of 1 , de(e 1 ) represents the degree of hyperedge e 1 , de(e 1 )=|e 1 |, |e 1 | represents the number of super points in hyperedge e 1 ; ∑ represents summation, where hNcut (rv,iv) represents the normalized cut of the hyperedge where the superpoints rv and iv are located, and represents the correlation between the two clusters when rv and iv are respectively regarded as a cluster. The calculation formula is Among them, dv(rv) and dv(iv) represent the degrees of the superpoints rv and iv, that is, the sum of the weights of the edges containing the superpoints rv and iv, e 2 represents the edges belonging to the set Q e , and Q e represents the A set of hyperedges cut by a set, when the hyperpoints in a hyperedge exist in different clusters, this hyperedge is called a hyperedge that is cut;
步骤(3.1.4.2.6)判断cur_score是否大于score,是则将超点iv拷贝到cv中、将cur_score的值赋值给score,进行步骤(3.1.4.2.7);Step (3.1.4.2.6) judges whether cur_score is greater than score, if so, copy the super point iv to cv, assign the value of cur_score to score, and proceed to step (3.1.4.2.7);
步骤(3.1.4.2.7)将rv与cv合并,本发明使用双向链表的形式,将有合并关系的超点进行连接,然后进行步骤(3.1.4.2.8);Step (3.1.4.2.7) merges rv and cv, and the present invention uses the form of doubly linked list to connect the superpoints with merge relationship, and then proceed to step (3.1.4.2.8);
其中,双向链表是链表的一种,直接相连的两个节点之间有指向对方的指针;Among them, the doubly linked list is a kind of linked list, and there are pointers pointing to each other between two directly connected nodes;
步骤(3.1.4.2.8)将被合并超点cv的权重增加到rv,进行步骤(3.1.4.2.9);Step (3.1.4.2.8) increases the weight of the merged superpoint cv to rv, and proceeds to step (3.1.4.2.9);
步骤(3.1.4.2.9)在当前迭代轮次使用的UVT中移除rv与cv,,进行步骤(3.1.4.2.10);Step (3.1.4.2.9) removes rv and cv in the UVT used in the current iteration round, and proceeds to step (3.1.4.2.10);
步骤(3.1.4.2.10)在UV中移除cv,进行步骤(3.1.4.2.11);Step (3.1.4.2.10) removes cv in UV, proceed to step (3.1.4.2.11);
步骤(3.1.4.2.11)将超点cv添加到集合C中,进行步骤(3.1.4.2.12);Step (3.1.4.2.11) adds the super point cv to the set C, and proceeds to step (3.1.4.2.12);
步骤(3.1.4.2.12)如果被粗化的超点数到达粗化阈值coarse_limit,则退出迭代,执行步骤(3.1.4.3),否则直接进入下一轮迭代;Step (3.1.4.2.12) If the number of coarsened superpoints reaches the coarsening threshold coarse_limit, exit the iteration and execute step (3.1.4.3), otherwise directly enter the next round of iteration;
步骤(3.1.4.3)对I自增1,即I=I+1,进行步骤(3.1.4.4);Step (3.1.4.3) automatically increments I by 1, i.e. I=I+1, and proceeds to step (3.1.4.4);
步骤(3.1.4.4)判断迭代次数I是否小于iterTime、被粗化的超点数是否到达粗化阈值coarse_limit,任一条件不符合,则停止粗化,停止迭代,执行步骤(3.1.5);Step (3.1.4.4) judge whether the number of iterations I is less than iterTime, whether the coarsened superpoint number reaches the coarsening threshold coarse_limit, if any condition does not meet, then stop coarsening, stop iteration, and perform step (3.1.5);
步骤(3.1.5)将粗化后的粗化后的超点、超边、超点的权重集合和超边的权重集合分别拷贝到V'、E'、c'、w'中,得到粗化结果,即超图G'=(V',E',c',w')。Step (3.1.5) Copy the coarsened superpoint, hyperedge, superpoint weight set and hyperedge weight set to V', E', c', w' after coarsening to obtain the coarse The result of transformation, that is, the hypergraph G'=(V', E', c', w').
所述步骤(3.2)的过程如下:The process of said step (3.2) is as follows:
步骤(3.2.1)进行步骤(3.2.1.1-3.2.1.6),构建拉普拉斯矩阵,进行步骤(3.2.2);Step (3.2.1) carries out step (3.2.1.1-3.2.1.6), constructs Laplacian matrix, carries out step (3.2.2);
其中,拉普拉斯矩阵是图的矩阵表示,主要应用在图论中;Among them, the Laplacian matrix is a matrix representation of a graph, which is mainly used in graph theory;
步骤(3.2.1.1)定义VO和EO分别为存储超图中所有超点与对应索引位置和超图中所有超边与对应索引位置的数组,将超图中所有超点与对应索引位置存储在VO中;将超图中所有超边与对应索引位置存储在EO中,进行步骤(3.2.1.2);Step (3.2.1.1) defines VO and EO as arrays for storing all superpoints and corresponding index positions in the hypergraph and all hyperedges and corresponding index positions in the hypergraph respectively, and stores all superpoints and corresponding index positions in the hypergraph in In VO; store all hyperedges and corresponding index positions in the hypergraph in EO, and perform step (3.2.1.2);
步骤(3.2.1.2)根据步骤(2.3)中的超边构建方式和权重计算方法得到的超图信息构建相似矩阵S,进行步骤(3.2.1.3);Step (3.2.1.2) constructs a similarity matrix S according to the hypergraph information obtained by the hyperedge construction method and the weight calculation method in the step (2.3), and proceeds to the step (3.2.1.3);
其中,相似度矩阵是一项重要的统计学技术,用于组织一组数据点之间的彼此相似性,相似性即超边的权重;Among them, the similarity matrix is an important statistical technique, which is used to organize the mutual similarity between a group of data points, and the similarity is the weight of the hyperedge;
步骤(3.2.1.3)根据相似矩阵构建超点度矩阵VD以及超边度矩阵ED,进行步骤(3.2.1.4);Step (3.2.1.3) builds superpoint degree matrix VD and hyperedge degree matrix ED according to similarity matrix, carries out step (3.2.1.4);
其中,超点度矩阵VD是由超点度构成的对角线矩阵,对角线上的元素为超点的度dv(va),其中va代表VO中对应索引位置a的超点;Wherein, the superpoint degree matrix VD is a diagonal matrix formed by the superpoint degree, and the elements on the diagonal are the degree dv (v a ) of the superpoint, where v a represents the superpoint of the corresponding index position a in VO;
其中,超边度矩阵ED是由超边度构成的对角线矩阵,对角线上的元素为超边的度de(eb),其中eb代表EO中对应索引位置b的超点;Wherein, the hyperedge degree matrix ED is a diagonal matrix formed by the hyperedge degree, and the elements on the diagonal are the degree de(e b ) of the hyperedge, where e b represents the hyperpoint corresponding to the index position b in EO;
步骤(3.2.1.4)根据相似矩阵,进行步骤(3.2.1.4.1-3.2.1.4.4)构建关联矩阵H,进行步骤(3.2.1.5);Step (3.2.1.4) carries out step (3.2.1.4.1-3.2.1.4.4) to construct correlation matrix H according to similarity matrix, and carries out step (3.2.1.5);
其中,关联矩阵是一个n×m的矩阵,第d行表示第d个超点所属的所有超边,第f列表示第f条超边包含的所有超点,矩阵的第d行第f列表示为 Among them, the incidence matrix is an n×m matrix, the dth row represents all the hyperedges to which the dth superpoint belongs, the fth column represents all the superpoints contained in the fth hyperedge, and the dth row and fth column of the matrix Expressed as
步骤(3.2.1.4.1)建立一个n×m的全为0的矩阵H,进行步骤(3.2.1.4.2);Step (3.2.1.4.1) establishes a matrix H that is all 0s of n×m, and proceeds to step (3.2.1.4.2);
步骤(3.2.1.4.2)遍历EO中所有超边,进行步骤(3.2.1.4.3);Step (3.2.1.4.2) traverses all hyperedges in EO, and performs step (3.2.1.4.3);
步骤(3.2.1.4.3)遍历超边中所有超点,进行步骤(3.2.1.4.4);Step (3.2.1.4.3) traverses all hyperpoints in the hyperedge, and proceeds to step (3.2.1.4.4);
步骤(3.2.1.4.4)在矩阵H中将该超边和超点对应索引位置的0置为1;Step (3.2.1.4.4) sets 0 at the corresponding index position of the hyperedge and superpoint to 1 in the matrix H;
步骤(3.2.1.5)根据步骤(2.3)中的超边构建方式和权重计算方法得到的超图信息进行步骤(3.2.1.5.1-3.2.1.5.3),构建权重矩阵W,进行步骤(3.2.1.6);Step (3.2.1.5) carries out step (3.2.1.5.1-3.2.1.5.3) according to the hypergraph information obtained by the hyperedge construction method and weight calculation method in step (2.3), constructs weight matrix W, and performs step ( 3.2.1.6);
其中,权重矩阵W是超边权重的对角矩阵,对角线上的元素为超边的权重;Among them, the weight matrix W is a diagonal matrix of hyperedge weights, and the elements on the diagonal are the weights of hyperedges;
步骤(3.2.1.5.1)建立一个m×m的全为0的矩阵W进行步骤(3.2.1.5.2);Step (3.2.1.5.1) establishes a matrix W that is all 0 of m * m and carries out step (3.2.1.5.2);
步骤(3.2.1.5.2)遍历EO中所有超边,进行步骤(3.2.1.5.3);Step (3.2.1.5.2) traverses all hyperedges in EO, and performs step (3.2.1.5.3);
步骤(3.2.1.5.3)将超边的权重按照EO中的索引位置作为权重矩阵中的值替换矩阵W中的0;Step (3.2.1.5.3) replaces 0 in the matrix W with the weight of the hyperedge as the value in the weight matrix according to the index position in the EO;
步骤(3.2.1.6)根据关联矩阵、权重矩阵、超边度矩阵以及超点度矩阵构建拉普拉斯矩阵L;Step (3.2.1.6) constructs Laplacian matrix L according to correlation matrix, weight matrix, hyperedge degree matrix and superpoint degree matrix;
其中,拉普拉斯矩阵的计算公式为L=E*-ED-1/2HWVD-1HTED-1/2,其中E*表示单位矩阵,即从左上角到右下角的对角线上的元素均为1,其余位置都是0的矩阵,ED-1/2和VD-1分别表示矩阵ED取(-1/2)次幂和矩阵VD取-1次幂,由于ED和VD都是对角线矩阵,所以只要分别对对角线上的数取对应次幂,HT表示H的转置,即将矩阵H中的行换成同序号的列得到的矩阵;Among them, the calculation formula of the Laplacian matrix is L=E*-ED -1/2 HWVD -1 H T ED -1/2 , where E* represents the identity matrix, that is, the diagonal from the upper left corner to the lower right corner The elements above are all 1, and the rest of the positions are all 0. ED -1/2 and VD -1 respectively represent that the matrix ED is taken to the power of (-1/2) and the matrix VD is taken to the power of -1. Since ED and VD Both are diagonal matrices, so as long as the corresponding powers are taken on the numbers on the diagonal, H T represents the transpose of H, that is, the matrix obtained by replacing the rows in the matrix H with columns of the same sequence number;
步骤(3.2.2)对拉普拉斯矩阵进行归一化处理得到NL,进行步骤(3.2.3);Step (3.2.2) carries out normalization process to Laplacian matrix and obtains NL, carries out step (3.2.3);
其中,矩阵的归一化指将矩阵的数据以列为单元,映射到区间(0,1)之间,即,将每一列数据减去该列的最小值并除以该列的最大值;Among them, the normalization of the matrix refers to mapping the data of the matrix to the interval (0, 1) in units of columns, that is, subtracting the minimum value of the column from each column of data and dividing it by the maximum value of the column;
步骤(3.2.3)求解拉普拉斯矩阵NL,计算特征值,得到特征向量,进行步骤(3.2.4);Step (3.2.3) solves Laplacian matrix NL, calculates eigenvalue, obtains eigenvector, carries out step (3.2.4);
其中特征值,特征向量的定义为:矩阵乘法对应了一个变换,是把任意一个向量变成另一个方向或长度都大多不同的新向量,如果矩阵对某一个向量或某些向量只发生伸缩变换,不对这些向量产生旋转的效果,那么这些向量就称为这个矩阵的特征向量,伸缩的比例就是特征值;Among them, the definition of eigenvalue and eigenvector is: matrix multiplication corresponds to a transformation, which is to change any vector into a new vector with different directions or lengths. If the matrix only stretches and transforms a certain vector or some vectors , does not produce a rotation effect on these vectors, then these vectors are called the eigenvectors of this matrix, and the scaling ratio is the eigenvalue;
步骤(3.2.4)将特征向量按列构建成矩阵NE,并归一化处理,进行步骤(3.2.5);Step (3.2.4) constructs eigenvector into matrix NE by column, and normalizes, carries out step (3.2.5);
步骤(3.2.5)定义bestC用于存储最佳聚类结果,定义BestE用于存储最佳聚类结果的评分,初始化BestE为0,定义T_num用于存储重复聚类次数,初始化T_num为0,定义retryTime为聚类操作重复次数,进行步骤(3.2.6);Step (3.2.5) defines bestC to store the best clustering result, defines BestE to store the score of the best clustering result, initializes BestE to 0, defines T_num to store the number of repeated clustering, initializes T_num to 0, Define retryTime as the number of repetitions of the clustering operation, and perform step (3.2.6);
步骤(3.2.6)执行步骤(3.2.6.1-3.2.6.5),对NE进行聚类,评估聚类的效果,循环retryTime次,找到解质量最高的聚类,执行步骤(3.2.7);Step (3.2.6) execute step (3.2.6.1-3.2.6.5), cluster NE, evaluate the effect of clustering, loop retryTime times, find the cluster with the highest solution quality, execute step (3.2.7);
步骤(3.2.6.1)对T_num自增1,即,T_num=T_num+1,进行步骤(3.2.6.2);Step (3.2.6.1) automatically increments T_num by 1, that is, T_num=T_num+1, and proceeds to step (3.2.6.2);
步骤(3.2.6.2)对矩阵NE使用K-Means方法进行聚类,将得到的向量聚类结果作为对应超点的聚类结果,进行步骤(3.2.6.3);Step (3.2.6.2) uses the K-Means method to cluster the matrix NE, and uses the obtained vector clustering result as the clustering result of the corresponding superpoint, and performs step (3.2.6.3);
其中K-Means方法将数据集分为K个簇,每个簇使用簇内所有样本均值来表示,将该均值称为“质心”;Among them, the K-Means method divides the data set into K clusters, and each cluster is represented by the mean value of all samples in the cluster, and the mean value is called "centroid";
步骤(3.2.6.3)根据评分函数,评估聚类的效果,进行步骤(3.2.6.4);Step (3.2.6.3) evaluates the effect of clustering according to the scoring function, and performs step (3.2.6.4);
其中评分函数的公式为其中,me表示聚类结果bestC中簇类的数量,Vmi表示bestC中的第mi个簇类,hCut(Vmi)的公式为其中,w(ecut)代表超边ecut的权重,β(Vmi)代表由被Vmi切割的超边组成的超边集,ecut代表超边集β(Vmi)中的超边,代表Vmi在集合V中的补集,|ecut∩Vmi|代表ecut和Vmi并集的大小,即同时存在于超边ecut和Vmi的超点的数量;The formula of the scoring function is Among them, me represents the number of clusters in the clustering result bestC, V mi represents the mi-th cluster in bestC, and the formula of hCut(V mi ) is Among them, w(e cut ) represents the weight of the hyperedge e cut , β(V mi ) represents the hyperedge set composed of hyperedges cut by V mi , and e cut represents the hyperedge in the hyperedge set β(V mi ) , Represents the complement of V mi in the set V, |e cut ∩ V mi | represents the size of the union of e cut and V mi , that is, the number of superpoints that exist in the hyperedge e cut and V mi at the same time;
步骤(3.2.6.4)如果本次聚类的评分高于BestE,进行步骤(3.2.6.4.1-3.2.6.4.2),用本次聚类结果替换原聚类结果,否则进行步骤(3.2.6.5);Step (3.2.6.4) If the score of this clustering is higher than BestE, go to step (3.2.6.4.1-3.2.6.4.2), replace the original clustering result with this clustering result, otherwise go to step (3.2 .6.5);
步骤(3.2.6.4.1)将本次据聚类结果替换到bestC中,进行步骤(3.2.6.4.2);Step (3.2.6.4.1) replaces the data clustering result of this time into bestC, and proceeds to step (3.2.6.4.2);
步骤(3.2.6.4.2)将本次据聚类结果的评分赋值给BestE;Step (3.2.6.4.2) assigns the score of this data clustering result to BestE;
步骤(3.2.6.5)判断T_num是否大于或等于retryTime,是则退出循环,否则进入下一次循环;Step (3.2.6.5) judges whether T_num is greater than or equal to retryTime, if yes, exit the loop, otherwise enter the next loop;
步骤(3.2.7)结束循环,得到聚类结果bestC。Step (3.2.7) ends the cycle and obtains the clustering result bestC.
所述步骤(3.3)的过程如下:The process of the step (3.3) is as follows:
步骤(3.3.1)定义UV’表示需要细化的超点,将超图G’的超点集V’拷贝到UV’中,进行步骤(3.3.2);Step (3.3.1) defines UV' to represent the superpoints that need to be refined, copy the superpoint set V' of hypergraph G' to UV', and perform step (3.3.2);
步骤(3.3.2)进入循环,执行步骤(3.3.2.1-3.3.2.4),对UV’中的超点进行细化,直到所有超点都细化完成,进行步骤(3.3.3);Step (3.3.2) enters the loop, executes steps (3.3.2.1-3.3.2.4), and refines the superpoints in UV' until all superpoints are refined, then proceeds to step (3.3.3);
步骤(3.3.2.1)从集合UV’中随机取出一个超点v*,进行步骤(3.3.2.2);Step (3.3.2.1) randomly takes out a super point v* from the set UV', and proceeds to step (3.3.2.2);
步骤(3.3.2.2)按照顺序遍历存储被超点v*合并的超点的双向链表,获取其中的超点v**,执行步骤(3.3.2.2.1-3.3.2.2.6),找到超点v**的所属簇类,进行步骤(3.3.2.3);Step (3.3.2.2) traverses in order the doubly-linked list that stores the superpoints merged by superpoint v*, obtains the superpoint v** in it, executes steps (3.3.2.2.1-3.3.2.2.6), and finds the superpoint The belonging cluster class of point v** is carried out in step (3.3.2.3);
步骤(3.3.2.2.1)定义cc代表超点v**的所属簇类,初始化聚类结果与超点v*相同,进行步骤(3.3.2.2.2);Step (3.3.2.2.1) defines cc to represent the cluster class of superpoint v**, the initial clustering result is the same as superpoint v*, and proceeds to step (3.3.2.2.2);
步骤(3.3.2.2.2)定义distance代表超点v**和所属簇类之间的距离,计算超点v**和簇类cc之间的距离初始化distance,进行步骤(3.3.2.2.3);Step (3.3.2.2.2) defines distance to represent the distance between the super point v** and the cluster class to which it belongs, calculates the distance between the super point v** and the cluster class cc to initialize distance, and proceeds to step (3.3.2.2.3 );
其中,计算距离的公式为:Among them, the formula for calculating the distance is:
其中,代表当超点类型为主题时,根据超点权重进行惩罚;vr1、vr2、vr3表示存在于簇类cc中的超点,Qe’表示同时包含超点vr1和vr2的超边的集合,e”表示该集合中的超边,vr3表示存在于簇类cc中的超点,Qe”表示同时包含超点v**和vr3的超边的集合,e”’表示该集合中的超边,len(cc)2表示len(cc)的平方,即集合cc的大小的平方,dv(v**)表示超点v**的度; in, Represents that when the superpoint type is the subject, the penalty is carried out according to the superpoint weight; v r1 , v r2 , v r3 represent the superpoints existing in the cluster class cc, and Q e ' represents the superpoints that contain both the superpoints v r1 and v r2 The set of edges, e" means the hyperedges in this set, v r3 means the superpoints existing in the cluster class cc, Q e " means the set of hyperedges containing both the superpoints v ** and v r3 , e"' Represents the hyperedge in the set, len(cc) 2 represents the square of len(cc), that is, the square of the size of the set cc, dv(v**) represents the degree of super point v**;
步骤(3.3.2.2.3)遍历聚类结果bestC中的簇类,命名为cluster,执行步骤(3.3.2.2.3.1-3.3.2.2.3.2),找到和v**距离最近的簇类,执行步骤(3.3.2.2.4);Step (3.3.2.2.3) Traverse the clusters in the clustering result bestC, name it cluster, perform steps (3.3.2.2.3.1-3.3.2.2.3.2), find the cluster with the closest distance to v**, execute Step (3.3.2.2.4);
步骤(3.3.2.2.3.1)根据步骤(3.3.2.2.2)中的距离公式计算v**与cluster的距离,进行步骤(3.3.2.2.3.2);Step (3.3.2.2.3.1) calculates the distance between v** and cluster according to the distance formula in step (3.3.2.2.2), and performs step (3.3.2.2.3.2);
步骤(3.3.2.2.3.2)如果v**与cluster的距离小于则将cc更新为cluster,将distance更新为v**与cluster的距离;Step (3.3.2.2.3.2) If the distance between v** and cluster is less than then update cc to cluster, and update distance to the distance between v** and cluster;
步骤(3.3.2.2.4)将v**移动到簇类cc中,进行步骤(3.3.2.2.5);Step (3.3.2.2.4) moves v** into the cluster class cc, and proceeds to step (3.3.2.2.5);
步骤(3.3.2.2.5)从v*的权重中减去v**的权重,进行步骤(3.3.2.2.6);Step (3.3.2.2.5) subtracts the weight of v** from the weight of v*, and proceeds to step (3.3.2.2.6);
步骤(3.3.2.2.6)如果v**不是双向链表的最后一个超点,则将v**更新为双向链表的下一个超点并进入下一轮迭代,否则结束迭代;Step (3.3.2.2.6) If v** is not the last superpoint of the doubly linked list, then update v** to be the next superpoint of the doubly linked list and enter the next round of iteration, otherwise end the iteration;
步骤(3.3.2.3)从集合UV’中删除v*,进行步骤(3.3.2.4);Step (3.3.2.3) deletes v* from the set UV', proceed to step (3.3.2.4);
步骤(3.3.2.4)如果集合UV’为空则退出循环,否则进入下一次循环;Step (3.3.2.4) exit the loop if the set UV' is empty, otherwise enter the next loop;
步骤(3.3.3)得到细化后的最终的聚类结果bestC,细化完成。Step (3.3.3) obtains the final clustering result bestC after refinement, and the refinement is completed.
所述第四步的过程如下:The process of the fourth step is as follows:
步骤(4.1)定义带重启的随机游走模型的符号,进行步骤(4.2);Step (4.1) defines the symbol of the random walk model with restart, and proceeds to step (4.2);
其中,tag表示Mashup标签、topic表示Mashup主题、iterTime’表示迭代次数、α表示重启概率、topQ表示推荐个数、recommend表示服务推荐列表和clusters(C1,C2,...,Ck)表示聚类分区其中C1,C2,...,Ck表示服务推荐列表中的服务;Among them, tag indicates the Mashup tag, topic indicates the Mashup topic, iterTime' indicates the number of iterations, α indicates the restart probability, topQ indicates the number of recommendations, recommend indicates the service recommendation list and clusters(C 1 ,C 2 ,...,C k ) Represents the cluster partition where C 1 , C 2 ,...,C k represent the services in the service recommendation list;
步骤(4.2)定义y(t)为在t时刻各个超点被访问的概率所构成的列向量,进行步骤(4.3);Step (4.2) defines y (t) as the column vector formed by the probability of each super point being visited at time t, and proceeds to step (4.3);
步骤(4.3)定义y(1)为y(t)在t为1时各个超点被访问的概率所构成的列向量,以与VO相同的顺序将VO长度的倒数存储在y(1)中;Step (4.3) defines y (1) as a column vector composed of the probabilities of each superpoint being visited when t is 1 in y (t) , and stores the reciprocal of the length of VO in y (1) in the same order as VO ;
步骤(4.4)进行步骤(4.4.1-4.4.2),将需求Mashup的描述文档、标签、主题构建成初始列向量q,进行步骤(4.5);Step (4.4) proceed to step (4.4.1-4.4.2), construct the description document, label, and subject of the required Mashup into an initial column vector q, and proceed to step (4.5);
其中,q为由0和1构成的初始列向量,其中代表初始超点的位置赋值为1,其余为0;Among them, q is an initial column vector composed of 0 and 1, where the position representing the initial superpoint is assigned a value of 1, and the rest are 0;
步骤(4.4.1)建立一个与VO长度相同的全为0的向量q,进行步骤(4.4.2);Step (4.4.1) establishes a vector q of all 0s with the same length as VO, and proceeds to step (4.4.2);
步骤(4.4.2)将需求Mashup的描述文档、标签、主题分别依次与VO中的超点比较,找出相同超点,在q中将对应索引位置1;Step (4.4.2) Compare the description document, label, and topic of the demand Mashup with the superpoints in VO in turn, find out the same superpoint, and set the corresponding index position 1 in q;
步骤(4.5)进行步骤(4.5.1-4.5.5),计算超点集的访问分布向量,进行步骤(4.6);Step (4.5) carries out step (4.5.1-4.5.5), calculates the visit distribution vector of superpoint set, carries out step (4.6);
步骤(4.5.1)利用超点、超边的度矩阵、权重矩阵、关联矩阵构建转移概率矩阵P,公式为P=VD-1HWED-1HT,进行步骤(4.5.2);Step (4.5.1) constructs transition probability matrix P by using the degree matrix, weight matrix, and incidence matrix of hyperpoints and hyperedges, the formula is P=VD -1 HWED - 1HT, and proceeds to step (4.5.2);
其中,D-1表示D的逆矩阵,D-1D=E*;Wherein, D -1 represents the inverse matrix of D, D -1 D=E * ;
步骤(4.5.2)根据带重启的随机游走,得到整个超点集的访问分布为y(t+1)=αPy(t)+(1-α)q,进行步骤(4.5.3);Step (4.5.2) obtains the access distribution of the entire superpoint set as y (t+1) =αPy (t) +(1-α)q according to the random walk with restart, and proceeds to step (4.5.3);
步骤(4.5.3)进行初始化的设置,y(1)向量为初始分布向量,其中每一个元素的值为超点总数量的倒数,y(1)向量与VO等长,进行步骤(4.5.4);Step (4.5.3) is initialized, the y (1) vector is the initial distribution vector, the value of each element is the reciprocal of the total number of superpoints, the y (1) vector is equal to the length of VO, and the step (4.5. 4);
步骤(4.5.4)从初始分布向量y(1),进行步骤(4.5.4.1-4.5.4.2)开启随机游走,对y(t)进行迭代,直到到达指定的迭代次数Tmax,进行步骤(4.5.5);Step (4.5.4) From the initial distribution vector y (1) , proceed to step (4.5.4.1-4.5.4.2) to start a random walk, iterate on y (t) until it reaches the specified number of iterations Tmax, proceed to step ( 4.5.5);
步骤(4.5.4.1)根据步骤(4.5.2)计算y(t+1),进行步骤(4.5.4.2);Step (4.5.4.1) calculates y (t+1 ) according to step (4.5.2), and performs step (4.5.4.2);
步骤(4.5.4.2)更新t为t+1;Step (4.5.4.2) updates t to be t+1;
步骤(4.5.5)随机游走过程停止,此时向量y(t)已经不再变化;Step (4.5.5) the random walk process stops, and the vector y (t) no longer changes at this time;
步骤(4.6)对y(t)向量按照值进行从大到小排序,进行步骤(4.7);Step (4.6) sorts the y (t) vector from large to small according to the value, and performs step (4.7);
步骤(4.7)新建空的推荐列表recommend,进行步骤(4.8);Step (4.7) creates a new empty recommendation list recommend, and proceeds to step (4.8);
步骤(4.8)进行步骤(4.8.1-4.8.2),获取y(t)值最大的topQ个API作为推荐API放入recommend列表,进行步骤(4.9);Step (4.8) proceed to step (4.8.1-4.8.2), obtain the topQ APIs with the largest y (t) values and put them into the recommend list as recommended APIs, proceed to step (4.9);
步骤(4.8.1)遍历y(t)向量中的值对应的超点,进行步骤(4.8.2),直到recommend的长度达到topQ;Step (4.8.1) traverses the superpoint corresponding to the value in the y (t) vector, and proceeds to step (4.8.2) until the length of recommend reaches topQ;
步骤(4.8.2)对超点进行类型判断,如果是API类型则加入推荐列表,否则跳到下一次循环;Step (4.8.2) judges the type of the super point, if it is an API type, it is added to the recommendation list, otherwise it skips to the next cycle;
步骤(4.9)结束循环,返回推荐列表,其长度为topQ,此时recommend中经过大小排序且类型都为API的超点就是所求的推荐API,完成推荐,结束。Step (4.9) ends the cycle, returns the recommendation list, and its length is topQ. At this time, the superpoints in recommend that have been sorted by size and whose types are all API are the recommended recommended APIs. Complete the recommendation and end.
本发明的有益效果是,利用MATaTo数据模型描述服务推荐场景能充分利用服务推荐场景中的信息;进一步,一致性超图涵盖辅助信息的同时克服了辅助信息难以融合、高阶信息易丢失的问题;进一步,使用改进的多级聚类方法进行聚类,降低超图结构的复杂性,有效的缩小服务搜索空间,提高服务推荐的准确率,降低服务推荐的时间成本,减少大型超图给基于超图的推荐算法带来负面影响;最后,设定重启参数进行带重启的随机游走,在保证数据推荐多样性的同时得到个性化推荐列表。The beneficial effect of the present invention is that using the MATaTo data model to describe the service recommendation scene can make full use of the information in the service recommendation scene; further, while the consistency hypergraph covers auxiliary information, it overcomes the problems of difficult fusion of auxiliary information and easy loss of high-order information ;Further, use the improved multi-level clustering method for clustering, reduce the complexity of the hypergraph structure, effectively reduce the service search space, improve the accuracy of service recommendation, reduce the time cost of service recommendation, and reduce the large-scale hypergraph based on Hypergraph’s recommendation algorithm has negative impacts; finally, set the restart parameters to perform random walk with restart, and get a personalized recommendation list while ensuring the diversity of data recommendations.
附图说明Description of drawings
图1示出了在不同服务类数下本发明推荐方法与其他方法推荐结果的准确率。Fig. 1 shows the accuracy rate of the recommendation method of the present invention and the recommendation results of other methods under different numbers of service categories.
图2示出了在不同服务类数下本发明推荐方法与其他方法推荐结果的召回率Recall。Fig. 2 shows the recall rate Recall of the recommendation method of the present invention and the recommendation results of other methods under different numbers of service categories.
图3示出了在不同服务类数下本发明推荐方法与其他方法推荐结果的折扣累计增益DCG@R。Fig. 3 shows the discount cumulative gain DCG@R of the recommendation method of the present invention and the recommendation results of other methods under different numbers of service categories.
图4示出了在不同服务类数下本发明推荐方法与其他方法推荐结果的汉明距离HMD。Fig. 4 shows the Hamming distance HMD between the recommendation method of the present invention and the recommendation results of other methods under different numbers of service categories.
具体实施方式detailed description
下面结合附图对本发明作进一步描述。The present invention will be further described below in conjunction with the accompanying drawings.
参照图1~图4,一种面向Saas的融合带重启随机游走算法的Web Api多样性推荐方法,包括以下步骤:Referring to Figures 1 to 4, a Saas-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm includes the following steps:
第一步、提取MATaTo(Mashup-API-Tag-Topic)数据模型,过程如下:The first step is to extract the MATaTo (Mashup-API-Tag-Topic) data model, the process is as follows:
步骤(1.1)从ProgrammableWeb平台爬取Mashup服务集合信息,进行步骤(1.2);Step (1.1) crawls the Mashup service collection information from the ProgrammableWeb platform, and proceeds to step (1.2);
步骤(1.2)对爬取到的Mashup服务集合进行分析,选择有利于提升服务推荐准确性且内容丰富的元信息,进行步骤(1.3);Step (1.2) analyzes the crawled Mashup service collection, selects meta information that is conducive to improving the accuracy of service recommendation and is rich in content, and proceeds to step (1.3);
其中,爬取到的信息有服务描述文档、Mashup与API的调用关系、主题、标签、服务关注者、服务调用链接、更新日志信息;Among them, the crawled information includes service description documents, call relationships between Mashup and API, topics, labels, service followers, service call links, and update log information;
其中,服务描述文档是传统Web服务WSDL文档的替代品,通过直观的语言描述服务信息,一直以来是服务推荐研究者的研究重点,不少研究者通过TF-IDF模型或主题模型对描述文档处理分析;Among them, the service description document is a substitute for the traditional Web service WSDL document. It has always been the research focus of service recommendation researchers to describe service information through an intuitive language. Many researchers use the TF-IDF model or topic model to process description documents. analyze;
其中,Mashup与API的调用关系包含了Mashup与API之间存在的不可分割的关系,而现有的研究中往往忽略了两者的调用关系,因此,需要使用Mashup与API的调用关系,并且在调用关系的基础上,可以引申出Mashup和API流行度、Mashup与API、Mashup与Mashup、API与API共现度信息;Among them, the call relationship between Mashup and API includes the inseparable relationship between Mashup and API, but the existing research often ignores the call relationship between the two. Therefore, the call relationship between Mashup and API needs to be used, and in On the basis of the call relationship, the popularity of Mashup and API, Mashup and API, Mashup and Mashup, API and API co-occurrence information can be derived;
其中,主题能反应服务的功能性特征,对于服务的使用范围有一个大致的定性,同一主题下的服务有相似的功能,标签通常由用户人为添加,从用户的主观角度对服务进行补充描述;Among them, the theme can reflect the functional characteristics of the service, and has a general qualitative definition of the scope of use of the service. Services under the same theme have similar functions. Labels are usually manually added by the user, and the service is supplemented from the user's subjective point of view;
其中,服务调用链接仅代表了服务的资源地址,并无实际帮助。因此,不使用服务调用链接信息;Among them, the service call link only represents the resource address of the service, and has no actual help. Therefore, the service call link information is not used;
步骤(1.3)定义MATaTo模型的符号,M表示Mashup集合、A表示API集合、Ta表示标签集合、To表示主题集合、R1表示Mashup之间的关系、R2表示API之间的关系、R3表示Mashup及其属性的关系、R4表示API及其属性的关系、R5表示Mashup和API之间的关系、n表示超点数量、m表示超边数量、WHG∈Rm×n表示加权超图、v表示超点、V表示由n个超点构成的集合、e表示超边、E表示由m个超边构成的集合,Rm×n表示大小为m×n的实数集合;进行步骤(1.4);Step (1.3) defines the symbols of the MATaTo model, M represents the Mashup set, A represents the API set, Ta represents the tag set, To represents the topic set, R1 represents the relationship between Mashups, R2 represents the relationship between APIs, R3 represents the Mashup and The relationship between its attributes, R4 indicates the relationship between API and its attributes, R5 indicates the relationship between Mashup and API, n indicates the number of hyperpoints, m indicates the number of hyperedges, WHG∈R m×n indicates the weighted hypergraph, v indicates the hypergraph point, V represents a set composed of n superpoints, e represents a hyperedge, E represents a set composed of m hyperedges, R m * n represents a set of real numbers whose size is m * n; proceed to step (1.4);
步骤(1.4)遍历爬取到的信息,进行步骤(1.4.1-1.4.2)从中提取出MATaTo数据模型;Step (1.4) traverses the crawled information, and performs steps (1.4.1-1.4.2) to extract the MATaTo data model therefrom;
步骤(1.4.1)提取服务描述文档、Mashup与API的调用关系、主题、标签四种类型信息,进行步骤(1.4.2);Step (1.4.1) extracts four types of information, the service description document, the call relationship between Mashup and API, the subject, and the label, and proceeds to step (1.4.2);
其中,四种类型信息的说明具体如下:Among them, the description of the four types of information is as follows:
Mashup(M*)表示Mashup服务,是服务推荐场景中推荐的请求者,Mashup本身携带有标签、主题的信息,并且与API存在调用关系;Mashup (M*) represents the Mashup service, which is the recommended requester in the service recommendation scenario. Mashup itself carries information about tags and topics, and has a calling relationship with the API;
API(A*)表示Web API,是服务推荐场景中的推荐对象,与Mashup服务对象相同,也具有标签、主题的信息,与Mashup存在被调用的关系;API(A*) means Web API, which is the recommended object in the service recommendation scenario. It is the same as the Mashup service object, and also has label and subject information, and has a calling relationship with Mashup;
标签(Ta*)表示爬取到的标签信息,具有相同标签的Mashup和API关系更加紧密;The label (Ta*) indicates the crawled label information, and the Mashup with the same label has a closer relationship with the API;
主题(To*)表示爬取到的主题信息,同一主题下的Mashup和API有相近的功能;Topic (To*) indicates the crawled topic information, Mashup and API under the same topic have similar functions;
步骤(1.4.2)归纳出R1(M*-Ta*-To*)、R2(A*-Ta*-To*)、R3(M*-M*)、R4(A*-A*)、R5(M*-A*)五种类型的高阶关系;Step (1.4.2) summarizes R1(M*-Ta*-To*), R2(A*-Ta*-To*), R3(M*-M*), R4(A*-A*), R5(M*-A*) five types of higher-order relationships;
其中,五种高阶关系的说明具体如下:Among them, the description of the five high-order relationships is as follows:
R1(M*-Ta*-To*)表示Mashup及其属性的关系,Mashup除了自身的描述文档外,用其携带的主题、标签信息进行补充说明;R1(M*-Ta*-To*) represents the relationship between Mashup and its attributes. In addition to its own description document, Mashup uses its carried subject and label information for supplementary explanation;
R2(A*-Ta*-To*)表示API及其属性的关系,与R1相同,API用其携带的主题、标签信息进行补充说明,Mashup调用具有相同主题、标签的API;R2(A*-Ta*-To*) represents the relationship between the API and its attributes. It is the same as R1. The API uses the topic and label information it carries to make supplementary explanations. Mashup calls the API with the same topic and label;
R3(M*-M*)表示Mashup之间的关系,本发明计算不同Mashup描述文档之间的相似度,来表示不同Mashup服务之间的关系;R3(M*-M*) represents the relationship between Mashups, and the present invention calculates the similarity between different Mashup description documents to represent the relationship between different Mashup services;
R4(A*-A*)表示API之间的关系,与R3相同,使用相似度表示API服务之间的关系。除此之外,被同时调用的两个或多个API之间能反映其共现度,一定程度上提升推荐的准确度;R4(A*-A*) represents the relationship between APIs, which is the same as R3, using similarity to represent the relationship between API services. In addition, the co-occurrence between two or more APIs that are called at the same time can reflect their co-occurrence, which improves the accuracy of the recommendation to a certain extent;
R5(M*-A*)表示Mashup和API之间的关系,Mashup由API组合而成,除了显式的调用关系外,API的主题或标签同样能对Mashup进行功能性描述;R5(M*-A*) indicates the relationship between Mashup and API. Mashup is composed of API. In addition to the explicit call relationship, the topic or label of API can also describe the function of Mashup;
第二步、基于MATaTo数据模型构建超图;The second step is to construct a hypergraph based on the MATaTo data model;
步骤(2.1)使用NMF模型对Mashup和API描述文档建模,得到对应的主题向量,进行步骤(2.2);Step (2.1) uses the NMF model to model the Mashup and API description documents, obtains the corresponding topic vector, and performs step (2.2);
其中,NMF模型是非负矩阵分解,是一种无监督学习算法,其目的在于提取有用的特征,常用于图像特征识别等领域;Among them, the NMF model is a non-negative matrix decomposition, which is an unsupervised learning algorithm whose purpose is to extract useful features, and is often used in image feature recognition and other fields;
步骤(2.2)根据MATaTo数据模型,进行步骤(2.2.1-2.2.4),针对四种类型的对象构成超图的超点;Step (2.2) carries out step (2.2.1-2.2.4) according to MATaTo data model, forms the superpoint of hypergraph for four types of objects;
步骤(2.2.1)定义Mashup集合M={m1,m2,m3,...,mN},进行步骤(2.2.2);Step (2.2.1) defines the Mashup set M={m 1 , m 2 , m 3 ,...,m N }, and proceeds to step (2.2.2);
其中,N为Mashup的总数;Among them, N is the total number of Mashup;
步骤(2.2.2)定义Mashup mi调用的API集合为Ai={ai1,ai1,ai1,...,aik'},进行步骤(2.2.3);Step (2.2.2) defines the API set called by Mashup m i as A i ={a i1 ,a i1 ,a i1 ,...,a ik' }, proceed to step (2.2.3);
其中,i代表当前Mashup在集合M中的编号,k’为Mashup mi调用的API总数,则所有Mashup调用的API集合为其中,∪表示取并集;Among them, i represents the number of the current mashup in the set M, k' is the total number of APIs called by mashup m i , then the set of APIs called by all mashups is Among them, ∪ means to take the union;
步骤(2.2.3)定义Ta和To为所有Mashup和API携带的标签和主题信息集合,进行步骤(2.2.4);Step (2.2.3) defines Ta and To as the tags and subject information collections carried by all Mashups and APIs, and proceeds to step (2.2.4);
步骤(2.2.4)初始化超图的超点集为V=M∪A∪Ta∪To;Step (2.2.4) initializes the superpoint set of the hypergraph as V=M∪A∪T a ∪T o ;
步骤(2.3)根据五种高阶关系中建立十种类型的超边,定义十种类型的超边EM *kTo*、EM*kTa*、EA*kTo*、EA*kTa*、EM*ToTa*、EA*To*Ta*、EM*A*To*、EM*A*Ta*、和建立超图的超边集;Step (2.3) Establish ten types of hyperedges according to five high-order relations, and define ten types of hyperedges E M *kTo* , E M*kTa* , E A*kTo* , E A*kTa* , E M*ToTa* 、 E A*To*Ta* 、 E M*A*To* 、 E M*A*Ta* 、 and Create hyperedge sets of hypergraphs;
其中,超边定义的具体说明如下:Among them, the specific description of the hyperedge definition is as follows:
主题能对Mashup进行功能性描述,而Mashup本身携带的主题有限,通过与其相似的Mashup扩展自身的主题,对于每个Mashup,选取top-k个相似Mashup以及它们的主题构成一条超边,该边的权重设定为与k个Mashup之间相似度的平均值,相似度的计算采用常见的余弦相似度,其中,余弦相似度的计算公式如下: A topic can describe the functionality of a mashup, and a mashup itself has limited themes. It expands its own theme through similar mashups. For each mashup, select top-k similar mashups and their topics to form a hyperedge. The edge The weight of is set as the average value of the similarity with k Mashups, and the calculation of the similarity adopts the common cosine similarity, where the calculation formula of the cosine similarity is as follows:
其中,θ表示预先相似度的夹角,vmx'表示第x’个Mashup的特征向量,|vmx'|表示特征向量的模; Among them, θ represents the angle of the pre-similarity, vm x' represents the feature vector of the x'th Mashup, | vm x' | represents the modulus of the feature vector;
标签能对Mashup进行细节性描述,而Mashup本身携带的标签有限,通过与其相似的Mashup扩展自身的标签,对于每个Mashup,选取top-k个相似Mashup以及它们的标签构成一条超边,该边的权重设定为与k个Mashup之间相似度的平均值,相似度的计算和步骤(2.3)中的余弦相似度的计算公式相同; Tags can describe the mashup in detail, and the mashup itself has limited tags. By extending its own tags through similar mashups, for each mashup, select top-k similar mashups and their tags to form a hyperedge. The weight of is set as the average value of similarity with k Mashups, the calculation of similarity and step (2.3) The calculation formula of cosine similarity is the same;
一个API携带的标签数有限,可以通过与其他API之间的相似性扩展自身的标签。对于每个API,选取top-k个相似API以及它们的主题构成一条超边,该边的权重设定为与k个API之间相似度的平均值,相似度的计算和步骤(2.3)中的余弦相似度的计算公式相同; An API carries a limited number of tags, and can extend its own tags through similarities with other APIs. For each API, select top-k similar APIs and their topics to form a hyperedge, and the weight of this edge is set as the average value of the similarity between the k APIs. The calculation of similarity is the same as in step (2.3). The calculation formula of cosine similarity is the same;
同上,API可以通过与其他API之间的相似性扩展自身的标签,对于每个API,选取top-k个相似API以及它们的标签构成一条超边,该边的权重设定为与k个API之间相似度的平均值,相似度的计算和步骤(2.3)中的余弦相似度的计算公式相同; As above, an API can extend its own label through the similarity with other APIs. For each API, select top-k similar APIs and their labels to form a hyperedge, and the weight of this edge is set to be similar to k APIs. The average value of similarity between, the calculation of similarity and step (2.3) The calculation formula of cosine similarity is the same;
EM*To*Ta*:一个Mashup本身携带有主题与标签信息,对于每个Mashup,选取其携带的任意数量主题与标签构成一条超边。该边的权重设定为1;E M*To*Ta* : A mashup itself carries topic and label information. For each mashup, select any number of topics and labels it carries to form a hyperedge. The weight of this edge is set to 1;
EA*To*Ta*:一个API本身携带有主题与标签信息。对于每个API,选取其携带的任意数量主题与标签构成一条超边,该边的权重设定为1;E A*To*Ta* : An API itself carries subject and tag information. For each API, select any number of topics and labels it carries to form a hyperedge, and the weight of this edge is set to 1;
EM*A*To*:一个Mashup调用一个或多个API,这些API的主题同样能补充Mashup的主题,对于每个Mashup,选取其调用的所有API以及携带的主题。该边的权重设定为1;E M*A*To* : A mashup calls one or more APIs. The themes of these APIs can also complement the theme of the mashup. For each mashup, select all the APIs it calls and the topics it carries. The weight of this edge is set to 1;
EM*A*Ta*:同上,一个Mashup调用一个或多个API,这些API的标签同样能补充Mashup的标签,对于每个Mashup,选取其调用的所有API以及携带的标签,该边的权重设定为1;E M*A*Ta* : As above, a mashup calls one or more APIs, and the labels of these APIs can also supplement the labels of the mashup. For each mashup, select all the APIs it calls and the labels it carries, and the weight of the edge set to 1;
流行的API往往代表较高的质量,对于每个API,与其所在的主题构建一条超边。该边的权重设定为流行度大小,其中,流行度大小的公式如下: Popular APIs tend to represent higher quality, and for each API, a hyperedge is constructed with its topic. The weight of this edge is set to the popularity, where the formula of the popularity is as follows:
其中Call(Ar)表示API Ar被Mashup调用的频次,Topic(Ar)表示与Ar具有相同主题的所有API。MinCall(Topic(Ar))表示与Ar相同主题中的所有API中调用频次最低的API,MaxCall(Topic(Ar))表示与Ar相同主题中的所有API中调用频次最高的API; Among them, Call(A r ) indicates the frequency that API A r is called by Mashup, and Topic(A r ) indicates all APIs with the same topic as A r . MinCall(Topic(A r )) means the API with the lowest call frequency among all APIs in the same topic as A r , and MaxCall(Topic(A r )) means the API with the highest call frequency among all APIs in the same topic as A r ;
经常一起被调用的API具有相关性,该边包含所有共现次数大于1的API,该边的权重设定为共现度大小,其中,共现度大小的公式如下: The APIs that are often called together are related. This edge includes all APIs whose co-occurrence times are greater than 1. The weight of this edge is set to the co-occurrence degree. The formula for the co-occurrence degree is as follows:
其中,|Am'∩...∩An'|表示API Am’,...,An’被相同Mashup调用的次数,|Am'∪...∪An'|表示API Am’,...,An’被Mashup调用的总次数; Among them, |A m' ∩...∩A n' | indicates the number of times API A m' ,...,A n' is called by the same Mashup, and |A m' ∪...∪A n' | indicates API The total number of times A m' ,...,A n' is called by Mashup;
第三步、对MATaTo数据模型构成的超图进行多级聚类,过程如下:The third step is to perform multi-level clustering on the hypergraph formed by the MATaTo data model. The process is as follows:
步骤(3.1)粗化阶段:进行步骤(3.1.1-3.1.5)对原始图超点进行连续的粗化直到超图规模减小到一定程度。Step (3.1) Coarsening stage: Carry out steps (3.1.1-3.1.5) to perform continuous coarsening on the superpoints of the original graph until the size of the hypergraph is reduced to a certain extent.
步骤(3.1.1)定义粗化阶段所需的符号,c表示超点的权重集合、w表示超边的权重集合、G=(V,E,c,w)表示原始超图、iterTime表示迭代次数、coarse_limit表示粗化阈值、max_weight表示超点权重阈值、G'=(V',E',c',w')表示粗化后的图,V'表示粗化后未被合并的超点,E'表示粗化后的超边,c'表示粗化后超点的权重集合,w'表示粗化超边后的权重集合,进行步骤(3.1.2);Step (3.1.1) defines the symbols required for the coarsening stage, c represents the weight set of hyperpoints, w represents the weight set of hyperedges, G=(V,E,c,w) represents the original hypergraph, and iterTime represents iteration The number of times, coarse_limit indicates the coarsening threshold, max_weight indicates the superpoint weight threshold, G'=(V',E',c',w') indicates the graph after coarsening, and V' indicates the superpoint that has not been merged after coarsening , E' represents the hyperedge after coarsening, c' represents the weight set of superpoints after coarsening, w' represents the weight set after coarsening the hyperedge, and proceed to step (3.1.2);
步骤(3.1.2)将所有超点加入到未访问超点集合UV中,进行步骤(3.1.3);Step (3.1.2) joins all superpoints in the unvisited superpoint collection UV, carries out step (3.1.3);
步骤(3.1.3)初始化空集合C用于记录所有被粗化的超点,进行步骤(3.1.4);Step (3.1.3) initializes the empty collection C to be used for recording all superpoints that are coarsened, and carries out step (3.1.4);
步骤(3.1.4)定义变量I用于存储迭代次数,并将其初始化为0,进行步骤(3.1.4.1-3.1.4.4),对集合UV中的超点进行粗化,进行步骤(3.1.5);Step (3.1.4) defines the variable I for storing the number of iterations, and initializes it to 0, performs steps (3.1.4.1-3.1.4.4), coarsens the super points in the set UV, and performs steps (3.1. 5);
步骤(3.1.4.1)定义集合UVT,用于存储对本次迭代需要粗化的超点合集,将集合UV中的超点拷贝进集合UVT中,进行步骤(3.1.4.2);Step (3.1.4.1) defines the set UVT, which is used to store the superpoint collection that needs to be roughened for this iteration, and copies the superpoints in the set UV into the set UVT, and proceeds to step (3.1.4.2);
步骤(3.1.4.2)进行步骤(3.1.4.2.1-3.1.4.2.12),对UVT中超点进行粗化,直到迭代次数到达len(UVT),进行步骤(3.1.4.3);Step (3.1.4.2) proceed to step (3.1.4.2.1-3.1.4.2.12), coarsen the super point in UVT until the number of iterations reaches len(UVT), proceed to step (3.1.4.3);
其中,len()表示获取集合的大小,len(UVT)表示集合UVT的大小;Among them, len() indicates the size of the collection, and len(UVT) indicates the size of the collection UVT;
步骤(3.1.4.2.1)随机取集合UVT中一个超点,命名为rv,进行步骤(3.1.4.2.2);Step (3.1.4.2.1) randomly takes a super point in the set UVT, named rv, and proceeds to step (3.1.4.2.2);
步骤(3.1.4.2.2)判断该超点rv是否属于集合UVT,是则进行步骤(3.1.4.2.3),否则直接进入下一轮循环;Step (3.1.4.2.2) judges whether the super point rv belongs to the set UVT, if so, proceed to step (3.1.4.2.3), otherwise directly enter the next cycle;
步骤(3.1.4.2.3)定义一个超点cv,并将超点rv拷贝给cv,定义score用于存储评分,并初始化评分为0,进行步骤(3.1.4.2.4);Step (3.1.4.2.3) defines a super point cv, and copies the super point rv to cv, defines score to store the score, and initializes the score to 0, proceed to step (3.1.4.2.4);
步骤(3.1.4.2.4)遍历与超点rv邻接的其他超点,并将该超点拷贝给iv,进行步骤(3.1.4.2.4.1-3.1.4.2.4.3),找到要与rv合并的超点iv,进行步骤(3.1.4.2.5);Step (3.1.4.2.4) traverses other superpoints adjacent to superpoint rv, and copies the superpoint to iv, and performs steps (3.1.4.2.4.1-3.1.4.2.4.3) to find the superpoint to be merged with rv Super point iv, proceed to step (3.1.4.2.5);
其中,与超点rv邻接的其他超点即与该超点在同一边中的超点;Among them, other superpoints adjacent to superpoint rv are superpoints on the same side as this superpoint;
步骤(3.1.4.2.4.1)计算rv的权重与iv的权重之和;Step (3.1.4.2.4.1) calculates the sum of the weight of rv and the weight of iv;
步骤(3.1.4.2.4.2)判断该权重和是否大于超点权重阈值max_weight,否则进行步骤(3.1.4.2.4.3),是则直接进入下一次循环;Step (3.1.4.2.4.2) judges whether the weight sum is greater than the superpoint weight threshold max_weight, otherwise proceed to step (3.1.4.2.4.3), and if so, directly enter the next cycle;
步骤(3.1.4.2.4.3)判断iv是不是属于集合UVT中的超点,是则直接进入下一次循环,否则直接结束循环;Step (3.1.4.2.4.3) judges whether iv belongs to a superpoint in the set UVT, if so, directly enters the next cycle, otherwise directly ends the cycle;
步骤(3.1.4.2.5)根据评分函数计算rv与iv的评分函数,定义变量cur_score,将评分存储在cur_score中,进行步骤(3.1.4.2.6);Step (3.1.4.2.5) calculates the scoring function of rv and iv according to scoring function, defines variable cur_score, scores are stored in cur_score, and carries out step (3.1.4.2.6);
其中,评分函数为score(rv,iv)=a·con(rv,iv)+b·hNcut(rv,iv)。a、b的值为经验值,依据实验效果调参后所得。其中,其中,I*(rv)、I*(iv)分别表示包含超点rv、iv的超边集合,∩表示取交集,即取符号左右集合都存在的元素,其中c*(rv)、c*(iv)分别为超点rv、iv的权重,e1表示同时和超点rv、iv存在于同一超边的超点,w(e1)表示超边e1的权重,de(e1)表示超边e1的度,de(e1)=|e1|,|e1|表示超边e1中的超点数量;∑表示求和。其中hNcut(rv,iv)表示超点rv和iv所在超边的归一化切割,代表了rv和iv分别被看作一个簇类时这两个簇类的相关度,计算公式为其中,dv(rv)、dv(iv)表示超点rv、iv的度,即包含超点rv、iv的边的权重之和,e2表示属于集合Qe的边,Qe表示由这两个集合切割的超边集,当一条超边中的超点存在于不同簇类中时,称这条超边是被切割的超边;Wherein, the scoring function is score(rv,iv)=a·con(rv,iv)+b·hNcut(rv,iv). The values of a and b are empirical values, which are obtained after parameter adjustment according to the experimental results. in, Among them, I*(rv), I*(iv) represent the hyperedge set containing the superpoint rv, iv respectively, ∩ means to take the intersection, that is, to take the elements that exist in both the left and right sets of symbols, Among them, c*(rv) and c*(iv) are the weights of superpoints rv and iv respectively, e 1 means a superpoint that exists on the same hyperedge as superpoint rv and iv at the same time, w(e 1 ) means hyperedge e The weight of 1 , de(e 1 ) indicates the degree of hyperedge e 1 , de(e 1 )=|e 1 |, |e 1 | indicates the number of super vertices in hyperedge e 1 ; ∑ indicates summation. where hNcut(rv,iv) represents the normalized cut of the hyperedge where the superpoints rv and iv are located, and represents the correlation between the two clusters when rv and iv are respectively regarded as a cluster. The calculation formula is Among them, dv(rv) and dv(iv) represent the degrees of the superpoints rv and iv, that is, the sum of the weights of the edges containing the superpoints rv and iv, e 2 represents the edges belonging to the set Q e , and Q e represents the A set of hyperedges cut by a set, when the hyperpoints in a hyperedge exist in different clusters, this hyperedge is called a hyperedge that is cut;
步骤(3.1.4.2.6)判断cur_score是否大于score,是则将超点iv拷贝到cv中、将cur_score的值赋值给score,进行步骤(3.1.4.2.7);Step (3.1.4.2.6) judges whether cur_score is greater than score, if so, copy the super point iv to cv, assign the value of cur_score to score, and proceed to step (3.1.4.2.7);
步骤(3.1.4.2.7)将rv与cv合并,本发明使用双向链表的形式,将有合并关系的超点进行连接,然后进行步骤(3.1.4.2.8);Step (3.1.4.2.7) merges rv and cv, and the present invention uses the form of doubly linked list to connect the superpoints with merge relationship, and then proceed to step (3.1.4.2.8);
其中,双向链表是链表的一种,它的每个数据结点中都有两个指针,分别指向直接后继和直接前驱,即,直接相连的两个节点之间有指向对方的指针。Among them, the doubly linked list is a kind of linked list, each of its data nodes has two pointers, pointing to the direct successor and direct predecessor respectively, that is, there are pointers pointing to each other between two directly connected nodes.
步骤(3.1.4.2.8)将被合并超点cv的权重增加到rv,进行步骤(3.1.4.2.9);Step (3.1.4.2.8) increases the weight of the merged superpoint cv to rv, and proceeds to step (3.1.4.2.9);
步骤(3.1.4.2.9)在当前迭代轮次使用的UVT中移除rv与cv,,进行步骤(3.1.4.2.10);Step (3.1.4.2.9) removes rv and cv in the UVT used in the current iteration round, and proceeds to step (3.1.4.2.10);
步骤(3.1.4.2.10)在UV中移除cv,进行步骤(3.1.4.2.11);Step (3.1.4.2.10) removes cv in UV, proceed to step (3.1.4.2.11);
步骤(3.1.4.2.11)将超点cv添加到集合C中,进行步骤(3.1.4.2.12);Step (3.1.4.2.11) adds the super point cv to the set C, and proceeds to step (3.1.4.2.12);
步骤(3.1.4.2.12)如果被粗化的超点数到达粗化阈值coarse_limit,则退出迭代,执行步骤(3.1.4.3),否则直接进入下一轮迭代;Step (3.1.4.2.12) If the number of coarsened superpoints reaches the coarsening threshold coarse_limit, exit the iteration and execute step (3.1.4.3), otherwise directly enter the next round of iteration;
步骤(3.1.4.3)对I自增1,即I=I+1,进行步骤(3.1.4.4);Step (3.1.4.3) automatically increments I by 1, i.e. I=I+1, and proceeds to step (3.1.4.4);
步骤(3.1.4.4)判断迭代次数I是否小于iterTime、被粗化的超点数是否到达粗化阈值coarse_limit,任一条件不符合,则停止粗化,停止迭代,执行步骤(3.1.5)。Step (3.1.4.4) judge whether the number of iterations I is less than iterTime, whether the number of coarsened superpoints reaches the coarsening threshold coarse_limit, if any condition is not met, then stop coarsening, stop iteration, and execute step (3.1.5).
步骤(3.1.5)将粗化后的粗化后的超点、超边、超点的权重集合和超边的权重集合分别拷贝到V'、E'、c'、w'中,得到粗化结果,即超图G'=(V',E',c',w');Step (3.1.5) Copy the coarsened superpoint, hyperedge, superpoint weight set and hyperedge weight set to V', E', c', w' after coarsening to obtain the coarse Transformation result, that is, hypergraph G'=(V', E', c', w');
步骤(3.2)聚类阶段:进行步骤(3.2.1-3.2.7),对粗化阶段得到的最粗超图使用不同的参数进行谱聚类,根据结果选择效果最优的聚类,进行步骤(3.3);Step (3.2) Clustering stage: Carry out steps (3.2.1-3.2.7), use different parameters to perform spectral clustering on the coarsest hypergraph obtained in the coarsening stage, select the cluster with the best effect according to the results, and perform step (3.3);
步骤(3.2.1)进行步骤(3.2.1.1-3.2.1.6),构建拉普拉斯矩阵,进行步骤(3.2.2);Step (3.2.1) carries out step (3.2.1.1-3.2.1.6), constructs Laplacian matrix, carries out step (3.2.2);
其中拉普拉斯矩阵是图的矩阵表示,主要应用在图论中;Among them, the Laplacian matrix is the matrix representation of the graph, which is mainly used in graph theory;
步骤(3.2.1.1)定义VO和EO分别为存储超图中所有超点与对应索引位置和超图中所有超边与对应索引位置的数组,将超图中所有超点与对应索引位置存储在VO中;将超图中所有超边与对应索引位置存储在EO中,进行步骤(3.2.1.2);Step (3.2.1.1) defines VO and EO as arrays for storing all superpoints and corresponding index positions in the hypergraph and all hyperedges and corresponding index positions in the hypergraph respectively, and stores all superpoints and corresponding index positions in the hypergraph in In VO; store all hyperedges and corresponding index positions in the hypergraph in EO, and perform step (3.2.1.2);
步骤(3.2.1.2)根据步骤(2.3)中的超边构建方式和权重计算方法得到的超图信息构建相似矩阵S,进行步骤(3.2.1.3);Step (3.2.1.2) constructs a similarity matrix S according to the hypergraph information obtained by the hyperedge construction method and the weight calculation method in the step (2.3), and proceeds to the step (3.2.1.3);
其中,相似度矩阵是一项重要的统计学技术,用于组织一组数据点之间的彼此相似性。本发明中相似性即超边的权重。Among them, the similarity matrix is an important statistical technique used to organize the mutual similarity between a set of data points. In the present invention, the similarity is the weight of the hyperedge.
步骤(3.2.1.3)根据相似矩阵构建超点度矩阵VD以及超边度矩阵ED,进行步骤(3.2.1.4);Step (3.2.1.3) builds superpoint degree matrix VD and hyperedge degree matrix ED according to similarity matrix, carries out step (3.2.1.4);
其中,超点度矩阵VD是由超点度构成的对角线矩阵,对角线上的元素为超点的度dv(va),其中va代表VO中对应索引位置a的超点;Wherein, the superpoint degree matrix VD is a diagonal matrix formed by the superpoint degree, and the elements on the diagonal are the degree dv (v a ) of the superpoint, where v a represents the superpoint of the corresponding index position a in VO;
其中,超边度矩阵ED是由超边度构成的对角线矩阵,对角线上的元素为超边的度de(eb),其中eb代表EO中对应索引位置b的超点;Wherein, the hyperedge degree matrix ED is a diagonal matrix formed by the hyperedge degree, and the elements on the diagonal are the degree de(e b ) of the hyperedge, where e b represents the hyperpoint corresponding to the index position b in EO;
步骤(3.2.1.4)根据相似矩阵,进行步骤(3.2.1.4.1-3.2.1.4.4)构建关联矩阵H,进行步骤(3.2.1.5);Step (3.2.1.4) carries out step (3.2.1.4.1-3.2.1.4.4) to construct correlation matrix H according to similarity matrix, and carries out step (3.2.1.5);
其中,关联矩阵是一个n×m的矩阵,第d行表示第d个超点所属的所有超边,第f列表示第f条超边包含的所有超点,矩阵的第d行第f列表示为 Among them, the incidence matrix is an n×m matrix, the dth row represents all the hyperedges to which the dth superpoint belongs, the fth column represents all the superpoints contained in the fth hyperedge, and the dth row and fth column of the matrix Expressed as
步骤(3.2.1.4.1)建立一个n×m的全为0的矩阵H,进行步骤(3.2.1.4.2);Step (3.2.1.4.1) establishes a matrix H that is all 0s of n×m, and proceeds to step (3.2.1.4.2);
步骤(3.2.1.4.2)遍历EO中所有超边,进行步骤(3.2.1.4.3);Step (3.2.1.4.2) traverses all hyperedges in EO, and performs step (3.2.1.4.3);
步骤(3.2.1.4.3)遍历超边中所有超点,进行步骤(3.2.1.4.4);Step (3.2.1.4.3) traverses all hyperpoints in the hyperedge, and proceeds to step (3.2.1.4.4);
步骤(3.2.1.4.4)在矩阵H中将该超边和超点对应索引位置的0置为1;Step (3.2.1.4.4) sets 0 at the corresponding index position of the hyperedge and superpoint to 1 in the matrix H;
步骤(3.2.1.5)根据步骤(2.3)中的超边构建方式和权重计算方法得到的超图信息进行步骤(3.2.1.5.1-3.2.1.5.3),构建权重矩阵W,进行步骤(3.2.1.6);Step (3.2.1.5) carries out step (3.2.1.5.1-3.2.1.5.3) according to the hypergraph information obtained by the hyperedge construction method and weight calculation method in step (2.3), constructs weight matrix W, and performs step ( 3.2.1.6);
其中,权重矩阵W是超边权重的对角矩阵,对角线上的元素为超边的权重;Among them, the weight matrix W is a diagonal matrix of hyperedge weights, and the elements on the diagonal are the weights of hyperedges;
步骤(3.2.1.5.1)建立一个m×m的全为0的矩阵W进行步骤(3.2.1.5.2);Step (3.2.1.5.1) establishes a matrix W that is all 0 of m * m and carries out step (3.2.1.5.2);
步骤(3.2.1.5.2)遍历EO中所有超边,进行步骤(3.2.1.5.3);Step (3.2.1.5.2) traverses all hyperedges in EO, and performs step (3.2.1.5.3);
步骤(3.2.1.5.3)将超边的权重按照EO中的索引位置作为权重矩阵中的值替换矩阵W中的0;Step (3.2.1.5.3) replaces 0 in the matrix W with the weight of the hyperedge as the value in the weight matrix according to the index position in the EO;
步骤(3.2.1.6)根据关联矩阵、权重矩阵、超边度矩阵以及超点度矩阵构建拉普拉斯矩阵L;Step (3.2.1.6) constructs Laplacian matrix L according to correlation matrix, weight matrix, hyperedge degree matrix and superpoint degree matrix;
其中,拉普拉斯矩阵的计算公式为L=E*-ED-1/2HWVD-1HTED-1/2,其中E*表示单位矩阵,即从左上角到右下角的对角线上的元素均为1,其余位置都是0的矩阵,ED-1/2和VD-1分别表示矩阵ED取(-1/2)次幂和矩阵VD取-1次幂,由于ED和VD都是对角线矩阵,所以只要分别对对角线上的数取对应次幂,HT表示H的转置,即将矩阵H中的行换成同序号的列得到的矩阵;Among them, the calculation formula of the Laplacian matrix is L=E*-ED -1/2 HWVD -1 H T ED -1/2 , where E* represents the identity matrix, that is, the diagonal from the upper left corner to the lower right corner The elements above are all 1, and the rest of the positions are all 0. ED -1/2 and VD -1 respectively represent that the matrix ED is taken to the power of (-1/2) and the matrix VD is taken to the power of -1. Since ED and VD Both are diagonal matrices, so as long as the corresponding powers are taken on the numbers on the diagonal, H T represents the transpose of H, that is, the matrix obtained by replacing the rows in the matrix H with columns of the same sequence number;
步骤(3.2.2)对拉普拉斯矩阵进行归一化处理得到NL,进行步骤(3.2.3);Step (3.2.2) carries out normalization process to Laplacian matrix and obtains NL, carries out step (3.2.3);
其中,矩阵的归一化指将矩阵的数据以列为单元,映射到区间(0,1)之间,即,将每一列数据减去该列的最小值并除以该列的最大值;Among them, the normalization of the matrix refers to mapping the data of the matrix to the interval (0, 1) in units of columns, that is, subtracting the minimum value of the column from each column of data and dividing it by the maximum value of the column;
步骤(3.2.3)求解拉普拉斯矩阵NL,计算特征值,得到特征向量,进行步骤(3.2.4);Step (3.2.3) solves Laplacian matrix NL, calculates eigenvalue, obtains eigenvector, carries out step (3.2.4);
其中特征值,特征向量的定义为:矩阵乘法对应了一个变换,是把任意一个向量变成另一个方向或长度都大多不同的新向量,如果矩阵对某一个向量或某些向量只发生伸缩变换,不对这些向量产生旋转的效果,那么这些向量就称为这个矩阵的特征向量,伸缩的比例就是特征值;Among them, the definition of eigenvalue and eigenvector is: matrix multiplication corresponds to a transformation, which is to change any vector into a new vector with different directions or lengths. If the matrix only stretches and transforms a certain vector or some vectors , does not produce a rotation effect on these vectors, then these vectors are called the eigenvectors of this matrix, and the scaling ratio is the eigenvalue;
步骤(3.2.4)将特征向量按列构建成矩阵NE,并归一化处理,进行步骤(3.2.5);Step (3.2.4) constructs eigenvector into matrix NE by column, and normalizes, carries out step (3.2.5);
步骤(3.2.5)定义bestC用于存储最佳聚类结果,定义BestE用于存储最佳聚类结果的评分,初始化BestE为0,定义T_num用于存储重复聚类次数,初始化T_num为0,定义retryTime为聚类操作重复次数,进行步骤(3.2.6);Step (3.2.5) defines bestC to store the best clustering result, defines BestE to store the score of the best clustering result, initializes BestE to 0, defines T_num to store the number of repeated clustering, initializes T_num to 0, Define retryTime as the number of repetitions of the clustering operation, and perform step (3.2.6);
步骤(3.2.6)执行步骤(3.2.6.1-3.2.6.5),对NE进行聚类,评估聚类的效果,循环retryTime次,找到解质量最高的聚类,执行步骤(3.2.7);Step (3.2.6) execute step (3.2.6.1-3.2.6.5), cluster NE, evaluate the effect of clustering, loop retryTime times, find the cluster with the highest solution quality, execute step (3.2.7);
步骤(3.2.6.1)对T_num自增1,即,T_num=T_num+1,进行步骤(3.2.6.2);Step (3.2.6.1) automatically increments T_num by 1, that is, T_num=T_num+1, and proceeds to step (3.2.6.2);
步骤(3.2.6.2)对矩阵NE使用K-Means方法进行聚类,将得到的向量聚类结果作为对应超点的聚类结果,进行步骤(3.2.6.3);Step (3.2.6.2) uses the K-Means method to cluster the matrix NE, and uses the obtained vector clustering result as the clustering result of the corresponding superpoint, and performs step (3.2.6.3);
其中K-Means方法是一种常见的聚类算法。算法会将数据集分为K个簇,每个簇使用簇内所有样本均值来表示,将该均值称为“质心”。Among them, the K-Means method is a common clustering algorithm. The algorithm divides the data set into K clusters, and each cluster is represented by the mean value of all samples in the cluster, which is called the "centroid".
步骤(3.2.6.3)根据评分函数,评估聚类的效果,进行步骤(3.2.6.4);Step (3.2.6.3) evaluates the effect of clustering according to the scoring function, and performs step (3.2.6.4);
其中评分函数的公式为其中,me表示聚类结果bestC中簇类的数量,Vmi表示bestC中的第mi个簇类,hCut(Vmi)的公式为其中,w(ecut)代表超边ecut的权重,β(Vmi)代表由被Vmi切割的超边组成的超边集,ecut代表超边集β(Vmi)中的超边,代表Vmi在集合V中的补集,|ecut∩Vmi|代表ecut和Vmi并集的大小,即同时存在于超边ecut和Vmi的超点的数量;The formula of the scoring function is Among them, me represents the number of clusters in the clustering result bestC, V mi represents the mi-th cluster in bestC, and the formula of hCut(V mi ) is Among them, w(e cut ) represents the weight of the hyperedge e cut , β(V mi ) represents the hyperedge set composed of hyperedges cut by V mi , and e cut represents the hyperedge in the hyperedge set β(V mi ) , Represents the complement of V mi in the set V, |e cut ∩ V mi | represents the size of the union of e cut and V mi , that is, the number of superpoints that exist in the hyperedge e cut and V mi at the same time;
步骤(3.2.6.4)如果本次聚类的评分高于BestE,进行步骤(3.2.6.4.1-3.2.6.4.2),用本次聚类结果替换原聚类结果,否则进行步骤(3.2.6.5);Step (3.2.6.4) If the score of this clustering is higher than BestE, go to step (3.2.6.4.1-3.2.6.4.2), replace the original clustering result with this clustering result, otherwise go to step (3.2 .6.5);
步骤(3.2.6.4.1)将本次据聚类结果替换到bestC中,进行步骤(3.2.6.4.2);Step (3.2.6.4.1) replaces the data clustering result of this time into bestC, and proceeds to step (3.2.6.4.2);
步骤(3.2.6.4.2)将本次据聚类结果的评分赋值给BestE;Step (3.2.6.4.2) assigns the score of this data clustering result to BestE;
步骤(3.2.6.5)判断T_num是否大于或等于retryTime,是则退出循环,否则进入下一次循环;Step (3.2.6.5) judges whether T_num is greater than or equal to retryTime, if yes, exit the loop, otherwise enter the next loop;
步骤(3.2.7)结束循环,得到聚类结果bestC;Step (3.2.7) ends the loop and obtains the clustering result bestC;
步骤(3.3)细化阶段:进行步骤(3.3.1-3.3.3)将粗化阶段中被粗化的超点映射到聚类中,结束;Step (3.3) refinement stage: perform steps (3.3.1-3.3.3) to map the coarsened superpoints in the coarsening stage to clusters, and end;
步骤(3.3.1)定义UV’表示需要细化的超点,将超图G’的超点集V’拷贝到UV’中,进行步骤(3.3.2);Step (3.3.1) defines UV' to represent the superpoints that need to be refined, copy the superpoint set V' of hypergraph G' to UV', and perform step (3.3.2);
步骤(3.3.2)进入循环,执行步骤(3.3.2.1-3.3.2.4),对UV’中的超点进行细化,直到所有超点都细化完成,进行步骤(3.3.3);Step (3.3.2) enters the loop, executes steps (3.3.2.1-3.3.2.4), and refines the superpoints in UV' until all superpoints are refined, then proceeds to step (3.3.3);
步骤(3.3.2.1)从集合UV’中随机取出一个超点v*,进行步骤(3.3.2.2);Step (3.3.2.1) randomly takes out a super point v* from the set UV', and proceeds to step (3.3.2.2);
步骤(3.3.2.2)按照顺序遍历存储被超点v*合并的超点的双向链表,获取其中的超点v**,执行步骤(3.3.2.2.1-3.3.2.2.6),找到超点v**的所属簇类,进行步骤(3.3.2.3);Step (3.3.2.2) traverses in order the doubly-linked list that stores the superpoints merged by superpoint v*, obtains the superpoint v** in it, executes steps (3.3.2.2.1-3.3.2.2.6), and finds the superpoint The belonging cluster class of point v**, carry out step (3.3.2.3);
步骤(3.3.2.2.1)定义cc代表超点v**的所属簇类,初始化聚类结果与超点v*相同,进行步骤(3.3.2.2.2);Step (3.3.2.2.1) defines cc to represent the cluster class of superpoint v**, the initial clustering result is the same as superpoint v*, and proceeds to step (3.3.2.2.2);
步骤(3.3.2.2.2)定义distance代表超点v**和所属簇类之间的距离,计算超点v**和簇类cc之间的距离初始化distance,进行步骤(3.3.2.2.3);Step (3.3.2.2.2) defines distance to represent the distance between the super point v** and the cluster class to which it belongs, calculates the distance between the super point v** and the cluster class cc to initialize distance, and proceeds to step (3.3.2.2.3 );
其中,计算距离的公式为:Among them, the formula for calculating the distance is:
其中,代表当超点类型为主题时,在聚类中具有功能性含义,移动该类型超点可能导致聚类的功能发生偏移,超点权重越大表示对应的功能在所在聚类影响越大,因此根据超点权重进行惩罚;vr1、vr2、vr3表示存在于簇类cc中的超点,Qe’表示同时包含超点vr1和vr2的超边的集合,e”表示该集合中的超边,vr3表示存在于簇类cc中的超点,Qe”表示同时包含超点v**和vr3的超边的集合,e”’表示该集合中的超边,len(cc)2表示len(cc)的平方,即集合cc的大小的平方,dv(v**)表示超点v**的度; in, It means that when the superpoint type is the theme, it has functional meaning in the clustering. Moving this type of superpoint may cause the function of the cluster to shift. The larger the weight of the superpoint, the greater the influence of the corresponding function in the cluster. Therefore, the penalty is carried out according to the superpoint weight; v r1 , v r2 , v r3 represent the superpoints existing in the cluster class cc, Q e ' represents the set of hyperedges containing The hyperedges in the set, v r3 means the super points existing in the cluster class cc, Q e " means the set of hyper edges including the super points v** and v r3 at the same time, e"' means the hyper edges in the set, len(cc) 2 means the square of len(cc), that is, the square of the size of the set cc, and dv(v**) means the degree of superpoint v**;
步骤(3.3.2.2.3)遍历聚类结果bestC中的簇类,命名为cluster,执行步骤(3.3.2.2.3.1-3.3.2.2.3.2),找到和v**距离最近的簇类,执行步骤(3.3.2.2.4);Step (3.3.2.2.3) Traverse the clusters in the clustering result bestC, name it cluster, perform steps (3.3.2.2.3.1-3.3.2.2.3.2), find the cluster with the closest distance to v**, execute Step (3.3.2.2.4);
步骤(3.3.2.2.3.1)根据步骤(3.3.2.2.2)中的距离公式计算v**与cluster的距离,进行步骤(3.3.2.2.3.2);Step (3.3.2.2.3.1) calculates the distance between v** and cluster according to the distance formula in step (3.3.2.2.2), and performs step (3.3.2.2.3.2);
步骤(3.3.2.2.3.2)如果v**与cluster的距离小于则将cc更新为cluster,将distance更新为v**与cluster的距离;Step (3.3.2.2.3.2) If the distance between v** and cluster is less than then update cc to cluster, and update distance to the distance between v** and cluster;
步骤(3.3.2.2.4)将v**移动到簇类cc中,进行步骤(3.3.2.2.5);Step (3.3.2.2.4) moves v** into the cluster class cc, and proceeds to step (3.3.2.2.5);
步骤(3.3.2.2.5)从v*的权重中减去v**的权重,进行步骤(3.3.2.2.6);Step (3.3.2.2.5) subtracts the weight of v** from the weight of v*, and proceeds to step (3.3.2.2.6);
步骤(3.3.2.2.6)如果v**不是双向链表的最后一个超点,则将v**更新为双向链表的下一个超点并进入下一轮迭代,否则结束迭代;Step (3.3.2.2.6) If v** is not the last superpoint of the doubly linked list, then update v** to be the next superpoint of the doubly linked list and enter the next round of iteration, otherwise end the iteration;
步骤(3.3.2.3)从集合UV’中删除v*,进行步骤(3.3.2.4);Step (3.3.2.3) deletes v* from the set UV', proceed to step (3.3.2.4);
步骤(3.3.2.4)如果集合UV’为空则退出循环,否则进入下一次循环;Step (3.3.2.4) exit the loop if the set UV' is empty, otherwise enter the next loop;
步骤(3.3.3)得到细化后的最终的聚类结果bestC,细化完成;Step (3.3.3) obtains the final clustering result bestC after refinement, and the refinement is completed;
第四步、基于带重启随机游走算法对Mashup服务进行推荐,过程如下:The fourth step is to recommend the Mashup service based on the random walk algorithm with restart, the process is as follows:
其中,随机游走是一种被广泛地用于描述自然界中许多实体的运动方式的数学统计模型,在本发明设计的带重启的随机游走算法中,超图中的超点在每一步的游走过程中有两种选择,以特定的概率α∈(0,1)游走到邻近的超点或者以1-α的概率返回初始超点,这种随机游走在任意时刻都有可能回到初始节点,使趋于稳定的概率分布和初始节点的相关性加强,因此最终的服务推荐列表将更加符合需求Mashup;Among them, random walk is a mathematical statistical model widely used to describe the motion of many entities in nature. In the random walk algorithm with restart designed in the present invention, the superpoints in the hypergraph are at each step There are two options in the walking process, walk to the adjacent superpoint with a specific probability α∈(0,1) or return to the initial superpoint with the probability of 1-α, this random walk is possible at any time Back to the initial node, the correlation between the stable probability distribution and the initial node is strengthened, so the final service recommendation list will be more in line with the demand Mashup;
步骤(4.1)定义带重启的随机游走模型的符号,进行步骤(4.2);Step (4.1) defines the symbol of the random walk model with restart, and proceeds to step (4.2);
其中,tag表示Mashup标签、topic表示Mashup主题、iterTime’表示迭代次数、α表示重启概率、topQ表示推荐个数、recommend表示服务推荐列表和clusters(C1,C2,...,Ck)表示聚类分区其中C1,C2,...,Ck表示服务推荐列表中的服务;Among them, tag indicates the Mashup tag, topic indicates the Mashup topic, iterTime' indicates the number of iterations, α indicates the restart probability, topQ indicates the number of recommendations, recommend indicates the service recommendation list and clusters(C 1 ,C 2 ,...,C k ) Represents the cluster partition where C 1 , C 2 ,...,C k represent the services in the service recommendation list;
步骤(4.2)定义y(t)为在t时刻各个超点被访问的概率所构成的列向量,进行步骤(4.3);Step (4.2) defines y (t) as the column vector formed by the probability of each super point being visited at time t, and proceeds to step (4.3);
步骤(4.3)定义y(1)为y(t)在t为1时各个超点被访问的概率所构成的列向量,以与VO相同的顺序将VO长度的倒数存储在y(1)中;Step (4.3) defines y (1) as a column vector composed of the probabilities of each superpoint being visited when t is 1 in y (t) , and stores the reciprocal of the length of VO in y (1) in the same order as VO ;
步骤(4.4)进行步骤(4.4.1-4.4.2),将需求Mashup的描述文档、标签、主题构建成初始列向量q,进行步骤(4.5);Step (4.4) proceed to step (4.4.1-4.4.2), construct the description document, label, and subject of the required Mashup into an initial column vector q, and proceed to step (4.5);
其中,q为由0和1构成的初始列向量,其中代表初始超点的位置赋值为1,其余为0;Among them, q is an initial column vector composed of 0 and 1, where the position representing the initial superpoint is assigned a value of 1, and the rest are 0;
步骤(4.4.1)建立一个与VO长度相同的全为0的向量q,进行步骤(4.4.2);Step (4.4.1) establishes a vector q of all 0s with the same length as VO, and proceeds to step (4.4.2);
步骤(4.4.2)将需求Mashup的描述文档、标签、主题分别依次与VO中的超点比较,找出相同超点,在q中将对应索引位置1;Step (4.4.2) Compare the description document, label, and topic of the demand Mashup with the superpoints in VO in turn, find out the same superpoint, and set the corresponding index position 1 in q;
步骤(4.5)进行步骤(4.5.1-4.5.5),计算超点集的访问分布向量,进行步骤(4.6);Step (4.5) carries out step (4.5.1-4.5.5), calculates the visit distribution vector of superpoint set, carries out step (4.6);
步骤(4.5.1)利用超点、超边的度矩阵、权重矩阵、关联矩阵构建转移概率矩阵P,公式为P=VD-1HWED-1HT,进行步骤(4.5.2);Step (4.5.1) constructs transition probability matrix P by utilizing the degree matrix, weight matrix, and incidence matrix of hyperpoints and hyperedges, the formula is P=VD -1 HWED - 1HT, and proceeds to step (4.5.2);
其中,D-1表示D的逆矩阵,D-1D=E*。。Wherein, D -1 represents the inverse matrix of D, and D -1 D=E * . .
步骤(4.5.2)根据带重启的随机游走,得到整个超点集的访问分布为y(t+1)=αPy(t)+(1-α)q,进行步骤(4.5.3);Step (4.5.2) obtains the access distribution of the entire superpoint set as y (t+1) =αPy (t) +(1-α)q according to the random walk with restart, and proceeds to step (4.5.3);
步骤(4.5.3)进行初始化的设置,y(1)向量为初始分布向量,视为等概率值,即其中每一个元素的值为超点总数量的倒数,y(1)向量中的元素之和为1,y(1)向量与VO等长,进行步骤(4.5.4);Step (4.5.3) is initialized, the y (1) vector is the initial distribution vector, which is regarded as an equal probability value, that is, the value of each element is the reciprocal of the total number of super points, and the elements in the y (1) vector The sum is 1, the y (1) vector has the same length as VO, and proceeds to step (4.5.4);
步骤(4.5.4)从初始分布向量y(1),进行步骤(4.5.4.1-4.5.4.2)开启随机游走,对y(t)进行迭代,直到到达指定的迭代次数Tmax,也就是t=Tmax,进行步骤(4.5.5);Step (4.5.4) From the initial distribution vector y (1) , proceed to step (4.5.4.1-4.5.4.2) to start random walk, and iterate y (t) until it reaches the specified number of iterations Tmax, which is t =Tmax, carry out step (4.5.5);
步骤(4.5.4.1)根据步骤(4.5.2)计算y(t+1),进行步骤(4.5.4.2);Step (4.5.4.1) calculates y (t+1 ) according to step (4.5.2), and performs step (4.5.4.2);
步骤(4.5.4.2)更新t为t+1;Step (4.5.4.2) updates t to be t+1;
步骤(4.5.5)随机游走过程停止,此时向量y(t)已经不再变化;Step (4.5.5) the random walk process stops, and the vector y (t) no longer changes at this time;
步骤(4.6)对y(t)向量按照值进行从大到小排序,进行步骤(4.7);Step (4.6) sorts the y (t) vector from large to small according to the value, and performs step (4.7);
步骤(4.7)新建空的推荐列表recommend,进行步骤(4.8);Step (4.7) creates a new empty recommendation list recommend, and proceeds to step (4.8);
步骤(4.8)进行步骤(4.8.1-4.8.2),获取y(t)值最大的topQ个API作为推荐API放入recommend列表,进行步骤(4.9);Step (4.8) proceed to step (4.8.1-4.8.2), obtain the topQ APIs with the largest y (t) values and put them into the recommend list as recommended APIs, proceed to step (4.9);
步骤(4.8.1)遍历y(t)向量中的值对应的超点,进行步骤(4.8.2),直到recommend的长度达到topQ;Step (4.8.1) traverses the superpoint corresponding to the value in the y (t) vector, and proceeds to step (4.8.2) until the length of recommend reaches topQ;
步骤(4.8.2)对超点进行类型判断,如果是API类型则加入推荐列表,否则跳到下一次循环;Step (4.8.2) judges the type of the super point, if it is an API type, it is added to the recommendation list, otherwise it skips to the next cycle;
步骤(4.9)结束循环,返回推荐列表,其长度为topQ。此时recommend中经过大小排序且类型都为API的超点就是所求的推荐API,完成推荐,结束。Step (4.9) ends the loop and returns a recommendation list whose length is topQ. At this time, the superpoints in recommend that have been sorted by size and whose types are all APIs are the requested recommended APIs. Complete the recommendation and end.
参照图1~图4,在不同服务类数下其他推荐方法与本发明方法对比分析结果,包括以下步骤:With reference to Fig. 1~Fig. 4, the comparative analysis result of other recommended methods and the method of the present invention under different number of service classes comprises the following steps:
第一步、定义10种推荐方法;The first step is to define 10 recommended methods;
TF-IDF:根据TF-IDF模型对服务描述文档以及需求文档建模得到向量,计算相似度,选择相似度最高的Web API服务。TF-IDF: According to the TF-IDF model, model the service description document and requirement document to obtain vectors, calculate the similarity, and select the Web API service with the highest similarity.
E-LDA:根据LDA模型得到服务和API的主题特征,进行相似度计算,选择相似度最高的Web API服务进行推荐。E-LDA: According to the LDA model, the subject features of the service and API are obtained, the similarity is calculated, and the Web API service with the highest similarity is selected for recommendation.
T+K+CF:在T+K聚类基础上将需求Mashup与各聚类中心进行相似度计算,选择最相似的聚类进行基于物品的协同过滤算法,将得到的结果进行排序,对topK个Web API进行推荐;T+K+CF: On the basis of T+K clustering, calculate the similarity between the demand Mashup and each clustering center, select the most similar clustering to perform an item-based collaborative filtering algorithm, sort the obtained results, and rank topK a Web API to recommend;
其中,T+K表示通过TF-IDF将每个Mashup服务描述文档表示成向量的形式,用Kmeans算法进行聚类。其中,Kmeans算法是一种常见的聚类算法。算法会将数据集分为K个簇,每个簇使用簇内所有样本均值来表示,将该均值称为“质心”。Among them, T+K means that each Mashup service description document is expressed in the form of a vector through TF-IDF, and the Kmeans algorithm is used for clustering. Among them, the Kmeans algorithm is a common clustering algorithm. The algorithm divides the data set into K clusters, and each cluster is represented by the mean value of all samples in the cluster, which is called the "centroid".
LDA+K+CF:与方法T+K+CF类似,只是将Mashup描述文档的向量表示通过LDA主题模型建模得到的主题向量替换。LDA+K+CF: Similar to the method T+K+CF, except that the vector representation of the Mashup description document is replaced by the topic vector obtained by modeling the LDA topic model.
T+K+PopK:在T+K聚类基础上选择最相似的聚类,通过对聚类中Web API的受欢迎度进行排序,返回最受欢迎的topK个Web API进行推荐。Web API的受欢迎度定义为该WebAPI被当前聚类中Mashup调用的次数。T+K+PopK: Select the most similar cluster based on T+K clustering, sort the popularity of Web APIs in the clusters, and return the most popular topK Web APIs for recommendation. The popularity of a Web API is defined as the number of times the Web API is called by the Mashup in the current cluster.
RWR:根据Mashup和API的调用关系构建超图模型,将需求Mashup构建为初始向量,直接使用带重启的随机游走进行推荐。RWR: Construct a hypergraph model based on the call relationship between mashup and API, construct the demand mashup as an initial vector, and directly use random walk with restart to make recommendations.
HPS+RWR:在超图谱聚类的基础上,根据距离公式选择最近聚类,构建为初始向量,使用带重启的随机游走进行推荐。HPS+RWR: On the basis of hypergraph clustering, select the nearest cluster according to the distance formula, construct it as an initial vector, and use random walk with restart for recommendation.
其中,谱聚类是谱聚类算法建立在图论中的谱图理论基础上,其本质是将聚类问题转化为图的最优划分问题,是一种点对聚类算法;Among them, spectral clustering is a spectral clustering algorithm based on the spectral graph theory in graph theory. Its essence is to transform the clustering problem into the optimal partition problem of the graph. It is a point-to-point clustering algorithm;
hMETIS+RWR:使用hMETIS算法进行超图分区,同样根据距离公式,并构建初始向量,使用带重启的随机游走进行推荐。hMETIS+RWR: Use the hMETIS algorithm to partition the hypergraph, also based on the distance formula, and construct the initial vector, and use the random walk with restart to make recommendations.
其中,hMETIS是目前流行的超图分区算法,在分区前会将超图进行粗化,并得到平衡的分区。Among them, hMETIS is a currently popular hypergraph partitioning algorithm, which coarsens the hypergraph before partitioning and obtains balanced partitions.
hMATaTo-RW:在MATaTo模型上进行随机游走,将需求Mashup构建为初始向量,迭代一定次数后将向量排序,得到Web API推荐列表。hMATaTo-RW: Perform a random walk on the MATaTo model, construct the demand Mashup as the initial vector, and sort the vectors after a certain number of iterations to obtain the Web API recommendation list.
hMATaTo-RWR:本发明提出的方法,在MATaTo模型上进行本发明提出的带重启的随机游走得到Web API推荐列表。hMATaTo-RWR: The method proposed by the present invention, the random walk with restart proposed by the present invention is performed on the MATaTo model to obtain the Web API recommendation list.
第二步、从ProgrammableWeb平台爬取Mashup服务数据,并选取三个公开数据集MovieLens/Cora/TwitterFootball进行实验,并计算各方法推荐结果的准确率、召回率、折扣累计增益DCG@R以及汉明距离HMD四个指标;The second step is to crawl the Mashup service data from the ProgrammableWeb platform, and select three public datasets MovieLens/Cora/TwitterFootball for experiments, and calculate the accuracy rate, recall rate, discount cumulative gain DCG@R and Hamming of the recommendation results of each method Four indicators from HMD;
其中,准确率的计算公式定义如下:Among them, the calculation formula of the accuracy rate is defined as follows:
其中,WA表示被需求Mashup服务真实调用的Web API集合,WAR表示方法所得的推荐列表。准确率值越高,表明推荐效果越好; Among them, WA represents the set of Web APIs that are actually called by the required Mashup service, and WAR represents the recommended list obtained by the method. The higher the accuracy value, the better the recommendation effect;
其中,召回率的计算公式为召回率值越高,表明推荐效果越好;Among them, the formula for calculating the recall rate is The higher the recall value, the better the recommendation effect;
其中,折扣累计增益DCG@R的计算公式如下:Among them, the calculation formula of the discounted cumulative gain DCG@R is as follows:
其中,R表示推荐的Web API个数,r(i)表示推荐列表的第i个API是否被需求Mashup真实调用过,若调用过则r(i)值为1,否则为0。DCG@R值越高,表明推荐效果越好。 Among them, R represents the number of recommended Web APIs, and r(i) represents whether the i-th API in the recommendation list has been actually called by the demand mashup. If it has been called, the value of r(i) is 1, otherwise it is 0. The higher the DCG@R value, the better the recommendation effect.
其中,汉明距离的计算公式如下:Among them, the calculation formula of Hamming distance is as follows:
其中,Mm1,Mm2表示两个不同的需求Mashup,Q(Mm1,Mm2)表示对于不同的需求Mashup给出的服务推荐列表中相同的Web API数量。若两个Mashup的推荐列表中的Web API均相同,则HMD(Mm1,Mm2)为0,若都不同,则为1。汉明距离值越高,表明推荐结果越多样。 Among them, M m1 , M m2 represent two different demand mashups, and Q(M m1 , M m2 ) represents the same number of Web APIs in the service recommendation list given by different demand mashups. If the Web APIs in the recommended lists of the two mashups are the same, HMD(M m1 , M m2 ) is 0, and if they are different, it is 1. The higher the Hamming distance value, the more diverse the recommendation results.
第三步、对计算结果进行评估;The third step is to evaluate the calculation results;
图1~图4分别示出了在不同服务类数下本发明推荐方法与其他方法推荐结果的准确率、召回率、折扣累计增益DCG@R以及汉明距离HMD四个指标。如图所示,横坐标为Mashup服务类别数目,纵坐标分别为四个指标。Figures 1 to 4 respectively show the accuracy rate, recall rate, discount cumulative gain DCG@R and Hamming distance HMD four indicators of the recommendation method of the present invention and the recommendation results of other methods under different numbers of service categories. As shown in the figure, the abscissa is the number of Mashup service categories, and the ordinate is the four indicators.
由图1和图2可知,hMATaTo-RWR算法在召回率和准确率上均优于其他算法。总体来看,使用各方法得到的召回率和准确率随着服务类别数目的增多均有所下降,这是因为随着服务数量的增多,自然地对服务推荐的结果产生干扰。It can be seen from Figure 1 and Figure 2 that the hMATaTo-RWR algorithm is superior to other algorithms in terms of recall and accuracy. Overall, the recall rate and precision rate obtained by using each method decrease with the increase of the number of service categories, because as the number of services increases, it naturally interferes with the results of service recommendation.
由图3可知,hMATaTo-RWR的推荐列表能有效地区分出服务的优先级,其主要原因是本发明将需求Mashup的服务信息充分转化,以向量的形式在游走过程中不断迭代,最终得到的平稳分布与向量有强相关性。It can be seen from Figure 3 that the recommendation list of hMATaTo-RWR can effectively distinguish the priority of services. The main reason is that the present invention fully transforms the service information of the demand Mashup, iterates continuously in the form of vectors during the walk process, and finally obtains The stationary distribution of has a strong correlation with the vector.
由图4可知,随着服务类别数目的增多,各方法在汉明距离指标上都有所提升。这是因为服务类目的增多,给推荐的服务带来了更多的可能性,不再局限于仅有的类中。而hMATaTo-RWR通过随机游走在全局节点上进行游走,即使是冷门的服务也会进行推荐,因此有较好的多样性。It can be seen from Figure 4 that with the increase of the number of service categories, the Hamming distance index of each method has improved. This is because the increase of service categories brings more possibilities to the recommended services, which are no longer limited to only one category. However, hMATaTo-RWR walks on the global nodes through random walks, and recommends even unpopular services, so it has better diversity.
本实施例提出的hMATaTo-RWR的实验效果优于上述方法。原因在于:hMATaTo-RWR方法使用谱聚类进行聚类,该方法更与图模型契合,有更好的聚类结果。除此之外,hMATaTo-RWR方法利用了大量辅助信息如标签、主题信息,结合了在模型中融合的五种高阶关系,有效缓解稀疏性问题,同时,在模型中融入了共现度、流行度等非功能性特征。因此,hMATaTo-RWR方法在各个评估指标上均表现出非常好的结果。The experimental effect of hMATaTo-RWR proposed in this example is better than that of the above method. The reason is that the hMATaTo-RWR method uses spectral clustering for clustering, which is more consistent with the graphical model and has better clustering results. In addition, the hMATaTo-RWR method utilizes a large amount of auxiliary information such as tags and topic information, and combines five high-order relationships fused in the model to effectively alleviate the sparsity problem. At the same time, the model incorporates co-occurrence, Non-functional characteristics such as popularity. Therefore, the hMATaTo-RWR method shows very good results on various evaluation metrics.
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202211226862.7A CN115599980A (en) | 2022-10-09 | 2022-10-09 | SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202211226862.7A CN115599980A (en) | 2022-10-09 | 2022-10-09 | SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm | 
Publications (1)
| Publication Number | Publication Date | 
|---|---|
| CN115599980A true CN115599980A (en) | 2023-01-13 | 
Family
ID=84846327
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202211226862.7A Withdrawn CN115599980A (en) | 2022-10-09 | 2022-10-09 | SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN115599980A (en) | 
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN119669580A (en) * | 2025-02-20 | 2025-03-21 | 江西省影票文化传媒股份有限公司 | A content recommendation method and system based on local life service platform | 
| CN119763672A (en) * | 2024-12-30 | 2025-04-04 | 广州悉地医药科技有限公司 | Method and system for excavating relationship between lncRNA and diseases | 
- 
        2022
        
- 2022-10-09 CN CN202211226862.7A patent/CN115599980A/en not_active Withdrawn
 
 
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN119763672A (en) * | 2024-12-30 | 2025-04-04 | 广州悉地医药科技有限公司 | Method and system for excavating relationship between lncRNA and diseases | 
| CN119669580A (en) * | 2025-02-20 | 2025-03-21 | 江西省影票文化传媒股份有限公司 | A content recommendation method and system based on local life service platform | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| Aggarwal | An introduction to cluster analysis | |
| Hofmann | The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data | |
| Zhang et al. | Service recommendation based on quotient space granularity analysis and covering algorithm on Spark | |
| Aghdam et al. | A novel non-negative matrix factorization method for recommender systems | |
| CN115599980A (en) | SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm | |
| CN109255125A (en) | A kind of Web service clustering method based on improvement DBSCAN algorithm | |
| Cao et al. | Rankcompete: Simultaneous ranking and clustering of information networks | |
| CN115620040A (en) | Cloud-oriented Mashup service clustering method based on hypergraph multilevel clustering | |
| Hazratgholizadeh et al. | Active constrained deep embedded clustering with dual source | |
| CN119228271B (en) | Library inventory management method and system | |
| CN116739402A (en) | Health portrait construction method based on data mining | |
| CN120087201A (en) | Intelligent clothing matching recommendation system and method based on multimodal learning | |
| Martínez-García et al. | Construction of an outranking relation based on semantic criteria with ELECTRE-III | |
| CN119760057A (en) | Response large model retrieval enhancement method and device based on hierarchical cluster index structure | |
| CN112203152A (en) | Multi-modal confrontation learning type video recommendation method and system | |
| Ye et al. | Incorporating user's preference into attributed graph clustering | |
| Malleshappa et al. | Classification of Web Pages Using the Machine Learning Algorithms with Web Page Recommendations. | |
| CN110162580A (en) | Data mining and depth analysis method and application based on distributed early warning platform | |
| CN112434174B (en) | Method, device, equipment and medium for identifying a publishing account of multimedia information | |
| CN116431877A (en) | Webpage big data content clustering method driven by cloud computing platform | |
| Escalante et al. | Multi-class particle swarm model selection for automatic image annotation | |
| More et al. | KH-FC: krill herd-based fractional calculus algorithm for text document clustering using MapReduce structure | |
| Zeng et al. | Multi-strategy continual learning for knowledge refinement and consolidation | |
| CN114826921B (en) | Method, system and medium for dynamic allocation of network resources based on sampling subgraph | |
| CN120086414B (en) | File retrieval method and system based on cloud computing | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WW01 | Invention patent application withdrawn after publication | ||
| WW01 | Invention patent application withdrawn after publication | 
             Application publication date: 20230113  |