+

CN115599980A - SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm - Google Patents

SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm Download PDF

Info

Publication number
CN115599980A
CN115599980A CN202211226862.7A CN202211226862A CN115599980A CN 115599980 A CN115599980 A CN 115599980A CN 202211226862 A CN202211226862 A CN 202211226862A CN 115599980 A CN115599980 A CN 115599980A
Authority
CN
China
Prior art keywords
super
mashup
matrix
api
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202211226862.7A
Other languages
Chinese (zh)
Inventor
陆佳炜
费佳欢
马超治
王琪冰
肖刚
徐俊
张元鸣
郑嘉弘
梅浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Jiliang University
Original Assignee
China Jiliang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Jiliang University filed Critical China Jiliang University
Priority to CN202211226862.7A priority Critical patent/CN115599980A/en
Publication of CN115599980A publication Critical patent/CN115599980A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A Saas-oriented Web Api diversity recommendation method with a restart random walk algorithm fusion function comprises the following steps: firstly, extracting a MATaTo data model; secondly, constructing a hypergraph based on the MATaTo data model; thirdly, performing multi-level clustering on a hypergraph formed by the MATaTo data model; and fourthly, recommending Mashup services based on a random walk algorithm with restart, defining symbols of a random walk model with restart, wherein random walk can return to an initial node at any time, and an obtained service recommendation list can better meet the requirement of Mashup. The invention reduces the complexity of the hypergraph structure, effectively reduces the service search space, improves the accuracy of service recommendation, reduces the time cost of service recommendation, and reduces the negative influence of a large hypergraph on a hypergraph-based recommendation algorithm; the personalized recommendation list is obtained while the data recommendation diversity is ensured.

Description

面向SaaS的融合带重启随机游走算法的Web Api多样性推荐 方法SaaS-oriented Web Api diversity recommendation with fusion and restart random walk algorithm method

技术领域technical field

本发明涉及Web环境中的Api推荐场景,具体涉及一种面向Saas的融合带重启随机游走算法的Web Api多样性推荐方法。The invention relates to an Api recommendation scene in a Web environment, in particular to a Saas-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm.

背景技术Background technique

SaaS全称为Software-as-a-service,意思是“软件即服务”。SaaS是一种基于互联网提供软件服务的应用模式,而Web API是SaaS的核心元素之一,能够提供特定功能的应用程序接口,可以对接类型多样的客户端。Mashup是由Web API构成的通过组合来自不同Web源的内容、表示或应用程序功能而生成的Web应用程序,其目标是结合这些资源来创建有用的新应用程序或服务。Mashup有简单、高可用和易于访问等特点,且其简单性优于功能完整性和完全可扩展性。The full name of SaaS is Software-as-a-service, which means "software as a service". SaaS is an application model that provides software services based on the Internet, and Web API is one of the core elements of SaaS, which can provide application programming interfaces with specific functions and can connect to various types of clients. A Mashup is a Web application composed of Web APIs that is generated by combining content, presentation, or application functionality from different Web sources, with the goal of combining these resources to create useful new applications or services. Mashups are simple, highly available, and easy to access, and their simplicity is better than functional integrity and full scalability.

如何当下海量的服务仓库中寻找合适的服务已成为重要课题。国内外不少学者已经开展了对服务推荐的研究。例如,Gao等人将Mashup的描述文档转化成向量,将向量用k-means算法进行聚类,并提出一种广义流式排序算法进行API推荐。Qi等人通过图神经网络学习Mashup和API的潜在表示,并通过一个具有注意力机制的神经网络分析需求Mashup采用各个API的概率,根据概率大小返回推荐列表。How to find suitable services in the current massive service warehouse has become an important issue. Many scholars at home and abroad have carried out research on service recommendation. For example, Gao et al. converted Mashup description documents into vectors, clustered the vectors with the k-means algorithm, and proposed a generalized streaming sorting algorithm for API recommendation. Qi et al. learned the potential representation of Mashup and API through a graph neural network, and analyzed the probability that each API was adopted by the demand Mashup through a neural network with an attention mechanism, and returned a recommendation list according to the probability.

超图是近年来在国际上流行的图模型,超图中的边可以联结任意个数量的超点,每条超边是超图点集中的非空子集。相较于普通图,超图能直观地对现实世界中的数据信息建模。将超图应用于Api推荐可以将高阶信息以超边的形式融入超图,在捕捉同类型、跨类型对象之间的关系的同时,生动、直观的可视化多元关系。Hypergraph is a popular graph model in the world in recent years. The edges in the hypergraph can connect any number of hyperpoints, and each hyperedge is a non-empty subset of the hypergraph point set. Compared with ordinary graphs, hypergraphs can intuitively model data information in the real world. Applying hypergraphs to Api recommendation can integrate high-level information into hypergraphs in the form of hyperedges. While capturing the relationships between objects of the same type and across types, they can vividly and intuitively visualize multiple relationships.

发明内容Contents of the invention

为了提升服务推荐准确率的同时提高服务推荐的多样性。本发明提出了面向Saas的融合带重启随机游走算法的Web Api多样性推荐方法。首先建立数据模型将Mashup信息建模,然后构建相应超图,并对原始图超点进行多级聚类减小搜索空间,最后,通过带重启的随机游走算法对Web API进行推荐。In order to improve the accuracy of service recommendation and improve the diversity of service recommendation. The invention proposes a Saas-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm. Firstly, a data model is established to model the Mashup information, and then the corresponding hypergraph is constructed, and the original graph superpoints are multi-level clustered to reduce the search space. Finally, the Web API is recommended through the random walk algorithm with restart.

本发明提供如下的技术方案:The present invention provides following technical scheme:

一种面向Saas的融合带重启随机游走算法的Web Api多样性推荐方法,所述方法包括以下步骤:A kind of Web Api diversity recommendation method of Saas-oriented fusion band restart random walk algorithm, described method comprises the following steps:

第一步、提取MATaTo(Mashup-API-Tag-Topic)数据模型,过程如下:The first step is to extract the MATaTo (Mashup-API-Tag-Topic) data model, the process is as follows:

步骤(1.1)从互联网网站爬取Mashup服务集合信息,进行步骤(1.2);Step (1.1) crawls the collection information of Mashup service from Internet website, carries out step (1.2);

步骤(1.2)对爬取到的Mashup服务集合进行分析,选择有利于提升服务推荐准确性且内容丰富的元信息,进行步骤(1.3);Step (1.2) analyzes the crawled Mashup service collection, selects meta information that is conducive to improving the accuracy of service recommendation and is rich in content, and proceeds to step (1.3);

其中,爬取到的信息有服务描述文档、Mashup与API的调用关系、主题、标签、服务关注者、服务调用链接、更新日志信息;Among them, the crawled information includes service description documents, call relationships between Mashup and API, topics, labels, service followers, service call links, and update log information;

步骤(1.3)定义MATaTo模型的符号,M表示Mashup集合、A表示API集合、Ta表示标签集合、To表示主题集合、R1、R2、R3、R4、R5表示分别表示五种高阶关系、n表示超点数量、m表示超边数量、WHG∈Rm×n表示加权超图、v表示超点、V表示由n个超点构成的集合、e表示超边、E表示由m个超边构成的集合,Rm×n表示大小为m×n的实数集合;进行步骤(1.4);Step (1.3) defines the symbols of the MATaTo model, M represents the Mashup set, A represents the API set, Ta represents the label set, To represents the topic set, R1, R2, R3, R4, R5 represent five high-order relationships respectively, and n represents The number of super vertices, m represents the number of hyper edges, WHG∈R m×n represents a weighted hypergraph, v represents a super point, V represents a set composed of n super points, e represents a hyper edge, E represents a set composed of m hyper edges The collection, R m * n represents the real number collection that size is m * n; Carry out step (1.4);

步骤(1.4)遍历爬取到的信息,从中提取出MATaTo数据模型;Step (1.4) traverses the crawled information and extracts the MATaTo data model therefrom;

第二步、基于MATaTo数据模型构建超图,过程如下:The second step is to build a hypergraph based on the MATaTo data model. The process is as follows:

步骤(2.1)使用NMF模型对Mashup和API描述文档建模,得到对应的主题向量,进行步骤(2.2);Step (2.1) uses the NMF model to model the Mashup and API description documents, obtains the corresponding topic vector, and performs step (2.2);

其中,NMF模型是非负矩阵分解,是一种无监督学习算法;Among them, the NMF model is a non-negative matrix factorization, which is an unsupervised learning algorithm;

步骤(2.2)根据MATaTo数据模型,针对四种类型的对象构成超图的超点;Step (2.2) according to the MATaTo data model, forms the superpoint of hypergraph for four types of objects;

步骤(2.3)根据五种高阶关系中建立十种类型的超边,定义十种类型的超边EM *kTo*、EM*kTa*、EA*kTo*、EA*kTa*、EM*ToTa*、EA*To*Ta*、EM*A*To*、EM*A*Ta*

Figure BDA0003880083200000021
Figure BDA0003880083200000022
建立超图的超边集;Step (2.3) Establish ten types of hyperedges according to five high-order relations, and define ten types of hyperedges E M *kTo* , E M*kTa* , E A*kTo* , E A*kTa* , E M*ToTa* 、 E A*To*Ta* 、 E M*A*To* 、 E M*A*Ta*
Figure BDA0003880083200000021
and
Figure BDA0003880083200000022
Create hyperedge sets of hypergraphs;

第三步、对MATaTo数据模型构成的超图进行多级聚类,过程如下:The third step is to perform multi-level clustering on the hypergraph formed by the MATaTo data model. The process is as follows:

步骤(3.1)粗化阶段:对原始图超点进行连续的粗化直到超图规模减小到一定程度;Step (3.1) Coarsening stage: Continuously coarsen the superpoints of the original graph until the scale of the hypergraph is reduced to a certain extent;

步骤(3.2)聚类阶段:对粗化阶段得到的最粗超图使用不同的参数进行谱聚类,根据结果选择效果最优的聚类,进行步骤(3.3);Step (3.2) clustering stage: use different parameters to perform spectral clustering on the coarsest hypergraph obtained in the coarsening stage, select the cluster with the best effect according to the results, and proceed to step (3.3);

步骤(3.3)细化阶段:将粗化阶段中被粗化的超点映射到聚类中,结束;Step (3.3) refinement stage: map the coarsened superpoints in the coarsening stage to clusters, and end;

第四步、基于带重启随机游走算法对Mashup服务进行推荐;The fourth step is to recommend the Mashup service based on the random walk algorithm with restart;

其中,随机游走是一种被广泛地用于描述自然界中许多实体的运动方式的数学统计模型,在本发明设计的带重启的随机游走算法中,超图中的超点在每一步的游走过程中有两种选择,以特定的概率α∈(0,1)游走到邻近的超点或者以1-α的概率返回初始超点,这种随机游走使得最终的服务推荐列表将更加符合需求Mashup。Among them, random walk is a mathematical statistical model widely used to describe the motion of many entities in nature. In the random walk algorithm with restart designed in the present invention, the superpoints in the hypergraph are at each step There are two choices in the walk process, walk to the adjacent superpoint with a specific probability α∈(0,1) or return to the initial superpoint with the probability of 1-α, this random walk makes the final service recommendation list Will be more in line with the needs of Mashup.

进一步,所述步骤(1.2)中,服务描述文档是传统Web服务WSDL文档的替代品,通过直观的语言描述服务信息;Further, in the step (1.2), the service description document is a substitute for the traditional Web service WSDL document, and describes the service information through an intuitive language;

其中,服务调用链接仅代表了服务的资源地址,并无实际帮助,因此,不使用服务调用链接信息。Among them, the service call link only represents the resource address of the service, and has no actual help. Therefore, the service call link information is not used.

更进一步,所述步骤(1.4)的过程如下:Further, the process of the step (1.4) is as follows:

步骤(1.4.1)提取服务描述文档、Mashup与API的调用关系、主题、标签四种类型信息,进行步骤(1.4.2);Step (1.4.1) extracts four types of information, the service description document, the call relationship between Mashup and API, the subject, and the label, and proceeds to step (1.4.2);

其中,四种类型信息的说明具体如下:Among them, the description of the four types of information is as follows:

Mashup(M*)表示Mashup服务,是服务推荐场景中推荐的请求者,Mashup本身携带有标签、主题的信息,并且与API存在调用关系;Mashup (M*) represents the Mashup service, which is the recommended requester in the service recommendation scenario. Mashup itself carries information about tags and topics, and has a calling relationship with the API;

API(A*)表示Web API,是服务推荐场景中的推荐对象,与Mashup服务对象相同,也具有标签、主题的信息,与Mashup存在被调用的关系;API(A*) means Web API, which is the recommended object in the service recommendation scenario. It is the same as the Mashup service object, and also has label and subject information, and has a calling relationship with Mashup;

标签(Ta*)表示爬取到的标签信息,具有相同标签的Mashup和API关系更加紧密;The label (Ta*) indicates the crawled label information, and the Mashup with the same label has a closer relationship with the API;

主题(To*)表示爬取到的主题信息,同一主题下的Mashup和API有相近的功能;Topic (To*) indicates the crawled topic information, Mashup and API under the same topic have similar functions;

步骤(1.4.2)归纳出R1(M*-Ta*-To*)、R2(A*-Ta*-To*)、R3(M*-M*)、R4(A*-A*)、R5(M*-A*)五种类型的高阶关系;Step (1.4.2) summarizes R1(M*-Ta*-To*), R2(A*-Ta*-To*), R3(M*-M*), R4(A*-A*), R5(M*-A*) five types of higher-order relationships;

其中,五种高阶关系的说明具体如下:Among them, the description of the five high-order relationships is as follows:

R1(M*-Ta*-To*)表示Mashup及其属性的关系,Mashup除了自身的描述文档外,用其携带的主题、标签信息进行补充说明;R1(M*-Ta*-To*) represents the relationship between Mashup and its attributes. In addition to its own description document, Mashup uses its carried subject and label information for supplementary explanation;

R2(A*-Ta*-To*)表示API及其属性的关系,与R1相同,API用其携带的主题、标签信息进行补充说明,Mashup调用具有相同主题、标签的API;R2(A*-Ta*-To*) represents the relationship between the API and its attributes. It is the same as R1. The API uses the topic and label information it carries to make supplementary explanations. Mashup calls the API with the same topic and label;

R3(M*-M*)表示Mashup之间的关系,本发明计算不同Mashup描述文档之间的相似度,来表示不同Mashup服务之间的关系;R3(M*-M*) represents the relationship between Mashups, and the present invention calculates the similarity between different Mashup description documents to represent the relationship between different Mashup services;

R4(A*-A*)表示API之间的关系,与R3相同,使用相似度表示API服务之间的关系,除此之外,被同时调用的两个或多个API之间能反映其共现度,一定程度上提升推荐的准确度;R4(A*-A*) represents the relationship between APIs, which is the same as R3, and uses similarity to represent the relationship between API services. In addition, two or more APIs that are called at the same time can reflect their relationship Co-occurrence can improve the accuracy of recommendation to a certain extent;

R5(M*-A*)表示Mashup和API之间的关系,Mashup由API组合而成,除了显式的调用关系外,API的主题或标签同样能对Mashup进行功能性描述。R5(M*-A*) indicates the relationship between the Mashup and the API. The Mashup is composed of APIs. In addition to the explicit call relationship, the topic or label of the API can also describe the functionality of the Mashup.

再进一步,所述步骤(2.2)的过程如下:Further, the process of the step (2.2) is as follows:

步骤(2.2.1)定义Mashup集合M={m1,m2,m3,...,mN},进行步骤(2.2.2);Step (2.2.1) defines the Mashup set M={m 1 , m 2 , m 3 ,...,m N }, and proceeds to step (2.2.2);

其中,N为Mashup的总数;Among them, N is the total number of Mashup;

步骤(2.2.2)定义Mashup mi调用的API集合为Ai={ai1,ai1,ai1,...,aik'},进行步骤(2.2.3);Step (2.2.2) defines the API set called by Mashup m i as A i ={a i1 ,a i1 ,a i1 ,...,a ik' }, proceed to step (2.2.3);

其中,i代表当前Mashup在集合M中的编号,k’为Mashup mi调用的API总数,则所有Mashup调用的API集合为

Figure BDA0003880083200000041
其中,∪表示取并集;Among them, i represents the number of the current mashup in the set M, k' is the total number of APIs called by mashup m i , then the set of APIs called by all mashups is
Figure BDA0003880083200000041
Among them, ∪ means to take the union;

步骤(2.2.3)定义Ta和To为所有Mashup和API携带的标签和主题信息集合,进行步骤(2.2.4);Step (2.2.3) defines Ta and To as the tags and subject information collections carried by all Mashups and APIs, and proceeds to step (2.2.4);

步骤(2.2.4)初始化超图的超点集为V=M∪A∪Ta∪ToStep (2.2.4) initializes the superpoint set of the hypergraph as V=M∪A∪T a ∪T o .

所述步骤(2.3)中,超边定义的具体说明如下:In the described step (2.3), the specific description of the hyperedge definition is as follows:

Figure BDA0003880083200000042
对于每个Mashup,选取top-k个相似Mashup以及它们的主题构成一条超边,该边的权重设定为与k个Mashup之间相似度的平均值,相似度的计算采用常见的余弦相似度,其中,余弦相似度的计算公式如下:
Figure BDA0003880083200000042
For each mashup, select top-k similar mashups and their topics to form a hyperedge, and the weight of this edge is set as the average value of the similarity with k mashups, and the similarity is calculated using the common cosine similarity , where the calculation formula of cosine similarity is as follows:

Figure BDA0003880083200000043
其中,θ表示预先相似度的夹角,vmx'表示第x’个Mashup的特征向量,|vmx'|表示特征向量的模;
Figure BDA0003880083200000043
Among them, θ represents the angle of the pre-similarity, vm x' represents the feature vector of the x'th Mashup, | vm x' | represents the modulus of the feature vector;

Figure BDA0003880083200000044
对于每个Mashup,选取top-k个相似Mashup以及它们的标签构成一条超边,该边的权重设定为与k个Mashup之间相似度的平均值,相似度的计算和步骤(2.3)中
Figure BDA0003880083200000045
的余弦相似度的计算公式相同;
Figure BDA0003880083200000044
For each Mashup, select top-k similar Mashups and their labels to form a hyperedge, and the weight of this edge is set as the average value of the similarity between the k Mashups. The calculation of the similarity is the same as in step (2.3).
Figure BDA0003880083200000045
The calculation formula of cosine similarity is the same;

Figure BDA0003880083200000046
对于每个API,选取top-k个相似API以及它们的主题构成一条超边,该边的权重设定为与k个API之间相似度的平均值,相似度的计算和步骤(2.3)中
Figure BDA0003880083200000047
的余弦相似度的计算公式相同;
Figure BDA0003880083200000046
For each API, select top-k similar APIs and their topics to form a hyperedge, and the weight of this edge is set as the average value of the similarity between the k APIs. The calculation of similarity is the same as in step (2.3).
Figure BDA0003880083200000047
The calculation formula of cosine similarity is the same;

Figure BDA0003880083200000051
对于每个API,选取top-k个相似API以及它们的标签构成一条超边,该边的权重设定为与k个API之间相似度的平均值,相似度的计算和步骤(2.3)中
Figure BDA0003880083200000052
的余弦相似度的计算公式相同;
Figure BDA0003880083200000051
For each API, select top-k similar APIs and their labels to form a hyperedge. The weight of this edge is set as the average value of the similarity between the k APIs. The calculation of similarity is the same as in step (2.3).
Figure BDA0003880083200000052
The calculation formula of cosine similarity is the same;

Figure BDA0003880083200000053
对于每个Mashup,选取其携带的任意数量主题与标签构成一条超边,该边的权重设定为1;
Figure BDA0003880083200000053
For each Mashup, select any number of topics and labels it carries to form a hyperedge, and the weight of this edge is set to 1;

EA*To*Ta*:对于每个API,选取其携带的任意数量主题与标签构成一条超边,该边的权重设定为1;E A*To*Ta* : For each API, select any number of topics and tags it carries to form a hyperedge, and the weight of this edge is set to 1;

EM*A*To*:对于每个Mashup,选取其调用的所有API以及携带的主题,该边的权重设定为1;E M*A*To* : For each Mashup, select all the APIs it calls and the topics it carries, and set the weight of this edge to 1;

EM*A*Ta*:对于每个Mashup,选取其调用的所有API以及携带的标签,该边的权重设定为1;E M*A*Ta* : For each Mashup, select all the APIs it calls and the labels it carries, and set the weight of this edge to 1;

Figure BDA0003880083200000054
流行的API往往代表较高的质量,对于每个API,与其所在的主题构建一条超边,该边的权重设定为流行度大小,其中,流行度大小的公式如下:
Figure BDA0003880083200000054
Popular APIs often represent higher quality. For each API, a hyperedge is constructed with its topic, and the weight of this edge is set as the popularity. The formula of popularity is as follows:

Figure BDA0003880083200000055
其中Call(Ar)表示API Ar被Mashup调用的频次,Topic(Ar)表示与Ar具有相同主题的所有API,MinCall(Topic(Ar))表示与Ar相同主题中的所有API中调用频次最低的API,MaxCall(Topic(Ar))表示与Ar相同主题中的所有API中调用频次最高的API;
Figure BDA0003880083200000055
Among them, Call(A r ) indicates the frequency of API A r being called by mashup, Topic(A r ) indicates all APIs with the same topic as A r , and MinCall(Topic(A r )) indicates all APIs in the same topic as A r The API with the lowest call frequency in , MaxCall(Topic(A r )) means the API with the highest call frequency among all APIs in the same topic as A r ;

Figure BDA0003880083200000056
经常一起被调用的API具有相关性,该边包含所有共现次数大于1的API,该边的权重设定为共现度大小,其中,共现度大小的公式如下:
Figure BDA0003880083200000056
The APIs that are often called together are related. This edge includes all APIs whose co-occurrence times are greater than 1. The weight of this edge is set to the co-occurrence degree. The formula for the co-occurrence degree is as follows:

Figure BDA0003880083200000057
其中,|Am'∩...∩An'|表示API Am’,...,An’被相同Mashup调用的次数,|Am'∪...∪An'|表示API Am’,...,An’被Mashup调用的总次数。
Figure BDA0003880083200000057
Among them, |A m' ∩...∩A n' | indicates the number of times API A m' ,...,A n' is called by the same Mashup, and |A m' ∪...∪A n' | indicates API The total number of times A m' ,...,An ' is called by the mashup.

所述步骤(3.1)的过程如下:The process of said step (3.1) is as follows:

步骤(3.1.1)定义粗化阶段所需的符号,c表示超点的权重集合、w表示超边的权重集合、G=(V,E,c,w)表示原始超图、iterTime表示迭代次数、coarse_limit表示粗化阈值、max_weight表示超点权重阈值、G'=(V',E',c',w')表示粗化后的图,V'表示粗化后未被合并的超点,E'表示粗化后的超边,c'表示粗化后超点的权重集合,w'表示粗化超边后的权重集合,进行步骤(3.1.2);Step (3.1.1) defines the symbols required for the coarsening stage, c represents the weight set of hyperpoints, w represents the weight set of hyperedges, G=(V,E,c,w) represents the original hypergraph, and iterTime represents iteration The number of times, coarse_limit indicates the coarsening threshold, max_weight indicates the superpoint weight threshold, G'=(V',E',c',w') indicates the graph after coarsening, and V' indicates the superpoint that has not been merged after coarsening , E' represents the hyperedge after coarsening, c' represents the weight set of superpoints after coarsening, w' represents the weight set after coarsening the hyperedge, and proceed to step (3.1.2);

步骤(3.1.2)将所有超点加入到未访问超点集合UV中,进行步骤(3.1.3);Step (3.1.2) joins all superpoints in the unvisited superpoint collection UV, carries out step (3.1.3);

步骤(3.1.3)初始化空集合C用于记录所有被粗化的超点,进行步骤(3.1.4);Step (3.1.3) initializes the empty collection C to be used for recording all superpoints that are coarsened, and carries out step (3.1.4);

步骤(3.1.4)定义变量I用于存储迭代次数,并将其初始化为0,进行步骤(3.1.4.1-3.1.4.4),对集合UV中的超点进行粗化,进行步骤(3.1.5);Step (3.1.4) defines the variable I for storing the number of iterations, and initializes it to 0, performs steps (3.1.4.1-3.1.4.4), coarsens the super points in the set UV, and performs steps (3.1. 5);

步骤(3.1.4.1)定义集合UVT,用于存储对本次迭代需要粗化的超点合集,将集合UV中的超点拷贝进集合UVT中,进行步骤(3.1.4.2);Step (3.1.4.1) defines the set UVT, which is used to store the superpoint collection that needs to be roughened for this iteration, and copies the superpoints in the set UV into the set UVT, and proceeds to step (3.1.4.2);

步骤(3.1.4.2)进行步骤(3.1.4.2.1-3.1.4.2.12),对UVT中超点进行粗化,直到迭代次数到达len(UVT),进行步骤(3.1.4.3);Step (3.1.4.2) proceed to step (3.1.4.2.1-3.1.4.2.12), coarsen the super point in UVT until the number of iterations reaches len(UVT), proceed to step (3.1.4.3);

其中,len()表示获取集合的大小,len(UVT)表示集合UVT的大小;Among them, len() indicates the size of the collection, and len(UVT) indicates the size of the collection UVT;

步骤(3.1.4.2.1)随机取集合UVT中一个超点,命名为rv,进行步骤(3.1.4.2.2);Step (3.1.4.2.1) randomly takes a super point in the set UVT, named rv, and proceeds to step (3.1.4.2.2);

步骤(3.1.4.2.2)判断该超点rv是否属于集合UVT,是则进行步骤(3.1.4.2.3),否则直接进入下一轮循环;Step (3.1.4.2.2) judges whether the super point rv belongs to the set UVT, if so, proceed to step (3.1.4.2.3), otherwise directly enter the next cycle;

步骤(3.1.4.2.3)定义一个超点cv,并将超点rv拷贝给cv,定义score用于存储评分,并初始化评分为0,进行步骤(3.1.4.2.4);Step (3.1.4.2.3) defines a super point cv, and copies the super point rv to cv, defines score to store the score, and initializes the score to 0, proceed to step (3.1.4.2.4);

步骤(3.1.4.2.4)遍历与超点rv邻接的其他超点,并将该超点拷贝给iv,进行步骤(3.1.4.2.4.1-3.1.4.2.4.3),找到要与rv合并的超点iv,进行步骤(3.1.4.2.5);Step (3.1.4.2.4) traverses other superpoints adjacent to superpoint rv, and copies the superpoint to iv, and performs steps (3.1.4.2.4.1-3.1.4.2.4.3) to find the superpoint to be merged with rv Super point iv, proceed to step (3.1.4.2.5);

其中,与超点rv邻接的其他超点即与该超点在同一边中的超点;Among them, other superpoints adjacent to superpoint rv are superpoints on the same side as this superpoint;

步骤(3.1.4.2.4.1)计算rv的权重与iv的权重之和;Step (3.1.4.2.4.1) calculates the sum of the weight of rv and the weight of iv;

步骤(3.1.4.2.4.2)判断该权重和是否大于超点权重阈值max_weight,否则进行步骤(3.1.4.2.4.3),是则直接进入下一次循环;Step (3.1.4.2.4.2) judges whether the weight sum is greater than the superpoint weight threshold max_weight, otherwise proceed to step (3.1.4.2.4.3), and if so, directly enter the next cycle;

步骤(3.1.4.2.4.3)判断iv是不是属于集合UVT中的超点,是则直接进入下一次循环,否则直接结束循环;Step (3.1.4.2.4.3) judges whether iv belongs to a superpoint in the set UVT, if so, directly enters the next cycle, otherwise directly ends the cycle;

步骤(3.1.4.2.5)根据评分函数计算rv与iv的评分函数,定义变量cur_score,将评分存储在cur_score中,进行步骤(3.1.4.2.6);Step (3.1.4.2.5) calculates the scoring function of rv and iv according to scoring function, defines variable cur_score, scores are stored in cur_score, and carries out step (3.1.4.2.6);

其中,评分函数为score(rv,iv)=a·con(rv,iv)+b·hNcut(rv,iv),a、b的值为经验值,依据实验效果调参后所得,其中,

Figure BDA0003880083200000071
其中,I*(rv)、I*(iv)分别表示包含超点rv、iv的超边集合,∩表示取交集,即取符号左右集合都存在的元素,
Figure BDA0003880083200000072
其中c*(rv)、c*(iv)分别为超点rv、iv的权重,e1表示同时和超点rv、iv存在于同一超边的超点,w(e1)表示超边e1的权重,de(e1)表示超边e1的度,de(e1)=|e1|,|e1|表示超边e1中的超点数量;∑表示求和,其中hNcut(rv,iv)表示超点rv和iv所在超边的归一化切割,代表了rv和iv分别被看作一个簇类时这两个簇类的相关度,计算公式为
Figure BDA0003880083200000073
其中,dv(rv)、dv(iv)表示超点rv、iv的度,即包含超点rv、iv的边的权重之和,e2表示属于集合Qe的边,Qe表示由这两个集合切割的超边集,当一条超边中的超点存在于不同簇类中时,称这条超边是被切割的超边;Among them, the scoring function is score(rv,iv)=a·con(rv,iv)+b·hNcut(rv,iv), the values of a and b are empirical values, which are obtained after adjusting the parameters according to the experimental results. Among them,
Figure BDA0003880083200000071
Among them, I*(rv), I*(iv) represent the hyperedge set containing the superpoint rv, iv respectively, ∩ means to take the intersection, that is, to take the elements that exist in both the left and right sets of symbols,
Figure BDA0003880083200000072
Among them, c*(rv) and c*(iv) are the weights of superpoints rv and iv respectively, e 1 means a superpoint that exists on the same hyperedge as superpoint rv and iv at the same time, w(e 1 ) means hyperedge e The weight of 1 , de(e 1 ) represents the degree of hyperedge e 1 , de(e 1 )=|e 1 |, |e 1 | represents the number of super points in hyperedge e 1 ; ∑ represents summation, where hNcut (rv,iv) represents the normalized cut of the hyperedge where the superpoints rv and iv are located, and represents the correlation between the two clusters when rv and iv are respectively regarded as a cluster. The calculation formula is
Figure BDA0003880083200000073
Among them, dv(rv) and dv(iv) represent the degrees of the superpoints rv and iv, that is, the sum of the weights of the edges containing the superpoints rv and iv, e 2 represents the edges belonging to the set Q e , and Q e represents the A set of hyperedges cut by a set, when the hyperpoints in a hyperedge exist in different clusters, this hyperedge is called a hyperedge that is cut;

步骤(3.1.4.2.6)判断cur_score是否大于score,是则将超点iv拷贝到cv中、将cur_score的值赋值给score,进行步骤(3.1.4.2.7);Step (3.1.4.2.6) judges whether cur_score is greater than score, if so, copy the super point iv to cv, assign the value of cur_score to score, and proceed to step (3.1.4.2.7);

步骤(3.1.4.2.7)将rv与cv合并,本发明使用双向链表的形式,将有合并关系的超点进行连接,然后进行步骤(3.1.4.2.8);Step (3.1.4.2.7) merges rv and cv, and the present invention uses the form of doubly linked list to connect the superpoints with merge relationship, and then proceed to step (3.1.4.2.8);

其中,双向链表是链表的一种,直接相连的两个节点之间有指向对方的指针;Among them, the doubly linked list is a kind of linked list, and there are pointers pointing to each other between two directly connected nodes;

步骤(3.1.4.2.8)将被合并超点cv的权重增加到rv,进行步骤(3.1.4.2.9);Step (3.1.4.2.8) increases the weight of the merged superpoint cv to rv, and proceeds to step (3.1.4.2.9);

步骤(3.1.4.2.9)在当前迭代轮次使用的UVT中移除rv与cv,,进行步骤(3.1.4.2.10);Step (3.1.4.2.9) removes rv and cv in the UVT used in the current iteration round, and proceeds to step (3.1.4.2.10);

步骤(3.1.4.2.10)在UV中移除cv,进行步骤(3.1.4.2.11);Step (3.1.4.2.10) removes cv in UV, proceed to step (3.1.4.2.11);

步骤(3.1.4.2.11)将超点cv添加到集合C中,进行步骤(3.1.4.2.12);Step (3.1.4.2.11) adds the super point cv to the set C, and proceeds to step (3.1.4.2.12);

步骤(3.1.4.2.12)如果被粗化的超点数到达粗化阈值coarse_limit,则退出迭代,执行步骤(3.1.4.3),否则直接进入下一轮迭代;Step (3.1.4.2.12) If the number of coarsened superpoints reaches the coarsening threshold coarse_limit, exit the iteration and execute step (3.1.4.3), otherwise directly enter the next round of iteration;

步骤(3.1.4.3)对I自增1,即I=I+1,进行步骤(3.1.4.4);Step (3.1.4.3) automatically increments I by 1, i.e. I=I+1, and proceeds to step (3.1.4.4);

步骤(3.1.4.4)判断迭代次数I是否小于iterTime、被粗化的超点数是否到达粗化阈值coarse_limit,任一条件不符合,则停止粗化,停止迭代,执行步骤(3.1.5);Step (3.1.4.4) judge whether the number of iterations I is less than iterTime, whether the coarsened superpoint number reaches the coarsening threshold coarse_limit, if any condition does not meet, then stop coarsening, stop iteration, and perform step (3.1.5);

步骤(3.1.5)将粗化后的粗化后的超点、超边、超点的权重集合和超边的权重集合分别拷贝到V'、E'、c'、w'中,得到粗化结果,即超图G'=(V',E',c',w')。Step (3.1.5) Copy the coarsened superpoint, hyperedge, superpoint weight set and hyperedge weight set to V', E', c', w' after coarsening to obtain the coarse The result of transformation, that is, the hypergraph G'=(V', E', c', w').

所述步骤(3.2)的过程如下:The process of said step (3.2) is as follows:

步骤(3.2.1)进行步骤(3.2.1.1-3.2.1.6),构建拉普拉斯矩阵,进行步骤(3.2.2);Step (3.2.1) carries out step (3.2.1.1-3.2.1.6), constructs Laplacian matrix, carries out step (3.2.2);

其中,拉普拉斯矩阵是图的矩阵表示,主要应用在图论中;Among them, the Laplacian matrix is a matrix representation of a graph, which is mainly used in graph theory;

步骤(3.2.1.1)定义VO和EO分别为存储超图中所有超点与对应索引位置和超图中所有超边与对应索引位置的数组,将超图中所有超点与对应索引位置存储在VO中;将超图中所有超边与对应索引位置存储在EO中,进行步骤(3.2.1.2);Step (3.2.1.1) defines VO and EO as arrays for storing all superpoints and corresponding index positions in the hypergraph and all hyperedges and corresponding index positions in the hypergraph respectively, and stores all superpoints and corresponding index positions in the hypergraph in In VO; store all hyperedges and corresponding index positions in the hypergraph in EO, and perform step (3.2.1.2);

步骤(3.2.1.2)根据步骤(2.3)中的超边构建方式和权重计算方法得到的超图信息构建相似矩阵S,进行步骤(3.2.1.3);Step (3.2.1.2) constructs a similarity matrix S according to the hypergraph information obtained by the hyperedge construction method and the weight calculation method in the step (2.3), and proceeds to the step (3.2.1.3);

其中,相似度矩阵是一项重要的统计学技术,用于组织一组数据点之间的彼此相似性,相似性即超边的权重;Among them, the similarity matrix is an important statistical technique, which is used to organize the mutual similarity between a group of data points, and the similarity is the weight of the hyperedge;

步骤(3.2.1.3)根据相似矩阵构建超点度矩阵VD以及超边度矩阵ED,进行步骤(3.2.1.4);Step (3.2.1.3) builds superpoint degree matrix VD and hyperedge degree matrix ED according to similarity matrix, carries out step (3.2.1.4);

其中,超点度矩阵VD是由超点度构成的对角线矩阵,对角线上的元素为超点的度dv(va),其中va代表VO中对应索引位置a的超点;Wherein, the superpoint degree matrix VD is a diagonal matrix formed by the superpoint degree, and the elements on the diagonal are the degree dv (v a ) of the superpoint, where v a represents the superpoint of the corresponding index position a in VO;

其中,超边度矩阵ED是由超边度构成的对角线矩阵,对角线上的元素为超边的度de(eb),其中eb代表EO中对应索引位置b的超点;Wherein, the hyperedge degree matrix ED is a diagonal matrix formed by the hyperedge degree, and the elements on the diagonal are the degree de(e b ) of the hyperedge, where e b represents the hyperpoint corresponding to the index position b in EO;

步骤(3.2.1.4)根据相似矩阵,进行步骤(3.2.1.4.1-3.2.1.4.4)构建关联矩阵H,进行步骤(3.2.1.5);Step (3.2.1.4) carries out step (3.2.1.4.1-3.2.1.4.4) to construct correlation matrix H according to similarity matrix, and carries out step (3.2.1.5);

其中,关联矩阵是一个n×m的矩阵,第d行表示第d个超点所属的所有超边,第f列表示第f条超边包含的所有超点,矩阵的第d行第f列表示为

Figure BDA0003880083200000081
Among them, the incidence matrix is an n×m matrix, the dth row represents all the hyperedges to which the dth superpoint belongs, the fth column represents all the superpoints contained in the fth hyperedge, and the dth row and fth column of the matrix Expressed as
Figure BDA0003880083200000081

步骤(3.2.1.4.1)建立一个n×m的全为0的矩阵H,进行步骤(3.2.1.4.2);Step (3.2.1.4.1) establishes a matrix H that is all 0s of n×m, and proceeds to step (3.2.1.4.2);

步骤(3.2.1.4.2)遍历EO中所有超边,进行步骤(3.2.1.4.3);Step (3.2.1.4.2) traverses all hyperedges in EO, and performs step (3.2.1.4.3);

步骤(3.2.1.4.3)遍历超边中所有超点,进行步骤(3.2.1.4.4);Step (3.2.1.4.3) traverses all hyperpoints in the hyperedge, and proceeds to step (3.2.1.4.4);

步骤(3.2.1.4.4)在矩阵H中将该超边和超点对应索引位置的0置为1;Step (3.2.1.4.4) sets 0 at the corresponding index position of the hyperedge and superpoint to 1 in the matrix H;

步骤(3.2.1.5)根据步骤(2.3)中的超边构建方式和权重计算方法得到的超图信息进行步骤(3.2.1.5.1-3.2.1.5.3),构建权重矩阵W,进行步骤(3.2.1.6);Step (3.2.1.5) carries out step (3.2.1.5.1-3.2.1.5.3) according to the hypergraph information obtained by the hyperedge construction method and weight calculation method in step (2.3), constructs weight matrix W, and performs step ( 3.2.1.6);

其中,权重矩阵W是超边权重的对角矩阵,对角线上的元素为超边的权重;Among them, the weight matrix W is a diagonal matrix of hyperedge weights, and the elements on the diagonal are the weights of hyperedges;

步骤(3.2.1.5.1)建立一个m×m的全为0的矩阵W进行步骤(3.2.1.5.2);Step (3.2.1.5.1) establishes a matrix W that is all 0 of m * m and carries out step (3.2.1.5.2);

步骤(3.2.1.5.2)遍历EO中所有超边,进行步骤(3.2.1.5.3);Step (3.2.1.5.2) traverses all hyperedges in EO, and performs step (3.2.1.5.3);

步骤(3.2.1.5.3)将超边的权重按照EO中的索引位置作为权重矩阵中的值替换矩阵W中的0;Step (3.2.1.5.3) replaces 0 in the matrix W with the weight of the hyperedge as the value in the weight matrix according to the index position in the EO;

步骤(3.2.1.6)根据关联矩阵、权重矩阵、超边度矩阵以及超点度矩阵构建拉普拉斯矩阵L;Step (3.2.1.6) constructs Laplacian matrix L according to correlation matrix, weight matrix, hyperedge degree matrix and superpoint degree matrix;

其中,拉普拉斯矩阵的计算公式为L=E*-ED-1/2HWVD-1HTED-1/2,其中E*表示单位矩阵,即从左上角到右下角的对角线上的元素均为1,其余位置都是0的矩阵,ED-1/2和VD-1分别表示矩阵ED取(-1/2)次幂和矩阵VD取-1次幂,由于ED和VD都是对角线矩阵,所以只要分别对对角线上的数取对应次幂,HT表示H的转置,即将矩阵H中的行换成同序号的列得到的矩阵;Among them, the calculation formula of the Laplacian matrix is L=E*-ED -1/2 HWVD -1 H T ED -1/2 , where E* represents the identity matrix, that is, the diagonal from the upper left corner to the lower right corner The elements above are all 1, and the rest of the positions are all 0. ED -1/2 and VD -1 respectively represent that the matrix ED is taken to the power of (-1/2) and the matrix VD is taken to the power of -1. Since ED and VD Both are diagonal matrices, so as long as the corresponding powers are taken on the numbers on the diagonal, H T represents the transpose of H, that is, the matrix obtained by replacing the rows in the matrix H with columns of the same sequence number;

步骤(3.2.2)对拉普拉斯矩阵进行归一化处理得到NL,进行步骤(3.2.3);Step (3.2.2) carries out normalization process to Laplacian matrix and obtains NL, carries out step (3.2.3);

其中,矩阵的归一化指将矩阵的数据以列为单元,映射到区间(0,1)之间,即,将每一列数据减去该列的最小值并除以该列的最大值;Among them, the normalization of the matrix refers to mapping the data of the matrix to the interval (0, 1) in units of columns, that is, subtracting the minimum value of the column from each column of data and dividing it by the maximum value of the column;

步骤(3.2.3)求解拉普拉斯矩阵NL,计算特征值,得到特征向量,进行步骤(3.2.4);Step (3.2.3) solves Laplacian matrix NL, calculates eigenvalue, obtains eigenvector, carries out step (3.2.4);

其中特征值,特征向量的定义为:矩阵乘法对应了一个变换,是把任意一个向量变成另一个方向或长度都大多不同的新向量,如果矩阵对某一个向量或某些向量只发生伸缩变换,不对这些向量产生旋转的效果,那么这些向量就称为这个矩阵的特征向量,伸缩的比例就是特征值;Among them, the definition of eigenvalue and eigenvector is: matrix multiplication corresponds to a transformation, which is to change any vector into a new vector with different directions or lengths. If the matrix only stretches and transforms a certain vector or some vectors , does not produce a rotation effect on these vectors, then these vectors are called the eigenvectors of this matrix, and the scaling ratio is the eigenvalue;

步骤(3.2.4)将特征向量按列构建成矩阵NE,并归一化处理,进行步骤(3.2.5);Step (3.2.4) constructs eigenvector into matrix NE by column, and normalizes, carries out step (3.2.5);

步骤(3.2.5)定义bestC用于存储最佳聚类结果,定义BestE用于存储最佳聚类结果的评分,初始化BestE为0,定义T_num用于存储重复聚类次数,初始化T_num为0,定义retryTime为聚类操作重复次数,进行步骤(3.2.6);Step (3.2.5) defines bestC to store the best clustering result, defines BestE to store the score of the best clustering result, initializes BestE to 0, defines T_num to store the number of repeated clustering, initializes T_num to 0, Define retryTime as the number of repetitions of the clustering operation, and perform step (3.2.6);

步骤(3.2.6)执行步骤(3.2.6.1-3.2.6.5),对NE进行聚类,评估聚类的效果,循环retryTime次,找到解质量最高的聚类,执行步骤(3.2.7);Step (3.2.6) execute step (3.2.6.1-3.2.6.5), cluster NE, evaluate the effect of clustering, loop retryTime times, find the cluster with the highest solution quality, execute step (3.2.7);

步骤(3.2.6.1)对T_num自增1,即,T_num=T_num+1,进行步骤(3.2.6.2);Step (3.2.6.1) automatically increments T_num by 1, that is, T_num=T_num+1, and proceeds to step (3.2.6.2);

步骤(3.2.6.2)对矩阵NE使用K-Means方法进行聚类,将得到的向量聚类结果作为对应超点的聚类结果,进行步骤(3.2.6.3);Step (3.2.6.2) uses the K-Means method to cluster the matrix NE, and uses the obtained vector clustering result as the clustering result of the corresponding superpoint, and performs step (3.2.6.3);

其中K-Means方法将数据集分为K个簇,每个簇使用簇内所有样本均值来表示,将该均值称为“质心”;Among them, the K-Means method divides the data set into K clusters, and each cluster is represented by the mean value of all samples in the cluster, and the mean value is called "centroid";

步骤(3.2.6.3)根据评分函数,评估聚类的效果,进行步骤(3.2.6.4);Step (3.2.6.3) evaluates the effect of clustering according to the scoring function, and performs step (3.2.6.4);

其中评分函数的公式为

Figure BDA0003880083200000101
其中,me表示聚类结果bestC中簇类的数量,Vmi表示bestC中的第mi个簇类,hCut(Vmi)的公式为
Figure BDA0003880083200000102
其中,w(ecut)代表超边ecut的权重,β(Vmi)代表由被Vmi切割的超边组成的超边集,ecut代表超边集β(Vmi)中的超边,
Figure BDA0003880083200000103
代表Vmi在集合V中的补集,|ecut∩Vmi|代表ecut和Vmi并集的大小,即同时存在于超边ecut和Vmi的超点的数量;The formula of the scoring function is
Figure BDA0003880083200000101
Among them, me represents the number of clusters in the clustering result bestC, V mi represents the mi-th cluster in bestC, and the formula of hCut(V mi ) is
Figure BDA0003880083200000102
Among them, w(e cut ) represents the weight of the hyperedge e cut , β(V mi ) represents the hyperedge set composed of hyperedges cut by V mi , and e cut represents the hyperedge in the hyperedge set β(V mi ) ,
Figure BDA0003880083200000103
Represents the complement of V mi in the set V, |e cut ∩ V mi | represents the size of the union of e cut and V mi , that is, the number of superpoints that exist in the hyperedge e cut and V mi at the same time;

步骤(3.2.6.4)如果本次聚类的评分高于BestE,进行步骤(3.2.6.4.1-3.2.6.4.2),用本次聚类结果替换原聚类结果,否则进行步骤(3.2.6.5);Step (3.2.6.4) If the score of this clustering is higher than BestE, go to step (3.2.6.4.1-3.2.6.4.2), replace the original clustering result with this clustering result, otherwise go to step (3.2 .6.5);

步骤(3.2.6.4.1)将本次据聚类结果替换到bestC中,进行步骤(3.2.6.4.2);Step (3.2.6.4.1) replaces the data clustering result of this time into bestC, and proceeds to step (3.2.6.4.2);

步骤(3.2.6.4.2)将本次据聚类结果的评分赋值给BestE;Step (3.2.6.4.2) assigns the score of this data clustering result to BestE;

步骤(3.2.6.5)判断T_num是否大于或等于retryTime,是则退出循环,否则进入下一次循环;Step (3.2.6.5) judges whether T_num is greater than or equal to retryTime, if yes, exit the loop, otherwise enter the next loop;

步骤(3.2.7)结束循环,得到聚类结果bestC。Step (3.2.7) ends the cycle and obtains the clustering result bestC.

所述步骤(3.3)的过程如下:The process of the step (3.3) is as follows:

步骤(3.3.1)定义UV’表示需要细化的超点,将超图G’的超点集V’拷贝到UV’中,进行步骤(3.3.2);Step (3.3.1) defines UV' to represent the superpoints that need to be refined, copy the superpoint set V' of hypergraph G' to UV', and perform step (3.3.2);

步骤(3.3.2)进入循环,执行步骤(3.3.2.1-3.3.2.4),对UV’中的超点进行细化,直到所有超点都细化完成,进行步骤(3.3.3);Step (3.3.2) enters the loop, executes steps (3.3.2.1-3.3.2.4), and refines the superpoints in UV' until all superpoints are refined, then proceeds to step (3.3.3);

步骤(3.3.2.1)从集合UV’中随机取出一个超点v*,进行步骤(3.3.2.2);Step (3.3.2.1) randomly takes out a super point v* from the set UV', and proceeds to step (3.3.2.2);

步骤(3.3.2.2)按照顺序遍历存储被超点v*合并的超点的双向链表,获取其中的超点v**,执行步骤(3.3.2.2.1-3.3.2.2.6),找到超点v**的所属簇类,进行步骤(3.3.2.3);Step (3.3.2.2) traverses in order the doubly-linked list that stores the superpoints merged by superpoint v*, obtains the superpoint v** in it, executes steps (3.3.2.2.1-3.3.2.2.6), and finds the superpoint The belonging cluster class of point v** is carried out in step (3.3.2.3);

步骤(3.3.2.2.1)定义cc代表超点v**的所属簇类,初始化聚类结果与超点v*相同,进行步骤(3.3.2.2.2);Step (3.3.2.2.1) defines cc to represent the cluster class of superpoint v**, the initial clustering result is the same as superpoint v*, and proceeds to step (3.3.2.2.2);

步骤(3.3.2.2.2)定义distance代表超点v**和所属簇类之间的距离,计算超点v**和簇类cc之间的距离初始化distance,进行步骤(3.3.2.2.3);Step (3.3.2.2.2) defines distance to represent the distance between the super point v** and the cluster class to which it belongs, calculates the distance between the super point v** and the cluster class cc to initialize distance, and proceeds to step (3.3.2.2.3 );

其中,计算距离的公式为:Among them, the formula for calculating the distance is:

Figure BDA0003880083200000111
其中,
Figure BDA0003880083200000112
代表当超点类型为主题时,根据超点权重进行惩罚;vr1、vr2、vr3表示存在于簇类cc中的超点,Qe’表示同时包含超点vr1和vr2的超边的集合,e”表示该集合中的超边,vr3表示存在于簇类cc中的超点,Qe”表示同时包含超点v**和vr3的超边的集合,e”’表示该集合中的超边,len(cc)2表示len(cc)的平方,即集合cc的大小的平方,dv(v**)表示超点v**的度;
Figure BDA0003880083200000111
in,
Figure BDA0003880083200000112
Represents that when the superpoint type is the subject, the penalty is carried out according to the superpoint weight; v r1 , v r2 , v r3 represent the superpoints existing in the cluster class cc, and Q e ' represents the superpoints that contain both the superpoints v r1 and v r2 The set of edges, e" means the hyperedges in this set, v r3 means the superpoints existing in the cluster class cc, Q e " means the set of hyperedges containing both the superpoints v ** and v r3 , e"' Represents the hyperedge in the set, len(cc) 2 represents the square of len(cc), that is, the square of the size of the set cc, dv(v**) represents the degree of super point v**;

步骤(3.3.2.2.3)遍历聚类结果bestC中的簇类,命名为cluster,执行步骤(3.3.2.2.3.1-3.3.2.2.3.2),找到和v**距离最近的簇类,执行步骤(3.3.2.2.4);Step (3.3.2.2.3) Traverse the clusters in the clustering result bestC, name it cluster, perform steps (3.3.2.2.3.1-3.3.2.2.3.2), find the cluster with the closest distance to v**, execute Step (3.3.2.2.4);

步骤(3.3.2.2.3.1)根据步骤(3.3.2.2.2)中的距离公式计算v**与cluster的距离,进行步骤(3.3.2.2.3.2);Step (3.3.2.2.3.1) calculates the distance between v** and cluster according to the distance formula in step (3.3.2.2.2), and performs step (3.3.2.2.3.2);

步骤(3.3.2.2.3.2)如果v**与cluster的距离小于则将cc更新为cluster,将distance更新为v**与cluster的距离;Step (3.3.2.2.3.2) If the distance between v** and cluster is less than then update cc to cluster, and update distance to the distance between v** and cluster;

步骤(3.3.2.2.4)将v**移动到簇类cc中,进行步骤(3.3.2.2.5);Step (3.3.2.2.4) moves v** into the cluster class cc, and proceeds to step (3.3.2.2.5);

步骤(3.3.2.2.5)从v*的权重中减去v**的权重,进行步骤(3.3.2.2.6);Step (3.3.2.2.5) subtracts the weight of v** from the weight of v*, and proceeds to step (3.3.2.2.6);

步骤(3.3.2.2.6)如果v**不是双向链表的最后一个超点,则将v**更新为双向链表的下一个超点并进入下一轮迭代,否则结束迭代;Step (3.3.2.2.6) If v** is not the last superpoint of the doubly linked list, then update v** to be the next superpoint of the doubly linked list and enter the next round of iteration, otherwise end the iteration;

步骤(3.3.2.3)从集合UV’中删除v*,进行步骤(3.3.2.4);Step (3.3.2.3) deletes v* from the set UV', proceed to step (3.3.2.4);

步骤(3.3.2.4)如果集合UV’为空则退出循环,否则进入下一次循环;Step (3.3.2.4) exit the loop if the set UV' is empty, otherwise enter the next loop;

步骤(3.3.3)得到细化后的最终的聚类结果bestC,细化完成。Step (3.3.3) obtains the final clustering result bestC after refinement, and the refinement is completed.

所述第四步的过程如下:The process of the fourth step is as follows:

步骤(4.1)定义带重启的随机游走模型的符号,进行步骤(4.2);Step (4.1) defines the symbol of the random walk model with restart, and proceeds to step (4.2);

其中,tag表示Mashup标签、topic表示Mashup主题、iterTime’表示迭代次数、α表示重启概率、topQ表示推荐个数、recommend表示服务推荐列表和clusters(C1,C2,...,Ck)表示聚类分区其中C1,C2,...,Ck表示服务推荐列表中的服务;Among them, tag indicates the Mashup tag, topic indicates the Mashup topic, iterTime' indicates the number of iterations, α indicates the restart probability, topQ indicates the number of recommendations, recommend indicates the service recommendation list and clusters(C 1 ,C 2 ,...,C k ) Represents the cluster partition where C 1 , C 2 ,...,C k represent the services in the service recommendation list;

步骤(4.2)定义y(t)为在t时刻各个超点被访问的概率所构成的列向量,进行步骤(4.3);Step (4.2) defines y (t) as the column vector formed by the probability of each super point being visited at time t, and proceeds to step (4.3);

步骤(4.3)定义y(1)为y(t)在t为1时各个超点被访问的概率所构成的列向量,以与VO相同的顺序将VO长度的倒数存储在y(1)中;Step (4.3) defines y (1) as a column vector composed of the probabilities of each superpoint being visited when t is 1 in y (t) , and stores the reciprocal of the length of VO in y (1) in the same order as VO ;

步骤(4.4)进行步骤(4.4.1-4.4.2),将需求Mashup的描述文档、标签、主题构建成初始列向量q,进行步骤(4.5);Step (4.4) proceed to step (4.4.1-4.4.2), construct the description document, label, and subject of the required Mashup into an initial column vector q, and proceed to step (4.5);

其中,q为由0和1构成的初始列向量,其中代表初始超点的位置赋值为1,其余为0;Among them, q is an initial column vector composed of 0 and 1, where the position representing the initial superpoint is assigned a value of 1, and the rest are 0;

步骤(4.4.1)建立一个与VO长度相同的全为0的向量q,进行步骤(4.4.2);Step (4.4.1) establishes a vector q of all 0s with the same length as VO, and proceeds to step (4.4.2);

步骤(4.4.2)将需求Mashup的描述文档、标签、主题分别依次与VO中的超点比较,找出相同超点,在q中将对应索引位置1;Step (4.4.2) Compare the description document, label, and topic of the demand Mashup with the superpoints in VO in turn, find out the same superpoint, and set the corresponding index position 1 in q;

步骤(4.5)进行步骤(4.5.1-4.5.5),计算超点集的访问分布向量,进行步骤(4.6);Step (4.5) carries out step (4.5.1-4.5.5), calculates the visit distribution vector of superpoint set, carries out step (4.6);

步骤(4.5.1)利用超点、超边的度矩阵、权重矩阵、关联矩阵构建转移概率矩阵P,公式为P=VD-1HWED-1HT,进行步骤(4.5.2);Step (4.5.1) constructs transition probability matrix P by using the degree matrix, weight matrix, and incidence matrix of hyperpoints and hyperedges, the formula is P=VD -1 HWED - 1HT, and proceeds to step (4.5.2);

其中,D-1表示D的逆矩阵,D-1D=E*Wherein, D -1 represents the inverse matrix of D, D -1 D=E * ;

步骤(4.5.2)根据带重启的随机游走,得到整个超点集的访问分布为y(t+1)=αPy(t)+(1-α)q,进行步骤(4.5.3);Step (4.5.2) obtains the access distribution of the entire superpoint set as y (t+1) =αPy (t) +(1-α)q according to the random walk with restart, and proceeds to step (4.5.3);

步骤(4.5.3)进行初始化的设置,y(1)向量为初始分布向量,其中每一个元素的值为超点总数量的倒数,y(1)向量与VO等长,进行步骤(4.5.4);Step (4.5.3) is initialized, the y (1) vector is the initial distribution vector, the value of each element is the reciprocal of the total number of superpoints, the y (1) vector is equal to the length of VO, and the step (4.5. 4);

步骤(4.5.4)从初始分布向量y(1),进行步骤(4.5.4.1-4.5.4.2)开启随机游走,对y(t)进行迭代,直到到达指定的迭代次数Tmax,进行步骤(4.5.5);Step (4.5.4) From the initial distribution vector y (1) , proceed to step (4.5.4.1-4.5.4.2) to start a random walk, iterate on y (t) until it reaches the specified number of iterations Tmax, proceed to step ( 4.5.5);

步骤(4.5.4.1)根据步骤(4.5.2)计算y(t+1),进行步骤(4.5.4.2);Step (4.5.4.1) calculates y (t+1 ) according to step (4.5.2), and performs step (4.5.4.2);

步骤(4.5.4.2)更新t为t+1;Step (4.5.4.2) updates t to be t+1;

步骤(4.5.5)随机游走过程停止,此时向量y(t)已经不再变化;Step (4.5.5) the random walk process stops, and the vector y (t) no longer changes at this time;

步骤(4.6)对y(t)向量按照值进行从大到小排序,进行步骤(4.7);Step (4.6) sorts the y (t) vector from large to small according to the value, and performs step (4.7);

步骤(4.7)新建空的推荐列表recommend,进行步骤(4.8);Step (4.7) creates a new empty recommendation list recommend, and proceeds to step (4.8);

步骤(4.8)进行步骤(4.8.1-4.8.2),获取y(t)值最大的topQ个API作为推荐API放入recommend列表,进行步骤(4.9);Step (4.8) proceed to step (4.8.1-4.8.2), obtain the topQ APIs with the largest y (t) values and put them into the recommend list as recommended APIs, proceed to step (4.9);

步骤(4.8.1)遍历y(t)向量中的值对应的超点,进行步骤(4.8.2),直到recommend的长度达到topQ;Step (4.8.1) traverses the superpoint corresponding to the value in the y (t) vector, and proceeds to step (4.8.2) until the length of recommend reaches topQ;

步骤(4.8.2)对超点进行类型判断,如果是API类型则加入推荐列表,否则跳到下一次循环;Step (4.8.2) judges the type of the super point, if it is an API type, it is added to the recommendation list, otherwise it skips to the next cycle;

步骤(4.9)结束循环,返回推荐列表,其长度为topQ,此时recommend中经过大小排序且类型都为API的超点就是所求的推荐API,完成推荐,结束。Step (4.9) ends the cycle, returns the recommendation list, and its length is topQ. At this time, the superpoints in recommend that have been sorted by size and whose types are all API are the recommended recommended APIs. Complete the recommendation and end.

本发明的有益效果是,利用MATaTo数据模型描述服务推荐场景能充分利用服务推荐场景中的信息;进一步,一致性超图涵盖辅助信息的同时克服了辅助信息难以融合、高阶信息易丢失的问题;进一步,使用改进的多级聚类方法进行聚类,降低超图结构的复杂性,有效的缩小服务搜索空间,提高服务推荐的准确率,降低服务推荐的时间成本,减少大型超图给基于超图的推荐算法带来负面影响;最后,设定重启参数进行带重启的随机游走,在保证数据推荐多样性的同时得到个性化推荐列表。The beneficial effect of the present invention is that using the MATaTo data model to describe the service recommendation scene can make full use of the information in the service recommendation scene; further, while the consistency hypergraph covers auxiliary information, it overcomes the problems of difficult fusion of auxiliary information and easy loss of high-order information ;Further, use the improved multi-level clustering method for clustering, reduce the complexity of the hypergraph structure, effectively reduce the service search space, improve the accuracy of service recommendation, reduce the time cost of service recommendation, and reduce the large-scale hypergraph based on Hypergraph’s recommendation algorithm has negative impacts; finally, set the restart parameters to perform random walk with restart, and get a personalized recommendation list while ensuring the diversity of data recommendations.

附图说明Description of drawings

图1示出了在不同服务类数下本发明推荐方法与其他方法推荐结果的准确率。Fig. 1 shows the accuracy rate of the recommendation method of the present invention and the recommendation results of other methods under different numbers of service categories.

图2示出了在不同服务类数下本发明推荐方法与其他方法推荐结果的召回率Recall。Fig. 2 shows the recall rate Recall of the recommendation method of the present invention and the recommendation results of other methods under different numbers of service categories.

图3示出了在不同服务类数下本发明推荐方法与其他方法推荐结果的折扣累计增益DCG@R。Fig. 3 shows the discount cumulative gain DCG@R of the recommendation method of the present invention and the recommendation results of other methods under different numbers of service categories.

图4示出了在不同服务类数下本发明推荐方法与其他方法推荐结果的汉明距离HMD。Fig. 4 shows the Hamming distance HMD between the recommendation method of the present invention and the recommendation results of other methods under different numbers of service categories.

具体实施方式detailed description

下面结合附图对本发明作进一步描述。The present invention will be further described below in conjunction with the accompanying drawings.

参照图1~图4,一种面向Saas的融合带重启随机游走算法的Web Api多样性推荐方法,包括以下步骤:Referring to Figures 1 to 4, a Saas-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm includes the following steps:

第一步、提取MATaTo(Mashup-API-Tag-Topic)数据模型,过程如下:The first step is to extract the MATaTo (Mashup-API-Tag-Topic) data model, the process is as follows:

步骤(1.1)从ProgrammableWeb平台爬取Mashup服务集合信息,进行步骤(1.2);Step (1.1) crawls the Mashup service collection information from the ProgrammableWeb platform, and proceeds to step (1.2);

步骤(1.2)对爬取到的Mashup服务集合进行分析,选择有利于提升服务推荐准确性且内容丰富的元信息,进行步骤(1.3);Step (1.2) analyzes the crawled Mashup service collection, selects meta information that is conducive to improving the accuracy of service recommendation and is rich in content, and proceeds to step (1.3);

其中,爬取到的信息有服务描述文档、Mashup与API的调用关系、主题、标签、服务关注者、服务调用链接、更新日志信息;Among them, the crawled information includes service description documents, call relationships between Mashup and API, topics, labels, service followers, service call links, and update log information;

其中,服务描述文档是传统Web服务WSDL文档的替代品,通过直观的语言描述服务信息,一直以来是服务推荐研究者的研究重点,不少研究者通过TF-IDF模型或主题模型对描述文档处理分析;Among them, the service description document is a substitute for the traditional Web service WSDL document. It has always been the research focus of service recommendation researchers to describe service information through an intuitive language. Many researchers use the TF-IDF model or topic model to process description documents. analyze;

其中,Mashup与API的调用关系包含了Mashup与API之间存在的不可分割的关系,而现有的研究中往往忽略了两者的调用关系,因此,需要使用Mashup与API的调用关系,并且在调用关系的基础上,可以引申出Mashup和API流行度、Mashup与API、Mashup与Mashup、API与API共现度信息;Among them, the call relationship between Mashup and API includes the inseparable relationship between Mashup and API, but the existing research often ignores the call relationship between the two. Therefore, the call relationship between Mashup and API needs to be used, and in On the basis of the call relationship, the popularity of Mashup and API, Mashup and API, Mashup and Mashup, API and API co-occurrence information can be derived;

其中,主题能反应服务的功能性特征,对于服务的使用范围有一个大致的定性,同一主题下的服务有相似的功能,标签通常由用户人为添加,从用户的主观角度对服务进行补充描述;Among them, the theme can reflect the functional characteristics of the service, and has a general qualitative definition of the scope of use of the service. Services under the same theme have similar functions. Labels are usually manually added by the user, and the service is supplemented from the user's subjective point of view;

其中,服务调用链接仅代表了服务的资源地址,并无实际帮助。因此,不使用服务调用链接信息;Among them, the service call link only represents the resource address of the service, and has no actual help. Therefore, the service call link information is not used;

步骤(1.3)定义MATaTo模型的符号,M表示Mashup集合、A表示API集合、Ta表示标签集合、To表示主题集合、R1表示Mashup之间的关系、R2表示API之间的关系、R3表示Mashup及其属性的关系、R4表示API及其属性的关系、R5表示Mashup和API之间的关系、n表示超点数量、m表示超边数量、WHG∈Rm×n表示加权超图、v表示超点、V表示由n个超点构成的集合、e表示超边、E表示由m个超边构成的集合,Rm×n表示大小为m×n的实数集合;进行步骤(1.4);Step (1.3) defines the symbols of the MATaTo model, M represents the Mashup set, A represents the API set, Ta represents the tag set, To represents the topic set, R1 represents the relationship between Mashups, R2 represents the relationship between APIs, R3 represents the Mashup and The relationship between its attributes, R4 indicates the relationship between API and its attributes, R5 indicates the relationship between Mashup and API, n indicates the number of hyperpoints, m indicates the number of hyperedges, WHG∈R m×n indicates the weighted hypergraph, v indicates the hypergraph point, V represents a set composed of n superpoints, e represents a hyperedge, E represents a set composed of m hyperedges, R m * n represents a set of real numbers whose size is m * n; proceed to step (1.4);

步骤(1.4)遍历爬取到的信息,进行步骤(1.4.1-1.4.2)从中提取出MATaTo数据模型;Step (1.4) traverses the crawled information, and performs steps (1.4.1-1.4.2) to extract the MATaTo data model therefrom;

步骤(1.4.1)提取服务描述文档、Mashup与API的调用关系、主题、标签四种类型信息,进行步骤(1.4.2);Step (1.4.1) extracts four types of information, the service description document, the call relationship between Mashup and API, the subject, and the label, and proceeds to step (1.4.2);

其中,四种类型信息的说明具体如下:Among them, the description of the four types of information is as follows:

Mashup(M*)表示Mashup服务,是服务推荐场景中推荐的请求者,Mashup本身携带有标签、主题的信息,并且与API存在调用关系;Mashup (M*) represents the Mashup service, which is the recommended requester in the service recommendation scenario. Mashup itself carries information about tags and topics, and has a calling relationship with the API;

API(A*)表示Web API,是服务推荐场景中的推荐对象,与Mashup服务对象相同,也具有标签、主题的信息,与Mashup存在被调用的关系;API(A*) means Web API, which is the recommended object in the service recommendation scenario. It is the same as the Mashup service object, and also has label and subject information, and has a calling relationship with Mashup;

标签(Ta*)表示爬取到的标签信息,具有相同标签的Mashup和API关系更加紧密;The label (Ta*) indicates the crawled label information, and the Mashup with the same label has a closer relationship with the API;

主题(To*)表示爬取到的主题信息,同一主题下的Mashup和API有相近的功能;Topic (To*) indicates the crawled topic information, Mashup and API under the same topic have similar functions;

步骤(1.4.2)归纳出R1(M*-Ta*-To*)、R2(A*-Ta*-To*)、R3(M*-M*)、R4(A*-A*)、R5(M*-A*)五种类型的高阶关系;Step (1.4.2) summarizes R1(M*-Ta*-To*), R2(A*-Ta*-To*), R3(M*-M*), R4(A*-A*), R5(M*-A*) five types of higher-order relationships;

其中,五种高阶关系的说明具体如下:Among them, the description of the five high-order relationships is as follows:

R1(M*-Ta*-To*)表示Mashup及其属性的关系,Mashup除了自身的描述文档外,用其携带的主题、标签信息进行补充说明;R1(M*-Ta*-To*) represents the relationship between Mashup and its attributes. In addition to its own description document, Mashup uses its carried subject and label information for supplementary explanation;

R2(A*-Ta*-To*)表示API及其属性的关系,与R1相同,API用其携带的主题、标签信息进行补充说明,Mashup调用具有相同主题、标签的API;R2(A*-Ta*-To*) represents the relationship between the API and its attributes. It is the same as R1. The API uses the topic and label information it carries to make supplementary explanations. Mashup calls the API with the same topic and label;

R3(M*-M*)表示Mashup之间的关系,本发明计算不同Mashup描述文档之间的相似度,来表示不同Mashup服务之间的关系;R3(M*-M*) represents the relationship between Mashups, and the present invention calculates the similarity between different Mashup description documents to represent the relationship between different Mashup services;

R4(A*-A*)表示API之间的关系,与R3相同,使用相似度表示API服务之间的关系。除此之外,被同时调用的两个或多个API之间能反映其共现度,一定程度上提升推荐的准确度;R4(A*-A*) represents the relationship between APIs, which is the same as R3, using similarity to represent the relationship between API services. In addition, the co-occurrence between two or more APIs that are called at the same time can reflect their co-occurrence, which improves the accuracy of the recommendation to a certain extent;

R5(M*-A*)表示Mashup和API之间的关系,Mashup由API组合而成,除了显式的调用关系外,API的主题或标签同样能对Mashup进行功能性描述;R5(M*-A*) indicates the relationship between Mashup and API. Mashup is composed of API. In addition to the explicit call relationship, the topic or label of API can also describe the function of Mashup;

第二步、基于MATaTo数据模型构建超图;The second step is to construct a hypergraph based on the MATaTo data model;

步骤(2.1)使用NMF模型对Mashup和API描述文档建模,得到对应的主题向量,进行步骤(2.2);Step (2.1) uses the NMF model to model the Mashup and API description documents, obtains the corresponding topic vector, and performs step (2.2);

其中,NMF模型是非负矩阵分解,是一种无监督学习算法,其目的在于提取有用的特征,常用于图像特征识别等领域;Among them, the NMF model is a non-negative matrix decomposition, which is an unsupervised learning algorithm whose purpose is to extract useful features, and is often used in image feature recognition and other fields;

步骤(2.2)根据MATaTo数据模型,进行步骤(2.2.1-2.2.4),针对四种类型的对象构成超图的超点;Step (2.2) carries out step (2.2.1-2.2.4) according to MATaTo data model, forms the superpoint of hypergraph for four types of objects;

步骤(2.2.1)定义Mashup集合M={m1,m2,m3,...,mN},进行步骤(2.2.2);Step (2.2.1) defines the Mashup set M={m 1 , m 2 , m 3 ,...,m N }, and proceeds to step (2.2.2);

其中,N为Mashup的总数;Among them, N is the total number of Mashup;

步骤(2.2.2)定义Mashup mi调用的API集合为Ai={ai1,ai1,ai1,...,aik'},进行步骤(2.2.3);Step (2.2.2) defines the API set called by Mashup m i as A i ={a i1 ,a i1 ,a i1 ,...,a ik' }, proceed to step (2.2.3);

其中,i代表当前Mashup在集合M中的编号,k’为Mashup mi调用的API总数,则所有Mashup调用的API集合为

Figure BDA0003880083200000161
其中,∪表示取并集;Among them, i represents the number of the current mashup in the set M, k' is the total number of APIs called by mashup m i , then the set of APIs called by all mashups is
Figure BDA0003880083200000161
Among them, ∪ means to take the union;

步骤(2.2.3)定义Ta和To为所有Mashup和API携带的标签和主题信息集合,进行步骤(2.2.4);Step (2.2.3) defines Ta and To as the tags and subject information collections carried by all Mashups and APIs, and proceeds to step (2.2.4);

步骤(2.2.4)初始化超图的超点集为V=M∪A∪Ta∪ToStep (2.2.4) initializes the superpoint set of the hypergraph as V=M∪A∪T a ∪T o ;

步骤(2.3)根据五种高阶关系中建立十种类型的超边,定义十种类型的超边EM *kTo*、EM*kTa*、EA*kTo*、EA*kTa*、EM*ToTa*、EA*To*Ta*、EM*A*To*、EM*A*Ta*

Figure BDA0003880083200000162
Figure BDA0003880083200000163
建立超图的超边集;Step (2.3) Establish ten types of hyperedges according to five high-order relations, and define ten types of hyperedges E M *kTo* , E M*kTa* , E A*kTo* , E A*kTa* , E M*ToTa* 、 E A*To*Ta* 、 E M*A*To* 、 E M*A*Ta*
Figure BDA0003880083200000162
and
Figure BDA0003880083200000163
Create hyperedge sets of hypergraphs;

其中,超边定义的具体说明如下:Among them, the specific description of the hyperedge definition is as follows:

Figure BDA0003880083200000164
主题能对Mashup进行功能性描述,而Mashup本身携带的主题有限,通过与其相似的Mashup扩展自身的主题,对于每个Mashup,选取top-k个相似Mashup以及它们的主题构成一条超边,该边的权重设定为与k个Mashup之间相似度的平均值,相似度的计算采用常见的余弦相似度,其中,余弦相似度的计算公式如下:
Figure BDA0003880083200000164
A topic can describe the functionality of a mashup, and a mashup itself has limited themes. It expands its own theme through similar mashups. For each mashup, select top-k similar mashups and their topics to form a hyperedge. The edge The weight of is set as the average value of the similarity with k Mashups, and the calculation of the similarity adopts the common cosine similarity, where the calculation formula of the cosine similarity is as follows:

Figure BDA0003880083200000165
其中,θ表示预先相似度的夹角,vmx'表示第x’个Mashup的特征向量,|vmx'|表示特征向量的模;
Figure BDA0003880083200000165
Among them, θ represents the angle of the pre-similarity, vm x' represents the feature vector of the x'th Mashup, | vm x' | represents the modulus of the feature vector;

Figure BDA0003880083200000166
标签能对Mashup进行细节性描述,而Mashup本身携带的标签有限,通过与其相似的Mashup扩展自身的标签,对于每个Mashup,选取top-k个相似Mashup以及它们的标签构成一条超边,该边的权重设定为与k个Mashup之间相似度的平均值,相似度的计算和步骤(2.3)中
Figure BDA0003880083200000171
的余弦相似度的计算公式相同;
Figure BDA0003880083200000166
Tags can describe the mashup in detail, and the mashup itself has limited tags. By extending its own tags through similar mashups, for each mashup, select top-k similar mashups and their tags to form a hyperedge. The weight of is set as the average value of similarity with k Mashups, the calculation of similarity and step (2.3)
Figure BDA0003880083200000171
The calculation formula of cosine similarity is the same;

Figure BDA0003880083200000172
一个API携带的标签数有限,可以通过与其他API之间的相似性扩展自身的标签。对于每个API,选取top-k个相似API以及它们的主题构成一条超边,该边的权重设定为与k个API之间相似度的平均值,相似度的计算和步骤(2.3)中
Figure BDA0003880083200000173
的余弦相似度的计算公式相同;
Figure BDA0003880083200000172
An API carries a limited number of tags, and can extend its own tags through similarities with other APIs. For each API, select top-k similar APIs and their topics to form a hyperedge, and the weight of this edge is set as the average value of the similarity between the k APIs. The calculation of similarity is the same as in step (2.3).
Figure BDA0003880083200000173
The calculation formula of cosine similarity is the same;

Figure BDA0003880083200000174
同上,API可以通过与其他API之间的相似性扩展自身的标签,对于每个API,选取top-k个相似API以及它们的标签构成一条超边,该边的权重设定为与k个API之间相似度的平均值,相似度的计算和步骤(2.3)中
Figure BDA0003880083200000175
的余弦相似度的计算公式相同;
Figure BDA0003880083200000174
As above, an API can extend its own label through the similarity with other APIs. For each API, select top-k similar APIs and their labels to form a hyperedge, and the weight of this edge is set to be similar to k APIs. The average value of similarity between, the calculation of similarity and step (2.3)
Figure BDA0003880083200000175
The calculation formula of cosine similarity is the same;

EM*To*Ta*:一个Mashup本身携带有主题与标签信息,对于每个Mashup,选取其携带的任意数量主题与标签构成一条超边。该边的权重设定为1;E M*To*Ta* : A mashup itself carries topic and label information. For each mashup, select any number of topics and labels it carries to form a hyperedge. The weight of this edge is set to 1;

EA*To*Ta*:一个API本身携带有主题与标签信息。对于每个API,选取其携带的任意数量主题与标签构成一条超边,该边的权重设定为1;E A*To*Ta* : An API itself carries subject and tag information. For each API, select any number of topics and labels it carries to form a hyperedge, and the weight of this edge is set to 1;

EM*A*To*:一个Mashup调用一个或多个API,这些API的主题同样能补充Mashup的主题,对于每个Mashup,选取其调用的所有API以及携带的主题。该边的权重设定为1;E M*A*To* : A mashup calls one or more APIs. The themes of these APIs can also complement the theme of the mashup. For each mashup, select all the APIs it calls and the topics it carries. The weight of this edge is set to 1;

EM*A*Ta*:同上,一个Mashup调用一个或多个API,这些API的标签同样能补充Mashup的标签,对于每个Mashup,选取其调用的所有API以及携带的标签,该边的权重设定为1;E M*A*Ta* : As above, a mashup calls one or more APIs, and the labels of these APIs can also supplement the labels of the mashup. For each mashup, select all the APIs it calls and the labels it carries, and the weight of the edge set to 1;

Figure BDA0003880083200000176
流行的API往往代表较高的质量,对于每个API,与其所在的主题构建一条超边。该边的权重设定为流行度大小,其中,流行度大小的公式如下:
Figure BDA0003880083200000176
Popular APIs tend to represent higher quality, and for each API, a hyperedge is constructed with its topic. The weight of this edge is set to the popularity, where the formula of the popularity is as follows:

Figure BDA0003880083200000177
其中Call(Ar)表示API Ar被Mashup调用的频次,Topic(Ar)表示与Ar具有相同主题的所有API。MinCall(Topic(Ar))表示与Ar相同主题中的所有API中调用频次最低的API,MaxCall(Topic(Ar))表示与Ar相同主题中的所有API中调用频次最高的API;
Figure BDA0003880083200000177
Among them, Call(A r ) indicates the frequency that API A r is called by Mashup, and Topic(A r ) indicates all APIs with the same topic as A r . MinCall(Topic(A r )) means the API with the lowest call frequency among all APIs in the same topic as A r , and MaxCall(Topic(A r )) means the API with the highest call frequency among all APIs in the same topic as A r ;

Figure BDA0003880083200000181
经常一起被调用的API具有相关性,该边包含所有共现次数大于1的API,该边的权重设定为共现度大小,其中,共现度大小的公式如下:
Figure BDA0003880083200000181
The APIs that are often called together are related. This edge includes all APIs whose co-occurrence times are greater than 1. The weight of this edge is set to the co-occurrence degree. The formula for the co-occurrence degree is as follows:

Figure BDA0003880083200000182
其中,|Am'∩...∩An'|表示API Am’,...,An’被相同Mashup调用的次数,|Am'∪...∪An'|表示API Am’,...,An’被Mashup调用的总次数;
Figure BDA0003880083200000182
Among them, |A m' ∩...∩A n' | indicates the number of times API A m' ,...,A n' is called by the same Mashup, and |A m' ∪...∪A n' | indicates API The total number of times A m' ,...,A n' is called by Mashup;

第三步、对MATaTo数据模型构成的超图进行多级聚类,过程如下:The third step is to perform multi-level clustering on the hypergraph formed by the MATaTo data model. The process is as follows:

步骤(3.1)粗化阶段:进行步骤(3.1.1-3.1.5)对原始图超点进行连续的粗化直到超图规模减小到一定程度。Step (3.1) Coarsening stage: Carry out steps (3.1.1-3.1.5) to perform continuous coarsening on the superpoints of the original graph until the size of the hypergraph is reduced to a certain extent.

步骤(3.1.1)定义粗化阶段所需的符号,c表示超点的权重集合、w表示超边的权重集合、G=(V,E,c,w)表示原始超图、iterTime表示迭代次数、coarse_limit表示粗化阈值、max_weight表示超点权重阈值、G'=(V',E',c',w')表示粗化后的图,V'表示粗化后未被合并的超点,E'表示粗化后的超边,c'表示粗化后超点的权重集合,w'表示粗化超边后的权重集合,进行步骤(3.1.2);Step (3.1.1) defines the symbols required for the coarsening stage, c represents the weight set of hyperpoints, w represents the weight set of hyperedges, G=(V,E,c,w) represents the original hypergraph, and iterTime represents iteration The number of times, coarse_limit indicates the coarsening threshold, max_weight indicates the superpoint weight threshold, G'=(V',E',c',w') indicates the graph after coarsening, and V' indicates the superpoint that has not been merged after coarsening , E' represents the hyperedge after coarsening, c' represents the weight set of superpoints after coarsening, w' represents the weight set after coarsening the hyperedge, and proceed to step (3.1.2);

步骤(3.1.2)将所有超点加入到未访问超点集合UV中,进行步骤(3.1.3);Step (3.1.2) joins all superpoints in the unvisited superpoint collection UV, carries out step (3.1.3);

步骤(3.1.3)初始化空集合C用于记录所有被粗化的超点,进行步骤(3.1.4);Step (3.1.3) initializes the empty collection C to be used for recording all superpoints that are coarsened, and carries out step (3.1.4);

步骤(3.1.4)定义变量I用于存储迭代次数,并将其初始化为0,进行步骤(3.1.4.1-3.1.4.4),对集合UV中的超点进行粗化,进行步骤(3.1.5);Step (3.1.4) defines the variable I for storing the number of iterations, and initializes it to 0, performs steps (3.1.4.1-3.1.4.4), coarsens the super points in the set UV, and performs steps (3.1. 5);

步骤(3.1.4.1)定义集合UVT,用于存储对本次迭代需要粗化的超点合集,将集合UV中的超点拷贝进集合UVT中,进行步骤(3.1.4.2);Step (3.1.4.1) defines the set UVT, which is used to store the superpoint collection that needs to be roughened for this iteration, and copies the superpoints in the set UV into the set UVT, and proceeds to step (3.1.4.2);

步骤(3.1.4.2)进行步骤(3.1.4.2.1-3.1.4.2.12),对UVT中超点进行粗化,直到迭代次数到达len(UVT),进行步骤(3.1.4.3);Step (3.1.4.2) proceed to step (3.1.4.2.1-3.1.4.2.12), coarsen the super point in UVT until the number of iterations reaches len(UVT), proceed to step (3.1.4.3);

其中,len()表示获取集合的大小,len(UVT)表示集合UVT的大小;Among them, len() indicates the size of the collection, and len(UVT) indicates the size of the collection UVT;

步骤(3.1.4.2.1)随机取集合UVT中一个超点,命名为rv,进行步骤(3.1.4.2.2);Step (3.1.4.2.1) randomly takes a super point in the set UVT, named rv, and proceeds to step (3.1.4.2.2);

步骤(3.1.4.2.2)判断该超点rv是否属于集合UVT,是则进行步骤(3.1.4.2.3),否则直接进入下一轮循环;Step (3.1.4.2.2) judges whether the super point rv belongs to the set UVT, if so, proceed to step (3.1.4.2.3), otherwise directly enter the next cycle;

步骤(3.1.4.2.3)定义一个超点cv,并将超点rv拷贝给cv,定义score用于存储评分,并初始化评分为0,进行步骤(3.1.4.2.4);Step (3.1.4.2.3) defines a super point cv, and copies the super point rv to cv, defines score to store the score, and initializes the score to 0, proceed to step (3.1.4.2.4);

步骤(3.1.4.2.4)遍历与超点rv邻接的其他超点,并将该超点拷贝给iv,进行步骤(3.1.4.2.4.1-3.1.4.2.4.3),找到要与rv合并的超点iv,进行步骤(3.1.4.2.5);Step (3.1.4.2.4) traverses other superpoints adjacent to superpoint rv, and copies the superpoint to iv, and performs steps (3.1.4.2.4.1-3.1.4.2.4.3) to find the superpoint to be merged with rv Super point iv, proceed to step (3.1.4.2.5);

其中,与超点rv邻接的其他超点即与该超点在同一边中的超点;Among them, other superpoints adjacent to superpoint rv are superpoints on the same side as this superpoint;

步骤(3.1.4.2.4.1)计算rv的权重与iv的权重之和;Step (3.1.4.2.4.1) calculates the sum of the weight of rv and the weight of iv;

步骤(3.1.4.2.4.2)判断该权重和是否大于超点权重阈值max_weight,否则进行步骤(3.1.4.2.4.3),是则直接进入下一次循环;Step (3.1.4.2.4.2) judges whether the weight sum is greater than the superpoint weight threshold max_weight, otherwise proceed to step (3.1.4.2.4.3), and if so, directly enter the next cycle;

步骤(3.1.4.2.4.3)判断iv是不是属于集合UVT中的超点,是则直接进入下一次循环,否则直接结束循环;Step (3.1.4.2.4.3) judges whether iv belongs to a superpoint in the set UVT, if so, directly enters the next cycle, otherwise directly ends the cycle;

步骤(3.1.4.2.5)根据评分函数计算rv与iv的评分函数,定义变量cur_score,将评分存储在cur_score中,进行步骤(3.1.4.2.6);Step (3.1.4.2.5) calculates the scoring function of rv and iv according to scoring function, defines variable cur_score, scores are stored in cur_score, and carries out step (3.1.4.2.6);

其中,评分函数为score(rv,iv)=a·con(rv,iv)+b·hNcut(rv,iv)。a、b的值为经验值,依据实验效果调参后所得。其中,

Figure BDA0003880083200000191
其中,I*(rv)、I*(iv)分别表示包含超点rv、iv的超边集合,∩表示取交集,即取符号左右集合都存在的元素,
Figure BDA0003880083200000192
其中c*(rv)、c*(iv)分别为超点rv、iv的权重,e1表示同时和超点rv、iv存在于同一超边的超点,w(e1)表示超边e1的权重,de(e1)表示超边e1的度,de(e1)=|e1|,|e1|表示超边e1中的超点数量;∑表示求和。其中hNcut(rv,iv)表示超点rv和iv所在超边的归一化切割,代表了rv和iv分别被看作一个簇类时这两个簇类的相关度,计算公式为
Figure BDA0003880083200000193
其中,dv(rv)、dv(iv)表示超点rv、iv的度,即包含超点rv、iv的边的权重之和,e2表示属于集合Qe的边,Qe表示由这两个集合切割的超边集,当一条超边中的超点存在于不同簇类中时,称这条超边是被切割的超边;Wherein, the scoring function is score(rv,iv)=a·con(rv,iv)+b·hNcut(rv,iv). The values of a and b are empirical values, which are obtained after parameter adjustment according to the experimental results. in,
Figure BDA0003880083200000191
Among them, I*(rv), I*(iv) represent the hyperedge set containing the superpoint rv, iv respectively, ∩ means to take the intersection, that is, to take the elements that exist in both the left and right sets of symbols,
Figure BDA0003880083200000192
Among them, c*(rv) and c*(iv) are the weights of superpoints rv and iv respectively, e 1 means a superpoint that exists on the same hyperedge as superpoint rv and iv at the same time, w(e 1 ) means hyperedge e The weight of 1 , de(e 1 ) indicates the degree of hyperedge e 1 , de(e 1 )=|e 1 |, |e 1 | indicates the number of super vertices in hyperedge e 1 ; ∑ indicates summation. where hNcut(rv,iv) represents the normalized cut of the hyperedge where the superpoints rv and iv are located, and represents the correlation between the two clusters when rv and iv are respectively regarded as a cluster. The calculation formula is
Figure BDA0003880083200000193
Among them, dv(rv) and dv(iv) represent the degrees of the superpoints rv and iv, that is, the sum of the weights of the edges containing the superpoints rv and iv, e 2 represents the edges belonging to the set Q e , and Q e represents the A set of hyperedges cut by a set, when the hyperpoints in a hyperedge exist in different clusters, this hyperedge is called a hyperedge that is cut;

步骤(3.1.4.2.6)判断cur_score是否大于score,是则将超点iv拷贝到cv中、将cur_score的值赋值给score,进行步骤(3.1.4.2.7);Step (3.1.4.2.6) judges whether cur_score is greater than score, if so, copy the super point iv to cv, assign the value of cur_score to score, and proceed to step (3.1.4.2.7);

步骤(3.1.4.2.7)将rv与cv合并,本发明使用双向链表的形式,将有合并关系的超点进行连接,然后进行步骤(3.1.4.2.8);Step (3.1.4.2.7) merges rv and cv, and the present invention uses the form of doubly linked list to connect the superpoints with merge relationship, and then proceed to step (3.1.4.2.8);

其中,双向链表是链表的一种,它的每个数据结点中都有两个指针,分别指向直接后继和直接前驱,即,直接相连的两个节点之间有指向对方的指针。Among them, the doubly linked list is a kind of linked list, each of its data nodes has two pointers, pointing to the direct successor and direct predecessor respectively, that is, there are pointers pointing to each other between two directly connected nodes.

步骤(3.1.4.2.8)将被合并超点cv的权重增加到rv,进行步骤(3.1.4.2.9);Step (3.1.4.2.8) increases the weight of the merged superpoint cv to rv, and proceeds to step (3.1.4.2.9);

步骤(3.1.4.2.9)在当前迭代轮次使用的UVT中移除rv与cv,,进行步骤(3.1.4.2.10);Step (3.1.4.2.9) removes rv and cv in the UVT used in the current iteration round, and proceeds to step (3.1.4.2.10);

步骤(3.1.4.2.10)在UV中移除cv,进行步骤(3.1.4.2.11);Step (3.1.4.2.10) removes cv in UV, proceed to step (3.1.4.2.11);

步骤(3.1.4.2.11)将超点cv添加到集合C中,进行步骤(3.1.4.2.12);Step (3.1.4.2.11) adds the super point cv to the set C, and proceeds to step (3.1.4.2.12);

步骤(3.1.4.2.12)如果被粗化的超点数到达粗化阈值coarse_limit,则退出迭代,执行步骤(3.1.4.3),否则直接进入下一轮迭代;Step (3.1.4.2.12) If the number of coarsened superpoints reaches the coarsening threshold coarse_limit, exit the iteration and execute step (3.1.4.3), otherwise directly enter the next round of iteration;

步骤(3.1.4.3)对I自增1,即I=I+1,进行步骤(3.1.4.4);Step (3.1.4.3) automatically increments I by 1, i.e. I=I+1, and proceeds to step (3.1.4.4);

步骤(3.1.4.4)判断迭代次数I是否小于iterTime、被粗化的超点数是否到达粗化阈值coarse_limit,任一条件不符合,则停止粗化,停止迭代,执行步骤(3.1.5)。Step (3.1.4.4) judge whether the number of iterations I is less than iterTime, whether the number of coarsened superpoints reaches the coarsening threshold coarse_limit, if any condition is not met, then stop coarsening, stop iteration, and execute step (3.1.5).

步骤(3.1.5)将粗化后的粗化后的超点、超边、超点的权重集合和超边的权重集合分别拷贝到V'、E'、c'、w'中,得到粗化结果,即超图G'=(V',E',c',w');Step (3.1.5) Copy the coarsened superpoint, hyperedge, superpoint weight set and hyperedge weight set to V', E', c', w' after coarsening to obtain the coarse Transformation result, that is, hypergraph G'=(V', E', c', w');

步骤(3.2)聚类阶段:进行步骤(3.2.1-3.2.7),对粗化阶段得到的最粗超图使用不同的参数进行谱聚类,根据结果选择效果最优的聚类,进行步骤(3.3);Step (3.2) Clustering stage: Carry out steps (3.2.1-3.2.7), use different parameters to perform spectral clustering on the coarsest hypergraph obtained in the coarsening stage, select the cluster with the best effect according to the results, and perform step (3.3);

步骤(3.2.1)进行步骤(3.2.1.1-3.2.1.6),构建拉普拉斯矩阵,进行步骤(3.2.2);Step (3.2.1) carries out step (3.2.1.1-3.2.1.6), constructs Laplacian matrix, carries out step (3.2.2);

其中拉普拉斯矩阵是图的矩阵表示,主要应用在图论中;Among them, the Laplacian matrix is the matrix representation of the graph, which is mainly used in graph theory;

步骤(3.2.1.1)定义VO和EO分别为存储超图中所有超点与对应索引位置和超图中所有超边与对应索引位置的数组,将超图中所有超点与对应索引位置存储在VO中;将超图中所有超边与对应索引位置存储在EO中,进行步骤(3.2.1.2);Step (3.2.1.1) defines VO and EO as arrays for storing all superpoints and corresponding index positions in the hypergraph and all hyperedges and corresponding index positions in the hypergraph respectively, and stores all superpoints and corresponding index positions in the hypergraph in In VO; store all hyperedges and corresponding index positions in the hypergraph in EO, and perform step (3.2.1.2);

步骤(3.2.1.2)根据步骤(2.3)中的超边构建方式和权重计算方法得到的超图信息构建相似矩阵S,进行步骤(3.2.1.3);Step (3.2.1.2) constructs a similarity matrix S according to the hypergraph information obtained by the hyperedge construction method and the weight calculation method in the step (2.3), and proceeds to the step (3.2.1.3);

其中,相似度矩阵是一项重要的统计学技术,用于组织一组数据点之间的彼此相似性。本发明中相似性即超边的权重。Among them, the similarity matrix is an important statistical technique used to organize the mutual similarity between a set of data points. In the present invention, the similarity is the weight of the hyperedge.

步骤(3.2.1.3)根据相似矩阵构建超点度矩阵VD以及超边度矩阵ED,进行步骤(3.2.1.4);Step (3.2.1.3) builds superpoint degree matrix VD and hyperedge degree matrix ED according to similarity matrix, carries out step (3.2.1.4);

其中,超点度矩阵VD是由超点度构成的对角线矩阵,对角线上的元素为超点的度dv(va),其中va代表VO中对应索引位置a的超点;Wherein, the superpoint degree matrix VD is a diagonal matrix formed by the superpoint degree, and the elements on the diagonal are the degree dv (v a ) of the superpoint, where v a represents the superpoint of the corresponding index position a in VO;

其中,超边度矩阵ED是由超边度构成的对角线矩阵,对角线上的元素为超边的度de(eb),其中eb代表EO中对应索引位置b的超点;Wherein, the hyperedge degree matrix ED is a diagonal matrix formed by the hyperedge degree, and the elements on the diagonal are the degree de(e b ) of the hyperedge, where e b represents the hyperpoint corresponding to the index position b in EO;

步骤(3.2.1.4)根据相似矩阵,进行步骤(3.2.1.4.1-3.2.1.4.4)构建关联矩阵H,进行步骤(3.2.1.5);Step (3.2.1.4) carries out step (3.2.1.4.1-3.2.1.4.4) to construct correlation matrix H according to similarity matrix, and carries out step (3.2.1.5);

其中,关联矩阵是一个n×m的矩阵,第d行表示第d个超点所属的所有超边,第f列表示第f条超边包含的所有超点,矩阵的第d行第f列表示为

Figure BDA0003880083200000211
Among them, the incidence matrix is an n×m matrix, the dth row represents all the hyperedges to which the dth superpoint belongs, the fth column represents all the superpoints contained in the fth hyperedge, and the dth row and fth column of the matrix Expressed as
Figure BDA0003880083200000211

步骤(3.2.1.4.1)建立一个n×m的全为0的矩阵H,进行步骤(3.2.1.4.2);Step (3.2.1.4.1) establishes a matrix H that is all 0s of n×m, and proceeds to step (3.2.1.4.2);

步骤(3.2.1.4.2)遍历EO中所有超边,进行步骤(3.2.1.4.3);Step (3.2.1.4.2) traverses all hyperedges in EO, and performs step (3.2.1.4.3);

步骤(3.2.1.4.3)遍历超边中所有超点,进行步骤(3.2.1.4.4);Step (3.2.1.4.3) traverses all hyperpoints in the hyperedge, and proceeds to step (3.2.1.4.4);

步骤(3.2.1.4.4)在矩阵H中将该超边和超点对应索引位置的0置为1;Step (3.2.1.4.4) sets 0 at the corresponding index position of the hyperedge and superpoint to 1 in the matrix H;

步骤(3.2.1.5)根据步骤(2.3)中的超边构建方式和权重计算方法得到的超图信息进行步骤(3.2.1.5.1-3.2.1.5.3),构建权重矩阵W,进行步骤(3.2.1.6);Step (3.2.1.5) carries out step (3.2.1.5.1-3.2.1.5.3) according to the hypergraph information obtained by the hyperedge construction method and weight calculation method in step (2.3), constructs weight matrix W, and performs step ( 3.2.1.6);

其中,权重矩阵W是超边权重的对角矩阵,对角线上的元素为超边的权重;Among them, the weight matrix W is a diagonal matrix of hyperedge weights, and the elements on the diagonal are the weights of hyperedges;

步骤(3.2.1.5.1)建立一个m×m的全为0的矩阵W进行步骤(3.2.1.5.2);Step (3.2.1.5.1) establishes a matrix W that is all 0 of m * m and carries out step (3.2.1.5.2);

步骤(3.2.1.5.2)遍历EO中所有超边,进行步骤(3.2.1.5.3);Step (3.2.1.5.2) traverses all hyperedges in EO, and performs step (3.2.1.5.3);

步骤(3.2.1.5.3)将超边的权重按照EO中的索引位置作为权重矩阵中的值替换矩阵W中的0;Step (3.2.1.5.3) replaces 0 in the matrix W with the weight of the hyperedge as the value in the weight matrix according to the index position in the EO;

步骤(3.2.1.6)根据关联矩阵、权重矩阵、超边度矩阵以及超点度矩阵构建拉普拉斯矩阵L;Step (3.2.1.6) constructs Laplacian matrix L according to correlation matrix, weight matrix, hyperedge degree matrix and superpoint degree matrix;

其中,拉普拉斯矩阵的计算公式为L=E*-ED-1/2HWVD-1HTED-1/2,其中E*表示单位矩阵,即从左上角到右下角的对角线上的元素均为1,其余位置都是0的矩阵,ED-1/2和VD-1分别表示矩阵ED取(-1/2)次幂和矩阵VD取-1次幂,由于ED和VD都是对角线矩阵,所以只要分别对对角线上的数取对应次幂,HT表示H的转置,即将矩阵H中的行换成同序号的列得到的矩阵;Among them, the calculation formula of the Laplacian matrix is L=E*-ED -1/2 HWVD -1 H T ED -1/2 , where E* represents the identity matrix, that is, the diagonal from the upper left corner to the lower right corner The elements above are all 1, and the rest of the positions are all 0. ED -1/2 and VD -1 respectively represent that the matrix ED is taken to the power of (-1/2) and the matrix VD is taken to the power of -1. Since ED and VD Both are diagonal matrices, so as long as the corresponding powers are taken on the numbers on the diagonal, H T represents the transpose of H, that is, the matrix obtained by replacing the rows in the matrix H with columns of the same sequence number;

步骤(3.2.2)对拉普拉斯矩阵进行归一化处理得到NL,进行步骤(3.2.3);Step (3.2.2) carries out normalization process to Laplacian matrix and obtains NL, carries out step (3.2.3);

其中,矩阵的归一化指将矩阵的数据以列为单元,映射到区间(0,1)之间,即,将每一列数据减去该列的最小值并除以该列的最大值;Among them, the normalization of the matrix refers to mapping the data of the matrix to the interval (0, 1) in units of columns, that is, subtracting the minimum value of the column from each column of data and dividing it by the maximum value of the column;

步骤(3.2.3)求解拉普拉斯矩阵NL,计算特征值,得到特征向量,进行步骤(3.2.4);Step (3.2.3) solves Laplacian matrix NL, calculates eigenvalue, obtains eigenvector, carries out step (3.2.4);

其中特征值,特征向量的定义为:矩阵乘法对应了一个变换,是把任意一个向量变成另一个方向或长度都大多不同的新向量,如果矩阵对某一个向量或某些向量只发生伸缩变换,不对这些向量产生旋转的效果,那么这些向量就称为这个矩阵的特征向量,伸缩的比例就是特征值;Among them, the definition of eigenvalue and eigenvector is: matrix multiplication corresponds to a transformation, which is to change any vector into a new vector with different directions or lengths. If the matrix only stretches and transforms a certain vector or some vectors , does not produce a rotation effect on these vectors, then these vectors are called the eigenvectors of this matrix, and the scaling ratio is the eigenvalue;

步骤(3.2.4)将特征向量按列构建成矩阵NE,并归一化处理,进行步骤(3.2.5);Step (3.2.4) constructs eigenvector into matrix NE by column, and normalizes, carries out step (3.2.5);

步骤(3.2.5)定义bestC用于存储最佳聚类结果,定义BestE用于存储最佳聚类结果的评分,初始化BestE为0,定义T_num用于存储重复聚类次数,初始化T_num为0,定义retryTime为聚类操作重复次数,进行步骤(3.2.6);Step (3.2.5) defines bestC to store the best clustering result, defines BestE to store the score of the best clustering result, initializes BestE to 0, defines T_num to store the number of repeated clustering, initializes T_num to 0, Define retryTime as the number of repetitions of the clustering operation, and perform step (3.2.6);

步骤(3.2.6)执行步骤(3.2.6.1-3.2.6.5),对NE进行聚类,评估聚类的效果,循环retryTime次,找到解质量最高的聚类,执行步骤(3.2.7);Step (3.2.6) execute step (3.2.6.1-3.2.6.5), cluster NE, evaluate the effect of clustering, loop retryTime times, find the cluster with the highest solution quality, execute step (3.2.7);

步骤(3.2.6.1)对T_num自增1,即,T_num=T_num+1,进行步骤(3.2.6.2);Step (3.2.6.1) automatically increments T_num by 1, that is, T_num=T_num+1, and proceeds to step (3.2.6.2);

步骤(3.2.6.2)对矩阵NE使用K-Means方法进行聚类,将得到的向量聚类结果作为对应超点的聚类结果,进行步骤(3.2.6.3);Step (3.2.6.2) uses the K-Means method to cluster the matrix NE, and uses the obtained vector clustering result as the clustering result of the corresponding superpoint, and performs step (3.2.6.3);

其中K-Means方法是一种常见的聚类算法。算法会将数据集分为K个簇,每个簇使用簇内所有样本均值来表示,将该均值称为“质心”。Among them, the K-Means method is a common clustering algorithm. The algorithm divides the data set into K clusters, and each cluster is represented by the mean value of all samples in the cluster, which is called the "centroid".

步骤(3.2.6.3)根据评分函数,评估聚类的效果,进行步骤(3.2.6.4);Step (3.2.6.3) evaluates the effect of clustering according to the scoring function, and performs step (3.2.6.4);

其中评分函数的公式为

Figure BDA0003880083200000221
其中,me表示聚类结果bestC中簇类的数量,Vmi表示bestC中的第mi个簇类,hCut(Vmi)的公式为
Figure BDA0003880083200000222
其中,w(ecut)代表超边ecut的权重,β(Vmi)代表由被Vmi切割的超边组成的超边集,ecut代表超边集β(Vmi)中的超边,
Figure BDA0003880083200000231
代表Vmi在集合V中的补集,|ecut∩Vmi|代表ecut和Vmi并集的大小,即同时存在于超边ecut和Vmi的超点的数量;The formula of the scoring function is
Figure BDA0003880083200000221
Among them, me represents the number of clusters in the clustering result bestC, V mi represents the mi-th cluster in bestC, and the formula of hCut(V mi ) is
Figure BDA0003880083200000222
Among them, w(e cut ) represents the weight of the hyperedge e cut , β(V mi ) represents the hyperedge set composed of hyperedges cut by V mi , and e cut represents the hyperedge in the hyperedge set β(V mi ) ,
Figure BDA0003880083200000231
Represents the complement of V mi in the set V, |e cut ∩ V mi | represents the size of the union of e cut and V mi , that is, the number of superpoints that exist in the hyperedge e cut and V mi at the same time;

步骤(3.2.6.4)如果本次聚类的评分高于BestE,进行步骤(3.2.6.4.1-3.2.6.4.2),用本次聚类结果替换原聚类结果,否则进行步骤(3.2.6.5);Step (3.2.6.4) If the score of this clustering is higher than BestE, go to step (3.2.6.4.1-3.2.6.4.2), replace the original clustering result with this clustering result, otherwise go to step (3.2 .6.5);

步骤(3.2.6.4.1)将本次据聚类结果替换到bestC中,进行步骤(3.2.6.4.2);Step (3.2.6.4.1) replaces the data clustering result of this time into bestC, and proceeds to step (3.2.6.4.2);

步骤(3.2.6.4.2)将本次据聚类结果的评分赋值给BestE;Step (3.2.6.4.2) assigns the score of this data clustering result to BestE;

步骤(3.2.6.5)判断T_num是否大于或等于retryTime,是则退出循环,否则进入下一次循环;Step (3.2.6.5) judges whether T_num is greater than or equal to retryTime, if yes, exit the loop, otherwise enter the next loop;

步骤(3.2.7)结束循环,得到聚类结果bestC;Step (3.2.7) ends the loop and obtains the clustering result bestC;

步骤(3.3)细化阶段:进行步骤(3.3.1-3.3.3)将粗化阶段中被粗化的超点映射到聚类中,结束;Step (3.3) refinement stage: perform steps (3.3.1-3.3.3) to map the coarsened superpoints in the coarsening stage to clusters, and end;

步骤(3.3.1)定义UV’表示需要细化的超点,将超图G’的超点集V’拷贝到UV’中,进行步骤(3.3.2);Step (3.3.1) defines UV' to represent the superpoints that need to be refined, copy the superpoint set V' of hypergraph G' to UV', and perform step (3.3.2);

步骤(3.3.2)进入循环,执行步骤(3.3.2.1-3.3.2.4),对UV’中的超点进行细化,直到所有超点都细化完成,进行步骤(3.3.3);Step (3.3.2) enters the loop, executes steps (3.3.2.1-3.3.2.4), and refines the superpoints in UV' until all superpoints are refined, then proceeds to step (3.3.3);

步骤(3.3.2.1)从集合UV’中随机取出一个超点v*,进行步骤(3.3.2.2);Step (3.3.2.1) randomly takes out a super point v* from the set UV', and proceeds to step (3.3.2.2);

步骤(3.3.2.2)按照顺序遍历存储被超点v*合并的超点的双向链表,获取其中的超点v**,执行步骤(3.3.2.2.1-3.3.2.2.6),找到超点v**的所属簇类,进行步骤(3.3.2.3);Step (3.3.2.2) traverses in order the doubly-linked list that stores the superpoints merged by superpoint v*, obtains the superpoint v** in it, executes steps (3.3.2.2.1-3.3.2.2.6), and finds the superpoint The belonging cluster class of point v**, carry out step (3.3.2.3);

步骤(3.3.2.2.1)定义cc代表超点v**的所属簇类,初始化聚类结果与超点v*相同,进行步骤(3.3.2.2.2);Step (3.3.2.2.1) defines cc to represent the cluster class of superpoint v**, the initial clustering result is the same as superpoint v*, and proceeds to step (3.3.2.2.2);

步骤(3.3.2.2.2)定义distance代表超点v**和所属簇类之间的距离,计算超点v**和簇类cc之间的距离初始化distance,进行步骤(3.3.2.2.3);Step (3.3.2.2.2) defines distance to represent the distance between the super point v** and the cluster class to which it belongs, calculates the distance between the super point v** and the cluster class cc to initialize distance, and proceeds to step (3.3.2.2.3 );

其中,计算距离的公式为:Among them, the formula for calculating the distance is:

Figure BDA0003880083200000232
其中,
Figure BDA0003880083200000233
代表当超点类型为主题时,在聚类中具有功能性含义,移动该类型超点可能导致聚类的功能发生偏移,超点权重越大表示对应的功能在所在聚类影响越大,因此根据超点权重进行惩罚;vr1、vr2、vr3表示存在于簇类cc中的超点,Qe’表示同时包含超点vr1和vr2的超边的集合,e”表示该集合中的超边,vr3表示存在于簇类cc中的超点,Qe”表示同时包含超点v**和vr3的超边的集合,e”’表示该集合中的超边,len(cc)2表示len(cc)的平方,即集合cc的大小的平方,dv(v**)表示超点v**的度;
Figure BDA0003880083200000232
in,
Figure BDA0003880083200000233
It means that when the superpoint type is the theme, it has functional meaning in the clustering. Moving this type of superpoint may cause the function of the cluster to shift. The larger the weight of the superpoint, the greater the influence of the corresponding function in the cluster. Therefore, the penalty is carried out according to the superpoint weight; v r1 , v r2 , v r3 represent the superpoints existing in the cluster class cc, Q e ' represents the set of hyperedges containing The hyperedges in the set, v r3 means the super points existing in the cluster class cc, Q e " means the set of hyper edges including the super points v** and v r3 at the same time, e"' means the hyper edges in the set, len(cc) 2 means the square of len(cc), that is, the square of the size of the set cc, and dv(v**) means the degree of superpoint v**;

步骤(3.3.2.2.3)遍历聚类结果bestC中的簇类,命名为cluster,执行步骤(3.3.2.2.3.1-3.3.2.2.3.2),找到和v**距离最近的簇类,执行步骤(3.3.2.2.4);Step (3.3.2.2.3) Traverse the clusters in the clustering result bestC, name it cluster, perform steps (3.3.2.2.3.1-3.3.2.2.3.2), find the cluster with the closest distance to v**, execute Step (3.3.2.2.4);

步骤(3.3.2.2.3.1)根据步骤(3.3.2.2.2)中的距离公式计算v**与cluster的距离,进行步骤(3.3.2.2.3.2);Step (3.3.2.2.3.1) calculates the distance between v** and cluster according to the distance formula in step (3.3.2.2.2), and performs step (3.3.2.2.3.2);

步骤(3.3.2.2.3.2)如果v**与cluster的距离小于则将cc更新为cluster,将distance更新为v**与cluster的距离;Step (3.3.2.2.3.2) If the distance between v** and cluster is less than then update cc to cluster, and update distance to the distance between v** and cluster;

步骤(3.3.2.2.4)将v**移动到簇类cc中,进行步骤(3.3.2.2.5);Step (3.3.2.2.4) moves v** into the cluster class cc, and proceeds to step (3.3.2.2.5);

步骤(3.3.2.2.5)从v*的权重中减去v**的权重,进行步骤(3.3.2.2.6);Step (3.3.2.2.5) subtracts the weight of v** from the weight of v*, and proceeds to step (3.3.2.2.6);

步骤(3.3.2.2.6)如果v**不是双向链表的最后一个超点,则将v**更新为双向链表的下一个超点并进入下一轮迭代,否则结束迭代;Step (3.3.2.2.6) If v** is not the last superpoint of the doubly linked list, then update v** to be the next superpoint of the doubly linked list and enter the next round of iteration, otherwise end the iteration;

步骤(3.3.2.3)从集合UV’中删除v*,进行步骤(3.3.2.4);Step (3.3.2.3) deletes v* from the set UV', proceed to step (3.3.2.4);

步骤(3.3.2.4)如果集合UV’为空则退出循环,否则进入下一次循环;Step (3.3.2.4) exit the loop if the set UV' is empty, otherwise enter the next loop;

步骤(3.3.3)得到细化后的最终的聚类结果bestC,细化完成;Step (3.3.3) obtains the final clustering result bestC after refinement, and the refinement is completed;

第四步、基于带重启随机游走算法对Mashup服务进行推荐,过程如下:The fourth step is to recommend the Mashup service based on the random walk algorithm with restart, the process is as follows:

其中,随机游走是一种被广泛地用于描述自然界中许多实体的运动方式的数学统计模型,在本发明设计的带重启的随机游走算法中,超图中的超点在每一步的游走过程中有两种选择,以特定的概率α∈(0,1)游走到邻近的超点或者以1-α的概率返回初始超点,这种随机游走在任意时刻都有可能回到初始节点,使趋于稳定的概率分布和初始节点的相关性加强,因此最终的服务推荐列表将更加符合需求Mashup;Among them, random walk is a mathematical statistical model widely used to describe the motion of many entities in nature. In the random walk algorithm with restart designed in the present invention, the superpoints in the hypergraph are at each step There are two options in the walking process, walk to the adjacent superpoint with a specific probability α∈(0,1) or return to the initial superpoint with the probability of 1-α, this random walk is possible at any time Back to the initial node, the correlation between the stable probability distribution and the initial node is strengthened, so the final service recommendation list will be more in line with the demand Mashup;

步骤(4.1)定义带重启的随机游走模型的符号,进行步骤(4.2);Step (4.1) defines the symbol of the random walk model with restart, and proceeds to step (4.2);

其中,tag表示Mashup标签、topic表示Mashup主题、iterTime’表示迭代次数、α表示重启概率、topQ表示推荐个数、recommend表示服务推荐列表和clusters(C1,C2,...,Ck)表示聚类分区其中C1,C2,...,Ck表示服务推荐列表中的服务;Among them, tag indicates the Mashup tag, topic indicates the Mashup topic, iterTime' indicates the number of iterations, α indicates the restart probability, topQ indicates the number of recommendations, recommend indicates the service recommendation list and clusters(C 1 ,C 2 ,...,C k ) Represents the cluster partition where C 1 , C 2 ,...,C k represent the services in the service recommendation list;

步骤(4.2)定义y(t)为在t时刻各个超点被访问的概率所构成的列向量,进行步骤(4.3);Step (4.2) defines y (t) as the column vector formed by the probability of each super point being visited at time t, and proceeds to step (4.3);

步骤(4.3)定义y(1)为y(t)在t为1时各个超点被访问的概率所构成的列向量,以与VO相同的顺序将VO长度的倒数存储在y(1)中;Step (4.3) defines y (1) as a column vector composed of the probabilities of each superpoint being visited when t is 1 in y (t) , and stores the reciprocal of the length of VO in y (1) in the same order as VO ;

步骤(4.4)进行步骤(4.4.1-4.4.2),将需求Mashup的描述文档、标签、主题构建成初始列向量q,进行步骤(4.5);Step (4.4) proceed to step (4.4.1-4.4.2), construct the description document, label, and subject of the required Mashup into an initial column vector q, and proceed to step (4.5);

其中,q为由0和1构成的初始列向量,其中代表初始超点的位置赋值为1,其余为0;Among them, q is an initial column vector composed of 0 and 1, where the position representing the initial superpoint is assigned a value of 1, and the rest are 0;

步骤(4.4.1)建立一个与VO长度相同的全为0的向量q,进行步骤(4.4.2);Step (4.4.1) establishes a vector q of all 0s with the same length as VO, and proceeds to step (4.4.2);

步骤(4.4.2)将需求Mashup的描述文档、标签、主题分别依次与VO中的超点比较,找出相同超点,在q中将对应索引位置1;Step (4.4.2) Compare the description document, label, and topic of the demand Mashup with the superpoints in VO in turn, find out the same superpoint, and set the corresponding index position 1 in q;

步骤(4.5)进行步骤(4.5.1-4.5.5),计算超点集的访问分布向量,进行步骤(4.6);Step (4.5) carries out step (4.5.1-4.5.5), calculates the visit distribution vector of superpoint set, carries out step (4.6);

步骤(4.5.1)利用超点、超边的度矩阵、权重矩阵、关联矩阵构建转移概率矩阵P,公式为P=VD-1HWED-1HT,进行步骤(4.5.2);Step (4.5.1) constructs transition probability matrix P by utilizing the degree matrix, weight matrix, and incidence matrix of hyperpoints and hyperedges, the formula is P=VD -1 HWED - 1HT, and proceeds to step (4.5.2);

其中,D-1表示D的逆矩阵,D-1D=E*。。Wherein, D -1 represents the inverse matrix of D, and D -1 D=E * . .

步骤(4.5.2)根据带重启的随机游走,得到整个超点集的访问分布为y(t+1)=αPy(t)+(1-α)q,进行步骤(4.5.3);Step (4.5.2) obtains the access distribution of the entire superpoint set as y (t+1) =αPy (t) +(1-α)q according to the random walk with restart, and proceeds to step (4.5.3);

步骤(4.5.3)进行初始化的设置,y(1)向量为初始分布向量,视为等概率值,即其中每一个元素的值为超点总数量的倒数,y(1)向量中的元素之和为1,y(1)向量与VO等长,进行步骤(4.5.4);Step (4.5.3) is initialized, the y (1) vector is the initial distribution vector, which is regarded as an equal probability value, that is, the value of each element is the reciprocal of the total number of super points, and the elements in the y (1) vector The sum is 1, the y (1) vector has the same length as VO, and proceeds to step (4.5.4);

步骤(4.5.4)从初始分布向量y(1),进行步骤(4.5.4.1-4.5.4.2)开启随机游走,对y(t)进行迭代,直到到达指定的迭代次数Tmax,也就是t=Tmax,进行步骤(4.5.5);Step (4.5.4) From the initial distribution vector y (1) , proceed to step (4.5.4.1-4.5.4.2) to start random walk, and iterate y (t) until it reaches the specified number of iterations Tmax, which is t =Tmax, carry out step (4.5.5);

步骤(4.5.4.1)根据步骤(4.5.2)计算y(t+1),进行步骤(4.5.4.2);Step (4.5.4.1) calculates y (t+1 ) according to step (4.5.2), and performs step (4.5.4.2);

步骤(4.5.4.2)更新t为t+1;Step (4.5.4.2) updates t to be t+1;

步骤(4.5.5)随机游走过程停止,此时向量y(t)已经不再变化;Step (4.5.5) the random walk process stops, and the vector y (t) no longer changes at this time;

步骤(4.6)对y(t)向量按照值进行从大到小排序,进行步骤(4.7);Step (4.6) sorts the y (t) vector from large to small according to the value, and performs step (4.7);

步骤(4.7)新建空的推荐列表recommend,进行步骤(4.8);Step (4.7) creates a new empty recommendation list recommend, and proceeds to step (4.8);

步骤(4.8)进行步骤(4.8.1-4.8.2),获取y(t)值最大的topQ个API作为推荐API放入recommend列表,进行步骤(4.9);Step (4.8) proceed to step (4.8.1-4.8.2), obtain the topQ APIs with the largest y (t) values and put them into the recommend list as recommended APIs, proceed to step (4.9);

步骤(4.8.1)遍历y(t)向量中的值对应的超点,进行步骤(4.8.2),直到recommend的长度达到topQ;Step (4.8.1) traverses the superpoint corresponding to the value in the y (t) vector, and proceeds to step (4.8.2) until the length of recommend reaches topQ;

步骤(4.8.2)对超点进行类型判断,如果是API类型则加入推荐列表,否则跳到下一次循环;Step (4.8.2) judges the type of the super point, if it is an API type, it is added to the recommendation list, otherwise it skips to the next cycle;

步骤(4.9)结束循环,返回推荐列表,其长度为topQ。此时recommend中经过大小排序且类型都为API的超点就是所求的推荐API,完成推荐,结束。Step (4.9) ends the loop and returns a recommendation list whose length is topQ. At this time, the superpoints in recommend that have been sorted by size and whose types are all APIs are the requested recommended APIs. Complete the recommendation and end.

参照图1~图4,在不同服务类数下其他推荐方法与本发明方法对比分析结果,包括以下步骤:With reference to Fig. 1~Fig. 4, the comparative analysis result of other recommended methods and the method of the present invention under different number of service classes comprises the following steps:

第一步、定义10种推荐方法;The first step is to define 10 recommended methods;

TF-IDF:根据TF-IDF模型对服务描述文档以及需求文档建模得到向量,计算相似度,选择相似度最高的Web API服务。TF-IDF: According to the TF-IDF model, model the service description document and requirement document to obtain vectors, calculate the similarity, and select the Web API service with the highest similarity.

E-LDA:根据LDA模型得到服务和API的主题特征,进行相似度计算,选择相似度最高的Web API服务进行推荐。E-LDA: According to the LDA model, the subject features of the service and API are obtained, the similarity is calculated, and the Web API service with the highest similarity is selected for recommendation.

T+K+CF:在T+K聚类基础上将需求Mashup与各聚类中心进行相似度计算,选择最相似的聚类进行基于物品的协同过滤算法,将得到的结果进行排序,对topK个Web API进行推荐;T+K+CF: On the basis of T+K clustering, calculate the similarity between the demand Mashup and each clustering center, select the most similar clustering to perform an item-based collaborative filtering algorithm, sort the obtained results, and rank topK a Web API to recommend;

其中,T+K表示通过TF-IDF将每个Mashup服务描述文档表示成向量的形式,用Kmeans算法进行聚类。其中,Kmeans算法是一种常见的聚类算法。算法会将数据集分为K个簇,每个簇使用簇内所有样本均值来表示,将该均值称为“质心”。Among them, T+K means that each Mashup service description document is expressed in the form of a vector through TF-IDF, and the Kmeans algorithm is used for clustering. Among them, the Kmeans algorithm is a common clustering algorithm. The algorithm divides the data set into K clusters, and each cluster is represented by the mean value of all samples in the cluster, which is called the "centroid".

LDA+K+CF:与方法T+K+CF类似,只是将Mashup描述文档的向量表示通过LDA主题模型建模得到的主题向量替换。LDA+K+CF: Similar to the method T+K+CF, except that the vector representation of the Mashup description document is replaced by the topic vector obtained by modeling the LDA topic model.

T+K+PopK:在T+K聚类基础上选择最相似的聚类,通过对聚类中Web API的受欢迎度进行排序,返回最受欢迎的topK个Web API进行推荐。Web API的受欢迎度定义为该WebAPI被当前聚类中Mashup调用的次数。T+K+PopK: Select the most similar cluster based on T+K clustering, sort the popularity of Web APIs in the clusters, and return the most popular topK Web APIs for recommendation. The popularity of a Web API is defined as the number of times the Web API is called by the Mashup in the current cluster.

RWR:根据Mashup和API的调用关系构建超图模型,将需求Mashup构建为初始向量,直接使用带重启的随机游走进行推荐。RWR: Construct a hypergraph model based on the call relationship between mashup and API, construct the demand mashup as an initial vector, and directly use random walk with restart to make recommendations.

HPS+RWR:在超图谱聚类的基础上,根据距离公式选择最近聚类,构建为初始向量,使用带重启的随机游走进行推荐。HPS+RWR: On the basis of hypergraph clustering, select the nearest cluster according to the distance formula, construct it as an initial vector, and use random walk with restart for recommendation.

其中,谱聚类是谱聚类算法建立在图论中的谱图理论基础上,其本质是将聚类问题转化为图的最优划分问题,是一种点对聚类算法;Among them, spectral clustering is a spectral clustering algorithm based on the spectral graph theory in graph theory. Its essence is to transform the clustering problem into the optimal partition problem of the graph. It is a point-to-point clustering algorithm;

hMETIS+RWR:使用hMETIS算法进行超图分区,同样根据距离公式,并构建初始向量,使用带重启的随机游走进行推荐。hMETIS+RWR: Use the hMETIS algorithm to partition the hypergraph, also based on the distance formula, and construct the initial vector, and use the random walk with restart to make recommendations.

其中,hMETIS是目前流行的超图分区算法,在分区前会将超图进行粗化,并得到平衡的分区。Among them, hMETIS is a currently popular hypergraph partitioning algorithm, which coarsens the hypergraph before partitioning and obtains balanced partitions.

hMATaTo-RW:在MATaTo模型上进行随机游走,将需求Mashup构建为初始向量,迭代一定次数后将向量排序,得到Web API推荐列表。hMATaTo-RW: Perform a random walk on the MATaTo model, construct the demand Mashup as the initial vector, and sort the vectors after a certain number of iterations to obtain the Web API recommendation list.

hMATaTo-RWR:本发明提出的方法,在MATaTo模型上进行本发明提出的带重启的随机游走得到Web API推荐列表。hMATaTo-RWR: The method proposed by the present invention, the random walk with restart proposed by the present invention is performed on the MATaTo model to obtain the Web API recommendation list.

第二步、从ProgrammableWeb平台爬取Mashup服务数据,并选取三个公开数据集MovieLens/Cora/TwitterFootball进行实验,并计算各方法推荐结果的准确率、召回率、折扣累计增益DCG@R以及汉明距离HMD四个指标;The second step is to crawl the Mashup service data from the ProgrammableWeb platform, and select three public datasets MovieLens/Cora/TwitterFootball for experiments, and calculate the accuracy rate, recall rate, discount cumulative gain DCG@R and Hamming of the recommendation results of each method Four indicators from HMD;

其中,准确率的计算公式定义如下:Among them, the calculation formula of the accuracy rate is defined as follows:

Figure BDA0003880083200000271
其中,WA表示被需求Mashup服务真实调用的Web API集合,WAR表示方法所得的推荐列表。准确率值越高,表明推荐效果越好;
Figure BDA0003880083200000271
Among them, WA represents the set of Web APIs that are actually called by the required Mashup service, and WAR represents the recommended list obtained by the method. The higher the accuracy value, the better the recommendation effect;

其中,召回率的计算公式为

Figure BDA0003880083200000272
召回率值越高,表明推荐效果越好;Among them, the formula for calculating the recall rate is
Figure BDA0003880083200000272
The higher the recall value, the better the recommendation effect;

其中,折扣累计增益DCG@R的计算公式如下:Among them, the calculation formula of the discounted cumulative gain DCG@R is as follows:

Figure BDA0003880083200000273
其中,R表示推荐的Web API个数,r(i)表示推荐列表的第i个API是否被需求Mashup真实调用过,若调用过则r(i)值为1,否则为0。DCG@R值越高,表明推荐效果越好。
Figure BDA0003880083200000273
Among them, R represents the number of recommended Web APIs, and r(i) represents whether the i-th API in the recommendation list has been actually called by the demand mashup. If it has been called, the value of r(i) is 1, otherwise it is 0. The higher the DCG@R value, the better the recommendation effect.

其中,汉明距离的计算公式如下:Among them, the calculation formula of Hamming distance is as follows:

Figure BDA0003880083200000281
其中,Mm1,Mm2表示两个不同的需求Mashup,Q(Mm1,Mm2)表示对于不同的需求Mashup给出的服务推荐列表中相同的Web API数量。若两个Mashup的推荐列表中的Web API均相同,则HMD(Mm1,Mm2)为0,若都不同,则为1。汉明距离值越高,表明推荐结果越多样。
Figure BDA0003880083200000281
Among them, M m1 , M m2 represent two different demand mashups, and Q(M m1 , M m2 ) represents the same number of Web APIs in the service recommendation list given by different demand mashups. If the Web APIs in the recommended lists of the two mashups are the same, HMD(M m1 , M m2 ) is 0, and if they are different, it is 1. The higher the Hamming distance value, the more diverse the recommendation results.

第三步、对计算结果进行评估;The third step is to evaluate the calculation results;

图1~图4分别示出了在不同服务类数下本发明推荐方法与其他方法推荐结果的准确率、召回率、折扣累计增益DCG@R以及汉明距离HMD四个指标。如图所示,横坐标为Mashup服务类别数目,纵坐标分别为四个指标。Figures 1 to 4 respectively show the accuracy rate, recall rate, discount cumulative gain DCG@R and Hamming distance HMD four indicators of the recommendation method of the present invention and the recommendation results of other methods under different numbers of service categories. As shown in the figure, the abscissa is the number of Mashup service categories, and the ordinate is the four indicators.

由图1和图2可知,hMATaTo-RWR算法在召回率和准确率上均优于其他算法。总体来看,使用各方法得到的召回率和准确率随着服务类别数目的增多均有所下降,这是因为随着服务数量的增多,自然地对服务推荐的结果产生干扰。It can be seen from Figure 1 and Figure 2 that the hMATaTo-RWR algorithm is superior to other algorithms in terms of recall and accuracy. Overall, the recall rate and precision rate obtained by using each method decrease with the increase of the number of service categories, because as the number of services increases, it naturally interferes with the results of service recommendation.

由图3可知,hMATaTo-RWR的推荐列表能有效地区分出服务的优先级,其主要原因是本发明将需求Mashup的服务信息充分转化,以向量的形式在游走过程中不断迭代,最终得到的平稳分布与向量有强相关性。It can be seen from Figure 3 that the recommendation list of hMATaTo-RWR can effectively distinguish the priority of services. The main reason is that the present invention fully transforms the service information of the demand Mashup, iterates continuously in the form of vectors during the walk process, and finally obtains The stationary distribution of has a strong correlation with the vector.

由图4可知,随着服务类别数目的增多,各方法在汉明距离指标上都有所提升。这是因为服务类目的增多,给推荐的服务带来了更多的可能性,不再局限于仅有的类中。而hMATaTo-RWR通过随机游走在全局节点上进行游走,即使是冷门的服务也会进行推荐,因此有较好的多样性。It can be seen from Figure 4 that with the increase of the number of service categories, the Hamming distance index of each method has improved. This is because the increase of service categories brings more possibilities to the recommended services, which are no longer limited to only one category. However, hMATaTo-RWR walks on the global nodes through random walks, and recommends even unpopular services, so it has better diversity.

本实施例提出的hMATaTo-RWR的实验效果优于上述方法。原因在于:hMATaTo-RWR方法使用谱聚类进行聚类,该方法更与图模型契合,有更好的聚类结果。除此之外,hMATaTo-RWR方法利用了大量辅助信息如标签、主题信息,结合了在模型中融合的五种高阶关系,有效缓解稀疏性问题,同时,在模型中融入了共现度、流行度等非功能性特征。因此,hMATaTo-RWR方法在各个评估指标上均表现出非常好的结果。The experimental effect of hMATaTo-RWR proposed in this example is better than that of the above method. The reason is that the hMATaTo-RWR method uses spectral clustering for clustering, which is more consistent with the graphical model and has better clustering results. In addition, the hMATaTo-RWR method utilizes a large amount of auxiliary information such as tags and topic information, and combines five high-order relationships fused in the model to effectively alleviate the sparsity problem. At the same time, the model incorporates co-occurrence, Non-functional characteristics such as popularity. Therefore, the hMATaTo-RWR method shows very good results on various evaluation metrics.

Claims (9)

1. A Saas-oriented Web Api diversity recommendation method fusing a random walk algorithm with restart function is characterized by comprising the following steps of:
firstly, extracting a MATaTo (Mashup-API-Tag-Topic) data model, and the process is as follows:
step (1.1) crawling Mashup service set information from an internet website, and performing step (1.2);
step (1.2) analyzing the crawled Mashup service set, selecting meta-information which is beneficial to improving service recommendation accuracy and rich in content, and performing step (1.3);
the information obtained by crawling comprises service description documents, calling relations between mashups and APIs, topics, labels, service followers, service calling links and log information updating;
step (1.3) defines the symbols of the MATaTo model, M represents Mashup set, A represents API set, ta represents label set, to represents theme set, R1 represents the relation between mashups, R2 represents the relation between APIs, R3 represents the relation between mashups and the attributes thereof, R4 represents the relation between APIs and the attributes thereof, R5 represents the relation between mashups and APIs, n represents the number of over-points, M represents the number of over-edges, WHG is equal To R m×n Representing a weighted hypergraph, V representing a set of n hypergraphs, E representing a hyperedge, E representing a set of m hyperedges, R representing a set of m hyperedges m×n Representing a set of real numbers of size mxn; carrying out step (1.4);
step (1.4) traversing the crawled information, and extracting an MATaTo data model from the crawled information;
secondly, constructing a hypergraph based on the MATaTo data model, wherein the process is as follows:
step (2.1) modeling Mashup and API description documents by using an NMF model to obtain corresponding theme vectors, and performing step (2.2);
the NMF model is non-negative matrix decomposition and is an unsupervised learning algorithm;
step (2.2) forming a hypergpoint of the hypergraph aiming at the four types of objects according to the MATaTo data model;
step (2.3) according to five high-order relations, ten types of super edges are established, and ten types of super edges are defined
Figure FDA0003880083190000011
E M*ToTa* 、E A*To*Ta* 、E M*A*To* 、E M*A*Ta*
Figure FDA0003880083190000012
And
Figure FDA0003880083190000013
establishing a super edge set of the hypergraph;
thirdly, performing multi-level clustering on the hypergraph formed by the MATaTo data model, wherein the process is as follows:
step (3.1) coarsening stage: continuously coarsening the original map beyond points until the scale of the map is reduced to a certain degree;
step (3.2) clustering stage: performing spectral clustering on the coarsest hypergraph obtained in the coarsening stage by using different parameters, selecting the cluster with the optimal effect according to the result, and performing the step (3.3);
step (3.3) refinement stage: mapping the overtop roughened in the roughening stage to the cluster, and ending;
fourthly, recommending Mashup service based on a random walk algorithm with restart;
the super-point in the hypergraph has two choices in the wandering process of each step, the super-point is wandered to an adjacent super-point by a specific probability alpha epsilon (0,1) or returns to an initial super-point by a probability of 1-alpha, the random wandering is likely to return to the initial node at any time, so that the probability distribution which tends to be stable and the correlation of the initial node are strengthened, and the final service recommendation list is more in line with the requirement Mashup.
2. The Saas-oriented Web Api diversity recommendation method with restart random walk algorithm converged according to claim 1, wherein in the step (1.2), the service description document is a substitute of a traditional Web service WSDL document, and describes service information through an intuitive language, which has been a research focus of a service recommendation researcher;
the calling relationship between Mashup and the API comprises an indivisible relationship between Mashup and the API, and the calling relationship between Mashup and the API is often ignored in the existing research, so that the calling relationship between Mashup and the API needs to be used, and on the basis of the calling relationship, information of Mashup and API popularity, mashup and API, mashup and Mashup, and API co-occurrence degree can be extended;
the theme can reflect the functional characteristics of the service, the service application range has rough qualitative, the service under the same theme has similar functions, the label is usually added manually by the user, and the service is subjected to supplementary description from the subjective angle of the user;
the service call link only represents the resource address of the service and is not really helpful, so that the service call link information is not used.
3. The Saas-oriented Web Api diversity recommendation method with restart random walk algorithm converged according to claim 1 or 2, wherein the procedure of step (1.4) is as follows:
extracting four types of information including a service description document, a Mashup and API calling relation, a theme and a label, and performing the step (1.4.2);
the description of the four types of information is specifically as follows:
mashup (M) represents Mashup service and is a recommended requester in a service recommendation scene, and the Mashup carries information of a label and a theme and has a calling relation with an API;
the API (A) represents a Web API, is a recommendation object in a service recommendation scene, is the same as a Mashup service object, also has information of a label and a theme, and has a called relation with the Mashup;
the label (Ta) represents the crawled label information, and Mashup with the same label is more closely related to the API;
the theme (To) represents the crawled theme information, and Mashup and API under the same theme have similar functions;
step (1.4.2) summarizes the high-order relations of five types of R1 (M-Ta-To), R2 (A-Ta-To), R3 (M-M), R4 (A-A) and R5 (M-A);
the description of the five high-order relationships is specifically as follows:
r1 (M-Ta-To) represents Mashup and the relationship of the attribute thereof, and the Mashup carries out supplementary explanation by using the theme and label information except the description document of the Mashup;
r2 (A-Ta-To) represents the relationship between the API and the attribute thereof, the API is the same as R1, the API is supplemented and explained by the theme and label information carried by the API, and the Mashup calls the API with the same theme and label;
r3 (M-M) represents the relation between mashups, and the similarity between different Mashup description documents is calculated to represent the relation between different Mashup services;
r4 (A-A) represents the relation between APIs, and like R3, the relation between API services is represented by using similarity, besides, the co-occurrence degree of two or more APIs called simultaneously can be reflected, and the recommendation accuracy is improved to a certain extent;
r5 (M x-a) represents the relationship between Mashup and the API, mashup is composed of API, and the subject or label of API can also describe the functionality of Mashup except for the explicit call relationship.
4. The Saas-oriented Web Api diversity recommendation method with restart random walk algorithm converged according to claim 1 or 2, wherein the procedure of step (2.2) is as follows:
step (2.2.1) defines Mashup set M = { M = { (M) } 1 ,m 2 ,m 3 ,...,m N }, carrying out step (2.2.2);
wherein N is the total number of mashups;
step (2.2.2) definition of Mashup m i The API set called is A i ={a i1 ,a i1 ,a i1 ,...,a ik' }, carrying out step (2.2.3);
wherein i represents the number of the Mashup in the set M, and k' is Mashup M i The total number of called APIs is that all the called APIs of Mashup are collected as
Figure FDA0003880083190000031
Wherein, U represents a union set;
defining Ta and To as label and subject information sets carried by all mashups and APIs, and carrying out the step (2.2.4);
step (2.2.4) initializes the hyper graph with the hyper point set of V = MeuMeuTu a ∪T o
5. The Saas-oriented Web Api diversity recommendation method with restart random walk algorithm converged according to claim 1 or 2, wherein in the step (2.3), the definition of the excess edges is specified as follows:
Figure FDA0003880083190000032
the theme can functionally describe mashups, the themes carried by the mashups are limited, the themes of the mashups are expanded through the mashups similar to the themes, for each Mashup, top-k similar mashups and the themes of the top-k similar mashups form a super edge, the weight of the super edge is set as the average value of the similarity between the top-k similar mashups and the themes of the top-k similar mashups, the similarity is calculated by adopting the common cosine similarity, wherein the calculation formula of the cosine similarity is as follows:
Figure FDA0003880083190000033
where θ represents the angle of the pre-similarity, vm x' Represents the feature vector of the x' th Mashup, | vm x' | represents a modulus of the feature vector;
Figure FDA0003880083190000034
the tags can describe mashups in detail, the tags carried by the mashups are limited, the tags of the mashups are extended through the mashups similar to the tags, for each Mashup, top-k similar mashups and the tags of the top-k similar mashups form a super edge, the weight of the super edge is set to be the average value of the similarity between the top-k similar mashups and the tags of the top-k similar mashups, the similarity is calculated, and the step (2.3) is carried out
Figure FDA0003880083190000035
The calculation formulas of the cosine similarity are the same;
Figure FDA0003880083190000036
the number of the tags carried by one API is limited, the tags of the API can be expanded through the similarity with other APIs, for each API, top-k similar APIs and the subjects of the similar APIs are selected to form a super edge, the weight of the super edge is set to be the average value of the similarity with the k APIs, the similarity is calculated, and the step (2.3) is carried out
Figure FDA0003880083190000037
The calculation formulas of the cosine similarity are the same;
Figure FDA0003880083190000038
similarly, the API can expand its own tag by the similarity with other APIs, for each API, top-k similar APIs and their tags are selected to form a super edge, the weight of the super edge is set as the average value of the similarity with k APIs, the calculation of the similarity and the step (2.3) are carried out
Figure FDA0003880083190000039
The calculation formulas of the cosine similarity are the same;
E M*To*Ta* : one Mashup carries the theme and label information, and for each Mashup, any number of themes and labels carried by the Mashup are selected to form a super edge, and the weight of the edge is set to be 1;
E A*To*Ta* : one API carries the information of the theme and the label, and for each API, any number of themes and labels carried by the API are selected to form a super edge, and the weight of the edge is set to be 1;
E M*A*To* : one Mashup calls one or more APIs, the themes of the APIs can also supplement the themes of the Mashup, for each Mashup, all the called APIs and the carried themes of the Mashup are selected, and the weight of the edge is set to be 1;
E M*A*Ta* : as above, one Mashup calls one or more APIs, the tags of these APIs can also supplement the tags of Mashup, and for each Mashup, all the APIs called by the Mashup and the carried tags are selected, and the weight of the edge is set to 1;
Figure FDA0003880083190000041
popular APIs tend to represent higher quality, and for each API, a super edge is constructed with the topic in which the API is located, and the weight of the super edge is set as the popularity size, wherein the popularity size is expressed by the following formula:
Figure FDA0003880083190000042
wherein Call (A) r ) Representing API A r Frequency of calls by Mashup, topic (A) r ) Is represented by the formula A r All APIs with the same Topic, minCall (Topic (A) r ) Are) represented by A r The API with the lowest frequency of calls, maxCallL (Topic (A), among all APIs in the same Topic r ) Are) represented by A r Calling the API with the highest frequency in all the APIs in the same theme;
Figure FDA0003880083190000043
the APIs frequently called together have relevance, the edge contains all APIs with the co-occurrence number larger than 1, the weight of the edge is set to be the co-occurrence size, and the formula of the co-occurrence size is as follows:
Figure FDA0003880083190000044
wherein, | A m' ∩...∩A n' I denotes API A m’ ,...,A n’ Number of times called by the same Mashup, | a m' ∪...∪A n' I denotes APIA m’ ,...,A n’ Total number of calls by Mashup.
6. The Saas-oriented Web Api diversity recommendation method with restart random walk algorithm converged according to claim 1 or 2, wherein the procedure of step (3.1) is as follows:
defining a symbol required by a coarsening stage, c representing a weight set of a super point, w representing a weight set of a super edge, G = (V, E, c, w) representing an original super graph, iterTime representing iteration times, coarse _ limit representing a coarsening threshold, max _ weight representing a weight threshold of the super point, G ' = (V ', E ', c ', w ') representing a graph after coarsening, V ' representing a super point which is not combined after coarsening, E ' representing a super edge after coarsening, c ' representing a weight set of a super point after coarsening, and w ' representing a weight set after the super edge is coarsened, and performing step (3.1.2);
step (3.1.2) adding all the super points into the un-visited super point set UV, and performing step (3.1.3);
step (3.1.3) initializing an empty set C for recording all coarsened overtops, and performing step (3.1.4);
step (3.1.4) defining a variable I for storing iteration times, initializing the variable I to be 0, performing step (3.1.4.1-3.1.4.4), coarsening the over point in the set UV, and performing step (3.1.5);
step (3.1.4.1) is to define a set UVT, store a set of the super points which need to be coarsened for the iteration, copy the super points in the set UV into the set UVT, and perform step (3.1.4.2);
step (3.1.4.2) is carried out (3.1.4.2.1-3.1.4.2.12), the overtopping of the UVT is carried out, until the iteration number reaches len (UVT), and step (3.1.4.3) is carried out;
wherein len () represents the size of the acquisition set, len (UVT) represents the size of the set UVT;
step (3.1.4.2.1) randomly selecting a super point in the set UVT, named rv, and performing step (3.1.4.2.2);
step (3.1.4.2.2) of judging whether the over point rv belongs to the set UVT, if so, performing step (3.1.4.2.3), and if not, directly entering the next round of circulation;
step (3.1.4.2.3) defining a singularity cv, copying the singularity rv to the cv, defining score for storing the score, and initializing the score to 0, and performing step (3.1.4.2.4);
step (3.1.4.2.4) traverses other superpoints that neighbor the superpoint rv and copies the superpoint to iv, goes to step (3.1.4.2.4.1-3.1.4.2.4.3), finds the superpoint iv to merge with rv, goes to step (3.1.4.2.5);
wherein, the other super point adjacent to the super point rv is the super point in the same side with the super point;
the step (3.1.4.2.4.1) of calculating the sum of the weight of rv and the weight of iv;
step (3.1.4.2.4.2) judges whether the weight sum is greater than the over-point weight threshold max _ weight, if not, step (3.1.4.2.4.3) is carried out, and if yes, the next cycle is directly entered;
step (3.1.4.2.4.3) judging whether iv is a overtop in the set UVT, if yes, directly entering the next cycle, otherwise, directly ending the cycle;
step (3.1.4.2.5) calculating a scoring function of rv and iv according to the scoring function, defining a variable cur _ score, storing the score in cur _ score, and performing step (3.1.4.2.6);
wherein the scoring function is score (rv, iv) = a · con (rv, iv) + b · hNcut (rv, iv), the values of a, b are empirical values, and are obtained by adjusting parameters according to experimental effects,
Figure FDA0003880083190000051
wherein, I (rv) and I (iv) respectively represent the super edge sets containing the super points rv and iv, and n represents the intersection, i.e. the elements existing in the left and right sets of the symbol,
Figure FDA0003880083190000052
wherein c (rv) and c (iv) are the weights of the super points rv and iv, respectively, e 1 Represents a super point, w (e), which is present on the same super edge as the super points rv, iv at the same time 1 ) Indicating a supercide e 1 Weight of (c), de (e) 1 ) Indicating a supercide e 1 Degree of (d), de (e) 1 )=|e 1 |,|e 1 I denotes the super edge e 1 The number of over points in; Σ represents summation, where hNcut (rv, iv) represents normalized cutting of the hyper-edge where the hyper-points rv and iv are located, and represents the correlation of two cluster classes when rv and iv are respectively regarded as one cluster class, and the calculation formula is
Figure FDA0003880083190000053
Wherein dv (rv) and dv (iv) represent the degree of the super point rv, iv, i.e. the sum of the weights of the edges containing the super point rv, iv, e 2 Representation of belonging to the set Q e Side of (C), Q e Representing the super edge set cut by the two sets, and when the super point in one super edge exists in different cluster types, the super edge is called as the cut super edge;
step (3.1.4.2.6) judging whether cur _ score is larger than score, if yes, copying the over point iv into cv, assigning the value of cur _ score to score, and performing step (3.1.4.2.7);
step (3.1.4.2.7) merging rv and cv, the invention uses the form of a two-way linked list to connect the super points with merging relation, and then step (3.1.4.2.8) is carried out;
the bidirectional linked list is one of the linked lists, and each data node of the bidirectional linked list is provided with two pointers which respectively point to a direct successor and a direct predecessor, namely, a pointer which points to the other side is arranged between two nodes which are directly connected;
step (3.1.4.2.8) adds the weight of the merged hyper point cv to rv, and step (3.1.4.2.9) is performed;
step (3.1.4.2.9) removing rv and cv from the UVT used in the current iteration, proceeding to step (3.1.4.2.10);
step (3.1.4.2.10) removing cv in UV, proceeding to step (3.1.4.2.11);
step (3.1.4.2.11) adds the over point cv to the set C, and step (3.1.4.2.12) is performed;
step (3.1.4.2.12) if the coarsened over-point number reaches the coarsening threshold coarse _ limit, exiting iteration, executing step (3.1.4.3), otherwise, directly entering the next iteration;
step (3.1.4.3) performs step (3.1.4.4) for I self-increment of 1, I = I +1;
step (3.1.4.4) judging whether the iteration time I is less than iterTime or not, whether the coarsened over-point number reaches a coarsening threshold coarse _ limit or not, if any condition is not met, stopping coarsening, stopping iteration, and executing step (3.1.5);
and (3.1.5) copying the coarsened super point, super edge, the weight set of the super point and the weight set of the super edge into V ', E ', c ' and w ' respectively to obtain a coarsening result, namely the super graph G ' = (V ', E ', c ', w ').
7. The Saas-oriented Web Api diversity recommendation method with restart random walk algorithm fused according to claim 1 or 2, wherein the procedure of step (3.2) is as follows:
step (3.2.1) performing step (3.2.1.1-3.2.1.6), constructing a Laplace matrix, and performing step (3.2.2);
the Laplace matrix is a matrix representation of a graph and is mainly applied to graph theory;
step (3.2.1.1) defining VO and EO as arrays for storing all the super points and the corresponding index positions in the hypergraph and all the super edges and the corresponding index positions in the hypergraph respectively, and storing all the super points and the corresponding index positions in the hypergraph in the VO; storing all the hyperedges and the corresponding index positions in the hypergraph in the EO, and performing the step (3.2.1.2);
step (3.2.1.2) constructing a similarity matrix S according to the hypergraph information obtained by the hypergraph construction mode and the weight calculation method in the step (2.3), and performing the step (3.2.1.3);
the similarity matrix is an important statistical technique used for organizing the mutual similarity among a group of data points, and the similarity is the weight of the excess edges;
step (3.2.1.3) constructing a super-point matrix VD and a super-edge matrix ED according to the similarity matrix, and performing step (3.2.1.4);
wherein the over-point degree matrix VD is a diagonal matrix composed of over-point degrees, and the elements on the diagonal are degree dv (v) of the over-point a ) Wherein v is a Representing the overtop of the corresponding index position a in the VO;
wherein the super-edge matrix ED is a diagonal matrix composed of super-edges, and the elements on the diagonal are super-edge degrees de (e) b ) In which e is b Represents the overtop of the corresponding index position b in EO;
step (3.2.1.4) according to the similar matrix, performing step (3.2.1.4.1-3.2.1.4.4) to construct a correlation matrix H, and performing step (3.2.1.5);
wherein, the incidence matrix is an n multiplied by m matrix, the d-th row represents all the super edges to which the d-th super point belongs, the f-th column represents all the super points contained in the f-th super edge, and the f-th row and the f-th column of the matrix represent
Figure FDA0003880083190000061
Step (3.2.1.4.1) establishing a matrix H with n multiplied by m and all 0, and performing step (3.2.1.4.2);
step (3.2.1.4.2) traversing all the excess edges in the EO, and performing step (3.2.1.4.3);
step (3.2.1.4.3) traversing all over points in the over edge, and performing step (3.2.1.4.4);
step (3.2.1.4.4) setting 0 of the index position corresponding to the super edge and the super point to 1 in the matrix H;
step (3.2.1.5) according to the hypergraph information obtained by the hypergraph construction mode and the weight calculation method in the step (2.3), step (3.2.1.5.1-3.2.1.5.3) is carried out, a weight matrix W is constructed, and step (3.2.1.6) is carried out;
wherein, the weight matrix W is a diagonal matrix of the super edge weight, and the elements on the diagonal are the weight of the super edge;
step (3.2.1.5.1) establishing a matrix W with m multiplied by m and all 0 to carry out step (3.2.1.5.2);
step (3.2.1.5.2) traversing all over edges in the EO, and performing step (3.2.1.5.3);
step (3.2.1.5.3) replacing 0 in the matrix W with the weight of the excess edge according to the index position in EO as the value in the weight matrix;
step (3.2.1.6) constructing a Laplace matrix L according to the incidence matrix, the weight matrix, the super-edge matrix and the super-dot matrix;
wherein the calculation formula of the Laplace matrix is L = E-ED -1/2 HWVD -1 H T ED -1/2 Where E denotes the identity matrix, i.e. the matrix with 1 in the diagonal from the top left to the bottom right and 0 in the remaining positions, ED -1/2 And VD -1 Respectively representing that the matrix ED takes the power of (-1/2) and the matrix VD takes the power of-1, because ED and VD are diagonal matrices, H takes the power of the number on the diagonal line respectively T The transposition of H is shown, namely a matrix obtained by converting rows in the matrix H into columns with the same serial numbers;
step (3.2.2) of normalizing the Laplace matrix to obtain NL, and step (3.2.3) of performing;
the normalization of the matrix refers to that the data of the matrix is mapped between intervals (0,1) in units of columns, namely, the minimum value of each column is subtracted from the data of each column and divided by the maximum value of the column;
step (3.2.3) solving the Laplace matrix NL, calculating an eigenvalue to obtain an eigenvector, and performing step (3.2.4);
wherein the definition of the characteristic value and the characteristic vector is as follows: the matrix multiplication corresponds to a transformation, any vector is changed into a new vector with different directions or lengths, in the transformation process, the original vector mainly changes in rotation and expansion, if the matrix only performs expansion transformation on a certain vector or some vectors and does not have the effect of rotation on the vectors, the vectors are called as the characteristic vectors of the matrix, and the expansion ratio is the characteristic value;
step (3.2.4) constructing the feature vectors into a matrix NE according to columns, and carrying out normalization processing to carry out step (3.2.5);
step (3.2.5) of defining bestC for storing the best clustering result, defining BestE for storing the score of the best clustering result, initializing BestE to 0, defining T _ num for storing the repeated clustering times, initializing T _ num to 0, defining retryTime to the repeated clustering times, and performing step (3.2.6);
step (3.2.6) executing step (3.2.6.1-3.2.6.5), clustering NE, evaluating clustering effect, circulating retryTime times, finding the cluster with highest solution quality, and executing step (3.2.7);
step (3.2.6.1) self-increments T _ num by 1, i.e., T _ num = T _ num +1, and proceeds to step (3.2.6.2);
step (3.2.6.2) clustering is carried out on the matrix NE by using a K-Means method, and the obtained vector clustering result is used as a clustering result of the corresponding over point, so that step (3.2.6.3) is carried out;
where the K-Means method divides the data set into K clusters, each cluster is represented using the mean of all samples within the cluster, which is referred to as the "centroid";
step (3.2.6.3) evaluating the clustering effect according to the scoring function, and performing step (3.2.6.4);
wherein the scoring function is formulated as
Figure FDA0003880083190000071
Where me represents the number of cluster classes in the clustering result bestC, V mi Denotes the mi cluster class in bestC, hCut (V) mi ) Is of the formula
Figure FDA0003880083190000072
Wherein, w (e) cut ) Represents a super edge e cut Weight of (c), beta (V) mi ) Is represented by V mi Super edge set composed of cut super edges, e cut Represents the super edge set beta (V) mi ) The length of the super edge of (1),
Figure FDA0003880083190000073
represents V mi Complement, | e, in set V cut ∩V mi I represents e cut And V mi Size of union, i.e. simultaneous presence of excess edges e cut And V mi The number of overtops;
step (3.2.6.4) if the score of the current cluster is higher than Beste, performing step (3.2.6.4.1-3.2.6.4.2) to replace the original clustering result with the current clustering result, otherwise performing step (3.2.6.5);
step (3.2.6.4.1) replacing the current data clustering result into bestC, and step (3.2.6.4.2) is carried out;
step (3.2.6.4.2) assigning the score of the current data clustering result to Beste;
step (3.2.6.5) of determining whether T _ num is greater than or equal to retryTime, if yes, exiting the loop, otherwise entering the next loop;
and (3.2.7) ending the circulation to obtain a clustering result bestC.
8. The Saas-oriented Web Api diversity recommendation method with restart random walk algorithm converged according to claim 1 or 2, wherein the procedure of step (3.3) is as follows:
step (3.3.1) defining UV 'to represent the super point needing to be refined, copying the super point set V' of the super graph G 'into the UV', and performing step (3.3.2);
step (3.3.2) enters a loop, step (3.3.2.1-3.3.2.4) is executed, the super points in the UV' are refined until all the super points are refined, and step (3.3.3) is carried out;
step (3.3.2.1) randomly taking one overtop v from the set UV', and performing step (3.3.2.2);
step (3.3.2.2) sequentially traversing the bidirectional linked list storing the super points which are merged by the super points v, acquiring the super points v, executing step (3.3.2.1-3.3.2.2.6), finding the cluster class to which the super points v belong, and performing step (3.3.2.3);
defining cc to represent the cluster class of the super point v, initializing the clustering result to be the same as the super point v, and performing the step (3.3.2.2.2);
step (3.3.2.2.2) defining distance representing the distance between the super point v and the cluster class to which the super point v belongs, calculating the distance between the super point v and the cluster class cc, and initializing the distance, and performing step (3.3.2.2.3);
the formula for calculating the distance is as follows:
Figure FDA0003880083190000081
wherein,
Figure FDA0003880083190000082
representing that when the type of the over point is a theme, the over point has functional meaning in the cluster, moving the type of the over point can cause the deviation of the function of the cluster, and the larger the weight of the over point is, the larger the influence of the corresponding function in the cluster is, so punishment is carried out according to the weight of the over point; v. of r1 、v r2 、v r3 Represents a super-point, Q, present in the cluster class cc e ' indicating simultaneous inclusion of a point of excess v r1 And v r2 E' represents the super edge in the set, v r3 Represents a super-point, Q, present in the cluster class cc e "indicates that the super point v is included at the same time ** And v r3 E' "denotes the super edge in the set, len (cc) 2 Represents the square of len (cc), i.e. the square of the size of the set cc, dv (v) represents the degree of the super point v;
step (3.3.2.2.3) traversing cluster classes in the clustering result bestC, named cluster, executing step (3.3.2.2.3.1-3.3.2.2.3.2), finding cluster classes closest to v x, and executing step (3.3.2.2.4);
step (3.3.2.2.3.1) calculating the distance between v and cluster according to the distance formula in step (3.3.2.2.2), and performing step (3.3.2.2.3.2);
step (3.3.2.2.3.2) updating cc to cluster if v x is less than the distance from cluster, updating distance to v x distance from cluster;
step (3.3.2.2.4) moving v into cluster class cc, proceeding to step (3.3.2.2.5);
step (3.3.2.2.5) of subtracting the weight of v from the weight of v, and performing step (3.3.2.2.6);
step (3.3.2.2.6) if v is not the last overtop of the bi-directional linked list, updating v to the next overtop of the bi-directional linked list and entering the next iteration, otherwise ending the iteration;
step (3.3.2.3) deleting v from the set UV', proceeding to step (3.3.2.4);
step (3.3.2.4) exiting the loop if the set UV' is empty, otherwise entering the next loop;
and (3.3.3) obtaining a refined final clustering result bestC, and finishing the refinement.
9. The Saas-oriented Web Api diversity recommendation method with restart random walk algorithm converged according to claim 1 or 2, wherein the fourth step is performed by:
step (4.1) defining a symbol of the random walk model with restart, and performing step (4.2);
wherein tag represents Mashup label, topic represents Mashup theme, iterTime' represents iteration times, alpha represents restart probability, topQ represents recommendation number, recommend represents service recommendation list and clusters (C) 1 ,C 2 ,...,C k ) Representing cluster partitions where C 1 ,C 2 ,...,C k Representing services in a service recommendation list;
step (4.2) definition of y (t) Step (4.3) is carried out for the column vector formed by the access probability of each over point at the time t;
step (4.3) definition of y (1) Is y (t) When t is 1, the column vector composed of the probability of each super point being accessed stores the reciprocal of the VO length in y in the same order as VO (1) Performing the following steps;
step (4.4) performing step (4.4.1-4.4.2), constructing a description document, a label and a theme requiring Mashup into an initial column vector q, and performing step (4.5);
wherein q is an initial column vector consisting of 0 and 1, wherein the position assignment representing the initial super point is 1, and the rest are 0;
step (4.4.1) establishing a vector q which has the same length as the VO and is 0, and performing step (4.4.2);
step (4.4.2) comparing the description document, the label and the theme of the Mashup in demand with the super points in the VO in sequence respectively, finding out the same super points, and corresponding to an index position 1 in q;
step (4.5) carrying out step (4.5.1-4.5.5), calculating the access distribution vector of the super point set, and carrying out step (4.6);
step (4.5.1) utilizes the degree matrix, the weight matrix and the incidence matrix of the over point and the over edge to construct a transition probability matrix P, and the formula is P = VD -1 HWED -1 H T And (4.5.2) performing the step;
wherein D is -1 Representing the inverse matrix of D, D -1 D = E, E denotes an identity matrix, i.e., a matrix in which elements on a diagonal line from an upper left corner to a lower right corner are all 1 and the remaining positions are all 0, H T The transposition of H is shown, namely a matrix obtained by converting rows in the matrix H into columns with the same serial numbers;
step (4.5.2) obtaining the access distribution y of the whole super point set according to the random walk with restart (t+1) =αPy (t) Q (1-. Alpha.) and carrying out step (4.5.3);
step (4.5.3) setting for initialization, y (1) The vector is an initial distribution vector and is regarded as an equal probability value, namely the value of each element is the reciprocal of the total number of the over points, y (1) The sum of the elements in the vector is 1,y (1) The vector is as long as the VO, and the step (4.5.4) is carried out;
step (4.5.4) from initial distribution vector y (1) Step 4.5.4.1-4.5.4.2 is carried out to start random walk, and y is processed (t) Performing iteration until reaching a specified iteration time Tmax, namely t = Tmax, and performing the step (4.5.5);
step (4.5.4.1) calculate y from step (4.5.2) (t+1) Proceeding to step (4.5.4.2);
step (4.5.4.2) updates t to t +1;
step (4.5.5) the random walk process stops, at which point the vector y (t) Has not changed;
step (4.6) for y (t) The vectors are sorted from big to small according to the values, and the step (4.7) is carried out;
step (4.7) of creating an empty recommendation list recommend, and performing step (4.8);
step (4.8) step (4.8.1-4.8.2) is performed to obtain y (t) The topQ APIs with the maximum values are used as recommended APIs to be placed into a recammend list, and the step (4.9) is carried out;
step (4.8.1) traverse y (t) The value in the vector corresponds to a super point, andproceeding to the step (4.8.2) until the length of recommend reaches topQ;
step (4.8.2) judging the type of the over point, if the over point is the API type, adding a recommendation list, and otherwise jumping to the next cycle;
and (4.9) ending the circulation, returning a recommendation list, wherein the length of the recommendation list is topQ, and the superpoints which are contained in the recommend and have the size sequencing and the API types are the required recommendation API, completing the recommendation and ending.
CN202211226862.7A 2022-10-09 2022-10-09 SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm Withdrawn CN115599980A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211226862.7A CN115599980A (en) 2022-10-09 2022-10-09 SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211226862.7A CN115599980A (en) 2022-10-09 2022-10-09 SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm

Publications (1)

Publication Number Publication Date
CN115599980A true CN115599980A (en) 2023-01-13

Family

ID=84846327

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211226862.7A Withdrawn CN115599980A (en) 2022-10-09 2022-10-09 SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm

Country Status (1)

Country Link
CN (1) CN115599980A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119669580A (en) * 2025-02-20 2025-03-21 江西省影票文化传媒股份有限公司 A content recommendation method and system based on local life service platform
CN119763672A (en) * 2024-12-30 2025-04-04 广州悉地医药科技有限公司 Method and system for excavating relationship between lncRNA and diseases

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119763672A (en) * 2024-12-30 2025-04-04 广州悉地医药科技有限公司 Method and system for excavating relationship between lncRNA and diseases
CN119669580A (en) * 2025-02-20 2025-03-21 江西省影票文化传媒股份有限公司 A content recommendation method and system based on local life service platform

Similar Documents

Publication Publication Date Title
Aggarwal An introduction to cluster analysis
Hofmann The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data
Zhang et al. Service recommendation based on quotient space granularity analysis and covering algorithm on Spark
Aghdam et al. A novel non-negative matrix factorization method for recommender systems
CN115599980A (en) SaaS-oriented Web Api diversity recommendation method with fusion and restart random walk algorithm
CN109255125A (en) A kind of Web service clustering method based on improvement DBSCAN algorithm
Cao et al. Rankcompete: Simultaneous ranking and clustering of information networks
CN115620040A (en) Cloud-oriented Mashup service clustering method based on hypergraph multilevel clustering
Hazratgholizadeh et al. Active constrained deep embedded clustering with dual source
CN119228271B (en) Library inventory management method and system
CN116739402A (en) Health portrait construction method based on data mining
CN120087201A (en) Intelligent clothing matching recommendation system and method based on multimodal learning
Martínez-García et al. Construction of an outranking relation based on semantic criteria with ELECTRE-III
CN119760057A (en) Response large model retrieval enhancement method and device based on hierarchical cluster index structure
CN112203152A (en) Multi-modal confrontation learning type video recommendation method and system
Ye et al. Incorporating user's preference into attributed graph clustering
Malleshappa et al. Classification of Web Pages Using the Machine Learning Algorithms with Web Page Recommendations.
CN110162580A (en) Data mining and depth analysis method and application based on distributed early warning platform
CN112434174B (en) Method, device, equipment and medium for identifying a publishing account of multimedia information
CN116431877A (en) Webpage big data content clustering method driven by cloud computing platform
Escalante et al. Multi-class particle swarm model selection for automatic image annotation
More et al. KH-FC: krill herd-based fractional calculus algorithm for text document clustering using MapReduce structure
Zeng et al. Multi-strategy continual learning for knowledge refinement and consolidation
CN114826921B (en) Method, system and medium for dynamic allocation of network resources based on sampling subgraph
CN120086414B (en) File retrieval method and system based on cloud computing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20230113

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载