CN108492200B

CN108492200B - User attribute inference method and device based on convolutional neural network

Info

Publication number: CN108492200B
Application number: CN201810124041.XA
Authority: CN
Inventors: 曹亚男; 李晓雪; 尚燕敏; 刘燕兵; 谭建龙; 郭莉
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2018-02-07
Filing date: 2018-02-07
Publication date: 2022-06-17
Anticipated expiration: 2038-02-07
Also published as: CN108492200A

Abstract

The invention relates to a method and device for inferring user attributes based on a convolutional neural network. The method establishes a self-centered network according to the attributes and friend relationships of user nodes; then uses a convolutional neural network to extract the attribute information of the user nodes in the self-centered network and the hidden information contained in the friend relationships, and uses the hidden information to infer out the missing attributes of the user. For social networks where friend relationships cannot be directly acquired or are difficult to acquire, neural networks are used to classify and predict missing attributes using only the user's attribute information. The present invention can well avoid the limitation of artificially defining the similarity function, and can better represent the relationship between different attributes and between different attribute dimensions through the convolution operation of the convolution kernel, so that it can be efficiently and accurately performed. User-missing attribute inference.

Description

A method and device for user attribute inference based on convolutional neural network

技术领域technical field

本发明属于社交网络中用户缺失属性推断技术领域，特别是多值属性的推断，具体涉及一种基于卷积神经网络的用户属性推断方法和装置，具有较高的准确度。The invention belongs to the technical field of user missing attribute inference in social networks, in particular to the inference of multi-valued attributes, and in particular relates to a method and device for inferring user attributes based on a convolutional neural network, which has high accuracy.

背景技术Background technique

随着互联网技术的发展，在线社交产品，如微博、知乎和Facebook等成为用户的日常生活的必需品。用户在使用这些产品的同时，产生了大量的信息，包括用户的属性信息、发文内容和好友关系，这些信息为企业、科研人员等准确刻画用户画像提供了数据支持。与此同时为了保护用户的隐私，在线社交产品为用户提供了细粒度的隐私设置，这就导致了用户的属性信息是很难直接获取的。据相关资料统计，用户的属性信息缺失率高达70％，大量信息的缺失成为准确刻画用户画像所面临的最大问题，因此，用户属性推断技术在工业界和科研界引起了广泛的关注。With the development of Internet technology, online social products, such as Weibo, Zhihu and Facebook, have become the necessities of users' daily life. While using these products, users generate a large amount of information, including the user's attribute information, post content and friend relationships, which provide data support for enterprises and researchers to accurately describe user portraits. At the same time, in order to protect the privacy of users, online social products provide users with fine-grained privacy settings, which makes it difficult to obtain user attribute information directly. According to relevant statistics, the missing rate of user attribute information is as high as 70%. The lack of a large amount of information has become the biggest problem in accurately describing user portraits. Therefore, user attribute inference technology has attracted extensive attention in industry and scientific research.

传统的方法可以分为两大类——基于分类的方法和基于标签传播的方法。前者的理论基础是你属于什么，通常是通过计算标记节点与其邻居节点的相似性来预测未知属性的。其中相似度的计算主要取决于具体所采用的方法，经典的分类算法有SVM、贝叶斯等。此外一些学者提出了更加符合实际情况的相似性计算方法，如N.Z.gong于2016年发表的《Youare Who You Know and How You Behave:Attribute Inference Attacks via Users’Social Friends and Behaviors》文章中就提出了一种更好的计算相似性的方法，取得了很好的效果。但是一般来说，SVM、贝叶斯等模型在预测性别，年龄等属性上有良好表现，但对预测职业，兴趣等问题上的表现差强人意。而基于标签传播的方法是利用了社交网络的同质性，即两个具有好友关系的用户具有相似属性的可能性更大。基于这一理论基础，属性值就可以通过边从已知属性信息的节点传播给未知属性信息的节点，从而达到准确预测属性值的目的。如2010年Mingzhen Mo提出的《Exploit of Online Social Networks withSemi-Supervised Learning》一文中就是用标记传播的方法来预测用户的未知属性的，取得了很好的效果。但是，在实际的操作中，这一方法需要大量的时间和空间开销来计算由社交网络所构成的图的邻接矩阵。Traditional methods can be divided into two categories - classification-based methods and label propagation-based methods. The theoretical basis of the former is what you belong to, usually predicting unknown properties by computing the similarity of a marked node to its neighbors. The calculation of similarity mainly depends on the specific method used. The classic classification algorithms include SVM, Bayesian and so on. In addition, some scholars have proposed a similarity calculation method that is more in line with the actual situation. A better method for calculating similarity, with good results. However, in general, models such as SVM and Bayesian have good performance in predicting attributes such as gender and age, but their performance in predicting occupation, interests and other issues is not satisfactory. The method based on tag propagation takes advantage of the homogeneity of social networks, that is, two users with friend relationships are more likely to have similar attributes. Based on this theoretical basis, attribute values can be propagated from nodes with known attribute information to nodes with unknown attribute information through edges, so as to accurately predict attribute values. For example, in the article "Exploit of Online Social Networks with Semi-Supervised Learning" proposed by Mingzhen Mo in 2010, the method of mark propagation is used to predict the unknown attributes of users, and good results have been achieved. However, in practice, this method requires a lot of time and space overhead to calculate the adjacency matrix of the graph composed of social networks.

根据以上介绍可知，目前国内外对用户缺失属性填充有很多研究，主要分为基于分类和基于标记传播的。基于分类的方法的局限性在于相似度的计算需要全面精确，并且分类模型精度要高，数据特征的构建要全面，但目前的方法中虽然机器学习模型的精度很高，但大部分是针对于二分类问题，且在社交网络中面临着无法获取多维属性构建用户的全面的特征向量等问题，从而导致预测效果较差，尤其是多值属性的预测。而基于标记传播的属性预测算法则需要花费大量的时间去计算图的邻接矩阵，而且算法本身对标记节点的好友的重要程度一视同仁，这本身是不符合社交网络的特性的，所以在真实数据上的效果也是差强人意的。According to the above introduction, there are many researches on user missing attribute filling at home and abroad, which are mainly divided into classification-based and tag-based propagation. The limitation of the classification-based method is that the calculation of similarity needs to be comprehensive and accurate, the classification model must have high accuracy, and the construction of data features must be comprehensive. However, although the accuracy of the machine learning model in the current method is high, most The problem of binary classification, and the inability to obtain multi-dimensional attributes to construct a comprehensive feature vector of users in social networks, etc., leads to poor prediction effects, especially the prediction of multi-valued attributes. The attribute prediction algorithm based on mark propagation needs to spend a lot of time to calculate the adjacency matrix of the graph, and the algorithm itself treats the importance of the friends of the marked node equally, which itself is not in line with the characteristics of social networks, so on the real data The effect is also unsatisfactory.

此外，这两类方法在预测标记节点未知属性的时候，要根据特定的属性进行模型的训练，比如预测性别属性的时候，要根据用户的属性信息建立一个对应的模型，当预测职业属性时，之前训练的模型则不能直接使用，还需要训练新的模型才能达到较好的预测效果。In addition, when predicting unknown attributes of marked nodes, these two types of methods need to train models according to specific attributes. For example, when predicting gender attributes, a corresponding model needs to be established according to the user's attribute information. When predicting occupational attributes, The previously trained model cannot be used directly, and a new model needs to be trained to achieve a better prediction effect.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是提供高效、准确的用户缺失属性推断技术，用以更好的刻画用户画像。The technical problem to be solved by the present invention is to provide an efficient and accurate user missing attribute inference technology, so as to better describe the user portrait.

本发明采用的技术方案如下：The technical scheme adopted in the present invention is as follows:

一种基于卷积神经网络的用户属性推断方法，包括以下步骤：A method for user attribute inference based on convolutional neural network, including the following steps:

1)根据用户节点的属性和好友关系，建立自中心网络；1) According to the attributes and friend relationships of user nodes, establish a self-centered network;

2)采用卷积神经网络提取所述自中心网络中用户节点的属性信息和好友关系中所包含的隐藏信息，利用所述隐藏信息推断出用户的缺失属性。2) Using a convolutional neural network to extract the attribute information of the user node in the self-center network and the hidden information contained in the friend relationship, and use the hidden information to infer the missing attribute of the user.

进一步地，所述自中心网络采用五元组G′＝{V′,′,′,′,}来表示，其中V′包含自中心网络中节点信息，集合E′包含自中心网络中所有节点间的链接关系，集合A′和′分别表示节点的属性信息和行为信息，矩阵L∈V′×N包含自中心节点和它的好友的属性信息和行为信息，是属性和行为数据维度的总和。Further, the self-central network is represented by a quintuple G'={V',',',',}, where V' includes node information in the self-central network, and the set E' includes all nodes in the self-central network The link relationship between the nodes, the sets A' and ' respectively represent the attribute information and behavior information of the node, and the matrix L∈V'×N contains the attribute information and behavior information of the self-center node and its friends, which is the sum of the attributes and behavior data dimensions. .

进一步地，步骤1)首先对用户在网络上的属性信息进行过滤，然后再建立所述自中心网络；所述过滤包括：Further, step 1) firstly filters the attribute information of the user on the network, and then establishes the self-centered network; the filtering includes:

a)过滤掉除年龄之外其他属性的所有非汉字的词组。a) Filter out all non-Chinese character phrases with attributes other than age.

b)过滤掉属性信息缺失超过设定的阈值的节点。b) Filter out nodes whose attribute information missing exceeds the set threshold.

进一步地，所述卷积神经网络包括输入层、投影层、卷积层、池化层、全连接层和输出层，所述投影层将用户的属性信息和行为信息转化为向量。Further, the convolutional neural network includes an input layer, a projection layer, a convolution layer, a pooling layer, a fully connected layer and an output layer, and the projection layer converts the user's attribute information and behavior information into vectors.

进一步地，所述投影层将用户的属性信息和行为信息转化为向量，对属于同一职业或专业的词语采用以下规则：Further, the projection layer converts the user's attribute information and behavior information into vectors, and adopts the following rules for words belonging to the same occupation or profession:

i.创建hash表，以专业或职业的首字作为关键字，将所有以该字为首字的属性组成集合作为value；然后根据jara-winkler距离计算其他属性值与专业和职业词典中词语的相似性，并将相似度高的属性加入到集合value；i. Create a hash table, take the first word of the profession or occupation as the key, and use the set of all attributes with the first word as the value as the value; then calculate the similarity of other attribute values with the words in the professional and occupation dictionary according to the jara-winkler distance properties, and add attributes with high similarity to the set value;

ii.针对那些不在集合value中出现的属性值，运用word2vec训练出词向量，通过计算词向量间的距离，将相似性高的向量通过KNN算法聚合在一起，并与词典中的词语通过ID号进行关联，从而得到数字化向量。ii. For those attribute values that do not appear in the set value, use word2vec to train word vectors. By calculating the distance between the word vectors, the vectors with high similarity are aggregated together by the KNN algorithm, and the words in the dictionary are passed ID numbers. Correlate to get a digitized vector.

进一步地，所述卷积层采用训练好的权重矩阵和偏置与经过投影层后得到的特征矩阵做映射，并采用Relu作为激活函数；所述池化层采用最大值池化来保留局部特征中最重要的信息；所述输出层采用softmax分类器利用权重矩阵和偏置的值为属性可能的取值进行打分，得分最高的属性值即为标记用户的缺失属性值。Further, the convolutional layer uses the trained weight matrix and bias to map with the feature matrix obtained after the projection layer, and uses Relu as the activation function; the pooling layer uses maximum pooling to retain local features. The most important information in the output layer; the output layer uses the softmax classifier to score the possible values of the attribute by using the weight matrix and the bias value, and the attribute value with the highest score is the missing attribute value of the marked user.

进一步地，针对好友关系无法直接获取或获取难度较大的社交网络，采用神经网络仅利用用户的属性信息对缺失的属性进行分类预测。Further, for social networks in which the friend relationship cannot be directly obtained or is difficult to obtain, the neural network is used to classify and predict the missing attributes only by using the user's attribute information.

进一步地，所述神经网络包括输入层、投影层、隐藏层和输出层；所述投影层将用户的属性信息和行为信息转化为向量；所述隐藏层是两个全连接层，第一个隐藏层含有n*n个神经元，n是用户的属性和行为数据的维度之和，第二个隐藏层丢掉一部分神经元以防止过拟合；所述输出层采用softmax分类器利用权重矩阵和偏置的值为属性可能的取值进行打分，得分最高的属性值即为标记用户的缺失属性值。Further, the neural network includes an input layer, a projection layer, a hidden layer and an output layer; the projection layer converts the user's attribute information and behavior information into vectors; the hidden layer is two fully connected layers, the first one The hidden layer contains n*n neurons, where n is the sum of the dimensions of the user's attributes and behavior data, and the second hidden layer loses some neurons to prevent overfitting; the output layer uses the softmax classifier to use the weight matrix and The biased value is used to score the possible values of the attribute, and the attribute value with the highest score is the missing attribute value of the marked user.

一种基于卷积神经网络的用户属性推断装置，其包括：A device for inferring user attributes based on a convolutional neural network, comprising:

自中心网络构建模块，负责根据用户节点的属性和好友关系，建立自中心网络；The self-centered network building module is responsible for establishing a self-centered network according to the attributes and friend relationships of user nodes;

用户属性推断模块，负责采用卷积神经网络提取所述自中心网络中用户节点的属性信息和好友关系中所包含的隐藏信息，利用所述隐藏信息推断出用户的缺失属性。The user attribute inference module is responsible for extracting the attribute information of the user node in the self-center network and the hidden information contained in the friend relationship by using a convolutional neural network, and using the hidden information to infer the missing attribute of the user.

进一步地，针对好友关系无法直接获取或获取难度较大的社交网络，所述用户属性推断模块采用神经网络仅利用用户的属性信息对缺失的属性进行分类预测。Further, for social networks in which the friend relationship cannot be directly obtained or is difficult to obtain, the user attribute inference module uses a neural network to classify and predict the missing attributes only by using the user's attribute information.

本发明所提出的基于卷积神经网络的用户属性推断算法与现有技术中的基于分类的方法是类似的，同样的将属性推断问题看成了一个分类问题。最大的不同之处是传统的分类方法会有定义一个相似度计算的公式，根据该公式计算出节点间的相似度并对未知属性可能的取值进行打分，从而选取得分高的属性值作为标记节点的未知属性值。但本发明基于卷积网络的属性预测算法——UPE则是根据权重矩阵和偏置值计算出标记节点未知属性的可能值的得分，具体实现方法是通过大小不一的卷积核自动的抓取不同属性维度间潜在关系，然后根据这些关系和反向传播算法计算出属性的权重矩阵和偏置，再经过一个线性计算，就得出了未知属性每个可能取值的分数。通过权重矩阵和偏置去衡量相似度的方法可以很好的避免人为定义相似度函数的局限性，而且通过卷积核的卷积操作能够更好的表现出不同属性间以及不同的属性维度间的关系，这种关系是无法解释与衡量的，也是被传统的分类算法所忽略的。此外，这是卷积神经网络自提出以来第一次在属性推断问题上的尝试，并取得了很好的实验结果。The user attribute inference algorithm based on the convolutional neural network proposed by the present invention is similar to the classification-based method in the prior art, and the attribute inference problem is also regarded as a classification problem. The biggest difference is that the traditional classification method will define a similarity calculation formula. According to the formula, the similarity between nodes is calculated and the possible values of the unknown attributes are scored, so that the attribute value with the highest score is selected as the Unknown attribute value of the marker node. However, the attribute prediction algorithm based on the convolution network of the present invention, UPE, calculates the score of the possible value of the unknown attribute of the marked node according to the weight matrix and the offset value. Take the potential relationship between different attribute dimensions, and then calculate the weight matrix and bias of the attribute according to these relationships and the back-propagation algorithm. After a linear calculation, the score of each possible value of the unknown attribute is obtained. The method of measuring similarity by weight matrix and bias can well avoid the limitation of artificially defined similarity function, and the convolution operation of convolution kernel can better show the difference between different attributes and between different attribute dimensions. This relationship cannot be explained and measured, and it is also ignored by traditional classification algorithms. In addition, this is the first attempt of convolutional neural network on the problem of attribute inference since it was proposed, and has achieved good experimental results.

此外，作为对UPE算法例外情况的补充，UPS所采用的神经网络的方法也是第一次被运用到属性推断问题中的。与现有技术的方法最大的不同之处在于，神经网络的方法更加的灵活高效，没有固定的神经网络层数和相似度函数，因此在处理某些社交网络中的复杂多变的数据问题时候表现出了很好的包容性，并且效果优于其他分类算法。通过在真实数据集上的测试与调优，构建了UPS结构图。In addition, as a supplement to the exception of the UPE algorithm, the neural network approach adopted by UPS is also applied to the problem of attribute inference for the first time. The biggest difference from the prior art method is that the neural network method is more flexible and efficient, and there is no fixed number of neural network layers and similarity functions, so when dealing with complex and changeable data problems in some social networks It shows good inclusiveness, and the effect is better than other classification algorithms. Through testing and tuning on real datasets, a UPS structure diagram is constructed.

附图说明Description of drawings

图1：用户全网络图和自中心网络图。Figure 1: User-wide network diagram and self-centered network diagram.

图2：UPS流程图。Figure 2: UPS flow chart.

图3：UPE流程图。Figure 3: UPE flow chart.

图4：本发明与其它机器学习算法的实验结果对比图。Figure 4: A comparison diagram of the experimental results of the present invention and other machine learning algorithms.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂，下面通过具体实施例和附图，对本发明做进一步详细说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be further described in detail below through specific embodiments and accompanying drawings.

本发明的基于卷积神经网络的用户缺失属性推断技术，包括以下步骤：The convolutional neural network-based user missing attribute inference technology of the present invention includes the following steps:

步骤1，维度筛选：Step 1, dimension filtering:

根据用户在网络上的属性信息，分析得出用户的属性主要有15维，包括用户名，性别，年龄，学校，专业，简介，职业，所在地等。为了更好的实现算法，首先对是属性信息进行了过滤，具体做法如下：According to the user's attribute information on the network, the analysis shows that the user's attributes mainly have 15 dimensions, including user name, gender, age, school, major, profile, occupation, location, etc. In order to better implement the algorithm, the attribute information is first filtered. The specific methods are as follows:

1)过滤掉除年龄之外其他属性的所有非汉字的词组。年龄是用户属性中的一个重要维度，但除了年龄之外的其他属性的信息描述是由汉字构成的，若是由非汉字组成，不仅无法从中获取到有用的信息，还会对实验效果产生不好的影响，所以将这些信息过滤掉是为了更好的实现属性预测的算法。1) Filter out all non-Chinese character phrases with attributes other than age. Age is an important dimension of user attributes, but the information description of other attributes except age is composed of Chinese characters. If it is composed of non-Chinese characters, it will not only be unable to obtain useful information from it, but it will also have a bad effect on the experiment. Therefore, filtering out this information is to better implement the attribute prediction algorithm.

2)过滤掉属性信息缺失超过设定的阈值的节点。比如该阈值设定为50％，虽然本算法是为了解决属性缺失问题的，但是对于那些属性信息不足50％的节点，在没有相关发文内容等其他信息的情况下，是无法仅通过好友关系和已知的属性信息来进行有效的预测的。2) Filter out nodes whose attribute information missing exceeds the set threshold. For example, the threshold is set to 50%. Although this algorithm is designed to solve the problem of missing attributes, for those nodes whose attribute information is less than 50%, if there is no other information such as the content of the relevant postings, it is impossible to pass the friend relationship and other information only. Known attribute information to make effective predictions.

步骤2，词典构建和社交网络定义：Step 2, dictionary building and social network definition:

基于我国高等院校专业学科设置和国家职业标准目录及中文简写词的相应命名规则分别构建专业和职业词典，并通过唯一的ID号与词典相关联(词典中的每一个词语有一个ID号)；为提高算法的准确率，当辨别名词不在上述两个词典中时，根据jaro-winkler距离依次计算该词与词典中词的相似性，选取相似性最高的词作为特征。Based on the professional discipline settings of colleges and universities in my country, the national occupation standard catalogue and the corresponding naming rules of Chinese abbreviations, professional and occupational dictionaries are constructed respectively, and they are associated with the dictionaries through a unique ID number (each word in the dictionary has an ID number) ; In order to improve the accuracy of the algorithm, when the identified noun is not in the above two dictionaries, the similarity between the word and the word in the dictionary is calculated in turn according to the jaro-winkler distance, and the word with the highest similarity is selected as the feature.

jaro-winkler距离是一种计算两个字符串之间相似度的方法，假设给定两个字符串S₁和S₂，则该两个字符串之间的距离为：The jaro-winkler distance is a method to calculate the similarity between two strings. Suppose two strings S ₁ and S ₂ are given, the distance between the two strings is:

其中m是匹配的字符数，t是换位的数目，|S₁|、|S₂|表示分别表示字符串S₁和S₂的字符数，即长度。所述换位是指字符之间移动或者交换，换位的数目是指移动或者交换的步数之和。Where m is the number of matched characters, t is the number of transpositions, and |S ₁ | and |S ₂ | represent the number of characters of strings S ₁ and S ₂ respectively, that is, the length. The transposition refers to movement or exchange between characters, and the number of transpositions refers to the sum of the number of steps of movement or exchange.

为了更好的描述算法，提出了两种社交网络定义方法。一是由用户的属性信息构成的网络，称之为full-social-graph(全网络图)。另一种是由用户的属性信息和好友关系构成的网络，被称为ego-network(自中心网络)。In order to better describe the algorithm, two social network definition methods are proposed. One is a network composed of user attribute information, which is called a full-social-graph (full network graph). The other is a network composed of user's attribute information and friend relationship, which is called ego-network (self-centered network).

定义1.Full-social-graphDefinition 1. Full-social-graph

一个完整的社交网络可以用一个四元组＝{V,A,B,F}来表示。其中是该社交网络中所有节点的集合并假设该网络中节点个数为m，集合和分别表示节点的属性信息和行为信息，如A_ij和B_ij分别表示节点V_i的第j维属性值和行为值。定义包含用户在社交网络中的所有信息，F＝{F₁,F₂,…F_i…,F_m}，其中F_i＝A_i∪B_i。所述行为信息是指发文内容、评论、点赞、转发等一系列在社交网络中的行为。A complete social network can be represented by a quadruple = {V, A, B, F}. where is the set of all nodes in the social network and assumes that the number of nodes in the network is m, the set sum represents the attribute information and behavior information of the node respectively, such as A _ij and B _ij respectively represent the jth dimension attribute value of the node V _i and behavioral values. The definition contains all the information of the user in the social network, F={F ₁ ,F ₂ ,...F _i ...,F _m }, where F _i =A _i ∪B _i . The behavior information refers to a series of behaviors in social networks, such as posting content, comments, likes, and forwarding.

定义2.Ego-networkDefinition 2.Ego-network

Ego-network，又称为自中心网络，是由自中心节点的属性和好友关系以及它的好友的属性信息构成，为了更好的描述用户的自中心网络，定义了一个五元组G′＝{V′,E′,A′,B′,L}来表示用户的自中心网络。其中V′是自中心网络中所有节点的集合，集合E′包含自中心网络中所有节点间链接关系。集合A′和B′的定义与full-social-graph类似，即分别表示节点的属性信息和行为信息，而矩阵L∈V′×N则包含着自中心节点和它的好友的属性信息和行为信息，其中是属性和行为数据维度的总和。Ego-network, also known as the self-centered network, is composed of the attributes of the self-centered node, the friend relationship and the attribute information of its friends. In order to better describe the user's self-centered network, a five-tuple G'= {V', E', A', B', L} to represent the user's self-centered network. Among them, V' is the set of all nodes in the self-central network, and the set E' contains the link relationship between all nodes in the self-central network. The definition of sets A' and B' is similar to that of full-social-graph, that is, they represent the attribute information and behavior information of nodes respectively, while the matrix L∈V'×N contains the attribute information and behavior of the self-center node and its friends. Information, where is the sum of attributes and behavioral data dimensions.

用户的全网络图(full-social-graph)和自中心网络(ego-network)分别如图1所示，其中由节点v₁、v₂、v₃和其间的连线所构成的网络就是用户v₁的自中心网络(ego-network)，图1中的所有节点和与之相关的所有属性信息就是社交网络中的用户全网络图(full-social-graph)。The user's full-social-graph and ego-network are shown in Figure 1, respectively, where the network composed of nodes v ₁ , v ₂ , v ₃ and the connections between them is the user The ego-network of v1, all nodes in Figure ₁ and all attribute information related to them are the full-social-graph of users in the social network.

步骤3，基于神经网络的用户属性推断方法：Step 3, user attribute inference method based on neural network:

针对诸如QQ、微信等网络中好友关系无法直接获取或获取难度较大等问题，采用最普通的两层神经网络，仅利用用户的属性信息对缺失的属性进行分类预测。Aiming at the problems that friend relationships in networks such as QQ and WeChat cannot be directly obtained or are difficult to obtain, the most common two-layer neural network is used to classify and predict the missing attributes only by using the user's attribute information.

虽然大部分社交网络诸如知乎、Facebook这些的好友关系是公开的，但是仍有一部分网络如QQ、微信等的好友关系是加密的，这就使得好友关系的信息的获取的难度加大甚至无法直接获取到。为解决属性推断问题所面临的这一难题，本发明设计了一种基于神经网络的用户属性推断方法，称为UPS。这个方法的理论基础是神经网络在自然语言处理问题的良好的表现，在自然语言处理中，神经网络算法通常是将每个句子作为特征向量的，同样的，本发明将由用户的属性和行为数据所构成的向量作为特征向量，并通过类似于自然语言处理中的word2vec方法将转化为数字向量，使其能够放入神经网络的模型中。与传统的用户缺失属性推断方法相比，神经网络的方法可以获得更高的准确率。Although most social networks such as Zhihu and Facebook have public friend relationships, there are still some networks such as QQ, WeChat, etc. whose friend relationships are encrypted, which makes it more difficult or even impossible to obtain friend relationship information. obtained directly. In order to solve the problem of attribute inference, the present invention designs a user attribute inference method based on neural network, called UPS. The theoretical basis of this method is the good performance of neural networks in natural language processing problems. In natural language processing, neural network algorithms usually use each sentence as a feature vector. Similarly, the present invention will use user attributes and behavior data. The formed vector is used as a feature vector, and is converted into a numeric vector by a method similar to word2vec in natural language processing, so that it can be put into the model of the neural network. Compared with traditional user-missing attribute inference methods, neural network methods can achieve higher accuracy.

基于用户属性信息的神经网络方法的流程图如图2所示，其中模型的第一层是为了获取每个用户的特征向量，第二层是将获取到的特征转化成数字向量，剩下的几层是传统的神经网络。The flowchart of the neural network method based on user attribute information is shown in Figure 2. The first layer of the model is to obtain the feature vector of each user, the second layer is to convert the obtained features into digital vectors, and the rest Several layers are traditional neural networks.

首先，假设任何具有P层的前馈神经网络可以被看成p个任意的函数f的乘积，即：First, assume that any feedforward neural network with P layers can be viewed as the product of p arbitrary functions f, namely:

其中，

是神经网络的每一层的输出，

是整个神经网络的输入，

其中i∈(1,n)，n是网络中节点的个数。接下来详细的介绍图2中每一层神经网络。in,

is the output of each layer of the neural network,

is the input of the entire neural network,

where i∈(1,n), where n is the number of nodes in the network. Next, each layer of the neural network in Figure 2 is introduced in detail.

1)将用户向量F转化为特征向量(lookup table layer，投影层)1) Convert the user vector F into a feature vector (lookup table layer, projection layer)

在自然语言处理中，通常使用训练词向量的方法将每个单词转化成向量，同样的，本发明设计了一个投影层来完成将用户的属性信息和行为数据转化为向量的功能，主要是查找词典中是否有对应的词语。但是用户的职业等信息的描述可能并不唯一，如程序员可能填写成程序猿、程序媛等，学生的其他称呼有高中生、大学生、大一、大二等，属于同一职业或专业的词语具有以下规则：In natural language processing, the method of training word vectors is usually used to convert each word into a vector. Similarly, the present invention designs a projection layer to complete the function of converting the user's attribute information and behavior data into vectors, mainly to find Whether there is a corresponding word in the dictionary. However, the description of the user's occupation and other information may not be unique. For example, the programmer may fill in the program ape, program yuan, etc., and other names for students include high school students, college students, freshmen, sophomores, etc., which belong to the same occupation or professional. Has the following rules:

i.属于同一职业或专业的第一个字基本相同，如程序员、程序媛等。故首先创建hash表，以专业或职业词典的首字作为关键字key，之后将所有以该字为首字的属性组成集合作为value。然后根据jara-winkler距离计算其他属性值与词典中词语的相似性，并将相似度高的属性加入到集合value。Jara-winkler的原理是如果两个字符串中开始的字符越相近，那么它们的相似性就越高。i. The first words belonging to the same occupation or major are basically the same, such as programmers, programmers, etc. Therefore, first create a hash table, use the first word of the professional or occupational dictionary as the keyword key, and then use the set of all attributes with this word as the first word as the value. Then, the similarity between other attribute values and words in the dictionary is calculated according to the jara-winkler distance, and the attributes with high similarity are added to the set value. The principle of Jara-winkler is that the more similar the starting characters in two strings are, the more similar they are.

ii.属于同一职业或专业的词语的意思相近，如学生、高中生、大一等。故针对那些不在集合value中出现的属性值，可以运用Google的word2vec训练出词向量，通过计算词向量间的距离(余弦距离)，将相似性高的向量通过KNN算法(K最近邻分类算法)聚合在一起，并与词典中的词语通过ID号进行关联，从而达到数字化向量的目的。ii. Words belonging to the same occupation or major have similar meanings, such as student, high school student, freshman, etc. Therefore, for those attribute values that do not appear in the set value, Google's word2vec can be used to train word vectors, and by calculating the distance (cosine distance) between word vectors, the vectors with high similarity are passed through the KNN algorithm (K nearest neighbor classification algorithm) Aggregate together and associate with the words in the dictionary through ID numbers, so as to achieve the purpose of digitizing vectors.

神经网络内的处理方法是透明的，所以概括来讲，对任意节点i的属性向量_i，通过投影层得到的与之对应的数字化向量的表示_i可以用下面的公式表示：The processing method in the neural network is transparent, so in general, for the attribute vector _i of any node i, the representation _i of the corresponding digital vector obtained by the projection layer can be expressed by the following formula:

其中LTF(·)表示上述的一系列的运算方法。Among them, LTF(·) represents the above-mentioned series of operation methods.

2)隐藏层(hidden layer)2) Hidden layer (hidden layer)

图2所示的投影层后面的两个隐藏层是两个全连接层。其中第一个隐藏层含有n*n个神经元(n是用户的属性和行为数据的维度之和)，然后在第二个隐藏层会丢掉一部分(通常是一半)的神经元，这样做的目的是可以防止过拟合。采用ReLU作为的激活函数，可以用下面的公式表示：The two hidden layers behind the projection layer shown in Figure 2 are two fully connected layers. The first hidden layer contains n*n neurons (n is the sum of the dimensions of the user's attributes and behavior data), and then a part (usually half) of the neurons will be lost in the second hidden layer, doing so The purpose is to prevent overfitting. Using ReLU as the activation function, it can be expressed by the following formula:

除此之外，为了更快的得到神经网络训练过程的结果，在训练神经网络的过程中采用梯度下降的方法。In addition, in order to obtain the results of the neural network training process faster, the gradient descent method is used in the process of training the neural network.

3)输出层(softmax classifier)3) Output layer (softmax classifier)

为了计算标记节点的缺失属性值，神经网络的输出层采用了一个softmax分类器(缺失属性填充问题可以看成一个分类问题)。在训练神经网络的过程中，主要是通过反向传播的方法计算出网络中的权重矩阵W和偏置b，当权重矩阵和偏置确定以后，Softmax分类器利用权重矩阵和偏置的值，为属性可能的取值进行打分，而得分最高的属性值，就是标记用户的缺失属性值。In order to calculate the missing attribute value of the marked node, the output layer of the neural network adopts a softmax classifier (the missing attribute filling problem can be regarded as a classification problem). In the process of training the neural network, the weight matrix W and bias b in the network are mainly calculated by the method of backpropagation. When the weight matrix and bias are determined, the Softmax classifier uses the value of the weight matrix and bias, The possible values of the attribute are scored, and the attribute value with the highest score is the missing attribute value that marks the user.

最后，对于每个输入的向量F_i，在基于用户属性而构建的神经网络中它的输出结果可以用下面的公式表示：Finally, for each input vector F _i , its output in the neural network constructed based on user attributes can be expressed by the following formula:

步骤4：基于卷积神经网络的用户属性推断方法：Step 4: User attribute inference method based on convolutional neural network:

为了使得推断的精度更高，利用卷积神经网络，设计不同的卷积核提取用户的属性信息和好友关系中所包含的隐藏信息，从而更加准确的推断出用户的缺失属性。In order to make the inference more accurate, the convolutional neural network is used to design different convolution kernels to extract the user's attribute information and the hidden information contained in the friend relationship, so as to more accurately infer the user's missing attributes.

对于那些好友关系是公开的社交网络，采用基于用户属性信息的神经网络方法会丢失大量的信息，导致预测结果的精度不是很高，而卷积神经网络可以捕获到用户的属性与链接关系间的潜在联系，利用这些潜在的关系能够达到更加精确的推断出用户缺失属性值的目的。这个算法的理论基础是卷积神经网络在图像识别问题的良好表现，由用户的自中心网络所构成的矩阵就像一幅幅存在于社交网络中的图片，能够很好的展示出用户之间的差异性，通过提取出这些差异性(特征)，可以很好的对用户进行分类(属性推断)。类似于图像识别，首先应该把自中心网络矩阵转化为数字矩阵，才能放入卷积神经网络中。而自中心网络的矩阵处理方法类似于UPS方法里面的投影层，但是采用了一种比较复杂的网络来训练数据和预测未知的属性值。For those social networks whose friend relationships are public, using the neural network method based on user attribute information will lose a lot of information, resulting in low accuracy of prediction results, while convolutional neural networks can capture the relationship between user attributes and link relationships. Using these potential relationships can achieve the purpose of inferring the user's missing attribute values more accurately. The theoretical basis of this algorithm is the good performance of convolutional neural networks in image recognition problems. The matrix formed by the user's self-centered network is like pictures existing in social networks, which can well show the relationship between users. By extracting these differences (features), users can be well classified (attribute inference). Similar to image recognition, the self-centered network matrix should first be converted into a digital matrix before it can be put into the convolutional neural network. The matrix processing method of the self-centered network is similar to the projection layer in the UPS method, but a more complex network is used to train data and predict unknown attribute values.

基于卷积神经网络的用户属性推断方法，本发明称之为UPE。UPE的流程图如图3所示。其中第一层是投影层，与UPS的投影层是一样的。在其之后的convolution layer即卷积层则包含了一些卷积操作，在这一层，采用了不同大小的卷积核来提取输入数据中的潜在特征。接下来是按照顺序对UPE网络中的各个层的详细介绍。The user attribute inference method based on convolutional neural network is called UPE in the present invention. The flow chart of UPE is shown in Figure 3. The first layer is the projection layer, which is the same as the projection layer of the UPS. The subsequent convolution layer, the convolution layer, contains some convolution operations. In this layer, convolution kernels of different sizes are used to extract latent features in the input data. The following is a detailed introduction of each layer in the UPE network in order.

1)卷积层(convolution layer)1) Convolution layer (convolution layer)

首先，在UPE网络中的输入层和投影层是和UPS中的类似的，也就是说，由自中心节点_i所构成的网络对应的矩阵_i是UPS的输入，而构成_i的每个元素在经过了投影层后，就会变成与该元素所对应的数字，定义由_i转化过来的数字特征矩阵为ML_i。ML_i就是convolutionlayer的最原始的输入向量。一般地，可以用公式来表示为：First of all, the input layer and projection layer in the UPE network are similar to those in the UPS, that is to say, the matrix _i corresponding to the network composed of the self-center node _i is the input of the UPS, and each element constituting _i is in After passing through the projection layer, it will become the number corresponding to the element, and the digital feature matrix converted from _i is defined as ML _i . ML _i is the original input vector of the convolutionlayer. Generally, it can be expressed as:

卷积操作实际上是可以被概括为是关于权重矩阵W和特征矩阵_i中的元素间的一个线性操作，只不过这个线性操作与狭义的线性操作的概念有一些区别，它需要一个激活函数才能发挥它的作用，否则无法激活下一层的神经网络。所以，当特征矩阵_i到卷积层后，会有一个训练好的权重矩阵W和偏置b和它做一个映射，这个映射可以用下面的公式来表示：The convolution operation can actually be summarized as a linear operation between the elements in the weight matrix W and the feature matrix _i , but this linear operation is somewhat different from the narrow linear operation concept, which requires an activation function to Play its role, otherwise the neural network in the next layer cannot be activated. Therefore, when the feature matrix _i is transferred to the convolutional layer, there will be a trained weight matrix W and bias b to make a mapping with it. This mapping can be expressed by the following formula:

而式中的Relu是采用的激活函数。作为一个标准的映射层，卷积神经网络通常会提取到高维的特征，而这些高维的特征是怎样帮助网络去预测未知属性值的，那就要看看UPS中剩下的几层的作用都是什么了。The Relu in the formula is the activation function used. As a standard mapping layer, the convolutional neural network usually extracts high-dimensional features, and how these high-dimensional features help the network to predict unknown attribute values, then look at the remaining layers in the UPS. What is the effect.

2)最大池化层(max pooling layer)2) max pooling layer

通过以上的分析，可以知道第一层卷积层的输出向量的大小取决于自中心节点的好友个数(矩阵的列数)，但哪个好友对自中心节点的未知属性贡献最大，这就需要将由卷积神经网络抓取的局部特征(卷积核的大小决定了特征提取的范围)与全局特征进行组合后再判断。传统的卷积神经网络通常会在卷积层之后采用平均池化或者最大池化，但是平均池化在这种情况下不是很适用，因为在通常情况下，大部分好友的属性对于未知属性预测值并没有太大的影响，所以采用最大值池化就可以保留局部特征中最重要的信息，用公式表示为：Through the above analysis, it can be known that the size of the output vector of the first layer of convolutional layer depends on the number of friends from the center node (the number of columns of the matrix), but which friend contributes the most to the unknown attributes of the self-center node, which requires The local features captured by the convolutional neural network (the size of the convolution kernel determines the range of feature extraction) are combined with the global features and then judged. The traditional convolutional neural network usually adopts average pooling or max pooling after the convolutional layer, but the average pooling is not very suitable in this case, because in general, the attributes of most friends are used for unknown attribute prediction. The value does not have much effect, so the most important information in local features can be preserved by using max pooling, which is expressed by the formula:

其中

是卷积层p-1的输出，这一层产生出的全局特征最终会被输出到接下来的卷积层或者全连接层。UPS方法中剩下的全连接层和输出层是与UPE方法的最后两层是类似的，在这里就不在重复的介绍了。在UPS方法中，最终会通过softmax函数对属性每个可能的取值进行打分，分数高的就是要预测的未知的值，这一点是神经网络对解决分类问题传统的思路。in

is the output of the convolutional layer p-1, and the global features generated by this layer will eventually be output to the next convolutional layer or fully connected layer. The remaining fully connected layers and output layers in the UPS method are similar to the last two layers of the UPE method, and will not be repeated here. In the UPS method, each possible value of the attribute is finally scored through the softmax function. The higher score is the unknown value to be predicted. This is the traditional idea of neural networks for solving classification problems.

本发明的关键点主要是：The key points of the present invention are mainly:

1、利用ego-network提出了一种基于卷积神经网络的用户属性预测算法——UPE，极大的提升了标记节点缺失属性预测的准确率。卷积神经网络虽然在自然语言处理，计算机视觉等领域得到了很大的应用，但在社交网络尤其是属性推断问题中是第一次的尝试。1. Using ego-network, a user attribute prediction algorithm based on convolutional neural network-UPE is proposed, which greatly improves the accuracy of missing attribute prediction of marked nodes. Although convolutional neural networks have been widely used in natural language processing, computer vision and other fields, it is the first attempt in social networks, especially attribute inference.

2、针对部分社交网络中用户的属性信息无法获取，提出了一种基于多层神经网络的用户缺失属性预测算法——UPS，对UPE算法的例外情况进行了补充。2. In view of the inability to obtain the attribute information of users in some social networks, a multi-layer neural network-based user missing attribute prediction algorithm-UPS is proposed, which supplements the exceptions of the UPE algorithm.

3、根据社交网络的特性，提出了ego-network的定义；该定义与传统的ego-network的是有所差别的，它不仅包含了自中心节点的属性和好友关系以及好友的属性和好友间关系，还可以包含他们的行为和发文内容等信息。将这些行为信息进行处理并抽象成用户的一维特征是与以往的ego-network的定义是不同的。3. According to the characteristics of social network, the definition of ego-network is put forward; this definition is different from the traditional ego-network, it not only includes the attributes of the central node and the relationship between friends and the attributes of friends and the relationship between friends relationship, and can also include information such as their behavior and the content of their posts. Processing and abstracting these behavioral information into one-dimensional features of users is different from the previous definition of ego-network.

社交网络的数据因为涉及到个人的隐私，所以网络上公开的数据集几乎上是不存在的，本发明通过基于种子URL的爬虫技术，爬取了知乎上22万用户包含了15维属性信息的数据集，这些属性信息分别包括用户名，用户ID，性别，年龄，职业，专业，学校等。通过对数据的清洗，最终将去掉用户ID和用户名的数据分别放入UPE和UPS中。Because the data of social network involves personal privacy, the data set published on the network almost does not exist. The invention uses the crawler technology based on seed URL to crawl 220,000 users on Zhihu, including 15-dimensional attribute information. These attribute information respectively include username, user ID, gender, age, occupation, major, school, etc. By cleaning the data, the data with the user ID and user name removed is finally put into the UPE and the UPS respectively.

此外，为了更好的验证本发明所提出的算法的在推断用户未知属性上的良好表现，选取了常用的机器学习算法如贝叶斯(NB)、逻辑回归(LR)、多数表决(MV)、基于图的半监督(GSSL一种标记传播算法)等算法。实验结果如图4所示，其中的数值表示正确率。In addition, in order to better verify the good performance of the algorithm proposed in the present invention in inferring unknown attributes of users, commonly used machine learning algorithms such as Bayesian (NB), logistic regression (LR), majority voting (MV) are selected. , graph-based semi-supervised (GSSL a label propagation algorithm) and other algorithms. The experimental results are shown in Figure 4, where the numerical value represents the correct rate.

从图4可以看出，当预测标记用户的性别(gender)属性时，本文所提出的方法UPE正确率可达79.11％，比传统的方法高了5到27个百分点，不仅如此，，同样是使用神经网络的方法，UPS的正确率低于UPS，这是因为UPE是增加了关系属性，并且加入了用户的关系信息。当预测用户的职业信息时，从图4可以看出，UPE的正确率明显高于其他几个，达76.28％，而MV(多数表决)的正确率最低，为32.9％。从实验结果可以看出，卷积神经网络可以很好的提取出标记节点与它的好友之间的潜在关系，从而更好的预测出标记节点的未知属性值。As can be seen from Figure 4, when predicting the gender attribute of marked users, the UPE accuracy rate of the method proposed in this paper can reach 79.11%, which is 5 to 27 percentage points higher than the traditional method. Using the neural network method, the correct rate of UPS is lower than that of UPS, because UPE increases the relationship attribute and adds the user's relationship information. When predicting the user's occupational information, it can be seen from Figure 4 that the correct rate of UPE is significantly higher than the others, reaching 76.28%, while the correct rate of MV (majority voting) is the lowest, which is 32.9%. It can be seen from the experimental results that the convolutional neural network can well extract the potential relationship between the marked node and its friends, so as to better predict the unknown attribute value of the marked node.

本发明不仅限于以上实施例，神经网络中的参数的设置，每一层卷积核的大小，激活函数(ReLU)的选择都是可以替代的，是根据数据选择或训练出来的。The present invention is not limited to the above embodiments, the parameter settings in the neural network, the size of each layer of convolution kernel, and the selection of the activation function (ReLU) are all replaceable, and are selected or trained according to data.

以上实施例仅用以说明本发明的技术方案而非对其进行限制，本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明的精神和范围，本发明的保护范围应以权利要求书所述为准。The above embodiments are only used to illustrate the technical solutions of the present invention rather than limit them. Those of ordinary skill in the art can modify or equivalently replace the technical solutions of the present invention without departing from the spirit and scope of the present invention. The scope of protection shall be subject to what is stated in the claims.

Claims

1. a user attribute inference method based on convolutional neural network, is characterized in that, comprises the following steps:

1) According to the attributes and friend relationships of user nodes, establish a self-centered network;

2) using a convolutional neural network to extract the attribute information of the user node in the self-center network and the hidden information contained in the friendship relationship, and use the hidden information to infer the missing attribute of the user;

The self-center network is represented by a quintuple G'={V', E', A', B', L}, where V' includes the node information in the self-center network, and the set E' includes all the nodes in the self-center network. The link relationship between nodes, sets A' and B' respectively represent the attribute information and behavior information of the node, the matrix L∈V'×N contains the attribute information and behavior information of the self-center node and its friends, and N is the attribute and behavior data. the sum of dimensions;

Step 1) firstly filter the attribute information of the user on the network, and then establish the self-centered network; the filtering includes:

a) Filter out all non-Chinese phrases with attributes other than age;

b) Filter out nodes whose attribute information is missing beyond the set threshold;

The convolutional neural network includes an input layer, a projection layer, a convolution layer, a pooling layer, a fully connected layer and an output layer, and the projection layer converts the user's attribute information and behavior information into vectors;

The projection layer converts the user's attribute information and behavior information into vectors, and adopts the following rules for words belonging to the same occupation or profession:

i. Create a hash table, take the first word of the profession or occupation as the key, and use the set of all attributes with the first word as the value as the value; then calculate the similarity of other attribute values with the words in the professional and occupation dictionary according to the jara-winkler distance properties, and add attributes with high similarity to the set value;

ii. For those attribute values that do not appear in the set value, use word2vec to train word vectors. By calculating the distance between the word vectors, the vectors with high similarity are aggregated together by the KNN algorithm, and the words in the dictionary are passed ID numbers. Correlate to get a digitized vector;

The convolutional layer uses the trained weight matrix and bias to map with the feature matrix obtained after the projection layer, and uses Relu as the activation function; the pooling layer uses maximum pooling to retain the most important local features. The output layer uses the softmax classifier to score the possible values of the attribute by using the weight matrix and the bias value, and the attribute value with the highest score is the missing attribute value of the marked user.

2 . The method according to claim 1 , wherein, for social networks in which friend relationships cannot be directly obtained or are difficult to obtain, a neural network is used to classify and predict the missing attributes using only the user's attribute information. 3 .

3. The method according to claim 2, wherein the neural network comprises an input layer, a projection layer, a hidden layer and an output layer; the projection layer converts the user's attribute information and behavior information into vectors; the The hidden layer is two fully connected layers, the first hidden layer contains n*n neurons, n is the sum of the dimensions of the user's attributes and behavior data, and the second hidden layer loses some neurons to prevent overfitting; The output layer uses the softmax classifier to score the possible values of the attribute by using the weight matrix and the bias value, and the attribute value with the highest score is the missing attribute value of the marked user.

4. A device for inferring user attributes based on a convolutional neural network using the method of claim 1, characterized in that, comprising:

The self-centered network building module is responsible for establishing a self-centered network according to the attributes and friend relationships of user nodes;

The user attribute inference module is responsible for extracting the attribute information of the user node in the self-center network and the hidden information contained in the friend relationship by using a convolutional neural network, and using the hidden information to infer the missing attribute of the user.

5. The device according to claim 4, characterized in that, for social networks that cannot be directly obtained or difficult to obtain for friend relationships, the user attribute inference module adopts a neural network to only use the user's attribute information to carry out the missing attributes. Classification prediction.