Abstract
Clustering the nodes of a graph is a cornerstone of graph analysis and has been extensively studied. However, some popular methods are not suitable for very large graphs: e.g., spectral clustering requires the computation of the spectral decomposition of the Laplacian matrix, which is not applicable for large graphs with a large number of communities. This work introduces PASCO, an overlay that accelerates clustering algorithms. Our method consists of three steps: (1) We compute several independent small graphs representing the input graph by applying an efficient and structure-preserving coarsening algorithm. (2) A clustering algorithm is run in parallel onto each small graph and provides several partitions of the initial graph. (3) These partitions are aligned and combined with an optimal transport method to output the final partition. The PASCO framework is based on two key contributions: a novel global algorithm structure designed to enable parallelization and a fast, empirically validated graph coarsening algorithm that preserves structural properties. We demonstrate the strong performance of PASCO in terms of computational efficiency, structural preservation, and output partition quality, evaluated on both synthetic and real-world graph datasets.
Similar content being viewed by others
Availability of data and materials
The code to reproduce the experiments is available at https://github.com/elasalle/PASCO
Notes
For example, permuting the columns yields different representations of the same partition.
The use of \(\alpha\) as a parameter is relevant for the experiments done afterwards, as we then keep the density fixed while varying the difficulty level \(\alpha\) to recover the blocks.
References
Ayad, H. G., Kamel, M. S. (2010). On voting-based consensus of cluster ensembles. Pattern Recognition43(5)
Blondel, V. D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment2008(10)
Blondel, M., Seguy, V., Rolet, A. (2018). Smooth and sparse optimal transport. In AISTATS
Boutsidis, C., Kambadur, P., Gittens, A. (2015). Spectral clustering via the power method-provably. In ICML
Bunimovich, L., Smith, D., Webb, B. (2019). Finding hidden structures, hierarchies, and cores in networks via isospectral reduction. Applied Mathematics & Nonlinear Sciences4(1)
Chen, J., Saad, Y., & Zhang, Z. (2022). Graph coarsening: From scientific computing to machine learning. SeMA Journal, 79(1), 187–223.
Cuturi, M., Doucet, A. (2014). Fast computation of wasserstein barycenters. In ICML
Dhillon, I.S., Guan, Y., Kulis, B. (2004). Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining
Dhillon, I. S., Guan, Y., Kulis, B. (2007). Weighted graph cuts without eigenvectors a multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence29(11)
Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., et al. (2021). Pot: Python optimal transport. JMLR22(78)
Fortunato, S. (2010). Community detection in graphs. Physics Reports486.
Fortunato, S., Hric, D. (2016). Community detection in networks: A user guide. Physics Reports659. Community detection in networks: A user guide
Fred, A. (2001). Finding consistent clusters in data partitions. In International Workshop on Multiple Classifier Systems
Ghosh, J., Acharya, A. (2018). Cluster ensembles: Theory and applications. Data Clustering
Gottesbüren, L., Heuer, T., Sanders, P., Schulz, C., Seemaier, D. (2021). Deep multilevel graph partitioning. arXiv preprint arXiv:2105.02022
Hendrickson, B., Leland, R. W., et al. (1995). A multi-level algorithm for partitioning graphs. SC95(28)
Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., Leskovec, J. (2020). Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687
Jin, Y., Loukas, A., JaJa, J. (2020). Graph coarsening with preserved spectral properties. In International Conference on Artificial Intelligence and Statistics, pp. 4452–4462. PMLR
Karataş, A., Şahin, S. (2018). Application areas of community detection: A review. In 2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)
Karypis, G., Kumar, V. (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing20(1)
Lasalle, E., Vaudaine, R., Vayer, T. PASCO. https://hal.science/hal-05086352
Li, Y., Nie, F., Huang, H., Huang, J. (2015). Large-scale multi-view spectral clustering via bipartite graph. In AAAI
Li, J., Seo, B., Lin, L. (2019). Optimal transport, mean partition, and uncertainty assessment in cluster analysis. Statistical Analysis and Data Mining: The ASA Data Science Journal12(5)
Liu, D. C., Nocedal, J. (1989). On the limited memory bfgs method for large scale optimization. Mathematical Programming45(1)
Loukas, A. (2019). Graph reduction with spectral and cut guarantees. Journal of Machine Learning Research20(116)
Loukas, A., Vandergheynst, P. (2018). Spectrally approximating large graphs with smaller graphs. In ICML
Newman, M. E. J. (2010). Networks: An introduction
Newman, M. E., Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E69
Peixoto, T. P. (2014). Efficient monte carlo and greedy heuristic for the inference of stochastic block models. Physical Review E89(1)
Peyré, G., Cuturi, M., et al. (2019). Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning11(5-6)
Pourkamali-Anaraki, F. (2020). Scalable spectral clustering with nyström approximation: Practical and theoretical aspects. IEEE Open Journal of Signal Processing1
Quemener, E., & Corvellec, M. (2013). Sidus-the solution for extreme deduplication of an operating system. Linux Journal, 2013(235), 3.
Rissanen, J. (2007). Information and complexity in statistical modeling
Rossi, R. A., Ahmed, N. K. (2015). The network data repository with interactive graph analytics and visualization. AAAI
Rosvall, M., Bergstrom, C. T. (2011). Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems. PLOS ONE6
Rotta, R., & Noack, A. (2011). Multilevel local search algorithms for modularity clustering. Journal of Experimental Algorithmics (JEA), 16, 2–1.
Spielman, D. A., Srivastava, N. (2008). Graph sparsification by effective resistances. In Proceedings of the fortieth annual ACM symposium on theory of computing
Spielman, D. A., Teng, S.-H. (2011). Spectral sparsification of graphs. SIAM Journal on Computing40(4)
Traag, V. A., Waltman, L., Van Eck, N. J. (2019). From louvain to leiden: guaranteeing well-connected communities. Scientific Reports9(1)
Tremblay, N., Loukas, A. (2020). Approximating spectral clustering via sampling: A review. Sampling techniques for supervised or unsupervised tasks
Tremblay, N., Puy, G., Gribonval, R., Vandergheynst, P. (2016). Compressive spectral clustering. ICML
Vinh, N. X., Epps, J., Bailey, J. (2009). Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In ICML
Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing17
Waltman, L., & Van Eck, N. J. (2013). A smart local moving algorithm for large-scale modularity-based community detection. The European Physical Journal B, 86, 1–14.
Wu, J., Liu, H., Xiong, H., Cao, J., Chen, J. (2014). K-means-based consensus clustering: A unified view.IEEE Transactions on Knowledge and Data Engineering27(1)
Yan, D., Huang, L., Jordan, M. I. (2009). Fast approximate spectral clustering. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining
Acknowledgements
This work has been supported by the DATAREDUX project, ANR19-CE46-0008, and the ANR DARLING project ANR-19-CE48-0002. This project was supported in part by the AllegroAssai ANR project ANR-19-CHIA0009.
Author information
Authors and Affiliations
Contributions
E.L., R.V. and T.V. contributed to the code and figures and wrote most of the manuscript. All authors participated in the decisions that led to the final version of the manuscript and reviewed it.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Additional information
Editors: Maria Concepcion Bielza Lozoya, Ana Carolina Lorena, Tiago A. Almeida.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Finding consensus between partitions
In this section we describe the different approaches that we investigated for aligning different partitions into a reference partition. We recall that the problem reformulates as finding the partition \(\overline{P}\) that best agrees with all \((P_r)\) with \({r \in {[\![R]\!]}}\). Methods from the clustering ensemble literature already propose to solve such a problem. Mathematically, they amount to find a Frechet mean of the partitions \((P_r)\) for various notions of divergence \(\mathcal {D}\) between partitions
Depending on the choice of divergence \(\mathcal {D}\), we can obtain different consensus methods, as detailed below.
Linear regression and many-to-one
These two methods are based on a similar notion of divergence between partitions. Given two partitions \(P\in \mathcal {P}_{N, k}, \overline{P}\in \mathcal {P}_{N, \overline{k}}\), it is defined as
Depending on the choice for the set \(\mathcal {Q}(k, \overline{k})\) we get different methods:
-
When we simply set \(\mathcal {Q}(k, \overline{k}) = \mathcal {Q}_{\operatorname {lin-reg}}(k, \overline{k}) = \mathbb {R}^{k \times \overline{k}}\), the optimal matrix Q is given by (see Lemma 3)
$$\begin{aligned} Q_{\operatorname {lin-reg}} =(P^\top P)^{-1}P^\top \overline{P}= \operatorname {diag}(P^\top {\textbf{1}}_N)^{-1} P^\top \overline{P}, \end{aligned}$$(A3)where \(\operatorname {diag}(P^\top {\textbf{1}}_N)^{-1}\) corresponds to the diagonal matrix containing the inverse of the cluster sizes. This realignment is proposed in Ayad and Kamel (2010). Even though it allows to compute the divergence \(\mathcal {D}( P, \overline{P})\), this matrix yields a “re-aligned partition” \(PQ_{\operatorname {lin-reg}}\) which is only a “soft-partition”, with elements in [0, 1]. We refer to this solution as lin-reg.
-
If one wants to obtain a true partition (with elements in \(\{0,1\}\)), a solution is to restrict the matrix in \(\mathcal {Q}(k, \overline{k})\) to send each cluster of \(P\) to at most one cluster of \(\overline{P}\). For that we define \(\mathcal {Q}_{\operatorname {many-to-one}}(k, \overline{k}) = \{Q \in \{0,1\}^{k \times \overline{k}}, Q {\textbf{1}}_{\overline{k}} = {\textbf{1}}_{k} \}\). The solution of (A2) with \(\mathcal {Q}_{\operatorname {many-to-one}}(k, \overline{k})\) is given (see Lemma 4) by \(\operatorname {row-bin}(Q_{\operatorname {lin-reg}})\), where \(\operatorname {row-bin}\) is the operator that returns a binary matrix of same shape, where each row contains only zeros except at the position of the maximum in the corresponding row of the input matrix. We refer to this solution as many-to-one.
OT-based method To prevent empty clusters in the realigned partition, we also consider \(\mathcal {Q}(k,\overline{k}) = \mathcal {Q}_{\operatorname {ot}}(k,\overline{k})\) in (A2). With these constraints, the alignment problem becomes an OT problem which can be related to a specific quadratic regularized OT problem (Blondel et al., 2018) which admits efficient convex solvers, as detailed in the following lemma (proof can be found in Sect. D).
Lemma 1
Let \({P} \in \mathcal {P}_{N, k}\) and \(D = \operatorname {diag}(P^\top {\textbf{1}}_N)\). Then problem (A2) with \(\mathcal {Q}(k, \overline{k}) = \mathcal {Q}_{\operatorname {ot}}(k, \overline{k})\) is equivalent to the problem
where \(C = P^\top \overline{P}\). Assuming no empty cluster in the partition \(P\), the solution of Eq. (A4) is unique and can be solved by considering the dual problem
where \(\mu \oplus \nu := (\mu _i + \nu _j)_{ij}\) and for any matrix \(A, \ [A]_{+}:= \left( \max \{A_{ij}, 0\}\right) _{ij}\). More precisely, the optimal solution \(Q^\star\) of Eq. (A4) can be written as \(Q^{\star } = \frac{1}{2} D^{-1}[\mu ^\star \oplus \nu ^\star + 2 C]_{+}\) where \((\mu ^\star , \nu ^\star )\) is the optimal solution of Eq. (A5).
Building upon this result we can solve Eq. (A4) by tackling the dual Eq. (A5) which is a convex unconstrained problem of two variables (maximization of a concave function). This expression allow us to use any convex solver, and, as suggested in Blondel et al. (2018), we rely on L-BFGS (Liu & Nocedal, 1989) that we find particularly effective in practice. We call this alignment procedure quad-ot which has roughly a \(\mathcal {O}(N K^2)\) complexity.
Remark 1
The only difference between Eq. (A4) and standard quadratic OT problem of the form \(\min _Q \langle M, Q\rangle + \frac{\gamma }{2} \Vert Q\Vert _F^2\) is that in our case the regularization term is \(\Vert D^{1/2}Q\Vert _F^2\). This is equivalent to consider a Mahalanobis type regularization instead of a \(\ell ^2_2\) one.
Solving for the barycenter
To solve the barycenter problem in (A1) with a distance of the form of (A2), we alternate between finding the alignment matrices \(Q_r\) as explained above and updating the reference \(\overline{P}\) as in (3). This corresponds to an alternating minimization algorithm, where we alternate between (i) realigning the R partitions on the reference and (ii) updating the reference. The reference update is based on Lemma 5 which states that finding the closest partition matrix to a set a (realigned) matrices is given by the majority-vote update, see Eq. (3).
(Additional theoretical results can be found in Sect. D.)
Appendix B: Other edge sampling rules
In this section, we present edge sampling rules that we came up with while trying to design an fast coarsening procedure. They were not satisfying but we choose to present them here and explain their drawbacks, as we believe it is informative to better understand the coarsening algorithm we propose in the end. To the best of our knowledge, these strategies were not considered systematically in previous works to coarsen graphs.
Uniform edge sampling This approach might be the most natural one. It consists in choosing edges uniformly at random in the graph and collapsing them. Doing so favors edges incident to high degree nodes, resulting after collapsing to an even higher degree hypernode. This amplification phenomenon yields an unbalanced final coarsened graph that contains one huge hypernode and all other hypernodes containing only a few nodes. This is problematic as it would not express well the community structure of the initial graph.
Uniform edge sampling with marked neighboring edges This approach fixes the above issue. To avoid collapsing onto almost always the same hypernode, when an edge (i, j) is collapsed, we mark the edges incident to nodes i and j, and we sample edges uniformly at random among unmarked edges. This solves the issue of unbalancedness. However, the algorithm quickly runs out of unmarked edges and forces the early computation of the next current coarsened graph.
While this can be done with sparse matrix products between the adjacency matrix and the matrix encoding the composition of the hypernodes, it remains costly and should be avoided as much as possible.
Uniform edge sampling with marked visited nodes Here, we want to relax the limit imposed by the previous approach with marked edges. First, recall that sampling an edge uniformly at random is equivalent to sampling a node i with a probability proportional to its degree and sampling a neighbor j uniformly, see e.g., ((Newman, 2010), Section 6.14). So instead of discarding edge (i, j) whenever either i or j has been used in a previous collapse, a natural relaxation is to reject the edge (i, j) only when i has been previously involved in a collapse, irrespective of whether j was involved or not in such a collapse. Simulations showed that when using this sampling strategy (in step 6 of Algorithm 2) to generate coarsening tables \(h^\ell\), the hypernodes size distribution was similar to the one with the edge marking strategy, but resulted in much less intermediate coarsened graph reconstructions.
Uniform node sampling with marked visited nodes A final improvement to speedup the sampling procedure is to sample the first node i uniformly at random among non-visited nodes instead of according to its degree (the second node j being still draw uniformly among the neighbors of i). Computationally, this avoids updating the degrees after each collapse and further speeds up the coarsening procedure. The impact of sampling uniformly instead of according to the degrees stays limited thanks to the friendship paradox. This results in Algorithm 2.
Appendix C: Complexity of the coarsening phase in PASCO
Lemma 2
The complexity of Algorithm 2 is \(\mathcal {O}(n^{(\ell )} + N_{\operatorname {while}}^{(\ell )} + E^{\ell } )\) where \(N_{\operatorname {while}}^{(\ell )}\) is the number of time the while loop is being executed and \(E^{(\ell )}\) is the number of non-zero coefficients in the adjacency matrix \(A^{(\ell )}\).
Proof
The initializations of \(V_a\) and \(h^{(\ell )}\) cost \(\mathcal {O}(n^{(\ell )})\). Drawing the nodes u and v costs \(\mathcal {O}(1)\) as they are sampled uniformly at random. Updating \(h^{(\ell )}\) (step 7) has complexity \(\mathcal {O}(1)\). Similarly, updating \(V_a\) can be performed in \(\mathcal {O}(1)\) by updating a boolean array that keeps tracks of which nodes are in \(V_a\). Moreover, relabeling \(h^{(\ell )}\) has complexity \(\mathcal {O}(n^{(\ell )})\). Finally, computing \(A^{(\ell +1)}\) can be performed in \(E^{(\ell )}\) by iterating over the non-zero coefficients of \(A^{(\ell )}\). Summarizing, we have that the complexity of Algorithm 2 is \(\mathcal {O}(n^{(\ell )} + N_{\operatorname {while}}^{(\ell )} + E^{\ell })\). \(\square\)
Now we prove Proposition 1.
Proof of Proposition 1
The initialization of Algorithm 1 (essentially creating h) has complexity \(\mathcal {O}(N)\).
Now let us denote by \(N_{rep}\) the number of times the while loop is repeated. According to Lemma 2, the overall complexity of the while loop in Algorithm 1 is \(\mathcal {O}( N_{rep} ( N + |E| ) )\). This is obtained by using upperbounds of \(n^{(\ell )}\) and \(N_{\operatorname {while}}^{(\ell )}\) by N, and \(E^{\ell }\) by |E| where |E| is the number of non-zero coefficient in the initial adjacency matrix \(\mathbb {A}\). Without loss of generality, we can always assume that the graph has no isolated node (the clustering problem being irrelevant for these nodes), therefore the complexity of the while loop reduces to \(\mathcal {O}( N_{rep} |E| )\).
Now let us prove that \(N_{rep} \le 1 + \log \rho\). In Algorithm 2, remark that the while loop may stop for two reasons. First, the target size is reached, meaning that \(N_{\operatorname {while}}^{(\ell )} = n^{(\ell )}-n\). This only happens for \(\ell = N_{rep} -1\). Second, the set of available nodes \(V_a\) is empty. In this case, knowing that at each iteration we can remove either 1 or 2 elements from \(V_a\) (except at the first iteration where 2 are removed), we have \(n^{(\ell )} / 2 \le N_{\operatorname {while}}^{(\ell )} \le n^{(\ell )} -1\). This is the case for every iteration \(\ell \le N_{rep} -2\). Thus, for all \(\ell \le N_{rep} -2\), we have that \(n^{(\ell +1)} = n^{(\ell )} - N_{\operatorname {while}}^{(\ell )} \le n^{(\ell )} / 2\). Therefore, for all \(\ell \le N_{rep}-1\), we have \(n^{(\ell )} \le (1/2)^{\ell } N\) (as \(n^{(0)}=N\)). Now, remark that \(n^{(N_{rep}-1)} > n\), otherwise the coarsening would already have stopped. Recalling that \(n = \lfloor \rho ^{-1} N \rfloor\) we have \(\rho ^{-1} N \le (1/2)^{N_{rep}-1} N\), yielding \(N_{rep} \le 1 + \log \rho\).
Appendix D: Relegated theoretical results
Lemma 3
Let \({P} \in \mathcal {P}_{N, k}\) be a partition matrix (Definition 1), and assume that \(\forall i \in {[\![k]\!]}, [{P}^\top {\textbf{1}}_N]_i \ne 0\). Then, for any integer \(\overline{k}\) and any matrix \(\overline{P}\in \mathbb {R}^{N \times \overline{k}}\), with \(\mathcal {Q}(k, \overline{k}) = \mathbb {R}^{k \times \overline{k}}\), \(Q^\star = \operatorname {diag}({P}^\top {\textbf{1}}_N)^{-1}{P}^\top \overline{P}\) is an optimal solution to problem (A2).
Proof
Denoting \(f(Q) = \Vert {P} Q-\overline{P}\Vert _F^2\). Since \({P}\) is a partiton matrix its columns have pairwise disjoint support and we have \(P^\top P=\operatorname {diag}({P^\top }{\textbf{1}}_N)\) hence
The optimization problem is convex, setting the gradient of f to zero gives the solution. \(\square\)
Lemma 4
Let \({P} \in \mathcal {P}_{N, k}\) be a partition matrix (Definition 1), \(\overline{P} \in \mathbb {R}^{N \times \overline{k}}\), and \(\mathcal {Q}(k, \overline{k}) = \{Q \in \{0,1\}^{k \times \overline{k}}: Q{\textbf{1}}_{\overline{k}} = {\textbf{1}}_k\}\). Then \(Q^{\star }\) defined by
is an optimal solution to problem (A2).
Proof
Denoting \(f(Q) = \Vert PQ-\overline{P}\Vert _F^2\) and using that \({P}\) is a partition matrix, that \(Q_{ij} \in \{0,1\}\) (hence \(Q_{ij}^2=Q_{ij}\)) and \(Q {\textbf{1}}_{\overline{k}} = {\textbf{1}}_k\) we can rewrite f as
Denoting \(C = - {P}^\top \overline{P}\), a solution to problem (A2) can thus be found by solving
Now Eq. (D9) is an optimization problem that decouples with respect to the rows of Q, i.e. there are k independent problems per row of Q. For each row \(i \in {[\![k]\!]}\), a solution can be found by choosing any column index j such that \(j \in \operatorname {argmin}_{p \in {[\![\overline{k}]\!]}} C_{ip}\) This condition is equivalent to find j such that \(j \in \operatorname {argmax}_{p \in {[\![\overline{k}]\!]}} [{P}^\top \overline{P}]_{i p}\). \(\square\)
Lemma 5
Let \(P^{(1)}, \cdots , P^{(R)}\) where each \(P^{(r)} \in \mathbb {R}^{N \times \overline{k}}\). Then a solution to
is given by
Now let \(P^{(1)}, \cdots , P^{(R)}\) where each \(P^{(r)} \in \mathbb {R}^{N \times k_r}\) and \(Q^{(1)}, \cdots , Q^{(R)}\) be coupling matrices such that each \(Q^{(r)} \in \mathcal {Q}_{\operatorname {ot}}(k_r, \overline{k})\). Then a solution to
is given by
Proof
For the first point, take \(\overline{P}\) in the constraints. Since it is a partition matrix we have \(\Vert \overline{P}\Vert _F^2 = \sum _{ij} \overline{P}_{ij}^2 = \sum _{ij} \overline{P}_{ij} = {\textbf{1}}_N^\top \overline{P}{\textbf{1}}_{\overline{k}} = N\). Thus problem (D10) is equivalent to
As detailed in the proof of Lemma 4 a solution can be found by choosing the index of the column j such that \(j \in \operatorname {argmin}_{p \in {[\![\overline{k}]\!]}} [- \sum _{r=1}^{R} P^{(r)}]_{ip}\) which concludes the proof for the first point. For the second point, we use that \(\overline{P}\) is a partition matrix and \(Q^{(r)}\) are coupling matrices so that
Using that \(\Vert \overline{P}\Vert _F^2 = N\) as previously proved, we get that the problem is equivalent to
hence the result. \(\square\)
Lemma 1
Let \({P} \in \mathcal {P}_{N, k}\) and \(D = \operatorname {diag}(P^\top {\textbf{1}}_N)\). Then problem (A2) with \(\mathcal {Q}(k, \overline{k}) = \mathcal {Q}_{\operatorname {ot}}(k, \overline{k})\) is equivalent to the problem
where \(C = P^\top \overline{P}\). Assuming no empty cluster in the partition \(P\), the solution of Eq. (A4) is unique and can be solved by considering the dual problem
where \(\mu \oplus \nu := (\mu _i + \nu _j)_{ij}\) and for any matrix \(A, \ [A]_{+}:= \left( \max \{A_{ij}, 0\}\right) _{ij}\). More precisely, the optimal solution \(Q^\star\) of Eq. (A4) can be written as \(Q^{\star } = \frac{1}{2} D^{-1}[\mu ^\star \oplus \nu ^\star + 2 C]_{+}\) where \((\mu ^\star , \nu ^\star )\) is the optimal solution of Eq. (A5).
Proof
We will prove a slightly more general result by considering the problem
where \(M \in \mathbb {R}^{p \times p_{\operatorname {ref}}}, \gamma > 0\) and L is a symmetric positive definite matrix. We note \(a = \frac{1}{k} {\textbf{1}}_k, b = \frac{1}{\overline{k}} {\textbf{1}}_{\overline{k}}\). We will then apply to \(M = -2C, \gamma = 2\) and \(L = D\) which is symmetric positive definite when there is no empty clusters (since in this case \(\forall i \in {[\![K]\!]}, D_{ii} \ne 0\)). Most of our calculus are adapted from (Blondel et al., 2018). First, since L is a symmetric positive definite matrix, the problem Eq. (D17) is a strongly convex problem, thus it admits a unique solution.
To look at the dual of Eq. (D17), we consider the Lagrangian
where \(\Gamma\) is the variable accounting for the non-negativity constraints on Q. We have
This is statisfied when \(Q = Q^\star = \frac{1}{\gamma } L^{-1}(\Gamma + \mu \oplus \nu - M)\). Moreover,
Thus
Hence
Now we solve the problem over \(\Gamma\) that is we maximize the problem
where \(\ge 0\) should be understood pointwise. This is equivalent to
where \(A:= M - \mu \oplus \nu\). Writing \(L = U \Delta U^\top\) where \(\Delta = \operatorname {diag}(d_1, \cdots , d_n)\) with \(d_i > 0\), Eq. (D24) equivalently writes
With a change of variable \({\tilde{\Gamma }} = \Delta ^{-1/2} \Gamma \ge 0\) this is equivalent to
whose minimum is given by \({\tilde{\Gamma }} = [\Delta ^{-\frac{1}{2}}A]_+ = \Delta ^{-\frac{1}{2}}[A]_+\) since \(\Delta ^{-\frac{1}{2}}\) is a diagonal matrix with positive entries. Thus the solution of (D24) is given by \(\Gamma = \Delta ^{1/2} {\tilde{\Gamma }} = \Delta ^{1/2} [\Delta ^{-1/2}A]_+ = [A]_+\). Also \([A]_+ - A = [-A]_{+}\). Thus
Hence the dual problem of Eq. (D17) is given by \(\max _{\mu , \nu } \mathcal {L}(Q^\star , \mu , \nu , [-A]_{+})\) which is
Applying this to \(M = -2C, \gamma = 2\) and \(L = D\) concludes. \(\square\)
Appendix E: Experiments details and extra results
1.1 E.1: Details about the implementation
The PASCO implementation relies on the POT library (Flamary et al., 2021) for the fusion part. The heaviest experiments were performed with the support of the Centre Blaise Pascal’s IT test platform at ENS de Lyon (Lyon, France) that operates the SIDUS solution (Quemener & Corvellec, 2013). We use an Intel Xeon Gold 5218 machine.
1.2 E.2: Details of the coarsening experiment
Table 1 provide some characteristics of the real graphs used in Fig. 3.
The RSA experiments in Sect. 5.1 tested the conservation of spectral properties by coarsening algorithms, including PASCO. Figure 8 complete the Fig. 3 with SBM graphs. Graphs were drawn under the \(\operatorname {SSBM}(1000,k,2,\alpha )\), for \((k = 10, \alpha = 1/k)\), \((k = 100, \alpha = 1/k)\) and \((k = 10, \alpha = 1/(2k))\). Below, in Fig. 9, we display computational times of the coarsening methods.
We represent the computational time of various coarsening schemes (including PASCO), as a function of the compression rate (the higher, the coarser the obtained graph). Shaded areas represent 0.2 upper- and lower-quantiles over the 10 repetitions of the experiment. Top row: we reproduce a part of the experiment of ((Loukas, 2019), Figure 2) with the same real graphs and added PASCO. Bottom row: same experiment but with random graphs drawn from \(\operatorname {SSBM}(1000, k, 2, \alpha )\) for \((k=10, \alpha =1/k)\), \((k=100, \alpha =1/k)\) and \((k=10, \alpha =1/(2k))\)
1.3 Effectiveness of the alignment/fusion phase
To demonstrate the alignment+fusion procedure, we now consider a synthetic dataset generated from a two-dimensional Gaussian Mixture Model (GMM) with three clusters. The clusters consist of 500, 400 and 200 points, respectively. Each cluster is sampled from isotropic Gaussian distributions with a standard deviation of 0.25 and centers located at (0, 1), (1, 0), and (0, 0). The resulting dataset is shown in the top left panel of Fig. 10.
We generate 15 different partitions of the dataset with the goal of recovering the true partition corresponding to the original GMM clusters. The partitions are constructed as follows: first, we randomly select a number of clusters k, drawn uniformly between 3 and 10. Then, we designate k centroids: the first three are randomly selected from each of the true GMM clusters (so that we have at least one starting centroid in each true cluster), while the remaining centroids are uniformly sampled from the remaining points. Each point in the dataset is assigned to the nearest centroid, thereby forming a partition. Forcing each true cluster to be initially represented by at least one centroid ensures that the resulting partition is related to the true partition. See two examples of these generated partitions in Fig. 10.
The effectiveness of the proposed alignment and fusion method (Algorithm 3) is evaluated by comparing several alignment techniques: lin-reg, many-to-one, and the proposed ot to recover the true partition. For each case, we consider a randomly initialized reference with \(\overline{k} = 3\), such that each point is assigned to a cluster chosen uniformly at random. As a baseline, we compare with a K-means clustering with \(K=3\). We emphasize that the K-means algorithm benefits from the spatial coordinates of the data points, whereas the alignment+fusion methods operate solely on the different partitions \(P_1, \cdots , P_R\), and ignore the positions. The experiment is repeated hundred times, and the average Adjusted Mutual Information (AMI) (Vinh et al., 2009) (see Sect. E.5 for the definition) between the inferred and true partitions is plotted as a function of the number of partitions R on the bottom right panel of Fig. 10. The results indicate that the OT methods achieve performance comparable to K-means (high AMI, with small variance) when \(R \ge 6\), while lin-reg and many-to-one have high variance and struggle to retrieve the true partition for any value of R. This can be explained by the fact that the partitions are quite unbalanced and thus more suited for the coupling constraints. Finally, a partition recovered using the ot method is shown in the bottom left panel of Fig. 10, illustrating that it is nearly a perfect permutation of the true partition.
Experience with a toy dataset drawn from a 2D GMM (top left). Colors indicate clusters. Two partitions (out of 15) are depicted (top center and right), as well as the recovered partition using the alignment+fusion procedure based on ot (bottom left). The bottom right panel presents the average AMI between the true partition and the fused partitions obtained by ot, many-to-one, and lin-reg. Shaded areas represent 0.2 upper- and lower-quantiles over the 100 runs of the alignment+fusion algorithms
1.4 Additional results on parameters influence
Here, we provide additional results on the experiments of Sect. 5.2. Figure 11 show the performance of the various alignment methods for different difficulty levels in the SSBM. Parameters are the same as in Fig. 4.
In Fig. 5, we showed the computational gains of using PASCO. As a sanity check, in Fig. 12 we show that the performance w.r.t. the quality of the output partition are satisfying. Speed was not achieve at the cost of quality.
1.5 Definition of the scores used to evaluate clustering quality
Definition 4
(Adjusted Mutual Information) The Adjusted Mutual Information (AMI) between two partitions \(U = (U_1, \dots , U_k)\) and \(V = (V_1, \dots , V_{k'})\) of the set of node \(\mathbb {V}\) of size N is given by
where
with \(P_U(i) = |U_i|/N\) (similarly for \(P_V(j)\)), \(P_{UV}(i,j) = |U_i \cap V_j|/N\), and \(H(U) = - \sum _i P_U(i) \log P_U(i)\) (similarly for H(V)). The expected mutual information (MI) \(\mathbb {E}[MI(U,V)]\) in (E29) is computed by assuming hyper-geometric distribution for U and V, the parameters being estimated from the partitions. For the sake of conciseness and simplicity, we refer the reader to Vinh et al. (2009) for the precise formula.
Definition 5
(Generalized Normalized Cut) Given a graph \(\mathbb {G}\) represented by its adjacency or weight matrix \(\mathbb {A}\), consider a partition \((V_1, \dots , V_k)\) of the vertex set \(\mathbb {V}\) of \(\mathbb {G}\). The generalized normalized cut of the partition is defined as
Definition 6
(Modularity) Given a graph \(\mathbb {G}\) represented by its adjacency or weight matrix \(\mathbb {A}\), consider a partition \((V_1, \dots , V_k)\) of the vertex set \(\mathbb {V}\) of \(\mathbb {G}\). Let \(d_u\) denote the degree of node u and m the weight of all the edges. Then, the modularity is given by
Definition 7
(Description Length) Let \(\mathbb {G}= (\mathbb {V}, \mathbb {E})\) be a graph and consider the partition \((V_1, \dots , V_k)\) of the vertex set \(\mathbb {V}\). We denote by \(e_{i,j}\) the number of edges between \(V_i\) and \(V_j\). In mathematical terms, the description length is defined by \(\texttt {dl} = \mathcal {S} + \mathcal {L}\), where \(\mathcal {S}\) is the entropy of the fitted stochastic block model and \(\mathcal {L}\) is the information required to describe the model. Their expressions are given by
with \(h(x) = (1+x) \log (1+x) - x \log x\).
1.6 Real data experiments
Table 2 provides a few caracteristics of the real graphs used in Sect. 5.3.
Here, we present the results of the experiments on real graph with tables. Bold figures represent the best result for each criterion for a given clustering algorithm (SC, CSC, louvain, leiden, MDL, infomap) (Tables 3, 4, 5).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Etienne, L., Rémi, V., Titouan, V. et al. Pasco (PArallel Structured COarsening): an overlay to speed up graph clustering algorithms. Mach Learn 114, 212 (2025). https://doi.org/10.1007/s10994-025-06837-7
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10994-025-06837-7