Resume

pallavi

Location:

New Delhi, Delhi, India

Salary:

5000000

Posted:

December 09, 2021

Contact this candidate

Resume:

entropy

Article

Spectral Embedded Deep Clustering

Yuichiro Wada 1, Shugo Miyamoto 2, Takumi Nakagama 3, Léo Andéol 4,5, Wataru Kumagai 5 and Takafumi Kanamori 3,5,*

* ******** ****** ** **********n Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan 2 Department of Systems Innovation, School of Engineering, The University of Tokyo, Hongo Campus, Eng. Bldg. No. 3, 2F, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan 3 Department of Mathematical and Computing Science, School of Computing, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8552, Japan

4 Computer Science Department, Sorbonne Université, 4 place Jussieu, 75005 Paris, France 5 RIKEN AIP, Nihonbashi 1-chome Mitsui Building, 15th ﬂoor, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan

* Correspondence: adplpg@r.postjobfree.com

Received: 13 June 2019; Accepted: 12 August 2019; Published: 15 August 2019

Abstract: We propose a new clustering method based on a deep neural network. Given an unlabeled dataset and the number of clusters, our method directly groups the dataset into the given number of clusters in the original space. We use a conditional discrete probability distribution deﬁned by a deep neural network as a statistical model. Our strategy is ﬁrst to estimate the cluster labels of unlabeled data points selected from a high-density region, and then to conduct semi-supervised learning to train the model by using the estimated cluster labels and the remaining unlabeled data points. Lastly, by using the trained model, we obtain the estimated cluster labels of all given unlabeled data points. The advantage of our method is that it does not require key conditions. Existing clustering methods with deep neural networks assume that the cluster balance of a given dataset is uniform. Moreover, it also can be applied to various data domains as long as the data is expressed by a feature vector. In addition, it is observed that our method is robust against outliers. Therefore, the proposed method is expected to perform, on average, better than previous methods. We conducted numerical experiments on ﬁve commonly used datasets to conﬁrm the effectiveness of the proposed method. Keywords: clustering; deep neural networks; manifold learning; semi-supervised learning 1. Introduction

Clustering is one of the oldest machine-learning ﬁelds, where the objective is, given data points, to group them into clusters according to some measure. Many clustering methods have been proposed for a long while [1], and been applied to real-world problems [2]. The best known classical methods are k-means [3] and Gaussian Mixture Model (GMM) clustering [4]. Though those methods are computationally efﬁcient, they can only model convex shapes and are thus applicable in limited cases. The kernel k-means [5], kernel GMM clustering [6] and Spectral Clustering (SC) [7] can capture more complicated shapes than k-means and GMM but are difﬁcult to scale up to large datasets. In recent years, due to technological progress, we can acquire many types of data such as images, texts, and genomes in large numbers. Thus, the demand of advanced efﬁcient clustering methods grows even stronger [8]. Thanks to the development of deep neural networks, we can now handle large datasets with complicated shapes [9]. Consequently, the studies of clustering using deep neural networks has been proposed. One major direction in the studies is to combine deep AutoEncoders (AE) [10] with classical clustering methods. This AE is used to obtain a clustering friendly low dimensional representation. Entropy 2019, 21, 795; doi:10.3390/e21080795 www.mdpi.com/journal/entropy Entropy 2019, 21, 795 2 of 16

Another major direction is directly grouping a given unlabeled dataset into the clusters in the original input space by employing a deep neural network to model the distribution of cluster labels. With both directions, there exist popular methods. We summarize their applicable data domain and well performing condition in Table 1. For examples, CatGAN (Categorical Generative Adversarial Networks) learns discriminative neural network classiﬁers that maximize mutual information between the input data points and the cluster labels, while enforcing the robustness of the classiﬁers to data points produced by adversarial generative models. Since maximizing mutual information implicitly encourages the cluster-balance distribution of the model to be uniform, CatGAN performs well under the condition of the uniform cluster balance. JULE (Joint Unsupervised LEarning) learns a clustering friendly low dimensional representation for image datasets by using a convolutional neural network [11]. The assigned cluster labels and low dimensional representation are jointly optimized by updating a n n similarity matrix of the representations, where n is the number of data points. Thus, O(n2) memory space must be secured to conduct the method. Table 1. Summary of popular deep clustering methods. Three AE-based methods and four direct methods are presented, including our method SEDC. A method has a preferred domain when it is specialized for a certain type of data. Please note that all methods require the smoothness and manifold assumptions. Some methods require additional key conditions except the above two assumptions. Empty spaces mean “None”. All methods below require the number of clusters as one of their inputs. AE-Based Methods Preferred Domain Key Condition

DEC [12] AE acquires good representation.

JULE [13] Image Size of dataset is not large.

VaDE [14] The representation follows GMM.

Direct Methods Preferred Domain Key Condition

IMSAT [15] Cluster balance is uniform.

CatGAN [16] Cluster balance is uniform.

SpectralNet [17]

SEDC

As we can see in Table 1, most of their key conditions are not always realistic since the details of given unlabeled datasets are unknown and their size is large in typical machine-learning scenarios. On the other hand, SpectralNet does not require key condition. It only requires the following two fundamental assumptions: the smoothness and manifold assumptions [18]. Please note that the other methods in Table 1 also require the two assumptions. As for the weakness of SpectralNet, it is not robust against outliers. In the learning process, it learns the pairwise similarities over all data points. Therefore, the existence of outliers disturbs the method learning the similarities precisely, and thus returns inaccurate clustering results.

In this paper, we propose a deep clustering method named Spectral Embedded Deep Clustering

(SEDC). Given an unlabeled dataset and the number of clusters, SEDC directly groups the dataset into the given number clusters in the input space. Our statistical model is the conditional discrete probability distribution, which is deﬁned by a fully connected deep neural network. SEDC does not require key condition except the smoothness and manifold assumptions, and it can be applied to various data domains. Moreover, throughout our numerical experiments, we observed that our method was more robust against outliers than SpectralNet. The procedure of SEDC is composed of two stages. In the ﬁrst stage, we conduct SC only on the unlabeled data points selected from high-density region by using the geodesic metric to estimate the cluster labels. This special type of SC is named as Selective Geodesic Spectral Clustering (SGSC), which we propose for assisting SEDC as well. Thereafter, we conduct semi-supervised learning to train the model by using the estimated cluster labels and the remaining unlabeled data points. Please note that in this semi-supervised learning, we treat the estimated cluster labels of the selected unlabeled Entropy 2019, 21, 795 3 of 16

data points as the given true cluster labels. At last, by using the trained model, we obtain the estimated cluster labels of all given unlabeled data points. In the remainder of this paper, we introduce related works in Section 2. We then introduce our proposed method in Section 3. We demonstrate the efﬁciency of our method with numerical experiments in Section 4. Finally, in Section 5, we conclude the paper with the discussion on future works.

2. Related Works

In this section, we ﬁrst introduce the existing clustering studies based on deep neural networks. As mentioned in Section 1, there are two major directions in recent studies, i.e., the deep-AE-based clustering and the direct deep clustering. We then introduce the two techniques related to our proposed method, i.e., SC and Virtual Adversarial Training (VAT). 2.1. Existing Clustering Methods Using Deep Neural Network In deep-AE-based clustering, the AE and a classical clustering method such as k-means are combined. The combination strategies are either sequential [19,20] or simultaneous [12–14]. In the sequential way, deeply embedded representations of the given dataset are obtained by the deep AE, and then a classical clustering method is applied to the embedded set. In the simultaneous way, the deep representations and their cluster labels are learned jointly by optimizing a single objective function.

As examples of the simultaneous way, we introduce [12,14] here. The method of [12] is named Deep Embedded Clustering (DEC). DEC trains a deep neural network by iteratively minimizing the Kullback–Leibler (KL) divergence between a centroid-based probability distribution and an auxiliary target distribution. The deep network is used as the AE. They reported the clustering performance of DEC depended on the initial embedded representations and cluster centroids obtained by the AE and k-means. The method of [14] is named Variational Deep Embedding (VaDE). The approach relies on a Variational AutoEncoder (VAE) [21] that uses a Gaussian mixture prior. VaDE trains its deep neural network by minimizing the reconstruction error, while enforcing that the low dimensional representations follow the Gaussian mixture model. Regarding the direct deep clustering, we introduce [15,17]. The method of [15] is named Information Maximizing Self-Augmented Training (IMSAT). It is based on data augmentation, where a deep neural network is trained to maximize the mutual information while regularizing the network so that the cluster label assignment of original data will be consistent with the assignment of augmented data. The method of [17] is named SpectralNet. This method is proposed to overcome the scalability and generalization of SC. It uses two deep neural networks. The ﬁrst network learns the similarities among all given data points. This network is known as Siamese net [22]. Then, the second network learns a dimension reduction mapping which preserves the similarities obtained by the ﬁrst net. After both are trained, the dimensionality reduced data points obtained by the second network are grouped into clusters by k-means.

2.2. Related Techniques with Proposed Method

SEDC handles the following clustering problem: given a set of unlabeled data points X = fxign i=1

(xi 2 RD) and the number of clusters C, group X into C clusters. In SEDC, the estimated cluster label of xi is obtained by the trained conditional discrete probability distributional classiﬁer. This classiﬁer is denoted by pq(yjx) where q, xi and y are a set of parameters, a feature vector and a cluster label, respectively. The cluster label y ranges f1, . . ., Cg. In addition, the classiﬁer is deﬁned by a fully connected deep neural network whose last layer is the soft-max function. SC and VAT, explained below, take an important role in SEDC: see the detail in Section 3. Entropy 2019, 21, 795 4 of 16

2.2.1. Spectral Clustering

SC [7,23] is a popular classical clustering algorithm. We here introduce the commonly used framework of SC, which is used also in SEDC algorithm. It ﬁrst embeds the data points in the eigenspace of the Laplacian matrix derived from the pairwise similarities over all given data points. Then, SC applies k-means on the representation to obtain the cluster labels. The SC algorithm is outlined below.

1. Given dataset X, deﬁne the weighted undirected graph G which comprises a set of vertices X and a set of undirected edges E deﬁned by k1-nearest neighbor (k1-NN) based on a metric d. Suppose that each edge eij 2 E has a non-negative symmetric weight wij. 2. Denote the n n afﬁnity matrix D 1/2WD 1/2 on G by L, where W = (wij)i,j, and D is the diagonal matrix whose entries are given by dii = åj wij. 3. Given the number of clusters C, compute the largest C eigenvectors u1, . . ., uC of the eigenproblem Lu = lu. Let U 2 Rn C be the matrix containing the vectors u1, . . ., uC as columns. Thereafter, re-normalize each row of U to have unit length.

4. Cluster n rows of U as points in RC by conducting k-means (k = C). Let f yign ˆ i=1 and fmjgC

j=1 be

the estimated labels of X and the set of centroids obtained by k-means, respectively. The above procedure can be noted as

f yign ˆ

i=1,fmjgC

j=1,U

= SC(X, k1, d, w, C). The weight w is

often deﬁned by the following similarity function: wij = (

exp( d(xi,xj)2/s2), eij 2 E,

0, eij 2 /E,

(1)

The bandwidth s is selected by the median or mean heuristics [24]. In our proposed method, we employ k-means++ [25] technique since the method uses that SC function. 2.2.2. Virtual Adversarial Training

VAT [26] is a regularization method based on local perturbation. It forces the statistical model pq(yjx) follow the smoothness assumption. VAT is known to empirically perform better than other local perturbation methods such as random perturbation [27] and adversarial training [28] in both semi-supervised and supervised learning scenarios. It can be employed also in unsupervised learning scenarios [15,29] since VAT only requires the unlabeled data points. VAT ﬁrst deﬁnes the adversarial point TVAT(x) with given x as follows: TVAT(x) = x + rvadv, (2)

where rvadv is e-perturbation to a virtual adversarial direction: rvadv = argmax

KL [pqt(yjx)kpq(yjx + r)] ; krk2 e

. (3)

In Equation (3), qt is the estimated parameter at t-th iteration, and KL is Kullback–Leibler divergence [30]. Then, VAT minimizes the following RVAT(q;X) with respect to q: RVAT(q;X) =

i=1

KL [pqt(yjxi)kpq(yjTVAT(xi))] . (4)

The approximation of rvadv in Equation (3) is computed by the following two steps: g rr KL [pqt(yjx))kpqt(yjx + r)] r=xd, (5)

r vadv e

kgk2

, (6)

Entropy 2019, 21, 795 5 of 16

where d 2 RD is a random unit vector generated by the standard normal distribution, and x 2 R+ is a ﬁxed small positive number. Regarding the logic behind the approximation, see Section 3.3 of [26]. Remark 1. The radius e of Equation (3) is deﬁned for given x as below: e(x) = akx x(z)k2, (7)

where a is a scalar and x(z) is the z-th nearest data point from x. In [15], a = 0.4 and z = 10 are used. 3. Proposed Method

As we already mentioned in the end of Section 1 and the beginning of Section 2.2, given an unlabeled dataset X = fxign

i=1 (xi 2 RD) and the number of clusters C, our proposed deep clustering named SEDC groups X into C clusters. Since this grouping is achieved by obtaining the estimated cluster labels of X, our goal can be replaced by estimating the cluster labels up to permutation of labels. In SEDC, the estimated cluster label of each xi 2 X is deﬁned by argmaxj=1 Cpq (y = jjxi), where q is the trained set of parameters. The training scheme of the classiﬁer pq(yjx) is as follows: we ﬁrstly only estimate the cluster labels of selected unlabeled data points by using only X (this part is done by SGSC algorithm.), and then conduct semi-supervised learning to train the classiﬁer. Regarding with this semi-supervised learning, we use the estimated cluster labels of selected unlabeled data points and the remaining unlabeled data points, which are treated as the given true cluster labels and unlabeled data points respectively.

In this section, we ﬁrst introduce SGSC. Thereafter, we present our main method SEDC. 3.1. Selective Geodesic Spectral Clustering

Using SGSC, the clustering problem is converted into a semi-supervised learning problem as shown below. In SGSC, ﬁrstly some unlabeled data points are selected from high-density regions in the original dataset. Then, the SC is used to the selected unlabeled data points with the geodesic metric. As a result, we obtain cluster labels on the selected points. Since the points are picked up from the high-density regions, the locations of selected points are stable and robust against outliers [31]. The geodesic metric is approximated by the graph shortest path distances on the graph. The reason to employ the geodesic metric is that the metric is known to be useful to capture the structure of the data manifolds especially when the number of given data points is large [32]. Here, the number of selected data points is a hyperparameter in SGSC. Suppose that the hyperparameters are tuned appropriately, Then, the set of the selected data points with their estimated cluster labels can roughly approximate the manifolds represented by the full original dataset even when the dataset has complicated manifolds inluding outliers.

Throughout numerical experiments on ﬁve datasets in later section, the following two are conﬁrmed. Firstly, the estimation accuracy of cluster labels with selected points by SGSC can stay high. Then, secondly, due to the highly accurate estimation by SGSC, it can help the clustering by SEDC to be successful on several types of datasets on average. We will refer to the selected data points as hub data points. Let H X be the set of hub data points. The hub dataset H is formally deﬁned below. Deﬁnition 1. Let X be the given unlabeled dataset. On the graph G = (X, E), let Nj be the set of adjacent nodes of xj 2 X. For a natural number h, the hub set H is deﬁned as the collection of nodes that ranked in the top-h cardinality of Nj in X.

Algorithm 1 and Figure 1 show the pseudo code of SGSC and the mechanism of SGSC, respectively. The detail of this algorithm is explained below.

Entropy 2019, 21, 795 6 of 16

Algorithm 1 : Q = SGSC(X, k0, k1, h, C)

Input: Unlabeled dataset X = fxign

i=1. Number of neighbors k0, k1. Number of hub data points h. Number of clusters C.

Output: The estimated conditional discrete probability distributions with hub data points, Q. 1: Construct the undirected graph G0 = (X, E), where the edge set E is deﬁned by k0-NN with the Euclidean distance.

2: Build the hub dataset H on graph G0 such that jHj = h. Denote the element of H by x(i)

(i = 1, . . ., h).

3: Deﬁne the geodesic metric dG0 as the shortest path distance on the graph G0. 4: Deﬁne fmjgC

j=1 and U as the two outputs of SC(H, k1,dG0,w, C), where the weight w(xi,xj) is deﬁned by exp( dG0(xi,xj)2/s2). Then, compute the conditional cluster probability qjj(i) with each hub data point x(i) in H as follows:

qjj(i) =

1 + x (i) mj

/g g+1

j0=1

1 + x (i) mj0

/g g+1

where g is a small positive number and x (i) is i-th row of U. 5: Let q(i) and Q be (q1j(i), . . ., qCj(i)) and h C matrix, respectively. The i-th row of Q is deﬁned by q(i).

(a) (b) (c)

Figure 1. This ﬁgure explains how SGSC works. (a): Given unlabeled data points, in line 2 of Algorithm 1, SGSC computes the hub data points. The hub data points are expressed by star symbols, and the number of hub data points h is ten in this case; (b): In line 4 of the algorithm, SGSC focuses only on the hub data points, then conducts SC with the geodesic metric on those hub points, where we set one to k1; (c): As the results, we obtain the cluster labels of hub points. The triangle and square symbols mean different labels. Please note that an actual output of SGSC is the estimated conditional discrete probability distributions with hub data points, but we can obtain the estimated cluster labels from the distributions.

Line 1: Given an unlabeled dataset X, the undirected graph G0 is constructed in the k0-nearest neighbor (k0-NN) manner with the Euclidean metric. G0 is used not only for deﬁning the hub set Entropy 2019, 21, 795 7 of 16

H but also for approximating the geodesic distance on the manifolds shaped by X. We consider k0 as a hyperparameter.

Line 2: Given the number of hub points h and G0, the algorithm deﬁnes the hub set H based on Deﬁnition 1. Outliers can be excluded from H by setting h appropriately. In this algorithm, h is considered to be a hyperparameter.

Line 3: The geodesic distance is determined from the Euclidean distances on the undirected edges of G0. In Line 4, we need to compute the geodesic distances between the data points of H. Efﬁcient algorithms are available for this purpose [32,33].

Line 4: Given the number of clusters C, we estimate the conditional discrete probability distribution p(yjx(i)) for each x(i) 2 H, where y is the cluster label ranging f1, . . ., Cg. The estimated p(yjx(i)) is denoted as q(i) = (q1j(i), . . ., qCj(i)). This estimation relies on conducting SC with dG0 metric only on H. The deﬁnition of the weight w in this SC follows Equation (1). The key to succeed the estimation is to employ the combination of a different number of neighbors k1 from k0 and the geodesic metric dG0 to a SC. Typically, given data points that are dense in the input space, the combination of a small number of neighbors and the Euclidean metric makes a SC perform well. However, we consider H, which is sparse in the input space. Therefore, we employ the combination. We consider k1 as a hyperparameter as well. Following [34], we compute qjj(i) by using the outputs fmjgC

j=1 and U of SC(H, k1, dG0,w, C). Please note that qjj(i) can be considered to be the probability that x i belongs to the cluster j, where x (i) is the low dimensional representation of x(i) according to the property of SC [23]. As for g, we set 10 60 to it. Remark 2. Though we say we estimate the “cluster labels” of hub data points by SGSC, it actually outputs the estimated conditional probability distributions with hub data points. The reason is that throughout our preliminary experiments, we observed that employing Q of line 5 made SEDC perform better than employing the one-hot vector. This one-hot vector, for instance, can be deﬁned by using the estimated cluster labels f y(ˆ i)gh i=1

which is one of the outputs of SC(H, k1, dG0,w, C). 3.2. Spectral Embedded Deep Clustering

SEDC is a deep clustering method for clustering. Given an unlabeled dataset X = fxign i=1 and

the number of clusters C, it groups X into C clusters. As mentioned in the beginning of this section, this method employs the conditional discrete probability distribution pq(yjx) as the statistical model, which is deﬁned by a fully connected deep neural network. By using the trained model, we obtain the estimated cluster label of each xi. This method does not require an additional condition except two fundamental assumptions: the smoothness and manifold assumptions. Therefore, among the methods of Table 1, only SpectralNet is comparable to SEDC in this point. In addition, our method can be applied to various data domains once the raw data is transformed to the feature vector. Furthermore, empirically speaking, the performance of SEDC can be robust against outliers due to the robustness of SGSC against them. The pseudo code of SEDC is shown in Algorithm 2. The explanation is below. The procedure of SEDC is composed of two stages. In the ﬁrst stage, we estimate the conditional discrete probability distributions Q with hub data points. In the second stage, by treating Q as the given true distributions of hub data points, we conduct semi-supervised learning where Q and the remaining unlabeled data points are used, to train the statistical model pq(yjx). After this training, SEDC returns the estimated cluster labels of each xi 2 X by argmaxj=1 Cpq (y = jjxi), where q is the trained set of parameters and j 2 f1, . . ., Cg. The estimated cluster labels of xi is denoted by yi. ˆ Note that the estimated cluster labels of hub data points might be updated at the end of SEDC procedure. In the second stage, we conduct semi-supervised learning to train the statistical model pq(yjx) using q(i), i = 1, . . ., h. Recall that the model pq(yjx) is deﬁned by the deep neural network whose last layer is soft-max function. The number of neurons of the ﬁrst and last layer are the dimension Entropy 2019, 21, 795 8 of 16

of feature vector D and the number of clusters C, respectively. In this semi-supervised learning, we minimize the following loss with respect to q:

RVAT(q;X) +

i=1

pq(yjx(i))kq(i)i + l2H(YjX), (8)

where l1 and l2 are hyperparameters that range over positive numbers. In Equation (8), the ﬁrst and second terms express VAT loss of Equation (4) and the pseudo empirical loss with estimated cluster probabilities of hub data points, respectively. The third term is the conditional Shannon entropy [30] averaged over X, and it is deﬁned as follows:

H (YjX) =

i=1

j=1

pq(y = jjxi) log pq(y = jjxi).

Algorithm 2 : f yign ˆ

i=1 = SEDC(X, k0, k1, h, C, l1,l2)

Input: Unlabeled dataset X = fxign

i=1. Number of neighbors k0, k1. Number of hub data points h. Number of clusters C. Regularization parameters l1,l2 > 0. Output: The estimated cluster labels of X, f yign ˆ i=1.

1: Obtain the h C matrix of estimated conditional discrete probability distributions with hub data points Q by computing SGSC(X, k0, k1, h, C) of Algorithm 1. Denote i-th row of Q by q(i), which means the estimated cluster label probability distribution of hub data point x(i). The index i ranges f1, . . ., hg.

2: Let pq(yjx) be a statistical model, which is the cluster label probability distribution with given data point x. The cluster label ranges f1, . . ., Cg. Deﬁne the objective of Equation (8) by using pq(yjx), fq(i)gh

i=1 and given l1,l2. Then, minimize the objective with q in stochastic gradient descent fashion. Denote the optimized parameter by q . 3: Obtain the estimated cluster labels of all data points in X by using the trained classiﬁer pq (yjx). Denote pq (y = jjxi) by p

jji. Then, for all data point index i, compute yi ˆ by yi ˆ = argmax j

jji.

We use the Adam optimizer [35] for the minimization. After minimizing Equation (8), we estimate the labels of X by using the trained parameter q . Let f yign ˆ i=1denote the estimated cluster labels of

X = fxign

i=1. The labels are computed as follows: yi ˆ = argmaxj=1 C pq (y = jjxi). As mentioned in Section 2.2, the minimization of the VAT loss encourages pq(yjx) to follow the smoothness assumption. In addition, that of entropy loss helps the model to follow the cluster assumption [18]. The cluster assumption says that true decision boundary is not located in regions of the input space that are densely populated with data points. The entropy loss is commonly used in many studies [15,26,36,37]. Please note that the entropy loss is deﬁned only by using the unlabeled data points, like VAT loss. With regard to the pseudo empirical loss, we can consider other candidates such as the cross entropy. The reason we chose the KL-divergence is that we observed that the KL-divergence made SEDC perform better than other candidates in our preliminary experiments. 3.3. Computational and Space Complexity of SEDC

The procedure of SEDC is composed of two stages. The ﬁrst stage (line 1 of Algorithm 2) is conducting SGSC algorithm. The second stage (line 2 of Algorithm 2) is training the model by optimizing Equation (8). Therefore, total computational complexity of SEDC is the summation of the total computational complexity of SGSC and the complexity consumed in the mini-batch training. Entropy 2019, 21, 795 9 of 16

In the following, we analyze the computational complexity consumed in each line of Algorithm 1: see this summary in Table 2. Suppose that h, k0, k1 n. In line 1 of the algorithm, we consume O(Dn2) to construct k0-NN graph [38], where D is the dimension of feature vector. Then, in the line 2, we consume O(n log n) because we sort the nodes of the graph in descending order of degree for deﬁning the hub set H. Then, in the line 3, we consume O (h(log n + k0)n) for computing the graph shortest path distances between the data points in H: see algorithm 1 of [32]. Thereafter, in the line 4, we consume O(h3) for solving the eigenvector problem of the Laplacian matrix. Table 2. The computational complexity of each line of Algorithm 1 (SGSC algorithm), where D, n, h and k0 are the dimension of the feature vector, the number of given unlabeled data points, the number of hub data points and the number of neighbors, respectively. Corresponding Line of Algorithm 1 Line 1 Line 2 Line 3 Line 4 Theoretical computational complexity O(Dn2) O(n log n) O (h(log n + k0)n) O(h3) As for the memory complexity, since the dominant factors are to save k0-NN graph and the model, SEDC needs O (maxfk0n, jqjg) where q is the set of parameters in a deep neural network. Remark 3. For most of deep clustering methods relying on k-NN graph construction, the dominant factor with their total computational complexity is k-NN graph construction, i.e., we need O(Dn2). However, according to [39,40], by constructing the approximated k-NN graph, we only need O(Dnlogn) for the construction. 4. Numerical Experiments

In this section, we show the results of our numerical experiments. First, we show how accurately SGSC can estimate the labels of hub data points on ﬁve datasets. Then, we show the clustering results on the same ﬁve datasets by SEDC. With regards to the clustering experiments, we compare our proposed method with ﬁve popular methods: k-means [3], SC [7], DEC [12], IMSAT [15] and SpectralNet [17].

4.1. Datasets and Evaluation Metric

We conducted experiments on ﬁve datasets. Two of them are real-world datasets named MNIST [41] and Reuters-10k [42]. The other three are synthetic datasets named FC, TM, and TR. A summary of the dataset statistics is

Contact this candidate