Welcome to the IKCEST
Journal
IEEE Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering

Archives Papers: 795
IEEE Xplore
Please choose volume & issue:
Y-Graph: A Max-Ascent-Angle Graph for Detecting Clusters
Junyi GuanSheng LiXiongxiong HeJiajia ChenYangyang ZhaoYuxuan Zhang
Keywords:Clustering algorithmsMergingBuildingsShapeCouplingsResource managementCostsTuningSymmetric matricesStandardsComplex-shaped clustersnormalized-cutgraph clustering
Abstracts:Graph clustering technique is highly effective in detecting complex-shaped clusters, in which graph building is a crucial step. Nevertheless, building a reasonable graph that can exhibit high connectivity within clusters and low connectivity across clusters is challenging. Herein, we design a max-ascent-angle graph called the “Y-graph”, a high-sparse graph that automatically allocates dense edges within clusters and sparse edges across clusters, regardless of their shapes or dimensionality. In the graph, every point $x$x is allowed to connect its nearest higher-density neighbor $\delta$δ, and another higher-density neighbor $\gamma$γ, satisfying that the angle $\angle \delta x\gamma$∠δxγ is the largest, called “max-ascent-angle”. By seeking the max-ascent-angle, points are automatically connected as the Y-graph, which is a reasonable graph that can effectively balance inter-cluster connectivity and intra-cluster non-connectivity. Besides, an edge weight function is designed to capture the similarity of the neighbor probability distribution, which effectively represents the density connectivity between points. By employing the Normalized-Cut (Ncut) technique, a Ncut-Y algorithm is proposed. Benefiting from the excellent performance of Y-graph, Ncut-Y can fast seek and cut the edges located in the low-density boundaries between clusters, thereby, capturing clusters effectively. Experimental results on both synthetic and real datasets demonstrate the effectiveness of Y-graph and Ncut-Y.
TGformer: A Graph Transformer Framework for Knowledge Graph Embedding
Fobo ShiDuantengchuan LiXiaoguang WangBing LiXindong Wu
Keywords:Knowledge graphsTransformersPredictive modelsContext modelingSemanticsTensorsPeriodic structuresGraph neural networksData modelsParallel processingKnowledge graphgraph transformerlink prediction
Abstracts:Knowledge graph embedding is efficient method for reasoning over known facts and inferring missing links. Existing methods are mainly triplet-based or graph-based. Triplet-based approaches learn the embedding of missing entities by a single triple only. They ignore the fact that the knowledge graph is essentially a graph structure. Graph-based methods consider graph structure information but ignore the contextual information of nodes in the knowledge graph, making them unable to discern valuable entity (relation) information. In response to the above limitations, we propose a general graph transformer framework for knowledge graph embedding (TGformer). It is the first to use a graph transformer to build knowledge embeddings with triplet-level and graph-level structural features in the static and temporal knowledge graph. Specifically, a context-level subgraph is constructed for each predicted triplet, which models the relation between triplets with the same entity. Afterward, we design a knowledge graph transformer network (KGTN) to fully explore multi-structural features in knowledge graphs, including triplet-level and graph-level, boosting the model to understand entities (relations) in different contexts. Finally, semantic matching is adopted to select the entity with the highest score. Experimental results on several public knowledge graph datasets show that our method can achieve state-of-the-art performance in link prediction.
Task-Aware Data Selectivity in Pervasive Edge Computing Environments
Athanasios KoukosiasChristos AnagnostopoulosKostas Kolomvatsos
Keywords:FiltersPeer-to-peer computingSensorsDistributed databasesInternet of ThingsData modelsUncertaintyReal-time systemsPredictive modelsEdge computingEdge computingdata filterdata selectivityone-class support vector machinesfuzzy inference
Abstracts:Context-aware data selectivity in Edge Computing (EC) requires nodes to efficiently manage the data collected from Internet of Things (IoT) devices, e.g., sensors, for supporting real-time and data-driven pervasive analytics. Data selectivity at the network edge copes with the challenge of deciding which data should be kept at the edge for future analytics tasks under limited computational and storage resources. Our challenge is to efficiently learn the access patterns of data-driven tasks (analytics) and predict which data are relevant, thus, being stored in nodes’ local datasets. Task patterns directly indicate which data need to be accessed and processed to support end-users’ applications. We introduce a task workload-aware mechanism which adopts one-class classification to learn and predict the relevant data requested by past tasks. The inherent uncertainty in learning task patterns, identifying inliers and eliminating outliers is handled by introducing a lightweight fuzzy inference estimator that dynamically adapts nodes’ local data filters ensuring accurate data relevance prediction. We analytically describe our mechanism and comprehensively evaluate and compare against baselines and approaches found in the literature showcasing its applicability in pervasive EC.
Survey and Benchmark of Anomaly Detection in Business Processes
Wei GuanJian CaoHaiyan ZhaoYang GuShiyou Qian
Keywords:BusinessAnomaly detectionSurveysBenchmark testingProcess controlStreamsProcess miningSource codingOrganizationsMeasurementAnomaly detectionbusiness processevent log qualityprocess mining
Abstracts:Effective management of business processes is crucial for organizational success. However, despite meticulous design and implementation, anomalies are inevitable and can result in inefficiencies, delays, or even significant financial losses. Numerous methods for detecting anomalies in business processes have been proposed recently. However, there is no comprehensive benchmark to evaluate these methods. Consequently, the relative merits of each method remain unclear due to differences in their experimental setup, choice of datasets and evaluation measures. In this paper, we present a systematic literature review and taxonomy of business process anomaly detection methods. Additionally, we select at least one method from each category, resulting in 16 methods that are cross-benchmarked against 32 synthetic logs and 19 real-life logs from different industry domains. Our analysis provides insights into the strengths and weaknesses of different anomaly detection methods. Ultimately, our findings can help researchers and practitioners in the field of process mining make informed decisions when selecting and applying anomaly detection methods to real-life business scenarios. Finally, some future directions are discussed in order to promote the evolution of business process anomaly detection.
Scalable Semi-Supervised Clustering via Structural Entropy With Different Constraints
Guangjie ZengHao PengAngsheng LiJia WuChunyang LiuPhilip S. Yu
Keywords:EntropyScalabilityVegetationEncodingData analysisClustering methodsClustering algorithmsAccuracyComplexity theorySimilarity learningBiological data analysisscalable clusteringsemi-supervised clusteringstructural entropy
Abstracts:Semi-supervised clustering leverages prior information in the form of constraints to achieve higher-quality clustering outcomes. However, most existing methods struggle with large-scale datasets owing to their high time and space complexity. Moreover, they encounter the challenge of seamlessly integrating various constraints, thereby limiting their applicability. In this paper, we present Scalable Semi-supervised clustering via Structural Entropy (SSSE), a novel method that tackles scalable datasets with different types of constraints from diverse sources to perform both semi-supervised partitioning and hierarchical clustering, which is fully explainable compared to deep learning-based methods. Specifically, we design objectives based on structural entropy, integrating constraints for semi-supervised partitioning and hierarchical clustering. To achieve scalability on data size, we develop efficient algorithms based on graph sampling to reduce the time and space complexity. To achieve generalization on constraint types, we formulate a uniform view for widely used pairwise and label constraints. Extensive experiments on real-world clustering datasets at different scales demonstrate the superiority of SSSE in clustering accuracy and scalability with different constraints. Additionally, Cell clustering experiments on single-cell RNA-seq datasets demonstrate the functionality of SSSE for biological data analysis.
PUMA: Efficient Continual Graph Learning for Node Classification With Graph Condensation
Yilun LiuRuihong QiuYanran TangHongzhi YinZi Huang
Keywords:TrainingGraph neural networksMemory managementRepresentation learningContinuing educationMessage passingData modelsOptimizationKnowledge engineeringComputer visionContinual graph learning (CGL)graph neural networks (GNNs)graph condensation
Abstracts:When handling streaming graphs, existing graph representation learning models encounter a catastrophic forgetting problem, where previously learned knowledge of these models is easily overwritten when learning with newly incoming graphs. In response, Continual Graph Learning (CGL) emerges as a novel paradigm enabling graph representation learning from static to streaming graphs. Our prior work, Condense and Train (CaT) (Liu et al. 2023) is a replay-based CGL framework with a balanced continual learning procedure, which designs a small yet effective memory bank for replaying data by condensing incoming graphs. Although the CaT alleviates the catastrophic forgetting problem, there exist three issues: (1) The graph condensation algorithm derived in CaT only focuses on labelled nodes while neglecting abundant information carried by unlabelled nodes; (2) The continual training scheme of the CaT overemphasises on the previously learned knowledge, limiting the model capacity to learn from newly added memories; (3) Both the condensation process and replaying process of the CaT are time-consuming. In this paper, we propose a PsUdo-label guided Memory bAnk (PUMA) CGL framework, extending from the CaT to enhance its efficiency and effectiveness by overcoming the above-mentioned weaknesses and limits. To fully exploit the information in a graph, PUMA expands the coverage of nodes during graph condensation with both labelled and unlabelled nodes. Furthermore, a training-from-scratch strategy is proposed to upgrade the previous continual learning scheme for a balanced training between the historical and the new graphs. Besides, PUMA uses a one-time prorogation and wide graph encoders to accelerate the graph condensation and the graph encoding process in the training stage to improve the efficiency of the whole framework. Extensive experiments on seven datasets for the node classification task demonstrate the state-of-the-art performance and efficiency over existing methods.
PURE: Personality-Coupled Multi-Task Learning Framework for Aspect-Based Multimodal Sentiment Analysis
Puning ZhangMiao FuRongjian ZhaoHongbin ZhangChangchun Luo
Keywords:Feature extractionSentiment analysisAnalytical modelsMultitaskingPsychologyData miningRepresentation learningAdaptation modelsAccuracyVisualizationAttention-based fusionbig five modelmultimodal sentiment analysispersonality prediction
Abstracts:Aspect-Based Multimodal Sentiment Analysis (ABMSA) aims to infer the users’ sentiment polarities over individual aspects using visual, textual, and acoustic signals. Although psychological studies have shown that personality has a direct impact on people's sentiment orientations, most existing methods disregard the potential personality character while executing ABMSA tasks. To tackle this issue, a novel psychological perspective, the people's personalities are introduced. To the best of our knowledge, this paper is the very first study in this field. Different from current pipelined multi-task sentiment analysis methods, an end-to-end ABMSA method called Personality-coupled mUlti-task leaRning framEwork (PURE) is proposed, which strongly couples personality mining and ABMSA tasks in a unified architecture to avoid error propagation and enhance the overall system robustness. Specifically, an adaptive personality feature extraction method is designed to accurately model the first impression of different people's personalities. Then, a multi-task ABMSA framework is designed to strongly couple the multimodal features of aspects extracted by the interactive attention fusion network with people's personalities. Subsequently, the proposed framework optimizes them parallel via extended Bayesian meta-learning. Finally, compared to the current optimal model, the classification accuracy and macro F1 score of the proposed model have both shown significant improvements on public datasets. In addition, PURE is transferable and can effectively couple personality modeling tasks with any other sentiment analysis methods.
MPM: Multi Patterns Memory Model for Short-Term Time Series Forecasting
Dezheng WangRongjie LiuCongyan ChenShihua Li
Keywords:Time series analysisForecastingPredictive modelsTransformersLong short term memoryConvolutionComputational modelingFeature extractionData modelsAttention mechanismsAsymmetric convolutionmulti patterns memory modelspatial-temporal dependencytime series forecasting
Abstracts:Short-term time series forecasting is pivotal in various scientific and industrial fields. Recent advancements in deep learning-based technologies have significantly improved the efficiency and accuracy of short-term time series modeling. Despite advancements, current time short-term series forecasting methods typically emphasize modeling dependencies across time stamps but frequently overlook inter-variable dependencies, which is crucial for multivariate forecasting. We propose a multi patterns memory model discovering various dependency patterns for short-term multivariate time series forecasting to fill the gap. The proposed model is structured around two key components: the short-term memory block and the long-term memory block. These networks are distinctively characterized by their use of asymmetric convolution, each tailored to process the various spatial-temporal dependencies among data. Experimental results show that the proposed model demonstrates competitive performance over the other time series forecasting methods across five benchmark datasets, likely thanks to the asymmetric structure, which can effectively extract the underlying various spatial-temporal dependencies among data.
Learning Road Network Index Structure for Efficient Map Matching
Zhidan LiuYingqian ZhouXiaosi LiuHaodi ZhangYabo DongDongming LuKaishun Wu
Keywords:Hidden Markov modelsRoadsGlobal Positioning SystemTrajectoryIndexesAccuracyViterbi algorithmComputational modelingPredictive modelsData modelsGPS Trajectoryhidden markov modellearned indexmap matchingroad network
Abstracts:Map matching aims to align GPS trajectories to their actual travel routes on a road network, which is an essential pre-processing task for most of trajectory-based applications. Many map matching approaches utilize Hidden Markov Model (HMM) as their backbones. Typically, HMM treats GPS samples of a trajectory as observations and nearby road segments as hidden states. During map matching, HMM determines candidate states for each observation with a fixed searching range, and computes the most likely travel route using the Viterbi algorithm. Although HMM-based approaches can derive high matching accuracy, they still suffer from high computation overheads. By inspecting the HMM process, we find that the computation bottleneck mainly comes from improper candidate sets, which contain many irrelevant candidates and incur unnecessary computations. In this paper, we present $\mathtt {LiMM}$LiMM – a learned road network index structure for efficient map matching. $\mathtt {LiMM}$LiMM improves existing HMM-based approaches from two aspects. First, we propose a novel learned index for road networks, which considers the characteristics of road data. Second, we devise an adaptive searching range mechanism to dynamically adjust the searching range for GPS samples based on their locations. As a result, $\mathtt {LiMM}$LiMM can provide refined candidate sets for GPS samples and thus accelerate the map matching process. Extensive experiments are conducted with three large real-world GPS trajectory datasets. The results demonstrate that $\mathtt {LiMM}$LiMM significantly reduces computation overheads by achieving an average speedup of $11.7\times$11.7× than baseline methods, merely with a subtle accuracy loss of 1.8%.
Information-Oriented Random Walks and Pipeline Optimization for Distributed Graph Embedding
Peng FangZhenli LiArijit KhanSiqiang LuoFang WangZhan ShiDan Feng
Keywords:Task analysisScalabilityComputational modelingVectorsSynchronizationServersPipelinesDistributed graph embeddinginformation-centricpipeline execution
Abstracts:Graph embedding maps graph nodes to low-dimensional vectors and is widely used in machine learning tasks. The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs, such as link prediction on Twitter with over one billion edges. Most existing graph embedding methods fall short of reaching high data scalability. In this paper, we present a general-purpose, distributed, information-centric random walk-based, and pipeline-optimized graph embedding framework, $\sf{DistGER-Pipe}$DistGER−Pipe, which scales to embed billion-edge graphs. $\sf{DistGER-Pipe}$DistGER−Pipe incrementally computes information-centric random walks to reduce redundant computations for more effective and efficient graph embedding. It further leverages a multi-proximity-aware, streaming, parallel graph partitioning strategy, simultaneously achieving high local partition quality and excellent workload balancing across machines. $\sf{DistGER-Pipe}$DistGER−Pipe also improves the distributed $\sf{Skip-Gram}$Skip−Gram learning model to generate node embeddings by optimizing access locality, CPU throughput, and synchronization efficiency. Finally, $\sf{DistGER-Pipe}$DistGER−Pipe designs pipelined execution that decouples the operators in sampling and training procedures with an inter-round serial and intra-round parallel processing, attaining optimal utilization of computing resources. Experiments on real-world graphs demonstrate that compared to state-of-the-art distributed graph embedding frameworks, including $\sf{KnightKing}$KnightKing, $\sf{DistDGL}$DistDGL, $\sf{Pytorch-BigGraph}$Pytorch−BigGraph, and $\sf{DistGER}$DistGER, $\sf{DistGER-Pipe}$DistGER−Pipe exhibits 3.15×–1053× acceleration, 45% reduction in cross-machines communication, >10% effectiveness improvement in downstream tasks, and 38% enhancement in CPU utilization.
Hot Journals