Welcome to the IKCEST
Journal
IEEE Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering

Archives Papers: 827
IEEE Xplore
Please choose volume & issue:
Web-FTP: A Feature Transferring-Based Pre-Trained Model for Web Attack Detection
Zhenyu GuoQinghua ShangXin LiChengyi LiZijian ZhangZhuo ZhangJingjing HuJincheng AnChuanming HuangYang ChenYuguang Cai
Keywords:Analytical modelsAccuracySupervised learningCyberspaceFeature extractionData modelsCryptographyDetection ModelWeb AttacksGeneral CharacteristicsSupervised LearningFeature LearningReal-world SystemsPre-processing ModuleAbility Of The ModelDataset SizeTypes Of AttacksTraffic DataDenial Of ServiceMalwareCode BlocksReal-world PerformancePosition EmbeddingFeature CodingWeb DataInjection AttacksUniform Resource LocatorCode SearchTokenizedToken EmbeddingPre-training DatasetPre-training DataNetwork AttacksCommon AttacksHTTP RequestsFine-tuned ModelWeb attack detectionpre-trained modeltransfer learning
Abstracts:Web attack is a major threat to cyberspace security, so web attack detection models have become a critical task. Traditional supervised learning methods learn features of web attacks with large amounts of high-confidence labeled data, which are extremely expensive in the real world. Pre-trained models offer a novel solution with their ability to learn generic features on large unlabeled datasets. However, designing and deploying a pre-trained model for real-world web attack detection remains challenges. In this paper, we present a pre-trained model for web attack detection, including a pre-processing module, a pre-training module, and a deployment scheme. Our model significantly improves classification performance on several web attack detection datasets. Moreover, we deploy the model in real-world systems and show its potential for industrial applications.
UniTE: A Survey and Unified Pipeline for Pre-Training Spatiotemporal Trajectory Embeddings
Yan LinZeyu ZhouYicheng LiuHaochen LvHaomin WenTianyi LiYushuai LiChristian S. JensenShengnan GuoYoufang LinHuaiyu Wan
Keywords:TrajectorySurveysPipelinesTrainingComputational modelingVectorsSpatiotemporal phenomenaRoadsNatural language processingDeep learningSpatiotemporal TrajectoriesUnified PipelineTrajectory EmbeddingReal-world DatasetsDevelopment Of New MethodsMean Absolute ErrorImplementation Of MethodRoad NetworkIndividual TrajectoriesVariational AutoencoderTokenizedSelf-supervised LearningTrajectory DataReconstruction LossEmbedding VectorsRoad SegmentsContrastive LossVehicle TrajectoryFully-connected NetworkTrajectory FeaturesAutoencoder FrameworkPre-training ProcessPre-trained EmbeddingsMasked Language ModelTrajectory DatasetRaw FeaturesL1 LossDiverse TasksDecodingLatent SpaceSpatiotemporal data miningtrajectory embeddingpre-trainingself-supervised learning
Abstracts:Spatiotemporal trajectories are sequences of timestamped locations, which enable a variety of analyses that in turn enable important real-world applications. It is common to map trajectories to vectors, called embeddings, before subsequent analyses. Thus, the qualities of embeddings are very important. Methods for pre-training embeddings, which leverage unlabeled trajectories for training universal embeddings, have shown promising applicability across different tasks, thus attracting considerable interest. However, research progress on this topic faces two key challenges: a lack of a comprehensive overview of existing methods, resulting in several related methods not being well-recognized, and the absence of a unified pipeline, complicating the development of new methods and the analysis of methods. We present UniTE, a survey and a unified pipeline for this domain. In doing so, we present a comprehensive list of existing methods for pre-training trajectory embeddings, which includes methods that either explicitly or implicitly employ pre-training techniques. Further, we present a unified and modular pipeline with publicly available underlying code, simplifying the process of constructing and evaluating methods for pre-training trajectory embeddings. Additionally, we contribute a selection of experimental results using the proposed pipeline on real-world datasets.
The Expressive Power of Graph Neural Networks: A Survey
Bingxu ZhangChangjun FanShixuan LiuKuihua HuangXiang ZhaoJincai HuangZhong Liu
Keywords:TopologyFeature extractionData modelsSurveysMessage passingGraph neural networksFansEncodingArtificial neural networksVectorsNeural NetworkGraph Neural NetworksPower Of Neural NetworksPower GraphGraph TopologyGraph FeaturesGraph IsomorphismStructural InformationFeature InformationGlobal FeaturesNodes In The GraphInfographicProperties Of MoleculesGraph StructureEquivalencyNode FeaturesTopological InformationAggregation FunctionNode RepresentationsSeparation AbilityMessage PassingGraph Neural Network ModelApproximation AbilityPair Of GraphsNode EmbeddingsAggregation OperatorsCapture ComplexLatent SpaceAdjacent NodesInvariant FunctionApproximation abilityexpressive powergraph neural networkseparation ability
Abstracts:Graph neural networks (GNNs) are effective machine learning models for many graph-related applications. Despite their empirical success, many research efforts focus on the theoretical limitations of GNNs, i.e., the GNNs expressive power. Early works in this domain mainly focus on studying the graph isomorphism recognition ability of GNNs, and recent works try to leverage the properties such as subgraph counting and connectivity learning to characterize the expressive power of GNNs, which are more practical and closer to real-world. However, no survey papers and open-source repositories comprehensively summarize and discuss models in this important direction. To fill the gap, we conduct a first survey for models for enhancing expressive power under different forms of definition. Concretely, the models are reviewed based on three categories, i.e., Graph feature enhancement, Graph topology enhancement, and GNNs architecture enhancement.
Self-Correcting Clustering
Hanxuan WangNa LuZixuan WangYuxuan YanGustavo CarneiroZhen Wang
Keywords:Feature extractionTrainingClustering methodsNoise measurementOptimizationNeural networksManualsDistribution functionsAccuracyRepresentation learningHyperparametersClustering MethodPositive Feedback LoopCluster AssignmentTarget DistributionCorrect LabelPseudo LabelsClustering ModuleClustering FrameworkNeural NetworkPrior InformationMultivariate MethodsData AugmentationRepresentation LearningClassifier TrainingNoisy DataAsymmetric DistributionGaussian Mixture ModelBeta DistributionUnlabeled DataNoisy LabelsEarly Stage Of TrainingClustering PerformancePair Of ClassesCurse Of DimensionalityEdge ClusteringClustering TaskManual DesignSemi-supervised MethodsClean DataDeep clusteringmisassignmentsrobust target distribution solverself-correcting
Abstracts:The incorporation of target distribution significantly enhances the success of deep clustering. However, most of the related deep clustering methods suffer from two drawbacks: (1) manually-designed target distribution functions with uncertain performance and (2) cluster misassignment accumulation. To address these issues, a Self-Correcting Clustering (Self-CC) framework is proposed. In Self-CC, a robust target distribution solver (RTDS) is designed to automatically predict the target distribution and alleviate the adverse influence of misassignments. Specifically, RTDS divides the high confidence samples selected according to the cluster assignments predicted by a clustering module into labeled samples with correct pseudo labels and unlabeled samples of possible misassignments by modeling its training loss distribution. With the divided data, RTDS can be trained in a semi-supervised way. The critical hyperparameter which controls the semi-supervised training process can be set adaptively by estimating the distribution property of misassignments in the pseudo-label space with the support of a theoretical analysis. The target distribution can be predicted by the well-trained RTDS automatically, optimizing the clustering module and correcting misassignments in the cluster assignments. The clustering module and RTDS mutually promote each other forming a positive feedback loop. Extensive experiments on four benchmark datasets demonstrate the effectiveness of the proposed Self-CC.
Segmented Sequence Prediction Using Variable-Order Markov Model Ensemble
Weichao YanHao MaZaiyue Yang
Keywords:Predictive modelsHidden Markov modelsData modelsSportsNatural language processingAnalytical modelsTransformersTrainingRecurrent neural networksGraphical modelsSequence SegmentsVariable-order Markov ModelsSequencing DataNeural NetworkDeep LearningTransition StateRecurrent Neural NetworkHigher-order ModelNatural Language Processing TasksProbabilistic Graphical ModelsTraining SetTraining DatasetPredictive PerformanceLong Short-term MemoryWeight VectorRoot NodeNodes In The GraphLanguage ModelEntropy LossRecommender SystemsHistorical SequenceMaximum OrderHigher-order NetworksRecurrent Neural Network LayerLong Short-term Memory ModelSequence Of StatesSpecific OrderSequence DependenceNodes In OrderDynamic Bayesian NetworkVariable-order Markov modelsequence predictionprobabilistic graphical modelrecurrent neural network
Abstracts:In recent years, sequence prediction, particularly in natural language processing tasks, has made significant progress due to advanced neural network architectures like Transformer and enhanced computing power. However, challenges persist in modeling and analyzing certain types of sequence data, such as human daily activities and competitive ball games. These segmented sequence data are characterized by short length, varying local dependencies, and coarse-grained unit states. These characteristics limit the effectiveness of conventional probabilistic graphical models and attention-based or recurrent neural networks in modeling and analyzing segmented sequence data. To address this gap, we introduce a novel generative model for segmented sequences, employing an ensemble of multiple variable-order Markov models (VOMMs) to flexibly represent state transition dependencies. Our approach integrates probabilistic graphical models with neural networks, surpassing the representation capabilities of single high-order or variable-order Markov models. Compared to end-to-end deep learning models, our method offers improved interpretability and reduces overfitting in short segments. We demonstrate the efficacy of our proposed method in two tasks: predicting tennis shot types and forecasting daily action sequences. These applications highlight the broad applicability of our segmented sequence modeling approach across diverse domains.
Robust and Communication-Efficient Federated Domain Adaptation via Random Features
Zhanbo FengYuanjie WangJie LiFan YangJiong LouTiebin MiRobert Caiming QiuZhenyu Liao
Keywords:TrainingKernelFeature extractionClassification tree analysisReliabilityProtocolsData modelsVectorsFederated learningTransfer learningDomain AdaptationRandom FeatureExtensive ExperimentsDomain ShiftTarget DomainSource DomainComplex CommunicationSingle MachineCommunication OverheadNetwork ReliabilityFederated LearningMaximum Mean DiscrepancyData SourcesDeep Neural NetworkTransfer LearningPrivacy ProtectionUnlabeled DataComputational OverheadTarget DataNull SpaceDomain Adaptation MethodsLocal Feature ExtractionLinear LayerSource CharacteristicsCommunication CostReproducing Kernel Hilbert SpaceSource DistributionCommon Feature SpaceGaussian MatrixAlgorithmic ApproachRandom featuresmaximum mean discrepancykernel methodfederated domain adaptation (FDA)
Abstracts:Modern machine learning (ML) models have grown to a scale where training them on a single machine becomes impractical. As a result, there is a growing trend to leverage federated learning (FL) techniques to train large ML models in a distributed and collaborative manner. These models, however, when deployed on new devices, might struggle to generalize well due to domain shifts. In this context, federated domain adaptation (FDA) emerges as a powerful approach to address this challenge. Most existing FDA approaches typically focus on aligning the distributions between source and target domains by minimizing their (e.g., MMD) distance. Such strategies, however, inevitably introduce high communication overheads and can be highly sensitive to network reliability. In this paper, we introduce RF-TCA, an enhancement to the standard Transfer Component Analysis approach that significantly accelerates computation without compromising theoretical and empirical performance. Leveraging the computational advantage of RF-TCA, we further extend it to FDA setting with FedRF-TCA. The proposed FedRF-TCA protocol boasts communication complexity that is independent of the sample size, while maintaining performance that is either comparable to or even surpasses state-of-the-art FDA methods. We present extensive experiments to showcase the superior performance and robustness (to network condition) of FedRF-TCA.
RMD-Graph: Adversarial Attacks Resisting Malicious Domain Detection Based on Dual Denoising
Sanfeng ZhangLuyao HuangZheng ZhangWenduan XuWang YangLinfeng Liu
Keywords:AutoencodersNoise reductionNoiseContrastive learningFeature extractionDomain Name SystemIP networksData miningTrainingPerturbation methodsAdversarial AttacksMalicious DomainsSingular Value DecompositionDual ModeSelf-supervised LearningResidual ConnectionReconstruction LossDomain NameOriginal GraphNode RepresentationsGraph-based ModelsF1 ScoreDetection PerformanceSingular ValueStructural HeterogeneityDecrease In PerformanceDomain FeaturesGraph StructureSET DomainNode FeaturesHeterogeneous GraphGraph-based MethodsContrastive LossNode AttributesAttack Success RateNode EmbeddingsPre-training PhaseCross-correlation MatrixGraph PropertiesConnectivity ScoresMalicious domain detectionheterogeneous graphautoencoderadversarial attacksgraph structure learninggraph contrastive learning
Abstracts:The Domain Name System (DNS) is a critical Internet service that translates domain names into IPs, but it is often targeted by attackers, posing a serious security risk. Graph-based models for detecting malicious domains have shown high performance but are vulnerable to adversarial attacks. To address this issue, we propose RMD-Graph, which is characterized by its ability to resist adversarial attacks and its low dependency on labeled data. A dual denoising module is specifically designed based on two autoencoders to generate the reconstructed graph, where SVD, TOP-k and reconstruction loss are introduced to enhance the denoising capability of autoencoders. Subsequently, residual connections are employed to generate an optimized graph that retains essential information from the original graph. The reconstructed graph and the optimized graph are then utilized as two views for graph contrastive learning, thereby achieving an self-supervised representation learning task without labels. In the downstream malicious domain detection, the denoised node representations are employed for machine learning classification. Extensive experiments are conducted on publicly available DNS datasets, and the results demonstrate that RMD-Graph significantly outperforms known baseline methods, especially in adversarial scenarios.
Nowhere to H2IDE: Fraud Detection From Multi-Relation Graphs via Disentangled Homophily and Heterophily Identification
Chao FuGuannan LiuKun YuanJunjie Wu
Keywords:FraudReviewsFeature extractionDisentangled representation learningScalabilityImage edge detectionGraph neural networksAggregatesTrainingTopologyFraud DetectionExtensive ExperimentsAttention MechanismRepresentation LearningBaseline MethodsFraudstersNormal UsersDisentangled RepresentationModel PerformanceTypes Of RelationshipsMutual InformationLatent FactorsLatent SpaceReal-world DatasetsNeighboring NodesNode FeaturesGraph Neural NetworksSource NodeNeighborhood InformationDifferent Types Of RelationshipsNode RepresentationsNormal NodesNode EmbeddingsPartial LabelsGraph Neural Network ModelStar RatingMutual Information EstimationDeceptive BehaviorSemanticGraph-structured DataFraud detectiondisentangled representationhomophilyheterophilygraph neural networks
Abstracts:Fraud detection has always been one of the primary concerns in social and economic activities and is becoming a decisive force in the booming digital economy. Graph structures formed by rich user interactions naturally serve as important clues for identifying fraudsters. While numerous graph neural network-based methods have been proposed, the diverse interactive connections within graphs and the heterophilic connections deliberately established by fraudsters to normal users as camouflage pose new research challenges. In this light, we propose H2IDE (Homophily and Heterophily Identification with Disentangled Embeddings) for accurate fraud detection in multi-relation graphs. H2IDE features in an independence-constrained disentangled representation learning scheme to capture various latent behavioral patterns in graphs, along with a supervised identification task to specifically model the factor-wise heterophilic connections, both of which are proven crucial to fraud detection. We also design a relation-aware attention mechanism for hierarchical and adaptive neighborhood aggregation in H2IDE. Extensive comparative experiments with state-of-the-art baseline methods on two real-world multi-relation graphs and two large-scale homogeneous graphs demonstrate the superiority and scalability of our proposed method and highlight the key role of disentangled representation learning with homophily and heterophily identification.
Next Point-of-Interest Recommendation With Adaptive Graph Contrastive Learning
Xuan RaoRenhe JiangShuo ShangLisi ChenPeng HanBin YaoPanos Kalnis
Keywords:Adaptation modelsTrajectoryAccuracyContrastive learningTransformersVectorsSymbolsFrequency measurementData augmentationCorrelationAdaptive LearningSelf-supervised LearningAdaptive GraphGraph Contrastive LearningAttention MechanismGeographical ProximityGraph Neural NetworksMultiple GraphsUser TrajectoryPositive SamplesWeight DecayKullback-LeiblerSequential ModelUser PreferencesTraining EfficiencyFrequent AttendanceGated Recurrent UnitTemporal ModulationGraph ConvolutionGraph-based MethodsStatic GraphFrequency GraphsTemporal GraphRegular GraphsHigh SparsityGraph Attention NetworkTransition PatternsSparse GraphClick-throughTemporal BiasPoint-of-Interestrecommendationtrajectory
Abstracts:Next point-of-interest (POI) recommendation predicts user’s next movement and facilitates location-based applications such as destination suggestion and travel planning. State-of-the-art (SOTA) methods learn an adaptive graph from user trajectories and compute POI representations using graph neural networks (GNNs). However, a single graph cannot capture the diverse dependencies among the POIs (e.g., geographical proximity and transition frequency). To tackle this limitation, we propose the Adaptive Graph Contrastive Learning (AGCL) framework. AGCL constructs multiple adaptive graphs, each modeling a kind of POI dependency and producing one POI representation; and the POI representations from different graphs are merged into a multi-facet representation that encodes comprehensive information. To train the POI representations, we tailor a graph-based contrastive learning, which encourages the representations of similar POIs to align and dissimilar POIs to differentiate. Moreover, to learn the sequential regularities of user trajectories, we design an attention mechanism to integrate spatial-temporal information into the POI representations. An explicit spatial-temporal bias is also employed to adjust the predictions for enhanced accuracy. We compare AGCL with 10 state-of-the-art baselines on 3 datasets. The results show that AGCL outperforms all baselines and achieves an improvement of 10.14% over the best performing baseline in average accuracy.
Network-to-Network: Self-Supervised Network Representation Learning via Position Prediction
Jie LiuChunhai ZhangZhicheng HeWenzheng ZhangNa Li
Keywords:Representation learningTrainingKnowledge engineeringFusesNetwork topologySelf-supervised learningVectorsGraph neural networksDecodingFacesSelf-supervised LearningInformation ContentPosition InformationContent KnowledgePrediction TaskNode PositionsGraph Neural NetworksLow-dimensional RepresentationTopological InformationRich ContentNode RepresentationsLow-dimensional EmbeddingNode EmbeddingsEgocentric NetworkConvolutional Neural NetworkSupervised LearningUnsupervised LearningNetwork TopologySemantic InformationGraph StructureEgo NetworkSemi-supervised LearningJigsaw PuzzleNode FeaturesCitation NetworkGraph Neural Network ModelNode ClassificationEmbedding DimensionWord EmbeddingHidden RepresentationGraph neural networksnetwork representation learningnetwork to networkself-supervised learning
Abstracts:Network Representation Learning (NRL) has achieved remarkable success in learning low-dimensional representations for network nodes. However, most NRL methods, including Graph Neural Networks (GNNs) and their variants, face critical challenges. First, labeled network data, which are required for training most GNNs, are expensive to obtain. Second, existing methods are sub-optimal in preserving comprehensive topological information, including structural and positional information. Finally, most GNN approaches ignore the rich node content information. To address these challenges, we propose a self-supervised Network-to-Network framework (Net2Net) to learn semantically meaningful node representations. Our framework employs a pretext task of node position prediction (PosPredict) to effectively fuse the topological and content knowledge into low-dimensional embeddings for every node in a semi-supervised manner. Specifically, we regard a network as node content and position networks, where Net2Net aims to learn the mapping between them. We utilize a multi-layer recursively composable encoder to integrate the content and topological knowledge into the egocentric network node embeddings. Furthermore, we design a cross-modal decoder to map the egocentric node embeddings into their node position identities (PosIDs) in the node position network. Extensive experiments on eight diverse networks demonstrate the superiority of Net2Net over comparable methods.
Hot Journals