Welcome to the IKCEST
Journal
IEEE Transactions on Computers

IEEE Transactions on Computers

Archives Papers: 511
IEEE Xplore
Please choose volume & issue:
ReViT: Vision Transformer Accelerator With Reconfigurable Semantic-Aware Differential Attention
Xiaofeng ZouCen ChenHongen ShaoQinyu WangXiaobin ZhuangYangfan LiKeqin Li
Keywords:SemanticsTransformersVisualizationComputer visionComputational modelingAttention mechanismsDogsComputersSnowPerformance evaluationVision TransformerSimilar CharacteristicsSpecific MechanismsAttention MechanismStrong SimilarityReal-world ImagesExecution EfficiencySemantic ImageCo-design ApproachComputational ComplexityComputational ResourcesMatrix MultiplicationProcessing ElementsHash FunctionComputational OverheadGlobal AttentionCommunication OverheadAttention ScoresPower-of-twoRaw FeaturesHierarchical AttentionSemantic GroupsLocal AttentionLocality Sensitive HashingIdle ResourcesSequential ExecutionAlgorithmic PerspectiveGroups Of Different SizesMulti-head Self-attentionPatch FeaturesHardware acceleratorvision transformerssoftware-hardware co-design
Abstracts:While vision transformers (ViTs) have continued to achieve new milestones in computer vision, their complicated network architectures with high computation and memory costs have hindered their deployment on resource-limited edge devices. Some customized accelerators have been proposed to accelerate the execution of ViTs, achieving improved performance with reduced energy consumption. However, these approaches utilize flattened attention mechanisms and ignore the inherent hierarchical visual semantics in images. In this work, we conduct a thorough analysis of hierarchical visual semantics in real-world images, revealing opportunities and challenges of leveraging visual semantics to accelerate ViTs. We propose ReViT, a systematic algorithm and architecture co-design approach, which aims to exploit the visual semantics to accelerate ViTs. Our proposed algorithm can leverage the same semantic class with strong feature similarity to reduce computation and communication in a differential attention mechanism, and support the semantic-aware attention efficiently. A novel dedicated architecture is designed to support the proposed algorithm and translate it into performance improvements. Moreover, we propose an efficient execution dataflow to alleviate workload imbalance and maximize hardware utilization. ReViT opens new directions for accelerating ViTs by exploring the underlying visual semantics of images. ReViT gains an average of 2.3$\boldsymbol{\times}$× speedup and 3.6$\boldsymbol{\times}$× energy efficiency over state-of-the-art ViT accelerators.
WaWoT: Towards Flexible and Efficient Web of Things Services via WebAssembly on Resource-Constrained IoT Devices
Borui LiHongchang FanYi GaoWei Dong
Keywords:Internet of ThingsRuntimeStandardsInteroperabilityCodesPerformance evaluationComputersSafetyProtocolsOptimizationInternet Of ThingsInternet Of Things DevicesFlexible WebWeb Of ThingsEnergy ConsumptionInteroperabilityFlexible ServicesInternet Of Things PlatformCut-pointsFace RecognitionWeb PageOptimal EnergyStandard LanguageMigration PolicyInternet Of Things ApplicationsAutomatic GenerationMemory FootprintExecution EfficiencyResource-constrained DevicesRuntime OverheadStack OverflowCompile TimeDescription FileBattery LifetimePiece Of CodeProtective MemoryJavaScriptRoadside UnitsWebAssemblyInternet of Thingsweb of thingsahead of time compilation
Abstracts:Web of Things (WoT) is an emerging concept to connect IoT devices to the web using standard interfaces. This provides interoperability between different IoT platforms and enables seamless integration with web and cloud services. However, running sophisticated web services directly on resource-constrained IoT devices is challenging due to limitations in memory, computation, and energy. This paper proposes WaWoT, a Wasm-based framework for flexible and efficient Web of Things services. WaWoT allows flexible WoT service development using annotations and automatic partitioning. It also enables dynamic service migration using WebAssembly modules to adapt placement between IoT devices and web clients. We also introduce an ahead-of-time compiler optimized for low memory usage through techniques like streamed compilation and trimming. For energy efficiency, we use optimizations like bulk instruction writing and direct I/O accessing. Safety is ensured through compile-time and run-time analyses to guarantee sandboxed execution. Evaluations demonstrate WaWoT exhibits better flexibility than existing WoT development approaches. Furthermore, WaWoT can also reduce RAM usage by 84.9x and energy consumption by 1.9x-4.9x over existing WebAssembly runtimes. Overall, it enables efficient, safe, and flexible WoT services on constrained IoT devices.
Access-Pattern Hiding Search Over Encrypted Databases by Using Distributed Point Functions
Hongcheng XieYu GuoYinbin MiaoXiaohua Jia
Keywords:DatabasesServersSecurityEncryptionIndexesAccuracyCloud computingHardwareProtocolsProbabilistic logicEncrypted DatabaseDistributed Point FunctionsSingle RoundCommunication OverheadEncrypted DataSearch OperationsSecurity GuaranteesEncryption SchemeDifferential PrivacySecurity FrameworkCommunication RoundsUpdate OperationKeyword QueriesTime CostFalse-positive And False-negativeMultiple RoundsType Of OperationPolynomial Of DegreeSecret KeyPublic KeyAccess PatternsDatabase ServerData OwnerBloom FilterNumber Of KeywordsUpdate ProcedureReal-world NetworksChosen-plaintextMonomialPattern SearchSearchable encryptionaccess-pattern hidingdistributed point function
Abstracts:Encrypted databases have been extensively studied with the increasing concern of data privacy in cloud services. For practical efficiency, most encrypted database systems are built under Dynamic Searchable Symmetric Encryption (DSSE) schemes to support fast query and update over encrypted data. However, DSSE schemes allow leakages in their security frameworks, especially access-pattern leakages (i.e., the search results corresponding to queried keywords), which lead to various attacks to infer sensitive information of queries and databases. Existing oblivious-access techniques, such as Oblivious RAM and differential privacy, suffer from excessive communication overhead and loss of query accuracy. In this paper, we propose a new DSSE scheme that enables access-pattern hiding keyword search and update operations. Servers can obliviously query and update databases within only a single communication round. Our building block is based on the Distributed Point Function (DPF), an advanced secret sharing technique that provides provable security guarantees against adversaries with arbitrary background knowledge. Moreover, we devise a novel update protocol that integrates DPF and Somewhat Homomorphic Encryption (SHE) such that servers can obliviously update their local data. We formally analyze the security and implement the prototype. The comprehensive experimental results demonstrate the security and efficiency of our scheme.
Lock-Free Triangle Counting on GPU
Zhigao ZhengGuojia WanJiawei JiangChuang HuHao LiuShahid MumtazBo Du
Keywords:Graphics processing unitsInstruction setsOrganizationsMemory managementComputersProgrammingMessage systemsLoad modelingIndexesBandwidthGraphics Processing UnitTriangle CountRandom AccessIntersection SetLink PredictionNeighboring VerticesLoad ImbalanceSource VertexTime ComplexitySearch SpaceVerticesBinary TreeArray StructureIntersectional ApproachMemory SpacePartition ModelVertex DegreeBinary SearchBreadth-first SearchNeighbor ListL2 CacheCache HitGraphics Processing Unit MemoryShared MemoryVertex ValuesHigh Time ComplexityGraph ProcessingSparse GraphHash FunctionTriangle countinglock freethread organizationGPGPUparallelism
Abstracts:Finding the triangles of large scale graphs is a fundamental graph mining task in many applications, such as motif detection, microscopic evolution, and link prediction. The recent works on triangle counting can be classified into merge-based or binary search-based paradigms. The merge-based triangle counting paradigm locates the triangles using the set intersection operation, which suffers from the random memory access problem. The binary search-based triangle counting paradigm sets the neighbors of the source vertex of an edge as the lookup array and searches the neighbors of the destination vertex. There are lots of expensive lock operations needed in the binary search-based paradigm, which leads to low thread efficiency. In this paper, we aim to improve the triangle counting efficiency on GPU by designing a lock-free policy named Skiff to implement a hash-based triangle counting algorithm. In Skiff, we first design a hash trie data layout to meet the coalesced memory access model and then propose a lock-free policy to reduce the conflicts of the hash trie. In addition, we use a level array to manage the index of the hash trie to make sure the nodes of the hash trie can be quickly located. Furthermore, we implement a CTA thread organization model to reduce the load imbalance of the real-world graphs. We conducted extensive experiments on NVIDIA GPUs to show the performance of Skiff. The results show that Skiff can achieve a good system performance improvement than the state-of-the-art (SOTA) works.
Feynman Meets Turing: The Uncomputability of Quantum Gate-Circuit Emulation and Concatenation
Holger BocheYannik N. BöckZoe Garcia del ToroFrank H. P. Fitzek
Keywords:Logic gatesQuantum computingEmulationCircuitsBoolean functionsQuantum mechanicsComputersHardwareQuantum algorithmComputational modelingContralateralDigital TechnologiesQuantum ComputingClassical EquationDigital ComputerComputational TheoryGate SetInformation TechnologyOperational UnitsInformation And Communication TechnologiesComplex NumbersPart Of FunctionLookup TableHilbert SpaceCalculation Of FunctionQuantum InformationUniversal SetArbitrary SetSelf-adjoint OperatorClassical ComputerRecursive FunctionBoolean FunctionTuring MachineQuantum GatesQuantum AlgorithmsFinite-dimensional Hilbert SpaceLinearizableAbuse Of NotationSet Of MatricesSet Of FunctionsQuantum circuitsquantum-circuit emulationquantum-circuit concatenationturing machineseffective analysis
Abstracts:We investigate the feasibility of computing quantum gate-circuit emulation (QGCE) and quantum gate-circuit concatenation (QGCC) on digital hardware. QGCE serves the purpose of rewriting gate circuits comprised of gates from a varying input gate set to gate circuits formed of gates from a fixed target gate set. Analogously, QGCC serves the purpose of finding an approximation to the concatenation of two arbitrary elements of a varying list of input gate circuits in terms of another element from the same list. Problems of this kind occur regularly in quantum computing and are often assumed an easy task for the digital computers controlling the quantum hardware. Arguably, this belief is due to analogical reasoning: The classical Boolean equivalents of QGCE and QGCC are natively computable on digital hardware. In the present paper, we present two insights in this regard: Upon applying a rigorous theory of computability, QGCE and QGCC turn out to be uncomputable on digital hardware. The results remain valid when we restrict the set of feasible inputs for the relevant functions to one parameter families of fixed gate sets. Our results underline the possibility that several ideas from quantum-computing theory may require a rethinking to become feasible for practical implementation.
NPC: A Non-Conflicting Processing-in-Memory Controller in DDR Memory Systems
Seungyong LeeSanghyun LeeMinseok SeoChunmyung ParkWoojae ShinHyuk-Jae LeeHyun Kim
Keywords:Memory managementProcessor schedulingRandom access memoryError correction codesComputersCalibrationMicroprocessorsMicromechanical devicesHardwareDegradationMemory SystemPerformance DegradationMemory ControlMode TransitionMemory WallHigh PerformancePower ConsumptionModerate ChangesProcessing UnitDecrease In PerformanceLoad DataState MachineLow LatencyTransition FrequencyForward Error CorrectionMode SwitchingMemory ContentOutput BufferTypes Of RequestsLarge WorkloadQueue SizeGray PartProcessing-in-memorymemory controllerDRAMscheduling policy
Abstracts:Processing-in-Memory (PIM) has emerged as a promising solution to address the memory wall problem. Existing memory interfaces must support new PIM commands to utilize PIM, making the definition of PIM commands according to memory modes a major issue in the development of practical PIM products. For performance and OS-transparency, the memory controller is responsible for changing the memory mode, which requires modifying the controller and resolving conflicts with existing functionalities. Additionally, it must operate to minimize mode transition overhead, which can cause significant performance degradation. In this study, we present NPC, a memory controller designed for mode transition PIM that delivers PIM commands via the DDR interface. NPC issues PIM commands while transparently changing the memory mode with a dedicated scheduling policy that reduces the number of mode transitions with aggregative issuing. Moreover, existing functions, such as refresh, are optimized for PIM operation. We implement NPC in hardware and develop a PIM emulation system to validate it on FPGA platforms. Experimental results reveal that NPC is compatible with existing interfaces and functionality, and the proposed scheduling policy improves performance by 2.2$\boldsymbol{\times}$× with balanced fairness, achieving up to 97% of the ideal performance. These findings have the potential to aid the application of PIM in real systems and contribute to the commercialization of mode transition PIM.
Accurate and Reliable Energy Measurement and Modelling of Data Transfer Between CPU and GPU in Parallel Applications on Heterogeneous Hybrid Platforms
Hafiz Adnan NiazRavi Reddy ManumachuAlexey Lastovetsky
Keywords:Data transferGraphics processing unitsSensorsPredictive modelsEnergy measurementComputational modelingSoftwareData modelsCentral Processing UnitAccuracyAccuracy Of ModelGraphics Processing UnitData TransferEnergy ModelAccurate EnergyReliable EnergyParallel ApplicationsHeterogeneous PlatformsHeterogeneous HybridLinear ModelEnergy ConsumptionAverage ErrorMulti-coreOptimal EnergyPower MeterComputing DevicesEnergy SensorExternal PowerDynamic EnergyDisjoint SetsSoftware ComponentsEnergy ProfileApplication RunningHigh Positive CorrelationApplication ExecutionEvents In GroupTransfer TimeEnd TimeFast ProcedureProcessing ArchitectureEnergy efficiencyenergy modellingenergy predictive modelenergy optimizationparallel computingheterogeneous computingheterogeneous platforms
Abstracts:Developing energy-efficient software that leverages application-level energy optimization techniques is essential to tackle the pressing technological challenge of energy efficiency on modern heterogeneous computing platforms. While energy modelling and optimization of computations have received considerable attention in energy research, there remains a significant gap in the energy modelling of data transfer between computing devices on heterogeneous hybrid platforms. Our study aims to fill this crucial gap. In this work, we comprehensively study the energy consumption of data transfer between a host CPU and a GPU accelerator on heterogeneous hybrid platforms using the three mainstream energy measurement methods: (a) System-level physical measurements based on external power meters (ground-truth), (b) Measurements using on-chip power sensors, and (c) Energy predictive models. The ground-truth method is accurate but prohibitively time-consuming. While the on-chip sensors in Intel multicore CPU processors are inaccurate, the Nvidia GPU sensors do not capture data transfer activity. Therefore, we focus on the third approach and propose a novel methodology to select a small subset of performance events that effectively capture all the energy consumption activities during a data transfer and develop accurate linear energy predictive models employing the shortlisted performance events. Finally, we develop independent and accurate runtime pluggable software energy sensors based on our proposed energy predictive models that employ disjoint sets of performance events to estimate the dynamic energy of computations and data transfers. We employ the sensors to predict the energy consumption of computations and data transfer between a host CPU and two A40 Nvidia GPUs in three parallel scientific applications, and the high accuracy (average prediction error of 5%) of our sensors’ predictions further underscores their practical relevance.
A Parallel Computing Scheme Utilizing Memristor Crossbars for Fast Corner Detection and Rotation Invariance in the ORB Algorithm
Qinghui HongHaoyou JiangPingdan XiaoSichun DuTao Li
Keywords:Gray-scaleFeature extractionComputer architectureMemristorsHardwareField programmable gate arraysCommon Information Model (computing)In-memory computingCorner detectionComputersParallelizationRotation InvarianceMemristor CrossbarProcessing SpeedImage SizeAverage SpeedSpeed Of The AlgorithmAnalog ComputingToughnessTime CostEffect Of ErrorSquare WaveInput VoltageFast AlgorithmMethod In This PaperIncrease In ErrorImpact Of NoiseDevice ParametersSoftware DefectCorner PointsWire ResistanceAccurate OutputNon-maximum SuppressionMapping RulesGrayscale ValueScale-invariant Feature TransformOperational AmplifierPower ConsumptionResistance ValuesComputing in memory (CIM)hardware acceleratorfeature extractionORB algorithmmemristor
Abstracts:The Oriented FAST and Rotated BRIEF (ORB) algorithm plays a crucial role in rapidly extracting image keypoints. However, in the domain of high-frame-rate real-time applications, the algorithm faces challenges of the speed and computational efficiency with the increase in both the size and quantity of images. To address this issue, an ORB algorithm accelerator based on a computing-in-memory (CIM) circuit is firstly proposed in this paper, which replaces the iterative calculations in traditional methods with one-step parallel analog computation. The proposed accelerator improves algorithm computational efficiency through CIM technology and enhances algorithm speed through parallel computation. Simulation demonstrate that the proposed method exhibits an average processing speed 22 $\boldsymbol{\times}$× faster than traditional methods and obtains more uniform corners distribution in large-scale images.
Data Sharing in the Metaverse With Key Abuse Resistance Based on Decentralized CP-ABE
Liang ZhangZhanrong OuChanghui HuHaibin KanJiheng Zhang
Keywords:BlockchainsAccess controlMetaverseSmart contractsSecurityPublic keyPrivacyOnline bankingGeneratorsEncryptionData SharingCiphertext-policy Attribute-based EncryptionKey AbuseAccess ControlSmart ContractsAttribute-based EncryptionUser DataDishonestSecret KeyPublic KeyData OwnerBilinear MapKey GenerationKey SizeSecure ChannelMalicious UsersEncryption KeyDistributed Denial Of ServiceDigital CurrencyDigital AssetsDecryption KeyImpersonation AttackTrusted Third PartyConsortium BlockchainEthereum BlockchainBilinear PairingSymmetric EncryptionTarget UserCryptographic PrimitivesMetaverseaccountabilityCP-ABEkey abuseGameFiDAO
Abstracts:Data sharing is ubiquitous in the metaverse, which adopts blockchain as its foundation. Blockchain is employed because it enables data transparency, achieves tamper resistance, and supports smart contracts. However, securely sharing data based on blockchain necessitates further consideration. Ciphertext-policy attribute-based encryption (CP-ABE) is a promising primitive to provide confidentiality and fine-grained access control. Nonetheless, authority accountability and key abuse are critical issues that practical applications must address. Few studies have considered CP-ABE key confidentiality and authority accountability simultaneously. To our knowledge, we are the first to fill this gap by integrating non-interactive zero-knowledge (NIZK) proofs into CP-ABE keys and outsourcing the verification process to a smart contract. To meet the decentralization requirement, we incorporate a decentralized CP-ABE scheme into the proposed data sharing system. Additionally, we provide an implementation based on smart contract to determine whether an access control policy is satisfied by a set of CP-ABE keys. We also introduce an open incentive mechanism to encourage honest participation in data sharing. Hence, the key abuse issue is resolved through the NIZK proof and the incentive mechanism. We provide a theoretical analysis and conduct comprehensive experiments to demonstrate the feasibility and efficiency of the data sharing system. Based on the proposed accountable approach, we further illustrate an application in GameFi, where players can play to earn or contribute to an accountable DAO, fostering a thriving metaverse ecosystem.
High-Radix Generalized Hyperbolic CORDIC and Its Hardware Implementation
Hui ChenLianghua QuanKe ChenWeiqiang Liu
Keywords:HardwareComputersVectorsAccuracyConvergenceSoftwareMathematical modelsComputer architectureSoftware algorithmsReal-time systemsHardware ImplementationLogarithmSelection CriteriaPower ConsumptionExponential FunctionSimulation SoftwareCoordinate RotationUpper LimitScaling FactorNew VariablesRotation AngleComputational AccuracySingular Value DecompositionDerivative Of FunctionLookup TableBit Error RateBit ErrorClock CyclesHardware ArchitectureRotational ModesVector ModesBit-widthCircular SystemSign BitJudgment CriteriaStage Of The CascadeAdditional BitsFloating-point OperationsTaylor ExpansionHigh-radixhyperbolic CORDICgeneralizedlogarithmicexponentialhardware implementation
Abstracts:In this paper, we propose a high-radix generalized hyperbolic coordinate rotation digital computer (HGH-CORDIC). This algorithm not only computes logarithmic and exponential functions with any fixed base but also significantly reduces the number of iterations required compared to traditional CORDIC methods. Initially, we present the general iteration formulas for HGH-CORDIC. Subsequently, we discuss its pivotal convergence properties and selection criteria, exemplifying these with commonly used cases. Through extensive software simulations, we validate the theoretical foundations of our approach. Finally, we explore efficient hardware implementation strategies. Our analysis indicates that, relative to state-of-the-art radix-2 GH-CORDIC, the proposed HGH-CORDIC can decrease the number of iterations by more than $50\%$50% while maintaining comparable accuracy. Synthesized under the 28nm CMOS technology, the reports show that the reference circuit can save about $40\%$40% area and power consumption averagely for $2^{x}$2x and $log_{2}x$log2x calculations compared with the latest CORDIC method.
Hot Journals