-
Scalable Multi-FPGA HPC Architecture for Associative Memory System
Deyu WangXiaoze YanYu YangDimitrios StathisAhmed HemaniAnders LansnerJiawei XuLi-Rong ZhengZhuo Zou
Keywords:Associative memoryField programmable gate arraysBrain modelingComputational modelingTask analysisTrainingScalabilityHigh-performance ComputingScalable ArchitectureHigh Performance Computing ArchitecturesAssociative Memory SystemSynchronizationNeural NetworkBrain TissueDesign StrategiesMaximum SizeModel SizeOnline LearningNetwork ConfigurationWorking FrequencyAssociation TaskHigh Storage CapacityWeight MatrixTraining PhaseSpatial DimensionsExcitatory Postsynaptic CurrentsState MachineSynaptic WeightsHigh-performance Computing SystemsPostsynaptic SpikeLocal SpikesTotal LatencySpiking Neural NetworksInference PhaseHardware ResourcesOn-chip MemorySpike-timing-dependent PlasticityAssociative memorymulti-FPGAscalabilityhigh performance computing (HPC)Bayesian confidence propagation neural network (BCPNN)spiking neural network (SNN)HumansNeural Networks, ComputerMemoryBayes TheoremModels, NeurologicalAlgorithms
Abstracts:Associative memory is a cornerstone of cognitive intelligence within the human brain. The Bayesian confidence propagation neural network (BCPNN), a cortex-inspired model with high biological plausibility, has proven effective in emulating high-level cognitive functions like associative memory. However, the current approach using GPUs to simulate BCPNN-based associative memory tasks encounters challenges in latency and power efficiency as the model size scales. This work proposes a scalable multi-FPGA high performance computing (HPC) architecture designed for the associative memory system. The architecture integrates a set of hypercolumn unit (HCU) computing cores for intra-board online learning and inference, along with a spike-based synchronization scheme for inter-board communication among multiple FPGAs. Several design strategies, including population-based model mapping, packet-based spike synchronization, and cluster-based timing optimization, are presented to facilitate the multi-FPGA implementation. The architecture is implemented and validated on two Xilinx Alveo U50 FPGA cards, achieving a maximum model size of 200$\boldsymbol{\times}$10 and a peak working frequency of 220 MHz for the associative memory system. Both the memory-bounded spatial scalability and compute-bounded temporal scalability of the architecture are evaluated and optimized, achieving a maximum scale-latency ratio (SLR) of 268.82 for the two-FPGA implementation. Compared to a two-GPU counterpart, the two-FPGA approach demonstrates a maximum latency reduction of 51.72$\boldsymbol{\times}$ and a power reduction exceeding 5.28$\boldsymbol{\times}$ under the same network configuration. Compared with the state-of-the-art works, the two-FPGA implementation exhibits a high pattern storage capacity for the associative memory task.
-
Integrated Active Quenching Circuit for High-Rate and Distortionless SPAD-Based Time-Resolved Fluorescence Applications
Francesco MalangaGennaro FrattaGiulia AcconciaIvan Rech
Keywords:PhotonicsSingle-photon avalanche diodesTransistorsCircuitsPhotodetectorsCathodesSensorsExtinctionDistortionTemporal VariationHigh VoltageDead TimeTime-correlated Single-photon CountingMultichannel SystemFluorescence Lifetime Imaging MicroscopyQuantum OpticsSingle-photon Avalanche DiodeExceptional SensitivityTime ConstantPulse WidthEnd Of PhaseEquivalent ResistanceParasitic CapacitanceDigital ControlExtent Of OccurrenceDelay LinePhoton DetectionOvervoltageTemperature DriftSingle-photon DetectorsGate SignalsControl WordsSerial CommunicationIllumination PowerBias ConditionsActive quenching circuit (AQC)dead timepile-upsingle-photon avalanche diode (SPAD)time-correlated single photon counting (TCSPC)process voltage temperature (PVT) variationsEquipment DesignPhotonsMicroscopy, FluorescenceOptical Imaging
Abstracts:Time-Correlated Single Photon Counting (TCSPC) is a pivotal technique in low-light-detection applications, renowned for its exceptional sensitivity and bandwidth, widely used in Fluorescence Lifetime Imaging Microscopy (FLIM) and quantum optics. Despite its features, TCSPC is significantly hindered by the pile-up effect, which may distort measurements at high photon-detection rates. Overcoming pile-up is challenging, with traditional solutions often involving complex post-processing or multichannel systems, complicating the TCSPC setup and limiting performance. A breakthrough to overcome this issue is matching the photodetector dead time to an integer multiple of the laser period, obtaining a distortionless histogram even at high illumination conditions. Building on this concept, we present an Active Quenching Circuit (AQC) developed in high-voltage 150 nm technology, achieving unprecedented control over the Single Photon Avalanche Diode (SPAD) dead time. Our design compensates for Process, Voltage, and Temperature (PVT) variations, ensuring ultra precise and robust dead time tuning. The presented AQC achieves a dead-time resolution of 50 ps suitable for time-resolved experiments within a selectable range of laser frequencies from 20 to 100 MHz, maintaining close-to-ideal linearity in dead-time control. Experimental validations through fluorescence measurements reveal a distortion as low as 0.43% under elevated count-rate conditions, highlighting the efficacy of our circuit in overcoming the pile-up limitation.
-
A 40-nm 169mW Ultrasound Imaging Processor Supporting Advanced Modes for Hand-Held Devices
Yi-Lin LoYu-Chen LoChia-Hsiang Yang
Keywords:ImagingDelaysArray signal processingVectorsUltrasonic imagingElastographyDoppler effectMobile DevicesUltrasound ImagingAdvanced ModePower ConsumptionLow Power ConsumptionPulse Repetition FrequencyDelay ValuesFlow VectorStorage SizeBeamlineCubic SplineFrame RateLookup TableAxial DirectionSpeed Of SoundLateral DirectionIncome DataSpline InterpolationShear WaveB-mode ImagesStandard ModeSpeed EstimationShear Wave SpeedDelay GeneratorMemory BankCubic Spline InterpolationDoppler SignalColor Doppler ImagingFlow VelocityHand-held ultrasound devicesadvanced imaging modealgorithm-architecture co-optimizationlow-power designCMOS integrated circuitsUltrasonographyEquipment DesignHumansAlgorithmsSignal Processing, Computer-AssistedImage Processing, Computer-AssistedElasticity Imaging Techniques
Abstracts:Hand-held ultrasound devices have been widely used in the field of healthcare and power-efficient, real-time imaging is essential. This work presents the world's first ultrasound imaging processor supporting advanced modes, including vector flow imaging and elastography imaging. Plane-wave beamforming is utilized to ensure that the pulse repetition frequency (PRF) is sufficiently high for the advanced mode. The storage size and power consumption are minimized through algorithm-architecture co-optimization. The proposed plane-wave beamforming reduces the storage size of the required delay values by 43.7%. By exchanging the processing order, the storage size is reduced by 78.1% for elastography imaging. Parallel beamforming and interleaved firing are employed to achieve real-time imaging for all the supported modes. Fabricated in 40-nm CMOS technology, the proposed processor integrates 4.7M logic gates in core area of 3.24mm${}^{2}$. This work achieves a 20.3$\boldsymbol{\times}$ higher beamforming rate with 5.3-to-29.1$\boldsymbol{\times}$ lower power consumption than the state-of-the-art design. It also has 60% lower hardware complexity (in terms of gate count), in addition to the capability for supporting the advanced mode.
-
An Electrochemical CMOS Biosensor Array Using Phase-Only Modulation With 0.035% Phase Error and In-Pixel Averaging
Aditi JainSaeromi ChungEliah Aronoff SpencerDrew A. Hall
Keywords:ElectrodesImpedanceDNABiosensorsPhase modulationIntegrated circuit modelingPhased arraysPhase ErrorElectrochemical BiosensorsPhase-only ModulatorBiosensor ArrayFrequency RangeElectrochemical Impedance SpectroscopyPhase DetectionDigital CircuitsCMOS ProcessTransimpedance AmplifierReadout TimePhase ChangeDuty CycleAptamerSquare WaveElectrochemical CellInductor CurrentParasitic CapacitancePhase-locked LoopBode PlotsReference PixelsTime-to-digital ConverterSaline Sodium Citrate BufferFrequency Range Of InterestPoles And ZerosFlicker NoiseElectrostatic DischargeSense AmplifierEquivalent ImpedanceDead ZonePoint-of-care (PoC)electrochemical impedance spectroscopy (EIS)biosensor arrayphase-to-digital converterBiosensing TechniquesDielectric SpectroscopyEquipment DesignSemiconductorsHumans
Abstracts:This paper presents a 16 × 20 CMOS biosensor array based on electrochemical impedance spectroscopy (EIS), a highly sensitive label-free technique for rapid disease detection at the point-of-care. This high-density system implements polar-mode detection with phase-only EIS measurement over a 5 kHz - 1 MHz frequency range. The design features predominantly digital readout circuitry, ensuring scalability with technology, along with a load-compensated transimpedance amplifier, all within a 140 × 140 µm2 pixel. The architecture enables in-pixel digitization and accumulation, which increases the SNR by 10 dB for each 10× increase in readout time. Implemented in a 180 nm CMOS process, the 3 × 4 mm2 chip achieves state-of-the-art performance with an rms phase error of 0.035% at 100 kHz through a duty-cycle insensitive phase detector and one of the smallest per pixel areas with in-pixel quantization.
-
High-Performance Method and Architecture for Attention Computation in DNN Inference
Qi ChengXiaofang HuHe XiaoYue ZhouShukai Duan
Keywords:Computer architectureHardwareCircuitsMemristorsMatrix convertersIn-memory computingComplexity theoryDNN InferenceNeural NetworkEnergy EfficiencyAttention MechanismProjection MatrixIntegrated DensityHardware ArchitectureField Of Medical ImagingHigh Resource ConsumptionPower ConsumptionInvertibleMatrix FormMatrix MultiplicationSoftmax FunctionInput VoltageLow Power ConsumptionCircuit SimulationMNIST DatasetCircuit PerformanceCrossbar ArrayExponentiation OperationsModulation Of CircuitsPre-trained WeightsAttention OperationMatrix Multiplication OperationDNN ModelStable OutputHardware AcceleratorsCircuit ArchitectureAttentioncompute-in-memorymultiplicationon-line programmingmultiply-and-accumulateacceleratorNeural Networks, ComputerHumansDeep LearningAlgorithmsAttention
Abstracts:In recent years, The combination of Attention mechanism and deep learning has a wide range of applications in the field of medical imaging. However, due to its complex computational processes, existing hardware architectures have high resource consumption or low accuracy, and deploying them efficiently to DNN accelerators is a challenge. This paper proposes an online-programmable Attention hardware architecture based on compute-in-memory (CIM) marco, which reduces the complexity of Attention in hardware and improves integration density, energy efficiency, and calculation accuracy. First, the Attention computation process is decomposed into multiple cascaded combinatorial matrix operations to reduce the complexity of its implementation on the hardware side; second, in order to reduce the influence of the non-ideal characteristics of the hardware, an online-programmable CIM architecture is designed to improve calculation accuracy by dynamically adjusting the weights; and lastly, it is verified that the proposed Attention hardware architecture can be applied for the inference of deep neural networks through Spice simulation. Based on the 100nm CMOS process, compared with the traditional Attention hardware architectures, the integrated density and energy efficiency are increased by at least 91.38 times, and latency and computing efficiency are improved by about 12.5 times.
-
A 2m-Range 711μW Body Channel Communication Transceiver Featuring Dynamically-Sampling Bias-Free Interface Front End
Guanjie GuChanggui YangJian ZhaoSijun DuYuxuan LuoBo Zhao
Keywords:ImpedanceReceiversTransceiversWireless communicationCircuitsPropagation lossesPower demandFront EndHigh ResistancePower ConsumptionBody SurfaceSignal TransmissionWearable DevicesSignal LossCommunication RangeHigh Input ImpedanceData RateHigh GainCarrier FrequencyInverterCircuit ModelBit Error RatePair Of ElectrodesBitstreamParasitic CapacitanceCapacitive CouplingPhase-locked LoopWireless Body Area NetworksLong-range TransmissionChannel LossSignal-to-interference RatioEnvelope DetectorIntermediate Frequency SignalMHz FrequencyChip AreaDC BiasPath LossBody channel communication (BCC)interface front end (IFE)communication rangelow powerinput impedanceHumansWireless TechnologyEquipment DesignWearable Electronic DevicesSignal Processing, Computer-Assisted
Abstracts:Body Channel Communication (BCC) utilizes the body surface as a low-loss signal transmission medium, reducing the power consumption of wireless wearable devices. However, the effective communication range on the human body is limited in the state-of-the-art BCC transceivers, where the signal loss between the body surface and the BCC receiver remains one of the main bottlenecks. To reduce the interface loss, a high input impedance is desired by the BCC receiver, but the DC-biasing circuits decrease the input impedance. In this work, a dynamically-sampling IFE is proposed to eliminate the DC voltage bias, resulting in a 90k$\Omega$ high input impedance and a 94dB RF$-$IF conversion gain to reduce the interface loss in long-range BCC applications. The BCC transceiver chip is fabricated in 55nm CMOS process, taking a die area of 0.123mm${}^{2}$. Measured results show that the chip extends the BCC range to 2m for both the forward and backward paths, where the transmitter and receiver consume 711$\mu$W power in total.
-
RRAM-Based Spiking Neural Network With Target-Modulated Spike-Timing-Dependent Plasticity
Kalkidan Deme MuletaBai-Sun Kong
Keywords:NeuronsFiringAccuracyTimingSynapsesSpiking neural networksHardwareSpiking Neural NetworksSpike-timing-dependent PlasticityNetwork SizeLearning RuleTemporal CodingExtract Representative FeaturesSmaller Network SizeEnergy EfficiencyLong-term PotentiationNeuronal FiringComplex DatasetsOutput NeuronsLong-term DepressionPostsynaptic NeuronsSpike TimesPresynaptic NeuronsLateral InhibitionSynaptic WeightsTarget NeuronsHand GesturesPostsynaptic SpikeSpatiotemporal DatasetsReward SignalSET PulseTiming DiagramPulse ProgramConventional WorkTime StepTarget ClassSynaptic PlasticitySTDPRSTDPSNNneuromorphic networksmemristorRRAMReSuMeTSTDPNeural Networks, ComputerNeuronal PlasticityHumansAction PotentialsNeuronsModels, NeurologicalAlgorithms
Abstracts:The spiking neural network (SNN) training with spike timing-dependent plasticity (STDP) for image classification usually requires a lot of neurons to extract representative features and(or) needs an external classifier. Conventional bio-inspired learning methods do not cover all possible learning opportunities, resulting in limited performance. We propose a new bio-plausible learning rule, target-modulated STDP (TSTDP), for higher learning efficiency and accuracy. We also propose an SNN architecture trainable with TSTDP using temporally encoded spikes to obtain higher accuracy and improved area efficiency without using an external classifier. Using the MNIST dataset, we have shown that the proposed design achieves an accuracy of 92%, which is up to 7% improvement compared to conventional networks of similar sizes. For providing similar accuracy, up to 75% smaller network size has been shown on top of demonstrating stronger resilience to process variations. Benchmarking on the CIFAR-10 and neuromorphic DVS gesture datasets show an accuracy improvement of up to 12.4% and 3.6%, respectively.
-
A Power-Efficient Envelope-Detector-Less Amplitude-Shift-Keying Forward Telemetry for Wirelessly Powered Biomedical Devices
Hyun-Su LeeHyung-Min Lee
Keywords:DemodulationAmplitude shift keyingTelemetryEnvelope detectorsRegulatorsResonant frequencyVoltage controlWireless PowerData RateResonance FrequencyPulse WidthTypes Of PainBit Error RateBit ErrorDigital ControlLeast Significant BitChip AreaForward DataSilicon AreaEnvelope DetectorLower Bit Error RatePower ConsumptionAdaptive MethodInverterPower TransferMagnetic FluxWireless Power TransferComparable YieldsPhase Shift KeyingClock GeneratorShift KeyingOR GateDigital BlockSetup TimeBias CurrentResonant CapacitorAmplitude shift keyingcomparator-lessdigital cleanerdownlinkenvelope-detector-lessforward telemetrypower path lesswireless power/data transferTelemetryWireless TechnologyEquipment DesignSignal Processing, Computer-AssistedElectric Power SuppliesHumans
Abstracts:This paper proposes an envelope-detector-less (EDL) amplitude-shift-keying (ASK) forward telemetry (FT) demodulator for wireless power/data transfer (WPDT) systems. The EDL ASK FT demodulator can substitute bulky and power-hungry components, which are an envelope detector and an analog comparator in the conventional ASK FT demodulator, with a digital controller, reducing both power dissipation and chip area. The proposed demodulator shares the gate control signals of pass transistors, which are used in an ac-dc regulator for wireless power reception, to maintain a constant load voltage while efficiently demodulating the forward telemetry data. Also, a proposed digital cleaner in the EDL demodulator refines this control signal into a wide pulse without suffering from resonant frequency noise, while a synchronizer can align its frequency with the data rate and resonant frequency. The 0.25-µm CMOS prototype chip of the proposed power-path-less EDL ASK FT demodulator, equipped with the ac-dc regulator, demonstrates a significant 38.2% reduction in power dissipation compared to the conventional ASK FT demodulator. Moreover, the EDL ASK FT demodulator occupies only 0.023-mm2 silicon area and achieves a low bit error rate (BER) less than 10−4 while maintaining a regulated voltage of 4.5 V on the load.
-
A Wearable Dual-Mode Probe for Image-Guided Closed-Loop Ultrasound Neuromodulation
Junjun HuanVida PashaeiSteve J. A. MajerusSwarup BhuniaSoumyajit Mandal
Keywords:ProbesUltrasonic imagingTransducersModulationNeuromodulationAcoustic arraysSubstratesDual-mode ProbeBlood VesselsActuatorUltrasound ImagingFiring RatePosition ErrorPrinted Circuit BoardInter-subject VariabilityFlexible ElectronicsTemplate MatchingArray TransducerArray ImagesTibial NerveB-mode ImagesPiezoelectric TransducerMuscle TwitchFocused UltrasoundTarget NerveWearable DevicesClosed-loop SystemFlexible ArraySound PressureAxial ResolutionPhantom SurfaceSteering AngleInertial Measurement UnitAcoustic ImpedanceNeuronal FiringClosed-loop FeedbackUltrasound neuromodulationimage-guided therapyclosed-loop controlfunctional feedbackHumansPhantoms, ImagingWearable Electronic DevicesUltrasonographyEquipment DesignTransducersAlgorithms
Abstracts:Low-intensity focused ultrasound (FUS) is an emerging non-invasive and spatially/temporally precise method for modulating the firing rates and patterns of peripheral nerves. This paper describes an image-guided platform for chronic and patient-specific FUS neuromodulation. The system uses custom wearable probes containing separate ultrasound imaging and modulation transducer arrays realized using piezoelectric transducers assembled on a flexible printed circuit board (PCB). Dual-mode probes operating around 4 MHz (imaging) and 1.3 MHz (modulation) were fabricated and tested on tissue phantoms. The resulting B-mode images were analyzed using a template-matching algorithm to estimate the location of the target nerve and then direct the modulation beam toward the target. The ultrasound transmit voltage used to excite the modulation array was optimized in real-time by automatically regulating functional feedback signals (the average rates of emulated muscle twitches detected by an on-board motion sensor) through a proportional and integral (PI) controller, thus providing robustness to inter-subject variability and probe positioning errors. The proposed closed-loop neuromodulation paradigm was experimentally demonstrated in vitro using an active tissue phantom that integrates models of the posterior tibial nerve and nearby blood vessels together with embedded sensors and actuators.
-
A 62.2dB SNDR Event-Driven Level-Crossing ADC With SAR-Assisted Delay Compensation Loop for Time-Sparse Biomedical Signal Acquisition
Mengyu LiYi HuoShuang SongWanyuan QuLe YeMenglian ZhaoZhichao Tan
Keywords:DelaysSignal resolutionTime-frequency analysisSignal to noise ratioJitterClocksPower demandBiomedical SignalsSampling RateBiomedical ApplicationsSource CodePower ConsumptionNeural SpikeElectromyogram SignalsTime DomainLarge ErrorsInput SignalInternet Of ThingsSystem ArchitectureAnalog-to-digital ConverterQuantization ErrorLow Power ConsumptionNetwork InputMinimum PowerSignal BandwidthPulse SignalTime TiAmplitude ErrorDynamic PowerBluetooth Low EnergyCircuit ImplementationShift RegisterPhase TrackingDigital CircuitsOutput StageThreshold LevelLevel-crossing (LC)event-driventime-sparse signaldata compressionasynchronousInternet of Things (IoT)Signal Processing, Computer-AssistedHumansElectrocardiographyElectromyographyAlgorithms
Abstracts:This paper proposed an event-driven clockless level-crossing ADC (LC-ADC) suitable for biomedical applications. Thanks to the LC loop, the sampling rate of the converter automatically adapts to the input activities. Activity-dependent power consumption and data compression can thus be realized, saving system power, especially during time-sparse signal acquisition. Meanwhile, a SAR-assisted loop is exploited to resolve the loop-delay-induced distortion in conventional LC-ADC. Therefore, the resolution and power efficiency of the LC-ADC are improved effectively while maintaining the event-driven feature. Implemented in a 55nm process, the proposed LC-ADC achieves a scalable power consumption and a peak SNDR of 62.2dB for a 20kHz input. It also achieves a Walden FoM of 29.7fJ/conv.-step and a Schreier FoM of 158.6dB, which is best in class, without using off-chip calibration. Sub µW power is realized when the input frequency is below 1.5kHz. The proposed LC-ADC is also verified by simulated electrocardiogram (ECG), neural spike, and electromyogram (EMG) signals. It provides a ∼7X data compression for ECG input, providing an attractive solution for time-sparse signal acquisition in biomedical applications.