Usage instructions: here
Table of Contents
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-08-22 | Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation | Weiting Tan et.al. | 2508.16188 | link |
| 2025-08-21 | QvTAD: Differential Relative Attribute Learning for Voice Timbre Attribute Detection | Zhiyu Wu et.al. | 2508.15931 | null |
| 2025-08-21 | Abelian integrals for polynomials with trivial global monodromy on |
Jesús Muciño-Raymundo et.al. | 2508.15925 | null |
| 2025-08-21 | Any-to-any Speaker Attribute Perturbation for Asynchronous Voice Anonymization | Liping Chen et.al. | 2508.15565 | null |
| 2025-08-24 | Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets | Chenlin Liu et.al. | 2508.15442 | null |
| 2025-08-21 | UniCoM: A Universal Code-Switching Speech Generator | Sangmin Lee et.al. | 2508.15244 | link |
| 2025-08-25 | Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization | Rui Wang et.al. | 2508.14947 | null |
| 2025-08-20 | Long-Context Speech Synthesis with Context-Aware Memory | Zhipeng Li et.al. | 2508.14713 | null |
| 2025-08-20 | Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder Refinement | Heitor R. Guimarães et.al. | 2508.14709 | null |
| 2025-08-22 | Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS | Can Jin et.al. | 2508.14313 | null |
| 2025-08-19 | Exponential Ergodicity for McKean-Vlasov SDEs with Singular Interactions | Xing Huang et.al. | 2508.13924 | null |
| 2025-08-20 | DiffIER: Optimizing Diffusion Models with Iterative Error Reduction | Ao Chen et.al. | 2508.13628 | null |
| 2025-08-19 | Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM | Dariia Puhach et.al. | 2508.13603 | null |
| 2025-08-18 | A Surveillance Based Interactive Robot | Kshitij Kavimandan et.al. | 2508.13319 | link |
| 2025-08-18 | MrMARTIAN: A Multi-resolution Mass Reconstruction Algorithm Combining Free-form and Analytic Components | Sangjun Cha et.al. | 2508.13262 | null |
| 2025-08-18 | Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis | Zhu Li et.al. | 2508.13028 | null |
| 2025-08-18 | Cooperative Sensing-Assisted Predictive Beam Tracking for MIMO-OFDM Networked ISAC Systems | Xiaoyu Yang et.al. | 2508.12723 | null |
| 2025-08-18 | Real-Time Sign Language Gestures to Speech Transcription using Deep Learning | Brandone Fonya et.al. | 2508.12713 | null |
| 2025-08-19 | FNH-TTS: A Fast, Natural, and Human-Like Speech Synthesis System with advanced prosodic modeling based on Mixture of Experts | Qingliang Meng et.al. | 2508.12001 | null |
| 2025-08-16 | SimInterview: Transforming Business Education through Large Language Model-Based Simulated Multilingual Interview Training System | Truong Thanh Hung Nguyen et.al. | 2508.11873 | null |
| 2025-08-15 | MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts | Heyang Xue et.al. | 2508.11326 | null |
| 2025-08-15 | EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens | Joonyong Park et.al. | 2508.11273 | null |
| 2025-08-14 | Towards high-precision inspiral gravitational waveforms from binary neutron star mergers in numerical relativity | Kenta Kiuchi et.al. | 2508.10981 | null |
| 2025-08-14 | Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform | Yuankun Xie et.al. | 2508.10559 | null |
| 2025-08-14 | Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning | Yejin Jeon et.al. | 2508.10412 | null |
| 2025-08-14 | Towards Frame-level Quality Predictions of Synthetic Speech | Michael Kuhlmann et.al. | 2508.10374 | link |
| 2025-08-13 | Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions | Tina Raissi et.al. | 2508.09868 | null |
| 2025-08-13 | UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech | Shuhei Kato et.al. | 2508.09767 | null |
| 2025-08-13 | Boyu Zhu et.al. | 2508.09702 | null | |
| 2025-08-12 | ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs | Eray Eren et.al. | 2508.09389 | null |
| 2025-07-21 | Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models | Kaiyan Chang et.al. | 2507.15512 | null |
| 2025-07-21 | Lunar and Terrestrial Time Transformation Based on the Principle of General Relativity | Min Liu et.al. | 2507.15456 | null |
| 2025-07-21 | A2TTS: TTS for Low Resource Indian Languages | Ayush Singh Bhadoriya et.al. | 2507.15272 | null |
| 2025-07-21 | EchoVoices: Preserving Generational Voices and Memories for Seniors and Children | Haiying Xu et.al. | 2507.15221 | null |
| 2025-07-20 | Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding | Yuanhan Zhang et.al. | 2507.15028 | null |
| 2025-07-22 | Hear Your Code Fail, Voice-Assisted Debugging for Python | Sayed Mahbub Hasan Amiri et.al. | 2507.15007 | null |
| 2025-07-20 | DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis | Yinghao Aaron Li et.al. | 2507.14988 | null |
| 2025-07-20 | MUR: Momentum Uncertainty guided Reasoning for Large Language Models | Hang Yan et.al. | 2507.14958 | null |
| 2025-07-20 | FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing | Shoutao Guo et.al. | 2507.14815 | null |
| 2025-07-18 | Inflated hot Jupiters: inferring average atmospheric velocity via Ohmic models coupled with internal dynamo evolution | Daniele Viganò et.al. | 2507.13991 | null |
| 2025-07-18 | Charged lepton flavor violating decays with a pair of light dark matter and muonium invisible decay | Sahabub Jahedi et.al. | 2507.13876 | null |
| 2025-07-17 | A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models | Kirill Borodin et.al. | 2507.13563 | null |
| 2025-07-17 | NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech | Maksim Borisov et.al. | 2507.13155 | null |
| 2025-07-17 | Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication | Tianyu Song et.al. | 2507.13052 | null |
| 2025-07-17 | Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes | Zhou Feng et.al. | 2507.12932 | null |
| 2025-07-16 | Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations | Yichen Han et.al. | 2507.12197 | null |
| 2025-07-16 | EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis | Haoxun Li et.al. | 2507.12015 | null |
| 2025-07-17 | Comprehensive investigation on baryon number violating nucleon decays involving an axion-like particle | Wei-Qi Fan et.al. | 2507.11844 | null |
| 2025-07-15 | Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection | Ivan Viakhirev et.al. | 2507.11777 | null |
| 2025-07-15 | P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge | Marvin Sach et.al. | 2507.11306 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-08-22 | Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment | Youjia Zhang et.al. | 2508.15568 | null |
| 2025-08-21 | DualMark: Identifying Model and Training Data Origins in Generated Audio | Xuefeng Yang et.al. | 2508.15521 | null |
| 2025-08-19 | MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence | Sonal Kumar et.al. | 2508.13992 | null |
| 2025-08-19 | DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer | Yisu Liu et.al. | 2508.13786 | null |
| 2025-08-21 | FoleySpace: Vision-Aligned Binaural Spatial Audio Generation | Lei Zhao et.al. | 2508.12918 | null |
| 2025-08-18 | TTA-DAME: Test-Time Adaptation with Domain Augmentation and Model Ensemble for Dynamic Driving Conditions | Dongjae Jeon et.al. | 2508.12690 | null |
| 2025-08-15 | Pretrained Conformers for Audio Fingerprinting and Retrieval | Kemal Altwlkany et.al. | 2508.11609 | null |
| 2025-08-14 | LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters | Haomin Zhang et.al. | 2508.11074 | null |
| 2025-08-14 | A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation | Jiulin Li et.al. | 2508.10494 | null |
| 2025-08-13 | TOTNet: Occlusion-Aware Temporal Tracking for Robust Ball Detection in Sports Videos | Hao Xu et.al. | 2508.09650 | null |
| 2025-08-12 | QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems | Chien-Chun Wang et.al. | 2508.08957 | null |
| 2025-08-20 | MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling | Qian Wang et.al. | 2508.08487 | null |
| 2025-08-11 | Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization | Nicholas Klein et.al. | 2508.08141 | null |
| 2025-08-11 | Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models | Khanh-Binh Nguyen et.al. | 2508.07570 | null |
| 2025-08-08 | MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows | Xiquan Li et.al. | 2508.06098 | null |
| 2025-08-08 | DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching | Wei Chen et.al. | 2508.05978 | null |
| 2025-07-22 | TTMBA: Towards Text To Multiple Sources Binaural Audio Generation | Yuxuan He et.al. | 2507.16564 | null |
| 2025-07-21 | An Investigation of Test-time Adaptation for Audio Classification under Background Noise | Weichuang Shao et.al. | 2507.15523 | null |
| 2025-07-18 | CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation | Marc Lafon et.al. | 2507.14312 | null |
| 2025-07-16 | Evaluation of Neural Surrogates for Physical Modelling Synthesis of Nonlinear Elastic Plates | Carlos De La Vega Martin et.al. | 2507.12563 | null |
| 2025-07-16 | Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations | Yichen Han et.al. | 2507.12197 | null |
| 2025-07-16 | GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models | Zhaohong Huang et.al. | 2507.11969 | null |
| 2025-07-14 | DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis | Wenjie Tian et.al. | 2507.10109 | null |
| 2025-07-14 | Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction | Shu-wen Yang et.al. | 2507.09834 | null |
| 2025-07-13 | Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations | Yiwen Liang et.al. | 2507.09500 | null |
| 2025-07-11 | Monitoring Risks in Test-Time Adaptation | Mona Schirmer et.al. | 2507.08721 | null |
| 2025-07-11 | BayesTTA: Continual-Temporal Test-Time Adaptation for Vision-Language Models via Gaussian Discriminant Analysis | Shuang Cui et.al. | 2507.08607 | null |
| 2025-07-11 | FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation | Yuxuan Jiang et.al. | 2507.08557 | null |
| 2025-07-11 | MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling | Jingjing Tang et.al. | 2507.08530 | null |
| 2025-07-10 | Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement | Xiao Yang et.al. | 2507.07908 | null |
| 2025-07-10 | Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos | Hao Xu et.al. | 2507.07381 | null |
| 2025-07-09 | Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM | Qiyuan Dai et.al. | 2507.06973 | null |
| 2025-07-09 | Physics-Informed Direction-Aware Neural Acoustic Fields | Yoshiki Masuyama et.al. | 2507.06826 | null |
| 2025-07-13 | Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation | Yingshan Liang et.al. | 2507.04959 | null |
| 2025-07-05 | MMMOS: Multi-domain Multi-axis Audio Quality Assessment | Yi-Cheng Lin et.al. | 2507.04094 | null |
| 2025-07-04 | Dynamic Multimodal Prototype Learning in Vision-Language Models | Xingyu Zhu et.al. | 2507.03657 | null |
| 2025-07-03 | F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning | Wei Li et.al. | 2507.02437 | null |
| 2025-07-03 | Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation | Feizhen Huang et.al. | 2507.02271 | null |
| 2025-07-02 | Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation | Andrei Jelea et.al. | 2507.01347 | null |
| 2025-07-01 | AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio embedding Sequences | Minoru Kishi et.al. | 2507.00475 | null |
| 2025-07-01 | Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation | Jizhou Han et.al. | 2507.00462 | null |
| 2025-06-30 | The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models | Lijun Sheng et.al. | 2506.24000 | null |
| 2025-06-30 | Scaling Self-Supervised Representation Learning for Symbolic Piano Performance | Louis Bradshaw et.al. | 2506.23869 | null |
| 2025-06-30 | When Small Guides Large: Cross-Model Co-Learning for Test-Time Adaptation | Chang'an Yi et.al. | 2506.23724 | null |
| 2025-06-30 | RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio | Yusuke Kanamori et.al. | 2506.23582 | null |
| 2025-06-30 | Human-CLAP: Human-perception-based contrastive language-audio pretraining | Taisei Takano et.al. | 2506.23553 | null |
| 2025-06-27 | SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition | Muhammad Umar Farooq et.al. | 2506.22143 | null |
| 2025-06-28 | ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing | Huadai Liu et.al. | 2506.21448 | null |
| 2025-06-27 | Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance | Akio Hayakawa et.al. | 2506.20995 | null |
| 2025-06-24 | Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation | Jun Wang et.al. | 2506.19774 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-08-19 | InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing | Shaoshu Yang et.al. | 2508.14033 | null |
| 2025-08-21 | FoleySpace: Vision-Aligned Binaural Spatial Audio Generation | Lei Zhao et.al. | 2508.12918 | null |
| 2025-08-14 | LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters | Haomin Zhang et.al. | 2508.11074 | null |
| 2025-08-12 | Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization | Chaoqun Cui et.al. | 2508.08550 | null |
| 2025-07-14 | DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis | Wenjie Tian et.al. | 2507.10109 | null |
| 2025-07-13 | Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation | Yingshan Liang et.al. | 2507.04959 | null |
| 2025-06-23 | Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions | Vineet Kumar Rakesh et.al. | 2507.02900 | null |
| 2025-07-03 | Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation | Feizhen Huang et.al. | 2507.02271 | null |
| 2025-06-23 | IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech | Siyi Zhou et.al. | 2506.21619 | null |
| 2025-06-28 | ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing | Huadai Liu et.al. | 2506.21448 | null |
| 2025-06-27 | Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance | Akio Hayakawa et.al. | 2506.20995 | null |
| 2025-06-24 | Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation | Jun Wang et.al. | 2506.19774 | null |
| 2025-06-13 | ViSAGe: Video-to-Spatial Audio Generation | Jaeyeon Kim et.al. | 2506.12199 | null |
| 2025-05-31 | Length Aware Speech Translation for Video Dubbing | Harveen Singh Chadha et.al. | 2506.00740 | null |
| 2025-05-26 | Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks | Chang Liu et.al. | 2505.20038 | link |
| 2025-05-22 | SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet | Zhi Zhong et.al. | 2505.16195 | null |
| 2025-05-30 | TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis | Yu Zhang et.al. | 2505.14910 | link |
| 2025-05-28 | Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model | Yong Ren et.al. | 2505.13062 | null |
| 2025-06-03 | OmniAudio: Generating Spatial Audio from 360-Degree Video | Huadai Liu et.al. | 2504.14906 | link |
| 2025-04-17 | CAFA: a Controllable Automatic Foley Artist | Roi Benita et.al. | 2504.06778 | link |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-08-22 | LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence | Alisa Vinogradova et.al. | 2508.16571 | null |
| 2025-08-21 | Evolving k-Threshold Visual Cryptography Schemes | Xiaoli Zhuo et.al. | 2508.15917 | null |
| 2025-08-20 | Maxmum Size of a Uniform Family with Bounded VC-dimension | Tianchi Yang et.al. | 2508.14334 | null |
| 2025-08-20 | Fortifying the Agentic Web: A Unified Zero-Trust Architecture Against Logic-layer Threats | Ken Huang et.al. | 2508.12259 | null |
| 2025-08-13 | Perturbed Public Voices (P |
Chongyang Gao et.al. | 2508.10949 | null |
| 2025-08-13 | Regularity for hypergraphs with bounded VC |
Lior Gishboliner et.al. | 2508.09969 | null |
| 2025-08-11 | Exploring Disentangled Neural Speech Codecs from Self-Supervised Representations | Ryo Aihara et.al. | 2508.08399 | null |
| 2025-08-10 | Scalable Controllable Accented TTS | Henry Li Xinyuan et.al. | 2508.07426 | null |
| 2025-08-09 | Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody | Jinsung Yoon et.al. | 2508.06890 | null |
| 2025-08-08 | DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching | Wei Chen et.al. | 2508.05978 | null |
| 2025-08-07 | Grouped k-threshold random grid-based visual cryptography scheme | Xiaoli Zhuo et.al. | 2508.05394 | null |
| 2025-08-15 | Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS | M Anuprabha et.al. | 2508.05102 | null |
| 2025-08-08 | REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers | Yuepeng Jiang et.al. | 2508.04996 | null |
| 2025-08-14 | Marco-Voice Technical Report | Fengping Tian et.al. | 2508.02038 | null |
| 2025-07-23 | Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion | Yu Zhang et.al. | 2507.14534 | null |
| 2025-07-17 | Computational-Statistical Tradeoffs from NP-hardness | Guy Blanc et.al. | 2507.13222 | null |
| 2025-07-17 | Cross-Modal Watermarking for Authentic Audio Recovery and Tamper Localization in Synthesized Audiovisual Forgeries | Minyoung Kim et.al. | 2507.12723 | null |
| 2025-07-15 | Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection | Ivan Viakhirev et.al. | 2507.11777 | null |
| 2025-07-16 | Multipass Linear Sketches for Geometric LP-Type Problems | N. Efe Çekirge et.al. | 2507.11484 | null |
| 2025-07-15 | On Tight Robust Coresets for |
Lingxiao Huang et.al. | 2507.11260 | null |
| 2025-07-15 | Pronunciation Deviation Analysis Through Voice Cloning and Acoustic Comparison | Andrew Valdivia et.al. | 2507.10985 | null |
| 2025-07-12 | Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning | Dominika Woszczyk et.al. | 2507.09310 | null |
| 2025-07-11 | SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment | Shivam Mehta et.al. | 2507.09070 | null |
| 2025-07-11 | Detecting Deepfake Talking Heads from Facial Biometric Anomalies | Justin D. Norman et.al. | 2507.08917 | null |
| 2025-07-11 | On Fair Epsilon Net and Geometric Hitting Set | Mohsen Dehghankar et.al. | 2507.08758 | null |
| 2025-07-08 | On the pointwise and sup-norm errors for local regression estimators | Jérémy Bettinger et.al. | 2507.07132 | null |
| 2025-07-09 | Speech Tokenizer is Key to Consistent Representation | Wonjin Jung et.al. | 2507.06802 | null |
| 2025-07-07 | Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters | Mathilde Abrassart et.al. | 2507.04817 | null |
| 2025-07-06 | TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet | Jaeseok Jeong et.al. | 2507.04349 | null |
| 2025-07-04 | Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion | Lea Fischbach et.al. | 2507.03641 | null |
| 2025-07-04 | Going Beyond Surfaces in Diameter Approximation | Michał Włodarczyk et.al. | 2507.03447 | null |
| 2025-07-03 | De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks | Wei Fan et.al. | 2507.02606 | null |
| 2025-07-03 | Open-Source System for Multilingual Translation and Cloned Speech Synthesis | Mateo Cámara et.al. | 2507.02530 | null |
| 2025-07-03 | JoyTTS: LLM-based Spoken Chatbot With Voice Cloning | Fangru Zhou et.al. | 2507.02380 | null |
| 2025-07-02 | Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis | Marc-André Carbonneau et.al. | 2507.02176 | null |
| 2025-07-02 | Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora | Hitoshi Suda et.al. | 2507.01356 | null |
| 2025-07-01 | MuteSwap: Silent Face-based Voice Conversion | Yifan Liu et.al. | 2507.00498 | null |
| 2025-06-26 | Avatars and Environments for Meetings in Social VR: What Styles and Choices Matter to People in Group Creativity Tasks? | Anya Osborne et.al. | 2506.21780 | null |
| 2025-06-23 | Selecting N-lowest scores for training MOS prediction models | Yuto Kondo et.al. | 2506.18326 | null |
| 2025-06-23 | Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting | Yuto Kondo et.al. | 2506.18307 | null |
| 2025-06-23 | JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles | Yuto Kondo et.al. | 2506.18296 | null |
| 2025-06-12 | RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding | Yisi Liu et.al. | 2506.10289 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-08-22 | Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation | Chun-Peng Chang et.al. | 2508.16512 | null |
| 2025-08-25 | OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models | Huanpeng Chu et.al. | 2508.16212 | null |
| 2025-08-22 | Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers | Shikang Zheng et.al. | 2508.16211 | null |
| 2025-08-21 | Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning | Yijun Liu et.al. | 2508.15874 | null |
| 2025-08-21 | CineScale: Free Lunch in High-Resolution Cinematic Visual Generation | Haonan Qiu et.al. | 2508.15774 | null |
| 2025-08-21 | Scaling Group Inference for Diverse and High-Quality Generation | Gaurav Parmar et.al. | 2508.15773 | null |
| 2025-08-21 | Waver: Wave Your Way to Lifelike Video Generation | Yifu Zhang et.al. | 2508.15761 | null |
| 2025-08-21 | WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception | Zhiheng Liu et.al. | 2508.15720 | null |
| 2025-08-21 | VideoEraser: Concept Erasure in Text-to-Video Diffusion Models | Naen Xu et.al. | 2508.15314 | null |
| 2025-08-20 | AnchorSync: Global Consistency Optimization for Long Video Editing | Zichi Liu et.al. | 2508.14609 | null |
| 2025-08-20 | Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration | Haoran Bai et.al. | 2508.14483 | null |
| 2025-08-20 | DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing | Weitao Wang et.al. | 2508.14465 | null |
| 2025-08-20 | MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation | Guile Wu et.al. | 2508.14327 | null |
| 2025-08-19 | InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing | Shaoshu Yang et.al. | 2508.14033 | null |
| 2025-08-19 | Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment | Samuel Seligardi et.al. | 2508.13989 | null |
| 2025-08-19 | Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing | Feng-Lin Liu et.al. | 2508.13797 | null |
| 2025-08-18 | 4DNeX: Feed-Forward 4D Generative Modeling Made Easy | Zhaoxi Chen et.al. | 2508.13154 | null |
| 2025-08-18 | Precise Action-to-Video Generation Through Visual Action Prompts | Yuang Wang et.al. | 2508.13104 | null |
| 2025-08-18 | EgoTwin: Dreaming Body and View in First Person | Jingqiao Xiu et.al. | 2508.13013 | null |
| 2025-08-18 | Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model | Xianglong He et.al. | 2508.13009 | null |
| 2025-08-18 | Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation | Qirui Li et.al. | 2508.12969 | null |
| 2025-08-18 | Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models | Jianshu Zeng et.al. | 2508.12945 | null |
| 2025-08-18 | S^2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models | Chubin Chen et.al. | 2508.12880 | null |
| 2025-08-18 | E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model | Ronghao Lin et.al. | 2508.12854 | null |
| 2025-08-18 | MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration | Yuanxin Wei et.al. | 2508.12691 | null |
| 2025-08-17 | TiP4GEN: Text to Immersive Panorama 4D Scene Generation | Ke Xing et.al. | 2508.12415 | null |
| 2025-08-15 | CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models | Xiaoxue Wu et.al. | 2508.11484 | null |
| 2025-08-14 | LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters | Haomin Zhang et.al. | 2508.11074 | null |
| 2025-08-14 | GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning | Kelin Yu et.al. | 2508.11049 | null |
| 2025-08-14 | EVCtrl: Efficient Control Adapter for Visual Generation | Zixiang Yang et.al. | 2508.10963 | null |
| 2025-08-14 | Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation | Harold Haodong Chen et.al. | 2508.10858 | null |
| 2025-08-14 | Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation | Youping Gu et.al. | 2508.10774 | null |
| 2025-08-14 | AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences | Jieyu Li et.al. | 2508.10771 | null |
| 2025-08-14 | HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis | Shiyu Liu et.al. | 2508.10566 | null |
| 2025-08-13 | LIA-X: Interpretable Latent Portrait Animator | Yaohui Wang et.al. | 2508.09959 | null |
| 2025-08-13 | Physical Autoregressive Model for Robotic Manipulation without Action Pretraining | Zijian Song et.al. | 2508.09822 | null |
| 2025-07-22 | MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation | Yanchen Liu et.al. | 2507.16310 | null |
| 2025-07-22 | PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation | Yaofang Liu et.al. | 2507.16116 | null |
| 2025-07-21 | Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models | Enes Sanli et.al. | 2507.15824 | null |
| 2025-07-21 | TokensGen: Harnessing Condensed Tokens for Long Video Generation | Wenqi Ouyang et.al. | 2507.15728 | null |
| 2025-07-21 | Conditional Video Generation for High-Efficiency Video Compression | Fangqiu Yi et.al. | 2507.15269 | null |
| 2025-07-19 | BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM | Haiquan Wen et.al. | 2507.14632 | null |
| 2025-07-19 | Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey | Jiahui Zhang et.al. | 2507.14501 | null |
| 2025-07-18 | Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis | Tongtong Su et.al. | 2507.13753 | null |
| 2025-07-17 | Dmitrii Mikhailov et.al. | 2507.13546 | null | |
| 2025-07-17 | "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models | Jing Gu et.al. | 2507.13428 | null |
| 2025-07-17 | Taming Diffusion Transformer for Real-Time Mobile Video Generation | Yushu Wu et.al. | 2507.13343 | null |
| 2025-07-17 | Leveraging Pre-Trained Visual Models for AI-Generated Video Detection | Keerthi Veeramachaneni et.al. | 2507.13224 | null |
| 2025-07-17 | LoViC: Efficient Long Video Generation with Context Compression | Jiaxiu Jiang et.al. | 2507.12952 | null |
| 2025-07-17 | World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving | Yanchen Guan et.al. | 2507.12762 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-08-22 | Decoding MGMT Methylation: A Step Towards Precision Medicine in Glioblastoma | Hafeez Ur Rehman et.al. | 2508.16424 | null |
| 2025-08-22 | FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing | Jiahao Chen et.al. | 2508.16230 | null |
| 2025-08-25 | OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models | Huanpeng Chu et.al. | 2508.16212 | null |
| 2025-08-22 | Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers | Shikang Zheng et.al. | 2508.16211 | null |
| 2025-08-22 | Competition and Attraction Improve Model Fusion | João Abrantes et.al. | 2508.16204 | null |
| 2025-08-22 | RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution | Haodong He et.al. | 2508.16158 | null |
| 2025-08-22 | Two-flow Feedback Multi-scale Progressive Generative Adversarial Network | Sun Weikai et.al. | 2508.16089 | null |
| 2025-08-21 | Structure-Preserving Medical Image Generation from a Latent Graph Representation | Kevin Arias et.al. | 2508.15920 | null |
| 2025-08-21 | CineScale: Free Lunch in High-Resolution Cinematic Visual Generation | Haonan Qiu et.al. | 2508.15774 | null |
| 2025-08-21 | Visual Autoregressive Modeling for Instruction-Guided Image Editing | Qingyang Mao et.al. | 2508.15772 | null |
| 2025-08-21 | Waver: Wave Your Way to Lifelike Video Generation | Yifu Zhang et.al. | 2508.15761 | null |
| 2025-08-21 | Are Virtual DES Images a Valid Alternative to the Real Ones? | Ana C. Perre et.al. | 2508.15594 | null |
| 2025-08-21 | GenTune: Toward Traceable Prompts to Improve Controllability of Image Refinement in Environment Design | Wen-Fan Wang et.al. | 2508.15227 | null |
| 2025-08-21 | See it. Say it. Sorted: Agentic System for Compositional Diagram Generation | Hantao Zhang et.al. | 2508.15222 | null |
| 2025-08-20 | Side Effects of Erasing Concepts from Diffusion Models | Shaswati Saha et.al. | 2508.15124 | null |
| 2025-08-20 | CurveFlow: Curvature-Guided Flow Matching for Image Generation | Yan Luo et.al. | 2508.15093 | null |
| 2025-08-20 | TAIGen: Training-Free Adversarial Image Generation via Diffusion Models | Susim Roy et.al. | 2508.15020 | null |
| 2025-08-20 | SATURN: Autoregressive Image Generation Guided by Scene Graphs | Thanh-Nhan Vo et.al. | 2508.14502 | null |
| 2025-08-20 | Multimode Fiber Imaging Based on Hydrogel Fiber | Lele He et.al. | 2508.14501 | null |
| 2025-08-20 | MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion | Fei Peng et.al. | 2508.14440 | null |
| 2025-08-20 | CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities | Yue Gong et.al. | 2508.14405 | null |
| 2025-08-20 | Img2ST-Net: Efficient High-Resolution Spatial Omics Prediction from Whole Slide Histology Images via Fully Convolutional Image-to-Image Learning | Junchao Zhu et.al. | 2508.14393 | null |
| 2025-08-19 | Tooth-Diffusion: Guided 3D CBCT Synthesis with Fine-Grained Tooth Conditioning | Said Djafar Said et.al. | 2508.14276 | null |
| 2025-08-19 | SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation | Paul Grimal et.al. | 2508.13866 | null |
| 2025-08-19 | Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing | Feng-Lin Liu et.al. | 2508.13797 | null |
| 2025-08-20 | DiffIER: Optimizing Diffusion Models with Iterative Error Reduction | Ao Chen et.al. | 2508.13628 | null |
| 2025-08-20 | 2D Gaussians Meet Visual Tokenizer | Yiang Shi et.al. | 2508.13515 | null |
| 2025-08-19 | AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes | Tianyi Xu et.al. | 2508.13503 | null |
| 2025-08-18 | ID-Card Synthetic Generation: Toward a Simulated Bona fide Dataset | Qingwen Zeng et.al. | 2508.13078 | null |
| 2025-08-18 | From Transthoracic to Transesophageal: Cross-Modality Generation using LoRA Diffusion | Emmanuel Oladokun et.al. | 2508.13077 | null |
| 2025-08-18 | 7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models | Elena Izzo et.al. | 2508.12919 | null |
| 2025-08-18 | Next Visual Granularity Generation | Yikai Wang et.al. | 2508.12811 | null |
| 2025-08-18 | Single-Reference Text-to-Image Manipulation with Dual Contrastive Denoising Score | Syed Muhmmad Israr et.al. | 2508.12718 | null |
| 2025-08-18 | Stable Diffusion-Based Approach for Human De-Occlusion | Seung Young Noh et.al. | 2508.12663 | null |
| 2025-08-17 | Say It, See It: A Systematic Evaluation on Speech-Based 3D Content Generation Methods in Augmented Reality | Yanming Xiu et.al. | 2508.12498 | null |
| 2025-08-17 | DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models | Xiaochuan Lin et.al. | 2508.12396 | null |
| 2025-08-17 | Semantic Discrepancy-aware Detector for Image Forgery Identification | Ziye Wang et.al. | 2508.12341 | null |
| 2025-08-17 | Sketchar: Supporting Character Design and Illustration Prototyping Using Generative AI | Long Ling et.al. | 2508.12333 | null |
| 2025-08-15 | Training-Free Anomaly Generation via Dual-Attention Enhancement in Diffusion Model | Zuo Zuo et.al. | 2508.11550 | null |
| 2025-08-15 | SPG: Style-Prompting Guidance for Style-Specific Content Creation | Qian Liang et.al. | 2508.11476 | null |
| 2025-08-15 | MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation | Qian Liang et.al. | 2508.11433 | null |
| 2025-08-15 | AnatoMaskGAN: GNN-Driven Slice Feature Fusion and Noise Augmentation for Medical Semantic Image Synthesis | Zonglin Wu et.al. | 2508.11375 | null |
| 2025-08-18 | TimeMachine: Fine-Grained Facial Age Editing with Identity Preservation | Yilin Mi et.al. | 2508.11284 | null |
| 2025-08-15 | Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension | Zhenhao Li et.al. | 2508.11211 | null |
| 2025-08-15 | StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation | Seungmi Lee et.al. | 2508.11203 | null |
| 2025-08-15 | LEARN: A Story-Driven Layout-to-Image Generation Framework for STEM Instruction | Maoquan Zhang et.al. | 2508.11153 | null |
| 2025-08-14 | Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models | Basile Lewandowski et.al. | 2508.10993 | null |
| 2025-08-16 | Object Fidelity Diffusion for Remote Sensing Image Generation | Ziqi Ye et.al. | 2508.10801 | null |
| 2025-07-22 | Pyramid Hierarchical Masked Diffusion Model for Imaging Synthesis | Xiaojiao Xiao et.al. | 2507.16579 | null |
| 2025-07-22 | ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement | Kahim Wong et.al. | 2507.16397 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-08-12 | QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems | Chien-Chun Wang et.al. | 2508.08957 | null |
| 2025-08-12 | Opening Musical Creativity? Embedded Ideologies in Generative-AI Music Systems | Liam Pram et.al. | 2508.08805 | null |
| 2025-08-08 | Live Music Models | Lyria Team et.al. | 2508.04651 | null |
| 2025-08-03 | Automatic Melody Reduction via Shortest Path Finding | Ziyu Wang et.al. | 2508.01571 | null |
| 2025-07-31 | DeformTune: A Deformable XAI Music Prototype for Non-Musicians | Ziqing Xu et.al. | 2508.00160 | null |
| 2025-07-31 | "I made this (sort of)": Negotiating authorship, confronting fraudulence, and exploring new musical spaces with prompt-based AI music generation | Bob L. T. Sturm et.al. | 2507.23365 | null |
| 2025-07-28 | Music Arena: Live Evaluation for Text-to-Music | Yonghyun Kim et.al. | 2507.20900 | null |
| 2025-07-28 | Controllable Video-to-Music Generation with Multiple Time-Varying Conditions | Junxian Wu et.al. | 2507.20627 | null |
| 2025-07-27 | Diffusion-based Symbolic Music Generation with Structured State Space Models | Shenghua Yuan et.al. | 2507.20128 | null |
| 2025-08-07 | SAMUeL: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion | Hei Shing Cheung et.al. | 2507.19991 | null |
| 2025-07-17 | A new XML conversion process for mensural music encoding : CMME_to_MEI (via Verovio) | David Fiala et.al. | 2507.15991 | null |
| 2025-07-17 | WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling | Qihui Yang et.al. | 2507.10534 | null |
| 2025-07-07 | Evaluating Fake Music Detection Performance Under Audio Augmentations | Tomasz Sroka et.al. | 2507.10447 | null |
| 2025-07-14 | ASTAR-NTU solution to AudioMOS Challenge 2025 Track1 | Fabian Ritter-Gutierrez et.al. | 2507.09904 | null |
| 2025-07-09 | Exploring State-Space-Model based Language Model in Music Generation | Wei-Jaw Lee et.al. | 2507.06674 | null |
| 2025-07-08 | MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation | Fathinah Izzati et.al. | 2507.05894 | null |
| 2025-07-07 | EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation | Fathinah Izzati et.al. | 2507.04955 | null |
| 2025-07-04 | MusGO: A Community-Driven Framework For Assessing Openness in Music-Generative AI | Roser Batlle-Roca et.al. | 2507.03599 | null |
| 2025-06-29 | The Florence Price Art Song Dataset and Piano Accompaniment Generator | Tao-Tao He et.al. | 2506.23130 | null |
| 2025-06-29 | TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure | Qi He et.al. | 2506.23094 | null |
| 2025-06-26 | Exploring Adapter Design Tradeoffs for Low Resource Music Generation | Atharva Mehta et.al. | 2506.21298 | null |
| 2025-06-23 | A Fourier Explanation of AI-music Artifacts | Darius Afchar et.al. | 2506.19108 | null |
| 2025-06-23 | Benchmarking Music Generation Models and Metrics via Human Preference Studies | Florian Grötschla et.al. | 2506.19085 | null |
| 2025-06-23 | Let Your Video Listen to Your Music! | Xinyu Zhang et.al. | 2506.18881 | null |
| 2025-06-24 | MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners | Fang-Duo Tsai et.al. | 2506.18729 | null |
| 2025-06-28 | AI-Generated Song Detection via Lyrics Transcripts | Markus Frohmann et.al. | 2506.18488 | null |
| 2025-06-23 | Large-Scale Training Data Attribution for Music Generative Models via Unlearning | Woosung Choi et.al. | 2506.18312 | null |
| 2025-06-20 | From Generality to Mastery: Composer-Style Symbolic Music Generation via Large-Scale Pre-training | Mingyang Yao et.al. | 2506.17497 | link |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-08-22 | Vevo2: Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning | Xueyao Zhang et.al. | 2508.16332 | null |
| 2025-08-15 | EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens | Joonyong Park et.al. | 2508.11273 | null |
| 2025-08-15 | Benchmarking Prosody Encoding in Discrete Speech Tokens | Kentaro Onda et.al. | 2508.11224 | null |
| 2025-08-13 | DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models | Yuanyuan Wang et.al. | 2508.08961 | null |
| 2025-08-11 | Exploring Disentangled Neural Speech Codecs from Self-Supervised Representations | Ryo Aihara et.al. | 2508.08399 | null |
| 2025-08-07 | NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference | Edresson Casanova et.al. | 2508.05835 | null |
| 2025-08-07 | SpectroStream: A Versatile Neural Codec for General Audio | Yunpeng Li et.al. | 2508.05207 | null |
| 2025-08-05 | Real-time speech enhancement in noise for throat microphone using neural audio codec as foundation model | Julien Hauret et.al. | 2508.02974 | null |
| 2025-08-04 | SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec | Chunyu Qiang et.al. | 2508.02849 | null |
| 2025-08-02 | Multi-Granularity Adaptive Time-Frequency Attention Framework for Audio Deepfake Detection under Real-World Communication Degradations | Haohan Shi et.al. | 2508.01467 | null |
| 2025-08-01 | Next Tokens Denoising for Speech Synthesis | Yanqing Liu et.al. | 2507.22746 | null |
| 2025-07-22 | Step-Audio 2 Technical Report | Boyong Wu et.al. | 2507.16632 | null |
| 2025-07-17 | Autoregressive Speech Enhancement via Acoustic Tokens | Luca Della Libera et.al. | 2507.12825 | null |
| 2025-07-17 | Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine | Anastasia Kuznetsova et.al. | 2507.12701 | null |
| 2025-07-16 | Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations | Yichen Han et.al. | 2507.12197 | null |
| 2025-07-16 | Room Impulse Response Generation Conditioned on Acoustic Parameters | Silvia Arellano et.al. | 2507.12136 | null |
| 2025-07-14 | Aligning Generative Speech Enhancement with Human Preferences via Direct Preference Optimization | Haoyang Li et.al. | 2507.09929 | null |
| 2025-07-14 | Token-based Audio Inpainting via Discrete Diffusion | Tali Dror et.al. | 2507.08333 | null |
| 2025-07-10 | Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders | Dimitrios Bralios et.al. | 2507.07867 | null |
| 2025-07-09 | Speech Tokenizer is Key to Consistent Representation | Wonjin Jung et.al. | 2507.06802 | null |
| 2025-07-01 | StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding | Dake Guo et.al. | 2506.23986 | null |
| 2025-07-09 | XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs | Yitian Gong et.al. | 2506.23325 | null |
| 2025-06-27 | DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding | Yang Yang et.al. | 2506.22362 | null |
| 2025-06-26 | CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate | Hankun Wang et.al. | 2506.21074 | null |
| 2025-06-24 | Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation | Jun Wang et.al. | 2506.19774 | null |
| 2025-06-20 | LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization | Daejin Jo et.al. | 2506.16738 | null |
| 2025-06-18 | Factorized RVQ-GAN For Disentangled Speech Tokenization | Sameer Khurana et.al. | 2506.15456 | null |
| 2025-06-17 | A Variational Framework for Improving Naturalness in Generative Spoken Language Models | Li-Wei Chen et.al. | 2506.14767 | link |
| 2025-06-14 | Towards Neural Audio Codec Source Parsing | Orchid Chetia Phukan et.al. | 2506.12627 | null |
| 2025-06-14 | Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction | Xiaoran Fan et.al. | 2506.12537 | null |
| 2025-06-13 | ViSAGe: Video-to-Spatial Audio Generation | Jaeyeon Kim et.al. | 2506.12199 | null |
| 2025-06-16 | Discrete Audio Tokens: More Than a Survey! | Pooneh Mousavi et.al. | 2506.10274 | null |
| 2025-06-13 | Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model | Ailin Huang et.al. | 2506.08967 | null |
| 2025-06-10 | Towards Generalized Source Tracing for Codec-Based Deepfake Speech | Xuanjun Chen et.al. | 2506.07294 | null |
| 2025-06-19 | Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training | Sathvik Udupa et.al. | 2506.07081 | null |
| 2025-06-04 | Bringing Interpretability to Neural Audio Codecs | Samir Sadok et.al. | 2506.04492 | null |
| 2025-06-13 | Phi-Omni-ST: A multimodal language model for direct speech-to-speech translation | Yuxuan Hu et.al. | 2506.04392 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-08-19 | Lexical Hints of Accuracy in LLM Reasoning Chains | Arne Vanhoyweghen et.al. | 2508.15842 | null |
| 2025-08-18 | Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models | Zhifei Xie et.al. | 2508.15827 | null |
| 2025-08-21 | Conditionally adaptive augmented Lagrangian method for physics-informed learning of forward and inverse problems using artificial neural networks | Qifeng Hu et.al. | 2508.15695 | null |
| 2025-08-21 | When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models | Cheng Wang et.al. | 2508.15407 | null |
| 2025-08-19 | OmViD: Omni-supervised active learning for video action detection | Aayush Rana et.al. | 2508.13983 | null |
| 2025-08-19 | FAMNet: Integrating 2D and 3D Features for Micro-expression Recognition via Multi-task Learning and Hierarchical Attention | Liangyu Fu et.al. | 2508.13483 | null |
| 2025-08-19 | Modeling and Control of AWOISV: A Filtered Tube-Based MPC Approach for Simultaneous Tracking of Lateral Position and Heading Angle | Xu Yang et.al. | 2508.13457 | null |
| 2025-08-18 | Omni Survey for Multimodality Analysis in Visual Object Tracking | Zhangyong Tang et.al. | 2508.13000 | null |
| 2025-08-16 | OmniD: Generalizable Robot Manipulation Policy via Image-Based BEV Representation | Jilei Mao et.al. | 2508.11898 | null |
| 2025-08-15 | Semi-supervised Image Dehazing via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models | Bing Liu et.al. | 2508.11165 | null |
| 2025-08-15 | Residual-based Efficient Bidirectional Diffusion Model for Image Dehazing and Haze Generation | Bing Liu et.al. | 2508.11134 | null |
| 2025-08-14 | HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs | Zheng Qin et.al. | 2508.10576 | null |
| 2025-08-13 | Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning | Vaishnavi Shrivastava et.al. | 2508.09726 | null |
| 2025-08-13 | A Chain of Diagnosis Framework for Accurate and Explainable Radiology Report Generation | Haibo Jin et.al. | 2508.09566 | null |
| 2025-08-11 | MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models | Fan Zhang et.al. | 2508.09210 | null |
| 2025-08-11 | MSU-Bench: Towards Understanding the Conversational Multi-talker Scenarios | Shuai Wang et.al. | 2508.08155 | null |
| 2025-08-12 | Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning | Shu Wu et.al. | 2508.08039 | null |
| 2025-08-12 | Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation | Fangyuan Mao et.al. | 2508.07981 | null |
| 2025-08-10 | AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning | Siminfar Samakoush Galougah et.al. | 2508.07470 | null |
| 2025-07-21 | Prospects of detecting rotational flatness of exoplanets from space-based photometry | Sz. Kálmán et.al. | 2507.15359 | null |
| 2025-07-21 | The CHEOPS view of HD 95338b: refined transit parameters, and a search for exomoons | Sz. Kálmán et.al. | 2507.15318 | null |
| 2025-07-20 | Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards | Derek Li et.al. | 2507.14783 | null |
| 2025-07-19 | Anisotropic Anderson localization in higher-dimensional nonreciprocal lattices | Jinyuan Shang et.al. | 2507.14523 | null |
| 2025-07-18 | RiNNAL+: a Riemannian ALM Solver for SDP-RLT Relaxations of Mixed-Binary Quadratic Programs | Di Hou et.al. | 2507.13776 | null |
| 2025-07-17 | DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model | Han Zhang et.al. | 2507.13087 | null |
| 2025-07-17 | AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning | Yiming Ren et.al. | 2507.12841 | null |
| 2025-07-16 | An augmented Lagrangian method for strongly regular minimizers in a class of convex composite optimization problems | Chengjing Wang et.al. | 2507.12040 | null |
| 2025-07-15 | UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks | Peiran Wu et.al. | 2507.11336 | null |
| 2025-07-14 | MultiVox: Benchmarking Voice Assistants for Multimodal Interactions | Ramaneswaran Selvakumar et.al. | 2507.10859 | null |
| 2025-07-14 | The Man Behind the Sound: Demystifying Audio Private Attribute Profiling via Multimodal Large Language Model Agents | Lixu Wang et.al. | 2507.10016 | null |
| 2025-07-14 | The electronic and transport properties in the Haldane-Hubbard with odd-parity altermagnetism | Minghuan Zeng et.al. | 2507.09906 | null |
| 2025-07-11 | Two-Level Distributed Interference Management for Large-Scale HAPS-Empowered vHetNets | Afsoon Alidadi Shamsabadi et.al. | 2507.08299 | null |
| 2025-07-10 | Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models | Arushi Goel et.al. | 2507.08128 | null |
| 2025-07-09 | Omni-Fusion of Spatial and Spectral for Hyperspectral Image Segmentation | Qing Zhang et.al. | 2507.06606 | null |
| 2025-07-09 | Omni-Video: Democratizing Unified Video Understanding and Generation | Zhiyu Tan et.al. | 2507.06119 | null |
| 2025-07-08 | ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark | He Wang et.al. | 2507.05727 | null |
| 2025-07-08 | Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition | Zijin Gu et.al. | 2507.05724 | null |
| 2025-07-07 | Electronic transport and anti-super-Klein tunneling in few-layer black phosphorous | Jorge Alfonso Lizarraga-Brito et.al. | 2507.05462 | null |
| 2025-07-03 | DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment | Ke-Han Lu et.al. | 2507.02768 | null |
| 2025-06-29 | VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems | Ethan Smyth et.al. | 2507.00079 | null |
| 2025-06-28 | UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments | Dayong Su et.al. | 2506.22736 | null |
| 2025-06-27 | Augmented Lagrangian methods for infeasible convex optimization problems and diverging proximal-point algorithms | Roland Andrews et.al. | 2506.22428 | null |
| 2025-06-26 | Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation | Guanting Dong et.al. | 2506.21384 | null |
| 2025-06-26 | HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context | Qize Yang et.al. | 2506.21277 | null |
| 2025-06-26 | Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation | Yihong Cao et.al. | 2506.21198 | null |
| 2025-06-29 | OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs | Yiman Zhang et.al. | 2506.20960 | null |
| 2025-06-23 | OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation | Qijun Gan et.al. | 2506.18866 | null |
| 2025-06-22 | Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents | Jinjie Wei et.al. | 2506.17913 | null |