Skip to content

WWWWxp/arxiv_daily

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2,322 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Updated on 2025.08.26

Usage instructions: here

Table of Contents
  1. Text to Speech
  2. Text to Audio
  3. Video to Audio
  4. Voice Conversion
  5. Video Generation
  6. Image Generation
  7. Music Generation
  8. Audio Codec
  9. Large Audio Language Model

Text to Speech

Publish Date Title Authors PDF Code
2025-08-22 Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation Weiting Tan et.al. 2508.16188 link
2025-08-21 QvTAD: Differential Relative Attribute Learning for Voice Timbre Attribute Detection Zhiyu Wu et.al. 2508.15931 null
2025-08-21 Abelian integrals for polynomials with trivial global monodromy on $\mathbb{C}^2$ Jesús Muciño-Raymundo et.al. 2508.15925 null
2025-08-21 Any-to-any Speaker Attribute Perturbation for Asynchronous Voice Anonymization Liping Chen et.al. 2508.15565 null
2025-08-24 Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets Chenlin Liu et.al. 2508.15442 null
2025-08-21 UniCoM: A Universal Code-Switching Speech Generator Sangmin Lee et.al. 2508.15244 link
2025-08-25 Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization Rui Wang et.al. 2508.14947 null
2025-08-20 Long-Context Speech Synthesis with Context-Aware Memory Zhipeng Li et.al. 2508.14713 null
2025-08-20 Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder Refinement Heitor R. Guimarães et.al. 2508.14709 null
2025-08-22 Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS Can Jin et.al. 2508.14313 null
2025-08-19 Exponential Ergodicity for McKean-Vlasov SDEs with Singular Interactions Xing Huang et.al. 2508.13924 null
2025-08-20 DiffIER: Optimizing Diffusion Models with Iterative Error Reduction Ao Chen et.al. 2508.13628 null
2025-08-19 Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM Dariia Puhach et.al. 2508.13603 null
2025-08-18 A Surveillance Based Interactive Robot Kshitij Kavimandan et.al. 2508.13319 link
2025-08-18 MrMARTIAN: A Multi-resolution Mass Reconstruction Algorithm Combining Free-form and Analytic Components Sangjun Cha et.al. 2508.13262 null
2025-08-18 Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis Zhu Li et.al. 2508.13028 null
2025-08-18 Cooperative Sensing-Assisted Predictive Beam Tracking for MIMO-OFDM Networked ISAC Systems Xiaoyu Yang et.al. 2508.12723 null
2025-08-18 Real-Time Sign Language Gestures to Speech Transcription using Deep Learning Brandone Fonya et.al. 2508.12713 null
2025-08-19 FNH-TTS: A Fast, Natural, and Human-Like Speech Synthesis System with advanced prosodic modeling based on Mixture of Experts Qingliang Meng et.al. 2508.12001 null
2025-08-16 SimInterview: Transforming Business Education through Large Language Model-Based Simulated Multilingual Interview Training System Truong Thanh Hung Nguyen et.al. 2508.11873 null
2025-08-15 MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts Heyang Xue et.al. 2508.11326 null
2025-08-15 EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens Joonyong Park et.al. 2508.11273 null
2025-08-14 Towards high-precision inspiral gravitational waveforms from binary neutron star mergers in numerical relativity Kenta Kiuchi et.al. 2508.10981 null
2025-08-14 Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform Yuankun Xie et.al. 2508.10559 null
2025-08-14 Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning Yejin Jeon et.al. 2508.10412 null
2025-08-14 Towards Frame-level Quality Predictions of Synthetic Speech Michael Kuhlmann et.al. 2508.10374 link
2025-08-13 Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions Tina Raissi et.al. 2508.09868 null
2025-08-13 UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech Shuhei Kato et.al. 2508.09767 null
2025-08-13 $\text{M}^3\text{PDB}$ : A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation Boyu Zhu et.al. 2508.09702 null
2025-08-12 ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs Eray Eren et.al. 2508.09389 null
2025-07-21 Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models Kaiyan Chang et.al. 2507.15512 null
2025-07-21 Lunar and Terrestrial Time Transformation Based on the Principle of General Relativity Min Liu et.al. 2507.15456 null
2025-07-21 A2TTS: TTS for Low Resource Indian Languages Ayush Singh Bhadoriya et.al. 2507.15272 null
2025-07-21 EchoVoices: Preserving Generational Voices and Memories for Seniors and Children Haiying Xu et.al. 2507.15221 null
2025-07-20 Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Yuanhan Zhang et.al. 2507.15028 null
2025-07-22 Hear Your Code Fail, Voice-Assisted Debugging for Python Sayed Mahbub Hasan Amiri et.al. 2507.15007 null
2025-07-20 DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis Yinghao Aaron Li et.al. 2507.14988 null
2025-07-20 MUR: Momentum Uncertainty guided Reasoning for Large Language Models Hang Yan et.al. 2507.14958 null
2025-07-20 FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing Shoutao Guo et.al. 2507.14815 null
2025-07-18 Inflated hot Jupiters: inferring average atmospheric velocity via Ohmic models coupled with internal dynamo evolution Daniele Viganò et.al. 2507.13991 null
2025-07-18 Charged lepton flavor violating decays with a pair of light dark matter and muonium invisible decay Sahabub Jahedi et.al. 2507.13876 null
2025-07-17 A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models Kirill Borodin et.al. 2507.13563 null
2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech Maksim Borisov et.al. 2507.13155 null
2025-07-17 Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication Tianyu Song et.al. 2507.13052 null
2025-07-17 Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes Zhou Feng et.al. 2507.12932 null
2025-07-16 Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations Yichen Han et.al. 2507.12197 null
2025-07-16 EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis Haoxun Li et.al. 2507.12015 null
2025-07-17 Comprehensive investigation on baryon number violating nucleon decays involving an axion-like particle Wei-Qi Fan et.al. 2507.11844 null
2025-07-15 Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection Ivan Viakhirev et.al. 2507.11777 null
2025-07-15 P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge Marvin Sach et.al. 2507.11306 null

(back to top)

Text to Audio

Publish Date Title Authors PDF Code
2025-08-22 Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment Youjia Zhang et.al. 2508.15568 null
2025-08-21 DualMark: Identifying Model and Training Data Origins in Generated Audio Xuefeng Yang et.al. 2508.15521 null
2025-08-19 MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence Sonal Kumar et.al. 2508.13992 null
2025-08-19 DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer Yisu Liu et.al. 2508.13786 null
2025-08-21 FoleySpace: Vision-Aligned Binaural Spatial Audio Generation Lei Zhao et.al. 2508.12918 null
2025-08-18 TTA-DAME: Test-Time Adaptation with Domain Augmentation and Model Ensemble for Dynamic Driving Conditions Dongjae Jeon et.al. 2508.12690 null
2025-08-15 Pretrained Conformers for Audio Fingerprinting and Retrieval Kemal Altwlkany et.al. 2508.11609 null
2025-08-14 LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters Haomin Zhang et.al. 2508.11074 null
2025-08-14 A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation Jiulin Li et.al. 2508.10494 null
2025-08-13 TOTNet: Occlusion-Aware Temporal Tracking for Robust Ball Detection in Sports Videos Hao Xu et.al. 2508.09650 null
2025-08-12 QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems Chien-Chun Wang et.al. 2508.08957 null
2025-08-20 MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling Qian Wang et.al. 2508.08487 null
2025-08-11 Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization Nicholas Klein et.al. 2508.08141 null
2025-08-11 Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models Khanh-Binh Nguyen et.al. 2508.07570 null
2025-08-08 MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows Xiquan Li et.al. 2508.06098 null
2025-08-08 DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching Wei Chen et.al. 2508.05978 null
2025-07-22 TTMBA: Towards Text To Multiple Sources Binaural Audio Generation Yuxuan He et.al. 2507.16564 null
2025-07-21 An Investigation of Test-time Adaptation for Audio Classification under Background Noise Weichuang Shao et.al. 2507.15523 null
2025-07-18 CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation Marc Lafon et.al. 2507.14312 null
2025-07-16 Evaluation of Neural Surrogates for Physical Modelling Synthesis of Nonlinear Elastic Plates Carlos De La Vega Martin et.al. 2507.12563 null
2025-07-16 Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations Yichen Han et.al. 2507.12197 null
2025-07-16 GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models Zhaohong Huang et.al. 2507.11969 null
2025-07-14 DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis Wenjie Tian et.al. 2507.10109 null
2025-07-14 Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction Shu-wen Yang et.al. 2507.09834 null
2025-07-13 Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations Yiwen Liang et.al. 2507.09500 null
2025-07-11 Monitoring Risks in Test-Time Adaptation Mona Schirmer et.al. 2507.08721 null
2025-07-11 BayesTTA: Continual-Temporal Test-Time Adaptation for Vision-Language Models via Gaussian Discriminant Analysis Shuang Cui et.al. 2507.08607 null
2025-07-11 FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation Yuxuan Jiang et.al. 2507.08557 null
2025-07-11 MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling Jingjing Tang et.al. 2507.08530 null
2025-07-10 Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement Xiao Yang et.al. 2507.07908 null
2025-07-10 Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos Hao Xu et.al. 2507.07381 null
2025-07-09 Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM Qiyuan Dai et.al. 2507.06973 null
2025-07-09 Physics-Informed Direction-Aware Neural Acoustic Fields Yoshiki Masuyama et.al. 2507.06826 null
2025-07-13 Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation Yingshan Liang et.al. 2507.04959 null
2025-07-05 MMMOS: Multi-domain Multi-axis Audio Quality Assessment Yi-Cheng Lin et.al. 2507.04094 null
2025-07-04 Dynamic Multimodal Prototype Learning in Vision-Language Models Xingyu Zhu et.al. 2507.03657 null
2025-07-03 F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning Wei Li et.al. 2507.02437 null
2025-07-03 Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation Feizhen Huang et.al. 2507.02271 null
2025-07-02 Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation Andrei Jelea et.al. 2507.01347 null
2025-07-01 AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio embedding Sequences Minoru Kishi et.al. 2507.00475 null
2025-07-01 Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation Jizhou Han et.al. 2507.00462 null
2025-06-30 The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models Lijun Sheng et.al. 2506.24000 null
2025-06-30 Scaling Self-Supervised Representation Learning for Symbolic Piano Performance Louis Bradshaw et.al. 2506.23869 null
2025-06-30 When Small Guides Large: Cross-Model Co-Learning for Test-Time Adaptation Chang'an Yi et.al. 2506.23724 null
2025-06-30 RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio Yusuke Kanamori et.al. 2506.23582 null
2025-06-30 Human-CLAP: Human-perception-based contrastive language-audio pretraining Taisei Takano et.al. 2506.23553 null
2025-06-27 SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition Muhammad Umar Farooq et.al. 2506.22143 null
2025-06-28 ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing Huadai Liu et.al. 2506.21448 null
2025-06-27 Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance Akio Hayakawa et.al. 2506.20995 null
2025-06-24 Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation Jun Wang et.al. 2506.19774 null

(back to top)

Video to Audio

Publish Date Title Authors PDF Code
2025-08-19 InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing Shaoshu Yang et.al. 2508.14033 null
2025-08-21 FoleySpace: Vision-Aligned Binaural Spatial Audio Generation Lei Zhao et.al. 2508.12918 null
2025-08-14 LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters Haomin Zhang et.al. 2508.11074 null
2025-08-12 Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization Chaoqun Cui et.al. 2508.08550 null
2025-07-14 DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis Wenjie Tian et.al. 2507.10109 null
2025-07-13 Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation Yingshan Liang et.al. 2507.04959 null
2025-06-23 Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions Vineet Kumar Rakesh et.al. 2507.02900 null
2025-07-03 Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation Feizhen Huang et.al. 2507.02271 null
2025-06-23 IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech Siyi Zhou et.al. 2506.21619 null
2025-06-28 ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing Huadai Liu et.al. 2506.21448 null
2025-06-27 Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance Akio Hayakawa et.al. 2506.20995 null
2025-06-24 Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation Jun Wang et.al. 2506.19774 null
2025-06-13 ViSAGe: Video-to-Spatial Audio Generation Jaeyeon Kim et.al. 2506.12199 null
2025-05-31 Length Aware Speech Translation for Video Dubbing Harveen Singh Chadha et.al. 2506.00740 null
2025-05-26 Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks Chang Liu et.al. 2505.20038 link
2025-05-22 SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet Zhi Zhong et.al. 2505.16195 null
2025-05-30 TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis Yu Zhang et.al. 2505.14910 link
2025-05-28 Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model Yong Ren et.al. 2505.13062 null
2025-06-03 OmniAudio: Generating Spatial Audio from 360-Degree Video Huadai Liu et.al. 2504.14906 link
2025-04-17 CAFA: a Controllable Automatic Foley Artist Roi Benita et.al. 2504.06778 link

(back to top)

Voice Conversion

Publish Date Title Authors PDF Code
2025-08-22 LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence Alisa Vinogradova et.al. 2508.16571 null
2025-08-21 Evolving k-Threshold Visual Cryptography Schemes Xiaoli Zhuo et.al. 2508.15917 null
2025-08-20 Maxmum Size of a Uniform Family with Bounded VC-dimension Tianchi Yang et.al. 2508.14334 null
2025-08-20 Fortifying the Agentic Web: A Unified Zero-Trust Architecture Against Logic-layer Threats Ken Huang et.al. 2508.12259 null
2025-08-13 Perturbed Public Voices (P $^{2}$ V): A Dataset for Robust Audio Deepfake Detection Chongyang Gao et.al. 2508.10949 null
2025-08-13 Regularity for hypergraphs with bounded VC $_2$ dimension Lior Gishboliner et.al. 2508.09969 null
2025-08-11 Exploring Disentangled Neural Speech Codecs from Self-Supervised Representations Ryo Aihara et.al. 2508.08399 null
2025-08-10 Scalable Controllable Accented TTS Henry Li Xinyuan et.al. 2508.07426 null
2025-08-09 Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody Jinsung Yoon et.al. 2508.06890 null
2025-08-08 DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching Wei Chen et.al. 2508.05978 null
2025-08-07 Grouped k-threshold random grid-based visual cryptography scheme Xiaoli Zhuo et.al. 2508.05394 null
2025-08-15 Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS M Anuprabha et.al. 2508.05102 null
2025-08-08 REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers Yuepeng Jiang et.al. 2508.04996 null
2025-08-14 Marco-Voice Technical Report Fengping Tian et.al. 2508.02038 null
2025-07-23 Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion Yu Zhang et.al. 2507.14534 null
2025-07-17 Computational-Statistical Tradeoffs from NP-hardness Guy Blanc et.al. 2507.13222 null
2025-07-17 Cross-Modal Watermarking for Authentic Audio Recovery and Tamper Localization in Synthesized Audiovisual Forgeries Minyoung Kim et.al. 2507.12723 null
2025-07-15 Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection Ivan Viakhirev et.al. 2507.11777 null
2025-07-16 Multipass Linear Sketches for Geometric LP-Type Problems N. Efe Çekirge et.al. 2507.11484 null
2025-07-15 On Tight Robust Coresets for $k$ -Medians Clustering Lingxiao Huang et.al. 2507.11260 null
2025-07-15 Pronunciation Deviation Analysis Through Voice Cloning and Acoustic Comparison Andrew Valdivia et.al. 2507.10985 null
2025-07-12 Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning Dominika Woszczyk et.al. 2507.09310 null
2025-07-11 SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment Shivam Mehta et.al. 2507.09070 null
2025-07-11 Detecting Deepfake Talking Heads from Facial Biometric Anomalies Justin D. Norman et.al. 2507.08917 null
2025-07-11 On Fair Epsilon Net and Geometric Hitting Set Mohsen Dehghankar et.al. 2507.08758 null
2025-07-08 On the pointwise and sup-norm errors for local regression estimators Jérémy Bettinger et.al. 2507.07132 null
2025-07-09 Speech Tokenizer is Key to Consistent Representation Wonjin Jung et.al. 2507.06802 null
2025-07-07 Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters Mathilde Abrassart et.al. 2507.04817 null
2025-07-06 TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet Jaeseok Jeong et.al. 2507.04349 null
2025-07-04 Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion Lea Fischbach et.al. 2507.03641 null
2025-07-04 Going Beyond Surfaces in Diameter Approximation Michał Włodarczyk et.al. 2507.03447 null
2025-07-03 De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks Wei Fan et.al. 2507.02606 null
2025-07-03 Open-Source System for Multilingual Translation and Cloned Speech Synthesis Mateo Cámara et.al. 2507.02530 null
2025-07-03 JoyTTS: LLM-based Spoken Chatbot With Voice Cloning Fangru Zhou et.al. 2507.02380 null
2025-07-02 Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis Marc-André Carbonneau et.al. 2507.02176 null
2025-07-02 Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora Hitoshi Suda et.al. 2507.01356 null
2025-07-01 MuteSwap: Silent Face-based Voice Conversion Yifan Liu et.al. 2507.00498 null
2025-06-26 Avatars and Environments for Meetings in Social VR: What Styles and Choices Matter to People in Group Creativity Tasks? Anya Osborne et.al. 2506.21780 null
2025-06-23 Selecting N-lowest scores for training MOS prediction models Yuto Kondo et.al. 2506.18326 null
2025-06-23 Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting Yuto Kondo et.al. 2506.18307 null
2025-06-23 JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles Yuto Kondo et.al. 2506.18296 null
2025-06-12 RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding Yisi Liu et.al. 2506.10289 null

(back to top)

Video Generation

Publish Date Title Authors PDF Code
2025-08-22 Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation Chun-Peng Chang et.al. 2508.16512 null
2025-08-25 OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models Huanpeng Chu et.al. 2508.16212 null
2025-08-22 Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers Shikang Zheng et.al. 2508.16211 null
2025-08-21 Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning Yijun Liu et.al. 2508.15874 null
2025-08-21 CineScale: Free Lunch in High-Resolution Cinematic Visual Generation Haonan Qiu et.al. 2508.15774 null
2025-08-21 Scaling Group Inference for Diverse and High-Quality Generation Gaurav Parmar et.al. 2508.15773 null
2025-08-21 Waver: Wave Your Way to Lifelike Video Generation Yifu Zhang et.al. 2508.15761 null
2025-08-21 WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception Zhiheng Liu et.al. 2508.15720 null
2025-08-21 VideoEraser: Concept Erasure in Text-to-Video Diffusion Models Naen Xu et.al. 2508.15314 null
2025-08-20 AnchorSync: Global Consistency Optimization for Long Video Editing Zichi Liu et.al. 2508.14609 null
2025-08-20 Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration Haoran Bai et.al. 2508.14483 null
2025-08-20 DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing Weitao Wang et.al. 2508.14465 null
2025-08-20 MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation Guile Wu et.al. 2508.14327 null
2025-08-19 InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing Shaoshu Yang et.al. 2508.14033 null
2025-08-19 Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment Samuel Seligardi et.al. 2508.13989 null
2025-08-19 Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing Feng-Lin Liu et.al. 2508.13797 null
2025-08-18 4DNeX: Feed-Forward 4D Generative Modeling Made Easy Zhaoxi Chen et.al. 2508.13154 null
2025-08-18 Precise Action-to-Video Generation Through Visual Action Prompts Yuang Wang et.al. 2508.13104 null
2025-08-18 EgoTwin: Dreaming Body and View in First Person Jingqiao Xiu et.al. 2508.13013 null
2025-08-18 Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model Xianglong He et.al. 2508.13009 null
2025-08-18 Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation Qirui Li et.al. 2508.12969 null
2025-08-18 Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models Jianshu Zeng et.al. 2508.12945 null
2025-08-18 S^2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models Chubin Chen et.al. 2508.12880 null
2025-08-18 E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model Ronghao Lin et.al. 2508.12854 null
2025-08-18 MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration Yuanxin Wei et.al. 2508.12691 null
2025-08-17 TiP4GEN: Text to Immersive Panorama 4D Scene Generation Ke Xing et.al. 2508.12415 null
2025-08-15 CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models Xiaoxue Wu et.al. 2508.11484 null
2025-08-14 LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters Haomin Zhang et.al. 2508.11074 null
2025-08-14 GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning Kelin Yu et.al. 2508.11049 null
2025-08-14 EVCtrl: Efficient Control Adapter for Visual Generation Zixiang Yang et.al. 2508.10963 null
2025-08-14 Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation Harold Haodong Chen et.al. 2508.10858 null
2025-08-14 Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation Youping Gu et.al. 2508.10774 null
2025-08-14 AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences Jieyu Li et.al. 2508.10771 null
2025-08-14 HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis Shiyu Liu et.al. 2508.10566 null
2025-08-13 LIA-X: Interpretable Latent Portrait Animator Yaohui Wang et.al. 2508.09959 null
2025-08-13 Physical Autoregressive Model for Robotic Manipulation without Action Pretraining Zijian Song et.al. 2508.09822 null
2025-07-22 MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation Yanchen Liu et.al. 2507.16310 null
2025-07-22 PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation Yaofang Liu et.al. 2507.16116 null
2025-07-21 Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models Enes Sanli et.al. 2507.15824 null
2025-07-21 TokensGen: Harnessing Condensed Tokens for Long Video Generation Wenqi Ouyang et.al. 2507.15728 null
2025-07-21 Conditional Video Generation for High-Efficiency Video Compression Fangqiu Yi et.al. 2507.15269 null
2025-07-19 BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM Haiquan Wen et.al. 2507.14632 null
2025-07-19 Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey Jiahui Zhang et.al. 2507.14501 null
2025-07-18 Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis Tongtong Su et.al. 2507.13753 null
2025-07-17 $\nabla$ NABLA: Neighborhood Adaptive Block-Level Attention Dmitrii Mikhailov et.al. 2507.13546 null
2025-07-17 "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models Jing Gu et.al. 2507.13428 null
2025-07-17 Taming Diffusion Transformer for Real-Time Mobile Video Generation Yushu Wu et.al. 2507.13343 null
2025-07-17 Leveraging Pre-Trained Visual Models for AI-Generated Video Detection Keerthi Veeramachaneni et.al. 2507.13224 null
2025-07-17 LoViC: Efficient Long Video Generation with Context Compression Jiaxiu Jiang et.al. 2507.12952 null
2025-07-17 World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving Yanchen Guan et.al. 2507.12762 null

(back to top)

Image Generation

Publish Date Title Authors PDF Code
2025-08-22 Decoding MGMT Methylation: A Step Towards Precision Medicine in Glioblastoma Hafeez Ur Rehman et.al. 2508.16424 null
2025-08-22 FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing Jiahao Chen et.al. 2508.16230 null
2025-08-25 OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models Huanpeng Chu et.al. 2508.16212 null
2025-08-22 Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers Shikang Zheng et.al. 2508.16211 null
2025-08-22 Competition and Attraction Improve Model Fusion João Abrantes et.al. 2508.16204 null
2025-08-22 RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution Haodong He et.al. 2508.16158 null
2025-08-22 Two-flow Feedback Multi-scale Progressive Generative Adversarial Network Sun Weikai et.al. 2508.16089 null
2025-08-21 Structure-Preserving Medical Image Generation from a Latent Graph Representation Kevin Arias et.al. 2508.15920 null
2025-08-21 CineScale: Free Lunch in High-Resolution Cinematic Visual Generation Haonan Qiu et.al. 2508.15774 null
2025-08-21 Visual Autoregressive Modeling for Instruction-Guided Image Editing Qingyang Mao et.al. 2508.15772 null
2025-08-21 Waver: Wave Your Way to Lifelike Video Generation Yifu Zhang et.al. 2508.15761 null
2025-08-21 Are Virtual DES Images a Valid Alternative to the Real Ones? Ana C. Perre et.al. 2508.15594 null
2025-08-21 GenTune: Toward Traceable Prompts to Improve Controllability of Image Refinement in Environment Design Wen-Fan Wang et.al. 2508.15227 null
2025-08-21 See it. Say it. Sorted: Agentic System for Compositional Diagram Generation Hantao Zhang et.al. 2508.15222 null
2025-08-20 Side Effects of Erasing Concepts from Diffusion Models Shaswati Saha et.al. 2508.15124 null
2025-08-20 CurveFlow: Curvature-Guided Flow Matching for Image Generation Yan Luo et.al. 2508.15093 null
2025-08-20 TAIGen: Training-Free Adversarial Image Generation via Diffusion Models Susim Roy et.al. 2508.15020 null
2025-08-20 SATURN: Autoregressive Image Generation Guided by Scene Graphs Thanh-Nhan Vo et.al. 2508.14502 null
2025-08-20 Multimode Fiber Imaging Based on Hydrogel Fiber Lele He et.al. 2508.14501 null
2025-08-20 MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion Fei Peng et.al. 2508.14440 null
2025-08-20 CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities Yue Gong et.al. 2508.14405 null
2025-08-20 Img2ST-Net: Efficient High-Resolution Spatial Omics Prediction from Whole Slide Histology Images via Fully Convolutional Image-to-Image Learning Junchao Zhu et.al. 2508.14393 null
2025-08-19 Tooth-Diffusion: Guided 3D CBCT Synthesis with Fine-Grained Tooth Conditioning Said Djafar Said et.al. 2508.14276 null
2025-08-19 SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation Paul Grimal et.al. 2508.13866 null
2025-08-19 Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing Feng-Lin Liu et.al. 2508.13797 null
2025-08-20 DiffIER: Optimizing Diffusion Models with Iterative Error Reduction Ao Chen et.al. 2508.13628 null
2025-08-20 2D Gaussians Meet Visual Tokenizer Yiang Shi et.al. 2508.13515 null
2025-08-19 AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes Tianyi Xu et.al. 2508.13503 null
2025-08-18 ID-Card Synthetic Generation: Toward a Simulated Bona fide Dataset Qingwen Zeng et.al. 2508.13078 null
2025-08-18 From Transthoracic to Transesophageal: Cross-Modality Generation using LoRA Diffusion Emmanuel Oladokun et.al. 2508.13077 null
2025-08-18 7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models Elena Izzo et.al. 2508.12919 null
2025-08-18 Next Visual Granularity Generation Yikai Wang et.al. 2508.12811 null
2025-08-18 Single-Reference Text-to-Image Manipulation with Dual Contrastive Denoising Score Syed Muhmmad Israr et.al. 2508.12718 null
2025-08-18 Stable Diffusion-Based Approach for Human De-Occlusion Seung Young Noh et.al. 2508.12663 null
2025-08-17 Say It, See It: A Systematic Evaluation on Speech-Based 3D Content Generation Methods in Augmented Reality Yanming Xiu et.al. 2508.12498 null
2025-08-17 DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models Xiaochuan Lin et.al. 2508.12396 null
2025-08-17 Semantic Discrepancy-aware Detector for Image Forgery Identification Ziye Wang et.al. 2508.12341 null
2025-08-17 Sketchar: Supporting Character Design and Illustration Prototyping Using Generative AI Long Ling et.al. 2508.12333 null
2025-08-15 Training-Free Anomaly Generation via Dual-Attention Enhancement in Diffusion Model Zuo Zuo et.al. 2508.11550 null
2025-08-15 SPG: Style-Prompting Guidance for Style-Specific Content Creation Qian Liang et.al. 2508.11476 null
2025-08-15 MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation Qian Liang et.al. 2508.11433 null
2025-08-15 AnatoMaskGAN: GNN-Driven Slice Feature Fusion and Noise Augmentation for Medical Semantic Image Synthesis Zonglin Wu et.al. 2508.11375 null
2025-08-18 TimeMachine: Fine-Grained Facial Age Editing with Identity Preservation Yilin Mi et.al. 2508.11284 null
2025-08-15 Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension Zhenhao Li et.al. 2508.11211 null
2025-08-15 StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation Seungmi Lee et.al. 2508.11203 null
2025-08-15 LEARN: A Story-Driven Layout-to-Image Generation Framework for STEM Instruction Maoquan Zhang et.al. 2508.11153 null
2025-08-14 Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models Basile Lewandowski et.al. 2508.10993 null
2025-08-16 Object Fidelity Diffusion for Remote Sensing Image Generation Ziqi Ye et.al. 2508.10801 null
2025-07-22 Pyramid Hierarchical Masked Diffusion Model for Imaging Synthesis Xiaojiao Xiao et.al. 2507.16579 null
2025-07-22 ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement Kahim Wong et.al. 2507.16397 null

(back to top)

Music Generation

Publish Date Title Authors PDF Code
2025-08-12 QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems Chien-Chun Wang et.al. 2508.08957 null
2025-08-12 Opening Musical Creativity? Embedded Ideologies in Generative-AI Music Systems Liam Pram et.al. 2508.08805 null
2025-08-08 Live Music Models Lyria Team et.al. 2508.04651 null
2025-08-03 Automatic Melody Reduction via Shortest Path Finding Ziyu Wang et.al. 2508.01571 null
2025-07-31 DeformTune: A Deformable XAI Music Prototype for Non-Musicians Ziqing Xu et.al. 2508.00160 null
2025-07-31 "I made this (sort of)": Negotiating authorship, confronting fraudulence, and exploring new musical spaces with prompt-based AI music generation Bob L. T. Sturm et.al. 2507.23365 null
2025-07-28 Music Arena: Live Evaluation for Text-to-Music Yonghyun Kim et.al. 2507.20900 null
2025-07-28 Controllable Video-to-Music Generation with Multiple Time-Varying Conditions Junxian Wu et.al. 2507.20627 null
2025-07-27 Diffusion-based Symbolic Music Generation with Structured State Space Models Shenghua Yuan et.al. 2507.20128 null
2025-08-07 SAMUeL: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion Hei Shing Cheung et.al. 2507.19991 null
2025-07-17 A new XML conversion process for mensural music encoding : CMME_to_MEI (via Verovio) David Fiala et.al. 2507.15991 null
2025-07-17 WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling Qihui Yang et.al. 2507.10534 null
2025-07-07 Evaluating Fake Music Detection Performance Under Audio Augmentations Tomasz Sroka et.al. 2507.10447 null
2025-07-14 ASTAR-NTU solution to AudioMOS Challenge 2025 Track1 Fabian Ritter-Gutierrez et.al. 2507.09904 null
2025-07-09 Exploring State-Space-Model based Language Model in Music Generation Wei-Jaw Lee et.al. 2507.06674 null
2025-07-08 MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation Fathinah Izzati et.al. 2507.05894 null
2025-07-07 EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation Fathinah Izzati et.al. 2507.04955 null
2025-07-04 MusGO: A Community-Driven Framework For Assessing Openness in Music-Generative AI Roser Batlle-Roca et.al. 2507.03599 null
2025-06-29 The Florence Price Art Song Dataset and Piano Accompaniment Generator Tao-Tao He et.al. 2506.23130 null
2025-06-29 TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure Qi He et.al. 2506.23094 null
2025-06-26 Exploring Adapter Design Tradeoffs for Low Resource Music Generation Atharva Mehta et.al. 2506.21298 null
2025-06-23 A Fourier Explanation of AI-music Artifacts Darius Afchar et.al. 2506.19108 null
2025-06-23 Benchmarking Music Generation Models and Metrics via Human Preference Studies Florian Grötschla et.al. 2506.19085 null
2025-06-23 Let Your Video Listen to Your Music! Xinyu Zhang et.al. 2506.18881 null
2025-06-24 MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners Fang-Duo Tsai et.al. 2506.18729 null
2025-06-28 AI-Generated Song Detection via Lyrics Transcripts Markus Frohmann et.al. 2506.18488 null
2025-06-23 Large-Scale Training Data Attribution for Music Generative Models via Unlearning Woosung Choi et.al. 2506.18312 null
2025-06-20 From Generality to Mastery: Composer-Style Symbolic Music Generation via Large-Scale Pre-training Mingyang Yao et.al. 2506.17497 link

(back to top)

Audio Codec

Publish Date Title Authors PDF Code
2025-08-22 Vevo2: Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning Xueyao Zhang et.al. 2508.16332 null
2025-08-15 EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens Joonyong Park et.al. 2508.11273 null
2025-08-15 Benchmarking Prosody Encoding in Discrete Speech Tokens Kentaro Onda et.al. 2508.11224 null
2025-08-13 DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models Yuanyuan Wang et.al. 2508.08961 null
2025-08-11 Exploring Disentangled Neural Speech Codecs from Self-Supervised Representations Ryo Aihara et.al. 2508.08399 null
2025-08-07 NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference Edresson Casanova et.al. 2508.05835 null
2025-08-07 SpectroStream: A Versatile Neural Codec for General Audio Yunpeng Li et.al. 2508.05207 null
2025-08-05 Real-time speech enhancement in noise for throat microphone using neural audio codec as foundation model Julien Hauret et.al. 2508.02974 null
2025-08-04 SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec Chunyu Qiang et.al. 2508.02849 null
2025-08-02 Multi-Granularity Adaptive Time-Frequency Attention Framework for Audio Deepfake Detection under Real-World Communication Degradations Haohan Shi et.al. 2508.01467 null
2025-08-01 Next Tokens Denoising for Speech Synthesis Yanqing Liu et.al. 2507.22746 null
2025-07-22 Step-Audio 2 Technical Report Boyong Wu et.al. 2507.16632 null
2025-07-17 Autoregressive Speech Enhancement via Acoustic Tokens Luca Della Libera et.al. 2507.12825 null
2025-07-17 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine Anastasia Kuznetsova et.al. 2507.12701 null
2025-07-16 Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations Yichen Han et.al. 2507.12197 null
2025-07-16 Room Impulse Response Generation Conditioned on Acoustic Parameters Silvia Arellano et.al. 2507.12136 null
2025-07-14 Aligning Generative Speech Enhancement with Human Preferences via Direct Preference Optimization Haoyang Li et.al. 2507.09929 null
2025-07-14 Token-based Audio Inpainting via Discrete Diffusion Tali Dror et.al. 2507.08333 null
2025-07-10 Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders Dimitrios Bralios et.al. 2507.07867 null
2025-07-09 Speech Tokenizer is Key to Consistent Representation Wonjin Jung et.al. 2507.06802 null
2025-07-01 StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding Dake Guo et.al. 2506.23986 null
2025-07-09 XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs Yitian Gong et.al. 2506.23325 null
2025-06-27 DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding Yang Yang et.al. 2506.22362 null
2025-06-26 CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate Hankun Wang et.al. 2506.21074 null
2025-06-24 Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation Jun Wang et.al. 2506.19774 null
2025-06-20 LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization Daejin Jo et.al. 2506.16738 null
2025-06-18 Factorized RVQ-GAN For Disentangled Speech Tokenization Sameer Khurana et.al. 2506.15456 null
2025-06-17 A Variational Framework for Improving Naturalness in Generative Spoken Language Models Li-Wei Chen et.al. 2506.14767 link
2025-06-14 Towards Neural Audio Codec Source Parsing Orchid Chetia Phukan et.al. 2506.12627 null
2025-06-14 Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction Xiaoran Fan et.al. 2506.12537 null
2025-06-13 ViSAGe: Video-to-Spatial Audio Generation Jaeyeon Kim et.al. 2506.12199 null
2025-06-16 Discrete Audio Tokens: More Than a Survey! Pooneh Mousavi et.al. 2506.10274 null
2025-06-13 Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model Ailin Huang et.al. 2506.08967 null
2025-06-10 Towards Generalized Source Tracing for Codec-Based Deepfake Speech Xuanjun Chen et.al. 2506.07294 null
2025-06-19 Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training Sathvik Udupa et.al. 2506.07081 null
2025-06-04 Bringing Interpretability to Neural Audio Codecs Samir Sadok et.al. 2506.04492 null
2025-06-13 Phi-Omni-ST: A multimodal language model for direct speech-to-speech translation Yuxuan Hu et.al. 2506.04392 null

(back to top)

Large Audio Language Model

Publish Date Title Authors PDF Code
2025-08-19 Lexical Hints of Accuracy in LLM Reasoning Chains Arne Vanhoyweghen et.al. 2508.15842 null
2025-08-18 Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models Zhifei Xie et.al. 2508.15827 null
2025-08-21 Conditionally adaptive augmented Lagrangian method for physics-informed learning of forward and inverse problems using artificial neural networks Qifeng Hu et.al. 2508.15695 null
2025-08-21 When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models Cheng Wang et.al. 2508.15407 null
2025-08-19 OmViD: Omni-supervised active learning for video action detection Aayush Rana et.al. 2508.13983 null
2025-08-19 FAMNet: Integrating 2D and 3D Features for Micro-expression Recognition via Multi-task Learning and Hierarchical Attention Liangyu Fu et.al. 2508.13483 null
2025-08-19 Modeling and Control of AWOISV: A Filtered Tube-Based MPC Approach for Simultaneous Tracking of Lateral Position and Heading Angle Xu Yang et.al. 2508.13457 null
2025-08-18 Omni Survey for Multimodality Analysis in Visual Object Tracking Zhangyong Tang et.al. 2508.13000 null
2025-08-16 OmniD: Generalizable Robot Manipulation Policy via Image-Based BEV Representation Jilei Mao et.al. 2508.11898 null
2025-08-15 Semi-supervised Image Dehazing via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models Bing Liu et.al. 2508.11165 null
2025-08-15 Residual-based Efficient Bidirectional Diffusion Model for Image Dehazing and Haze Generation Bing Liu et.al. 2508.11134 null
2025-08-14 HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs Zheng Qin et.al. 2508.10576 null
2025-08-13 Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning Vaishnavi Shrivastava et.al. 2508.09726 null
2025-08-13 A Chain of Diagnosis Framework for Accurate and Explainable Radiology Report Generation Haibo Jin et.al. 2508.09566 null
2025-08-11 MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models Fan Zhang et.al. 2508.09210 null
2025-08-11 MSU-Bench: Towards Understanding the Conversational Multi-talker Scenarios Shuai Wang et.al. 2508.08155 null
2025-08-12 Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning Shu Wu et.al. 2508.08039 null
2025-08-12 Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation Fangyuan Mao et.al. 2508.07981 null
2025-08-10 AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning Siminfar Samakoush Galougah et.al. 2508.07470 null
2025-07-21 Prospects of detecting rotational flatness of exoplanets from space-based photometry Sz. Kálmán et.al. 2507.15359 null
2025-07-21 The CHEOPS view of HD 95338b: refined transit parameters, and a search for exomoons Sz. Kálmán et.al. 2507.15318 null
2025-07-20 Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards Derek Li et.al. 2507.14783 null
2025-07-19 Anisotropic Anderson localization in higher-dimensional nonreciprocal lattices Jinyuan Shang et.al. 2507.14523 null
2025-07-18 RiNNAL+: a Riemannian ALM Solver for SDP-RLT Relaxations of Mixed-Binary Quadratic Programs Di Hou et.al. 2507.13776 null
2025-07-17 DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model Han Zhang et.al. 2507.13087 null
2025-07-17 AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning Yiming Ren et.al. 2507.12841 null
2025-07-16 An augmented Lagrangian method for strongly regular minimizers in a class of convex composite optimization problems Chengjing Wang et.al. 2507.12040 null
2025-07-15 UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks Peiran Wu et.al. 2507.11336 null
2025-07-14 MultiVox: Benchmarking Voice Assistants for Multimodal Interactions Ramaneswaran Selvakumar et.al. 2507.10859 null
2025-07-14 The Man Behind the Sound: Demystifying Audio Private Attribute Profiling via Multimodal Large Language Model Agents Lixu Wang et.al. 2507.10016 null
2025-07-14 The electronic and transport properties in the Haldane-Hubbard with odd-parity altermagnetism Minghuan Zeng et.al. 2507.09906 null
2025-07-11 Two-Level Distributed Interference Management for Large-Scale HAPS-Empowered vHetNets Afsoon Alidadi Shamsabadi et.al. 2507.08299 null
2025-07-10 Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models Arushi Goel et.al. 2507.08128 null
2025-07-09 Omni-Fusion of Spatial and Spectral for Hyperspectral Image Segmentation Qing Zhang et.al. 2507.06606 null
2025-07-09 Omni-Video: Democratizing Unified Video Understanding and Generation Zhiyu Tan et.al. 2507.06119 null
2025-07-08 ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark He Wang et.al. 2507.05727 null
2025-07-08 Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition Zijin Gu et.al. 2507.05724 null
2025-07-07 Electronic transport and anti-super-Klein tunneling in few-layer black phosphorous Jorge Alfonso Lizarraga-Brito et.al. 2507.05462 null
2025-07-03 DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment Ke-Han Lu et.al. 2507.02768 null
2025-06-29 VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems Ethan Smyth et.al. 2507.00079 null
2025-06-28 UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments Dayong Su et.al. 2506.22736 null
2025-06-27 Augmented Lagrangian methods for infeasible convex optimization problems and diverging proximal-point algorithms Roland Andrews et.al. 2506.22428 null
2025-06-26 Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation Guanting Dong et.al. 2506.21384 null
2025-06-26 HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context Qize Yang et.al. 2506.21277 null
2025-06-26 Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation Yihong Cao et.al. 2506.21198 null
2025-06-29 OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs Yiman Zhang et.al. 2506.20960 null
2025-06-23 OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation Qijun Gan et.al. 2506.18866 null
2025-06-22 Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents Jinjie Wei et.al. 2506.17913 null

(back to top)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%