GitHub - WWWWxp/arxiv_daily

Updated on 2025.08.26

Usage instructions: here

Table of Contents

Text to Speech
Text to Audio
Video to Audio
Voice Conversion
Video Generation
Image Generation
Music Generation
Audio Codec
Large Audio Language Model

Text to Speech

Publish Date	Title	Authors	PDF	Code
2025-08-22	Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation	Weiting Tan et.al.	2508.16188	link
2025-08-21	QvTAD: Differential Relative Attribute Learning for Voice Timbre Attribute Detection	Zhiyu Wu et.al.	2508.15931	null
2025-08-21	Abelian integrals for polynomials with trivial global monodromy on $\mathbb{C}^2$	Jesús Muciño-Raymundo et.al.	2508.15925	null
2025-08-21	Any-to-any Speaker Attribute Perturbation for Asynchronous Voice Anonymization	Liping Chen et.al.	2508.15565	null
2025-08-24	Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets	Chenlin Liu et.al.	2508.15442	null
2025-08-21	UniCoM: A Universal Code-Switching Speech Generator	Sangmin Lee et.al.	2508.15244	link
2025-08-25	Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization	Rui Wang et.al.	2508.14947	null
2025-08-20	Long-Context Speech Synthesis with Context-Aware Memory	Zhipeng Li et.al.	2508.14713	null
2025-08-20	Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder Refinement	Heitor R. Guimarães et.al.	2508.14709	null
2025-08-22	Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS	Can Jin et.al.	2508.14313	null
2025-08-19	Exponential Ergodicity for McKean-Vlasov SDEs with Singular Interactions	Xing Huang et.al.	2508.13924	null
2025-08-20	DiffIER: Optimizing Diffusion Models with Iterative Error Reduction	Ao Chen et.al.	2508.13628	null
2025-08-19	Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM	Dariia Puhach et.al.	2508.13603	null
2025-08-18	A Surveillance Based Interactive Robot	Kshitij Kavimandan et.al.	2508.13319	link
2025-08-18	MrMARTIAN: A Multi-resolution Mass Reconstruction Algorithm Combining Free-form and Analytic Components	Sangjun Cha et.al.	2508.13262	null
2025-08-18	Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis	Zhu Li et.al.	2508.13028	null
2025-08-18	Cooperative Sensing-Assisted Predictive Beam Tracking for MIMO-OFDM Networked ISAC Systems	Xiaoyu Yang et.al.	2508.12723	null
2025-08-18	Real-Time Sign Language Gestures to Speech Transcription using Deep Learning	Brandone Fonya et.al.	2508.12713	null
2025-08-19	FNH-TTS: A Fast, Natural, and Human-Like Speech Synthesis System with advanced prosodic modeling based on Mixture of Experts	Qingliang Meng et.al.	2508.12001	null
2025-08-16	SimInterview: Transforming Business Education through Large Language Model-Based Simulated Multilingual Interview Training System	Truong Thanh Hung Nguyen et.al.	2508.11873	null
2025-08-15	MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts	Heyang Xue et.al.	2508.11326	null
2025-08-15	EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens	Joonyong Park et.al.	2508.11273	null
2025-08-14	Towards high-precision inspiral gravitational waveforms from binary neutron star mergers in numerical relativity	Kenta Kiuchi et.al.	2508.10981	null
2025-08-14	Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform	Yuankun Xie et.al.	2508.10559	null
2025-08-14	Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning	Yejin Jeon et.al.	2508.10412	null
2025-08-14	Towards Frame-level Quality Predictions of Synthetic Speech	Michael Kuhlmann et.al.	2508.10374	link
2025-08-13	Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions	Tina Raissi et.al.	2508.09868	null
2025-08-13	UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech	Shuhei Kato et.al.	2508.09767	null
2025-08-13	$\text{M}^3\text{PDB}$ : A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation	Boyu Zhu et.al.	2508.09702	null
2025-08-12	ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs	Eray Eren et.al.	2508.09389	null
2025-07-21	Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models	Kaiyan Chang et.al.	2507.15512	null
2025-07-21	Lunar and Terrestrial Time Transformation Based on the Principle of General Relativity	Min Liu et.al.	2507.15456	null
2025-07-21	A2TTS: TTS for Low Resource Indian Languages	Ayush Singh Bhadoriya et.al.	2507.15272	null
2025-07-21	EchoVoices: Preserving Generational Voices and Memories for Seniors and Children	Haiying Xu et.al.	2507.15221	null
2025-07-20	Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding	Yuanhan Zhang et.al.	2507.15028	null
2025-07-22	Hear Your Code Fail, Voice-Assisted Debugging for Python	Sayed Mahbub Hasan Amiri et.al.	2507.15007	null
2025-07-20	DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis	Yinghao Aaron Li et.al.	2507.14988	null
2025-07-20	MUR: Momentum Uncertainty guided Reasoning for Large Language Models	Hang Yan et.al.	2507.14958	null
2025-07-20	FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing	Shoutao Guo et.al.	2507.14815	null
2025-07-18	Inflated hot Jupiters: inferring average atmospheric velocity via Ohmic models coupled with internal dynamo evolution	Daniele Viganò et.al.	2507.13991	null
2025-07-18	Charged lepton flavor violating decays with a pair of light dark matter and muonium invisible decay	Sahabub Jahedi et.al.	2507.13876	null
2025-07-17	A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models	Kirill Borodin et.al.	2507.13563	null
2025-07-17	NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech	Maksim Borisov et.al.	2507.13155	null
2025-07-17	Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication	Tianyu Song et.al.	2507.13052	null
2025-07-17	Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes	Zhou Feng et.al.	2507.12932	null
2025-07-16	Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations	Yichen Han et.al.	2507.12197	null
2025-07-16	EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis	Haoxun Li et.al.	2507.12015	null
2025-07-17	Comprehensive investigation on baryon number violating nucleon decays involving an axion-like particle	Wei-Qi Fan et.al.	2507.11844	null
2025-07-15	Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection	Ivan Viakhirev et.al.	2507.11777	null
2025-07-15	P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge	Marvin Sach et.al.	2507.11306	null

(back to top)

Text to Audio

Publish Date	Title	Authors	PDF	Code
2025-08-22	Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment	Youjia Zhang et.al.	2508.15568	null
2025-08-21	DualMark: Identifying Model and Training Data Origins in Generated Audio	Xuefeng Yang et.al.	2508.15521	null
2025-08-19	MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence	Sonal Kumar et.al.	2508.13992	null
2025-08-19	DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer	Yisu Liu et.al.	2508.13786	null
2025-08-21	FoleySpace: Vision-Aligned Binaural Spatial Audio Generation	Lei Zhao et.al.	2508.12918	null
2025-08-18	TTA-DAME: Test-Time Adaptation with Domain Augmentation and Model Ensemble for Dynamic Driving Conditions	Dongjae Jeon et.al.	2508.12690	null
2025-08-15	Pretrained Conformers for Audio Fingerprinting and Retrieval	Kemal Altwlkany et.al.	2508.11609	null
2025-08-14	LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters	Haomin Zhang et.al.	2508.11074	null
2025-08-14	A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation	Jiulin Li et.al.	2508.10494	null
2025-08-13	TOTNet: Occlusion-Aware Temporal Tracking for Robust Ball Detection in Sports Videos	Hao Xu et.al.	2508.09650	null
2025-08-12	QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems	Chien-Chun Wang et.al.	2508.08957	null
2025-08-20	MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling	Qian Wang et.al.	2508.08487	null
2025-08-11	Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization	Nicholas Klein et.al.	2508.08141	null
2025-08-11	Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models	Khanh-Binh Nguyen et.al.	2508.07570	null
2025-08-08	MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows	Xiquan Li et.al.	2508.06098	null
2025-08-08	DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching	Wei Chen et.al.	2508.05978	null
2025-07-22	TTMBA: Towards Text To Multiple Sources Binaural Audio Generation	Yuxuan He et.al.	2507.16564	null
2025-07-21	An Investigation of Test-time Adaptation for Audio Classification under Background Noise	Weichuang Shao et.al.	2507.15523	null
2025-07-18	CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation	Marc Lafon et.al.	2507.14312	null
2025-07-16	Evaluation of Neural Surrogates for Physical Modelling Synthesis of Nonlinear Elastic Plates	Carlos De La Vega Martin et.al.	2507.12563	null
2025-07-16	Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations	Yichen Han et.al.	2507.12197	null
2025-07-16	GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models	Zhaohong Huang et.al.	2507.11969	null
2025-07-14	DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis	Wenjie Tian et.al.	2507.10109	null
2025-07-14	Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction	Shu-wen Yang et.al.	2507.09834	null
2025-07-13	Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations	Yiwen Liang et.al.	2507.09500	null
2025-07-11	Monitoring Risks in Test-Time Adaptation	Mona Schirmer et.al.	2507.08721	null
2025-07-11	BayesTTA: Continual-Temporal Test-Time Adaptation for Vision-Language Models via Gaussian Discriminant Analysis	Shuang Cui et.al.	2507.08607	null
2025-07-11	FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation	Yuxuan Jiang et.al.	2507.08557	null
2025-07-11	MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling	Jingjing Tang et.al.	2507.08530	null
2025-07-10	Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement	Xiao Yang et.al.	2507.07908	null
2025-07-10	Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos	Hao Xu et.al.	2507.07381	null
2025-07-09	Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM	Qiyuan Dai et.al.	2507.06973	null
2025-07-09	Physics-Informed Direction-Aware Neural Acoustic Fields	Yoshiki Masuyama et.al.	2507.06826	null
2025-07-13	Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation	Yingshan Liang et.al.	2507.04959	null
2025-07-05	MMMOS: Multi-domain Multi-axis Audio Quality Assessment	Yi-Cheng Lin et.al.	2507.04094	null
2025-07-04	Dynamic Multimodal Prototype Learning in Vision-Language Models	Xingyu Zhu et.al.	2507.03657	null
2025-07-03	F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning	Wei Li et.al.	2507.02437	null
2025-07-03	Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation	Feizhen Huang et.al.	2507.02271	null
2025-07-02	Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation	Andrei Jelea et.al.	2507.01347	null
2025-07-01	AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio embedding Sequences	Minoru Kishi et.al.	2507.00475	null
2025-07-01	Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation	Jizhou Han et.al.	2507.00462	null
2025-06-30	The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models	Lijun Sheng et.al.	2506.24000	null
2025-06-30	Scaling Self-Supervised Representation Learning for Symbolic Piano Performance	Louis Bradshaw et.al.	2506.23869	null
2025-06-30	When Small Guides Large: Cross-Model Co-Learning for Test-Time Adaptation	Chang'an Yi et.al.	2506.23724	null
2025-06-30	RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio	Yusuke Kanamori et.al.	2506.23582	null
2025-06-30	Human-CLAP: Human-perception-based contrastive language-audio pretraining	Taisei Takano et.al.	2506.23553	null
2025-06-27	SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition	Muhammad Umar Farooq et.al.	2506.22143	null
2025-06-28	ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing	Huadai Liu et.al.	2506.21448	null
2025-06-27	Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance	Akio Hayakawa et.al.	2506.20995	null
2025-06-24	Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation	Jun Wang et.al.	2506.19774	null

(back to top)

Video to Audio

Publish Date	Title	Authors	PDF	Code
2025-08-19	InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing	Shaoshu Yang et.al.	2508.14033	null
2025-08-21	FoleySpace: Vision-Aligned Binaural Spatial Audio Generation	Lei Zhao et.al.	2508.12918	null
2025-08-14	LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters	Haomin Zhang et.al.	2508.11074	null
2025-08-12	Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization	Chaoqun Cui et.al.	2508.08550	null
2025-07-14	DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis	Wenjie Tian et.al.	2507.10109	null
2025-07-13	Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation	Yingshan Liang et.al.	2507.04959	null
2025-06-23	Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions	Vineet Kumar Rakesh et.al.	2507.02900	null
2025-07-03	Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation	Feizhen Huang et.al.	2507.02271	null
2025-06-23	IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech	Siyi Zhou et.al.	2506.21619	null
2025-06-28	ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing	Huadai Liu et.al.	2506.21448	null
2025-06-27	Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance	Akio Hayakawa et.al.	2506.20995	null
2025-06-24	Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation	Jun Wang et.al.	2506.19774	null
2025-06-13	ViSAGe: Video-to-Spatial Audio Generation	Jaeyeon Kim et.al.	2506.12199	null
2025-05-31	Length Aware Speech Translation for Video Dubbing	Harveen Singh Chadha et.al.	2506.00740	null
2025-05-26	Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks	Chang Liu et.al.	2505.20038	link
2025-05-22	SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet	Zhi Zhong et.al.	2505.16195	null
2025-05-30	TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis	Yu Zhang et.al.	2505.14910	link
2025-05-28	Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model	Yong Ren et.al.	2505.13062	null
2025-06-03	OmniAudio: Generating Spatial Audio from 360-Degree Video	Huadai Liu et.al.	2504.14906	link
2025-04-17	CAFA: a Controllable Automatic Foley Artist	Roi Benita et.al.	2504.06778	link

(back to top)

Voice Conversion

Publish Date	Title	Authors	PDF	Code
2025-08-22	LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence	Alisa Vinogradova et.al.	2508.16571	null
2025-08-21	Evolving k-Threshold Visual Cryptography Schemes	Xiaoli Zhuo et.al.	2508.15917	null
2025-08-20	Maxmum Size of a Uniform Family with Bounded VC-dimension	Tianchi Yang et.al.	2508.14334	null
2025-08-20	Fortifying the Agentic Web: A Unified Zero-Trust Architecture Against Logic-layer Threats	Ken Huang et.al.	2508.12259	null
2025-08-13	Perturbed Public Voices (P $^{2}$ V): A Dataset for Robust Audio Deepfake Detection	Chongyang Gao et.al.	2508.10949	null
2025-08-13	Regularity for hypergraphs with bounded VC $_2$ dimension	Lior Gishboliner et.al.	2508.09969	null
2025-08-11	Exploring Disentangled Neural Speech Codecs from Self-Supervised Representations	Ryo Aihara et.al.	2508.08399	null
2025-08-10	Scalable Controllable Accented TTS	Henry Li Xinyuan et.al.	2508.07426	null
2025-08-09	Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody	Jinsung Yoon et.al.	2508.06890	null
2025-08-08	DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching	Wei Chen et.al.	2508.05978	null
2025-08-07	Grouped k-threshold random grid-based visual cryptography scheme	Xiaoli Zhuo et.al.	2508.05394	null
2025-08-15	Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS	M Anuprabha et.al.	2508.05102	null
2025-08-08	REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers	Yuepeng Jiang et.al.	2508.04996	null
2025-08-14	Marco-Voice Technical Report	Fengping Tian et.al.	2508.02038	null
2025-07-23	Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion	Yu Zhang et.al.	2507.14534	null
2025-07-17	Computational-Statistical Tradeoffs from NP-hardness	Guy Blanc et.al.	2507.13222	null
2025-07-17	Cross-Modal Watermarking for Authentic Audio Recovery and Tamper Localization in Synthesized Audiovisual Forgeries	Minyoung Kim et.al.	2507.12723	null
2025-07-15	Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection	Ivan Viakhirev et.al.	2507.11777	null
2025-07-16	Multipass Linear Sketches for Geometric LP-Type Problems	N. Efe Çekirge et.al.	2507.11484	null
2025-07-15	On Tight Robust Coresets for $k$ -Medians Clustering	Lingxiao Huang et.al.	2507.11260	null
2025-07-15	Pronunciation Deviation Analysis Through Voice Cloning and Acoustic Comparison	Andrew Valdivia et.al.	2507.10985	null
2025-07-12	Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning	Dominika Woszczyk et.al.	2507.09310	null
2025-07-11	SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment	Shivam Mehta et.al.	2507.09070	null
2025-07-11	Detecting Deepfake Talking Heads from Facial Biometric Anomalies	Justin D. Norman et.al.	2507.08917	null
2025-07-11	On Fair Epsilon Net and Geometric Hitting Set	Mohsen Dehghankar et.al.	2507.08758	null
2025-07-08	On the pointwise and sup-norm errors for local regression estimators	Jérémy Bettinger et.al.	2507.07132	null
2025-07-09	Speech Tokenizer is Key to Consistent Representation	Wonjin Jung et.al.	2507.06802	null
2025-07-07	Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters	Mathilde Abrassart et.al.	2507.04817	null
2025-07-06	TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet	Jaeseok Jeong et.al.	2507.04349	null
2025-07-04	Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion	Lea Fischbach et.al.	2507.03641	null
2025-07-04	Going Beyond Surfaces in Diameter Approximation	Michał Włodarczyk et.al.	2507.03447	null
2025-07-03	De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks	Wei Fan et.al.	2507.02606	null
2025-07-03	Open-Source System for Multilingual Translation and Cloned Speech Synthesis	Mateo Cámara et.al.	2507.02530	null
2025-07-03	JoyTTS: LLM-based Spoken Chatbot With Voice Cloning	Fangru Zhou et.al.	2507.02380	null
2025-07-02	Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis	Marc-André Carbonneau et.al.	2507.02176	null
2025-07-02	Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora	Hitoshi Suda et.al.	2507.01356	null
2025-07-01	MuteSwap: Silent Face-based Voice Conversion	Yifan Liu et.al.	2507.00498	null
2025-06-26	Avatars and Environments for Meetings in Social VR: What Styles and Choices Matter to People in Group Creativity Tasks?	Anya Osborne et.al.	2506.21780	null
2025-06-23	Selecting N-lowest scores for training MOS prediction models	Yuto Kondo et.al.	2506.18326	null
2025-06-23	Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting	Yuto Kondo et.al.	2506.18307	null
2025-06-23	JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles	Yuto Kondo et.al.	2506.18296	null
2025-06-12	RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding	Yisi Liu et.al.	2506.10289	null

(back to top)

Video Generation

Publish Date	Title	Authors	PDF	Code
2025-08-22	Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation	Chun-Peng Chang et.al.	2508.16512	null
2025-08-25	OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models	Huanpeng Chu et.al.	2508.16212	null
2025-08-22	Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers	Shikang Zheng et.al.	2508.16211	null
2025-08-21	Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning	Yijun Liu et.al.	2508.15874	null
2025-08-21	CineScale: Free Lunch in High-Resolution Cinematic Visual Generation	Haonan Qiu et.al.	2508.15774	null
2025-08-21	Scaling Group Inference for Diverse and High-Quality Generation	Gaurav Parmar et.al.	2508.15773	null
2025-08-21	Waver: Wave Your Way to Lifelike Video Generation	Yifu Zhang et.al.	2508.15761	null
2025-08-21	WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception	Zhiheng Liu et.al.	2508.15720	null
2025-08-21	VideoEraser: Concept Erasure in Text-to-Video Diffusion Models	Naen Xu et.al.	2508.15314	null
2025-08-20	AnchorSync: Global Consistency Optimization for Long Video Editing	Zichi Liu et.al.	2508.14609	null
2025-08-20	Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration	Haoran Bai et.al.	2508.14483	null
2025-08-20	DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing	Weitao Wang et.al.	2508.14465	null
2025-08-20	MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation	Guile Wu et.al.	2508.14327	null
2025-08-19	InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing	Shaoshu Yang et.al.	2508.14033	null
2025-08-19	Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment	Samuel Seligardi et.al.	2508.13989	null
2025-08-19	Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing	Feng-Lin Liu et.al.	2508.13797	null
2025-08-18	4DNeX: Feed-Forward 4D Generative Modeling Made Easy	Zhaoxi Chen et.al.	2508.13154	null
2025-08-18	Precise Action-to-Video Generation Through Visual Action Prompts	Yuang Wang et.al.	2508.13104	null
2025-08-18	EgoTwin: Dreaming Body and View in First Person	Jingqiao Xiu et.al.	2508.13013	null
2025-08-18	Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model	Xianglong He et.al.	2508.13009	null
2025-08-18	Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation	Qirui Li et.al.	2508.12969	null
2025-08-18	Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models	Jianshu Zeng et.al.	2508.12945	null
2025-08-18	S^2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models	Chubin Chen et.al.	2508.12880	null
2025-08-18	E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model	Ronghao Lin et.al.	2508.12854	null
2025-08-18	MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration	Yuanxin Wei et.al.	2508.12691	null
2025-08-17	TiP4GEN: Text to Immersive Panorama 4D Scene Generation	Ke Xing et.al.	2508.12415	null
2025-08-15	CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models	Xiaoxue Wu et.al.	2508.11484	null
2025-08-14	LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters	Haomin Zhang et.al.	2508.11074	null
2025-08-14	GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning	Kelin Yu et.al.	2508.11049	null
2025-08-14	EVCtrl: Efficient Control Adapter for Visual Generation	Zixiang Yang et.al.	2508.10963	null
2025-08-14	Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation	Harold Haodong Chen et.al.	2508.10858	null
2025-08-14	Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation	Youping Gu et.al.	2508.10774	null
2025-08-14	AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences	Jieyu Li et.al.	2508.10771	null
2025-08-14	HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis	Shiyu Liu et.al.	2508.10566	null
2025-08-13	LIA-X: Interpretable Latent Portrait Animator	Yaohui Wang et.al.	2508.09959	null
2025-08-13	Physical Autoregressive Model for Robotic Manipulation without Action Pretraining	Zijian Song et.al.	2508.09822	null
2025-07-22	MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation	Yanchen Liu et.al.	2507.16310	null
2025-07-22	PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation	Yaofang Liu et.al.	2507.16116	null
2025-07-21	Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models	Enes Sanli et.al.	2507.15824	null
2025-07-21	TokensGen: Harnessing Condensed Tokens for Long Video Generation	Wenqi Ouyang et.al.	2507.15728	null
2025-07-21	Conditional Video Generation for High-Efficiency Video Compression	Fangqiu Yi et.al.	2507.15269	null
2025-07-19	BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM	Haiquan Wen et.al.	2507.14632	null
2025-07-19	Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey	Jiahui Zhang et.al.	2507.14501	null
2025-07-18	Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis	Tongtong Su et.al.	2507.13753	null
2025-07-17	$\nabla$ NABLA: Neighborhood Adaptive Block-Level Attention	Dmitrii Mikhailov et.al.	2507.13546	null
2025-07-17	"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models	Jing Gu et.al.	2507.13428	null
2025-07-17	Taming Diffusion Transformer for Real-Time Mobile Video Generation	Yushu Wu et.al.	2507.13343	null
2025-07-17	Leveraging Pre-Trained Visual Models for AI-Generated Video Detection	Keerthi Veeramachaneni et.al.	2507.13224	null
2025-07-17	LoViC: Efficient Long Video Generation with Context Compression	Jiaxiu Jiang et.al.	2507.12952	null
2025-07-17	World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving	Yanchen Guan et.al.	2507.12762	null

(back to top)

Image Generation

Publish Date	Title	Authors	PDF	Code
2025-08-22	Decoding MGMT Methylation: A Step Towards Precision Medicine in Glioblastoma	Hafeez Ur Rehman et.al.	2508.16424	null
2025-08-22	FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing	Jiahao Chen et.al.	2508.16230	null
2025-08-25	OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models	Huanpeng Chu et.al.	2508.16212	null
2025-08-22	Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers	Shikang Zheng et.al.	2508.16211	null
2025-08-22	Competition and Attraction Improve Model Fusion	João Abrantes et.al.	2508.16204	null
2025-08-22	RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution	Haodong He et.al.	2508.16158	null
2025-08-22	Two-flow Feedback Multi-scale Progressive Generative Adversarial Network	Sun Weikai et.al.	2508.16089	null
2025-08-21	Structure-Preserving Medical Image Generation from a Latent Graph Representation	Kevin Arias et.al.	2508.15920	null
2025-08-21	CineScale: Free Lunch in High-Resolution Cinematic Visual Generation	Haonan Qiu et.al.	2508.15774	null
2025-08-21	Visual Autoregressive Modeling for Instruction-Guided Image Editing	Qingyang Mao et.al.	2508.15772	null
2025-08-21	Waver: Wave Your Way to Lifelike Video Generation	Yifu Zhang et.al.	2508.15761	null
2025-08-21	Are Virtual DES Images a Valid Alternative to the Real Ones?	Ana C. Perre et.al.	2508.15594	null
2025-08-21	GenTune: Toward Traceable Prompts to Improve Controllability of Image Refinement in Environment Design	Wen-Fan Wang et.al.	2508.15227	null
2025-08-21	See it. Say it. Sorted: Agentic System for Compositional Diagram Generation	Hantao Zhang et.al.	2508.15222	null
2025-08-20	Side Effects of Erasing Concepts from Diffusion Models	Shaswati Saha et.al.	2508.15124	null
2025-08-20	CurveFlow: Curvature-Guided Flow Matching for Image Generation	Yan Luo et.al.	2508.15093	null
2025-08-20	TAIGen: Training-Free Adversarial Image Generation via Diffusion Models	Susim Roy et.al.	2508.15020	null
2025-08-20	SATURN: Autoregressive Image Generation Guided by Scene Graphs	Thanh-Nhan Vo et.al.	2508.14502	null
2025-08-20	Multimode Fiber Imaging Based on Hydrogel Fiber	Lele He et.al.	2508.14501	null
2025-08-20	MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion	Fei Peng et.al.	2508.14440	null
2025-08-20	CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities	Yue Gong et.al.	2508.14405	null
2025-08-20	Img2ST-Net: Efficient High-Resolution Spatial Omics Prediction from Whole Slide Histology Images via Fully Convolutional Image-to-Image Learning	Junchao Zhu et.al.	2508.14393	null
2025-08-19	Tooth-Diffusion: Guided 3D CBCT Synthesis with Fine-Grained Tooth Conditioning	Said Djafar Said et.al.	2508.14276	null
2025-08-19	SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation	Paul Grimal et.al.	2508.13866	null
2025-08-19	Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing	Feng-Lin Liu et.al.	2508.13797	null
2025-08-20	DiffIER: Optimizing Diffusion Models with Iterative Error Reduction	Ao Chen et.al.	2508.13628	null
2025-08-20	2D Gaussians Meet Visual Tokenizer	Yiang Shi et.al.	2508.13515	null
2025-08-19	AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes	Tianyi Xu et.al.	2508.13503	null
2025-08-18	ID-Card Synthetic Generation: Toward a Simulated Bona fide Dataset	Qingwen Zeng et.al.	2508.13078	null
2025-08-18	From Transthoracic to Transesophageal: Cross-Modality Generation using LoRA Diffusion	Emmanuel Oladokun et.al.	2508.13077	null
2025-08-18	7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models	Elena Izzo et.al.	2508.12919	null
2025-08-18	Next Visual Granularity Generation	Yikai Wang et.al.	2508.12811	null
2025-08-18	Single-Reference Text-to-Image Manipulation with Dual Contrastive Denoising Score	Syed Muhmmad Israr et.al.	2508.12718	null
2025-08-18	Stable Diffusion-Based Approach for Human De-Occlusion	Seung Young Noh et.al.	2508.12663	null
2025-08-17	Say It, See It: A Systematic Evaluation on Speech-Based 3D Content Generation Methods in Augmented Reality	Yanming Xiu et.al.	2508.12498	null
2025-08-17	DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models	Xiaochuan Lin et.al.	2508.12396	null
2025-08-17	Semantic Discrepancy-aware Detector for Image Forgery Identification	Ziye Wang et.al.	2508.12341	null
2025-08-17	Sketchar: Supporting Character Design and Illustration Prototyping Using Generative AI	Long Ling et.al.	2508.12333	null
2025-08-15	Training-Free Anomaly Generation via Dual-Attention Enhancement in Diffusion Model	Zuo Zuo et.al.	2508.11550	null
2025-08-15	SPG: Style-Prompting Guidance for Style-Specific Content Creation	Qian Liang et.al.	2508.11476	null
2025-08-15	MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation	Qian Liang et.al.	2508.11433	null
2025-08-15	AnatoMaskGAN: GNN-Driven Slice Feature Fusion and Noise Augmentation for Medical Semantic Image Synthesis	Zonglin Wu et.al.	2508.11375	null
2025-08-18	TimeMachine: Fine-Grained Facial Age Editing with Identity Preservation	Yilin Mi et.al.	2508.11284	null
2025-08-15	Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension	Zhenhao Li et.al.	2508.11211	null
2025-08-15	StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation	Seungmi Lee et.al.	2508.11203	null
2025-08-15	LEARN: A Story-Driven Layout-to-Image Generation Framework for STEM Instruction	Maoquan Zhang et.al.	2508.11153	null
2025-08-14	Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models	Basile Lewandowski et.al.	2508.10993	null
2025-08-16	Object Fidelity Diffusion for Remote Sensing Image Generation	Ziqi Ye et.al.	2508.10801	null
2025-07-22	Pyramid Hierarchical Masked Diffusion Model for Imaging Synthesis	Xiaojiao Xiao et.al.	2507.16579	null
2025-07-22	ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement	Kahim Wong et.al.	2507.16397	null

(back to top)

Music Generation

Publish Date	Title	Authors	PDF	Code
2025-08-12	QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems	Chien-Chun Wang et.al.	2508.08957	null
2025-08-12	Opening Musical Creativity? Embedded Ideologies in Generative-AI Music Systems	Liam Pram et.al.	2508.08805	null
2025-08-08	Live Music Models	Lyria Team et.al.	2508.04651	null
2025-08-03	Automatic Melody Reduction via Shortest Path Finding	Ziyu Wang et.al.	2508.01571	null
2025-07-31	DeformTune: A Deformable XAI Music Prototype for Non-Musicians	Ziqing Xu et.al.	2508.00160	null
2025-07-31	"I made this (sort of)": Negotiating authorship, confronting fraudulence, and exploring new musical spaces with prompt-based AI music generation	Bob L. T. Sturm et.al.	2507.23365	null
2025-07-28	Music Arena: Live Evaluation for Text-to-Music	Yonghyun Kim et.al.	2507.20900	null
2025-07-28	Controllable Video-to-Music Generation with Multiple Time-Varying Conditions	Junxian Wu et.al.	2507.20627	null
2025-07-27	Diffusion-based Symbolic Music Generation with Structured State Space Models	Shenghua Yuan et.al.	2507.20128	null
2025-08-07	SAMUeL: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion	Hei Shing Cheung et.al.	2507.19991	null
2025-07-17	A new XML conversion process for mensural music encoding : CMME_to_MEI (via Verovio)	David Fiala et.al.	2507.15991	null
2025-07-17	WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling	Qihui Yang et.al.	2507.10534	null
2025-07-07	Evaluating Fake Music Detection Performance Under Audio Augmentations	Tomasz Sroka et.al.	2507.10447	null
2025-07-14	ASTAR-NTU solution to AudioMOS Challenge 2025 Track1	Fabian Ritter-Gutierrez et.al.	2507.09904	null
2025-07-09	Exploring State-Space-Model based Language Model in Music Generation	Wei-Jaw Lee et.al.	2507.06674	null
2025-07-08	MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation	Fathinah Izzati et.al.	2507.05894	null
2025-07-07	EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation	Fathinah Izzati et.al.	2507.04955	null
2025-07-04	MusGO: A Community-Driven Framework For Assessing Openness in Music-Generative AI	Roser Batlle-Roca et.al.	2507.03599	null
2025-06-29	The Florence Price Art Song Dataset and Piano Accompaniment Generator	Tao-Tao He et.al.	2506.23130	null
2025-06-29	TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure	Qi He et.al.	2506.23094	null
2025-06-26	Exploring Adapter Design Tradeoffs for Low Resource Music Generation	Atharva Mehta et.al.	2506.21298	null
2025-06-23	A Fourier Explanation of AI-music Artifacts	Darius Afchar et.al.	2506.19108	null
2025-06-23	Benchmarking Music Generation Models and Metrics via Human Preference Studies	Florian Grötschla et.al.	2506.19085	null
2025-06-23	Let Your Video Listen to Your Music!	Xinyu Zhang et.al.	2506.18881	null
2025-06-24	MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners	Fang-Duo Tsai et.al.	2506.18729	null
2025-06-28	AI-Generated Song Detection via Lyrics Transcripts	Markus Frohmann et.al.	2506.18488	null
2025-06-23	Large-Scale Training Data Attribution for Music Generative Models via Unlearning	Woosung Choi et.al.	2506.18312	null
2025-06-20	From Generality to Mastery: Composer-Style Symbolic Music Generation via Large-Scale Pre-training	Mingyang Yao et.al.	2506.17497	link

(back to top)

Audio Codec

Publish Date	Title	Authors	PDF	Code
2025-08-22	Vevo2: Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning	Xueyao Zhang et.al.	2508.16332	null
2025-08-15	EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens	Joonyong Park et.al.	2508.11273	null
2025-08-15	Benchmarking Prosody Encoding in Discrete Speech Tokens	Kentaro Onda et.al.	2508.11224	null
2025-08-13	DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models	Yuanyuan Wang et.al.	2508.08961	null
2025-08-11	Exploring Disentangled Neural Speech Codecs from Self-Supervised Representations	Ryo Aihara et.al.	2508.08399	null
2025-08-07	NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference	Edresson Casanova et.al.	2508.05835	null
2025-08-07	SpectroStream: A Versatile Neural Codec for General Audio	Yunpeng Li et.al.	2508.05207	null
2025-08-05	Real-time speech enhancement in noise for throat microphone using neural audio codec as foundation model	Julien Hauret et.al.	2508.02974	null
2025-08-04	SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec	Chunyu Qiang et.al.	2508.02849	null
2025-08-02	Multi-Granularity Adaptive Time-Frequency Attention Framework for Audio Deepfake Detection under Real-World Communication Degradations	Haohan Shi et.al.	2508.01467	null
2025-08-01	Next Tokens Denoising for Speech Synthesis	Yanqing Liu et.al.	2507.22746	null
2025-07-22	Step-Audio 2 Technical Report	Boyong Wu et.al.	2507.16632	null
2025-07-17	Autoregressive Speech Enhancement via Acoustic Tokens	Luca Della Libera et.al.	2507.12825	null
2025-07-17	Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine	Anastasia Kuznetsova et.al.	2507.12701	null
2025-07-16	Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations	Yichen Han et.al.	2507.12197	null
2025-07-16	Room Impulse Response Generation Conditioned on Acoustic Parameters	Silvia Arellano et.al.	2507.12136	null
2025-07-14	Aligning Generative Speech Enhancement with Human Preferences via Direct Preference Optimization	Haoyang Li et.al.	2507.09929	null
2025-07-14	Token-based Audio Inpainting via Discrete Diffusion	Tali Dror et.al.	2507.08333	null
2025-07-10	Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders	Dimitrios Bralios et.al.	2507.07867	null
2025-07-09	Speech Tokenizer is Key to Consistent Representation	Wonjin Jung et.al.	2507.06802	null
2025-07-01	StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding	Dake Guo et.al.	2506.23986	null
2025-07-09	XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs	Yitian Gong et.al.	2506.23325	null
2025-06-27	DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding	Yang Yang et.al.	2506.22362	null
2025-06-26	CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate	Hankun Wang et.al.	2506.21074	null
2025-06-24	Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation	Jun Wang et.al.	2506.19774	null
2025-06-20	LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization	Daejin Jo et.al.	2506.16738	null
2025-06-18	Factorized RVQ-GAN For Disentangled Speech Tokenization	Sameer Khurana et.al.	2506.15456	null
2025-06-17	A Variational Framework for Improving Naturalness in Generative Spoken Language Models	Li-Wei Chen et.al.	2506.14767	link
2025-06-14	Towards Neural Audio Codec Source Parsing	Orchid Chetia Phukan et.al.	2506.12627	null
2025-06-14	Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction	Xiaoran Fan et.al.	2506.12537	null
2025-06-13	ViSAGe: Video-to-Spatial Audio Generation	Jaeyeon Kim et.al.	2506.12199	null
2025-06-16	Discrete Audio Tokens: More Than a Survey!	Pooneh Mousavi et.al.	2506.10274	null
2025-06-13	Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model	Ailin Huang et.al.	2506.08967	null
2025-06-10	Towards Generalized Source Tracing for Codec-Based Deepfake Speech	Xuanjun Chen et.al.	2506.07294	null
2025-06-19	Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training	Sathvik Udupa et.al.	2506.07081	null
2025-06-04	Bringing Interpretability to Neural Audio Codecs	Samir Sadok et.al.	2506.04492	null
2025-06-13	Phi-Omni-ST: A multimodal language model for direct speech-to-speech translation	Yuxuan Hu et.al.	2506.04392	null

(back to top)

Large Audio Language Model

Publish Date	Title	Authors	PDF	Code
2025-08-19	Lexical Hints of Accuracy in LLM Reasoning Chains	Arne Vanhoyweghen et.al.	2508.15842	null
2025-08-18	Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models	Zhifei Xie et.al.	2508.15827	null
2025-08-21	Conditionally adaptive augmented Lagrangian method for physics-informed learning of forward and inverse problems using artificial neural networks	Qifeng Hu et.al.	2508.15695	null
2025-08-21	When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models	Cheng Wang et.al.	2508.15407	null
2025-08-19	OmViD: Omni-supervised active learning for video action detection	Aayush Rana et.al.	2508.13983	null
2025-08-19	FAMNet: Integrating 2D and 3D Features for Micro-expression Recognition via Multi-task Learning and Hierarchical Attention	Liangyu Fu et.al.	2508.13483	null
2025-08-19	Modeling and Control of AWOISV: A Filtered Tube-Based MPC Approach for Simultaneous Tracking of Lateral Position and Heading Angle	Xu Yang et.al.	2508.13457	null
2025-08-18	Omni Survey for Multimodality Analysis in Visual Object Tracking	Zhangyong Tang et.al.	2508.13000	null
2025-08-16	OmniD: Generalizable Robot Manipulation Policy via Image-Based BEV Representation	Jilei Mao et.al.	2508.11898	null
2025-08-15	Semi-supervised Image Dehazing via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models	Bing Liu et.al.	2508.11165	null
2025-08-15	Residual-based Efficient Bidirectional Diffusion Model for Image Dehazing and Haze Generation	Bing Liu et.al.	2508.11134	null
2025-08-14	HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs	Zheng Qin et.al.	2508.10576	null
2025-08-13	Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning	Vaishnavi Shrivastava et.al.	2508.09726	null
2025-08-13	A Chain of Diagnosis Framework for Accurate and Explainable Radiology Report Generation	Haibo Jin et.al.	2508.09566	null
2025-08-11	MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models	Fan Zhang et.al.	2508.09210	null
2025-08-11	MSU-Bench: Towards Understanding the Conversational Multi-talker Scenarios	Shuai Wang et.al.	2508.08155	null
2025-08-12	Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning	Shu Wu et.al.	2508.08039	null
2025-08-12	Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation	Fangyuan Mao et.al.	2508.07981	null
2025-08-10	AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning	Siminfar Samakoush Galougah et.al.	2508.07470	null
2025-07-21	Prospects of detecting rotational flatness of exoplanets from space-based photometry	Sz. Kálmán et.al.	2507.15359	null
2025-07-21	The CHEOPS view of HD 95338b: refined transit parameters, and a search for exomoons	Sz. Kálmán et.al.	2507.15318	null
2025-07-20	Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards	Derek Li et.al.	2507.14783	null
2025-07-19	Anisotropic Anderson localization in higher-dimensional nonreciprocal lattices	Jinyuan Shang et.al.	2507.14523	null
2025-07-18	RiNNAL+: a Riemannian ALM Solver for SDP-RLT Relaxations of Mixed-Binary Quadratic Programs	Di Hou et.al.	2507.13776	null
2025-07-17	DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model	Han Zhang et.al.	2507.13087	null
2025-07-17	AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning	Yiming Ren et.al.	2507.12841	null
2025-07-16	An augmented Lagrangian method for strongly regular minimizers in a class of convex composite optimization problems	Chengjing Wang et.al.	2507.12040	null
2025-07-15	UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks	Peiran Wu et.al.	2507.11336	null
2025-07-14	MultiVox: Benchmarking Voice Assistants for Multimodal Interactions	Ramaneswaran Selvakumar et.al.	2507.10859	null
2025-07-14	The Man Behind the Sound: Demystifying Audio Private Attribute Profiling via Multimodal Large Language Model Agents	Lixu Wang et.al.	2507.10016	null
2025-07-14	The electronic and transport properties in the Haldane-Hubbard with odd-parity altermagnetism	Minghuan Zeng et.al.	2507.09906	null
2025-07-11	Two-Level Distributed Interference Management for Large-Scale HAPS-Empowered vHetNets	Afsoon Alidadi Shamsabadi et.al.	2507.08299	null
2025-07-10	Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models	Arushi Goel et.al.	2507.08128	null
2025-07-09	Omni-Fusion of Spatial and Spectral for Hyperspectral Image Segmentation	Qing Zhang et.al.	2507.06606	null
2025-07-09	Omni-Video: Democratizing Unified Video Understanding and Generation	Zhiyu Tan et.al.	2507.06119	null
2025-07-08	ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark	He Wang et.al.	2507.05727	null
2025-07-08	Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition	Zijin Gu et.al.	2507.05724	null
2025-07-07	Electronic transport and anti-super-Klein tunneling in few-layer black phosphorous	Jorge Alfonso Lizarraga-Brito et.al.	2507.05462	null
2025-07-03	DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment	Ke-Han Lu et.al.	2507.02768	null
2025-06-29	VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems	Ethan Smyth et.al.	2507.00079	null
2025-06-28	UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments	Dayong Su et.al.	2506.22736	null
2025-06-27	Augmented Lagrangian methods for infeasible convex optimization problems and diverging proximal-point algorithms	Roland Andrews et.al.	2506.22428	null
2025-06-26	Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation	Guanting Dong et.al.	2506.21384	null
2025-06-26	HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context	Qize Yang et.al.	2506.21277	null
2025-06-26	Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation	Yihong Cao et.al.	2506.21198	null
2025-06-29	OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs	Yiman Zhang et.al.	2506.20960	null
2025-06-23	OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation	Qijun Gan et.al.	2506.18866	null
2025-06-22	Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents	Jinjie Wei et.al.	2506.17913	null

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 2,322 Commits
.github		.github
assets		assets
docs		docs
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
daily_arxiv.py		daily_arxiv.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Updated on 2025.08.26

Text to Speech

Text to Audio

Video to Audio

Voice Conversion

Video Generation

Image Generation

Music Generation

Audio Codec

Large Audio Language Model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Updated on 2025.08.26

Text to Speech

Text to Audio

Video to Audio

Voice Conversion

Video Generation

Image Generation

Music Generation

Audio Codec

Large Audio Language Model

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages