avatar

Houquan Zhou

Ph.D. Student
Soochow University
Jacob_Zhou (at) outlook.com



2024 Dec 04, Wed
  • Pay Attention to the Robustness of Chinese Minority Language Models! Syllable-level Textual Adversarial Attack on Tibetan Script
    [AUTHORS]Xi Cao, Dolma Dawa, Nuo Qun, Trashi Nyima
    [ABSTRACT]The textual adversarial attack refers to an attack method in which the attacker adds imperceptible perturbations to the original texts by elaborate design so that the NLP (natural language processing) model produces false judgments. This method is also used to evaluate the robustness of NLP models. Currently, most of the research in this field focuses on English, and there is also a certain amount of research on Chinese. However, to the best of our knowledge, there is little research targeting Chinese minority languages. Textual adversarial attacks are a new challenge for the information processing of Chinese minority languages. In response to this situation, we propose a Tibetan syllable-level black-box textual adversarial attack called TSAttacker based on syllable cosine distance and scoring mechanism. And then, we conduct TSAttacker on six models generated by fine-tuning two PLMs (pre-trained language models) for three downstream tasks. The experiment results show that TSAttacker is effective and generates high-quality adversarial samples. In addition, the robustness of the involved models still has much room for improvement.
    [COMMENTS]Revised Version; Accepted at ACL 2023 Workshop on TrustNLP
    [LINK]http://arxiv.org/abs/2412.02323v2
    [DATE]2024-12-04 17:08:45+08:00
    [CATEGORIES]cs.CL
2024 Dec 03, Tue
  • Discovering influential text using convolutional neural networks
    [AUTHORS]Megan Ayers, Luke Sanford, Margaret Roberts, Eddie Yang
    [ABSTRACT]Experimental methods for estimating the impacts of text on human evaluation have been widely used in the social sciences. However, researchers in experimental settings are usually limited to testing a small number of pre-specified text treatments. While efforts to mine unstructured texts for features that causally affect outcomes have been ongoing in recent years, these models have primarily focused on the topics or specific words of text, which may not always be the mechanism of the effect. We connect these efforts with NLP interpretability techniques and present a method for flexibly discovering clusters of similar text phrases that are predictive of human reactions to texts using convolutional neural networks. When used in an experimental setting, this method can identify text treatments and their effects under certain assumptions. We apply the method to two datasets. The first enables direct validation of the model's ability to detect phrases known to cause the outcome. The second demonstrates its ability to flexibly discover text treatments with varying textual structures. In both cases, the model learns a greater variety of text treatments compared to benchmark methods, and these text features quantitatively meet or exceed the ability of benchmark methods to predict the outcome.
    [COMMENTS]Published in Findings of ACL 2024 ( see https://aclanthology.org/2024.findings-acl.714 )
    [LINK]http://arxiv.org/abs/2406.10086v3
    [DATE]2024-12-03 05:31:59+08:00
    [CATEGORIES]cs.CL cs.LG
2024 Dec 06, Fri
  • Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
    [AUTHORS]Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
    [ABSTRACT]Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representations of GUIs, which introduce limitations in generalization, efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements, and employs a consistent action space to ensure cross-platform generalization. To address the limitations of previous work, we integrate explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. We construct a large-scale dataset of GUI agent trajectories, incorporating multimodal reasoning and grounding, and employ a two-stage training pipeline that first focuses on general GUI grounding, followed by planning and reasoning. Through comprehensive experiments, we demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving, to our knowledge, the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. We open-sourced all datasets, models, and training recipes to facilitate future research at https://aguvis-project.github.io/.
    [COMMENTS]https://aguvis-project.github.io/
    [LINK]http://arxiv.org/abs/2412.04454v1
    [DATE]2024-12-06 02:58:26+08:00
    [CATEGORIES]cs.CL
  • EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
    [AUTHORS]Yuqi Wu, Wenzhao Zheng, Sicheng Zuo, Yuanhui Huang, Jie Zhou, Jiwen Lu
    [ABSTRACT]3D occupancy prediction provides a comprehensive description of the surrounding scenes and has become an essential task for 3D perception. Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents which demands to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize the global scene with uniform 3D semantic Gaussians and progressively update local regions observed by the embodied agent. For each update, we extract semantic and structural features from the observed image and efficiently incorporate them via deformable cross-attention to refine the regional Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global 3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown (i.e., uniformly distributed) environment and maintains an explicit global memory of it with 3D Gaussians. It gradually gains knowledge through local refinement of regional Gaussians, which is consistent with how humans understand new scenes through embodied exploration. We reorganize an EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the evaluation of the embodied 3D occupancy prediction task. Experiments demonstrate that our EmbodiedOcc outperforms existing local prediction methods and accomplishes the embodied occupancy prediction with high accuracy and strong expandability. Our code is available at: https://github.com/YkiWu/EmbodiedOcc.
    [COMMENTS]Code: https://github.com/YkiWu/EmbodiedOcc
    [LINK]http://arxiv.org/abs/2412.04380v1
    [DATE]2024-12-06 01:57:09+08:00
    [CATEGORIES]cs.LG
  • Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting
    [AUTHORS]Edoardo Cetin, Ahmed Touati, Yann Ollivier
    [ABSTRACT]The forward-backward representation (FB) is a recently proposed framework (Touati et al., 2023; Touati & Ollivier, 2021) to train behavior foundation models (BFMs) that aim at providing zero-shot efficient policies for any new task specified in a given reinforcement learning (RL) environment, without training for each new task. Here we address two core limitations of FB model training. First, FB, like all successor-feature-based methods, relies on a linear encoding of tasks: at test time, each new reward function is linearly projected onto a fixed set of pre-trained features. This limits expressivity as well as precision of the task representation. We break the linearity limitation by introducing auto-regressive features for FB, which let finegrained task features depend on coarser-grained task information. This can represent arbitrary nonlinear task encodings, thus significantly increasing expressivity of the FB framework. Second, it is well-known that training RL agents from offline datasets often requires specific techniques.We show that FB works well together with such offline RL techniques, by adapting techniques from (Nair et al.,2020b; Cetin et al., 2024) for FB. This is necessary to get non-flatlining performance in some datasets, such as DMC Humanoid. As a result, we produce efficient FB BFMs for a number of new environments. Notably, in the D4RL locomotion benchmark, the generic FB agent matches the performance of standard single-task offline agents (IQL, XQL). In many setups, the offline techniques are needed to get any decent performance at all. The auto-regressive features have a positive but moderate impact, concentrated on tasks requiring spatial precision and task generalization beyond the behaviors represented in the trainset.
    [LINK]http://arxiv.org/abs/2412.04368v1
    [DATE]2024-12-06 01:36:22+08:00
    [CATEGORIES]cs.LG
  • Machine Theory of Mind for Autonomous Cyber-Defence
    [AUTHORS]Luke Swaby, Matthew Stewart, Daniel Harrold, Chris Willis, Gregory Palmer
    [ABSTRACT]Intelligent autonomous agents hold much potential for the domain of cyber-security. However, due to many state-of-the-art approaches relying on uninterpretable black-box models, there is growing demand for methods that offer stakeholders clear and actionable insights into their latent beliefs and motivations. To address this, we evaluate Theory of Mind (ToM) approaches for Autonomous Cyber Operations. Upon learning a robust prior, ToM models can predict an agent's goals, behaviours, and contextual beliefs given only a handful of past behaviour observations. In this paper, we introduce a novel Graph Neural Network (GNN)-based ToM architecture tailored for cyber-defence, Graph-In, Graph-Out (GIGO)-ToM, which can accurately predict both the targets and attack trajectories of adversarial cyber agents over arbitrary computer network topologies. To evaluate the latter, we propose a novel extension of the Wasserstein distance for measuring the similarity of graph-based probability distributions. Whereas the standard Wasserstein distance lacks a fixed reference scale, we introduce a graph-theoretic normalization factor that enables a standardized comparison between networks of different sizes. We furnish this metric, which we term the Network Transport Distance (NTD), with a weighting function that emphasizes predictions according to custom node features, allowing network operators to explore arbitrary strategic considerations. Benchmarked against a Graph-In, Dense-Out (GIDO)-ToM architecture in an abstract cyber-defence environment, our empirical evaluations show that GIGO-ToM can accurately predict the goals and behaviours of various unseen cyber-attacking agents across a range of network topologies, as well as learn embeddings that can effectively characterize their policies.
    [COMMENTS]29 pages, 17 figures, 12 tables
    [LINK]http://arxiv.org/abs/2412.04367v1
    [DATE]2024-12-06 01:35:29+08:00
    [CATEGORIES]cs.LG
  • Action Mapping for Reinforcement Learning in Continuous Environments with Constraints
    [AUTHORS]Mirco Theile, Lukas Dirnberger, Raphael Trumpp, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli
    [ABSTRACT]Deep reinforcement learning (DRL) has had success across various domains, but applying it to environments with constraints remains challenging due to poor sample efficiency and slow convergence. Recent literature explored incorporating model knowledge to mitigate these problems, particularly through the use of models that assess the feasibility of proposed actions. However, integrating feasibility models efficiently into DRL pipelines in environments with continuous action spaces is non-trivial. We propose a novel DRL training strategy utilizing action mapping that leverages feasibility models to streamline the learning process. By decoupling the learning of feasible actions from policy optimization, action mapping allows DRL agents to focus on selecting the optimal action from a reduced feasible action set. We demonstrate through experiments that action mapping significantly improves training performance in constrained environments with continuous action spaces, especially with imperfect feasibility models.
    [LINK]http://arxiv.org/abs/2412.04327v1
    [DATE]2024-12-06 00:42:45+08:00
    [CATEGORIES]cs.LG
  • On Multi-Agent Inverse Reinforcement Learning
    [AUTHORS]Till Freihaut, Giorgia Ramponi
    [ABSTRACT]In multi-agent systems, the agent behavior is highly influenced by its utility function, as these utilities shape both individual goals as well as interactions with the other agents. Inverse Reinforcement Learning (IRL) is a well-established approach to inferring the utility function by observing an expert behavior within a given environment. In this paper, we extend the IRL framework to the multi-agent setting, assuming to observe agents who are following Nash Equilibrium (NE) policies. We theoretically investigate the set of utilities that explain the behavior of NE experts. Specifically, we provide an explicit characterization of the feasible reward set and analyze how errors in estimating the transition dynamics and expert behavior impact the recovered rewards. Building on these findings, we provide the first sample complexity analysis for the multi-agent IRL problem. Finally, we provide a numerical evaluation of our theoretical results.
    [COMMENTS]Currently under review
    [LINK]http://arxiv.org/abs/2411.15046v2
    [DATE]2024-12-06 00:04:02+08:00
    [CATEGORIES]cs.LG
2024 Dec 05, Thu
  • Agent-OM: Leveraging LLM Agents for Ontology Matching
    [AUTHORS]Zhangcheng Qiang, Weiqing Wang, Kerry Taylor
    [ABSTRACT]Ontology matching (OM) enables semantic interoperability between different ontologies and resolves their conceptual heterogeneity by aligning related entities. OM systems currently have two prevailing design paradigms: conventional knowledge-based expert systems and newer machine learning-based predictive systems. While large language models (LLMs) and LLM agents have revolutionised data engineering and have been applied creatively in many domains, their potential for OM remains underexplored. This study introduces a novel agent-powered LLM-based design paradigm for OM systems. With consideration of several specific challenges in leveraging LLM agents for OM, we propose a generic framework, namely Agent-OM (Agent for Ontology Matching), consisting of two Siamese agents for retrieval and matching, with a set of simple OM tools. Our framework is implemented in a proof-of-concept system. Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks over state-of-the-art OM systems show that our system can achieve results very close to the long-standing best performance on simple OM tasks and can significantly improve the performance on complex and few-shot OM tasks.
    [COMMENTS]14 pages, 13 figures, 4 tables
    [LINK]http://arxiv.org/abs/2312.00326v4
    [DATE]2024-12-05 22:45:05+08:00
    [CATEGORIES]cs.CL
  • A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios
    [AUTHORS]Xiachong Feng, Longxu Dou, Ella Li, Qinghao Wang, Haochuan Wang, Yu Guo, Chang Ma, Lingpeng Kong
    [ABSTRACT]Game-theoretic scenarios have become pivotal in evaluating the social intelligence of Large Language Model (LLM)-based social agents. While numerous studies have explored these agents in such settings, there is a lack of a comprehensive survey summarizing the current progress. To address this gap, we systematically review existing research on LLM-based social agents within game-theoretic scenarios. Our survey organizes the findings into three core components: Game Framework, Social Agent, and Evaluation Protocol. The game framework encompasses diverse game scenarios, ranging from choice-focusing to communication-focusing games. The social agent part explores agents' preferences, beliefs, and reasoning abilities. The evaluation protocol covers both game-agnostic and game-specific metrics for assessing agent performance. By reflecting on the current research and identifying future research directions, this survey provides insights to advance the development and evaluation of social agents in game-theoretic scenarios.
    [LINK]http://arxiv.org/abs/2412.03920v1
    [DATE]2024-12-05 14:46:46+08:00
    [CATEGORIES]cs.CL
  • MISR: Measuring Instrumental Self-Reasoning in Frontier Models
    [AUTHORS]Kai Fronsdal, David Lindner
    [ABSTRACT]We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models. We open-source our evaluations at https://github.com/kaifronsdal/Self-Reasoning-Evals.
    [COMMENTS]10 pages, 65 page appendix, 5 figures
    [LINK]http://arxiv.org/abs/2412.03904v1
    [DATE]2024-12-05 14:20:47+08:00
    [CATEGORIES]cs.CL cs.LG
  • PreAct: Prediction Enhances Agent's Planning Ability
    [AUTHORS]Dayuan Fu, Jianzhao Huang, Siyuan Lu, Guanting Dong, Yejie Wang, Keqing He, Weiran Xu
    [ABSTRACT]Addressing the disparity between forecasts and actual results can enable individuals to expand their thought processes and stimulate self-reflection, thus promoting accurate planning. In this research, we present **PreAct**, an agent framework that integrates **pre**diction, **rea**soning, and **act**ion. By utilizing the information derived from predictions, the large language model (LLM) agent can provide a wider range and more strategically focused reasoning. This leads to more efficient actions that aid the agent in accomplishing intricate tasks. Our experimental results show that PreAct surpasses the ReAct method in completing complex tasks and that PreAct's performance can be further improved when paired with other memory or selection strategy techniques. We presented the model with varying quantities of historical predictions and discovered that these predictions consistently enhance LLM planning.The variances in single-step reasoning between PreAct and ReAct indicate that PreAct indeed has benefits in terms of diversity and strategic orientation over ReAct.
    [COMMENTS]Coling 2025
    [LINK]http://arxiv.org/abs/2402.11534v2
    [DATE]2024-12-05 12:40:54+08:00
    [CATEGORIES]cs.CL
  • Network Formation and Dynamics Among Multi-LLMs
    [AUTHORS]Marios Papachristou, Yuan Yuan
    [ABSTRACT]Social networks fundamentally shape human opinions, behaviors, and the dissemination of information. As large language models (LLMs) like GPT, Claude, and Llama increasingly integrate into social and professional settings, understanding their behavior in the context of social interactions and network formation becomes essential. This study develops a framework to systematically examine whether the network formation behaviors of multiple LLMs approximate certain aspects of human network dynamics. By simulating interactions among LLM agents across various model families, we observe that these models consistently exhibit key patterns associated with social network principles including preferential attachment, triadic closure, homophily, community structure, and the small-world phenomenon when forming networks. Moreover, LLMs adapt their network formation strategies based on each network's characteristics, reflecting the context-dependent nature of human behavior: in Facebook networks, they prioritize triadic closure and homophily, mirroring close-knit friendships; in phone networks, homophily and preferential attachment dominate, capturing personal and professional connections, while in employment networks, LLMs favor heterophily and high-degree connections, aligning with career advancement dynamics. These results open new avenues for using LLMs in network science research, with potential applications in agent-based modeling and synthetic network generation.
    [LINK]http://arxiv.org/abs/2402.10659v4
    [DATE]2024-12-05 12:35:22+08:00
    [CATEGORIES]cs.CL
  • Social Life Simulation for Non-Cognitive Skills Learning
    [AUTHORS]Zihan Yan, Yaohong Xiang, Yun Huang
    [ABSTRACT]Non-cognitive skills are crucial for personal and social life well-being, and such skill development can be supported by narrative-based (e.g., storytelling) technologies. While generative AI enables interactive and role-playing storytelling, little is known about how users engage with and perceive the use of AI in social life simulation for non-cognitive skills learning. Additionally, the benefits of AI mentorship on self-reflection awareness and ability in this context remain largely underexplored. To this end, we introduced Simulife++, an interactive platform enabled by a large language model (LLM). The system allows users to act as protagonists, creating stories with one or multiple AI-based characters in diverse social scenarios. In particular, we expanded the Human-AI interaction to a Human-AI-AI collaboration by including a Sage Agent, who acts as a bystander, providing users with some perspectives and guidance on their choices and conversations in terms of non-cognitive skills to promote reflection. In a within-subject user study, our quantitative results reveal that, when accompanied by Sage Agent, users exhibit significantly higher levels of reflection on motivation, self-perceptions, and resilience & coping, along with an enhanced experience of narrative transportation. Additionally, our qualitative findings suggest that Sage Agent plays a crucial role in promoting reflection on non-cognitive skills, enhancing social communication and decision-making performance, and improving overall user experience within Simulife++. Multiple supportive relationships between Sage Agent and users were also reported. We offer design implications for the application of generative AI in narrative solutions and the future potential of Sage Agent for non-cognitive skill development in broader social contexts.
    [LINK]http://arxiv.org/abs/2405.00273v3
    [DATE]2024-12-05 12:19:45+08:00
    [CATEGORIES]cs.CL
  • Educational-Psychological Dialogue Robot Based on Multi-Agent Collaboration
    [AUTHORS]Shiwen Ni, Min Yang
    [ABSTRACT]Intelligent dialogue systems are increasingly used in modern education and psychological counseling fields, but most existing systems are limited to a single domain, cannot deal with both educational and psychological issues, and often lack accuracy and professionalism when dealing with complex issues. To address these problems, this paper proposes an intelligent dialog system that combines educational and psychological counseling functions. The system consists of multiple AI agent, including security detection agent, intent identification agent, educational LLM agent, and psychological LLM agent, which work in concert to ensure the provision of accurate educational knowledge Q\&A and psychological support services. Specifically, the system recognizes user-input intentions through an intention classification model and invokes a retrieval-enhanced educational grand model and a psychological grand model fine-tuned with psychological data in order to provide professional educational advice and psychological support.
    [LINK]http://arxiv.org/abs/2412.03847v1
    [DATE]2024-12-05 11:27:02+08:00
    [CATEGORIES]cs.CL
  • From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
    [AUTHORS]Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji
    [ABSTRACT]Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models, have revolutionized various natural language processing tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. We review fundamental building blocks crucial for studying chart understanding tasks. Additionally, we explore various tasks and their evaluation metrics and sources of both charts and textual inputs. Various modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed, highlighting the importance of several topics, such as domain-specific charts, lack of efforts in developing evaluation metrics, and agent-oriented settings. This survey paper serves as a comprehensive resource for researchers and practitioners in the fields of natural language processing, computer vision, and data analysis, providing valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies mentioned in this paper, along with emerging new research, will be continually updated at: https://github.com/khuangaf/Awesome-Chart-Understanding.
    [COMMENTS]IEEE Transactions on Knowledge and Data Engineering (TKDE)
    [LINK]http://arxiv.org/abs/2403.12027v4
    [DATE]2024-12-05 11:26:13+08:00
    [CATEGORIES]cs.CL
  • ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data
    [AUTHORS]Junhong Shen, Atishay Jain, Zedian Xiao, Ishan Amlekar, Mouad Hadji, Aaron Podolny, Ameet Talwalkar
    [ABSTRACT]Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting-based agents on existing benchmarks -- ScribeAgent achieves state-of-the-art direct generation performance on Mind2Web and improves the task success rate by 7.3% over the previous best text-only web agents on WebArena. We further perform detailed ablation studies on various fine-tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.
    [LINK]http://arxiv.org/abs/2411.15004v2
    [DATE]2024-12-05 10:00:07+08:00
    [CATEGORIES]cs.CL
  • Agent AI with LangGraph: A Modular Framework for Enhancing Machine Translation Using Large Language Models
    [AUTHORS]Jialin Wang, Zhihua Duan
    [ABSTRACT]This paper explores the transformative role of Agent AI and LangGraph in advancing the automation and effectiveness of machine translation (MT). Agents are modular components designed to perform specific tasks, such as translating between particular languages, with specializations like TranslateEnAgent, TranslateFrenchAgent, and TranslateJpAgent for English, French, and Japanese translations, respectively. These agents leverage the powerful semantic capabilities of large language models (LLMs), such as GPT-4o, to ensure accurate, contextually relevant translations while maintaining modularity, scalability, and context retention. LangGraph, a graph-based framework built on LangChain, simplifies the creation and management of these agents and their workflows. It supports dynamic state management, enabling agents to maintain dialogue context and automates complex workflows by linking agents and facilitating their collaboration. With flexibility, open-source community support, and seamless integration with LLMs, LangGraph empowers agents to deliver high-quality translations. Together, Agent AI and LangGraph create a cohesive system where LangGraph orchestrates agent interactions, ensuring that user inputs are analyzed, routed, and processed efficiently. Experimental results demonstrate the potential of this system to enhance multilingual translation accuracy and scalability. By highlighting modular design and automated workflows, this paper sets the stage for further innovations in intelligent machine translation services.
    [LINK]http://arxiv.org/abs/2412.03801v1
    [DATE]2024-12-05 09:45:12+08:00
    [CATEGORIES]cs.CL
  • The broader spectrum of in-context learning
    [AUTHORS]Andrew Kyle Lampinen, Stephanie C. Y. Chan, Aaditya K. Singh, Murray Shanahan
    [ABSTRACT]The ability of language models to learn a task from a few examples in context has generated substantial interest. Here, we provide a perspective that situates this type of supervised few-shot learning within a much broader spectrum of meta-learned in-context learning. Indeed, we suggest that any distribution of sequences in which context non-trivially decreases loss on subsequent predictions can be interpreted as eliciting a kind of in-context learning. We suggest that this perspective helps to unify the broad set of in-context abilities that language models exhibit $\unicode\{x2014\}$ such as adapting to tasks from instructions or role play, or extrapolating time series. This perspective also sheds light on potential roots of in-context learning in lower-level processing of linguistic dependencies (e.g. coreference or parallel structures). Finally, taking this perspective highlights the importance of generalization, which we suggest can be studied along several dimensions: not only the ability to learn something novel, but also flexibility in learning from different presentations, and in applying what is learned. We discuss broader connections to past literature in meta-learning and goal-conditioned agents, and other perspectives on learning and adaptation. We close by suggesting that research on in-context learning should consider this broader spectrum of in-context capabilities and types of generalization.
    [LINK]http://arxiv.org/abs/2412.03782v1
    [DATE]2024-12-05 08:05:11+08:00
    [CATEGORIES]cs.CL cs.LG
  • MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
    [AUTHORS]Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, Lichao Sun
    [ABSTRACT]Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. Recently, many studies have focused on the tool utilization ability of LLMs. They primarily investigated how LLMs effectively collaborate with given specific tools. However, in scenarios where LLMs serve as intelligent agents, as seen in applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate decision-making processes that involve deciding whether to employ a tool and selecting the most suitable tool(s) from a collection of available tools to fulfill user requests. Therefore, in this paper, we introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. Specifically, we create a dataset called ToolE within the benchmark. This dataset contains various types of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios. Subsequently, we set the tasks for both tool usage awareness and tool selection. We define four subtasks from different perspectives in tool selection, including tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools, highlighting the existing gaps between LLMs and genuine intelligent agents. However, through the error analysis, we found there is still significant room for improvement. Finally, we conclude with insights for tool developers -- we strongly recommend that tool developers choose an appropriate rewrite model for generating new descriptions based on the downstream LLM the tool will apply to. Our code is in https://github.com/HowieHwong/MetaTool.
    [LINK]http://arxiv.org/abs/2310.03128v6
    [DATE]2024-12-05 03:49:02+08:00
    [CATEGORIES]cs.CL
  • From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents
    [AUTHORS]Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, Zhongyu Wei
    [ABSTRACT]Traditional sociological research often relies on human participation, which, though effective, is expensive, challenging to scale, and with ethical concerns. Recent advancements in large language models (LLMs) highlight their potential to simulate human behavior, enabling the replication of individual responses and facilitating studies on many interdisciplinary studies. In this paper, we conduct a comprehensive survey of this field, illustrating the recent progress in simulation driven by LLM-empowered agents. We categorize the simulations into three types: (1) Individual Simulation, which mimics specific individuals or demographic groups; (2) Scenario Simulation, where multiple agents collaborate to achieve goals within specific contexts; and (3) Society Simulation, which models interactions within agent societies to reflect the complexity and variety of real-world dynamics. These simulations follow a progression, ranging from detailed individual modeling to large-scale societal phenomena. We provide a detailed discussion of each simulation type, including the architecture or key components of the simulation, the classification of objectives or scenarios and the evaluation method. Afterward, we summarize commonly used datasets and benchmarks. Finally, we discuss the trends across these three types of simulation. A repository for the related sources is at \{\url\{https://github.com/FudanDISC/SocialAgent\}\}.
    [LINK]http://arxiv.org/abs/2412.03563v1
    [DATE]2024-12-05 02:56:37+08:00
    [CATEGORIES]cs.CL
  • DataLab: A Unified Platform for LLM-Powered Business Intelligence
    [AUTHORS]Luoxuan Weng, Yinghao Tang, Yingchaojie Feng, Zhuo Chang, Peng Chen, Ruiqin Chen, Haozhe Feng, Chen Hou, Danqing Huang, Yang Li, Huaming Rao, Haonan Wang, Canshi Wei, Xiaofeng Yang, Yuhui Zhang, Yifeng Zheng, Xiuqi Huang, Minfeng Zhu, Yuxin Ma, Bin Cui, Wei Chen
    [ABSTRACT]Business intelligence (BI) transforms large volumes of data within modern organizations into actionable insights for informed decision-making. Recently, large language model (LLM)-based agents have streamlined the BI workflow by automatically performing task planning, reasoning, and actions in executable environments based on natural language (NL) queries. However, existing approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS. The fragmentation of tasks across different data roles and tools lead to inefficiencies and potential errors due to the iterative and collaborative nature of BI. In this paper, we introduce DataLab, a unified BI platform that integrates a one-stop LLM-based agent framework with an augmented computational notebook interface. DataLab supports a wide range of BI tasks for different data roles by seamlessly combining LLM assistance with user customization within a single environment. To achieve this unification, we design a domain knowledge incorporation module tailored for enterprise-specific BI tasks, an inter-agent communication mechanism to facilitate information sharing across the BI workflow, and a cell-based context management strategy to enhance context utilization efficiency in BI notebooks. Extensive experiments demonstrate that DataLab achieves state-of-the-art performance on various BI tasks across popular research benchmarks. Moreover, DataLab maintains high effectiveness and efficiency on real-world datasets from Tencent, achieving up to a 58.58% increase in accuracy and a 61.65% reduction in token cost on enterprise-specific BI tasks.
    [LINK]http://arxiv.org/abs/2412.02205v2
    [DATE]2024-12-05 00:12:08+08:00
    [CATEGORIES]cs.CL
  • HyperMARL: Adaptive Hypernetworks for Multi-Agent RL
    [AUTHORS]Kale-ab Abebe Tessera, Arrasy Rahman, Stefano V. Albrecht
    [ABSTRACT]Balancing individual specialisation and shared behaviours is a critical challenge in multi-agent reinforcement learning (MARL). Existing methods typically focus on encouraging diversity or leveraging shared representations. Full parameter sharing (FuPS) improves sample efficiency but struggles to learn diverse behaviours when required, while no parameter sharing (NoPS) enables diversity but is computationally expensive and sample inefficient. To address these challenges, we introduce HyperMARL, a novel approach using hypernetworks to balance efficiency and specialisation. HyperMARL generates agent-specific actor and critic parameters, enabling agents to adaptively exhibit diverse or homogeneous behaviours as needed, without modifying the learning objective or requiring prior knowledge of the optimal diversity. Furthermore, HyperMARL decouples agent-specific and state-based gradients, which empirically correlates with reduced policy gradient variance, potentially offering insights into its ability to capture diverse behaviours. Across MARL benchmarks requiring homogeneous, heterogeneous, or mixed behaviours, HyperMARL consistently matches or outperforms FuPS, NoPS, and diversity-focused methods, achieving NoPS-level diversity with a shared architecture. These results highlight the potential of hypernetworks as a versatile approach to the trade-off between specialisation and shared behaviours in MARL.
    [LINK]http://arxiv.org/abs/2412.04233v1
    [DATE]2024-12-05 23:09:51+08:00
    [CATEGORIES]cs.LG
  • Towards Generalizable Autonomous Penetration Testing via Domain Randomization and Meta-Reinforcement Learning
    [AUTHORS]Shicheng Zhou, Jingju Liu, Yuliang Lu, Jiahai Yang, Yue Zhang, Jie Chen
    [ABSTRACT]With increasing numbers of vulnerabilities exposed on the internet, autonomous penetration testing (pentesting) has emerged as an emerging research area, while reinforcement learning (RL) is a natural fit for studying autonomous pentesting. Previous research in RL-based autonomous pentesting mainly focused on enhancing agents' learning efficacy within abstract simulated training environments. They overlooked the applicability and generalization requirements of deploying agents' policies in real-world environments that differ substantially from their training settings. In contrast, for the first time, we shift focus to the pentesting agents' ability to generalize across unseen real environments. For this purpose, we propose a Generalizable Autonomous Pentesting framework (namely GAP) for training agents capable of drawing inferences from one to another -- a key requirement for the broad application of autonomous pentesting and a hallmark of human intelligence. GAP introduces a Real-to-Sim-to-Real pipeline with two key methods: domain randomization and meta-RL learning. Specifically, we are among the first to apply domain randomization in autonomous pentesting and propose a large language model-powered domain randomization method for synthetic environment generation. We further apply meta-RL to improve the agents' generalization ability in unseen environments by leveraging the synthetic environments. The combination of these two methods can effectively bridge the generalization gap and improve policy adaptation performance. Experiments are conducted on various vulnerable virtual machines, with results showing that GAP can (a) enable policy learning in unknown real environments, (b) achieve zero-shot policy transfer in similar environments, and (c) realize rapid policy adaptation in dissimilar environments.
    [COMMENTS]This work has been submitted to the IEEE for possible publication
    [LINK]http://arxiv.org/abs/2412.04078v1
    [DATE]2024-12-05 19:24:27+08:00
    [CATEGORIES]cs.LG
  • Learning Speed-Adaptive Walking Agent Using Imitation Learning with Physics-Informed Simulation
    [AUTHORS]Yi-Hung Chiu, Ung Hee Lee, Changseob Song, Manaen Hu, Inseung Kang
    [ABSTRACT]Virtual models of human gait, or digital twins, offer a promising solution for studying mobility without the need for labor-intensive data collection. However, challenges such as the sim-to-real gap and limited adaptability to diverse walking conditions persist. To address these, we developed and validated a framework to create a skeletal humanoid agent capable of adapting to varying walking speeds while maintaining biomechanically realistic motions. The framework combines a synthetic data generator, which produces biomechanically plausible gait kinematics from open-source biomechanics data, and a training system that uses adversarial imitation learning to train the agent's walking policy. We conducted comprehensive analyses comparing the agent's kinematics, synthetic data, and the original biomechanics dataset. The agent achieved a root mean square error of 5.24 +- 0.09 degrees at varying speeds compared to ground-truth kinematics data, demonstrating its adaptability. This work represents a significant step toward developing a digital twin of human locomotion, with potential applications in biomechanics research, exoskeleton design, and rehabilitation.
    [COMMENTS]Currently under review
    [LINK]http://arxiv.org/abs/2412.03949v1
    [DATE]2024-12-05 15:55:58+08:00
    [CATEGORIES]cs.LG
  • Traffic Co-Simulation Framework Empowered by Infrastructure Camera Sensing and Reinforcement Learning
    [AUTHORS]Talha Azfar, Ruimin Ke
    [ABSTRACT]Traffic simulations are commonly used to optimize traffic flow, with reinforcement learning (RL) showing promising potential for automated traffic signal control. Multi-agent reinforcement learning (MARL) is particularly effective for learning control strategies for traffic lights in a network using iterative simulations. However, existing methods often assume perfect vehicle detection, which overlooks real-world limitations related to infrastructure availability and sensor reliability. This study proposes a co-simulation framework integrating CARLA and SUMO, which combines high-fidelity 3D modeling with large-scale traffic flow simulation. Cameras mounted on traffic light poles within the CARLA environment use a YOLO-based computer vision system to detect and count vehicles, providing real-time traffic data as input for adaptive signal control in SUMO. MARL agents, trained with four different reward structures, leverage this visual feedback to optimize signal timings and improve network-wide traffic flow. Experiments in the test-bed demonstrate the effectiveness of the proposed MARL approach in enhancing traffic conditions using real-time camera-based detection. The framework also evaluates the robustness of MARL under faulty or sparse sensing and compares the performance of YOLOv5 and YOLOv8 for vehicle detection. Results show that while better accuracy improves performance, MARL agents can still achieve significant improvements with imperfect detection, demonstrating adaptability for real-world scenarios.
    [LINK]http://arxiv.org/abs/2412.03925v1
    [DATE]2024-12-05 15:01:56+08:00
    [CATEGORIES]cs.LG
  • Hyper: Hyperparameter Robust Efficient Exploration in Reinforcement Learning
    [AUTHORS]Yiran Wang, Chenshu Liu, Yunfan Li, Sanae Amani, Bolei Zhou, Lin F. Yang
    [ABSTRACT]The exploration \& exploitation dilemma poses significant challenges in reinforcement learning (RL). Recently, curiosity-based exploration methods achieved great success in tackling hard-exploration problems. However, they necessitate extensive hyperparameter tuning on different environments, which heavily limits the applicability and accessibility of this line of methods. In this paper, we characterize this problem via analysis of the agent behavior, concluding the fundamental difficulty of choosing a proper hyperparameter. We then identify the difficulty and the instability of the optimization when the agent learns with curiosity. We propose our method, hyperparameter robust exploration (\textbf\{Hyper\}), which extensively mitigates the problem by effectively regularizing the visitation of the exploration and decoupling the exploitation to ensure stable training. We theoretically justify that \textbf\{Hyper\} is provably efficient under function approximation setting and empirically demonstrate its appealing performance and robustness in various environments.
    [COMMENTS]arXiv admin note: text overlap with arXiv:1907.05388 by other authors
    [LINK]http://arxiv.org/abs/2412.03767v1
    [DATE]2024-12-05 07:12:41+08:00
    [CATEGORIES]cs.LG
  • PathletRL++: Optimizing Trajectory Pathlet Extraction and Dictionary Formation via Reinforcement Learning
    [AUTHORS]Gian Alix, Arian Haghparast, Manos Papagelis
    [ABSTRACT]Advances in tracking technologies have spurred the rapid growth of large-scale trajectory data. Building a compact collection of pathlets, referred to as a trajectory pathlet dictionary, is essential for supporting mobility-related applications. Existing methods typically adopt a top-down approach, generating numerous candidate pathlets and selecting a subset, leading to high memory usage and redundant storage from overlapping pathlets. To overcome these limitations, we propose a bottom-up strategy that incrementally merges basic pathlets to build the dictionary, reducing memory requirements by up to 24,000 times compared to baseline methods. The approach begins with unit-length pathlets and iteratively merges them while optimizing utility, which is defined using newly introduced metrics of trajectory loss and representability. We develop a deep reinforcement learning framework, PathletRL, which utilizes Deep Q-Networks (DQN) to approximate the utility function, resulting in a compact and efficient pathlet dictionary. Experiments on both synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art techniques, reducing the size of the constructed dictionary by up to 65.8%. Additionally, our results show that only half of the dictionary pathlets are needed to reconstruct 85% of the original trajectory data. Building on PathletRL, we introduce PathletRL++, which extends the original model by incorporating a richer state representation and an improved reward function to optimize decision-making during pathlet merging. These enhancements enable the agent to gain a more nuanced understanding of the environment, leading to higher-quality pathlet dictionaries. PathletRL++ achieves even greater dictionary size reduction, surpassing the performance of PathletRL, while maintaining high trajectory representability.
    [LINK]http://arxiv.org/abs/2412.03715v1
    [DATE]2024-12-05 05:09:43+08:00
    [CATEGORIES]cs.LG
  • Navigation World Models
    [AUTHORS]Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun
    [ABSTRACT]Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.
    [COMMENTS]project page: https://www.amirbar.net/nwm/
    [LINK]http://arxiv.org/abs/2412.03572v1
    [DATE]2024-12-05 02:59:45+08:00
    [CATEGORIES]cs.LG
2024 Dec 04, Wed
  • Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt Formats, Data Integration, and Multilingual Translation
    [AUTHORS]Yi-Chang Chen, Po-Chun Hsu, Chan-Jan Hsu, Da-shan Shiu
    [ABSTRACT]Large language models (LLMs) have significantly advanced autonomous agents, particularly in zero-shot tool usage, also known as function calling. This research delves into enhancing the function-calling capabilities of LLMs by exploring different approaches, including prompt formats for integrating function descriptions, blending function-calling and instruction-following data, introducing a novel Decision Token for conditional prompts, leveraging chain-of-thought reasoning, and overcoming multilingual challenges with a translation pipeline. Our key findings and contributions are as follows: (1) Instruction-following data improves both function-calling accuracy and relevance detection. (2) The use of the newly proposed Decision Token, combined with synthetic non-function-call data, enhances relevance detection. (3) A tailored translation pipeline effectively overcomes multilingual limitations, demonstrating significant improvements in Traditional Chinese. These insights highlight the potential for improved function-calling capabilities and multilingual applications in LLMs.
    [LINK]http://arxiv.org/abs/2412.01130v2
    [DATE]2024-12-04 11:34:42+08:00
    [CATEGORIES]cs.CL
  • Mediating Modes of Thought: LLM's for design scripting
    [AUTHORS]Moritz Rietschel, Fang Guo, Kyle Steinfeld
    [ABSTRACT]Architects adopt visual scripting and parametric design tools to explore more expansive design spaces (Coates, 2010), refine their thinking about the geometric logic of their design (Woodbury, 2010), and overcome conventional software limitations (Burry, 2011). Despite two decades of effort to make design scripting more accessible, a disconnect between a designer's free ways of thinking and the rigidity of algorithms remains (Burry, 2011). Recent developments in Large Language Models (LLMs) suggest this might soon change, as LLMs encode a general understanding of human context and exhibit the capacity to produce geometric logic. This project speculates that if LLMs can effectively mediate between user intent and algorithms, they become a powerful tool to make scripting in design more widespread and fun. We explore if such systems can interpret natural language prompts to assemble geometric operations relevant to computational design scripting. In the system, multiple layers of LLM agents are configured with specific context to infer the user intent and construct a sequential logic. Given a user's high-level text prompt, a geometric description is created, distilled into a sequence of logic operations, and mapped to software-specific commands. The completed script is constructed in the user's visual programming interface. The system succeeds in generating complete visual scripts up to a certain complexity but fails beyond this complexity threshold. It shows how LLMs can make design scripting much more aligned with human creativity and thought. Future research should explore conversational interactions, expand to multimodal inputs and outputs, and assess the performance of these tools.
    [COMMENTS]Published at ACADIA 2024
    [LINK]http://arxiv.org/abs/2411.14485v2
    [DATE]2024-12-04 06:27:12+08:00
    [CATEGORIES]cs.CL
  • WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
    [AUTHORS]Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, Yuxiao Dong
    [ABSTRACT]Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.
    [LINK]http://arxiv.org/abs/2411.02337v2
    [DATE]2024-12-04 00:37:23+08:00
    [CATEGORIES]cs.CL
  • Risk-aware Classification via Uncertainty Quantification
    [AUTHORS]Murat Sensoy, Lance M. Kaplan, Simon Julier, Maryam Saleki, Federico Cerutti
    [ABSTRACT]Autonomous and semi-autonomous systems are using deep learning models to improve decision-making. However, deep classifiers can be overly confident in their incorrect predictions, a major issue especially in safety-critical domains. The present study introduces three foundational desiderata for developing real-world risk-aware classification systems. Expanding upon the previously proposed Evidential Deep Learning (EDL), we demonstrate the unity between these principles and EDL's operational attributes. We then augment EDL empowering autonomous agents to exercise discretion during structured decision-making when uncertainty and risks are inherent. We rigorously examine empirical scenarios to substantiate these theoretical innovations. In contrast to existing risk-aware classifiers, our proposed methodologies consistently exhibit superior performance, underscoring their transformative potential in risk-conscious classification strategies.
    [COMMENTS]Accepted for publication in Expert Systems with Applications
    [LINK]http://arxiv.org/abs/2412.03391v1
    [DATE]2024-12-04 23:20:12+08:00
    [CATEGORIES]cs.LG
  • AI-Driven Day-to-Day Route Choice
    [AUTHORS]Leizhen Wang, Peibo Duan, Zhengbing He, Cheng Lyu, Xin Chen, Nan Zheng, Li Yao, Zhenliang Ma
    [ABSTRACT]Understanding travelers' route choices can help policymakers devise optimal operational and planning strategies for both normal and abnormal circumstances. However, existing choice modeling methods often rely on predefined assumptions and struggle to capture the dynamic and adaptive nature of travel behavior. Recently, Large Language Models (LLMs) have emerged as a promising alternative, demonstrating remarkable ability to replicate human-like behaviors across various fields. Despite this potential, their capacity to accurately simulate human route choice behavior in transportation contexts remains doubtful. To satisfy this curiosity, this paper investigates the potential of LLMs for route choice modeling by introducing an LLM-empowered agent, "LLMTraveler." This agent integrates an LLM as its core, equipped with a memory system that learns from past experiences and makes decisions by balancing retrieved data and personality traits. The study systematically evaluates the LLMTraveler's ability to replicate human-like decision-making through two stages: (1) analyzing its route-switching behavior in single origin-destination (OD) pair congestion game scenarios, where it demonstrates patterns align with laboratory data but are not fully explained by traditional models, and (2) testing its capacity to model day-to-day (DTD) adaptive learning behaviors on the Ortuzar and Willumsen (OW) network, producing results comparable to Multinomial Logit (MNL) and Reinforcement Learning (RL) models. These experiments demonstrate that the framework can partially replicate human-like decision-making in route choice while providing natural language explanations for its decisions. This capability offers valuable insights for transportation policymaking, such as simulating traveler responses to new policies or changes in the network.
    [LINK]http://arxiv.org/abs/2412.03338v1
    [DATE]2024-12-04 22:13:38+08:00
    [CATEGORIES]cs.LG
  • Reinforcement Learning for Finite Space Mean-Field Type Games
    [AUTHORS]Kai Shao, Jiacheng Shen, Chijie An, Mathieu Laurière
    [ABSTRACT]Mean field type games (MFTGs) describe Nash equilibria between large coalitions: each coalition consists of a continuum of cooperative agents who maximize the average reward of their coalition while interacting non-cooperatively with a finite number of other coalitions. Although the theory has been extensively developed, we are still lacking efficient and scalable computational methods. Here, we develop reinforcement learning methods for such games in a finite space setting with general dynamics and reward functions. We start by proving that MFTG solution yields approximate Nash equilibria in finite-size coalition games. We then propose two algorithms. The first is based on quantization of mean-field spaces and Nash Q-learning. We provide convergence and stability analysis. We then propose a deep reinforcement learning algorithm, which can scale to larger spaces. Numerical experiments in 5 environments with mean-field distributions of dimension up to $200$ show the scalability and efficiency of the proposed method.
    [LINK]http://arxiv.org/abs/2409.18152v2
    [DATE]2024-12-04 20:18:17+08:00
    [CATEGORIES]cs.LG
  • Out-of-Distribution Detection for Neurosymbolic Autonomous Cyber Agents
    [AUTHORS]Ankita Samaddar, Nicholas Potteiger, Xenofon Koutsoukos
    [ABSTRACT]Autonomous agents for cyber applications take advantage of modern defense techniques by adopting intelligent agents with conventional and learning-enabled components. These intelligent agents are trained via reinforcement learning (RL) algorithms, and can learn, adapt to, reason about and deploy security rules to defend networked computer systems while maintaining critical operational workflows. However, the knowledge available during training about the state of the operational network and its environment may be limited. The agents should be trustworthy so that they can reliably detect situations they cannot handle, and hand them over to cyber experts. In this work, we develop an out-of-distribution (OOD) Monitoring algorithm that uses a Probabilistic Neural Network (PNN) to detect anomalous or OOD situations of RL-based agents with discrete states and discrete actions. To demonstrate the effectiveness of the proposed approach, we integrate the OOD monitoring algorithm with a neurosymbolic autonomous cyber agent that uses behavior trees with learning-enabled components. We evaluate the proposed approach in a simulated cyber environment under different adversarial strategies. Experimental results over a large number of episodes illustrate the overall efficiency of our proposed approach.
    [COMMENTS]9 pages, 10 figures, IEEE International Conference on AI in Cybersecurity (ICAIC), 2025
    [LINK]http://arxiv.org/abs/2412.02875v1
    [DATE]2024-12-04 06:20:52+08:00
    [CATEGORIES]cs.LG
  • An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits
    [AUTHORS]Amaury Gouverneur, Borja Rodríguez-Gálvez, Tobias J. Oechtering, Mikael Skoglund
    [ABSTRACT]We study the performance of the Thompson Sampling algorithm for logistic bandit problems, where the agent receives binary rewards with probabilities determined by a logistic function $\exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle))$. We focus on the setting where the action $a$ and parameter $\theta$ lie within the $d$-dimensional unit ball with the action space encompassing the parameter space. Adopting the information-theoretic framework introduced by (Russo $\&$ Van Roy, 2015), we analyze the information ratio, which is defined as the ratio of the expected squared difference between the optimal and actual rewards to the mutual information between the optimal action and the reward. Improving upon previous results, we establish that the information ratio is bounded by $\tfrac\{9\}\{2\}d$. Notably, we obtain a regret bound in $O(d\sqrt\{T \log(\beta T/d)\})$ that depends only logarithmically on the parameter $\beta$.
    [COMMENTS]14 pages, Accepted to NeurIPS 2025 Workshop on Bayesian Decision-Making and Uncertainty
    [LINK]http://arxiv.org/abs/2412.02861v1
    [DATE]2024-12-04 05:55:41+08:00
    [CATEGORIES]cs.LG
  • Learning Temporal Logic Predicates from Data with Statistical Guarantees
    [AUTHORS]Emi Soroka, Rohan Sinha, Sanjay Lall
    [ABSTRACT]Temporal logic rules are often used in control and robotics to provide structured, human-interpretable descriptions of high-dimensional trajectory data. These rules have numerous applications including safety validation using formal methods, constraining motion planning among autonomous agents, and classifying data. However, existing methods for learning temporal logic predicates from data do not provide assurances about the correctness of the resulting predicate. We present a novel method to learn temporal logic predicates from data with finite-sample correctness guarantees. Our approach leverages expression optimization and conformal prediction to learn predicates that correctly describe future trajectories under mild assumptions. We provide experimental results showing the performance of our approach on a simulated trajectory dataset and perform ablation studies to understand how each component of our algorithm contributes to its performance.
    [LINK]http://arxiv.org/abs/2406.10449v2
    [DATE]2024-12-04 03:52:27+08:00
    [CATEGORIES]cs.LG
  • TAB-Fields: A Maximum Entropy Framework for Mission-Aware Adversarial Planning
    [AUTHORS]Gokul Puthumanaillam, Jae Hyuk Song, Nurzhan Yesmagambet, Shinkyu Park, Melkior Ornik
    [ABSTRACT]Autonomous agents operating in adversarial scenarios face a fundamental challenge: while they may know their adversaries' high-level objectives, such as reaching specific destinations within time constraints, the exact policies these adversaries will employ remain unknown. Traditional approaches address this challenge by treating the adversary's state as a partially observable element, leading to a formulation as a Partially Observable Markov Decision Process (POMDP). However, the induced belief-space dynamics in a POMDP require knowledge of the system's transition dynamics, which, in this case, depend on the adversary's unknown policy. Our key observation is that while an adversary's exact policy is unknown, their behavior is necessarily constrained by their mission objectives and the physical environment, allowing us to characterize the space of possible behaviors without assuming specific policies. In this paper, we develop Task-Aware Behavior Fields (TAB-Fields), a representation that captures adversary state distributions over time by computing the most unbiased probability distribution consistent with known constraints. We construct TAB-Fields by solving a constrained optimization problem that minimizes additional assumptions about adversary behavior beyond mission and environmental requirements. We integrate TAB-Fields with standard planning algorithms by introducing TAB-conditioned POMCP, an adaptation of Partially Observable Monte Carlo Planning. Through experiments in simulation with underwater robots and hardware implementations with ground robots, we demonstrate that our approach achieves superior performance compared to baselines that either assume specific adversary policies or neglect mission constraints altogether. Evaluation videos and code are available at https://tab-fields.github.io.
    [LINK]http://arxiv.org/abs/2412.02570v1
    [DATE]2024-12-04 00:55:27+08:00
    [CATEGORIES]cs.LG
  • Defending Against Diverse Attacks in Federated Learning Through Consensus-Based Bi-Level Optimization
    [AUTHORS]Nicolás García Trillos, Aditya Kumar Akash, Sixu Li, Konstantin Riedl, Yuhua Zhu
    [ABSTRACT]Adversarial attacks pose significant challenges in many machine learning applications, particularly in the setting of distributed training and federated learning, where malicious agents seek to corrupt the training process with the goal of jeopardizing and compromising the performance and reliability of the final models. In this paper, we address the problem of robust federated learning in the presence of such attacks by formulating the training task as a bi-level optimization problem. We conduct a theoretical analysis of the resilience of consensus-based bi-level optimization (CB$^2$O), an interacting multi-particle metaheuristic optimization method, in adversarial settings. Specifically, we provide a global convergence analysis of CB$^2$O in mean-field law in the presence of malicious agents, demonstrating the robustness of CB$^2$O against a diverse range of attacks. Thereby, we offer insights into how specific hyperparameter choices enable to mitigate adversarial effects. On the practical side, we extend CB$^2$O to the clustered federated learning setting by proposing FedCB$^2$O, a novel interacting multi-particle system, and design a practical algorithm that addresses the demands of real-world applications. Extensive experiments demonstrate the robustness of the FedCB$^2$O algorithm against label-flipping attacks in decentralized clustered federated learning scenarios, showcasing its effectiveness in practical contexts.
    [LINK]http://arxiv.org/abs/2412.02535v1
    [DATE]2024-12-04 00:26:56+08:00
    [CATEGORIES]cs.LG
  • Introduction to Reinforcement Learning
    [AUTHORS]Majid Ghasemi, Dariush Ebrahimi
    [ABSTRACT]Reinforcement Learning (RL), a subfield of Artificial Intelligence (AI), focuses on training agents to make decisions by interacting with their environment to maximize cumulative rewards. This paper provides an overview of RL, covering its core concepts, methodologies, and resources for further learning. It offers a thorough explanation of fundamental components such as states, actions, policies, and reward signals, ensuring readers develop a solid foundational understanding. Additionally, the paper presents a variety of RL algorithms, categorized based on the key factors such as model-free, model-based, value-based, policy-based, and other key factors. Resources for learning and implementing RL, such as books, courses, and online communities are also provided. By offering a clear, structured introduction, this paper aims to simplify the complexities of RL for beginners, providing a straightforward pathway to understanding.
    [COMMENTS]19 pages
    [LINK]http://arxiv.org/abs/2408.07712v3
    [DATE]2024-12-04 00:17:32+08:00
    [CATEGORIES]cs.LG
2024 Dec 03, Tue
  • Large Multimodal Agents for Accurate Phishing Detection with Enhanced Token Optimization and Cost Reduction
    [AUTHORS]Fouad Trad, Ali Chehab
    [ABSTRACT]With the rise of sophisticated phishing attacks, there is a growing need for effective and economical detection solutions. This paper explores the use of large multimodal agents, specifically Gemini 1.5 Flash and GPT-4o mini, to analyze both URLs and webpage screenshots via APIs, thus avoiding the complexities of training and maintaining AI systems. Our findings indicate that integrating these two data types substantially enhances detection performance over using either type alone. However, API usage incurs costs per query that depend on the number of input and output tokens. To address this, we propose a two-tiered agentic approach: initially, one agent assesses the URL, and if inconclusive, a second agent evaluates both the URL and the screenshot. This method not only maintains robust detection performance but also significantly reduces API costs by minimizing unnecessary multi-input queries. Cost analysis shows that with the agentic approach, GPT-4o mini can process about 4.2 times as many websites per $100 compared to the multimodal approach (107,440 vs. 25,626), and Gemini 1.5 Flash can process about 2.6 times more websites (2,232,142 vs. 862,068). These findings underscore the significant economic benefits of the agentic approach over the multimodal method, providing a viable solution for organizations aiming to leverage advanced AI for phishing detection while controlling expenses.
    [COMMENTS]Accepted in the 2nd International Conference on Foundation and Large Language Models (FLLM2024)
    [LINK]http://arxiv.org/abs/2412.02301v1
    [DATE]2024-12-03 17:13:52+08:00
    [CATEGORIES]cs.CL
  • AutoGuide: Automated Generation and Selection of Context-Aware Guidelines for Large Language Model Agents
    [AUTHORS]Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, Honglak Lee
    [ABSTRACT]Recent advances in large language models (LLMs) have empowered AI agents capable of performing various sequential decision-making tasks. However, effectively guiding LLMs to perform well in unfamiliar domains like web navigation, where they lack sufficient knowledge, has proven to be difficult with the demonstration-based in-context learning paradigm. In this paper, we introduce a novel framework, called AutoGuide, which addresses this limitation by automatically generating context-aware guidelines from offline experiences. Importantly, each context-aware guideline is expressed in concise natural language and follows a conditional structure, clearly describing the context where it is applicable. As a result, our guidelines facilitate the provision of relevant knowledge for the agent's current decision-making process, overcoming the limitations of the conventional demonstration-based learning paradigm. Our evaluation demonstrates that AutoGuide significantly outperforms competitive baselines in complex benchmark domains, including real-world web navigation.
    [LINK]http://arxiv.org/abs/2403.08978v2
    [DATE]2024-12-03 15:36:47+08:00
    [CATEGORIES]cs.CL cs.LG
  • Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models
    [AUTHORS]Cong Lu, Shengran Hu, Jeff Clune
    [ABSTRACT]Go-Explore is a powerful family of algorithms designed to solve hard-exploration problems built on the principle of archiving discovered states, and iteratively returning to and exploring from the most promising states. This approach has led to superhuman performance across a wide variety of challenging problems including Atari games and robotic control, but requires manually designing heuristics to guide exploration (i.e., determine which states to save and explore from, and what actions to consider next), which is time-consuming and infeasible in general. To resolve this, we propose Intelligent Go-Explore (IGE) which greatly extends the scope of the original Go-Explore by replacing these handcrafted heuristics with the intelligence and internalized human notions of interestingness captured by giant pretrained foundation models (FMs). This provides IGE with a human-like ability to instinctively identify how interesting or promising any new state is (e.g., discovering new objects, locations, or behaviors), even in complex environments where heuristics are hard to define. Moreover, IGE offers the exciting opportunity to recognize and capitalize on serendipitous discoveries-states encountered during exploration that are valuable in terms of exploration, yet where what makes them interesting was not anticipated by the human user. We evaluate our algorithm on a diverse range of language and vision-based tasks that require search and exploration. Across these tasks, IGE strongly exceeds classic reinforcement learning and graph search baselines, and also succeeds where prior state-of-the-art FM agents like Reflexion completely fail. Overall, Intelligent Go-Explore combines the tremendous strengths of FMs and the powerful Go-Explore algorithm, opening up a new frontier of research into creating more generally capable agents with impressive exploration capabilities.
    [LINK]http://arxiv.org/abs/2405.15143v3
    [DATE]2024-12-03 14:43:39+08:00
    [CATEGORIES]cs.LG cs.CL
  • Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance
    [AUTHORS]Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen, Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong, Zhong Zhang, Yankai Lin, Weiwen Liu, Yasheng Wang, Zhiyuan Liu, Fangming Liu, Maosong Sun
    [ABSTRACT]Agents powered by large language models have shown remarkable abilities in solving complex tasks. However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making. In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions. We propose a novel data-driven approach for this problem. Firstly, we collect real-world human activities to generate proactive task predictions. These predictions are then labeled by human annotators as either accepted or rejected. The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents. Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents. Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models. These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration.
    [COMMENTS]9 pages, 4 figures
    [LINK]http://arxiv.org/abs/2410.12361v3
    [DATE]2024-12-03 12:34:09+08:00
    [CATEGORIES]cs.CL
  • Large Language Model-Brained GUI Agents: A Survey
    [AUTHORS]Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
    [ABSTRACT]GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.
    [COMMENTS]The collection of papers reviewed in this survey will be hosted and regularly updated on the GitHub repository: https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a searchable webpage is available at https://aka.ms/gui-agent for easier access and exploration
    [LINK]http://arxiv.org/abs/2411.18279v3
    [DATE]2024-12-03 11:16:27+08:00
    [CATEGORIES]cs.CL
  • MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for Tabular Applications
    [AUTHORS]Vishnou Vinayagame, Gregory Senay, Luis Martí
    [ABSTRACT]Mathematical reasoning capabilities are increasing with tool-augmented language agents, but methods often rely either on closed-source or large models, external data, or extensive prompt engineering. This work introduces MATATA, a novel cost-effective method to train LLM agents for tabular data problems through reasoning, planning, and tool use. With a progressive self-improvement paradigm and an iterative weak supervision, it empowers 3.8B/8B Small Language Models (SLMs), particularly suited for local hosting and sensitive business contexts where data privacy is crucial. By employing a flexible and reusable tools across different datasets, it achieves robust performance with effective scalability across shared tasks. Experiments show that MATATA reaches state-of-the-art performances on FinQA and TAT-QA among reasoning frameworks based on open-source models. Moreover, MATATA models compete with GPT-4 based frameworks on TabMWP, while being SLMs.
    [LINK]http://arxiv.org/abs/2411.18915v2
    [DATE]2024-12-03 05:08:00+08:00
    [CATEGORIES]cs.LG cs.CL
  • Optimizing Plastic Waste Collection in Water Bodies Using Heterogeneous Autonomous Surface Vehicles with Deep Reinforcement Learning
    [AUTHORS]Alejandro Mendoza Barrionuevo, Samuel Yanes Luis, Daniel Gutiérrez Reina, Sergio L. Toral Marín
    [ABSTRACT]This paper presents a model-free deep reinforcement learning framework for informative path planning with heterogeneous fleets of autonomous surface vehicles to locate and collect plastic waste. The system employs two teams of vehicles: scouts and cleaners. Coordination between these teams is achieved through a deep reinforcement approach, allowing agents to learn strategies to maximize cleaning efficiency. The primary objective is for the scout team to provide an up-to-date contamination model, while the cleaner team collects as much waste as possible following this model. This strategy leads to heterogeneous teams that optimize fleet efficiency through inter-team cooperation supported by a tailored reward function. Different trainings of the proposed algorithm are compared with other state-of-the-art heuristics in two distinct scenarios, one with high convexity and another with narrow corridors and challenging access. According to the obtained results, it is demonstrated that deep reinforcement learning based algorithms outperform other benchmark heuristics, exhibiting superior adaptability. In addition, training with greedy actions further enhances performance, particularly in scenarios with intricate layouts.
    [COMMENTS]This article is currently under revision for the Robotics and Automation Letters (IEEE)
    [LINK]http://arxiv.org/abs/2412.02316v1
    [DATE]2024-12-03 17:32:02+08:00
    [CATEGORIES]cs.LG
  • Conformal Symplectic Optimization for Stable Reinforcement Learning
    [AUTHORS]Yao Lyu, Xiangteng Zhang, Shengbo Eben Li, Jingliang Duan, Letian Tao, Qing Xu, Lei He, Keqiang Li
    [ABSTRACT]Training deep reinforcement learning (RL) agents necessitates overcoming the highly unstable nonconvex stochastic optimization inherent in the trial-and-error mechanism. To tackle this challenge, we propose a physics-inspired optimization algorithm called relativistic adaptive gradient descent (RAD), which enhances long-term training stability. By conceptualizing neural network (NN) training as the evolution of a conformal Hamiltonian system, we present a universal framework for transferring long-term stability from conformal symplectic integrators to iterative NN updating rules, where the choice of kinetic energy governs the dynamical properties of resulting optimization algorithms. By utilizing relativistic kinetic energy, RAD incorporates principles from special relativity and limits parameter updates below a finite speed, effectively mitigating abnormal gradient influences. Additionally, RAD models NN optimization as the evolution of a multi-particle system where each trainable parameter acts as an independent particle with an individual adaptive learning rate. We prove RAD's sublinear convergence under general nonconvex settings, where smaller gradient variance and larger batch sizes contribute to tighter convergence. Notably, RAD degrades to the well-known adaptive moment estimation (ADAM) algorithm when its speed coefficient is chosen as one and symplectic factor as a small positive value. Experimental results show RAD outperforming nine baseline optimizers with five RL algorithms across twelve environments, including standard benchmarks and challenging scenarios. Notably, RAD achieves up to a 155.1% performance improvement over ADAM in Atari games, showcasing its efficacy in stabilizing and accelerating RL training.
    [LINK]http://arxiv.org/abs/2412.02291v1
    [DATE]2024-12-03 17:07:31+08:00
    [CATEGORIES]cs.LG
  • Feudal Graph Reinforcement Learning
    [AUTHORS]Tommaso Marzi, Arshjot Khehra, Andrea Cini, Cesare Alippi
    [ABSTRACT]Graph-based representations and message-passing modular policies constitute prominent approaches to tackling composable control problems in reinforcement learning (RL). However, as shown by recent graph deep learning literature, such local message-passing operators can create information bottlenecks and hinder global coordination. The issue becomes more serious in tasks requiring high-level planning. In this work, we propose a novel methodology, named Feudal Graph Reinforcement Learning (FGRL), that addresses such challenges by relying on hierarchical RL and a pyramidal message-passing architecture. In particular, FGRL defines a hierarchy of policies where high-level commands are propagated from the top of the hierarchy down through a layered graph structure. The bottom layers mimic the morphology of the physical system, while the upper layers correspond to higher-order sub-modules. The resulting agents are then characterized by a committee of policies where actions at a certain level set goals for the level below, thus implementing a hierarchical decision-making structure that can naturally implement task decomposition. We evaluate the proposed framework on a graph clustering problem and MuJoCo locomotion tasks; simulation results show that FGRL compares favorably against relevant baselines. Furthermore, an in-depth analysis of the command propagation mechanism provides evidence that the introduced message-passing scheme favors learning hierarchical decision-making policies.
    [LINK]http://arxiv.org/abs/2304.05099v6
    [DATE]2024-12-03 16:58:22+08:00
    [CATEGORIES]cs.LG
  • BOTracle: A framework for Discriminating Bots and Humans
    [AUTHORS]Jan Kadel, August See, Ritwik Sinha, Mathias Fischer
    [ABSTRACT]Bots constitute a significant portion of Internet traffic and are a source of various issues across multiple domains. Modern bots often become indistinguishable from real users, as they employ similar methods to browse the web, including using real browsers. We address the challenge of bot detection in high-traffic scenarios by analyzing three distinct detection methods. The first method operates on heuristics, allowing for rapid detection. The second method utilizes, well known, technical features, such as IP address, window size, and user agent. It serves primarily for comparison with the third method. In the third method, we rely solely on browsing behavior, omitting all static features and focusing exclusively on how clients behave on a website. In contrast to related work, we evaluate our approaches using real-world e-commerce traffic data, comprising 40 million monthly page visits. We further compare our methods against another bot detection approach, Botcha, on the same dataset. Our performance metrics, including precision, recall, and AUC, reach 98 percent or higher, surpassing Botcha.
    [COMMENTS]Bot Detection; User Behaviour Analysis; Published at ESORICS International Workshops 2024
    [LINK]http://arxiv.org/abs/2412.02266v1
    [DATE]2024-12-03 16:38:30+08:00
    [CATEGORIES]cs.LG
  • Selective Reviews of Bandit Problems in AI via a Statistical View
    [AUTHORS]Pengjie Zhou, Haoyu Wei, Huiming Zhang
    [ABSTRACT]Reinforcement Learning (RL) is a widely researched area in artificial intelligence that focuses on teaching agents decision-making through interactions with their environment. A key subset includes stochastic multi-armed bandit (MAB) and continuum-armed bandit (SCAB) problems, which model sequential decision-making under uncertainty. This review outlines the foundational models and assumptions of bandit problems, explores non-asymptotic theoretical tools like concentration inequalities and minimax regret bounds, and compares frequentist and Bayesian algorithms for managing exploration-exploitation trade-offs. We also extend the discussion to $K$-armed contextual bandits and SCAB, examining their methodologies, regret analyses, and discussing the relation between the SCAB problems and the functional data analysis. Finally, we highlight recent advances and ongoing challenges in the field.
    [COMMENTS]46 pages, 5 figures,
    [LINK]http://arxiv.org/abs/2412.02251v1
    [DATE]2024-12-03 16:28:47+08:00
    [CATEGORIES]cs.LG
  • FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL
    [AUTHORS]Woosung Koh, Wonbeen Oh, Siyeol Kim, Suhin Shin, Hyeongjin Kim, Jaein Jang, Junghyun Lee, Se-Young Yun
    [ABSTRACT]Multi-agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real-world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically removed or added during the inference trajectory -- a common occurrence in real-world environments like search and rescue missions and dynamic combat situations. In this paper, we tackle the challenge of intra-trajectory dynamic entity composition under zero-shot out-of-domain (OOD) generalization, where such dynamic changes cannot be anticipated beforehand. Our empirical studies reveal that existing MARL methods suffer significant performance degradation and increased uncertainty in these scenarios. In response, we propose FlickerFusion, a novel OOD generalization method that acts as a universally applicable augmentation technique for MARL backbone methods. FlickerFusion stochastically drops out parts of the observation space, emulating being in-domain when inferenced OOD. The results show that FlickerFusion not only achieves superior inference rewards but also uniquely reduces uncertainty vis-\`a-vis the backbone, compared to existing methods. Benchmarks, implementations, and model weights are organized and open-sourced at flickerfusion305.github.io, accompanied by ample demo video renderings.
    [COMMENTS]NeurIPS '24 Open-World Agents Workshop
    [LINK]http://arxiv.org/abs/2410.15876v3
    [DATE]2024-12-03 13:59:09+08:00
    [CATEGORIES]cs.LG
  • Optimizing Latent Goal by Learning from Trajectory Preference
    [AUTHORS]Guangyu Zhao, Kewei Lian, Haowei Lin, Haobo Fu, Qiang Fu, Shaofei Cai, Zihao Wang, Yitao Liang
    [ABSTRACT]A glowing body of work has emerged focusing on instruction-following policies for open-world agents, aiming to better align the agent's behavior with human intentions. However, the performance of these policies is highly susceptible to the initial prompt, which leads to extra efforts in selecting the best instructions. We propose a framework named Preference Goal Tuning (PGT). PGT allows an instruction following policy to interact with the environment to collect several trajectories, which will be categorized into positive and negative samples based on preference. Then we use preference learning to fine-tune the initial goal latent representation with the categorized trajectories while keeping the policy backbone frozen. The experiment result shows that with minimal data and training, PGT achieves an average relative improvement of 72.0% and 81.6% over 17 tasks in 2 different foundation policies respectively, and outperforms the best human-selected instructions. Moreover, PGT surpasses full fine-tuning in the out-of-distribution (OOD) task-execution environments by 13.4%, indicating that our approach retains strong generalization capabilities. Since our approach stores a single latent representation for each task independently, it can be viewed as an efficient method for continual learning, without the risk of catastrophic forgetting or task interference. In short, PGT enhances the performance of agents across nearly all tasks in the Minecraft Skillforge benchmark and demonstrates robustness to the execution environment.
    [LINK]http://arxiv.org/abs/2412.02125v1
    [DATE]2024-12-03 11:27:48+08:00
    [CATEGORIES]cs.LG
  • The Problem of Social Cost in Multi-Agent General Reinforcement Learning: Survey and Synthesis
    [AUTHORS]Kee Siong Ng, Samuel Yang-Zhao, Timothy Cadogan-Cowper
    [ABSTRACT]The AI safety literature is full of examples of powerful AI agents that, in blindly pursuing a specific and usually narrow objective, ends up with unacceptable and even catastrophic collateral damage to others. In this paper, we consider the problem of social harms that can result from actions taken by learning and utility-maximising agents in a multi-agent environment. The problem of measuring social harms or impacts in such multi-agent settings, especially when the agents are artificial generally intelligent (AGI) agents, was listed as an open problem in Everitt et al, 2018. We attempt a partial answer to that open problem in the form of market-based mechanisms to quantify and control the cost of such social harms. The proposed setup captures many well-studied special cases and is more general than existing formulations of multi-agent reinforcement learning with mechanism design in two ways: (i) the underlying environment is a history-based general reinforcement learning environment like in AIXI; (ii) the reinforcement-learning agents participating in the environment can have different learning strategies and planning horizons. To demonstrate the practicality of the proposed setup, we survey some key classes of learning algorithms and present a few applications, including a discussion of the Paperclips problem and pollution control with a cap-and-trade system.
    [COMMENTS]49 pages
    [LINK]http://arxiv.org/abs/2412.02091v1
    [DATE]2024-12-03 10:22:55+08:00
    [CATEGORIES]cs.LG
  • Comparative Analysis of Multi-Agent Reinforcement Learning Policies for Crop Planning Decision Support
    [AUTHORS]Anubha Mahajan, Shreya Hegde, Ethan Shay, Daniel Wu, Aviva Prins
    [ABSTRACT]In India, the majority of farmers are classified as small or marginal, making their livelihoods particularly vulnerable to economic losses due to market saturation and climate risks. Effective crop planning can significantly impact their expected income, yet existing decision support systems (DSS) often provide generic recommendations that fail to account for real-time market dynamics and the interactions among multiple farmers. In this paper, we evaluate the viability of three multi-agent reinforcement learning (MARL) approaches for optimizing total farmer income and promoting fairness in crop planning: Independent Q-Learning (IQL), where each farmer acts independently without coordination, Agent-by-Agent (ABA), which sequentially optimizes each farmer's policy in relation to the others, and the Multi-agent Rollout Policy, which jointly optimizes all farmers' actions for global reward maximization. Our results demonstrate that while IQL offers computational efficiency with linear runtime, it struggles with coordination among agents, leading to lower total rewards and an unequal distribution of income. Conversely, the Multi-agent Rollout policy achieves the highest total rewards and promotes equitable income distribution among farmers but requires significantly more computational resources, making it less practical for large numbers of agents. ABA strikes a balance between runtime efficiency and reward optimization, offering reasonable total rewards with acceptable fairness and scalability. These findings highlight the importance of selecting appropriate MARL approaches in DSS to provide personalized and equitable crop planning recommendations, advancing the development of more adaptive and farmer-centric agricultural decision-making systems.
    [LINK]http://arxiv.org/abs/2412.02057v1
    [DATE]2024-12-03 08:30:19+08:00
    [CATEGORIES]cs.LG
  • Explore Reinforced: Equilibrium Approximation with Reinforcement Learning
    [AUTHORS]Ryan Yu, Mateusz Nowak, Qintong Xie, Michelle Yilin Feng, Peter Chin
    [ABSTRACT]Current approximate Coarse Correlated Equilibria (CCE) algorithms struggle with equilibrium approximation for games in large stochastic environments but are theoretically guaranteed to converge to a strong solution concept. In contrast, modern Reinforcement Learning (RL) algorithms provide faster training yet yield weaker solutions. We introduce Exp3-IXrl - a blend of RL and game-theoretic approach, separating the RL agent's action selection from the equilibrium computation while preserving the integrity of the learning process. We demonstrate that our algorithm expands the application of equilibrium approximation algorithms to new environments. Specifically, we show the improved performance in a complex and adversarial cybersecurity network environment - the Cyber Operations Research Gym - and in the classical multi-armed bandit settings.
    [LINK]http://arxiv.org/abs/2412.02016v1
    [DATE]2024-12-03 06:37:59+08:00
    [CATEGORIES]cs.LG
  • Who's Gaming the System? A Causally-Motivated Approach for Detecting Strategic Adaptation
    [AUTHORS]Trenton Chang, Lindsay Warrenburg, Sae-Hwan Park, Ravi B. Parikh, Maggie Makar, Jenna Wiens
    [ABSTRACT]In many settings, machine learning models may be used to inform decisions that impact individuals or entities who interact with the model. Such entities, or agents, may game model decisions by manipulating their inputs to the model to obtain better outcomes and maximize some utility. We consider a multi-agent setting where the goal is to identify the "worst offenders:" agents that are gaming most aggressively. However, identifying such agents is difficult without knowledge of their utility function. Thus, we introduce a framework in which each agent's tendency to game is parameterized via a scalar. We show that this gaming parameter is only partially identifiable. By recasting the problem as a causal effect estimation problem where different agents represent different "treatments," we prove that a ranking of all agents by their gaming parameters is identifiable. We present empirical results in a synthetic data study validating the usage of causal effect estimation for gaming detection and show in a case study of diagnosis coding behavior in the U.S. that our approach highlights features associated with gaming.
    [COMMENTS]38 pages, 31 figures. NeurIPS 2024
    [LINK]http://arxiv.org/abs/2412.02000v1
    [DATE]2024-12-03 06:07:48+08:00
    [CATEGORIES]cs.LG
  • Generalized EXTRA stochastic gradient Langevin dynamics
    [AUTHORS]Mert Gurbuzbalaban, Mohammad Rafiqul Islam, Xiaoyu Wang, Lingjiong Zhu
    [ABSTRACT]Langevin algorithms are popular Markov Chain Monte Carlo methods for Bayesian learning, particularly when the aim is to sample from the posterior distribution of a parametric model, given the input data and the prior distribution over the model parameters. Their stochastic versions such as stochastic gradient Langevin dynamics (SGLD) allow iterative learning based on randomly sampled mini-batches of large datasets and are scalable to large datasets. However, when data is decentralized across a network of agents subject to communication and privacy constraints, standard SGLD algorithms cannot be applied. Instead, we employ decentralized SGLD (DE-SGLD) algorithms, where Bayesian learning is performed collaboratively by a network of agents without sharing individual data. Nonetheless, existing DE-SGLD algorithms induce a bias at every agent that can negatively impact performance; this bias persists even when using full batches and is attributable to network effects. Motivated by the EXTRA algorithm and its generalizations for decentralized optimization, we propose the generalized EXTRA stochastic gradient Langevin dynamics, which eliminates this bias in the full-batch setting. Moreover, we show that, in the mini-batch setting, our algorithm provides performance bounds that significantly improve upon those of standard DE-SGLD algorithms in the literature. Our numerical results also demonstrate the efficiency of the proposed approach.
    [LINK]http://arxiv.org/abs/2412.01993v1
    [DATE]2024-12-03 05:57:30+08:00
    [CATEGORIES]cs.LG
  • A Multi-Agent Reinforcement Learning Testbed for Cognitive Radio Applications
    [AUTHORS]Sriniketh Vangaru, Daniel Rosen, Dylan Green, Raphael Rodriguez, Maxwell Wiecek, Amos Johnson, Alyse M. Jones, William C. Headley
    [ABSTRACT]Technological trends show that Radio Frequency Reinforcement Learning (RFRL) will play a prominent role in the wireless communication systems of the future. Applications of RFRL range from military communications jamming to enhancing WiFi networks. Before deploying algorithms for these purposes, they must be trained in a simulation environment to ensure adequate performance. For this reason, we previously created the RFRL Gym: a standardized, accessible tool for the development and testing of reinforcement learning (RL) algorithms in the wireless communications space. This environment leveraged the OpenAI Gym framework and featured customizable simulation scenarios within the RF spectrum. However, the RFRL Gym was limited to training a single RL agent per simulation; this is not ideal, as most real-world RF scenarios will contain multiple intelligent agents in cooperative, competitive, or mixed settings, which is a natural consequence of spectrum congestion. Therefore, through integration with Ray RLlib, multi-agent reinforcement learning (MARL) functionality for training and assessment has been added to the RFRL Gym, making it even more of a robust tool for RF spectrum simulation. This paper provides an overview of the updated RFRL Gym environment. In this work, the general framework of the tool is described relative to comparable existing resources, highlighting the significant additions and refactoring we have applied to the Gym. Afterward, results from testing various RF scenarios in the MARL environment and future additions are discussed.
    [COMMENTS]Accepted to IEEE CCNC 2025. Added revisions from paper reviews
    [LINK]http://arxiv.org/abs/2410.21521v2
    [DATE]2024-12-03 03:49:59+08:00
    [CATEGORIES]cs.LG
  • MALT: Improving Reasoning with Multi-Agent LLM Training
    [AUTHORS]Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Markian Rybchuk, Philip H. S. Torr, Ivan Laptev, Fabio Pizzati, Ronald Clark, Christian Schroeder de Witt
    [ABSTRACT]Enabling effective collaboration among LLMs is a crucial step toward developing autonomous systems capable of solving complex problems. While LLMs are typically used as single-model generators, where humans critique and refine their outputs, the potential for jointly-trained collaborative models remains largely unexplored. Despite promising results in multi-agent communication and debate settings, little progress has been made in training models to work together on tasks. In this paper, we present a first step toward "Multi-agent LLM training" (MALT) on reasoning problems. Our approach employs a sequential multi-agent setup with heterogeneous LLMs assigned specialized roles: a generator, verifier, and refinement model iteratively solving problems. We propose a trajectory-expansion-based synthetic data generation process and a credit assignment strategy driven by joint outcome based rewards. This enables our post-training setup to utilize both positive and negative trajectories to autonomously improve each model's specialized capabilities as part of a joint sequential system. We evaluate our approach across MATH, GSM8k, and CQA, where MALT on Llama 3.1 8B models achieves relative improvements of 14.14%, 7.12%, and 9.40% respectively over the same baseline model. This demonstrates an early advance in multi-agent cooperative capabilities for performance on mathematical and common sense reasoning questions. More generally, our work provides a concrete direction for research around multi-agent LLM training approaches.
    [COMMENTS]Preliminary work
    [LINK]http://arxiv.org/abs/2412.01928v1
    [DATE]2024-12-03 03:30:36+08:00
    [CATEGORIES]cs.LG
  • CREW: Facilitating Human-AI Teaming Research
    [AUTHORS]Lingyu Zhang, Zhengran Ji, Boyuan Chen
    [ABSTRACT]With the increasing deployment of artificial intelligence (AI) technologies, the potential of humans working with AI agents has been growing at a great speed. Human-AI teaming is an important paradigm for studying various aspects when humans and AI agents work together. The unique aspect of Human-AI teaming research is the need to jointly study humans and AI agents, demanding multidisciplinary research efforts from machine learning to human-computer interaction, robotics, cognitive science, neuroscience, psychology, social science, and complex systems. However, existing platforms for Human-AI teaming research are limited, often supporting oversimplified scenarios and a single task, or specifically focusing on either human-teaming research or multi-agent AI algorithms. We introduce CREW, a platform to facilitate Human-AI teaming research in real-time decision-making scenarios and engage collaborations from multiple scientific disciplines, with a strong emphasis on human involvement. It includes pre-built tasks for cognitive studies and Human-AI teaming with expandable potentials from our modular design. Following conventional cognitive neuroscience research, CREW also supports multimodal human physiological signal recording for behavior analysis. Moreover, CREW benchmarks real-time human-guided reinforcement learning agents using state-of-the-art algorithms and well-tuned baselines. With CREW, we were able to conduct 50 human subject studies within a week to verify the effectiveness of our benchmark.
    [COMMENTS]Our project website is at: http://generalroboticslab.com/CREW
    [LINK]http://arxiv.org/abs/2408.00170v2
    [DATE]2024-12-03 02:37:01+08:00
    [CATEGORIES]cs.LG
  • Asynchronous Message-Passing and Zeroth-Order Optimization Based Distributed Learning with a Use-Case in Resource Allocation in Communication Networks
    [AUTHORS]Pourya Behmandpoor, Marc Moonen, Panagiotis Patrinos
    [ABSTRACT]Distributed learning and adaptation have received significant interest and found wide-ranging applications in machine learning and signal processing. While various approaches, such as shared-memory optimization, multi-task learning, and consensus-based learning (e.g., federated learning and learning over graphs), focus on optimizing either local costs or a global cost, there remains a need for further exploration of their interconnections. This paper specifically focuses on a scenario where agents collaborate towards a common task (i.e., optimizing a global cost equal to aggregated local costs) while effectively having distinct individual tasks (i.e., optimizing individual local parameters in a local cost). Each agent's actions can potentially impact other agents' performance through interactions. Notably, each agent has access to only its local zeroth-order oracle (i.e., cost function value) and shares scalar values, rather than gradient vectors, with other agents, leading to communication bandwidth efficiency and agent privacy. Agents employ zeroth-order optimization to update their parameters, and the asynchronous message-passing between them is subject to bounded but possibly random communication delays. This paper presents theoretical convergence analyses and establishes a convergence rate for nonconvex problems. Furthermore, it addresses the relevant use-case of deep learning-based resource allocation in communication networks and conducts numerical experiments in which agents, acting as transmitters, collaboratively train their individual policies to maximize a global reward, e.g., a sum of data rates.
    [LINK]http://arxiv.org/abs/2311.04604v3
    [DATE]2024-12-03 02:02:53+08:00
    [CATEGORIES]cs.LG
  • Discovering group dynamics in coordinated time series via hierarchical recurrent switching-state models
    [AUTHORS]Michael T. Wojnowicz, Kaitlin Gili, Preetish Rath, Eric Miller, Jeffrey Miller, Clifford Hancock, Meghan O'Donovan, Seth Elkin-Frankston, Tad T. Brunyé, Michael C. Hughes
    [ABSTRACT]We seek a computationally efficient model for a collection of time series arising from multiple interacting entities (a.k.a. "agents"). Recent models of spatiotemporal patterns across individuals fail to incorporate explicit system-level collective behavior that can influence the trajectories of individual entities. To address this gap in the literature, we present a new hierarchical switching-state model that can be trained in an unsupervised fashion to simultaneously learn both system-level and individual-level dynamics. We employ a latent system-level discrete state Markov chain that provides top-down influence on latent entity-level chains which in turn govern the emission of each observed time series. Recurrent feedback from the observations to the latent chains at both entity and system levels allows recent situational context to inform how dynamics unfold at all levels in bottom-up fashion. We hypothesize that including both top-down and bottom-up influences on group dynamics will improve interpretability of the learned dynamics and reduce error when forecasting. Our hierarchical switching recurrent dynamical model can be learned via closed-form variational coordinate ascent updates to all latent chains that scale linearly in the number of entities. This is asymptotically no more costly than fitting a separate model for each entity. Analysis of both synthetic data and real basketball team movements suggests our lean parametric model can achieve competitive forecasts compared to larger neural network models that require far more computational resources. Further experiments on soldier data as well as a synthetic task with 64 cooperating entities show how our approach can yield interpretable insights about team dynamics over time.
    [LINK]http://arxiv.org/abs/2401.14973v2
    [DATE]2024-12-03 01:35:07+08:00
    [CATEGORIES]cs.LG
2024 Dec 02, Mon
  • Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking
    [AUTHORS]Jie Liu, Wenxuan Wang, Zizhan Ma, Guolin Huang, Yihang SU, Kao-Jung Chang, Wenting Chen, Haoliang Li, Linlin Shen, Michael Lyu
    [ABSTRACT]Clinical decision making (CDM) is a complex, dynamic process crucial to healthcare delivery, yet it remains a significant challenge for artificial intelligence systems. While Large Language Model (LLM)-based agents have been tested on general medical knowledge using licensing exams and knowledge question-answering tasks, their performance in the CDM in real-world scenarios is limited due to the lack of comprehensive testing datasets that mirror actual medical practice. To address this gap, we present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow. MedChain distinguishes itself from existing benchmarks with three key features of real-world clinical practice: personalization, interactivity, and sequentiality. Further, to tackle real-world CDM challenges, we also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses. MedChain-Agent demonstrates remarkable adaptability in gathering information dynamically and handling sequential clinical tasks, significantly outperforming existing approaches. The relevant dataset and code will be released upon acceptance of this paper.
    [LINK]http://arxiv.org/abs/2412.01605v1
    [DATE]2024-12-02 23:25:02+08:00
    [CATEGORIES]cs.CL
  • Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach
    [AUTHORS]Tianyi Huang, Arya Somasundaram
    [ABSTRACT]Large Language Models (LLMs) often perpetuate biases in pronoun usage, leading to misrepresentation or exclusion of queer individuals. This paper addresses the specific problem of biased pronoun usage in LLM outputs, particularly the inappropriate use of traditionally gendered pronouns ("he," "she") when inclusive language is needed to accurately represent all identities. We introduce a collaborative agent pipeline designed to mitigate these biases by analyzing and optimizing pronoun usage for inclusivity. Our multi-agent framework includes specialized agents for both bias detection and correction. Experimental evaluations using the Tango dataset-a benchmark focused on gender pronoun usage-demonstrate that our approach significantly improves inclusive pronoun classification, achieving a 32.6 percentage point increase over GPT-4o in correctly disagreeing with inappropriate traditionally gendered pronouns $(\chi^2 = 38.57, p < 0.0001)$. These results accentuate the potential of agent-driven frameworks in enhancing fairness and inclusivity in AI-generated content, demonstrating their efficacy in reducing biases and promoting socially responsible AI.
    [COMMENTS]NeurIPS 2024 Queer in AI Workshop
    [LINK]http://arxiv.org/abs/2411.07656v2
    [DATE]2024-12-02 12:36:45+08:00
    [CATEGORIES]cs.CL
  • SAUP: Situation Awareness Uncertainty Propagation on LLM Agent
    [AUTHORS]Qiwei Zhao, Xujiang Zhao, Yanchi Liu, Wei Cheng, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Huaxiu Yao, Haifeng Chen
    [ABSTRACT]Large language models (LLMs) integrated into multistep agent systems enable complex decision-making processes across various applications. However, their outputs often lack reliability, making uncertainty estimation crucial. Existing uncertainty estimation methods primarily focus on final-step outputs, which fail to account for cumulative uncertainty over the multistep decision-making process and the dynamic interactions between agents and their environments. To address these limitations, we propose SAUP (Situation Awareness Uncertainty Propagation), a novel framework that propagates uncertainty through each step of an LLM-based agent's reasoning process. SAUP incorporates situational awareness by assigning situational weights to each step's uncertainty during the propagation. Our method, compatible with various one-step uncertainty estimation techniques, provides a comprehensive and accurate uncertainty measure. Extensive experiments on benchmark datasets demonstrate that SAUP significantly outperforms existing state-of-the-art methods, achieving up to 20% improvement in AUROC.
    [LINK]http://arxiv.org/abs/2412.01033v1
    [DATE]2024-12-02 09:31:13+08:00
    [CATEGORIES]cs.CL cs.LG
  • Towards Type Agnostic Cyber Defense Agents
    [AUTHORS]Erick Galinkin, Emmanouil Pountrourakis, Spiros Mancoridis
    [ABSTRACT]With computing now ubiquitous across government, industry, and education, cybersecurity has become a critical component for every organization on the planet. Due to this ubiquity of computing, cyber threats have continued to grow year over year, leading to labor shortages and a skills gap in cybersecurity. As a result, many cybersecurity product vendors and security organizations have looked to artificial intelligence to shore up their defenses. This work considers how to characterize attackers and defenders in one approach to the automation of cyber defense -- the application of reinforcement learning. Specifically, we characterize the types of attackers and defenders in the sense of Bayesian games and, using reinforcement learning, derive empirical findings about how to best train agents that defend against multiple types of attackers.
    [COMMENTS]Submitted to AICS 2025: https://aics.site
    [LINK]http://arxiv.org/abs/2412.01542v1
    [DATE]2024-12-02 22:32:18+08:00
    [CATEGORIES]cs.LG
  • Moral Alignment for LLM Agents
    [AUTHORS]Elizaveta Tennant, Stephen Hailes, Mirco Musolesi
    [ABSTRACT]Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques.
    [LINK]http://arxiv.org/abs/2410.01639v2
    [DATE]2024-12-02 22:25:30+08:00
    [CATEGORIES]cs.LG
  • MASP: Scalable GNN-based Planning for Multi-Agent Navigation
    [AUTHORS]Xinyi Yang, Xinting Yang, Chao Yu, Jiayu Chen, Wenbo Ding, Huazhong Yang, Yu Wang
    [ABSTRACT]We investigate multi-agent navigation tasks, where multiple agents need to reach initially unassigned goals in a limited time. Classical planning-based methods suffer from expensive computation overhead at each step and offer limited expressiveness for complex cooperation strategies. In contrast, reinforcement learning (RL) has recently become a popular approach for addressing this issue. However, RL struggles with low data efficiency and cooperation when directly exploring (nearly) optimal policies in a large exploration space, especially with an increased number of agents(e.g., 10+ agents) or in complex environments (e.g., 3-D simulators). In this paper, we propose the Multi-Agent Scalable Graph-based Planner (MASP), a goal-conditioned hierarchical planner for navigation tasks with a substantial number of agents in the decentralized setting. MASP employs a hierarchical framework to reduce space complexity by decomposing a large exploration space into multiple goal-conditioned subspaces, where a high-level policy assigns agents goals, and a low-level policy navigates agents toward designated goals. For agent cooperation and the adaptation to varying team sizes, we model agents and goals as graphs to better capture their relationship. The high-level policy, the Goal Matcher, leverages a graph-based Self-Encoder and Cross-Encoder to optimize goal assignment by updating the agent and the goal graphs. The low-level policy, the Coordinated Action Executor, introduces the Group Information Fusion to facilitate group division and extract agent relationships across groups, enhancing training efficiency for agent cooperation. The results demonstrate that MASP outperforms RL and planning-based baselines in task efficiency.
    [COMMENTS]Submitted to IEEE RA-L
    [LINK]http://arxiv.org/abs/2312.02522v2
    [DATE]2024-12-02 20:49:50+08:00
    [CATEGORIES]cs.LG
  • Masked Generative Priors Improve World Models Sequence Modelling Capabilities
    [AUTHORS]Cristian Meo, Mircea Lica, Zarif Ikram, Akihiro Nakano, Vedant Shah, Aniket Rajiv Didolkar, Dianbo Liu, Anirudh Goyal, Justin Dauwels
    [ABSTRACT]Deep Reinforcement Learning (RL) has become the leading approach for creating artificial agents in complex environments. Model-based approaches, which are RL methods with world models that predict environment dynamics, are among the most promising directions for improving data efficiency, forming a critical step toward bridging the gap between research and real-world deployment. In particular, world models enhance sample efficiency by learning in imagination, which involves training a generative sequence model of the environment in a self-supervised manner. Recently, Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling and generating token sequences. Building on the Efficient Stochastic Transformer-based World Models (STORM) architecture, we replace the traditional MLP prior with a Masked Generative Prior (e.g., MaskGIT Prior) and introduce GIT-STORM. We evaluate our model on two downstream tasks: reinforcement learning and video prediction. GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark. Moreover, we apply Transformer-based World Models to continuous action environments for the first time, addressing a significant gap in prior research. To achieve this, we employ a state mixer function that integrates latent state representations with actions, enabling our model to handle continuous control tasks. We validate this approach through qualitative and quantitative analyses on the DeepMind Control Suite, showcasing the effectiveness of Transformer-based World Models in this new domain. Our results highlight the versatility and efficacy of the MaskGIT dynamics prior, paving the way for more accurate world models and effective RL policies.
    [LINK]http://arxiv.org/abs/2410.07836v4
    [DATE]2024-12-02 20:44:48+08:00
    [CATEGORIES]cs.LG
  • Multi-turn Reinforcement Learning from Preference Human Feedback
    [AUTHORS]Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Rémi Munos
    [ABSTRACT]Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal. In this paper, we address this issue by developing novel methods for Reinforcement Learning (RL) from preference feedback between two full multi-turn conversations. In the tabular setting, we present a novel mirror-descent-based policy optimization algorithm for the general multi-turn preference-based RL problem, and prove its convergence to Nash equilibrium. To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student in learning a random topic, and show that a deep RL variant of our algorithm outperforms RLHF baselines. Finally, we show that in an environment with explicit rewards, our algorithm recovers the same performance as a reward-based RL baseline, despite relying solely on a weaker preference signal.
    [LINK]http://arxiv.org/abs/2405.14655v2
    [DATE]2024-12-02 20:37:46+08:00
    [CATEGORIES]cs.LG
  • LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations
    [AUTHORS]Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, Tim Genewein
    [ABSTRACT]Today's largest foundation models have increasingly general capabilities, yet when used as agents, they often struggle with simple reasoning and decision-making tasks, even though they possess good factual knowledge of the task and how to solve it. In this paper, we present a benchmark to pressure-test these models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether they can learn from a large number of expert demonstrations in their context. We evaluate a wide range of state-of-the-art frontier models as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We measure the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4o, o1-mini, and o1-preview under increasing amounts of expert demonstrations in the context $\unicode\{x2013\}$ from no demonstrations up to 512 full episodes, pushing these models' multimodal long-context reasoning capabilities to their limits. Across our tasks, today's frontier models rarely manage to fully reach expert performance, showcasing the difficulty of our benchmark. Presenting more demonstrations often has little effect, but some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. Overall, our results suggest that even today's most capable models often struggle to imitate desired behavior by generalizing purely from in-context demonstrations. To help quantify the impact of other approaches and future innovations aiming to tackle this problem, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.
    [LINK]http://arxiv.org/abs/2412.01441v1
    [DATE]2024-12-02 20:31:58+08:00
    [CATEGORIES]cs.LG
  • Task Adaptation of Reinforcement Learning-based NAS Agents through Transfer Learning
    [AUTHORS]Amber Cassimon, Siegfried Mercelis, Kevin Mets
    [ABSTRACT]Recently, a novel paradigm has been proposed for reinforcement learning-based NAS agents, that revolves around the incremental improvement of a given architecture. We assess the abilities of such reinforcement learning agents to transfer between different tasks. We perform our evaluation using the Trans-NASBench-101 benchmark, and consider the efficacy of the transferred agents, as well as how quickly they can be trained. We find that pretraining an agent on one task benefits the performance of the agent in another task in all but 1 task when considering final performance. We also show that the training procedure for an agent can be shortened significantly by pretraining it on another task. Our results indicate that these effects occur regardless of the source or target task, although they are more pronounced for some tasks than for others. Our results show that transfer learning can be an effective tool in mitigating the computational cost of the initial training procedure for reinforcement learning-based NAS agents.
    [COMMENTS]15 Pages, 13 Figures
    [LINK]http://arxiv.org/abs/2412.01420v1
    [DATE]2024-12-02 20:00:27+08:00
    [CATEGORIES]cs.LG
  • Practical Performative Policy Learning with Strategic Agents
    [AUTHORS]Qianyi Chen, Ying Chen, Bo Li
    [ABSTRACT]This paper studies the performative policy learning problem, where agents adjust their features in response to a released policy to improve their potential outcomes, inducing an endogenous distribution shift. There has been growing interest in training machine learning models in strategic environments, including strategic classification and performative prediction. However, existing approaches often rely on restrictive parametric assumptions: micro-level utility models in strategic classification and macro-level data distribution maps in performative prediction, severely limiting scalability and generalizability. We approach this problem as a complex causal inference task, relaxing parametric assumptions on both micro-level agent behavior and macro-level data distribution. Leveraging bounded rationality, we uncover a practical low-dimensional structure in distribution shifts and construct an effective mediator in the causal path from the deployed model to the shifted data. We then propose a gradient-based policy optimization algorithm with a differentiable classifier as a substitute for the high-dimensional distribution map. Our algorithm efficiently utilizes batch feedback and limited manipulation patterns. Our approach achieves high sample efficiency compared to methods reliant on bandit feedback or zero-order optimization. We also provide theoretical guarantees for algorithmic convergence. Extensive and challenging experiments on high-dimensional settings demonstrate our method's practical efficacy.
    [LINK]http://arxiv.org/abs/2412.01344v1
    [DATE]2024-12-02 18:09:44+08:00
    [CATEGORIES]cs.LG
  • BricksRL: A Platform for Democratizing Robotics and Reinforcement Learning Research and Education with LEGO
    [AUTHORS]Sebastian Dittert, Vincent Moens, Gianni De Fabritiis
    [ABSTRACT]We present BricksRL, a platform designed to democratize access to robotics for reinforcement learning research and education. BricksRL facilitates the creation, design, and training of custom LEGO robots in the real world by interfacing them with the TorchRL library for reinforcement learning agents. The integration of TorchRL with the LEGO hubs, via Bluetooth bidirectional communication, enables state-of-the-art reinforcement learning training on GPUs for a wide variety of LEGO builds. This offers a flexible and cost-efficient approach for scaling and also provides a robust infrastructure for robot-environment-algorithm communication. We present various experiments across tasks and robot configurations, providing built plans and training results. Furthermore, we demonstrate that inexpensive LEGO robots can be trained end-to-end in the real world to achieve simple tasks, with training times typically under 120 minutes on a normal laptop. Moreover, we show how users can extend the capabilities, exemplified by the successful integration of non-LEGO sensors. By enhancing accessibility to both robotics and reinforcement learning, BricksRL establishes a strong foundation for democratized robotic learning in research and educational settings.
    [LINK]http://arxiv.org/abs/2406.17490v2
    [DATE]2024-12-02 17:49:23+08:00
    [CATEGORIES]cs.LG
  • Dense Dynamics-Aware Reward Synthesis: Integrating Prior Experience with Demonstrations
    [AUTHORS]Cevahir Koprulu, Po-han Li, Tianyu Qiu, Ruihan Zhao, Tyler Westenbroek, David Fridovich-Keil, Sandeep Chinchali, Ufuk Topcu
    [ABSTRACT]Many continuous control problems can be formulated as sparse-reward reinforcement learning (RL) tasks. In principle, online RL methods can automatically explore the state space to solve each new task. However, discovering sequences of actions that lead to a non-zero reward becomes exponentially more difficult as the task horizon increases. Manually shaping rewards can accelerate learning for a fixed task, but it is an arduous process that must be repeated for each new environment. We introduce a systematic reward-shaping framework that distills the information contained in 1) a task-agnostic prior data set and 2) a small number of task-specific expert demonstrations, and then uses these priors to synthesize dense dynamics-aware rewards for the given task. This supervision substantially accelerates learning in our experiments, and we provide analysis demonstrating how the approach can effectively guide online learning agents to faraway goals.
    [LINK]http://arxiv.org/abs/2412.01114v1
    [DATE]2024-12-02 12:37:12+08:00
    [CATEGORIES]cs.LG
  • Realizable Continuous-Space Shields for Safe Reinforcement Learning
    [AUTHORS]Kyungmin Kim, Davide Corsi, Andoni Rodriguez, JB Lanier, Benjami Parellada, Pierre Baldi, Cesar Sanchez, Roy Fox
    [ABSTRACT]While Deep Reinforcement Learning (DRL) has achieved remarkable success across various domains, it remains vulnerable to occasional catastrophic failures without additional safeguards. An effective solution to prevent these failures is to use a shield that validates and adjusts the agent's actions to ensure compliance with a provided set of safety specifications. For real-world robotic domains, it is essential to define safety specifications over continuous state and action spaces to accurately account for system dynamics and compute new actions that minimally deviate from the agent's original decision. In this paper, we present the first shielding approach specifically designed to ensure the satisfaction of safety requirements in continuous state and action spaces, making it suitable for practical robotic applications. Our method builds upon realizability, an essential property that confirms the shield will always be able to generate a safe action for any state in the environment. We formally prove that realizability can be verified for stateful shields, enabling the incorporation of non-Markovian safety requirements, such as loop avoidance. Finally, we demonstrate the effectiveness of our approach in ensuring safety without compromising the policy's success rate by applying it to a navigation problem and a multi-agent particle environment.
    [COMMENTS]Kim, Corsi, and Rodriguez contributed equally
    [LINK]http://arxiv.org/abs/2410.02038v2
    [DATE]2024-12-02 12:20:10+08:00
    [CATEGORIES]cs.LG
  • A Memory-Based Reinforcement Learning Approach to Integrated Sensing and Communication
    [AUTHORS]Homa Nikbakht, Michèle Wigger, Shlomo Shamai, H. Vincent Poor
    [ABSTRACT]In this paper, we consider a point-to-point integrated sensing and communication (ISAC) system, where a transmitter conveys a message to a receiver over a channel with memory and simultaneously estimates the state of the channel through the backscattered signals from the emitted waveform. Using Massey's concept of directed information for channels with memory, we formulate the capacity-distortion tradeoff for the ISAC problem when sensing is performed in an online fashion. Optimizing the transmit waveform for this system to simultaneously achieve good communication and sensing performance is a complicated task, and thus we propose a deep reinforcement learning (RL) approach to find a solution. The proposed approach enables the agent to optimize the ISAC performance by learning a reward that reflects the difference between the communication gain and the sensing loss. Since the state-space in our RL model is \`a priori unbounded, we employ deep deterministic policy gradient algorithm (DDPG). Our numerical results suggest a significant performance improvement when one considers unbounded state-space as opposed to a simpler RL problem with reduced state-space. In the extreme case of degenerate state-space only memoryless signaling strategies are possible. Our results thus emphasize the necessity of well exploiting the memory inherent in ISAC systems.
    [LINK]http://arxiv.org/abs/2412.01077v1
    [DATE]2024-12-02 11:30:50+08:00
    [CATEGORIES]cs.LG
  • Multi-Agent Deep Reinforcement Learning for Distributed and Autonomous Platoon Coordination via Speed-regulation over Large-scale Transportation Networks
    [AUTHORS]Dixiao Wei, Peng Yi, Jinlong Lei, Xingyi Zhu
    [ABSTRACT]Truck platooning technology enables a group of trucks to travel closely together, with which the platoon can save fuel, improve traffic flow efficiency, and improve safety. In this paper, we consider the platoon coordination problem in a large-scale transportation network, to promote cooperation among trucks and optimize the overall efficiency. Involving the regulation of both speed and departure times at hubs, we formulate the coordination problem as a complicated dynamic stochastic integer programming under network and information constraints. To get an autonomous, distributed, and robust platoon coordination policy, we formulate the problem into a model of the Decentralized-Partial Observable Markov Decision Process. Then, we propose a Multi-Agent Deep Reinforcement Learning framework named Trcuk Attention-QMIX (TA-QMIX) to train an efficient online decision policy. TA-QMIX utilizes the attention mechanism to enhance the representation of truck fuel gains and delay times, and provides explicit truck cooperation information during the training process, promoting trucks' willingness to cooperate. The training framework adopts centralized training and distributed execution, thus training a policy for trucks to make decisions online using only nearby information. Hence, the policy can be autonomously executed on a large-scale network. Finally, we perform comparison experiments and ablation experiments in the transportation network of the Yangtze River Delta region in China to verify the effectiveness of the proposed framework. In a repeated comparative experiment with 5,000 trucks, our method average saves 19.17\% of fuel with an average delay of only 9.57 minutes per truck and a decision time of 0.001 seconds.
    [LINK]http://arxiv.org/abs/2412.01075v1
    [DATE]2024-12-02 11:21:40+08:00
    [CATEGORIES]cs.LG
  • Provable Partially Observable Reinforcement Learning with Privileged Information
    [AUTHORS]Yang Cai, Xiangyu Liu, Argyris Oikonomou, Kaiqing Zhang
    [ABSTRACT]Partial observability of the underlying states generally presents significant challenges for reinforcement learning (RL). In practice, certain \emph\{privileged information\}, e.g., the access to states from simulators, has been exploited in training and has achieved prominent empirical successes. To better understand the benefits of privileged information, we revisit and examine several simple and practically used paradigms in this setting. Specifically, we first formalize the empirical paradigm of \emph\{expert distillation\} (also known as \emph\{teacher-student\} learning), demonstrating its pitfall in finding near-optimal policies. We then identify a condition of the partially observable environment, the \emph\{deterministic filter condition\}, under which expert distillation achieves sample and computational complexities that are \emph\{both\} polynomial. Furthermore, we investigate another useful empirical paradigm of \emph\{asymmetric actor-critic\}, and focus on the more challenging setting of observable partially observable Markov decision processes. We develop a belief-weighted asymmetric actor-critic algorithm with polynomial sample and quasi-polynomial computational complexities, in which one key component is a new provable oracle for learning belief states that preserve \emph\{filter stability\} under a misspecified model, which may be of independent interest. Finally, we also investigate the provable efficiency of partially observable multi-agent RL (MARL) with privileged information. We develop algorithms featuring \emph\{centralized-training-with-decentralized-execution\}, a popular framework in empirical MARL, with polynomial sample and (quasi-)polynomial computational complexities in both paradigms above. Compared with a few recent related theoretical studies, our focus is on understanding practically inspired algorithmic paradigms, without computationally intractable oracles.
    [COMMENTS]This paper has been accepted to 2024 Conference on Neural Information Processing Systems (NeurIPS 2024)
    [LINK]http://arxiv.org/abs/2412.00985v1
    [DATE]2024-12-02 06:26:27+08:00
    [CATEGORIES]cs.LG
  • STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft
    [AUTHORS]Nicholas Lenzen, Amogh Raut, Andrew Melnik
    [ABSTRACT]Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.
    [COMMENTS]Accepted at CoRL 2024: Workshop on Lifelong Learning for Home Robots
    [LINK]http://arxiv.org/abs/2412.00949v1
    [DATE]2024-12-02 03:48:57+08:00
    [CATEGORIES]cs.LG
  • Bilinear Convolution Decomposition for Causal RL Interpretability
    [AUTHORS]Narmeen Oozeer, Sinem Erisken, Alice Rigg
    [ABSTRACT]Efforts to interpret reinforcement learning (RL) models often rely on high-level techniques such as attribution or probing, which provide only correlational insights and coarse causal control. This work proposes replacing nonlinearities in convolutional neural networks (ConvNets) with bilinear variants, to produce a class of models for which these limitations can be addressed. We show bilinear model variants perform comparably in model-free reinforcement learning settings, and give a side by side comparison on ProcGen environments. Bilinear layers' analytic structure enables weight-based decomposition. Previous work has shown bilinearity enables quantifying functional importance through eigendecomposition, to identify interpretable low rank structure. We show how to adapt the decomposition to convolution layers by applying singular value decomposition to vectors of interest, to separate the channel and spatial dimensions. Finally, we propose a methodology for causally validating concept-based probes, and illustrate its utility by studying a maze-solving agent's ability to track a cheese object.
    [COMMENTS]8 pages, 10 figures
    [LINK]http://arxiv.org/abs/2412.00944v1
    [DATE]2024-12-02 03:32:04+08:00
    [CATEGORIES]cs.LG
  • A Deep Generative Model for the Design of Synthesizable Ionizable Lipids
    [AUTHORS]Yuxuan Ou, Jingyi Zhao, Austin Tripp, Morteza Rasoulianboroujeni, José Miguel Hernández-Lobato
    [ABSTRACT]Lipid nanoparticles (LNPs) are vital in modern biomedicine, enabling the effective delivery of mRNA for vaccines and therapies by protecting it from rapid degradation. Among the components of LNPs, ionizable lipids play a key role in RNA protection and facilitate its delivery into the cytoplasm. However, designing ionizable lipids is complex. Deep generative models can accelerate this process and explore a larger candidate space compared to traditional methods. Due to the structural differences between lipids and small molecules, existing generative models used for small molecule generation are unsuitable for lipid generation. To address this, we developed a deep generative model specifically tailored for the discovery of ionizable lipids. Our model generates novel ionizable lipid structures and provides synthesis paths using synthetically accessible building blocks, addressing synthesizability. This advancement holds promise for streamlining the development of lipid-based delivery systems, potentially accelerating the deployment of new therapeutic agents, including mRNA vaccines and gene therapies.
    [COMMENTS]NeurIPS 2024 Workshop on AI for New Drug Modalities
    [LINK]http://arxiv.org/abs/2412.00928v1
    [DATE]2024-12-02 02:33:22+08:00
    [CATEGORIES]cs.LG
2024 Dec 01, Sun
  • Does chat change LLM's mind? Impact of Conversation on Psychological States of LLMs
    [AUTHORS]Junhyuk Choi, Yeseon Hong, Minju Kim, Bugeun Kim
    [ABSTRACT]The recent growth of large language models (LLMs) has enabled more authentic, human-centered interactions through multi-agent systems. However, investigation into how conversations affect the psychological states of LLMs is limited, despite the impact of these states on the usability of LLM-based systems. In this study, we explored whether psychological states change during multi-agent interactions, focusing on the effects of conversation depth, topic, and speaker. We experimentally investigated the behavior of 10 LLMs in open-domain conversations. We employed 14 questionnaires and a topic-analysis method to examine the behavior of LLMs across four aspects: personality, interpersonal relationships, motivation, and emotion. The results revealed distinct psychological trends influenced by conversation depth and topic, with significant variations observed between different LLM families and parameter sizes.
    [COMMENTS]Under review
    [LINK]http://arxiv.org/abs/2412.00804v1
    [DATE]2024-12-01 21:19:32+08:00
    [CATEGORIES]cs.CL
  • Towards Adaptive Mechanism Activation in Language Agent
    [AUTHORS]Ziyang Huang, Jun Zhao, Kang Liu
    [ABSTRACT]Language Agent could be endowed with different mechanisms for autonomous task accomplishment. Current agents typically rely on fixed mechanisms or a set of mechanisms activated in a predefined order, limiting their adaptation to varied potential task solution structures. To this end, this paper proposes \textbf\{A\}daptive \textbf\{L\}anguage \textbf\{A\}gent \textbf\{M\}echanism \textbf\{A\}ctivation Learning with Self-Exploration (\textbf\{ALAMA\}), which focuses on optimizing mechanism activation adaptability without reliance on expert models. Initially, it builds a harmonized agent framework (\textbf\{UniAct\}) to \textbf\{Uni\}fy different mechanisms via \textbf\{Act\}ions. Then it leverages a training-efficient optimization method based on self-exploration to enable the UniAct to adaptively activate the appropriate mechanisms according to the potential characteristics of the task. Experimental results demonstrate significant improvements in downstream agent tasks, affirming the effectiveness of our approach in facilitating more dynamic and context-sensitive mechanism activation.
    [COMMENTS]COLING2025
    [LINK]http://arxiv.org/abs/2412.00722v1
    [DATE]2024-12-01 16:10:04+08:00
    [CATEGORIES]cs.CL
  • Multi-Agent Collaboration in Incident Response with Large Language Models
    [AUTHORS]Zefang Liu
    [ABSTRACT]Incident response (IR) is a critical aspect of cybersecurity, requiring rapid decision-making and coordinated efforts to address cyberattacks effectively. Leveraging large language models (LLMs) as intelligent agents offers a novel approach to enhancing collaboration and efficiency in IR scenarios. This paper explores the application of LLM-based multi-agent collaboration using the Backdoors & Breaches framework, a tabletop game designed for cybersecurity training. We simulate real-world IR dynamics through various team structures, including centralized, decentralized, and hybrid configurations. By analyzing agent interactions and performance across these setups, we provide insights into optimizing multi-agent collaboration for incident response. Our findings highlight the potential of LLMs to enhance decision-making, improve adaptability, and streamline IR processes, paving the way for more effective and coordinated responses to cyber threats.
    [LINK]http://arxiv.org/abs/2412.00652v1
    [DATE]2024-12-01 11:12:26+08:00
    [CATEGORIES]cs.CL
  • A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning
    [AUTHORS]Xinzhe Li
    [ABSTRACT]Tool use, planning, and feedback learning are currently three prominent paradigms for developing Large Language Model (LLM)-based agents across various tasks. Although numerous frameworks have been devised for each paradigm, their intricate workflows and inconsistent taxonomy create challenges in understanding and reviewing the frameworks across different paradigms. This survey introduces a unified taxonomy to systematically review and discuss these frameworks. Specifically, 1) the taxonomy defines environments/tasks, common LLM-profiled roles or LMPRs (policy models, evaluators, and dynamic models), and universally applicable workflows found in prior work, and 2) it enables a comparison of key perspectives on the implementations of LMPRs and workflow designs across different agent paradigms and frameworks. 3) Finally, we identify three limitations in existing workflow designs and systematically discuss the future work. Resources have been made publicly available at in our GitHub repository https://github.com/xinzhel/LLM-Agent-Survey.
    [COMMENTS]CoLing 2025 Camera Ready (extended to 9 pages)
    [LINK]http://arxiv.org/abs/2406.05804v6
    [DATE]2024-12-01 06:38:57+08:00
    [CATEGORIES]cs.CL
  • Unveiling Performance Challenges of Large Language Models in Low-Resource Healthcare: A Demographic Fairness Perspective
    [AUTHORS]Yue Zhou, Barbara Di Eugenio, Lu Cheng
    [ABSTRACT]This paper studies the performance of large language models (LLMs), particularly regarding demographic fairness, in solving real-world healthcare tasks. We evaluate state-of-the-art LLMs with three prevalent learning frameworks across six diverse healthcare tasks and find significant challenges in applying LLMs to real-world healthcare tasks and persistent fairness issues across demographic groups. We also find that explicitly providing demographic information yields mixed results, while LLM's ability to infer such details raises concerns about biased health predictions. Utilizing LLMs as autonomous agents with access to up-to-date guidelines does not guarantee performance improvement. We believe these findings reveal the critical limitations of LLMs in healthcare fairness and the urgent need for specialized research in this area.
    [COMMENTS]Accepted to the main conference of COLING 2025
    [LINK]http://arxiv.org/abs/2412.00554v1
    [DATE]2024-12-01 02:52:30+08:00
    [CATEGORIES]cs.CL
  • Online Poisoning Attack Against Reinforcement Learning under Black-box Environments
    [AUTHORS]Jianhui Li, Bokang Zhang, Junfeng Wu
    [ABSTRACT]This paper proposes an online environment poisoning algorithm tailored for reinforcement learning agents operating in a black-box setting, where an adversary deliberately manipulates training data to lead the agent toward a mischievous policy. In contrast to prior studies that primarily investigate white-box settings, we focus on a scenario characterized by \textit\{unknown\} environment dynamics to the attacker and a \textit\{flexible\} reinforcement learning algorithm employed by the targeted agent. We first propose an attack scheme that is capable of poisoning the reward functions and state transitions. The poisoning task is formalized as a constrained optimization problem, following the framework of \cite\{ma2019policy\}. Given the transition probabilities are unknown to the attacker in a black-box environment, we apply a stochastic gradient descent algorithm, where the exact gradients are approximated using sample-based estimates. A penalty-based method along with a bilevel reformulation is then employed to transform the problem into an unconstrained counterpart and to circumvent the double-sampling issue. The algorithm's effectiveness is validated through a maze environment.
    [LINK]http://arxiv.org/abs/2412.00797v1
    [DATE]2024-12-01 20:43:23+08:00
    [CATEGORIES]cs.LG
  • InvestESG: A multi-agent reinforcement learning benchmark for studying climate investment as a social dilemma
    [AUTHORS]Xiaoxuan Hou, Jiayi Yuan, Joel Z. Leibo, Natasha Jaques
    [ABSTRACT]InvestESG is a novel multi-agent reinforcement learning (MARL) benchmark designed to study the impact of Environmental, Social, and Governance (ESG) disclosure mandates on corporate climate investments. Supported by both PyTorch and JAX implementation, the benchmark models an intertemporal social dilemma where companies balance short-term profit losses from climate mitigation efforts and long-term benefits from reducing climate risk, while ESG-conscious investors attempt to influence corporate behavior through their investment decisions, in a scalable and hardware-accelerated manner. Companies allocate capital across mitigation, greenwashing, and resilience, with varying strategies influencing climate outcomes and investor preferences. Our experiments show that without ESG-conscious investors with sufficient capital, corporate mitigation efforts remain limited under the disclosure mandate. However, when a critical mass of investors prioritizes ESG, corporate cooperation increases, which in turn reduces climate risks and enhances long-term financial stability. Additionally, providing more information about global climate risks encourages companies to invest more in mitigation, even without investor involvement. Our findings align with empirical research using real-world data, highlighting MARL's potential to inform policy by providing insights into large-scale socio-economic challenges through efficient testing of alternative policy and market designs.
    [LINK]http://arxiv.org/abs/2411.09856v2
    [DATE]2024-12-01 13:18:12+08:00
    [CATEGORIES]cs.LG
  • Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning
    [AUTHORS]Emile Anand, Ishani Karmarkar, Guannan Qu
    [ABSTRACT]Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging due to the fact that the size of the joint state and action spaces are exponentially large in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm \texttt\{SUBSAMPLE-MFQ\} (\textbf\{Subsample\}-\textbf\{M\}ean-\textbf\{F\}ield-\textbf\{Q\}-learning) and a decentralized randomized policy for a system with $n$ agents. For $k\leq n$, our algorithm system learns a policy for the system in time polynomial in $k$. We show that this learned policy converges to the optimal policy in the order of $\tilde\{O\}(1/\sqrt\{k\})$ as the number of subsampled agents $k$ increases. We validate our method empirically on Gaussian squeeze and global exploration settings.
    [COMMENTS]48 pages. 7 figures
    [LINK]http://arxiv.org/abs/2412.00661v1
    [DATE]2024-12-01 11:45:17+08:00
    [CATEGORIES]cs.LG
  • DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
    [AUTHORS]Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao
    [ABSTRACT]On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control presents significant challenges due to limited data availability and inefficient online training processes. This paper introduces DistRL, a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in the context of dynamic online interactions. Additionally, the framework is backed by our tailor-made RL algorithm, which effectively balances exploration with the prioritized utilization of collected data to ensure stable and robust training. Our experiments show that, on average, DistRL delivers a 3X improvement in training efficiency and enables training data collection 2.4X faster than the leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks from an open benchmark, significantly outperforming existing approaches while maintaining the same training time. These results validate DistRL as a scalable and efficient solution, offering substantial improvements in both training efficiency and agent performance for real-world, in-the-wild device control tasks.
    [COMMENTS]Paper and Appendix, 26 pages
    [LINK]http://arxiv.org/abs/2410.14803v4
    [DATE]2024-12-01 10:09:21+08:00
    [CATEGORIES]cs.LG
  • Towards Fault Tolerance in Multi-Agent Reinforcement Learning
    [AUTHORS]Yuchen Shi, Huaxin Pei, Liang Feng, Yi Zhang, Danya Yao
    [ABSTRACT]Agent faults pose a significant threat to the performance of multi-agent reinforcement learning (MARL) algorithms, introducing two key challenges. First, agents often struggle to extract critical information from the chaotic state space created by unexpected faults. Second, transitions recorded before and after faults in the replay buffer affect training unevenly, leading to a sample imbalance problem. To overcome these challenges, this paper enhances the fault tolerance of MARL by combining optimized model architecture with a tailored training data sampling strategy. Specifically, an attention mechanism is incorporated into the actor and critic networks to automatically detect faults and dynamically regulate the attention given to faulty agents. Additionally, a prioritization mechanism is introduced to selectively sample transitions critical to current training needs. To further support research in this area, we design and open-source a highly decoupled code platform for fault-tolerant MARL, aimed at improving the efficiency of studying related problems. Experimental results demonstrate the effectiveness of our method in handling various types of faults, faults occurring in any agent, and faults arising at random times.
    [COMMENTS]14 pages, 13 figures
    [LINK]http://arxiv.org/abs/2412.00534v1
    [DATE]2024-12-01 00:56:29+08:00
    [CATEGORIES]cs.LG
  • Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation
    [AUTHORS]Yiyuan Pan, Yunzhe Xu, Zhe Liu, Hesheng Wang
    [ABSTRACT]Humans navigate unfamiliar environments using the capabilities of episodic simulation and episodic memory. Developing imagination-based memory, analogous to episodic simulation and episodic memory, can enhance embodied agents' comprehension of the complex relationship between environments and objects. However, existing Vision-and-Language Navigation (VLN) agents fail to perform the aforementioned mechanism. We propose a novel architecture to help agents build a recurrent imaginative memory system. Specifically, the agent can maintain a reality-imagination hybrid global memory during navigation and expand the memory map through imaginative mechanisms and navigation actions. Correspondingly, we design a series of pre-training tasks to help the agent acquire fine-grained imaginative abilities. Our agents improve the state-of-the-art (SoTA) success rate (SR) by 7% while simultaneously imagining high-fidelity RGB representations for future scenes.
    [LINK]http://arxiv.org/abs/2412.01857v1
    [DATE]2024-12-01 00:49:14+08:00
    [CATEGORIES]cs.LG
2024 Dec 03, Tue
  • Compute-Constrained Data Selection
    [AUTHORS]Junjie Oscar Yin, Alexander M. Rush
    [ABSTRACT]Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective. For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.
    [LINK]http://arxiv.org/abs/2410.16208v3
    [DATE]2024-12-03 02:59:28+08:00
    [CATEGORIES]cs.LG cs.CL
2024 Dec 05, Thu
  • A Context-aware Framework for Translation-mediated Conversations
    [AUTHORS]José Pombal, Sweta Agrawal, Patrick Fernandes, Emmanouil Zaranis, André F. T. Martins
    [ABSTRACT]Effective communication is fundamental to any interaction, yet challenges arise when participants do not share a common language. Automatic translation systems offer a powerful solution to bridge language barriers in such scenarios, but they introduce errors that can lead to misunderstandings and conversation breakdown. A key issue is that current systems fail to incorporate the rich contextual information necessary to resolve ambiguities and omitted details, resulting in literal, inappropriate, or misaligned translations. In this work, we present a framework to improve large language model-based translation systems by incorporating contextual information in bilingual conversational settings. During training, we leverage context-augmented parallel data, which allows the model to generate translations sensitive to conversational history. During inference, we perform quality-aware decoding with context-aware metrics to select the optimal translation from a pool of candidates. We validate both components of our framework on two task-oriented domains: customer chat and user-assistant interaction. Across both settings, our framework consistently results in better translations than state-of-the-art systems like GPT-4o and TowerInstruct, as measured by multiple automatic translation quality metrics on several language pairs. We also show that the resulting model leverages context in an intended and interpretable way, improving consistency between the conveyed message and the generated translations.
    [LINK]http://arxiv.org/abs/2412.04205v1
    [DATE]2024-12-05 22:41:05+08:00
    [CATEGORIES]cs.CL
2024 Nov 30, Sat
  • Few-Shot Domain Adaptation for Named-Entity Recognition via Joint Constrained k-Means and Subspace Selection
    [AUTHORS]Ayoub Hammal, Benno Uthayasooriyar, Caio Corro
    [ABSTRACT]Named-entity recognition (NER) is a task that typically requires large annotated datasets, which limits its applicability across domains with varying entity definitions. This paper addresses few-shot NER, aiming to transfer knowledge to new domains with minimal supervision. Unlike previous approaches that rely solely on limited annotated data, we propose a weakly supervised algorithm that combines small labeled datasets with large amounts of unlabeled data. Our method extends the k-means algorithm with label supervision, cluster size constraints and domain-specific discriminative subspace selection. This unified framework achieves state-of-the-art results in few-shot NER on several English datasets.
    [COMMENTS]COLING 2025
    [LINK]http://arxiv.org/abs/2412.00426v1
    [DATE]2024-11-30 18:52:24+08:00
    [CATEGORIES]cs.CL
  • Noise-powered Multi-modal Knowledge Graph Representation Framework
    [AUTHORS]Zhuo Chen, Yin Fang, Yichi Zhang, Lingbing Guo, Jiaoyan Che, Jeff Z. Pan, Huajun Chen, Wen Zhang
    [ABSTRACT]The rise of Multi-modal Pre-training highlights the necessity for a unified Multi-Modal Knowledge Graph (MMKG) representation learning framework. Such a framework is essential for embedding structured knowledge into multi-modal Large Language Models effectively, alleviating issues like knowledge misconceptions and multi-modal hallucinations. In this work, we explore the efficacy of models in accurately embedding entities within MMKGs through two pivotal tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal Entity Alignment (MMEA). Building on this foundation, we propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking to robustly integrate multi-modal entity features in KGs. By incorporating specific training objectives for both MKGC and MMEA, our approach achieves SOTA performance across a total of ten datasets, demonstrating its versatility. Moreover, SNAG can not only function as a standalone model but also enhance other existing methods, providing stable performance improvements. Code and data are available at https://github.com/zjukg/SNAG.
    [COMMENTS]COLING 2025 Accpeted, Repo is available at https://github.com/zjukg/SNAG
    [LINK]http://arxiv.org/abs/2403.06832v3
    [DATE]2024-11-30 12:53:04+08:00
    [CATEGORIES]cs.CL
  • Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection
    [AUTHORS]Shanu Kumar, Saish Mendke, Karody Lubna Abdul Rahman, Santosh Kurasa, Parag Agrawal, Sandipan Dandapat
    [ABSTRACT]Chain-of-thought (CoT) prompting has significantly enhanced the capability of large language models (LLMs) by structuring their reasoning processes. However, existing methods face critical limitations: handcrafted demonstrations require extensive human expertise, while trigger phrases are prone to inaccuracies. In this paper, we propose the Zero-shot Uncertainty-based Selection (ZEUS) method, a novel approach that improves CoT prompting by utilizing uncertainty estimates to select effective demonstrations without needing access to model parameters. Unlike traditional methods, ZEUS offers high sensitivity in distinguishing between helpful and ineffective questions, ensuring more precise and reliable selection. Our extensive evaluation shows that ZEUS consistently outperforms existing CoT strategies across four challenging reasoning benchmarks, demonstrating its robustness and scalability.
    [COMMENTS]Accepted in COLING 2025
    [LINK]http://arxiv.org/abs/2412.00353v1
    [DATE]2024-11-30 12:22:00+08:00
    [CATEGORIES]cs.CL
  • SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration
    [AUTHORS]Xin Guan, Nathaniel Demchak, Saloni Gupta, Ze Wang, Ediz Ertekin Jr., Adriano Koshiyama, Emre Kazim, Zekun Wu
    [ABSTRACT]The development of unbiased large language models is widely recognized as crucial, yet existing benchmarks fall short in detecting biases due to limited scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the first holistic benchmarking pipeline to address these problems. The pipeline encompasses five core stages: scraping materials, assembling benchmarks, generating responses, extracting numeric features, and diagnosing with disparity metrics. SAGED includes metrics for max disparity, such as impact ratio, and bias concentration, such as Max Z-scores. Noticing that assessment tool bias and contextual bias in prompts can distort evaluation, SAGED implements counterfactual branching and baseline calibration for mitigation. For demonstration, we use SAGED on G20 Countries with popular 8b-level models including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we find that while Mistral and Qwen2 show lower max disparity and higher bias concentration than Gemma2 and Llama3.1, all models are notably biased against countries like Russia and (except for Qwen2) China. With further experiments to have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more intensively than Biden and Harris, indicating role-playing performance bias in these models.
    [COMMENTS]COLING 2025 Main Conference
    [LINK]http://arxiv.org/abs/2409.11149v4
    [DATE]2024-11-30 10:21:25+08:00
    [CATEGORIES]cs.CL
  • GADFA: Generator-Assisted Decision-Focused Approach for Opinion Expressing Timing Identification
    [AUTHORS]Chung-Chi Chen, Hiroya Takamura, Ichiro Kobayashi, Yusuke Miyao, Hsin-Hsi Chen
    [ABSTRACT]The advancement of text generation models has granted us the capability to produce coherent and convincing text on demand. Yet, in real-life circumstances, individuals do not continuously generate text or voice their opinions. For instance, consumers pen product reviews after weighing the merits and demerits of a product, and professional analysts issue reports following significant news releases. In essence, opinion expression is typically prompted by particular reasons or signals. Despite long-standing developments in opinion mining, the appropriate timing for expressing an opinion remains largely unexplored. To address this deficit, our study introduces an innovative task - the identification of news-triggered opinion expressing timing. We ground this task in the actions of professional stock analysts and develop a novel dataset for investigation. Our approach is decision-focused, leveraging text generation models to steer the classification model, thus enhancing overall performance. Our experimental findings demonstrate that the text generated by our model contributes fresh insights from various angles, effectively aiding in identifying the optimal timing for opinion expression.
    [COMMENTS]Accepted: COLING-2025
    [LINK]http://arxiv.org/abs/2410.01169v2
    [DATE]2024-11-30 09:04:31+08:00
    [CATEGORIES]cs.CL
2024 Dec 06, Fri
  • Context-Informed Machine Translation of Manga using Multimodal Large Language Models
    [AUTHORS]Philip Lippmann, Konrad Skublicki, Joshua Tanner, Shonosuke Ishiwatari, Jie Yang
    [ABSTRACT]Due to the significant time and effort required for handcrafting translations, most manga never leave the domestic Japanese market. Automatic manga translation is a promising potential solution. However, it is a budding and underdeveloped field and presents complexities even greater than those found in standard translation due to the need to effectively incorporate visual elements into the translation process to resolve ambiguities. In this work, we investigate to what extent multimodal large language models (LLMs) can provide effective manga translation, thereby assisting manga authors and publishers in reaching wider audiences. Specifically, we propose a methodology that leverages the vision component of multimodal LLMs to improve translation quality and evaluate the impact of translation unit size, context length, and propose a token efficient approach for manga translation. Moreover, we introduce a new evaluation dataset -- the first parallel Japanese-Polish manga translation dataset -- as part of a benchmark to be used in future research. Finally, we contribute an open-source software suite, enabling others to benchmark LLMs for manga translation. Our findings demonstrate that our proposed methods achieve state-of-the-art results for Japanese-English translation and set a new standard for Japanese-Polish.
    [COMMENTS]COLING 2025
    [LINK]http://arxiv.org/abs/2411.02589v2
    [DATE]2024-12-06 01:41:48+08:00
    [CATEGORIES]cs.CL
  • Unveiling Entity-Level Unlearning for Large Language Models: A Comprehensive Analysis
    [AUTHORS]Weitao Ma, Xiaocheng Feng, Weihong Zhong, Lei Huang, Yangfan Ye, Xiachong Feng, Bing Qin
    [ABSTRACT]Large language model unlearning has garnered increasing attention due to its potential to address security and privacy concerns, leading to extensive research in the field. However, much of this research has concentrated on instance-level unlearning, specifically targeting the removal of predefined instances containing sensitive content. This focus has left a significant gap in the exploration of full entity-level unlearning, which is critical in real-world scenarios such as copyright protection. To this end, we propose a novel task of Entity-level unlearning, which aims to erase entity-related knowledge from the target model completely. To thoroughly investigate this task, we systematically evaluate trending unlearning algorithms, revealing that current methods struggle to achieve effective entity-level unlearning. Then, we further explore the factors that influence the performance of the unlearning algorithms, identifying that knowledge coverage and the size of the forget set play pivotal roles. Notably, our analysis also uncovers that entities introduced through fine-tuning are more vulnerable to unlearning than pre-trained entities. These findings collectively offer valuable insights for advancing entity-level unlearning for LLMs.
    [COMMENTS]Accepted by COLING 2025
    [LINK]http://arxiv.org/abs/2406.15796v5
    [DATE]2024-12-06 00:13:09+08:00
    [CATEGORIES]cs.CL
2024 Dec 05, Thu
  • Representation Purification for End-to-End Speech Translation
    [AUTHORS]Chengwei Zhang, Yue Zhou, Rui Zhao, Yidong Chen, Xiaodong Shi
    [ABSTRACT]Speech-to-text translation (ST) is a cross-modal task that involves converting spoken language into text in a different language. Previous research primarily focused on enhancing speech translation by facilitating knowledge transfer from machine translation, exploring various methods to bridge the gap between speech and text modalities. Despite substantial progress made, factors in speech that are not relevant to translation content, such as timbre and rhythm, often limit the efficiency of knowledge transfer. In this paper, we conceptualize speech representation as a combination of content-agnostic and content-relevant factors. We examine the impact of content-agnostic factors on translation performance through preliminary experiments and observe a significant performance deterioration when content-agnostic perturbations are introduced to speech signals. To address this issue, we propose a \textbf\{S\}peech \textbf\{R\}epresentation \textbf\{P\}urification with \textbf\{S\}upervision \textbf\{E\}nhancement (SRPSE) framework, which excludes the content-agnostic components within speech representations to mitigate their negative impact on ST. Experiments on MuST-C and CoVoST-2 datasets demonstrate that SRPSE significantly improves translation performance across all translation directions in three settings and achieves preeminent performance under a \textit\{transcript-free\} setting.
    [COMMENTS]Accepted by COLING 2025
    [LINK]http://arxiv.org/abs/2412.04266v1
    [DATE]2024-12-05 23:50:44+08:00
    [CATEGORIES]cs.CL
  • ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
    [AUTHORS]Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, Xuanjing Huang
    [ABSTRACT]Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined, diverging from genuine needs. Furthermore, a sole emphasis on outcomes disregards the complex capabilities required for LLMs to effectively use tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs' tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world. Evaluations involving ten LLMs across three categories reveal a preference for specific scenarios and limited cognitive abilities in tool learning. Intriguingly, expanding the model size even exacerbates the hindrance to tool learning. The code and data are available at https://github.com/Junjie-Ye/ToolEyes.
    [COMMENTS]Accepted by COLING 2025 conference
    [LINK]http://arxiv.org/abs/2401.00741v3
    [DATE]2024-12-05 15:05:59+08:00
    [CATEGORIES]cs.CL
  • LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings
    [AUTHORS]Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé
    [ABSTRACT]Sentence embedding models play a key role in various Natural Language Processing tasks, such as in Topic Modeling, Document Clustering and Recommendation Systems. However, these models rely heavily on parallel data, which can be scarce for many low-resource languages, including Luxembourgish. This scarcity results in suboptimal performance of monolingual and cross-lingual sentence embedding models for these languages. To address this issue, we compile a relatively small but high-quality human-generated cross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence embedding model for Luxembourgish with strong cross-lingual capabilities. Additionally, we present evidence suggesting that including low-resource languages in parallel training datasets can be more advantageous for other low-resource languages than relying solely on high-resource language pairs. Furthermore, recognizing the lack of sentence embedding benchmarks for low-resource languages, we create a paraphrase detection benchmark specifically for Luxembourgish, aiming to partially fill this gap and promote further research.
    [COMMENTS]Accepted at COLING 2025
    [LINK]http://arxiv.org/abs/2412.03331v2
    [DATE]2024-12-05 15:05:57+08:00
    [CATEGORIES]cs.CL
  • PreAct: Prediction Enhances Agent's Planning Ability
    [AUTHORS]Dayuan Fu, Jianzhao Huang, Siyuan Lu, Guanting Dong, Yejie Wang, Keqing He, Weiran Xu
    [ABSTRACT]Addressing the disparity between forecasts and actual results can enable individuals to expand their thought processes and stimulate self-reflection, thus promoting accurate planning. In this research, we present **PreAct**, an agent framework that integrates **pre**diction, **rea**soning, and **act**ion. By utilizing the information derived from predictions, the large language model (LLM) agent can provide a wider range and more strategically focused reasoning. This leads to more efficient actions that aid the agent in accomplishing intricate tasks. Our experimental results show that PreAct surpasses the ReAct method in completing complex tasks and that PreAct's performance can be further improved when paired with other memory or selection strategy techniques. We presented the model with varying quantities of historical predictions and discovered that these predictions consistently enhance LLM planning.The variances in single-step reasoning between PreAct and ReAct indicate that PreAct indeed has benefits in terms of diversity and strategic orientation over ReAct.
    [COMMENTS]Coling 2025
    [LINK]http://arxiv.org/abs/2402.11534v2
    [DATE]2024-12-05 12:40:54+08:00
    [CATEGORIES]cs.CL
  • Acquired TASTE: Multimodal Stance Detection with Textual and Structural Embeddings
    [AUTHORS]Guy Barel, Oren Tsur, Dan Volenchik
    [ABSTRACT]Stance detection plays a pivotal role in enabling an extensive range of downstream applications, from discourse parsing to tracing the spread of fake news and the denial of scientific facts. While most stance classification models rely on textual representation of the utterance in question, prior work has demonstrated the importance of the conversational context in stance detection. In this work we introduce TASTE -- a multimodal architecture for stance detection that harmoniously fuses Transformer-based content embedding with unsupervised structural embedding. Through the fine-tuning of a pretrained transformer and the amalgamation with social embedding via a Gated Residual Network (GRN) layer, our model adeptly captures the complex interplay between content and conversational structure in determining stance. TASTE achieves state-of-the-art results on common benchmarks, significantly outperforming an array of strong baselines. Comparative evaluations underscore the benefits of social grounding -- emphasizing the criticality of concurrently harnessing both content and structure for enhanced stance detection.
    [COMMENTS]The modified camera ready version will be published in January 2025 at COLING
    [LINK]http://arxiv.org/abs/2412.03681v1
    [DATE]2024-12-05 03:23:37+08:00
    [CATEGORIES]cs.CL
  • Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation
    [AUTHORS]Weihua Wang, Qiuyu Liang, Feilong Bao, Guanglai Gao
    [ABSTRACT]Quaternion contains one real part and three imaginary parts, which provided a more expressive hypercomplex space for learning knowledge graph. Existing quaternion embedding models measure the plausibility of a triplet either through semantic matching or geometric distance scoring functions. However, it appears that semantic matching diminishes the separability of entities, while the distance scoring function weakens the semantics of entities. To address this issue, we propose a novel quaternion knowledge graph embedding model. Our model combines semantic matching with entity's geometric distance to better measure the plausibility of triplets. Specifically, in the quaternion space, we perform a right rotation on head entity and a reverse rotation on tail entity to learn rich semantic features. Then, we utilize distance adaptive translations to learn geometric distance between entities. Furthermore, we provide mathematical proofs to demonstrate our model can handle complex logical relationships. Extensive experimental results and analyses show our model significantly outperforms previous models on well-known knowledge graph completion benchmark datasets. Our code is available at https://github.com/llqy123/DaBR.
    [COMMENTS]Accepted by COLING 2025
    [LINK]http://arxiv.org/abs/2412.04076v1
    [DATE]2024-12-05 19:17:03+08:00
    [CATEGORIES]cs.LG
2024 Dec 04, Wed
  • DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles
    [AUTHORS]Jiaxuan Liu, Zhaoci Liu, Yajun Hu, Yingying Gao, Shilei Zhang, Zhenhua Ling
    [ABSTRACT]Human speech exhibits rich and flexible prosodic variations. To address the one-to-many mapping problem from text to prosody in a reasonable and flexible manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a conditional diffusion module and an improved classifier-free guidance, which hierarchically models speech prosodic features, and controls different prosodic styles to guide prosody prediction. Experiments show that our method outperforms all baselines in naturalness and achieves superior synthesis speed compared to three diffusion-based baselines. Additionally, by adjusting the guiding scale, DiffStyleTTS effectively controls the guidance intensity of the synthetic prosody.
    [COMMENTS]COLING 2025
    [LINK]http://arxiv.org/abs/2412.03388v1
    [DATE]2024-12-04 23:17:25+08:00
    [CATEGORIES]cs.CL
  • Sibyl: Empowering Empathetic Dialogue Generation in Large Language Models via Sensible and Visionary Commonsense Inference
    [AUTHORS]Lanrui Wang, Jiangnan Li, Chenxu Yang, Zheng Lin, Hongyin Tang, Huan Liu, Yanan Cao, Jingang Wang, Weiping Wang
    [ABSTRACT]Recently, there has been a heightened interest in building chatbots based on Large Language Models (LLMs) to emulate human-like qualities in multi-turn conversations. Despite having access to commonsense knowledge to better understand the psychological aspects and causality of dialogue context, even these powerful LLMs struggle to achieve the goals of empathy and emotional support. Current commonsense knowledge derived from dialogue contexts is inherently limited and often fails to adequately anticipate the future course of a dialogue. This lack of foresight can mislead LLMs and hinder their ability to provide effective support. In response to this challenge, we present an innovative framework named Sensible and Visionary Commonsense Knowledge (Sibyl). Designed to concentrate on the immediately succeeding dialogue, this paradigm equips LLMs with the capability to uncover the implicit requirements of the conversation, aiming to elicit more empathetic responses. Experimental results demonstrate that incorporating our paradigm for acquiring commonsense knowledge into LLMs comprehensively enhances the quality of their responses.
    [COMMENTS]Accepted by COLING 2025
    [LINK]http://arxiv.org/abs/2311.15316v4
    [DATE]2024-12-04 12:08:49+08:00
    [CATEGORIES]cs.CL
  • A Combinatorial Approach to Neural Emergent Communication
    [AUTHORS]Zheyuan Zhang
    [ABSTRACT]Substantial research on deep learning-based emergent communication uses the referential game framework, specifically the Lewis signaling game, however we argue that successful communication in this game typically only need one or two symbols for target image classification because of a sampling pitfall in the training data. To address this issue, we provide a theoretical analysis and introduce a combinatorial algorithm SolveMinSym (SMS) to solve the symbolic complexity for classification, which is the minimum number of symbols in the message for successful communication. We use the SMS algorithm to create datasets with different symbolic complexity to empirically show that data with higher symbolic complexity increases the number of effective symbols in the emergent language.
    [COMMENTS]Accepted to COLING 2025
    [LINK]http://arxiv.org/abs/2410.18806v2
    [DATE]2024-12-04 05:48:17+08:00
    [CATEGORIES]cs.LG cs.CL
2024 Dec 03, Tue
  • Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining
    [AUTHORS]Zongru Wu, Pengzhou Cheng, Lingyong Fang, Zhuosheng Zhang, Gongshen Liu
    [ABSTRACT]Backdoor attacks remain significant security threats to generative large language models (LLMs). Since generative LLMs output sequences of high-dimensional token logits instead of low-dimensional classification logits, most existing backdoor defense methods designed for discriminative models like BERT are ineffective for generative LLMs. Inspired by the observed differences in learning behavior between backdoor and clean mapping in the frequency space, we transform gradients of each training sample, directly influencing parameter updates, into the frequency space. Our findings reveal a distinct separation between the gradients of backdoor and clean samples in the frequency space. Based on this phenomenon, we propose Gradient Clustering in the Frequency Space for Backdoor Sample Filtering (GraCeFul), which leverages sample-wise gradients in the frequency space to effectively identify backdoor samples without requiring retraining LLMs. Experimental results show that GraCeFul outperforms baselines significantly. Notably, GraCeFul exhibits remarkable computational efficiency, achieving nearly 100% recall and F1 scores in identifying backdoor samples, reducing the average success rate of various backdoor attacks to 0% with negligible drops in clean accuracy across multiple free-style question answering datasets. Additionally, GraCeFul generalizes to Llama-2 and Vicuna. The codes are publicly available at https://github.com/ZrW00/GraceFul.
    [COMMENTS]Accepted at COLING 2025
    [LINK]http://arxiv.org/abs/2412.02454v1
    [DATE]2024-12-03 21:43:36+08:00
    [CATEGORIES]cs.CL
  • Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning
    [AUTHORS]Aditya Narayan Sankaran, Reza Farahbakhsh, Noel Crespi
    [ABSTRACT]Online abusive content detection, particularly in low-resource settings and within the audio modality, remains underexplored. We investigate the potential of pre-trained audio representations for detecting abusive language in low-resource languages, in this case, in Indian languages using Few Shot Learning (FSL). Leveraging powerful representations from models such as Wav2Vec and Whisper, we explore cross-lingual abuse detection using the ADIMA dataset with FSL. Our approach integrates these representations within the Model-Agnostic Meta-Learning (MAML) framework to classify abusive language in 10 languages. We experiment with various shot sizes (50-200) evaluating the impact of limited data on performance. Additionally, a feature visualization study was conducted to better understand model behaviour. This study highlights the generalization ability of pre-trained models in low-resource scenarios and offers valuable insights into detecting abusive language in multilingual contexts.
    [COMMENTS]Accepted as part of the proceedings of COLING 2025
    [LINK]http://arxiv.org/abs/2412.01408v2
    [DATE]2024-12-03 15:52:35+08:00
    [CATEGORIES]cs.CL
  • BANER: Boundary-Aware LLMs for Few-Shot Named Entity Recognition
    [AUTHORS]Quanjiang Guo, Yihong Dong, Ling Tian, Zhao Kang, Yu Zhang, Sijie Wang
    [ABSTRACT]Despite the recent success of two-stage prototypical networks in few-shot named entity recognition (NER), challenges such as over/under-detected false spans in the span detection stage and unaligned entity prototypes in the type classification stage persist. Additionally, LLMs have not proven to be effective few-shot information extractors in general. In this paper, we propose an approach called Boundary-Aware LLMs for Few-Shot Named Entity Recognition to address these issues. We introduce a boundary-aware contrastive learning strategy to enhance the LLM's ability to perceive entity boundaries for generalized entity spans. Additionally, we utilize LoRAHub to align information from the target domain to the source domain, thereby enhancing adaptive cross-domain classification capabilities. Extensive experiments across various benchmarks demonstrate that our framework outperforms prior methods, validating its effectiveness. In particular, the proposed strategies demonstrate effectiveness across a range of LLM architectures. The code and data are released on https://github.com/UESTC-GQJ/BANER.
    [COMMENTS]Appear on COLING 2025
    [LINK]http://arxiv.org/abs/2412.02228v1
    [DATE]2024-12-03 15:51:14+08:00
    [CATEGORIES]cs.CL cs.LG
  • NüshuRescue: Revitalization of the endangered Nüshu Language with AI
    [AUTHORS]Ivory Yang, Weicheng Ma, Soroush Vosoughi
    [ABSTRACT]The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by N\"ushu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce N\"ushuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. N\"ushuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence N\"ushu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to N\"ushu and only 35 short examples from NCGold, N\"ushuRescue achieved 48.69\% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. A sample of both NCGold and NCSilver is included in the Supplementary Materials. Additionally, we developed FastText-based and Seq2Seq models to further support research on N\"ushu. N\"ushuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input.
    [COMMENTS]Accepted to COLING 2025
    [LINK]http://arxiv.org/abs/2412.00218v2
    [DATE]2024-12-03 12:38:31+08:00
    [CATEGORIES]cs.CL cs.LG
  • Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index
    [AUTHORS]Tyler McDonald, Anthony Colosimo, Yifeng Li, Ali Emami
    [ABSTRACT]As prompt engineering research rapidly evolves, evaluations beyond accuracy are crucial for developing cost-effective techniques. We present the Economical Prompting Index (EPI), a novel metric that combines accuracy scores with token consumption, adjusted by a user-specified cost concern level to reflect different resource constraints. Our study examines 6 advanced prompting techniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts, across 10 widely-used language models and 4 diverse datasets. We demonstrate that approaches such as Self-Consistency often provide statistically insignificant gains while becoming cost-prohibitive. For example, on high-performing models like Claude 3.5 Sonnet, the EPI of simpler techniques like Chain-of-Thought (0.72) surpasses more complex methods like Self-Consistency (0.64) at slight cost concern levels. Our findings suggest a reevaluation of complex prompting strategies in resource-constrained scenarios, potentially reshaping future research priorities and improving cost-effectiveness for end-users.
    [COMMENTS]5 pages (excluding references), accepted to Coling 2025
    [LINK]http://arxiv.org/abs/2412.01690v1
    [DATE]2024-12-03 00:34:18+08:00
    [CATEGORIES]cs.CL
2024 Dec 02, Mon
  • NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers
    [AUTHORS]Angel Yahir Loredo Lopez, Tyler McDonald, Ali Emami
    [ABSTRACT]Large Language Models (LLMs) have shown impressive performance on various benchmarks, yet their ability to engage in deliberate reasoning remains questionable. We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive "System 1" thinking, isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple machine learning heuristic, and humans across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints. Our findings reveal a significant performance gap: even top-performing LLMs like GPT-4 fall short of human performance by nearly 30%. Notably, advanced prompting techniques such as Chain-of-Thought and Self-Consistency show diminishing returns as task difficulty increases. NYT-Connections uniquely combines linguistic isolation, resistance to intuitive shortcuts, and regular updates to mitigate data leakage, offering a novel tool for assessing LLM reasoning capabilities.
    [COMMENTS]5 pages (excluding references), accepted to Coling 2025
    [LINK]http://arxiv.org/abs/2412.01621v1
    [DATE]2024-12-02 23:41:47+08:00
    [CATEGORIES]cs.CL
  • Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem
    [AUTHORS]Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt
    [ABSTRACT]Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.
    [COMMENTS]Accepted at COLING 2025; long paper
    [LINK]http://arxiv.org/abs/2409.07123v2
    [DATE]2024-12-02 21:04:18+08:00
    [CATEGORIES]cs.CL cs.LG
  • GraphOTTER: Evolving LLM-based Graph Reasoning for Complex Table Question Answering
    [AUTHORS]Qianlong Li, Chen Huang, Shuai Li, Yuanxin Xiang, Deng Xiong, Wenqiang Lei
    [ABSTRACT]Complex Table Question Answering involves providing accurate answers to specific questions based on intricate tables that exhibit complex layouts and flexible header locations. Despite considerable progress having been made in the LLM era, the reasoning processes of existing methods are often implicit, feeding the entire table into prompts, making it difficult to effectively filter out irrelevant information in the table. To this end, we propose GraphOTTER that explicitly establishes the reasoning process to pinpoint the correct answers. In particular, GraphOTTER leverages a graph-based representation, transforming the complex table into an undirected graph. It then conducts step-by-step reasoning on the graph, with each step guided by a set of pre-defined intermediate reasoning actions. As such, it constructs a clear reasoning path and effectively identifies the answer to a given question. Comprehensive experiments on two benchmark datasets and two LLM backbones demonstrate the effectiveness of GraphOTTER. Further analysis indicates that its success may be attributed to the ability to efficiently filter out irrelevant information, thereby focusing the reasoning process on the most pertinent data. Our code and experimental datasets are available at \url\{https://github.com/JDing0521/GraphOTTER\}.
    [COMMENTS]COLING 2025, code is available at https://github.com/JDing0521/GraphOTTER
    [LINK]http://arxiv.org/abs/2412.01230v1
    [DATE]2024-12-02 15:49:23+08:00
    [CATEGORIES]cs.CL
  • LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks
    [AUTHORS]Akshara Prabhakar, Yuanzhi Li, Karthik Narasimhan, Sham Kakade, Eran Malach, Samy Jelassi
    [ABSTRACT]Low-Rank Adaptation (LoRA) is a popular technique for parameter-efficient fine-tuning of Large Language Models (LLMs). We study how different LoRA modules can be merged to achieve skill composition -- testing the performance of the merged model on a target task that involves combining multiple skills, each skill coming from a single LoRA. This setup is favorable when it is difficult to obtain training data for the target task and when it can be decomposed into multiple skills. First, we identify practically occurring use-cases that can be studied under the realm of skill composition, e.g. solving hard math-word problems with code, creating a bot to answer questions on proprietary manuals or about domain-specialized corpora. Our main contribution is to show that concatenation of LoRAs (CAT), which optimally weights LoRAs that were individually trained on different skills, outperforms existing model- and data- merging techniques; for instance on math-word problems, CAT beats these methods by an average of 43% and 12% respectively. Thus, this paper advocates model merging as an efficient way to solve compositional tasks and underscores CAT as a simple, compute-friendly and effective procedure. To our knowledge, this is the first work demonstrating the superiority of model merging over data mixing for binary skill composition tasks. Code and data are available at https://github.com/aksh555/LoRA-Soups
    [COMMENTS]COLING 2025 Industry track; 9 pages plus references and appendices
    [LINK]http://arxiv.org/abs/2410.13025v2
    [DATE]2024-12-02 14:40:50+08:00
    [CATEGORIES]cs.CL cs.LG
  • QABISAR: Query-Article Bipartite Interactions for Statutory Article Retrieval
    [AUTHORS]T. Y. S. S. Santosh, Hassan Sarwat, Matthias Grabmair
    [ABSTRACT]In this paper, we introduce QABISAR, a novel framework for statutory article retrieval, to overcome the semantic mismatch problem when modeling each query-article pair in isolation, making it hard to learn representation that can effectively capture multi-faceted information. QABISAR leverages bipartite interactions between queries and articles to capture diverse aspects inherent in them. Further, we employ knowledge distillation to transfer enriched query representations from the graph network into the query bi-encoder, to capture the rich semantics present in the graph representations, despite absence of graph-based supervision for unseen queries during inference. Our experiments on a real-world expert-annotated dataset demonstrate its effectiveness.
    [COMMENTS]Accepted to COLING 2025
    [LINK]http://arxiv.org/abs/2412.00934v1
    [DATE]2024-12-02 02:58:17+08:00
    [CATEGORIES]cs.CL
  • Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting
    [AUTHORS]Thilini Wijesiriwardene, Ruwan Wickramarachchi, Sreeram Vennam, Vinija Jain, Aman Chadha, Amitava Das, Ponnurangam Kumaraguru, Amit Sheth
    [ABSTRACT]Making analogies is fundamental to cognition. Proportional analogies, which consist of four terms, are often used to assess linguistic and cognitive abilities. For instance, completing analogies like "Oxygen is to Gas as is to " requires identifying the semantic relationship (e.g., "type of") between the first pair of terms ("Oxygen" and "Gas") and finding a second pair that shares the same relationship (e.g., "Aluminum" and "Metal"). In this work, we introduce a 15K Multiple-Choice Question Answering (MCQA) dataset for proportional analogy completion and evaluate the performance of contemporary Large Language Models (LLMs) in various knowledge-enhanced prompt settings. Specifically, we augment prompts with three types of knowledge: exemplar, structured, and targeted. Our results show that despite extensive training data, solving proportional analogies remains challenging for current LLMs, with the best model achieving an accuracy of 55%. Notably, we find that providing targeted knowledge can better assist models in completing proportional analogies compared to providing exemplars or collections of structured knowledge.
    [COMMENTS]Accepted at COLING 2025
    [LINK]http://arxiv.org/abs/2412.00869v1
    [DATE]2024-12-02 00:15:14+08:00
    [CATEGORIES]cs.CL
    </li> </ul> </details>
    2024 Dec 01, Sun
    • ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media
      [AUTHORS]Kung-Hsiang Huang, Hou Pong Chan, Kathleen McKeown, Heng Ji
      [ABSTRACT]Considerable advancements have been made to tackle the misrepresentation of information derived from reference articles in the domains of fact-checking and faithful summarization. However, an unaddressed aspect remains - the identification of social media posts that manipulate information within associated news articles. This task presents a significant challenge, primarily due to the prevalence of personal opinions in such posts. We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information. To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles. Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance. Additionally, we have developed a simple yet effective basic model that outperforms LLMs significantly on the ManiTweet dataset. Finally, we have conducted an exploratory analysis of human-written tweets, unveiling intriguing connections between manipulation and the domain and factuality of news articles, as well as revealing that manipulated sentences are more likely to encapsulate the main story or consequences of a news outlet.
      [COMMENTS]COLING 2025
      [LINK]http://arxiv.org/abs/2305.14225v3
      [DATE]2024-12-01 21:55:56+08:00
      [CATEGORIES]cs.CL
    • Multi-View Incongruity Learning for Multimodal Sarcasm Detection
      [AUTHORS]Diandian Guo, Cong Cao, Fangfang Yuan, Yanbing Liu, Guangjie Zeng, Xiaoyan Yu, Hao Peng, Philip S. Yu
      [ABSTRACT]Multimodal sarcasm detection (MSD) is essential for various downstream tasks. Existing MSD methods tend to rely on spurious correlations. These methods often mistakenly prioritize non-essential features yet still make correct predictions, demonstrating poor generalizability beyond training environments. Regarding this phenomenon, this paper undertakes several initiatives. Firstly, we identify two primary causes that lead to the reliance of spurious correlations. Secondly, we address these challenges by proposing a novel method that integrate Multimodal Incongruities via Contrastive Learning (MICL) for multimodal sarcasm detection. Specifically, we first leverage incongruity to drive multi-view learning from three views: token-patch, entity-object, and sentiment. Then, we introduce extensive data augmentation to mitigate the biased learning of the textual modality. Additionally, we construct a test set, SPMSD, which consists potential spurious correlations to evaluate the the model's generalizability. Experimental results demonstrate the superiority of MICL on benchmark datasets, along with the analyses showcasing MICL's advancement in mitigating the effect of spurious correlation.
      [COMMENTS]Accepted to COLING 2025
      [LINK]http://arxiv.org/abs/2412.00756v1
      [DATE]2024-12-01 18:29:36+08:00
      [CATEGORIES]cs.CL
    • Retrieval Augmented Instruction Tuning for Open NER with Large Language Models
      [AUTHORS]Tingyu Xie, Jian Zhang, Yan Zhang, Yuanyuan Liang, Qi Li, Hongwei Wang
      [ABSTRACT]The strong capability of large language models (LLMs) has been applied to information extraction (IE) through either retrieval augmented prompting or instruction tuning (IT). However, the best way to incorporate information with LLMs for IE remains an open question. In this paper, we explore Retrieval Augmented Instruction Tuning (RA-IT) for IE, focusing on the task of open named entity recognition (NER). Specifically, for each training sample, we retrieve semantically similar examples from the training dataset as the context and prepend them to the input of the original instruction. To evaluate our RA-IT approach more thoroughly, we construct a Chinese IT dataset for open NER and evaluate RA-IT in both English and Chinese scenarios. Experimental results verify the effectiveness of RA-IT across various data sizes and in both English and Chinese scenarios. We also conduct thorough studies to explore the impacts of various retrieval strategies in the proposed RA-IT framework. Code and data are available at: https://github.com/Emma1066/Retrieval-Augmented-IT-OpenNER
      [COMMENTS]To be appeared at COLING 2025
      [LINK]http://arxiv.org/abs/2406.17305v2
      [DATE]2024-12-01 17:02:35+08:00
      [CATEGORIES]cs.CL
    • A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning
      [AUTHORS]Xinzhe Li
      [ABSTRACT]Tool use, planning, and feedback learning are currently three prominent paradigms for developing Large Language Model (LLM)-based agents across various tasks. Although numerous frameworks have been devised for each paradigm, their intricate workflows and inconsistent taxonomy create challenges in understanding and reviewing the frameworks across different paradigms. This survey introduces a unified taxonomy to systematically review and discuss these frameworks. Specifically, 1) the taxonomy defines environments/tasks, common LLM-profiled roles or LMPRs (policy models, evaluators, and dynamic models), and universally applicable workflows found in prior work, and 2) it enables a comparison of key perspectives on the implementations of LMPRs and workflow designs across different agent paradigms and frameworks. 3) Finally, we identify three limitations in existing workflow designs and systematically discuss the future work. Resources have been made publicly available at in our GitHub repository https://github.com/xinzhel/LLM-Agent-Survey.
      [COMMENTS]CoLing 2025 Camera Ready (extended to 9 pages)
      [LINK]http://arxiv.org/abs/2406.05804v6
      [DATE]2024-12-01 06:38:57+08:00
      [CATEGORIES]cs.CL
    • Unveiling Performance Challenges of Large Language Models in Low-Resource Healthcare: A Demographic Fairness Perspective
      [AUTHORS]Yue Zhou, Barbara Di Eugenio, Lu Cheng
      [ABSTRACT]This paper studies the performance of large language models (LLMs), particularly regarding demographic fairness, in solving real-world healthcare tasks. We evaluate state-of-the-art LLMs with three prevalent learning frameworks across six diverse healthcare tasks and find significant challenges in applying LLMs to real-world healthcare tasks and persistent fairness issues across demographic groups. We also find that explicitly providing demographic information yields mixed results, while LLM's ability to infer such details raises concerns about biased health predictions. Utilizing LLMs as autonomous agents with access to up-to-date guidelines does not guarantee performance improvement. We believe these findings reveal the critical limitations of LLMs in healthcare fairness and the urgent need for specialized research in this area.
      [COMMENTS]Accepted to the main conference of COLING 2025
      [LINK]http://arxiv.org/abs/2412.00554v1
      [DATE]2024-12-01 02:52:30+08:00
      [CATEGORIES]cs.CL
    • SeQwen at the Financial Misinformation Detection Challenge Task: Sequential Learning for Claim Verification and Explanation Generation in Financial Domains
      [AUTHORS]Jebish Purbey, Siddhant Gupta, Nikhil Manali, Siddartha Pullakhandam, Drishti Sharma, Ashay Srivastava, Ram Mohan Rao Kadiyala
      [ABSTRACT]This paper presents the system description of our entry for the COLING 2025 FMD challenge, focusing on misinformation detection in financial domains. We experimented with a combination of large language models, including Qwen, Mistral, and Gemma-2, and leveraged pre-processing and sequential learning for not only identifying fraudulent financial content but also generating coherent, and concise explanations that clarify the rationale behind the classifications. Our approach achieved competitive results with an F1-score of 0.8283 for classification, and ROUGE-1 of 0.7253 for explanations. This work highlights the transformative potential of LLMs in financial applications, offering insights into their capabilities for combating misinformation and enhancing transparency while identifying areas for future improvement in robustness and domain adaptation.
      [COMMENTS]6 pages, 9 figures, Submitted to FinNLP-FNP-LLMFinLegal @ COLING 2025
      [LINK]http://arxiv.org/abs/2412.00549v1
      [DATE]2024-12-01 02:03:04+08:00
      [CATEGORIES]cs.CL cs.LG
    • Evaluating the Consistency of LLM Evaluators
      [AUTHORS]Noah Lee, Jiwoo Hong, James Thorne
      [ABSTRACT]Large language models (LLMs) have shown potential as general evaluators along with the evident benefits of speed and cost. While their correlation against human annotators has been widely studied, consistency as evaluators is still understudied, raising concerns about the reliability of LLM evaluators. In this paper, we conduct extensive studies on the two aspects of consistency in LLM evaluations, Self-Consistency (SC) and Inter-scale Consistency (IC), on different scoring scales and criterion granularity with open-source and proprietary models. Our comprehensive analysis demonstrates that strong proprietary models are not necessarily consistent evaluators, highlighting the importance of considering consistency in assessing the capability of LLM evaluators.
      [COMMENTS]Accepted to COLING 2025
      [LINK]http://arxiv.org/abs/2412.00543v1
      [DATE]2024-12-01 01:29:08+08:00
      [CATEGORIES]cs.CL
    </div>
    2024 Nov 30, Sat
    • Few-Shot Domain Adaptation for Named-Entity Recognition via Joint Constrained k-Means and Subspace Selection
      [AUTHORS]Ayoub Hammal, Benno Uthayasooriyar, Caio Corro
      [ABSTRACT]Named-entity recognition (NER) is a task that typically requires large annotated datasets, which limits its applicability across domains with varying entity definitions. This paper addresses few-shot NER, aiming to transfer knowledge to new domains with minimal supervision. Unlike previous approaches that rely solely on limited annotated data, we propose a weakly supervised algorithm that combines small labeled datasets with large amounts of unlabeled data. Our method extends the k-means algorithm with label supervision, cluster size constraints and domain-specific discriminative subspace selection. This unified framework achieves state-of-the-art results in few-shot NER on several English datasets.
      [COMMENTS]COLING 2025
      [LINK]http://arxiv.org/abs/2412.00426v1
      [DATE]2024-11-30 18:52:24+08:00
      [CATEGORIES]cs.CL
    2024 Nov 30, Sat
    • Scaling Laws for Precision
      [AUTHORS]Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, Aditi Raghunathan
      [ABSTRACT]Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model's "effective parameter count," allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.
      [LINK]http://arxiv.org/abs/2411.04330v2
      [DATE]2024-11-30 10:42:31+08:00
      [CATEGORIES]cs.LG cs.CL
    2024 Dec 02, Mon
    • Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers
      [AUTHORS]Andy Yang, David Chiang
      [ABSTRACT]Deriving formal bounds on the expressivity of transformers, as well as studying transformers that are constructed to implement known algorithms, are both effective methods for better understanding the computational power of transformers. Towards both ends, we introduce the temporal counting logic $\textsf\{K\}_\text\{t\}$[#] alongside the RASP variant $\textsf\{C-RASP\}$. We show they are equivalent to each other, and that together they are the best-known lower bound on the formal expressivity of future-masked soft attention transformers with unbounded input size. We prove this by showing all $\textsf\{K\}_\text\{t\}$[#] formulas can be compiled into these transformers.
      [LINK]http://arxiv.org/abs/2404.04393v2
      [DATE]2024-12-02 04:48:11+08:00
      [CATEGORIES]cs.CL cs.LG
    2024 Dec 06, Fri
    • The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation
      [AUTHORS]Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, Joakim Nivre
      [ABSTRACT]This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples -- a process we refer to as hyperfitting -- the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token.
      [COMMENTS]Under review at ICLR
      [LINK]http://arxiv.org/abs/2412.04318v1
      [DATE]2024-12-06 00:34:20+08:00
      [CATEGORIES]cs.CL
    2024 Dec 05, Thu
    • A Context-aware Framework for Translation-mediated Conversations
      [AUTHORS]José Pombal, Sweta Agrawal, Patrick Fernandes, Emmanouil Zaranis, André F. T. Martins
      [ABSTRACT]Effective communication is fundamental to any interaction, yet challenges arise when participants do not share a common language. Automatic translation systems offer a powerful solution to bridge language barriers in such scenarios, but they introduce errors that can lead to misunderstandings and conversation breakdown. A key issue is that current systems fail to incorporate the rich contextual information necessary to resolve ambiguities and omitted details, resulting in literal, inappropriate, or misaligned translations. In this work, we present a framework to improve large language model-based translation systems by incorporating contextual information in bilingual conversational settings. During training, we leverage context-augmented parallel data, which allows the model to generate translations sensitive to conversational history. During inference, we perform quality-aware decoding with context-aware metrics to select the optimal translation from a pool of candidates. We validate both components of our framework on two task-oriented domains: customer chat and user-assistant interaction. Across both settings, our framework consistently results in better translations than state-of-the-art systems like GPT-4o and TowerInstruct, as measured by multiple automatic translation quality metrics on several language pairs. We also show that the resulting model leverages context in an intended and interpretable way, improving consistency between the conveyed message and the generated translations.
      [LINK]http://arxiv.org/abs/2412.04205v1
      [DATE]2024-12-05 22:41:05+08:00
      [CATEGORIES]cs.CL
    • A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts
      [AUTHORS]Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng
      [ABSTRACT]Training and serving long-context large language models (LLMs) incurs substantial overhead. To address this, two critical steps are often required: a pretrained LLM typically undergoes a separate stage for context length extension by training on long-context data, followed by architectural modifications to reduce the overhead of KV cache during serving. This paper argues that integrating length extension with a GPU-friendly KV cache reduction architecture not only reduces training overhead during length extension, but also achieves better long-context performance. This leads to our proposed LongGen, which finetunes a pretrained LLM into an efficient architecture during length extension. LongGen builds on three key insights: (1) Sparse attention patterns, such as window attention (attending to recent tokens), attention sink (initial ones), and blockwise sparse attention (strided token blocks) are well-suited for building efficient long-context models, primarily due to their GPU-friendly memory access patterns, enabling efficiency gains not just theoretically but in practice as well. (2) It is essential for the model to have direct access to all tokens. A hybrid architecture with 1/3 full attention layers and 2/3 efficient ones achieves a balanced trade-off between efficiency and long-context performance. (3) Lightweight training on 5B long-context data is sufficient to extend the hybrid model's context length from 4K to 128K. We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its effectiveness across different scales. During training with 128K-long contexts, LongGen achieves 1.55x training speedup and reduces wall-clock time by 36%, compared to a full-attention baseline. During inference, LongGen reduces KV cache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding speedup.
      [LINK]http://arxiv.org/abs/2410.01485v2
      [DATE]2024-12-05 14:52:42+08:00
      [CATEGORIES]cs.CL
    • Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models
      [AUTHORS]Jiacheng Ye, Shansan Gong, Liheng Chen, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, Lingpeng Kong
      [ABSTRACT]Recently, diffusion models have garnered significant interest in the field of text processing due to their many potential advantages compared to conventional autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a novel approach that integrates diffusion models with Chain-of-Thought, a well-established technique for improving the reasoning ability of autoregressive language models. In contrast to autoregressive language models that make decisions in a left-to-right, token-by-token manner, DoT allows reasoning steps to diffuse over time through a diffusion language model and offers greater flexibility in trading-off computation for reasoning performance. Our experimental results demonstrate the effectiveness of DoT in multi-digit multiplication, boolean logic, and grade school math problems, with a small diffusion model outperforming a much larger autoregressive model in both efficiency and accuracy. In addition to that, DoT showcases promising self-correction abilities and benefits from existing reasoning-enhancing techniques like self-consistency decoding. Our findings contribute to the understanding and development of reasoning with diffusion language models.
      [COMMENTS]NeurIPS 2024
      [LINK]http://arxiv.org/abs/2402.07754v3
      [DATE]2024-12-05 14:49:06+08:00
      [CATEGORIES]cs.CL cs.LG
    • Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
      [AUTHORS]Wang Xiyao, Yang Zhengyuan, Li Linjie, Lu Hongjin, Xu Yuancheng, Lin Chung-Ching Lin, Lin Kevin, Huang Furong, Wang Lijuan
      [ABSTRACT]Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at https://github.com/si0wang/VisVM.
      [LINK]http://arxiv.org/abs/2412.03704v1
      [DATE]2024-12-05 04:35:07+08:00
      [CATEGORIES]cs.CL cs.LG
    • Learning Semantic Association Rules from Internet of Things Data
      [AUTHORS]Erkan Karabulut, Paul Groth, Victoria Degeler
      [ABSTRACT]Association Rule Mining (ARM) is the task of discovering commonalities in data in the form of logical implications. ARM is used in the Internet of Things (IoT) for different tasks including monitoring and decision-making. However, existing methods give limited consideration to IoT-specific requirements such as heterogeneity and volume. Furthermore, they do not utilize important static domain-specific description data about IoT systems, which is increasingly represented as knowledge graphs. In this paper, we propose a novel ARM pipeline for IoT data that utilizes both dynamic sensor data and static IoT system metadata. Furthermore, we propose an Autoencoder-based Neurosymbolic ARM method (Aerial) as part of the pipeline to address the high volume of IoT data and reduce the total number of rules that are resource-intensive to process. Aerial learns a neural representation of a given data and extracts association rules from this representation by exploiting the reconstruction (decoding) mechanism of an autoencoder. Extensive evaluations on 3 IoT datasets from 2 domains show that ARM on both static and dynamic IoT data results in more generically applicable rules while Aerial can learn a more concise set of high-quality association rules than the state-of-the-art with full coverage over the datasets.
      [LINK]http://arxiv.org/abs/2412.03417v2
      [DATE]2024-12-05 21:22:28+08:00
      [CATEGORIES]cs.LG
    • Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
      [AUTHORS]Jason Vega, Junsheng Huang, Gaokai Zhang, Hangoo Kang, Minjia Zhang, Gagandeep Singh
      [ABSTRACT]Safety alignment of Large Language Models (LLMs) has recently become a critical objective of model developers. In response, a growing body of work has been investigating how safety alignment can be bypassed through various jailbreaking methods, such as adversarial attacks. However, these jailbreak methods can be rather costly or involve a non-trivial amount of creativity and effort, introducing the assumption that malicious users are high-resource or sophisticated. In this paper, we study how simple random augmentations to the input prompt affect safety alignment effectiveness in state-of-the-art LLMs, such as Llama 3 and Qwen 2. We perform an in-depth evaluation of 17 different models and investigate the intersection of safety under random augmentations with multiple dimensions: augmentation type, model size, quantization, fine-tuning-based defenses, and decoding strategies (e.g., sampling temperature). We show that low-resource and unsophisticated attackers, i.e. $\textit\{stochastic monkeys\}$, can significantly improve their chances of bypassing alignment with just 25 random augmentations per prompt. Source code and data: https://github.com/uiuc-focal-lab/stochastic-monkeys/
      [COMMENTS]v2: Updated with changes from peer review rebuttal. v1: Version under peer review
      [LINK]http://arxiv.org/abs/2411.02785v2
      [DATE]2024-12-05 20:58:44+08:00
      [CATEGORIES]cs.LG
    2024 Dec 04, Wed
    • Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification
      [AUTHORS]Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin
      [ABSTRACT]Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by $\sim$75\% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumption under decoding without KV cache, while saving $\sim$50\% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at https://github.com/Osilly/dynamic_llava .
      [COMMENTS]Code is available at https://github.com/Osilly/dynamic_llava
      [LINK]http://arxiv.org/abs/2412.00876v2
      [DATE]2024-12-04 00:12:09+08:00
      [CATEGORIES]cs.CL cs.LG
    • Channel Reflection: Knowledge-Driven Data Augmentation for EEG-Based Brain-Computer Interfaces
      [AUTHORS]Ziwei Wang, Siyang Li, Jingwei Luo, Jiajing Liu, Dongrui Wu
      [ABSTRACT]A brain-computer interface (BCI) enables direct communication between the human brain and external devices. Electroencephalography (EEG) based BCIs are currently the most popular for able-bodied users. To increase user-friendliness, usually a small amount of user-specific EEG data are used for calibration, which may not be enough to develop a pure data-driven decoding model. To cope with this typical calibration data shortage challenge in EEG-based BCIs, this paper proposes a parameter-free channel reflection (CR) data augmentation approach that incorporates prior knowledge on the channel distributions of different BCI paradigms in data augmentation. Experiments on eight public EEG datasets across four different BCI paradigms (motor imagery, steady-state visual evoked potential, P300, and seizure classifications) using different decoding algorithms demonstrated that: 1) CR is effective, i.e., it can noticeably improve the classification accuracy; 2) CR is robust, i.e., it consistently outperforms existing data augmentation approaches in the literature; and, 3) CR is flexible, i.e., it can be combined with other data augmentation approaches to further increase the performance. We suggest that data augmentation approaches like CR should be an essential step in EEG-based BCIs. Our code is available online.
      [LINK]http://arxiv.org/abs/2412.03224v1
      [DATE]2024-12-04 19:21:30+08:00
      [CATEGORIES]cs.LG
    • ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
      [AUTHORS]Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo
      [ABSTRACT]Large Language Models (LLMs) have been widely deployed in a variety of applications, and the context length is rapidly increasing to handle tasks such as long-document QA and complex logical reasoning. However, long context poses significant challenges for inference efficiency, including high memory costs of key-value (KV) cache and increased latency due to extensive memory accesses. Recent works have proposed compressing KV cache to approximate computation, but these methods either evict tokens permanently, never recalling them for later inference, or recall previous tokens at the granularity of pages divided by textual positions. Both approaches degrade the model accuracy and output quality. To achieve efficient and accurate recallable KV cache compression, we introduce ClusterKV, which recalls tokens at the granularity of semantic clusters. We design and implement efficient algorithms and systems for clustering, selection, indexing and caching. Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths, using only a 1k to 2k KV cache budget, and achieves up to a 2$\times$ speedup in latency and a 2.5$\times$ improvement in decoding throughput. Compared to SoTA recallable KV compression methods, ClusterKV demonstrates higher model accuracy and output quality, while maintaining or exceeding inference efficiency.
      [LINK]http://arxiv.org/abs/2412.03213v1
      [DATE]2024-12-04 18:58:27+08:00
      [CATEGORIES]cs.LG
    • Pyramid Vector Quantization for LLMs
      [AUTHORS]Tycho F. A. van der Ouderaa, Maximilian L. Croci, Agrin Hilmkil, James Hensman
      [ABSTRACT]Recent works on compression of large language models (LLM) using quantization considered reparameterizing the architecture such that weights are distributed on the sphere. This demonstratively improves the ability to quantize by increasing the mathematical notion of coherence, resulting in fewer weight outliers without affecting the network output. In this work, we aim to further exploit this spherical geometry of the weights when performing quantization by considering Pyramid Vector Quantization (PVQ) for large language models. Arranging points evenly on the sphere is notoriously difficult, especially in high dimensions, and in case approximate solutions exists, representing points explicitly in a codebook is typically not feasible due to its additional memory cost. Instead, PVQ uses a fixed integer lattice on the sphere by projecting points onto the 1-sphere, which allows for efficient encoding and decoding without requiring an explicit codebook in memory. To obtain a practical algorithm, we propose to combine PVQ with scale quantization for which we derive theoretically optimal quantizations, under empirically verified assumptions. Further, we extend pyramid vector quantization to use Hessian information to minimize quantization error under expected feature activations, instead of only relying on weight magnitudes. Experimentally, we achieves state-of-the-art quantization performance with pareto-optimal trade-off between performance and bits per weight and bits per activation, compared to compared methods. On weight-only, we find that we can quantize a Llama-3 70B model to 3.25 bits per weight and retain 98\% accuracy on downstream tasks.
      [LINK]http://arxiv.org/abs/2410.16926v2
      [DATE]2024-12-04 18:52:04+08:00
      [CATEGORIES]cs.LG
    2024 Dec 03, Tue
    • Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation
      [AUTHORS]Zhi Qu, Yiran Wang, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Taro Watanabe
      [ABSTRACT]Existing multilingual neural machine translation (MNMT) approaches mainly focus on improving models with the encoder-decoder architecture to translate multiple languages. However, decoder-only architecture has been explored less in MNMT due to its underperformance when trained on parallel data solely. In this work, we attribute the issue of the decoder-only architecture to its lack of language transfer capability. Specifically, the decoder-only architecture is insufficient in encoding source tokens with the target language features. We propose dividing the decoding process into two stages so that target tokens are explicitly excluded in the first stage to implicitly boost the transfer capability across languages. Additionally, we impose contrastive learning on translation instructions, resulting in improved performance in zero-shot translation. We conduct experiments on TED-19 and OPUS-100 datasets, considering both training from scratch and fine-tuning scenarios. Experimental results show that, compared to the encoder-decoder architecture, our methods not only perform competitively in supervised translations but also achieve improvements of up to 3.39 BLEU, 6.99 chrF++, 3.22 BERTScore, and 4.81 COMET in zero-shot translations.
      [LINK]http://arxiv.org/abs/2412.02101v1
      [DATE]2024-12-03 10:52:14+08:00
      [CATEGORIES]cs.CL
    • Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models
      [AUTHORS]Schrasing Tong, Eliott Zemour, Rawisara Lohanimit, Lalana Kagal
      [ABSTRACT]Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that will be added to the LLM output at decoding-time. This approach combines resource efficiency with interpretability and can be optimized for mitigating specific types of bias, depending on the target use case. Experiments on mitigating gender, race, and religion biases show a reduction in bias on several local and global bias metrics while preserving language model performance.
      [COMMENTS]38th Conference on Neural Information Processing Systems (NeurIPS 2024) Safe Generative AI Workshop
      [LINK]http://arxiv.org/abs/2412.01711v1
      [DATE]2024-12-03 00:56:08+08:00
      [CATEGORIES]cs.CL
    • A Probabilistic Perspective on Unlearning and Alignment for Large Language Models
      [AUTHORS]Yan Scholten, Stephan Günnemann, Leo Schwinn
      [ABSTRACT]Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework in LLMs. Namely, we derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Through a case study focused on unlearning, we reveal that deterministic evaluations falsely indicate successful unlearning, whereas our probabilistic evaluations demonstrate that most if not all of the supposedly unlearned information remains accessible in these models. Additionally, we propose a novel unlearning loss based on entropy optimization and adaptive temperature scaling, which significantly improves unlearning in probabilistic settings on recent benchmarks. Our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. Code available at https://github.com/yascho/probabilistic-unlearning.
      [LINK]http://arxiv.org/abs/2410.03523v4
      [DATE]2024-12-03 22:31:41+08:00
      [CATEGORIES]cs.LG
    • An Automated Data Mining Framework Using Autoencoders for Feature Extraction and Dimensionality Reduction
      [AUTHORS]Yaxin Liang, Xinshi Li, Xin Huang, Ziqi Zhang, Yue Yao
      [ABSTRACT]This study proposes an automated data mining framework based on autoencoders and experimentally verifies its effectiveness in feature extraction and data dimensionality reduction. Through the encoding-decoding structure, the autoencoder can capture the data's potential characteristics and achieve noise reduction and anomaly detection, providing an efficient and stable solution for the data mining process. The experiment compared the performance of the autoencoder with traditional dimensionality reduction methods (such as PCA, FA, T-SNE, and UMAP). The results showed that the autoencoder performed best in terms of reconstruction error and root mean square error and could better retain data structure and enhance the generalization ability of the model. The autoencoder-based framework not only reduces manual intervention but also significantly improves the automation of data processing. In the future, with the advancement of deep learning and big data technology, the autoencoder method combined with a generative adversarial network (GAN) or graph neural network (GNN) is expected to be more widely used in the fields of complex data processing, real-time data analysis and intelligent decision-making.
      [LINK]http://arxiv.org/abs/2412.02211v1
      [DATE]2024-12-03 15:04:10+08:00
      [CATEGORIES]cs.LG
    • Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling
      [AUTHORS]Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Maximilian Du, Chelsea Finn
      [ABSTRACT]Predicting and executing a sequence of actions without intermediate replanning, known as action chunking, is increasingly used in robot learning from human demonstrations. Yet, its reported effects on the learned policy are inconsistent: some studies find it crucial for achieving strong results, while others observe decreased performance. In this paper, we first dissect how action chunking impacts the divergence between a learner and a demonstrator. We find that action chunking allows the learner to better capture the temporal dependencies in demonstrations but at the cost of reduced reactivity in stochastic environments. To address this tradeoff, we propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop operations. BID samples multiple predictions at each time step and searches for the optimal one based on two criteria: (i) backward coherence, which favors samples that align with previous decisions; (ii) forward contrast, which seeks samples of high likelihood for future plans. By coupling decisions within and across action chunks, BID promotes consistency over time while maintaining reactivity to unexpected changes. Experimental results show that BID boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks. Code and videos are available at https://bid-robot.github.io.
      [COMMENTS]Project website: https://bid-robot.github.io/
      [LINK]http://arxiv.org/abs/2408.17355v3
      [DATE]2024-12-03 14:53:58+08:00
      [CATEGORIES]cs.LG
    • GNN-based Auto-Encoder for Short Linear Block Codes: A DRL Approach
      [AUTHORS]Kou Tian, Chentao Yue, Changyang She, Yonghui Li, Branka Vucetic
      [ABSTRACT]This paper presents a novel auto-encoder based end-to-end channel encoding and decoding. It integrates deep reinforcement learning (DRL) and graph neural networks (GNN) in code design by modeling the generation of code parity-check matrices as a Markov Decision Process (MDP), to optimize key coding performance metrics such as error-rates and code algebraic properties. An edge-weighted GNN (EW-GNN) decoder is proposed, which operates on the Tanner graph with an iterative message-passing structure. Once trained on a single linear block code, the EW-GNN decoder can be directly used to decode other linear block codes of different code lengths and code rates. An iterative joint training of the DRL-based code designer and the EW-GNN decoder is performed to optimize the end-end encoding and decoding process. Simulation results show the proposed auto-encoder significantly surpasses several traditional coding schemes at short block lengths, including low-density parity-check (LDPC) codes with the belief propagation (BP) decoding and the maximum-likelihood decoding (MLD), and BCH with BP decoding, offering superior error-correction capabilities while maintaining low decoding complexity.
      [COMMENTS]13 pages; submitted to IEEE Trans. arXiv admin note: text overlap with arXiv:2211.06962
      [LINK]http://arxiv.org/abs/2412.02053v1
      [DATE]2024-12-03 08:25:14+08:00
      [CATEGORIES]cs.LG
    2024 Dec 02, Mon
    • Early Exit Is a Natural Capability in Transformer-based Models: An Empirical Study on Early Exit without Joint Optimization
      [AUTHORS]Weiqiao Shan, Long Meng, Tong Zheng, Yingfeng Luo, Bei Li, junxin Wang, Tong Xiao, Jingbo Zhu
      [ABSTRACT]Large language models (LLMs) exhibit exceptional performance across various downstream tasks. However, they encounter limitations due to slow inference speeds stemming from their extensive parameters. The early exit (EE) is an approach that aims to accelerate auto-regressive decoding. EE generates outputs from intermediate layers instead of using the whole model, which offers a promising solution to this challenge. However, additional output layers and joint optimization used in conventional EE hinder the application of EE in LLMs. In this paper, we explore the possibility of LLMs EE without additional output layers and joint optimization. Our findings indicate that EE is a natural capability within transformer-based models. While joint optimization does not give model EE capability, it must be employed to address challenges by improving the accuracy of locating the optimal EE layer through gating functions. Additionally, our study reveals patterns in EE behavior from a sub-word perspective based on the LLaMA model and the potential possibility for EE based on sub-layers.
      [LINK]http://arxiv.org/abs/2412.01455v1
      [DATE]2024-12-02 20:46:34+08:00
      [CATEGORIES]cs.CL
    • PLD+: Accelerating LLM inference by leveraging Language Model Artifacts
      [AUTHORS]Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena
      [ABSTRACT]To reduce the latency associated with autoretrogressive LLM inference, speculative decoding has emerged as a novel decoding paradigm, where future tokens are drafted and verified in parallel. However, the practical deployment of speculative decoding is hindered by its requirements for additional computational resources and fine-tuning, which limits its out-of-the-box usability. To address these challenges, we present PLD+, a suite of novel algorithms developed to accelerate the inference process of LLMs, particularly for input-guided tasks. These tasks, which include code editing, text editing, summarization, etc., often feature outputs with substantial overlap with their inputs-an attribute PLD+ is designed to exploit. PLD+ also leverages the artifacts (attention and hidden states) generated during inference to accelerate inference speed. We test our approach on five input-guided tasks and through extensive experiments we find that PLD+ outperforms all tuning-free approaches. In the greedy setting, it even outperforms the state-of-the-art tuning-dependent approach EAGLE on four of the tasks. (by a margin of upto 2.31 in terms of avg. speedup). Our approach is tuning free, does not require any additional compute and can easily be used for accelerating inference of any LLM.
      [LINK]http://arxiv.org/abs/2412.01447v1
      [DATE]2024-12-02 20:36:27+08:00
      [CATEGORIES]cs.CL
    • Federated Motor Imagery Classification for Privacy-Preserving Brain-Computer Interfaces
      [AUTHORS]Tianwang Jia, Lubin Meng, Siyang Li, Jiajing Liu, Dongrui Wu
      [ABSTRACT]Training an accurate classifier for EEG-based brain-computer interface (BCI) requires EEG data from a large number of users, whereas protecting their data privacy is a critical consideration. Federated learning (FL) is a promising solution to this challenge. This paper proposes Federated classification with local Batch-specific batch normalization and Sharpness-aware minimization (FedBS) for privacy protection in EEG-based motor imagery (MI) classification. FedBS utilizes local batch-specific batch normalization to reduce data discrepancies among different clients, and sharpness-aware minimization optimizer in local training to improve model generalization. Experiments on three public MI datasets using three popular deep learning models demonstrated that FedBS outperformed six state-of-the-art FL approaches. Remarkably, it also outperformed centralized training, which does not consider privacy protection at all. In summary, FedBS protects user EEG data privacy, enabling multiple BCI users to participate in large-scale machine learning model training, which in turn improves the BCI decoding accuracy.
      [LINK]http://arxiv.org/abs/2412.01079v1
      [DATE]2024-12-02 11:35:27+08:00
      [CATEGORIES]cs.LG
    • DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
      [AUTHORS]Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu
      [ABSTRACT]Human motion, inherently continuous and dynamic, presents significant challenges for generative models. Despite their dominance, discrete quantization methods, such as VQ-VAEs, suffer from inherent limitations, including restricted expressiveness and frame-wise noise artifacts. Continuous approaches, while producing smoother and more natural motions, often falter due to high-dimensional complexity and limited training data. To resolve this "discord" between discrete and continuous representations, we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that decodes discrete motion tokens into continuous motion through rectified flow. By employing an iterative refinement process in the continuous space, DisCoRD captures fine-grained dynamics and ensures smoother and more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results solidify DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Our project page is available at: https://whwjdqls.github.io/discord.github.io/.
      [COMMENTS]20 pages 18 figures
      [LINK]http://arxiv.org/abs/2411.19527v2
      [DATE]2024-12-02 11:34:45+08:00
      [CATEGORIES]cs.LG
    2024 Dec 01, Sun
    • PGSO: Prompt-based Generative Sequence Optimization Network for Aspect-based Sentiment Analysis
      [AUTHORS]Hao Dong, Wei Wei
      [ABSTRACT]Recently, generative pre-training based models have demonstrated remarkable results on Aspect-based Sentiment Analysis (ABSA) task. However, previous works overemphasize crafting various templates to paraphrase training targets for enhanced decoding, ignoring the internal optimizations on generative models. Despite notable results achieved by these target-oriented optimization methods, they struggle with the complicated long texts since the implicit long-distance relation, e.g., aspect-opinion relation, is difficult to extract under the position embedding mechanism in generative models. Thus, in this paper, we first clarify the causes of the problem and introduce two sequence optimization strategies: the rule-based static optimization and the score-based dynamic optimization. The rule-based approach relies on handcraft priority of dependency relation to reorder the context, while the score-based algorithm dynamically regulates the contextual sequence by calculating word position scores using neural network. Based on the dynamic optimization structure, we further propose a unified Prompt-based Generative Sequence Optimization network (named PGSO), which jointly optimizes the training target as well as the generative model. Specifically, PGSO contains two components, namely, prompt construction and sequence regulator. The former constructs a task-specific prompt based on unsupervised training objects to fully utilize the pre-trained model. The latter jointly leverages semantic, syntactic and original-sequence information to dynamically regulate contextual sequence. Our experiments conducted on four ABSA tasks across multiple benchmarks indicate that PGSO outperforms state-of-the-art methods, with an average improvement of 3.52% in F1 score.
      [LINK]http://arxiv.org/abs/2412.00763v1
      [DATE]2024-12-01 18:49:55+08:00
      [CATEGORIES]cs.CL
    • SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
      [AUTHORS]Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin
      [ABSTRACT]As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models (50B+ parameters) and must offload them to RAM or SSD. When running with offloaded parameters, the inference engine can process batches of hundreds or thousands of tokens at the same time as just one token, making it a natural fit for speculative decoding. We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families. It utilizes the high spikiness of the token probabilities distribution in modern LLMs and a high degree of alignment between model output probabilities. SpecExec takes the most probable tokens continuation from the draft model to build a "cache" tree for the target model, which then gets validated in a single pass. Using SpecExec, we demonstrate inference of 50B+ parameter LLMs on consumer GPUs with RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights.
      [LINK]http://arxiv.org/abs/2406.02532v3
      [DATE]2024-12-01 05:33:59+08:00
      [CATEGORIES]cs.CL
    • Tree-Wasserstein Distance for High Dimensional Data with a Latent Feature Hierarchy
      [AUTHORS]Ya-Wei Eileen Lin, Ronald R. Coifman, Gal Mishne, Ronen Talmon
      [ABSTRACT]Finding meaningful distances between high-dimensional data samples is an important scientific task. To this end, we propose a new tree-Wasserstein distance (TWD) for high-dimensional data with two key aspects. First, our TWD is specifically designed for data with a latent feature hierarchy, i.e., the features lie in a hierarchical space, in contrast to the usual focus on embedding samples in hyperbolic space. Second, while the conventional use of TWD is to speed up the computation of the Wasserstein distance, we use its inherent tree as a means to learn the latent feature hierarchy. The key idea of our method is to embed the features into a multi-scale hyperbolic space using diffusion geometry and then present a new tree decoding method by establishing analogies between the hyperbolic embedding and trees. We show that our TWD computed based on data observations provably recovers the TWD defined with the latent feature hierarchy and that its computation is efficient and scalable. We showcase the usefulness of the proposed TWD in applications to word-document and single-cell RNA-sequencing datasets, demonstrating its advantages over existing TWDs and methods based on pre-trained models.
      [LINK]http://arxiv.org/abs/2410.21107v2
      [DATE]2024-12-01 10:36:26+08:00
      [CATEGORIES]cs.LG
    2024 Dec 05, Thu
    • Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction
      [AUTHORS]Zheye Deng, Chunkit Chan, Weiqi Wang, Yuxi Sun, Wei Fan, Tianshi Zheng, Yauwai Yim, Yangqiu Song
      [ABSTRACT]The task of condensing large chunks of textual information into concise and structured tables has gained attention recently due to the emergence of Large Language Models (LLMs) and their potential benefit for downstream tasks, such as text summarization and text mining. Previous approaches often generate tables that directly replicate information from the text, limiting their applicability in broader contexts, as text-to-table generation in real-life scenarios necessitates information extraction, reasoning, and integration. However, there is a lack of both datasets and methodologies towards this task. In this paper, we introduce LiveSum, a new benchmark dataset created for generating summary tables of competitions based on real-time commentary texts. We evaluate the performances of state-of-the-art LLMs on this task in both fine-tuning and zero-shot settings, and additionally propose a novel pipeline called $T^3$(Text-Tuple-Table) to improve their performances. Extensive experimental results demonstrate that LLMs still struggle with this task even after fine-tuning, while our approach can offer substantial performance gains without explicit training. Further analyses demonstrate that our method exhibits strong generalization abilities, surpassing previous approaches on several other text-to-table datasets. Our code and data can be found at https://github.com/HKUST-KnowComp/LiveSum.
      [COMMENTS]Accepted to EMNLP 2024
      [LINK]http://arxiv.org/abs/2404.14215v2
      [DATE]2024-12-05 14:02:59+08:00
      [CATEGORIES]cs.CL
    • If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
      [AUTHORS]Reza Esfandiarpoor, Cristina Menghini, Stephen H. Bach
      [ABSTRACT]Recent works often assume that Vision-Language Model (VLM) representations are based on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach to characterize textual features that are important for VLMs. EX2 uses reinforcement learning to align a large language model with VLM preferences and generates descriptions that incorporate features that are important for the VLM. Then, we inspect the descriptions to identify features that contribute to VLM representations. Using EX2, we find that spurious descriptions have a major role in VLM representations despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions, VLMs rely significantly on non-visual attributes like habitat (e.g., North America) to represent visual concepts. Also, our analysis reveals that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.
      [COMMENTS]EMNLP 2024
      [LINK]http://arxiv.org/abs/2403.16442v2
      [DATE]2024-12-05 06:37:07+08:00
      [CATEGORIES]cs.CL cs.LG
    • Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
      [AUTHORS]Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
      [ABSTRACT]Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at https://github.com/prometheus-eval/prometheus-eval.
      [COMMENTS]EMNLP 2024 (Main Conference)
      [LINK]http://arxiv.org/abs/2405.01535v2
      [DATE]2024-12-05 03:23:17+08:00
      [CATEGORIES]cs.CL
    2024 Dec 04, Wed
    • AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark
      [AUTHORS]Abhay Gupta, Philip Meng, Ece Yurtseven, Sean O'Brien, Kevin Zhu
      [ABSTRACT]Detecting biases in natural language understanding (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language processing (NLP) systems. To address dialect-induced performance discrepancies, we introduce AAVENUE (\{AAVE\} \{N\}atural Language \{U\}nderstanding \{E\}valuation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across our evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability. Additionally, we recruit fluent AAVE speakers to validate our translations for authenticity. Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases and highlighting the need for more inclusive NLP models. We have open-sourced our source code on GitHub and created a website to showcase our work at https://aavenue.live.
      [COMMENTS]Published at NLP4PI @ EMNLP 2024
      [LINK]http://arxiv.org/abs/2408.14845v2
      [DATE]2024-12-04 21:43:28+08:00
      [CATEGORIES]cs.CL
    • Knowledge Mechanisms in Large Language Models: A Survey and Perspective
      [AUTHORS]Mengru Wang, Yunzhi Yao, Ziwen Xu, Shuofei Qiao, Shumin Deng, Peng Wang, Xiang Chen, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
      [ABSTRACT]Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. Moreover, we discuss what knowledge LLMs have learned, the reasons for the fragility of parametric knowledge, and the potential dark knowledge (hypothesis) that will be challenging to address. We hope this work can help understand knowledge in LLMs and provide insights for future research.
      [COMMENTS]EMNLP 2024 Findings; 39 pages (v4)
      [LINK]http://arxiv.org/abs/2407.15017v4
      [DATE]2024-12-04 17:54:59+08:00
      [CATEGORIES]cs.CL cs.LG
    • Filtered Direct Preference Optimization
      [AUTHORS]Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu
      [ABSTRACT]Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on direct preference optimization (DPO), an increasingly adopted reward-model-free RLHF method. We confirm that text quality significantly influences the performance of models optimized with DPO more than those optimized with reward-model-based RLHF. Building on this new insight, we propose an extension of DPO, termed filtered direct preference optimization (fDPO). fDPO uses a trained reward model to monitor the quality of texts within the preference dataset during DPO training. Samples of lower quality are discarded based on comparisons with texts generated by the model being optimized, resulting in a more accurate dataset. Experimental results demonstrate that fDPO enhances the final model performance. Our code is available at https://github.com/CyberAgentAILab/filtered-dpo.
      [COMMENTS]EMNLP 2024
      [LINK]http://arxiv.org/abs/2404.13846v4
      [DATE]2024-12-04 01:22:01+08:00
      [CATEGORIES]cs.LG cs.CL
    2024 Dec 03, Tue
    • Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
      [AUTHORS]Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, Jiliang Tang
      [ABSTRACT]Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the representation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.
      [COMMENTS]Accepted by EMNLP 2024 Main
      [LINK]http://arxiv.org/abs/2406.10794v3
      [DATE]2024-12-03 05:48:47+08:00
      [CATEGORIES]cs.CL
    • Salient Information Prompting to Steer Content in Prompt-based Abstractive Summarization
      [AUTHORS]Lei Xu, Mohammed Asad Karim, Saket Dingliwal, Aparna Elangovan
      [ABSTRACT]Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems. We release our code at \url\{https://github.com/amazon-science/SigExt\}
      [COMMENTS]Accepted to EMNLP 2024 Industry Track. Code available at https://github.com/amazon-science/SigExt
      [LINK]http://arxiv.org/abs/2410.02741v2
      [DATE]2024-12-03 05:06:29+08:00
      [CATEGORIES]cs.CL cs.LG
    • Large Language Models for Data Annotation and Synthesis: A Survey
      [AUTHORS]Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, Huan Liu
      [ABSTRACT]Data annotation and synthesis generally refers to the labeling or generating of raw data with relevant information, which could be used for improving the efficacy of machine learning models. The process, however, is labor-intensive and costly. The emergence of advanced Large Language Models (LLMs), exemplified by GPT-4, presents an unprecedented opportunity to automate the complicated process of data annotation and synthesis. While existing surveys have extensively covered LLM architecture, training, and general applications, we uniquely focus on their specific utility for data annotation. This survey contributes to three core aspects: LLM-Based Annotation Generation, LLM-Generated Annotations Assessment, and LLM-Generated Annotations Utilization. Furthermore, this survey includes an in-depth taxonomy of data types that LLMs can annotate, a comprehensive review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation and synthesis. Serving as a key guide, this survey aims to assist researchers and practitioners in exploring the potential of the latest LLMs for data annotation, thereby fostering future advancements in this critical field.
      [COMMENTS]Accepted to EMNLP 2024 Main
      [LINK]http://arxiv.org/abs/2402.13446v3
      [DATE]2024-12-03 04:55:15+08:00
      [CATEGORIES]cs.CL
    • AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning
      [AUTHORS]Yifan Yang, Kai Zhen, Ershad Banijamal, Athanasios Mouchtaris, Zheng Zhang
      [ABSTRACT]Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.
      [COMMENTS]Accepted for publication in EMNLP 2024
      [LINK]http://arxiv.org/abs/2406.18060v3
      [DATE]2024-12-03 03:03:47+08:00
      [CATEGORIES]cs.CL cs.LG
    2024 Dec 02, Mon
    • GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large Language Model
      [AUTHORS]Xuanchang Zhang, Zhuosheng Zhang, Hai Zhao
      [ABSTRACT]Despite the rapid progress of large language models (LLMs), their task performance remains sensitive to prompt design. Recent studies have explored leveraging the LLM itself as an optimizer to identify optimal prompts that maximize task accuracy. However, when evaluating prompts, such approaches heavily rely on elusive manually annotated gold labels to calculate task accuracy for each candidate prompt, which hinders the widespread implementation and generality. To overcome the limitation, this work proposes a gold label-agnostic prompt evaluation (GLaPE) to alleviate dependence on gold labels. Motivated by the observed correlation between self-consistency and the accuracy of the answer, we adopt self-consistency as the initial evaluation score. Subsequently, we refine the scores of prompts producing identical answers to be mutually consistent. Experimental results show that GLaPE provides reliable evaluations uniform with accuracy, even in the absence of gold labels. Moreover, on six popular reasoning tasks, our GLaPE-based prompt optimization yields effective prompts comparable to accuracy-based ones. The code is publicly available at https://github.com/thunderous77/GLaPE.
      [COMMENTS]EMNLP 2024
      [LINK]http://arxiv.org/abs/2402.02408v2
      [DATE]2024-12-02 15:47:00+08:00
      [CATEGORIES]cs.CL cs.LG
    2024 Nov 30, Sat
    • Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
      [AUTHORS]Duo Zheng, Shijia Huang, Liwei Wang
      [ABSTRACT]The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly impacted various multimodal tasks. However, these models face challenges in tasks that require spatial understanding within 3D environments. Efforts to enhance MLLMs, such as incorporating point cloud features, have been made, yet a considerable gap remains between the models' learned representations and the inherent complexity of 3D scenes. This discrepancy largely stems from the training of MLLMs on predominantly 2D data, which restricts their effectiveness in comprehending 3D spaces. To address this issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM, for 3D scene understanding. By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more accurately. Additionally, we have implemented a maximum coverage sampling technique to optimize the balance between computational costs and performance efficiency. Extensive experiments demonstrate that our model achieves state-of-the-art performance on several 3D scene understanding benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
      [COMMENTS]14 pages, 4 figures
      [LINK]http://arxiv.org/abs/2412.00493v1
      [DATE]2024-11-30 22:28:53+08:00
      [CATEGORIES]cs.CL
    • LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction
      [AUTHORS]Andre Niyongabo Rubungo, Kangming Li, Jason Hattrick-Simpers, Adji Bousso Dieng
      [ABSTRACT]Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.
      [COMMENTS]Accepted at NeurIPS 2024-AI4Mat Workshop. The Benchmark and code can be found at https://github.com/vertaix/LLM4Mat-Bench
      [LINK]http://arxiv.org/abs/2411.00177v3
      [DATE]2024-11-30 22:01:56+08:00
      [CATEGORIES]cs.CL
    • Unlocking Structured Thinking in Language Models with Cognitive Prompting
      [AUTHORS]Oliver Kramer, Jill Baumann
      [ABSTRACT]We propose cognitive prompting as a novel approach to guide problem-solving in large language models (LLMs) through structured, human-like cognitive operations, such as goal clarification, decomposition, filtering, abstraction, and pattern recognition. By employing systematic, step-by-step reasoning, cognitive prompting enables LLMs to tackle complex, multi-step tasks more efficiently. We introduce three variants: a deterministic sequence of cognitive operations, a self-adaptive variant in which the LLM dynamically selects the sequence of cognitive operations, and a hybrid variant that uses generated correct solutions as few-shot chain-of-thought prompts. Experiments with LLaMA, Gemma~2, and Qwen models in each two sizes on the arithmetic reasoning benchmark GSM8K demonstrate that cognitive prompting significantly improves performance compared to standard question answering.
      [COMMENTS]6 pages, submitted to ESANN 2025
      [LINK]http://arxiv.org/abs/2410.02953v3
      [DATE]2024-11-30 20:16:51+08:00
      [CATEGORIES]cs.CL
    • ORAssistant: A Custom RAG-based Conversational Assistant for OpenROAD
      [AUTHORS]Aviral Kaintura, Palaniappan R, Shui Song Luar, Indira Iyer Almeida
      [ABSTRACT]Open-source Electronic Design Automation (EDA) tools are rapidly transforming chip design by addressing key barriers of commercial EDA tools such as complexity, costs, and access. Recent advancements in Large Language Models (LLMs) have further enhanced efficiency in chip design by providing user assistance across a range of tasks like setup, decision-making, and flow automation. This paper introduces ORAssistant, a conversational assistant for OpenROAD, based on Retrieval-Augmented Generation (RAG). ORAssistant aims to improve the user experience for the OpenROAD flow, from RTL-GDSII by providing context-specific responses to common user queries, including installation, command usage, flow setup, and execution, in prose format. Currently, ORAssistant integrates OpenROAD, OpenROAD-flow-scripts, Yosys, OpenSTA, and KLayout. The data model is built from publicly available documentation and GitHub resources. The proposed architecture is scalable, supporting extensions to other open-source tools, operating modes, and LLM models. We use Google Gemini as the base LLM model to build and test ORAssistant. Early evaluation results of the RAG-based model show notable improvements in performance and accuracy compared to non-fine-tuned LLMs.
      [LINK]http://arxiv.org/abs/2410.03845v2
      [DATE]2024-11-30 19:19:39+08:00
      [CATEGORIES]cs.CL
    • Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino
      [AUTHORS]Jann Railey Montalan, Jian Gang Ngui, Wei Qi Leong, Yosephine Susanto, Hamsawardhini Rengarajan, Alham Fikri Aji, William Chandra Tjhi
      [ABSTRACT]Multilingual large language models (LLMs) today may not necessarily provide culturally appropriate and relevant responses to its Filipino users. We introduce Kalahi, a cultural LLM evaluation suite collaboratively created by native Filipino speakers. It is composed of 150 high-quality, handcrafted and nuanced prompts that test LLMs for generations that are relevant to shared Filipino cultural knowledge and values. Strong LLM performance in Kalahi indicates a model's ability to generate responses similar to what an average Filipino would say or do in a given situation. We conducted experiments on LLMs with multilingual and Filipino language support. Results show that Kalahi, while trivial for Filipinos, is challenging for LLMs, with the best model answering only 46.0% of the questions correctly compared to native Filipino performance of 89.10%. Thus, Kalahi can be used to accurately and reliably evaluate Filipino cultural representation in LLMs.
      [COMMENTS]Accepted for presentation at Paclic 38, 2024
      [LINK]http://arxiv.org/abs/2409.15380v2
      [DATE]2024-11-30 17:57:09+08:00
      [CATEGORIES]cs.CL
    • Uncovering Safety Risks of Large Language Models through Concept Activation Vector
      [AUTHORS]Zhihao Xu, Ruixuan Huang, Changyu Chen, Xiting Wang
      [ABSTRACT]Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks with automatically selected perturbation hyperparameters. Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data. Additionally, we find that our generated attack prompts may be transferable to GPT-4, and the embedding-level attacks may also be transferred to other white-box LLMs whose parameters are known. Our experiments further uncover the safety risks present in current LLMs. For example, in our evaluation of seven open-source LLMs, we observe an average attack success rate of 99.14%, based on the classic keyword-matching criterion. Finally, we provide insights into the safety mechanism of LLMs. The code is available at https://github.com/SproutNan/AI-Safety_SCAV.
      [COMMENTS]10 pages, accepted at NeurIPS 2024
      [LINK]http://arxiv.org/abs/2404.12038v5
      [DATE]2024-11-30 16:52:29+08:00
      [CATEGORIES]cs.CL
    • Fine-Grained Alignment in Vision-and-Language Navigation through Bayesian Optimization
      [AUTHORS]Yuhang Song, Mario Gianni, Chenguang Yang, Kunyang Lin, Te-Chuan Chiu, Anh Nguyen, Chun-Yi Lee
      [ABSTRACT]This paper addresses the challenge of fine-grained alignment in Vision-and-Language Navigation (VLN) tasks, where robots navigate realistic 3D environments based on natural language instructions. Current approaches use contrastive learning to align language with visual trajectory sequences. Nevertheless, they encounter difficulties with fine-grained vision negatives. To enhance cross-modal embeddings, we introduce a novel Bayesian Optimization-based adversarial optimization framework for creating fine-grained contrastive vision samples. To validate the proposed methodology, we conduct a series of experiments to assess the effectiveness of the enriched embeddings on fine-grained vision negatives. We conduct experiments on two common VLN benchmarks R2R and REVERIE, experiments on the them demonstrate that these embeddings benefit navigation, and can lead to a promising performance enhancement. Our source code and trained models are available at: https://anonymous.4open.science/r/FGVLN.
      [LINK]http://arxiv.org/abs/2411.14811v2
      [DATE]2024-11-30 16:47:23+08:00
      [CATEGORIES]cs.CL cs.LG
    • Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection
      [AUTHORS]Shanu Kumar, Saish Mendke, Karody Lubna Abdul Rahman, Santosh Kurasa, Parag Agrawal, Sandipan Dandapat
      [ABSTRACT]Chain-of-thought (CoT) prompting has significantly enhanced the capability of large language models (LLMs) by structuring their reasoning processes. However, existing methods face critical limitations: handcrafted demonstrations require extensive human expertise, while trigger phrases are prone to inaccuracies. In this paper, we propose the Zero-shot Uncertainty-based Selection (ZEUS) method, a novel approach that improves CoT prompting by utilizing uncertainty estimates to select effective demonstrations without needing access to model parameters. Unlike traditional methods, ZEUS offers high sensitivity in distinguishing between helpful and ineffective questions, ensuring more precise and reliable selection. Our extensive evaluation shows that ZEUS consistently outperforms existing CoT strategies across four challenging reasoning benchmarks, demonstrating its robustness and scalability.
      [COMMENTS]Accepted in COLING 2025
      [LINK]http://arxiv.org/abs/2412.00353v1
      [DATE]2024-11-30 12:22:00+08:00
      [CATEGORIES]cs.CL
    • THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models
      [AUTHORS]Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven
      [ABSTRACT]Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring automated test set generation, multifaceted benchmarking, and adaptable mitigation strategies. It automates test set creation from any corpus, ensuring high data quality, diversity, and cost-efficiency through techniques like batch processing, weighted sampling, and counterfactual validation. THaMES assesses a model's ability to detect and reduce hallucinations across various tasks, including text generation and binary classification, applying optimal mitigation strategies like In-Context Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base of academic papers, political news, and Wikipedia reveal that commercial models like GPT-4o benefit more from RAG than ICL, while open-weight models like Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT significantly enhances the performance of Llama-3.1-8B-Instruct in both evaluation tasks.
      [COMMENTS]NeurIPS 2024 SoLaR (Socially Responsible Language Modelling Research ) Workshop
      [LINK]http://arxiv.org/abs/2409.11353v3
      [DATE]2024-11-30 10:27:09+08:00
      [CATEGORIES]cs.CL
    • SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration
      [AUTHORS]Xin Guan, Nathaniel Demchak, Saloni Gupta, Ze Wang, Ediz Ertekin Jr., Adriano Koshiyama, Emre Kazim, Zekun Wu
      [ABSTRACT]The development of unbiased large language models is widely recognized as crucial, yet existing benchmarks fall short in detecting biases due to limited scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the first holistic benchmarking pipeline to address these problems. The pipeline encompasses five core stages: scraping materials, assembling benchmarks, generating responses, extracting numeric features, and diagnosing with disparity metrics. SAGED includes metrics for max disparity, such as impact ratio, and bias concentration, such as Max Z-scores. Noticing that assessment tool bias and contextual bias in prompts can distort evaluation, SAGED implements counterfactual branching and baseline calibration for mitigation. For demonstration, we use SAGED on G20 Countries with popular 8b-level models including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we find that while Mistral and Qwen2 show lower max disparity and higher bias concentration than Gemma2 and Llama3.1, all models are notably biased against countries like Russia and (except for Qwen2) China. With further experiments to have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more intensively than Biden and Harris, indicating role-playing performance bias in these models.
      [COMMENTS]COLING 2025 Main Conference
      [LINK]http://arxiv.org/abs/2409.11149v4
      [DATE]2024-11-30 10:21:25+08:00
      [CATEGORIES]cs.CL
    2024 Dec 06, Fri
    • p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
      [AUTHORS]Jun Zhang, Desen Meng, Ji Qi, Zhenpeng Huang, Tao Wu, Limin Wang
      [ABSTRACT]Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. The majority of computation stems from the overwhelming volume of vision tokens processed by the transformer decoder. In this paper, we propose to build efficient MLLMs by leveraging the Mixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layer and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. To validate the effectiveness of our approach, we conduct extensive experiments with two baseline models across 14 benchmarks. Our model, p-MoD, matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.
      [COMMENTS]Technical Report; Code released at https://github.com/MCG-NJU/p-MoD
      [LINK]http://arxiv.org/abs/2412.04449v1
      [DATE]2024-12-06 02:58:03+08:00
      [CATEGORIES]cs.CL
    • Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
      [AUTHORS]Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu
      [ABSTRACT]Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.
      [COMMENTS]Project released at: https://chenyi99.github.io/moto/
      [LINK]http://arxiv.org/abs/2412.04445v1
      [DATE]2024-12-06 02:57:04+08:00
      [CATEGORIES]cs.CL cs.LG
    • CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing
      [AUTHORS]Yen-Ju Lu, Jing Liu, Thomas Thebaud, Laureano Moro-Velazquez, Ariya Rastrow, Najim Dehak, Jesus Villalba
      [ABSTRACT]We introduce Condition-Aware Self-Supervised Learning Representation (CA-SSLR), a generalist conditioning model broadly applicable to various speech-processing tasks. Compared to standard fine-tuning methods that optimize for downstream models, CA-SSLR integrates language and speaker embeddings from earlier layers, making the SSL model aware of the current language and speaker context. This approach reduces the reliance on input audio features while preserving the integrity of the base SSLR. CA-SSLR improves the model's capabilities and demonstrates its generality on unseen tasks with minimal task-specific tuning. Our method employs linear modulation to dynamically adjust internal representations, enabling fine-grained adaptability without significantly altering the original model behavior. Experiments show that CA-SSLR reduces the number of trainable parameters, mitigates overfitting, and excels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a 10% relative reduction in LID errors, a 37% improvement in ASR CER on the ML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating its effectiveness.
      [COMMENTS]38th Conference on Neural Information Processing Systems (NeurIPS 2024)
      [LINK]http://arxiv.org/abs/2412.04425v1
      [DATE]2024-12-06 02:51:10+08:00
      [CATEGORIES]cs.CL cs.LG
    • SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
      [AUTHORS]Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman
      [ABSTRACT]Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.
      [LINK]http://arxiv.org/abs/2403.07384v2
      [DATE]2024-12-06 02:47:47+08:00
      [CATEGORIES]cs.CL cs.LG
    • CNNSum: Exploring Long-Conext Summarization with Large Language Models in Chinese Novels
      [AUTHORS]Lingxiao Wei, He Yan, Xiangju Lu, Junmin Zhu, Jun Wang, Wei Zhang
      [ABSTRACT]Large Language Models (LLMs) have been well-researched in many long-context tasks. However, due to high annotation costs, high-quality long-context summary datasets for training or evaluation are scarce, limiting further research. In this work, we introduce CNNSum, a new multi-scale Chinese long-context novel summarization benchmark, including four subsets, length covering 16k\textasciitilde128k, 695 samples in total, the annotations are human-driven. We evaluate commercial and open-source models on CNNSum and conduct a detailed analysis. Based on the observations, we further conduct fine-tuning exploration with short-context summary data. In our study: (1) GPT-4o underperformed, due to excessive subjective commentary. (2) Currently, long-context summarization mainly relies on memory ability, small LLMs with stable longer context lengths are the most cost-effective. Using long data concatenated from short-context summaries makes a significant improvement. (3) Prompt templates may cause a large performance gap but can be mitigated through fine-tuning. (4) Fine-tuned Chat or Instruction versions may harm the Base model and further fine-tuning cannot bridge performance gap. (5) while models with RoPE base scaling exhibit strong extrapolation potential, their performance may vary significantly when combined with other interpolation methods and need careful selection. (6) CNNSum provides more reliable and insightful evaluation results than other benchmarks. We release CNNSum to advance research in this field.
      [LINK]http://arxiv.org/abs/2412.02819v2
      [DATE]2024-12-06 01:51:20+08:00
      [CATEGORIES]cs.CL
    • Context-Informed Machine Translation of Manga using Multimodal Large Language Models
      [AUTHORS]Philip Lippmann, Konrad Skublicki, Joshua Tanner, Shonosuke Ishiwatari, Jie Yang
      [ABSTRACT]Due to the significant time and effort required for handcrafting translations, most manga never leave the domestic Japanese market. Automatic manga translation is a promising potential solution. However, it is a budding and underdeveloped field and presents complexities even greater than those found in standard translation due to the need to effectively incorporate visual elements into the translation process to resolve ambiguities. In this work, we investigate to what extent multimodal large language models (LLMs) can provide effective manga translation, thereby assisting manga authors and publishers in reaching wider audiences. Specifically, we propose a methodology that leverages the vision component of multimodal LLMs to improve translation quality and evaluate the impact of translation unit size, context length, and propose a token efficient approach for manga translation. Moreover, we introduce a new evaluation dataset -- the first parallel Japanese-Polish manga translation dataset -- as part of a benchmark to be used in future research. Finally, we contribute an open-source software suite, enabling others to benchmark LLMs for manga translation. Our findings demonstrate that our proposed methods achieve state-of-the-art results for Japanese-English translation and set a new standard for Japanese-Polish.
      [COMMENTS]COLING 2025
      [LINK]http://arxiv.org/abs/2411.02589v2
      [DATE]2024-12-06 01:41:48+08:00
      [CATEGORIES]cs.CL
    • BhashaVerse : Translation Ecosystem for Indian Subcontinent Languages
      [AUTHORS]Vandan Mujadia, Dipti Misra Sharma
      [ABSTRACT]This paper focuses on developing translation models and related applications for 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj, Bodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada, Kangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili, Malayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi, Sanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu, Telugu, and Urdu. Achieving this requires parallel and other types of corpora for all 36 * 36 language pairs, addressing challenges like script variations, phonetic differences, and syntactic diversity. For instance, languages like Kashmiri and Sindhi, which use multiple scripts, demand script normalization for alignment, while low-resource languages such as Khasi and Santali require synthetic data augmentation to ensure sufficient coverage and quality. To address these challenges, this work proposes strategies for corpus creation by leveraging existing resources, developing parallel datasets, generating domain-specific corpora, and utilizing synthetic data techniques. Additionally, it evaluates machine translation across various dimensions, including standard and discourse-level translation, domain-specific translation, reference-based and reference-free evaluation, error analysis, and automatic post-editing. By integrating these elements, the study establishes a comprehensive framework to improve machine translation quality and enable better cross-lingual communication in India's linguistically diverse ecosystem.
      [LINK]http://arxiv.org/abs/2412.04351v1
      [DATE]2024-12-06 01:10:19+08:00
      [CATEGORIES]cs.CL
    • Retrieval-Augmented Machine Translation with Unstructured Knowledge
      [AUTHORS]Jiaan Wang, Fandong Meng, Yingxue Zhang, Jie Zhou
      [ABSTRACT]Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs). In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance models' MT ability. However, a large amount of world knowledge is organized in unstructured documents, and might not be fully paired across different languages. In this paper, we study retrieval-augmented MT using unstructured documents. Specifically, we build RAGtrans, the first benchmark to train and evaluate LLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples collected via GPT-4o and human translators. Besides, documents from different languages are also provided to supply the knowledge to these samples. Based on RAGtrans, we further propose a multi-task training method to teach LLMs how to use information from multilingual documents during their translation. The method uses existing multilingual corpora to create auxiliary training objectives without additional labeling requirements. Extensive experiments show that the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores.
      [LINK]http://arxiv.org/abs/2412.04342v1
      [DATE]2024-12-06 01:00:32+08:00
      [CATEGORIES]cs.CL
    • Densing Law of LLMs
      [AUTHORS]Chaojun Xiao, Jie Cai, Weilin Zhao, Guoyang Zeng, Xu Han, Zhiyuan Liu, Maosong Sun
      [ABSTRACT]Large Language Models (LLMs) have emerged as a milestone in artificial intelligence, and their performance can improve as the model size increases. However, this scaling brings great challenges to training and inference efficiency, particularly for deploying LLMs in resource-constrained environments, and the scaling trend is becoming increasingly unsustainable. This paper introduces the concept of ``\textit\{capacity density\}'' as a new metric to evaluate the quality of the LLMs across different scales and describes the trend of LLMs in terms of both effectiveness and efficiency. To calculate the capacity density of a given target LLM, we first introduce a set of reference models and develop a scaling law to predict the downstream performance of these reference models based on their parameter sizes. We then define the \textit\{effective parameter size\} of the target LLM as the parameter size required by a reference model to achieve equivalent performance, and formalize the capacity density as the ratio of the effective parameter size to the actual parameter size of the target LLM. Capacity density provides a unified framework for assessing both model effectiveness and efficiency. Our further analysis of recent open-source base LLMs reveals an empirical law (the densing law)that the capacity density of LLMs grows exponentially over time. More specifically, using some widely used benchmarks for evaluation, the capacity density of LLMs doubles approximately every three months. The law provides new perspectives to guide future LLM development, emphasizing the importance of improving capacity density to achieve optimal results with minimal computational overhead.
      [LINK]http://arxiv.org/abs/2412.04315v1
      [DATE]2024-12-06 00:31:13+08:00
      [CATEGORIES]cs.CL
    • ALMA: Alignment with Minimal Annotation
      [AUTHORS]Michihiro Yasunaga, Leonid Shamis, Chunting Zhou, Andrew Cohen, Jason Weston, Luke Zettlemoyer, Marjan Ghazvininejad
      [ABSTRACT]Recent approaches to large language model (LLM) alignment typically require millions of human annotations or rely on external aligned models for synthetic data generation. This paper introduces ALMA: Alignment with Minimal Annotation, demonstrating that effective alignment can be achieved using only 9,000 labeled examples -- less than 1% of conventional approaches. ALMA generates large amounts of high-quality synthetic alignment data through new techniques: diverse prompt synthesis via few-shot learning, diverse response generation with multiple model checkpoints, and judge (reward model) enhancement through score aggregation and self-distillation. Using only a pretrained Llama3 base model, 5,000 SFT examples, and 4,000 judge annotations, ALMA achieves performance close to Llama3-Instruct across diverse alignment benchmarks (e.g., 0.1% difference on AlpacaEval 2.0 score). These results are achieved with a multi-round, self-bootstrapped data synthesis and training recipe that continues to improve for 10 rounds, surpassing the typical 3-round ceiling of previous methods. These results suggest that base models already possess sufficient knowledge for effective alignment, and that synthetic data generation methods can expose it.
      [LINK]http://arxiv.org/abs/2412.04305v1
      [DATE]2024-12-06 00:26:31+08:00
      [CATEGORIES]cs.CL cs.LG
    • Evolutionary Pre-Prompt Optimization for Mathematical Reasoning
      [AUTHORS]Mathurin Videau, Alessandro Leite, Marc Schoenauer, Olivier Teytaud
      [ABSTRACT]Recent advancements have highlighted that large language models (LLMs), when given a small set of task-specific examples, demonstrate remarkable proficiency, a capability that extends to complex reasoning tasks. In particular, the combination of few-shot learning with the chain-of-thought (CoT) approach has been pivotal in steering models towards more logically consistent conclusions. This paper explores the optimization of example selection for designing effective CoT pre-prompts and shows that the choice of the optimization algorithm, typically in favor of comparison-based methods such as evolutionary computation, significantly enhances efficacy and feasibility. Specifically, thanks to a limited exploitative and overfitted optimization, Evolutionary Pre-Prompt Optimization (EPPO) brings an improvement over the naive few-shot approach exceeding 10 absolute points in exact match scores on benchmark datasets such as GSM8k and MathQA. These gains are consistent across various contexts and are further amplified when integrated with self-consistency (SC)
      [LINK]http://arxiv.org/abs/2412.04291v1
      [DATE]2024-12-06 00:12:06+08:00
      [CATEGORIES]cs.CL
    • A method to benchmark high-dimensional process drift detection
      [AUTHORS]Edgar Wolf, Tobias Windisch
      [ABSTRACT]Process curves are multivariate finite time series data coming from manufacturing processes. This paper studies machine learning that detect drifts in process curve datasets. A theoretic framework to synthetically generate process curves in a controlled way is introduced in order to benchmark machine learning algorithms for process drift detection. An evaluation score, called the temporal area under the curve, is introduced, which allows to quantify how well machine learning models unveil curves belonging to drift segments. Finally, a benchmark study comparing popular machine learning approaches on synthetic data generated with the introduced framework is presented that shows that existing algorithms often struggle with datasets containing multiple drift segments.
      [LINK]http://arxiv.org/abs/2409.03669v2
      [DATE]2024-12-06 02:56:04+08:00
      [CATEGORIES]cs.LG
    • FedDUAL: A Dual-Strategy with Adaptive Loss and Dynamic Aggregation for Mitigating Data Heterogeneity in Federated Learning
      [AUTHORS]Pranab Sahoo, Ashutosh Tripathi, Sriparna Saha, Samrat Mondal
      [ABSTRACT]Federated Learning (FL) marks a transformative approach to distributed model training by combining locally optimized models from various clients into a unified global model. While FL preserves data privacy by eliminating centralized storage, it encounters significant challenges such as performance degradation, slower convergence, and reduced robustness of the global model due to the heterogeneity in client data distributions. Among the various forms of data heterogeneity, label skew emerges as a particularly formidable and prevalent issue, especially in domains such as image classification. To address these challenges, we begin with comprehensive experiments to pinpoint the underlying issues in the FL training process. Based on our findings, we then introduce an innovative dual-strategy approach designed to effectively resolve these issues. First, we introduce an adaptive loss function for client-side training, meticulously crafted to preserve previously acquired knowledge while maintaining an optimal equilibrium between local optimization and global model coherence. Secondly, we develop a dynamic aggregation strategy for aggregating client models at the server. This approach adapts to each client's unique learning patterns, effectively addressing the challenges of diverse data across the network. Our comprehensive evaluation, conducted across three diverse real-world datasets, coupled with theoretical convergence guarantees, demonstrates the superior efficacy of our method compared to several established state-of-the-art approaches.
      [LINK]http://arxiv.org/abs/2412.04416v1
      [DATE]2024-12-06 02:42:29+08:00
      [CATEGORIES]cs.LG
    • Efficient Task Grouping Through Samplewise Optimisation Landscape Analysis
      [AUTHORS]Anshul Thakur, Yichen Huang, Soheila Molaei, Yujiang Wang, David A. Clifton
      [ABSTRACT]Shared training approaches, such as multi-task learning (MTL) and gradient-based meta-learning, are widely used in various machine learning applications, but they often suffer from negative transfer, leading to performance degradation in specific tasks. While several optimisation techniques have been developed to mitigate this issue for pre-selected task cohorts, identifying optimal task combinations for joint learning - known as task grouping - remains underexplored and computationally challenging due to the exponential growth in task combinations and the need for extensive training and evaluation cycles. This paper introduces an efficient task grouping framework designed to reduce these overwhelming computational demands of the existing methods. The proposed framework infers pairwise task similarities through a sample-wise optimisation landscape analysis, eliminating the need for the shared model training required to infer task similarities in existing methods. With task similarities acquired, a graph-based clustering algorithm is employed to pinpoint near-optimal task groups, providing an approximate yet efficient and effective solution to the originally NP-hard problem. Empirical assessments conducted on 8 different datasets highlight the effectiveness of the proposed framework, revealing a five-fold speed enhancement compared to previous state-of-the-art methods. Moreover, the framework consistently demonstrates comparable performance, confirming its remarkable efficiency and effectiveness in task grouping.
      [COMMENTS]Under review at IEEE Transactions on Pattern Analysis and Machine Intelligence
      [LINK]http://arxiv.org/abs/2412.04413v1
      [DATE]2024-12-06 02:33:59+08:00
      [CATEGORIES]cs.LG
    • Asynchronous Batch Bayesian Optimization with Pipelining Evaluations for Experimental Resource$\unicode\{x2013\}$constrained Conditions
      [AUTHORS]Yujin Taguchi, Yusuke Shibuya, Yusuke Hiki, Takashi Morikura, Takahiro G. Yamada, Akira Funahashi
      [ABSTRACT]Bayesian optimization is efficient even with a small amount of data and is used in engineering and in science, including biology and chemistry. In Bayesian optimization, a parameterized model with an uncertainty is fitted to explain the experimental data, and then the model suggests parameters that would most likely improve the results. Batch Bayesian optimization reduces the processing time of optimization by parallelizing experiments. However, batch Bayesian optimization cannot be applied if the number of parallelized experiments is limited by the cost or scarcity of equipment; in such cases, sequential methods require an unrealistic amount of time. In this study, we developed pipelining Bayesian optimization (PipeBO) to reduce the processing time of optimization even with a limited number of parallel experiments. PipeBO was inspired by the pipelining of central processing unit architecture, which divides computational tasks into multiple processes. PipeBO was designed to achieve experiment parallelization by overlapping various processes of the experiments. PipeBO uses the results of completed experiments to update the parameters of running parallelized experiments. Using the Black-Box Optimization Benchmarking, which consists of 24 benchmark functions, we compared PipeBO with the sequential Bayesian optimization methods. PipeBO reduced the average processing time of optimization to about 56% for the experiments that consisted of two processes or even less for those with more processes for 20 out of the 24 functions. Overall, PipeBO parallelizes Bayesian optimization in the resource-constrained settings so that efficient optimization can be achieved.
      [LINK]http://arxiv.org/abs/2412.04392v1
      [DATE]2024-12-06 02:06:09+08:00
      [CATEGORIES]cs.LG
    • EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
      [AUTHORS]Yuqi Wu, Wenzhao Zheng, Sicheng Zuo, Yuanhui Huang, Jie Zhou, Jiwen Lu
      [ABSTRACT]3D occupancy prediction provides a comprehensive description of the surrounding scenes and has become an essential task for 3D perception. Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents which demands to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize the global scene with uniform 3D semantic Gaussians and progressively update local regions observed by the embodied agent. For each update, we extract semantic and structural features from the observed image and efficiently incorporate them via deformable cross-attention to refine the regional Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global 3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown (i.e., uniformly distributed) environment and maintains an explicit global memory of it with 3D Gaussians. It gradually gains knowledge through local refinement of regional Gaussians, which is consistent with how humans understand new scenes through embodied exploration. We reorganize an EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the evaluation of the embodied 3D occupancy prediction task. Experiments demonstrate that our EmbodiedOcc outperforms existing local prediction methods and accomplishes the embodied occupancy prediction with high accuracy and strong expandability. Our code is available at: https://github.com/YkiWu/EmbodiedOcc.
      [COMMENTS]Code: https://github.com/YkiWu/EmbodiedOcc
      [LINK]http://arxiv.org/abs/2412.04380v1
      [DATE]2024-12-06 01:57:09+08:00
      [CATEGORIES]cs.LG
    • Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting
      [AUTHORS]Edoardo Cetin, Ahmed Touati, Yann Ollivier
      [ABSTRACT]The forward-backward representation (FB) is a recently proposed framework (Touati et al., 2023; Touati & Ollivier, 2021) to train behavior foundation models (BFMs) that aim at providing zero-shot efficient policies for any new task specified in a given reinforcement learning (RL) environment, without training for each new task. Here we address two core limitations of FB model training. First, FB, like all successor-feature-based methods, relies on a linear encoding of tasks: at test time, each new reward function is linearly projected onto a fixed set of pre-trained features. This limits expressivity as well as precision of the task representation. We break the linearity limitation by introducing auto-regressive features for FB, which let finegrained task features depend on coarser-grained task information. This can represent arbitrary nonlinear task encodings, thus significantly increasing expressivity of the FB framework. Second, it is well-known that training RL agents from offline datasets often requires specific techniques.We show that FB works well together with such offline RL techniques, by adapting techniques from (Nair et al.,2020b; Cetin et al., 2024) for FB. This is necessary to get non-flatlining performance in some datasets, such as DMC Humanoid. As a result, we produce efficient FB BFMs for a number of new environments. Notably, in the D4RL locomotion benchmark, the generic FB agent matches the performance of standard single-task offline agents (IQL, XQL). In many setups, the offline techniques are needed to get any decent performance at all. The auto-regressive features have a positive but moderate impact, concentrated on tasks requiring spatial precision and task generalization beyond the behaviors represented in the trainset.
      [LINK]http://arxiv.org/abs/2412.04368v1
      [DATE]2024-12-06 01:36:22+08:00
      [CATEGORIES]cs.LG
    • Machine Theory of Mind for Autonomous Cyber-Defence
      [AUTHORS]Luke Swaby, Matthew Stewart, Daniel Harrold, Chris Willis, Gregory Palmer
      [ABSTRACT]Intelligent autonomous agents hold much potential for the domain of cyber-security. However, due to many state-of-the-art approaches relying on uninterpretable black-box models, there is growing demand for methods that offer stakeholders clear and actionable insights into their latent beliefs and motivations. To address this, we evaluate Theory of Mind (ToM) approaches for Autonomous Cyber Operations. Upon learning a robust prior, ToM models can predict an agent's goals, behaviours, and contextual beliefs given only a handful of past behaviour observations. In this paper, we introduce a novel Graph Neural Network (GNN)-based ToM architecture tailored for cyber-defence, Graph-In, Graph-Out (GIGO)-ToM, which can accurately predict both the targets and attack trajectories of adversarial cyber agents over arbitrary computer network topologies. To evaluate the latter, we propose a novel extension of the Wasserstein distance for measuring the similarity of graph-based probability distributions. Whereas the standard Wasserstein distance lacks a fixed reference scale, we introduce a graph-theoretic normalization factor that enables a standardized comparison between networks of different sizes. We furnish this metric, which we term the Network Transport Distance (NTD), with a weighting function that emphasizes predictions according to custom node features, allowing network operators to explore arbitrary strategic considerations. Benchmarked against a Graph-In, Dense-Out (GIDO)-ToM architecture in an abstract cyber-defence environment, our empirical evaluations show that GIGO-ToM can accurately predict the goals and behaviours of various unseen cyber-attacking agents across a range of network topologies, as well as learn embeddings that can effectively characterize their policies.
      [COMMENTS]29 pages, 17 figures, 12 tables
      [LINK]http://arxiv.org/abs/2412.04367v1
      [DATE]2024-12-06 01:35:29+08:00
      [CATEGORIES]cs.LG
    • Approximate Top-$k$ for Increased Parallelism
      [AUTHORS]Oscar Key, Luka Ribar, Alberto Cattaneo, Luke Hudlass-Galley, Douglas Orr
      [ABSTRACT]We present an evaluation of bucketed approximate top-$k$ algorithms. Computing top-$k$ exactly suffers from limited parallelism, because the $k$ largest values must be aggregated along the vector, thus is not well suited to computation on highly-parallel machine learning accelerators. By relaxing the requirement that the top-$k$ is exact, bucketed algorithms can dramatically increase the parallelism available by independently computing many smaller top-$k$ operations. We explore the design choices of this class of algorithms using both theoretical analysis and empirical evaluation on downstream tasks. Our motivating examples are sparsity algorithms for language models, which often use top-$k$ to select the most important parameters or activations. We also release a fast bucketed top-$k$ implementation for PyTorch.
      [LINK]http://arxiv.org/abs/2412.04358v1
      [DATE]2024-12-06 01:17:28+08:00
      [CATEGORIES]cs.LG
    • ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation
      [AUTHORS]Dayoung Gong, Suha Kwak, Minsu Cho
      [ABSTRACT]Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos. Despite apparent relevance and potential complementarity, these two problems have been investigated as separate and distinct tasks. In this work, we tackle these two problems, action segmentation and action anticipation, jointly using a unified diffusion model dubbed ActFusion. The key idea to unification is to train the model to effectively handle both visible and invisible parts of the sequence in an integrated manner; the visible part is for temporal segmentation, and the invisible part is for future anticipation. To this end, we introduce a new anticipative masking strategy during training in which a late part of the video frames is masked as invisible, and learnable tokens replace these frames to learn to predict the invisible future. Experimental results demonstrate the bi-directional benefits between action segmentation and anticipation. ActFusion achieves the state-of-the-art performance across the standard benchmarks of 50 Salads, Breakfast, and GTEA, outperforming task-specific models in both of the two tasks with a single unified model through joint learning.
      [COMMENTS]Accepted to NeurIPS 2024
      [LINK]http://arxiv.org/abs/2412.04353v1
      [DATE]2024-12-06 01:12:35+08:00
      [CATEGORIES]cs.LG
    • A Fisher-Rao gradient flow for entropy-regularised Markov decision processes in Polish spaces
      [AUTHORS]Bekzhan Kerimkulov, James-Michael Leahy, David Siska, Lukasz Szpruch, Yufei Zhang
      [ABSTRACT]We study the global convergence of a Fisher-Rao policy gradient flow for infinite-horizon entropy-regularised Markov decision processes with Polish state and action space. The flow is a continuous-time analogue of a policy mirror descent method. We establish the global well-posedness of the gradient flow and demonstrate its exponential convergence to the optimal policy. Moreover, we prove the flow is stable with respect to gradient evaluation, offering insights into the performance of a natural policy gradient flow with log-linear policy parameterisation. To overcome challenges stemming from the lack of the convexity of the objective function and the discontinuity arising from the entropy regulariser, we leverage the performance difference lemma and the duality relationship between the gradient and mirror descent flows. Our analysis provides a theoretical foundation for developing various discrete policy gradient algorithms.
      [COMMENTS]add discretizations of gradient flow and their convergence analysis
      [LINK]http://arxiv.org/abs/2310.02951v2
      [DATE]2024-12-06 00:35:46+08:00
      [CATEGORIES]cs.LG
    • Enhancing Novel Object Detection via Cooperative Foundational Models
      [AUTHORS]Rohit Bharadwaj, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan
      [ABSTRACT]In this work, we address the challenging and emergent problem of novel object detection (NOD), focusing on the accurate detection of both known and novel object categories during inference. Traditional object detection algorithms are inherently closed-set, limiting their capability to handle NOD. We present a novel approach to transform existing closed-set detectors into open-set detectors. This transformation is achieved by leveraging the complementary strengths of pre-trained foundational models, specifically CLIP and SAM, through our cooperative mechanism. Furthermore, by integrating this mechanism with state-of-the-art open-set detectors such as GDINO, we establish new benchmarks in object detection performance. Our method achieves 17.42 mAP in novel object detection and 42.08 mAP for known objects on the challenging LVIS dataset. Adapting our approach to the COCO OVD split, we surpass the current state-of-the-art by a margin of 7.2 $ \text\{AP\}_\{50\} $ for novel classes. Our code is available at https://rohit901.github.io/coop-foundation-models/ .
      [COMMENTS]Accepted at WACV 2025
      [LINK]http://arxiv.org/abs/2311.12068v3
      [DATE]2024-12-06 00:34:21+08:00
      [CATEGORIES]cs.LG
    • The Tile: A 2D Map of Ranking Scores for Two-Class Classification
      [AUTHORS]Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, Marc Van Droogenbroeck
      [ABSTRACT]In the computer vision and machine learning communities, as well as in many other research domains, rigorous evaluation of any new method, including classifiers, is essential. One key component of the evaluation process is the ability to compare and rank methods. However, ranking classifiers and accurately comparing their performances, especially when taking application-specific preferences into account, remains challenging. For instance, commonly used evaluation tools like Receiver Operating Characteristic (ROC) and Precision/Recall (PR) spaces display performances based on two scores. Hence, they are inherently limited in their ability to compare classifiers across a broader range of scores and lack the capability to establish a clear ranking among classifiers. In this paper, we present a novel versatile tool, named the Tile, that organizes an infinity of ranking scores in a single 2D map for two-class classifiers, including common evaluation scores such as the accuracy, the true positive rate, the positive predictive value, Jaccard's coefficient, and all F-beta scores. Furthermore, we study the properties of the underlying ranking scores, such as the influence of the priors or the correspondences with the ROC space, and depict how to characterize any other score by comparing them to the Tile. Overall, we demonstrate that the Tile is a powerful tool that effectively captures all the rankings in a single visualization and allows interpreting them.
      [LINK]http://arxiv.org/abs/2412.04309v1
      [DATE]2024-12-06 00:27:59+08:00
      [CATEGORIES]cs.LG
    • Structure-Aware Stylized Image Synthesis for Robust Medical Image Segmentation
      [AUTHORS]Jie Bao, Zhixin Zhou, Wen Jung Li, Rui Luo
      [ABSTRACT]Accurate medical image segmentation is essential for effective diagnosis and treatment planning but is often challenged by domain shifts caused by variations in imaging devices, acquisition conditions, and patient-specific attributes. Traditional domain generalization methods typically require inclusion of parts of the test domain within the training set, which is not always feasible in clinical settings with limited diverse data. Additionally, although diffusion models have demonstrated strong capabilities in image generation and style transfer, they often fail to preserve the critical structural information necessary for precise medical analysis. To address these issues, we propose a novel medical image segmentation method that combines diffusion models and Structure-Preserving Network for structure-aware one-shot image stylization. Our approach effectively mitigates domain shifts by transforming images from various sources into a consistent style while maintaining the location, size, and shape of lesions. This ensures robust and accurate segmentation even when the target domain is absent from the training data. Experimental evaluations on colonoscopy polyp segmentation and skin lesion segmentation datasets show that our method enhances the robustness and accuracy of segmentation models, achieving superior performance metrics compared to baseline models without style transfer. This structure-aware stylization framework offers a practical solution for improving medical image segmentation across diverse domains, facilitating more reliable clinical diagnoses.
      [LINK]http://arxiv.org/abs/2412.04296v1
      [DATE]2024-12-06 00:15:32+08:00
      [CATEGORIES]cs.LG
    • On Multi-Agent Inverse Reinforcement Learning
      [AUTHORS]Till Freihaut, Giorgia Ramponi
      [ABSTRACT]In multi-agent systems, the agent behavior is highly influenced by its utility function, as these utilities shape both individual goals as well as interactions with the other agents. Inverse Reinforcement Learning (IRL) is a well-established approach to inferring the utility function by observing an expert behavior within a given environment. In this paper, we extend the IRL framework to the multi-agent setting, assuming to observe agents who are following Nash Equilibrium (NE) policies. We theoretically investigate the set of utilities that explain the behavior of NE experts. Specifically, we provide an explicit characterization of the feasible reward set and analyze how errors in estimating the transition dynamics and expert behavior impact the recovered rewards. Building on these findings, we provide the first sample complexity analysis for the multi-agent IRL problem. Finally, we provide a numerical evaluation of our theoretical results.
      [COMMENTS]Currently under review
      [LINK]http://arxiv.org/abs/2411.15046v2
      [DATE]2024-12-06 00:04:02+08:00
      [CATEGORIES]cs.LG
    2024 Dec 05, Thu
    • Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic
      [AUTHORS]Zaid Alyafeai, Michael Pieler, Hannah Teufel, Jonathan Tow, Marco Bellagente, Duy Phung, Nikhil Pinnaparaju, Reshinth Adithyan, Paulo Rocha, Maksym Zhuravinskyi, Carlos Riquelme
      [ABSTRACT]Large Language Models (LLMs) have shown impressive results in multiple domains of natural language processing (NLP) but are mainly focused on the English language. Recently, more LLMs have incorporated a larger proportion of multilingual text to represent low-resource languages. In Arabic NLP, several Arabic-centric LLMs have shown remarkable results on multiple benchmarks in the past two years. However, most Arabic LLMs have more than 7 billion parameters, which increases their hardware requirements and inference latency, when compared to smaller LLMs. This paper introduces Arabic Stable LM 1.6B in a base and chat version as a small but powerful Arabic-centric LLM. Our Arabic Stable LM 1.6B chat model achieves impressive results on several benchmarks beating multiple models with up to 8x the parameters. In addition, we show the benefit of mixing in synthetic instruction tuning data by augmenting our fine-tuning data with a large synthetic dialogue dataset.
      [LINK]http://arxiv.org/abs/2412.04277v1
      [DATE]2024-12-05 23:59:29+08:00
      [CATEGORIES]cs.CL
    • CoSy: Evaluating Textual Explanations of Neurons
      [AUTHORS]Laura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina M. -C. Höhne, Kirill Bykov
      [ABSTRACT]A crucial aspect of understanding the complex nature of Deep Neural Networks (DNNs) is the ability to explain learned concepts within their latent representations. While methods exist to connect neurons to human-understandable textual descriptions, evaluating the quality of these explanations is challenging due to the lack of a unified quantitative approach. We introduce CoSy (Concept Synthesis), a novel, architecture-agnostic framework for evaluating textual explanations of latent neurons. Given textual explanations, our proposed framework uses a generative model conditioned on textual input to create data points representing the explanations. By comparing the neuron's response to these generated data points and control data points, we can estimate the quality of the explanation. We validate our framework through sanity checks and benchmark various neuron description methods for Computer Vision tasks, revealing significant differences in quality.
      [COMMENTS]10 pages, 5 figures
      [LINK]http://arxiv.org/abs/2405.20331v2
      [DATE]2024-12-05 23:48:24+08:00
      [CATEGORIES]cs.LG cs.CL
    • Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier
      [AUTHORS]John Dang, Shivalika Singh, Daniel D'souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campos, Yi-Chern Tan, Tom Kocmi, Florian Strub, Nathan Grinsztajn, Yannis Flet-Berliac, Acyr Locatelli, Hangyu Lin, Dwarak Talupuru, Bharat Venkitesh, David Cairuz, Bowen Yang, Tim Chung, Wei-Yin Ko, Sylvie Shang Shi, Amir Shukayev, Sammie Bae, Aleksandra Piktus, Roman Castagné, Felipe Cruz-Salinas, Eddie Kim, Lucas Crawhall-Stein, Adrien Morisot, Sudip Roy, Phil Blunsom, Ivan Zhang, Aidan Gomez, Nick Frosst, Marzieh Fadaee, Beyza Ermis, Ahmet Üstün, Sara Hooker
      [ABSTRACT]We introduce the Aya Expanse model family, a new generation of 8B and 32B parameter multilingual language models, aiming to address the critical challenge of developing highly performant multilingual models that match or surpass the capabilities of monolingual models. By leveraging several years of research at Cohere For AI and Cohere, including advancements in data arbitrage, multilingual preference training, and model merging, Aya Expanse sets a new state-of-the-art in multilingual performance. Our evaluations on the Arena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya Expanse 8B and 32B outperform leading open-weight models in their respective parameter classes, including Gemma 2, Qwen 2.5, and Llama 3.1, achieving up to a 76.6% win-rate. Notably, Aya Expanse 32B outperforms Llama 3.1 70B, a model with twice as many parameters, achieving a 54.0% win-rate. In this short technical report, we present extended evaluation results for the Aya Expanse model family and release their open-weights, together with a new multilingual evaluation dataset m-ArenaHard.
      [LINK]http://arxiv.org/abs/2412.04261v1
      [DATE]2024-12-05 23:41:06+08:00
      [CATEGORIES]cs.CL
    • CLINICSUM: Utilizing Language Models for Generating Clinical Summaries from Patient-Doctor Conversations
      [AUTHORS]Subash Neupane, Himanshu Tripathi, Shaswata Mitra, Sean Bozorgzad, Sudip Mittal, Shahram Rahimi, Amin Amirlatifi
      [ABSTRACT]This paper presents ClinicSum, a novel framework designed to automatically generate clinical summaries from patient-doctor conversations. It utilizes a two-module architecture: a retrieval-based filtering module that extracts Subjective, Objective, Assessment, and Plan (SOAP) information from conversation transcripts, and an inference module powered by fine-tuned Pre-trained Language Models (PLMs), which leverage the extracted SOAP data to generate abstracted clinical summaries. To fine-tune the PLM, we created a training dataset of consisting 1,473 conversations-summaries pair by consolidating two publicly available datasets, FigShare and MTS-Dialog, with ground truth summaries validated by Subject Matter Experts (SMEs). ClinicSum's effectiveness is evaluated through both automatic metrics (e.g., ROUGE, BERTScore) and expert human assessments. Results show that ClinicSum outperforms state-of-the-art PLMs, demonstrating superior precision, recall, and F-1 scores in automatic evaluations and receiving high preference from SMEs in human assessment, making it a robust solution for automated clinical summarization.
      [COMMENTS]accepted at the the 2024 IEEE International Conference on Big Data workshop Workshop on Big Data and AI for Healthcare
      [LINK]http://arxiv.org/abs/2412.04254v1
      [DATE]2024-12-05 23:34:02+08:00
      [CATEGORIES]cs.CL
    • Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots
      [AUTHORS]Maria Paola Priola
      [ABSTRACT]I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs). Mitigation is achieved in a question-answering Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS), which accounts for contextual relevance in responses. While RAG mitigates hallucinations by grounding answers in external data, NMISS refines the evaluation by identifying cases where traditional metrics incorrectly flag contextually accurate responses as hallucinations. I use Italian health news articles as context to evaluate LLM performance. Results show that Gemma2 and GPT-4 outperform the other models, with GPT-4 producing answers closely aligned with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral benefit significantly from NMISS, highlighting their ability to provide richer contextual information. This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.
      [LINK]http://arxiv.org/abs/2412.04235v1
      [DATE]2024-12-05 23:11:12+08:00
      [CATEGORIES]cs.CL
    • Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild
      [AUTHORS]Xinyu Zhao, Guoheng Sun, Ruisi Cai, Yukun Zhou, Pingzhi Li, Peihao Wang, Bowen Tan, Yexiao He, Li Chen, Yi Liang, Beidi Chen, Binhang Yuan, Hongyi Wang, Ang Li, Zhangyang Wang, Tianlong Chen
      [ABSTRACT]As Large Language Models (LLMs) excel across tasks and specialized domains, scaling LLMs based on existing models has garnered significant attention, which faces the challenge of decreasing performance when combining disparate models. Various techniques have been proposed for the aggregation of pre-trained LLMs, including model merging, Mixture-of-Experts, and stacking. Despite their merits, a comprehensive comparison and synergistic application of them to a diverse model zoo is yet to be adequately addressed. In light of this research gap, this paper introduces Model-GLUE, a holistic LLM scaling guideline. First, our work starts with a benchmarking of existing LLM scaling techniques, especially selective merging, and variants of mixture. Utilizing the insights from the benchmark results, we formulate an optimal strategy for the selection and aggregation of a heterogeneous model zoo characterizing different architectures and initialization.Our methodology involves the clustering of mergeable models and optimal merging strategy selection, and the integration of clusters through a model mixture. Finally, evidenced by our experiments on a diverse Llama-2-based model zoo, Model-GLUE shows an average performance enhancement of 5.61%, achieved without additional training. Codes are available at: https://github.com/Model-GLUE/Model-GLUE.
      [COMMENTS]24 pages, 4 figures, accepted to NeurIPS 2024 Datasets and Benchmarks Track
      [LINK]http://arxiv.org/abs/2410.05357v2
      [DATE]2024-12-05 23:08:56+08:00
      [CATEGORIES]cs.LG cs.CL
    • Agent-OM: Leveraging LLM Agents for Ontology Matching
      [AUTHORS]Zhangcheng Qiang, Weiqing Wang, Kerry Taylor
      [ABSTRACT]Ontology matching (OM) enables semantic interoperability between different ontologies and resolves their conceptual heterogeneity by aligning related entities. OM systems currently have two prevailing design paradigms: conventional knowledge-based expert systems and newer machine learning-based predictive systems. While large language models (LLMs) and LLM agents have revolutionised data engineering and have been applied creatively in many domains, their potential for OM remains underexplored. This study introduces a novel agent-powered LLM-based design paradigm for OM systems. With consideration of several specific challenges in leveraging LLM agents for OM, we propose a generic framework, namely Agent-OM (Agent for Ontology Matching), consisting of two Siamese agents for retrieval and matching, with a set of simple OM tools. Our framework is implemented in a proof-of-concept system. Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks over state-of-the-art OM systems show that our system can achieve results very close to the long-standing best performance on simple OM tasks and can significantly improve the performance on complex and few-shot OM tasks.
      [COMMENTS]14 pages, 13 figures, 4 tables
      [LINK]http://arxiv.org/abs/2312.00326v4
      [DATE]2024-12-05 22:45:05+08:00
      [CATEGORIES]cs.CL
    • AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic
      [AUTHORS]Nathaniel R. Robinson, Shahd Abdelmoneim, Kelly Marchisio, Sebastian Ruder
      [ABSTRACT]Dialectal Arabic (DA) varieties are under-served by language technologies, particularly large language models (LLMs). This trend threatens to exacerbate existing social inequalities and limits language modeling applications, yet the research community lacks operationalized LLM performance measurements in DA. We present a method that comprehensively evaluates LLM fidelity, understanding, quality, and diglossia in modeling DA. We evaluate nine LLMs in eight DA varieties across these four dimensions and provide best practice recommendations. Our evaluation suggests that LLMs do not produce DA as well as they understand it, but does not suggest deterioration in quality when they do. Further analysis suggests that current post-training can degrade DA capabilities, that few-shot examples can overcome this and other LLM deficiencies, and that otherwise no measurable features of input text correlate well with LLM DA performance.
      [COMMENTS]Pre-print
      [LINK]http://arxiv.org/abs/2412.04193v1
      [DATE]2024-12-05 22:33:00+08:00
      [CATEGORIES]cs.CL
    • Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
      [AUTHORS]Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi
      [ABSTRACT]Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.
      [COMMENTS]Updated with ablations and more technical details
      [LINK]http://arxiv.org/abs/2409.17146v2
      [DATE]2024-12-05 22:28:40+08:00
      [CATEGORIES]cs.CL cs.LG
    • Reducing Tool Hallucination via Reliability Alignment
      [AUTHORS]Hongshen Xu, Su Zhu, Zihan Wang, Hang Zheng, Da Ma, Ruisheng Cao, Shuai Fan, Lu Chen, Kai Yu
      [ABSTRACT]Large Language Models (LLMs) have extended their capabilities beyond language generation to interact with external systems through tool calling, offering powerful potential for real-world applications. However, the phenomenon of tool hallucinations, which occur when models improperly select or misuse tools, presents critical challenges that can lead to flawed task execution and increased operational costs. This paper investigates the concept of reliable tool calling and highlights the necessity of addressing tool hallucinations. We systematically categorize tool hallucinations into two main types: tool selection hallucination and tool usage hallucination. To mitigate these issues, we propose a reliability-focused alignment framework that enhances the model's ability to accurately assess tool relevance and usage. By proposing a suite of evaluation metrics and evaluating on StableToolBench, we further demonstrate the effectiveness of our framework in mitigating tool hallucination and improving the overall system reliability of LLM tool calling.
      [LINK]http://arxiv.org/abs/2412.04141v1
      [DATE]2024-12-05 21:10:54+08:00
      [CATEGORIES]cs.CL
    • Text Change Detection in Multilingual Documents Using Image Comparison
      [AUTHORS]Doyoung Park, Naresh Reddy Yarram, Sunjin Kim, Minkyu Kim, Seongho Cho, Taehee Lee
      [ABSTRACT]Document comparison typically relies on optical character recognition (OCR) as its core technology. However, OCR requires the selection of appropriate language models for each document and the performance of multilingual or hybrid models remains limited. To overcome these challenges, we propose text change detection (TCD) using an image comparison model tailored for multilingual documents. Unlike OCR-based approaches, our method employs word-level text image-to-image comparison to detect changes. Our model generates bidirectional change segmentation maps between the source and target documents. To enhance performance without requiring explicit text alignment or scaling preprocessing, we employ correlations among multi-scale attention features. We also construct a benchmark dataset comprising actual printed and scanned word pairs in various languages to evaluate our model. We validate our approach using our benchmark dataset and public benchmarks Distorted Document Images and the LRDE Document Binarization Dataset. We compare our model against state-of-the-art semantic segmentation and change detection models, as well as to conventional OCR-based models.
      [COMMENTS]15pages, 11figures 6tables, wacv2025 accepted
      [LINK]http://arxiv.org/abs/2412.04137v1
      [DATE]2024-12-05 21:04:10+08:00
      [CATEGORIES]cs.CL cs.LG
    • SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs
      [AUTHORS]Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Patrick Schramowski, Kristian Kersting
      [ABSTRACT]Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text, but their output may not be aligned with the user or even produce harmful content. This paper presents a novel approach to detect and steer concepts such as toxicity before generation. We introduce the Sparse Conditioned Autoencoder (SCAR), a single trained module that extends the otherwise untouched LLM. SCAR ensures full steerability, towards and away from concepts (e.g., toxic content), without compromising the quality of the model's text generation on standard evaluation benchmarks. We demonstrate the effective application of our approach through a variety of concepts, including toxicity, safety, and writing style alignment. As such, this work establishes a robust framework for controlling LLM generations, ensuring their ethical and safe deployment in real-world applications.
      [COMMENTS]Accepted at Socially Responsible Language Modelling Research (SoLaR) Workshop at NeurIPS 2024
      [LINK]http://arxiv.org/abs/2411.07122v2
      [DATE]2024-12-05 18:45:02+08:00
      [CATEGORIES]cs.CL
    • M$^\{3\}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction
      [AUTHORS]Jiang Liu, Bobo Li, Xinran Yang, Na Yang, Hao Fei, Mingyao Zhang, Fei Li, Donghong Ji
      [ABSTRACT]Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M$^\{3\}$D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.
      [COMMENTS]14 pages, 9 figures, 6 tables
      [LINK]http://arxiv.org/abs/2412.04026v1
      [DATE]2024-12-05 18:00:58+08:00
      [CATEGORIES]cs.CL
    • Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
      [AUTHORS]Lingfeng Ming, Bo Zeng, Chenyang Lyu, Tianqi Shi, Yu Zhao, Xue Yang, Yefeng Liu, Yiyu Wang, Linlong Xu, Yangyang Liu, Xiaohu Zhao, Hao Wang, Heng Liu, Hao Zhou, Huifeng Yin, Zifu Shang, Haijun Li, Longyue Wang, Weihua Luo, Kaifu Zhang
      [ABSTRACT]Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.
      [LINK]http://arxiv.org/abs/2412.04003v1
      [DATE]2024-12-05 17:26:58+08:00
      [CATEGORIES]cs.CL
    • Demonstration Selection for In-Context Learning via Reinforcement Learning
      [AUTHORS]Xubin Wang, Jianfei Wu, Yichen Yuan, Mingzhe Li, Deyu Cai, Weijia Jia
      [ABSTRACT]Diversity in demonstration selection is crucial for enhancing model generalization, as it enables a broader coverage of structures and concepts. However, constructing an appropriate set of demonstrations has remained a focal point of research. This paper presents the Relevance-Diversity Enhanced Selection (RDES), an innovative approach that leverages reinforcement learning to optimize the selection of diverse reference demonstrations for text classification tasks using Large Language Models (LLMs), especially in few-shot prompting scenarios. RDES employs a Q-learning framework to dynamically identify demonstrations that maximize both diversity and relevance to the classification objective by calculating a diversity score based on label distribution among selected demonstrations. This method ensures a balanced representation of reference data, leading to improved classification accuracy. Through extensive experiments on four benchmark datasets and involving 12 closed-source and open-source LLMs, we demonstrate that RDES significantly enhances classification accuracy compared to ten established baselines. Furthermore, we investigate the incorporation of Chain-of-Thought (CoT) reasoning in the reasoning process, which further enhances the model's predictive performance. The results underscore the potential of reinforcement learning to facilitate adaptive demonstration selection and deepen the understanding of classification challenges.
      [LINK]http://arxiv.org/abs/2412.03966v1
      [DATE]2024-12-05 16:33:52+08:00
      [CATEGORIES]cs.CL
    • MIND: Effective Incorrect Assignment Detection through a Multi-Modal Structure-Enhanced Language Model
      [AUTHORS]Yunhe Pang, Bo Chen, Fanjin Zhang, Yanghui Rao, Jie Tang
      [ABSTRACT]The rapid growth of academic publications has exacerbated the issue of author name ambiguity in online digital libraries. Despite advances in name disambiguation algorithms, cumulative errors continue to undermine the reliability of academic systems. It is estimated that over 10% paper-author assignments are rectified when constructing the million-scale WhoIsWho benchmark. Existing endeavors to detect incorrect assignments are either semantic-based or graph-based approaches, which fall short of making full use of the rich text attributes of papers and implicit structural features defined via the co-occurrence of paper attributes. To this end, this paper introduces a structure-enhanced language model that combines key structural features from graph-based methods with fine-grained semantic features from rich paper attributes to detect incorrect assignments. The proposed model is trained with a highly effective multi-modal multi-turn instruction tuning framework, which incorporates task-guided instruction tuning, text-attribute modality, and structural modality. Experimental results demonstrate that our model outperforms previous approaches, achieving top performance on the leaderboard of KDD Cup 2024. Our code has been publicly available.
      [LINK]http://arxiv.org/abs/2412.03930v1
      [DATE]2024-12-05 15:12:53+08:00
      [CATEGORIES]cs.CL
    • ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
      [AUTHORS]Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, Xuanjing Huang
      [ABSTRACT]Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined, diverging from genuine needs. Furthermore, a sole emphasis on outcomes disregards the complex capabilities required for LLMs to effectively use tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs' tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world. Evaluations involving ten LLMs across three categories reveal a preference for specific scenarios and limited cognitive abilities in tool learning. Intriguingly, expanding the model size even exacerbates the hindrance to tool learning. The code and data are available at https://github.com/Junjie-Ye/ToolEyes.
      [COMMENTS]Accepted by COLING 2025 conference
      [LINK]http://arxiv.org/abs/2401.00741v3
      [DATE]2024-12-05 15:05:59+08:00
      [CATEGORIES]cs.CL
    • LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings
      [AUTHORS]Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé
      [ABSTRACT]Sentence embedding models play a key role in various Natural Language Processing tasks, such as in Topic Modeling, Document Clustering and Recommendation Systems. However, these models rely heavily on parallel data, which can be scarce for many low-resource languages, including Luxembourgish. This scarcity results in suboptimal performance of monolingual and cross-lingual sentence embedding models for these languages. To address this issue, we compile a relatively small but high-quality human-generated cross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence embedding model for Luxembourgish with strong cross-lingual capabilities. Additionally, we present evidence suggesting that including low-resource languages in parallel training datasets can be more advantageous for other low-resource languages than relying solely on high-resource language pairs. Furthermore, recognizing the lack of sentence embedding benchmarks for low-resource languages, we create a paraphrase detection benchmark specifically for Luxembourgish, aiming to partially fill this gap and promote further research.
      [COMMENTS]Accepted at COLING 2025
      [LINK]http://arxiv.org/abs/2412.03331v2
      [DATE]2024-12-05 15:05:57+08:00
      [CATEGORIES]cs.CL
    • DRS: Deep Question Reformulation With Structured Output
      [AUTHORS]Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, Kai-Wei Chang
      [ABSTRACT]Question answering represents a core capability of large language models (LLMs). However, when individuals encounter unfamiliar knowledge in texts, they often formulate questions that the text itself cannot answer due to insufficient understanding of the underlying information. Recent studies reveal that while LLMs can detect unanswerable questions, they struggle to assist users in reformulating these questions. Even advanced models like GPT-3.5 demonstrate limited effectiveness in this regard. To address this limitation, we propose DRS: Deep Question Reformulation with Structured Output, a novel zero-shot method aimed at enhancing LLMs ability to assist users in reformulating questions to extract relevant information from new documents. DRS combines the strengths of LLMs with a DFS-based algorithm to iteratively explore potential entity combinations and constrain outputs using predefined entities. This structured approach significantly enhances the reformulation capabilities of LLMs. Comprehensive experimental evaluations demonstrate that DRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while also enhancing the performance of open-source models, such as Gemma2-9B, from 26.35% to 56.75%.
      [LINK]http://arxiv.org/abs/2411.17993v2
      [DATE]2024-12-05 14:53:40+08:00
      [CATEGORIES]cs.CL
    • A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios
      [AUTHORS]Xiachong Feng, Longxu Dou, Ella Li, Qinghao Wang, Haochuan Wang, Yu Guo, Chang Ma, Lingpeng Kong
      [ABSTRACT]Game-theoretic scenarios have become pivotal in evaluating the social intelligence of Large Language Model (LLM)-based social agents. While numerous studies have explored these agents in such settings, there is a lack of a comprehensive survey summarizing the current progress. To address this gap, we systematically review existing research on LLM-based social agents within game-theoretic scenarios. Our survey organizes the findings into three core components: Game Framework, Social Agent, and Evaluation Protocol. The game framework encompasses diverse game scenarios, ranging from choice-focusing to communication-focusing games. The social agent part explores agents' preferences, beliefs, and reasoning abilities. The evaluation protocol covers both game-agnostic and game-specific metrics for assessing agent performance. By reflecting on the current research and identifying future research directions, this survey provides insights to advance the development and evaluation of social agents in game-theoretic scenarios.
      [LINK]http://arxiv.org/abs/2412.03920v1
      [DATE]2024-12-05 14:46:46+08:00
      [CATEGORIES]cs.CL
    • MISR: Measuring Instrumental Self-Reasoning in Frontier Models
      [AUTHORS]Kai Fronsdal, David Lindner
      [ABSTRACT]We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models. We open-source our evaluations at https://github.com/kaifronsdal/Self-Reasoning-Evals.
      [COMMENTS]10 pages, 65 page appendix, 5 figures
      [LINK]http://arxiv.org/abs/2412.03904v1
      [DATE]2024-12-05 14:20:47+08:00
      [CATEGORIES]cs.CL cs.LG
    • Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction
      [AUTHORS]Zheye Deng, Chunkit Chan, Weiqi Wang, Yuxi Sun, Wei Fan, Tianshi Zheng, Yauwai Yim, Yangqiu Song
      [ABSTRACT]The task of condensing large chunks of textual information into concise and structured tables has gained attention recently due to the emergence of Large Language Models (LLMs) and their potential benefit for downstream tasks, such as text summarization and text mining. Previous approaches often generate tables that directly replicate information from the text, limiting their applicability in broader contexts, as text-to-table generation in real-life scenarios necessitates information extraction, reasoning, and integration. However, there is a lack of both datasets and methodologies towards this task. In this paper, we introduce LiveSum, a new benchmark dataset created for generating summary tables of competitions based on real-time commentary texts. We evaluate the performances of state-of-the-art LLMs on this task in both fine-tuning and zero-shot settings, and additionally propose a novel pipeline called $T^3$(Text-Tuple-Table) to improve their performances. Extensive experimental results demonstrate that LLMs still struggle with this task even after fine-tuning, while our approach can offer substantial performance gains without explicit training. Further analyses demonstrate that our method exhibits strong generalization abilities, surpassing previous approaches on several other text-to-table datasets. Our code and data can be found at https://github.com/HKUST-KnowComp/LiveSum.
      [COMMENTS]Accepted to EMNLP 2024
      [LINK]http://arxiv.org/abs/2404.14215v2
      [DATE]2024-12-05 14:02:59+08:00
      [CATEGORIES]cs.CL
    • AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer
      [AUTHORS]Davor Lauc, Attapol Rutherford, Weerin Wongwarawipatr
      [ABSTRACT]This study introduces AyutthayaAlpha, an advanced transformer-based machine learning model designed for the transliteration of Thai proper names into Latin script. Our system achieves state-of-the-art performance with 82.32% first-token accuracy and 95.24% first-three-token accuracy, while maintaining a low character error rate of 0.0047. The complexity of Thai phonology, including tonal features and vowel length distinctions, presents significant challenges for accurate transliteration, which we address through a novel two-model approach: AyutthayaAlpha-Small, based on the ByT5 architecture, and AyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly outperforms its larger counterpart. Our research combines linguistic rules with deep learning, training on a carefully curated dataset of 1.2 million Thai-Latin name pairs, augmented through strategic upsampling to 2.7 million examples. Extensive evaluations against existing transliteration methods and human expert benchmarks demonstrate that AyutthayaAlpha not only achieves superior accuracy but also effectively captures personal and cultural preferences in name romanization. The system's practical applications extend to cross-lingual information retrieval, international data standardization, and identity verification systems, with particular relevance for government databases, academic institutions, and global business operations. This work represents a significant advance in bridging linguistic gaps between Thai and Latin scripts, while respecting the cultural and personal dimensions of name transliteration.
      [LINK]http://arxiv.org/abs/2412.03877v1
      [DATE]2024-12-05 13:18:09+08:00
      [CATEGORIES]cs.CL
    • Yi-Lightning Technical Report
      [AUTHORS]01. AI, :, Alan Wake, Albert Wang, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qicheng Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang
      [ABSTRACT]This technical report presents Yi-Lightning, our latest flagship large language model (LLM). It achieves exceptional performance, ranking 6th overall on Chatbot Arena, with particularly strong results (2nd to 4th place) in specialized categories including Chinese, Math, Coding, and Hard Prompts. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, featuring advanced expert segmentation and routing mechanisms coupled with optimized KV-caching techniques. Our development process encompasses comprehensive pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF), where we devise deliberate strategies for multi-stage training, synthetic data construction, and reward modeling. Furthermore, we implement RAISE (Responsible AI Safety Engine), a four-component framework to address safety issues across pre-training, post-training, and serving phases. Empowered by our scalable super-computing infrastructure, all these innovations substantially reduce training, deployment and inference costs while maintaining high-performance standards. With further evaluations on public academic benchmarks, Yi-Lightning demonstrates competitive performance against top-tier LLMs, while we observe a notable disparity between traditional, static benchmark results and real-world, dynamic human preferences. This observation prompts a critical reassessment of conventional benchmarks' utility in guiding the development of more intelligent and powerful AI systems for practical applications. Yi-Lightning is now available through our developer platform at https://platform.lingyiwanwu.com.
      [LINK]http://arxiv.org/abs/2412.01253v3
      [DATE]2024-12-05 12:29:49+08:00
      [CATEGORIES]cs.CL cs.LG
    • Calibrating Reasoning in Language Models with Internal Consistency
      [AUTHORS]Zhihui Xie, Jizhou Guo, Tong Yu, Shuai Li
      [ABSTRACT]Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks, aided by techniques like chain-of-thought prompting that elicits verbalized reasoning. However, LLMs often generate text with obvious mistakes and contradictions, raising doubts about their ability to robustly process and utilize generated rationales. In this work, we investigate reasoning in LLMs through the lens of internal representations, focusing on how these representations are influenced by generated rationales. Our preliminary analysis reveals that while generated rationales improve answer accuracy, inconsistencies emerge between the model's internal representations in middle layers and those in final layers, potentially undermining the reliability of their reasoning processes. To address this, we propose internal consistency as a measure of the model's confidence by examining the agreement of latent predictions decoded from intermediate layers. Extensive empirical studies across different models and datasets demonstrate that internal consistency effectively distinguishes between correct and incorrect reasoning paths. Motivated by this, we propose a new approach to calibrate reasoning by up-weighting reasoning paths with high internal consistency, resulting in a significant boost in reasoning performance. Further analysis uncovers distinct patterns in attention and feed-forward modules across layers, providing insights into the emergence of internal inconsistency. In summary, our results demonstrate the potential of using internal representations for self-evaluation of LLMs. Our code is available at github.com/zhxieml/internal-consistency.
      [COMMENTS]NeurIPS 2024 camera ready
      [LINK]http://arxiv.org/abs/2405.18711v2
      [DATE]2024-12-05 12:01:28+08:00
      [CATEGORIES]cs.CL
    • From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
      [AUTHORS]Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji
      [ABSTRACT]Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models, have revolutionized various natural language processing tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. We review fundamental building blocks crucial for studying chart understanding tasks. Additionally, we explore various tasks and their evaluation metrics and sources of both charts and textual inputs. Various modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed, highlighting the importance of several topics, such as domain-specific charts, lack of efforts in developing evaluation metrics, and agent-oriented settings. This survey paper serves as a comprehensive resource for researchers and practitioners in the fields of natural language processing, computer vision, and data analysis, providing valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies mentioned in this paper, along with emerging new research, will be continually updated at: https://github.com/khuangaf/Awesome-Chart-Understanding.
      [COMMENTS]IEEE Transactions on Knowledge and Data Engineering (TKDE)
      [LINK]http://arxiv.org/abs/2403.12027v4
      [DATE]2024-12-05 11:26:13+08:00
      [CATEGORIES]cs.CL
    • ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data
      [AUTHORS]Junhong Shen, Atishay Jain, Zedian Xiao, Ishan Amlekar, Mouad Hadji, Aaron Podolny, Ameet Talwalkar
      [ABSTRACT]Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting-based agents on existing benchmarks -- ScribeAgent achieves state-of-the-art direct generation performance on Mind2Web and improves the task success rate by 7.3% over the previous best text-only web agents on WebArena. We further perform detailed ablation studies on various fine-tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.
      [LINK]http://arxiv.org/abs/2411.15004v2
      [DATE]2024-12-05 10:00:07+08:00
      [CATEGORIES]cs.CL
    • Labrador: Exploring the Limits of Masked Language Modeling for Laboratory Data
      [AUTHORS]David R. Bellamy, Bhawesh Kumar, Cindy Wang, Andrew Beam
      [ABSTRACT]In this work we introduce Labrador, a pre-trained Transformer model for laboratory data. Labrador and BERT were pre-trained on a corpus of 100 million lab test results from electronic health records (EHRs) and evaluated on various downstream outcome prediction tasks. Both models demonstrate mastery of the pre-training task but neither consistently outperform XGBoost on downstream supervised tasks. Our ablation studies reveal that transfer learning shows limited effectiveness for BERT and achieves marginal success with Labrador. We explore the reasons for the failure of transfer learning and suggest that the data generating process underlying each patient cannot be characterized sufficiently using labs alone, among other factors. We encourage future work to focus on joint modeling of multiple EHR data categories and to include tree-based baselines in their evaluations.
      [COMMENTS]26 pages, 8 figures, best paper award at ML4H 2024
      [LINK]http://arxiv.org/abs/2312.11502v2
      [DATE]2024-12-05 07:09:53+08:00
      [CATEGORIES]cs.CL cs.LG
    • From Language Models over Tokens to Language Models over Characters
      [AUTHORS]Tim Vieira, Ben LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Brian DuSell, John Terilla, Timothy J. O'Donnell, Ryan Cotterell
      [ABSTRACT]Modern language models are internally -- and mathematically -- distributions over token strings rather than \emph\{character\} strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent analyses are very sensitive to the specification of the prompt (e.g., if the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. We find that -- even with a small computation budget -- our method is able to accurately approximate the character-level distribution (less than 0.00021 excess bits / character) at reasonably fast speeds (46.3 characters / second) on the Llama 3.1 8B language model.
      [LINK]http://arxiv.org/abs/2412.03719v1
      [DATE]2024-12-05 05:19:20+08:00
      [CATEGORIES]cs.CL
    • Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance
      [AUTHORS]Meni Brief, Oded Ovadia, Gil Shenderovitz, Noga Ben Yoash, Rachel Lemberg, Eitam Sheetrit
      [ABSTRACT]The application of large language models (LLMs) in domain-specific contexts, including finance, has expanded rapidly. Domain-specific LLMs are typically evaluated based on their performance in various downstream tasks relevant to the domain. In this work, we present a detailed analysis of fine-tuning LLMs for such tasks. Somewhat counterintuitively, we find that in domain-specific cases, fine-tuning exclusively on the target task is not always the most effective strategy. Instead, multi-task finetuning - where models are trained on a cocktail of related tasks - can significantly enhance performance. We demonstrate how this approach enables a small model, such as Phi-3-Mini, to achieve state-of-the-art results, even surpassing the much larger GPT-4-o model on financial benchmarks. Our study involves a large-scale experiment, conducting over 200 training experiments using several widely adopted LLMs as baselines, and empirically confirms the benefits of multi-task fine-tuning. Additionally, we explore the use of general instruction data as a form of regularization, suggesting that it helps minimize performance degradation. We also investigate the inclusion of mathematical data, finding improvements in numerical reasoning that transfer effectively to financial tasks. Finally, we note that while fine-tuning for downstream tasks leads to targeted improvements in task performance, it does not necessarily result in broader gains in domain knowledge or complex domain reasoning abilities.
      [LINK]http://arxiv.org/abs/2410.01109v2
      [DATE]2024-12-05 04:57:05+08:00
      [CATEGORIES]cs.CL
    • Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
      [AUTHORS]Wang Xiyao, Yang Zhengyuan, Li Linjie, Lu Hongjin, Xu Yuancheng, Lin Chung-Ching Lin, Lin Kevin, Huang Furong, Wang Lijuan
      [ABSTRACT]Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at https://github.com/si0wang/VisVM.
      [LINK]http://arxiv.org/abs/2412.03704v1
      [DATE]2024-12-05 04:35:07+08:00
      [CATEGORIES]cs.CL cs.LG
    • MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
      [AUTHORS]Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, Lichao Sun
      [ABSTRACT]Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. Recently, many studies have focused on the tool utilization ability of LLMs. They primarily investigated how LLMs effectively collaborate with given specific tools. However, in scenarios where LLMs serve as intelligent agents, as seen in applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate decision-making processes that involve deciding whether to employ a tool and selecting the most suitable tool(s) from a collection of available tools to fulfill user requests. Therefore, in this paper, we introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. Specifically, we create a dataset called ToolE within the benchmark. This dataset contains various types of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios. Subsequently, we set the tasks for both tool usage awareness and tool selection. We define four subtasks from different perspectives in tool selection, including tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools, highlighting the existing gaps between LLMs and genuine intelligent agents. However, through the error analysis, we found there is still significant room for improvement. Finally, we conclude with insights for tool developers -- we strongly recommend that tool developers choose an appropriate rewrite model for generating new descriptions based on the downstream LLM the tool will apply to. Our code is in https://github.com/HowieHwong/MetaTool.
      [LINK]http://arxiv.org/abs/2310.03128v6
      [DATE]2024-12-05 03:49:02+08:00
      [CATEGORIES]cs.CL
    • Acquired TASTE: Multimodal Stance Detection with Textual and Structural Embeddings
      [AUTHORS]Guy Barel, Oren Tsur, Dan Volenchik
      [ABSTRACT]Stance detection plays a pivotal role in enabling an extensive range of downstream applications, from discourse parsing to tracing the spread of fake news and the denial of scientific facts. While most stance classification models rely on textual representation of the utterance in question, prior work has demonstrated the importance of the conversational context in stance detection. In this work we introduce TASTE -- a multimodal architecture for stance detection that harmoniously fuses Transformer-based content embedding with unsupervised structural embedding. Through the fine-tuning of a pretrained transformer and the amalgamation with social embedding via a Gated Residual Network (GRN) layer, our model adeptly captures the complex interplay between content and conversational structure in determining stance. TASTE achieves state-of-the-art results on common benchmarks, significantly outperforming an array of strong baselines. Comparative evaluations underscore the benefits of social grounding -- emphasizing the criticality of concurrently harnessing both content and structure for enhanced stance detection.
      [COMMENTS]The modified camera ready version will be published in January 2025 at COLING
      [LINK]http://arxiv.org/abs/2412.03681v1
      [DATE]2024-12-05 03:23:37+08:00
      [CATEGORIES]cs.CL
    • Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
      [AUTHORS]Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
      [ABSTRACT]Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at https://github.com/prometheus-eval/prometheus-eval.
      [COMMENTS]EMNLP 2024 (Main Conference)
      [LINK]http://arxiv.org/abs/2405.01535v2
      [DATE]2024-12-05 03:23:17+08:00
      [CATEGORIES]cs.CL
    • Evaluating Language Models as Synthetic Data Generators
      [AUTHORS]Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig
      [ABSTRACT]Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.
      [COMMENTS]Work in Progress
      [LINK]http://arxiv.org/abs/2412.03679v1
      [DATE]2024-12-05 03:20:32+08:00
      [CATEGORIES]cs.CL
    • Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis
      [AUTHORS]Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
      [ABSTRACT]The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which extend the capabilities of text-only LLMs to multiple modalities. This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks. We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods, including prompt learning, prefix tuning, and low-rank adaptation. Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging. We discuss the implications of these findings for future research in image captioning and the development of more adaptable Multimodal LLMs.
      [COMMENTS]ECCV 2024 Workshop on Green Foundation Models
      [LINK]http://arxiv.org/abs/2412.03665v1
      [DATE]2024-12-05 03:01:06+08:00
      [CATEGORIES]cs.CL
    • From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents
      [AUTHORS]Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, Zhongyu Wei
      [ABSTRACT]Traditional sociological research often relies on human participation, which, though effective, is expensive, challenging to scale, and with ethical concerns. Recent advancements in large language models (LLMs) highlight their potential to simulate human behavior, enabling the replication of individual responses and facilitating studies on many interdisciplinary studies. In this paper, we conduct a comprehensive survey of this field, illustrating the recent progress in simulation driven by LLM-empowered agents. We categorize the simulations into three types: (1) Individual Simulation, which mimics specific individuals or demographic groups; (2) Scenario Simulation, where multiple agents collaborate to achieve goals within specific contexts; and (3) Society Simulation, which models interactions within agent societies to reflect the complexity and variety of real-world dynamics. These simulations follow a progression, ranging from detailed individual modeling to large-scale societal phenomena. We provide a detailed discussion of each simulation type, including the architecture or key components of the simulation, the classification of objectives or scenarios and the evaluation method. Afterward, we summarize commonly used datasets and benchmarks. Finally, we discuss the trends across these three types of simulation. A repository for the related sources is at \{\url\{https://github.com/FudanDISC/SocialAgent\}\}.
      [LINK]http://arxiv.org/abs/2412.03563v1
      [DATE]2024-12-05 02:56:37+08:00
      [CATEGORIES]cs.CL
    • Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted Language Models
      [AUTHORS]Natalie Mackraz, Nivedha Sivakumar, Samira Khorshidi, Krishna Patel, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff
      [ABSTRACT]Large language models (LLMs) are increasingly being adapted to achieve task-specificity for deployment in real-world decision systems. Several previous works have investigated the bias transfer hypothesis (BTH) by studying the effect of the fine-tuning adaptation strategy on model fairness to find that fairness in pre-trained masked language models have limited effect on the fairness of models when adapted using fine-tuning. In this work, we expand the study of BTH to causal models under prompt adaptations, as prompting is an accessible, and compute-efficient way to deploy models in real-world systems. In contrast to previous works, we establish that intrinsic biases in pre-trained Mistral, Falcon and Llama models are strongly correlated (rho >= 0.94) with biases when the same models are zero- and few-shot prompted, using a pronoun co-reference resolution task. Further, we find that bias transfer remains strongly correlated even when LLMs are specifically prompted to exhibit fair or biased behavior (rho >= 0.92), and few-shot length and stereotypical composition are varied (rho >= 0.97). Our findings highlight the importance of ensuring fairness in pre-trained LLMs, especially when they are later used to perform downstream tasks via prompt adaptation.
      [LINK]http://arxiv.org/abs/2412.03537v1
      [DATE]2024-12-05 02:32:42+08:00
      [CATEGORIES]cs.CL cs.LG
    • StarVector: Generating Scalable Vector Graphics Code from Images and Text
      [AUTHORS]Juan A. Rodriguez, Abhay Puri, Shubham Agarwal, Issam H. Laradji, Pau Rodriguez, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli
      [ABSTRACT]Scalable Vector Graphics (SVGs) are vital for modern image rendering due to their scalability and versatility. Previous SVG generation methods have focused on curve-based vectorization, lacking semantic understanding, often producing artifacts, and struggling with SVG primitives beyond path curves. To address these issues, we introduce StarVector, a multimodal large language model for SVG generation. It performs image vectorization by understanding image semantics and using SVG primitives for compact, precise outputs. Unlike traditional methods, StarVector works directly in the SVG code space, leveraging visual understanding to apply accurate SVG primitives. To train StarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables generalization across vectorization tasks and precise use of primitives like ellipses, polygons, and text. We address challenges in SVG evaluation, showing that pixel-based metrics like MSE fail to capture the unique qualities of vector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and 3 tasks: Image-to-SVG, Text-to-SVG generation, and diagram generation. Using this setup, StarVector achieves state-of-the-art performance, producing more compact and semantically rich SVGs.
      [LINK]http://arxiv.org/abs/2312.11556v2
      [DATE]2024-12-05 02:31:44+08:00
      [CATEGORIES]cs.CL
    • A Review on Scientific Knowledge Extraction using Large Language Models in Biomedical Sciences
      [AUTHORS]Gabriel Lino Garcia, João Renato Ribeiro Manesco, Pedro Henrique Paiola, Lucas Miranda, Maria Paola de Salvo, João Paulo Papa
      [ABSTRACT]The rapid advancement of large language models (LLMs) has opened new boundaries in the extraction and synthesis of medical knowledge, particularly within evidence synthesis. This paper reviews the state-of-the-art applications of LLMs in the biomedical domain, exploring their effectiveness in automating complex tasks such as evidence synthesis and data extraction from a biomedical corpus of documents. While LLMs demonstrate remarkable potential, significant challenges remain, including issues related to hallucinations, contextual understanding, and the ability to generalize across diverse medical tasks. We highlight critical gaps in the current research literature, particularly the need for unified benchmarks to standardize evaluations and ensure reliability in real-world applications. In addition, we propose directions for future research, emphasizing the integration of state-of-the-art techniques such as retrieval-augmented generation (RAG) to enhance LLM performance in evidence synthesis. By addressing these challenges and utilizing the strengths of LLMs, we aim to improve access to medical literature and facilitate meaningful discoveries in healthcare.
      [COMMENTS]9 pages, 1 table, 1 figure, conference paper
      [LINK]http://arxiv.org/abs/2412.03531v1
      [DATE]2024-12-05 02:26:13+08:00
      [CATEGORIES]cs.CL cs.LG
    • Number Cookbook: Number Understanding of Language Models and How to Improve It
      [AUTHORS]Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, Muhan Zhang
      [ABSTRACT]Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as 9.11 > 9.9). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. Our work provides a more detailed and comprehensive understanding of NUPA in LLMs. Our benchmark and code are released at https://github.com/GraphPKU/number_cookbook.
      [LINK]http://arxiv.org/abs/2411.03766v2
      [DATE]2024-12-05 00:39:04+08:00
      [CATEGORIES]cs.CL
    • DataLab: A Unified Platform for LLM-Powered Business Intelligence
      [AUTHORS]Luoxuan Weng, Yinghao Tang, Yingchaojie Feng, Zhuo Chang, Peng Chen, Ruiqin Chen, Haozhe Feng, Chen Hou, Danqing Huang, Yang Li, Huaming Rao, Haonan Wang, Canshi Wei, Xiaofeng Yang, Yuhui Zhang, Yifeng Zheng, Xiuqi Huang, Minfeng Zhu, Yuxin Ma, Bin Cui, Wei Chen
      [ABSTRACT]Business intelligence (BI) transforms large volumes of data within modern organizations into actionable insights for informed decision-making. Recently, large language model (LLM)-based agents have streamlined the BI workflow by automatically performing task planning, reasoning, and actions in executable environments based on natural language (NL) queries. However, existing approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS. The fragmentation of tasks across different data roles and tools lead to inefficiencies and potential errors due to the iterative and collaborative nature of BI. In this paper, we introduce DataLab, a unified BI platform that integrates a one-stop LLM-based agent framework with an augmented computational notebook interface. DataLab supports a wide range of BI tasks for different data roles by seamlessly combining LLM assistance with user customization within a single environment. To achieve this unification, we design a domain knowledge incorporation module tailored for enterprise-specific BI tasks, an inter-agent communication mechanism to facilitate information sharing across the BI workflow, and a cell-based context management strategy to enhance context utilization efficiency in BI notebooks. Extensive experiments demonstrate that DataLab achieves state-of-the-art performance on various BI tasks across popular research benchmarks. Moreover, DataLab maintains high effectiveness and efficiency on real-world datasets from Tencent, achieving up to a 58.58% increase in accuracy and a 61.65% reduction in token cost on enterprise-specific BI tasks.
      [LINK]http://arxiv.org/abs/2412.02205v2
      [DATE]2024-12-05 00:12:08+08:00
      [CATEGORIES]cs.CL
    • Calib3D: Calibrating Model Preferences for Reliable 3D Scene Understanding
      [AUTHORS]Lingdong Kong, Xiang Xu, Jun Cen, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu
      [ABSTRACT]Safety-critical 3D scene understanding tasks necessitate not only accurate but also confident predictions from 3D perception models. This study introduces Calib3D, a pioneering effort to benchmark and scrutinize the reliability of 3D scene understanding models from an uncertainty estimation viewpoint. We comprehensively evaluate 28 state-of-the-art models across 10 diverse 3D datasets, uncovering insightful phenomena that cope with both the aleatoric and epistemic uncertainties in 3D scene understanding. We discover that despite achieving impressive levels of accuracy, existing models frequently fail to provide reliable uncertainty estimates -- a pitfall that critically undermines their applicability in safety-sensitive contexts. Through extensive analysis of key factors such as network capacity, LiDAR representations, rasterization resolutions, and 3D data augmentation techniques, we correlate these aspects directly with the model calibration efficacy. Furthermore, we introduce DeptS, a novel depth-aware scaling approach aimed at enhancing 3D model calibration. Extensive experiments across a wide range of configurations validate the superiority of our method. We hope this work could serve as a cornerstone for fostering reliable 3D scene understanding. Code and benchmark toolkit are publicly available.
      [COMMENTS]WACV 2025; 26 pages, 8 figures, 12 tables; Code at https://github.com/ldkong1205/Calib3D
      [LINK]http://arxiv.org/abs/2403.17010v2
      [DATE]2024-12-05 23:33:29+08:00
      [CATEGORIES]cs.LG
    • HyperMARL: Adaptive Hypernetworks for Multi-Agent RL
      [AUTHORS]Kale-ab Abebe Tessera, Arrasy Rahman, Stefano V. Albrecht
      [ABSTRACT]Balancing individual specialisation and shared behaviours is a critical challenge in multi-agent reinforcement learning (MARL). Existing methods typically focus on encouraging diversity or leveraging shared representations. Full parameter sharing (FuPS) improves sample efficiency but struggles to learn diverse behaviours when required, while no parameter sharing (NoPS) enables diversity but is computationally expensive and sample inefficient. To address these challenges, we introduce HyperMARL, a novel approach using hypernetworks to balance efficiency and specialisation. HyperMARL generates agent-specific actor and critic parameters, enabling agents to adaptively exhibit diverse or homogeneous behaviours as needed, without modifying the learning objective or requiring prior knowledge of the optimal diversity. Furthermore, HyperMARL decouples agent-specific and state-based gradients, which empirically correlates with reduced policy gradient variance, potentially offering insights into its ability to capture diverse behaviours. Across MARL benchmarks requiring homogeneous, heterogeneous, or mixed behaviours, HyperMARL consistently matches or outperforms FuPS, NoPS, and diversity-focused methods, achieving NoPS-level diversity with a shared architecture. These results highlight the potential of hypernetworks as a versatile approach to the trade-off between specialisation and shared behaviours in MARL.
      [LINK]http://arxiv.org/abs/2412.04233v1
      [DATE]2024-12-05 23:09:51+08:00
      [CATEGORIES]cs.LG
    • Foundations of the Theory of Performance-Based Ranking
      [AUTHORS]Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, Marc Van Droogenbroeck
      [ABSTRACT]Ranking entities such as algorithms, devices, methods, or models based on their performances, while accounting for application-specific preferences, is a challenge. To address this challenge, we establish the foundations of a universal theory for performance-based ranking. First, we introduce a rigorous framework built on top of both the probability and order theories. Our new framework encompasses the elements necessary to (1) manipulate performances as mathematical objects, (2) express which performances are worse than or equivalent to others, (3) model tasks through a variable called satisfaction, (4) consider properties of the evaluation, (5) define scores, and (6) specify application-specific preferences through a variable called importance. On top of this framework, we propose the first axiomatic definition of performance orderings and performance-based rankings. Then, we introduce a universal parametric family of scores, called ranking scores, that can be used to establish rankings satisfying our axioms, while considering application-specific preferences. Finally, we show, in the case of two-class classification, that the family of ranking scores encompasses well-known performance scores, including the accuracy, the true positive rate (recall, sensitivity), the true negative rate (specificity), the positive predictive value (precision), and F1. However, we also show that some other scores commonly used to compare classifiers are unsuitable to derive performance orderings satisfying the axioms. Therefore, this paper provides the computer vision and machine learning communities with a rigorous framework for evaluating and ranking entities.
      [LINK]http://arxiv.org/abs/2412.04227v1
      [DATE]2024-12-05 23:05:25+08:00
      [CATEGORIES]cs.LG
    • Dockformer: A transformer-based molecular docking paradigm for large-scale virtual screening
      [AUTHORS]Zhangfan Yang, Junkai Ji, Shan He, Jianqiang Li, Tiantian He, Ruibin Bai, Zexuan Zhu, Yew Soon Ong
      [ABSTRACT]Molecular docking is a crucial step in drug development, which enables the virtual screening of compound libraries to identify potential ligands that target proteins of interest. However, the computational complexity of traditional docking models increases as the size of the compound library increases. Recently, deep learning algorithms can provide data-driven research and development models to increase the speed of the docking process. Unfortunately, few models can achieve superior screening performance compared to that of traditional models. Therefore, a novel deep learning-based docking approach named Dockformer is introduced in this study. Dockformer leverages multimodal information to capture the geometric topology and structural knowledge of molecules and can directly generate binding conformations with the corresponding confidence measures in an end-to-end manner. The experimental results show that Dockformer achieves success rates of 90.53% and 82.71% on the PDBbind core set and PoseBusters benchmarks, respectively, and more than a 100-fold increase in the inference process speed, outperforming almost all state-of-the-art docking methods. In addition, the ability of Dockformer to identify the main protease inhibitors of coronaviruses is demonstrated in a real-world virtual screening scenario. Considering its high docking accuracy and screening efficiency, Dockformer can be regarded as a powerful and robust tool in the field of drug design.
      [COMMENTS]15 pages, 10 figures
      [LINK]http://arxiv.org/abs/2411.06740v4
      [DATE]2024-12-05 22:56:30+08:00
      [CATEGORIES]cs.LG
    • Multi-Layer Privacy-Preserving Record Linkage with Clerical Review based on gradual information disclosure
      [AUTHORS]Florens Rohde, Victor Christen, Martin Franke, Erhard Rahm
      [ABSTRACT]Privacy-Preserving Record linkage (PPRL) is an essential component in data integration tasks of sensitive information. The linkage quality determines the usability of combined datasets and (machine learning) applications based on them. We present a novel privacy-preserving protocol that integrates clerical review in PPRL using a multi-layer active learning process. Uncertain match candidates are reviewed on several layers by human and non-human oracles to reduce the amount of disclosed information per record and in total. Predictions are propagated back to update previous layers, resulting in an improved linkage performance for non-reviewed candidates as well. The data owners remain in control of the amount of information they share for each record. Therefore, our approach follows need-to-know and data sovereignty principles. The experimental evaluation on real-world datasets shows considerable linkage quality improvements with limited labeling effort and privacy risks.
      [COMMENTS]Accepted at 21st Conference on Database Systems for Business, Technology and Web (BTW)
      [LINK]http://arxiv.org/abs/2412.04178v1
      [DATE]2024-12-05 22:18:50+08:00
      [CATEGORIES]cs.LG
    • When Stability meets Sufficiency: Informative Explanations that do not Overwhelm
      [AUTHORS]Ronny Luss, Amit Dhurandhar
      [ABSTRACT]Recent studies evaluating various criteria for explainable artificial intelligence (XAI) suggest that fidelity, stability, and comprehensibility are among the most important metrics considered by users of AI across a diverse collection of usage contexts. We consider these criteria as applied to feature-based attribution methods, which are amongst the most prevalent in XAI literature. Going beyond standard correlation, methods have been proposed that highlight what should be minimally sufficient to justify the classification of an input (viz. pertinent positives). While minimal sufficiency is an attractive property akin to comprehensibility, the resulting explanations are often too sparse for a human to understand and evaluate the local behavior of the model. To overcome these limitations, we incorporate the criteria of stability and fidelity and propose a novel method called Path-Sufficient Explanations Method (PSEM) that outputs a sequence of stable and sufficient explanations for a given input of strictly decreasing size (or value) -- from original input to a minimally sufficient explanation -- which can be thought to trace the local boundary of the model in a stable manner, thus providing better intuition about the local model behavior for the specific input. We validate these claims, both qualitatively and quantitatively, with experiments that show the benefit of PSEM across three modalities (image, tabular and text) as well as versus other path explanations. A user study depicts the strength of the method in communicating the local behavior, where (many) users are able to correctly determine the prediction made by a model.
      [COMMENTS]Published at TMLR
      [LINK]http://arxiv.org/abs/2109.06181v2
      [DATE]2024-12-05 21:50:59+08:00
      [CATEGORIES]cs.LG
    • Looking at Model Debiasing through the Lens of Anomaly Detection
      [AUTHORS]Vito Paolo Pastore, Massimiliano Ciranni, Davide Marinelli, Francesca Odone, Vittorio Murino
      [ABSTRACT]It is widely recognized that deep neural networks are sensitive to bias in the data. This means that during training these models are likely to learn spurious correlations between data and labels, resulting in limited generalization abilities and low performance. In this context, model debiasing approaches can be devised aiming at reducing the model's dependency on such unwanted correlations, either leveraging the knowledge of bias information or not. In this work, we focus on the latter and more realistic scenario, showing the importance of accurately predicting the bias-conflicting and bias-aligned samples to obtain compelling performance in bias mitigation. On this ground, we propose to conceive the problem of model bias from an out-of-distribution perspective, introducing a new bias identification method based on anomaly detection. We claim that when data is mostly biased, bias-conflicting samples can be regarded as outliers with respect to the bias-aligned distribution in the feature space of a biased model, thus allowing for precisely detecting them with an anomaly detection method. Coupling the proposed bias identification approach with bias-conflicting data upsampling and augmentation in a two-step strategy, we reach state-of-the-art performance on synthetic and real benchmark datasets. Ultimately, our proposed approach shows that the data bias issue does not necessarily require complex debiasing methods, given that an accurate bias identification procedure is defined. Source code is available at https://github.com/Malga-Vision/MoDAD
      [COMMENTS]13 pages, 8 figures; Accepted at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025
      [LINK]http://arxiv.org/abs/2407.17449v3
      [DATE]2024-12-05 21:37:51+08:00
      [CATEGORIES]cs.LG
    • GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning
      [AUTHORS]Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang
      [ABSTRACT]Genetic variants (GVs) are defined as differences in the DNA sequences among individuals and play a crucial role in diagnosing and treating genetic diseases. The rapid decrease in next generation sequencing cost has led to an exponential increase in patient-level GV data. This growth poses a challenge for clinicians who must efficiently prioritize patient-specific GVs and integrate them with existing genomic databases to inform patient management. To addressing the interpretation of GVs, genomic foundation models (GFMs) have emerged. However, these models lack standardized performance assessments, leading to considerable variability in model evaluations. This poses the question: How effectively do deep learning methods classify unknown GVs and align them with clinically-verified GVs? We argue that representation learning, which transforms raw data into meaningful feature spaces, is an effective approach for addressing both indexing and classification challenges. We introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring variable-length contexts and detailed annotations, designed for deep learning models to learn GV representations across various traits, diseases, tissue types, and experimental contexts. Our contributions are three-fold: (i) Construction of a comprehensive dataset with 7 million records, each labeled with characteristics of the corresponding variants, alongside additional data from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant combinations, and 156 unique clinically verified GVs from real-world patients. (ii) Analysis of the structure and properties of the dataset. (iii) Experimentation of the dataset with pre-trained GFMs. The results show a significant gap between GFMs current capabilities and accurate GV representation. We hope this dataset will help advance genomic deep learning to bridge this gap.
      [COMMENTS]Preprint
      [LINK]http://arxiv.org/abs/2407.16940v2
      [DATE]2024-12-05 21:30:16+08:00
      [CATEGORIES]cs.LG
    • Learning Semantic Association Rules from Internet of Things Data
      [AUTHORS]Erkan Karabulut, Paul Groth, Victoria Degeler
      [ABSTRACT]Association Rule Mining (ARM) is the task of discovering commonalities in data in the form of logical implications. ARM is used in the Internet of Things (IoT) for different tasks including monitoring and decision-making. However, existing methods give limited consideration to IoT-specific requirements such as heterogeneity and volume. Furthermore, they do not utilize important static domain-specific description data about IoT systems, which is increasingly represented as knowledge graphs. In this paper, we propose a novel ARM pipeline for IoT data that utilizes both dynamic sensor data and static IoT system metadata. Furthermore, we propose an Autoencoder-based Neurosymbolic ARM method (Aerial) as part of the pipeline to address the high volume of IoT data and reduce the total number of rules that are resource-intensive to process. Aerial learns a neural representation of a given data and extracts association rules from this representation by exploiting the reconstruction (decoding) mechanism of an autoencoder. Extensive evaluations on 3 IoT datasets from 2 domains show that ARM on both static and dynamic IoT data results in more generically applicable rules while Aerial can learn a more concise set of high-quality association rules than the state-of-the-art with full coverage over the datasets.
      [LINK]http://arxiv.org/abs/2412.03417v2
      [DATE]2024-12-05 21:22:28+08:00
      [CATEGORIES]cs.LG
    • DeiSAM: Segment Anything with Deictic Prompting
      [AUTHORS]Hikaru Shindo, Manuel Brack, Gopika Sudhakaran, Devendra Singh Dhami, Patrick Schramowski, Kristian Kersting
      [ABSTRACT]Large-scale, pre-trained neural networks have demonstrated strong capabilities in various tasks, including zero-shot image segmentation. To identify concrete objects in complex scenes, humans instinctively rely on deictic descriptions in natural language, i.e., referring to something depending on the context such as "The object that is on the desk and behind the cup.". However, deep learning approaches cannot reliably interpret such deictic representations due to their lack of reasoning capabilities in complex scenarios. To remedy this issue, we propose DeiSAM -- a combination of large pre-trained neural networks with differentiable logic reasoners -- for deictic promptable segmentation. Given a complex, textual segmentation description, DeiSAM leverages Large Language Models (LLMs) to generate first-order logic rules and performs differentiable forward reasoning on generated scene graphs. Subsequently, DeiSAM segments objects by matching them to the logically inferred image regions. As part of our evaluation, we propose the Deictic Visual Genome (DeiVG) dataset, containing paired visual input and complex, deictic textual prompts. Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines for deictic promptable segmentation.
      [COMMENTS]Published as a conference paper at NeurIPS 2024
      [LINK]http://arxiv.org/abs/2402.14123v2
      [DATE]2024-12-05 21:15:34+08:00
      [CATEGORIES]cs.LG
    • Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
      [AUTHORS]Jason Vega, Junsheng Huang, Gaokai Zhang, Hangoo Kang, Minjia Zhang, Gagandeep Singh
      [ABSTRACT]Safety alignment of Large Language Models (LLMs) has recently become a critical objective of model developers. In response, a growing body of work has been investigating how safety alignment can be bypassed through various jailbreaking methods, such as adversarial attacks. However, these jailbreak methods can be rather costly or involve a non-trivial amount of creativity and effort, introducing the assumption that malicious users are high-resource or sophisticated. In this paper, we study how simple random augmentations to the input prompt affect safety alignment effectiveness in state-of-the-art LLMs, such as Llama 3 and Qwen 2. We perform an in-depth evaluation of 17 different models and investigate the intersection of safety under random augmentations with multiple dimensions: augmentation type, model size, quantization, fine-tuning-based defenses, and decoding strategies (e.g., sampling temperature). We show that low-resource and unsophisticated attackers, i.e. $\textit\{stochastic monkeys\}$, can significantly improve their chances of bypassing alignment with just 25 random augmentations per prompt. Source code and data: https://github.com/uiuc-focal-lab/stochastic-monkeys/
      [COMMENTS]v2: Updated with changes from peer review rebuttal. v1: Version under peer review
      [LINK]http://arxiv.org/abs/2411.02785v2
      [DATE]2024-12-05 20:58:44+08:00
      [CATEGORIES]cs.LG
    • DeepFEA: Deep Learning for Prediction of Transient Finite Element Analysis Solutions
      [AUTHORS]Georgios Triantafyllou, Panagiotis G. Kalozoumis, George Dimas, Dimitris K. Iakovidis
      [ABSTRACT]Finite Element Analysis (FEA) is a powerful but computationally intensive method for simulating physical phenomena. Recent advancements in machine learning have led to surrogate models capable of accelerating FEA. Yet there are still limitations in developing surrogates of transient FEA models that can simultaneously predict the solutions for both nodes and elements with applicability on both the 2D and 3D domains. Motivated by this research gap, this study proposes DeepFEA, a deep learning-based framework that leverages a multilayer Convolutional Long Short-Term Memory (ConvLSTM) network branching into two parallel convolutional neural networks to predict the solutions for both nodes and elements of FEA models. The proposed network is optimized using a novel adaptive learning algorithm, called Node-Element Loss Optimization (NELO). NELO minimizes the error occurring at both branches of the network enabling the prediction of solutions for transient FEA simulations. The experimental evaluation of DeepFEA is performed on three datasets in the context of structural mechanics, generated to serve as publicly available reference datasets. The results show that DeepFEA can achieve less than 3% normalized mean and root mean squared error for 2D and 3D simulation scenarios, and inference times that are two orders of magnitude faster than FEA. In contrast, relevant state-of-the-art methods face challenges with multi-dimensional output and dynamic input prediction. Furthermore, DeepFEA's robustness was demonstrated in a real-life biomedical scenario, confirming its suitability for accurate and efficient predictions of FEA simulations.
      [COMMENTS]This work has been submitted to a journal for possible publication
      [LINK]http://arxiv.org/abs/2412.04121v1
      [DATE]2024-12-05 20:46:18+08:00
      [CATEGORIES]cs.LG
    • Group Distributionally Robust Optimization can Suppress Class Imbalance Effect in Network Traffic Classification
      [AUTHORS]Wumei Du, Dong Liang, Yiqin Lv, Xingxing Liang, Guanlin Wu, Qi Wang, Zheng Xie
      [ABSTRACT]Internet services have led to the eruption of network traffic, and machine learning on these Internet data has become an indispensable tool, especially when the application is risk-sensitive. This paper focuses on network traffic classification in the presence of class imbalance, which fundamentally and ubiquitously exists in Internet data analysis. This existence of class imbalance mostly drifts the optimal decision boundary, resulting in a less optimal solution for machine learning models. To alleviate the effect, we propose to design strategies for alleviating the class imbalance through the lens of group distributionally robust optimization. Our approach iteratively updates the non-parametric weights for separate classes and optimizes the learning model by minimizing reweighted losses. We interpret the optimization process from a Stackelberg game and perform extensive experiments on typical benchmarks. Results show that our approach can not only suppress the negative effect of class imbalance but also improve the comprehensive performance in prediction.
      [LINK]http://arxiv.org/abs/2409.19214v2
      [DATE]2024-12-05 20:45:09+08:00
      [CATEGORIES]cs.LG
    • Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
      [AUTHORS]Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause
      [ABSTRACT]Recent efforts in fine-tuning language models often rely on automatic data selection, commonly using Nearest Neighbors retrieval from large datasets. However, we theoretically show that this approach tends to select redundant data, limiting its effectiveness or even hurting performance. To address this, we introduce SIFT, a data selection algorithm designed to reduce uncertainty about the model's response given a prompt, which unifies ideas from retrieval and active learning. Whereas Nearest Neighbor retrieval typically fails in the presence of information duplication, SIFT accounts for information duplication and optimizes the overall information gain of the selected examples. We focus our evaluations on fine-tuning at test-time for prompt-specific language modeling on the Pile dataset, and show that SIFT consistently outperforms Nearest Neighbor retrieval, with minimal computational overhead. Moreover, we show that our uncertainty estimates can predict the performance gain of test-time fine-tuning, and use this to develop an adaptive algorithm that invests test-time compute proportional to realized performance gains. We provide the $\texttt\{activeft\}$ (Active Fine-Tuning) library which can be used as a drop-in replacement for Nearest Neighbor retrieval.
      [LINK]http://arxiv.org/abs/2410.08020v2
      [DATE]2024-12-05 20:40:16+08:00
      [CATEGORIES]cs.LG
    • Federated Learning in Mobile Networks: A Comprehensive Case Study on Traffic Forecasting
      [AUTHORS]Nikolaos Pavlidis, Vasileios Perifanis, Selim F. Yilmaz, Francesc Wilhelmi, Marco Miozzo, Pavlos S. Efraimidis, Remous-Aris Koutsiamanis, Pavol Mulinka, Paolo Dini
      [ABSTRACT]The increasing demand for efficient resource allocation in mobile networks has catalyzed the exploration of innovative solutions that could enhance the task of real-time cellular traffic prediction. Under these circumstances, federated learning (FL) stands out as a distributed and privacy-preserving solution to foster collaboration among different sites, thus enabling responsive near-the-edge solutions. In this paper, we comprehensively study the potential benefits of FL in telecommunications through a case study on federated traffic forecasting using real-world data from base stations (BSs) in Barcelona (Spain). Our study encompasses relevant aspects within the federated experience, including model aggregation techniques, outlier management, the impact of individual clients, personalized learning, and the integration of exogenous sources of data. The performed evaluation is based on both prediction accuracy and sustainability, thus showcasing the environmental impact of employed FL algorithms in various settings. The findings from our study highlight FL as a promising and robust solution for mobile traffic prediction, emphasizing its twin merits as a privacy-conscious and environmentally sustainable approach, while also demonstrating its capability to overcome data heterogeneity and ensure high-quality predictions, marking a significant stride towards its integration in mobile traffic management systems.
      [LINK]http://arxiv.org/abs/2412.04081v1
      [DATE]2024-12-05 19:32:14+08:00
      [CATEGORIES]cs.LG
    • VGGHeads: 3D Multi Head Alignment with a Large-Scale Synthetic Dataset
      [AUTHORS]Orest Kupyn, Eugene Khvedchenia, Christian Rupprecht
      [ABSTRACT]Human head detection, keypoint estimation, and 3D head model fitting are essential tasks with many applications. However, traditional real-world datasets often suffer from bias, privacy, and ethical concerns, and they have been recorded in laboratory environments, which makes it difficult for trained models to generalize. Here, we introduce \method -- a large-scale synthetic dataset generated with diffusion models for human head detection and 3D mesh estimation. Our dataset comprises over 1 million high-resolution images, each annotated with detailed 3D head meshes, facial landmarks, and bounding boxes. Using this dataset, we introduce a new model architecture capable of simultaneous head detection and head mesh reconstruction from a single image in a single step. Through extensive experimental evaluations, we demonstrate that models trained on our synthetic data achieve strong performance on real images. Furthermore, the versatility of our dataset makes it applicable across a broad spectrum of tasks, offering a general and comprehensive representation of human heads.
      [LINK]http://arxiv.org/abs/2407.18245v2
      [DATE]2024-12-05 19:29:56+08:00
      [CATEGORIES]cs.LG
    • Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation
      [AUTHORS]Weihua Wang, Qiuyu Liang, Feilong Bao, Guanglai Gao
      [ABSTRACT]Quaternion contains one real part and three imaginary parts, which provided a more expressive hypercomplex space for learning knowledge graph. Existing quaternion embedding models measure the plausibility of a triplet either through semantic matching or geometric distance scoring functions. However, it appears that semantic matching diminishes the separability of entities, while the distance scoring function weakens the semantics of entities. To address this issue, we propose a novel quaternion knowledge graph embedding model. Our model combines semantic matching with entity's geometric distance to better measure the plausibility of triplets. Specifically, in the quaternion space, we perform a right rotation on head entity and a reverse rotation on tail entity to learn rich semantic features. Then, we utilize distance adaptive translations to learn geometric distance between entities. Furthermore, we provide mathematical proofs to demonstrate our model can handle complex logical relationships. Extensive experimental results and analyses show our model significantly outperforms previous models on well-known knowledge graph completion benchmark datasets. Our code is available at https://github.com/llqy123/DaBR.
      [COMMENTS]Accepted by COLING 2025
      [LINK]http://arxiv.org/abs/2412.04076v1
      [DATE]2024-12-05 19:17:03+08:00
      [CATEGORIES]cs.LG
    • Integrated Sensing and Communications for Low-Altitude Economy: A Deep Reinforcement Learning Approach
      [AUTHORS]Xiaowen Ye, Yuyi Mao, Xianghao Yu, Shu Sun, Liqun Fu, Jie Xu
      [ABSTRACT]This paper studies an integrated sensing and communications (ISAC) system for low-altitude economy (LAE), where a ground base station (GBS) provides communication and navigation services for authorized unmanned aerial vehicles (UAVs), while sensing the low-altitude airspace to monitor the unauthorized mobile target. The expected communication sum-rate over a given flight period is maximized by jointly optimizing the beamforming at the GBS and UAVs' trajectories, subject to the constraints on the average signal-to-noise ratio requirement for sensing, the flight mission and collision avoidance of UAVs, as well as the maximum transmit power at the GBS. Typically, this is a sequential decision-making problem with the given flight mission. Thus, we transform it to a specific Markov decision process (MDP) model called episode task. Based on this modeling, we propose a novel LAE-oriented ISAC scheme, referred to as Deep LAE-ISAC (DeepLSC), by leveraging the deep reinforcement learning (DRL) technique. In DeepLSC, a reward function and a new action selection policy termed constrained noise-exploration policy are judiciously designed to fulfill various constraints. To enable efficient learning in episode tasks, we develop a hierarchical experience replay mechanism, where the gist is to employ all experiences generated within each episode to jointly train the neural network. Besides, to enhance the convergence speed of DeepLSC, a symmetric experience augmentation mechanism, which simultaneously permutes the indexes of all variables to enrich available experience sets, is proposed. Simulation results demonstrate that compared with benchmarks, DeepLSC yields a higher sum-rate while meeting the preset constraints, achieves faster convergence, and is more robust against different settings.
      [COMMENTS]submitted for an IEEE publication
      [LINK]http://arxiv.org/abs/2412.04074v1
      [DATE]2024-12-05 19:12:46+08:00
      [CATEGORIES]cs.LG
    • Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream
      [AUTHORS]Abdulkadir Gokce, Martin Schrimpf
      [ABSTRACT]When trained on large-scale object classification datasets, certain artificial neural network models begin to approximate core object recognition (COR) behaviors and neural response patterns in the primate visual ventral stream (VVS). While recent machine learning advances suggest that scaling model size, dataset size, and compute resources improve task performance, the impact of scaling on brain alignment remains unclear. In this study, we explore scaling laws for modeling the primate VVS by systematically evaluating over 600 models trained under controlled conditions on benchmarks spanning V1, V2, V4, IT and COR behaviors. We observe that while behavioral alignment continues to scale with larger models, neural alignment saturates. This observation remains true across model architectures and training datasets, even though models with stronger inductive bias and datasets with higher-quality images are more compute-efficient. Increased scaling is especially beneficial for higher-level visual areas, where small models trained on few samples exhibit only poor alignment. Finally, we develop a scaling recipe, indicating that a greater proportion of compute should be allocated to data samples over model size. Our results suggest that while scaling alone might suffice for alignment with human core object recognition behavior, it will not yield improved models of the brain's visual ventral stream with current architectures and datasets, highlighting the need for novel strategies in building brain-like models.
      [COMMENTS]10 pages for the main paper, 23 pages in total. 7 main figures and 7 supplementary figures. Code, model weights, and benchmark results can be accessed at https://github.com/epflneuroailab/scaling-primate-vvs - In version 2, Figure 7 and the related discussion are added, and the appendix is updated
      [LINK]http://arxiv.org/abs/2411.05712v2
      [DATE]2024-12-05 17:39:07+08:00
      [CATEGORIES]cs.LG
    • Blind Underwater Image Restoration using Co-Operational Regressor Networks
      [AUTHORS]Ozer Can Devecioglu, Serkan Kiranyaz, Turker Ince, Moncef Gabbouj
      [ABSTRACT]The exploration of underwater environments is essential for applications such as biological research, archaeology, and infrastructure maintenanceHowever, underwater imaging is challenging due to the waters unique properties, including scattering, absorption, color distortion, and reduced visibility. To address such visual degradations, a variety of approaches have been proposed covering from basic signal processing methods to deep learning models; however, none of them has proven to be consistently successful. In this paper, we propose a novel machine learning model, Co-Operational Regressor Networks (CoRe-Nets), designed to achieve the best possible underwater image restoration. A CoRe-Net consists of two co-operating networks: the Apprentice Regressor (AR), responsible for image transformation, and the Master Regressor (MR), which evaluates the Peak Signal-to-Noise Ratio (PSNR) of the images generated by the AR and feeds it back to AR. CoRe-Nets are built on Self-Organized Operational Neural Networks (Self-ONNs), which offer a superior learning capability by modulating nonlinearity in kernel transformations. The effectiveness of the proposed model is demonstrated on the benchmark Large Scale Underwater Image (LSUI) dataset. Leveraging the joint learning capabilities of the two cooperating networks, the proposed model achieves the state-of-art restoration performance with significantly reduced computational complexity and often presents such results that can even surpass the visual quality of the ground truth with a 2-pass application. Our results and the optimized PyTorch implementation of the proposed approach are now publicly shared on GitHub.
      [COMMENTS]11 pages
      [LINK]http://arxiv.org/abs/2412.03995v1
      [DATE]2024-12-05 17:15:21+08:00
      [CATEGORIES]cs.LG
    • LaserGuider: A Laser Based Physical Backdoor Attack against Deep Neural Networks
      [AUTHORS]Yongjie Xu, Guangke Chen, Fu Song, Yuqi Chen
      [ABSTRACT]Backdoor attacks embed hidden associations between triggers and targets in deep neural networks (DNNs), causing them to predict the target when a trigger is present while maintaining normal behavior otherwise. Physical backdoor attacks, which use physical objects as triggers, are feasible but lack remote control, temporal stealthiness, flexibility, and mobility. To overcome these limitations, in this work, we propose a new type of backdoor triggers utilizing lasers that feature long-distance transmission and instant-imaging properties. Based on the laser-based backdoor triggers, we present a physical backdoor attack, called LaserGuider, which possesses remote control ability and achieves high temporal stealthiness, flexibility, and mobility. We also introduce a systematic approach to optimize laser parameters for improving attack effectiveness. Our evaluation on traffic sign recognition DNNs, critical in autonomous vehicles, demonstrates that LaserGuider with three different laser-based triggers achieves over 90% attack success rate with negligible impact on normal inputs. Additionally, we release LaserMark, the first dataset of real world traffic signs stamped with physical laser spots, to support further research in backdoor attacks and defenses.
      [COMMENTS]In Proceedings of the 23rd International Conference on Applied Cryptography and Network Security (ACNS), Munich, Germany, 23-26 June, 2025
      [LINK]http://arxiv.org/abs/2412.03993v1
      [DATE]2024-12-05 17:14:50+08:00
      [CATEGORIES]cs.LG
    • Local Curvature Smoothing with Stein's Identity for Efficient Score Matching
      [AUTHORS]Genki Osada, Makoto Shing, Takashi Nishide
      [ABSTRACT]The training of score-based diffusion models (SDMs) is based on score matching. The challenge of score matching is that it includes a computationally expensive Jacobian trace. While several methods have been proposed to avoid this computation, each has drawbacks, such as instability during training and approximating the learning as learning a denoising vector field rather than a true score. We propose a novel score matching variant, local curvature smoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by applying Stein's identity, enabling regularization effectiveness and efficient computation. We show that LCSS surpasses existing methods in sample generation performance and matches the performance of denoising score matching, widely adopted by most SDMs, in evaluations such as FID, Inception score, and bits per dimension. Furthermore, we show that LCSS enables realistic image generation even at a high resolution of $1024 \times 1024$.
      [COMMENTS]Accepted at NeurIPS 2024
      [LINK]http://arxiv.org/abs/2412.03962v1
      [DATE]2024-12-05 16:26:13+08:00
      [CATEGORIES]cs.LG
    • Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16
      [AUTHORS]José Camacho, Katarzyna Wasielewska, Pablo Espinosa, Marta Fuentes-García
      [ABSTRACT]Autonomous or self-driving networks are expected to provide a solution to the myriad of extremely demanding new applications with minimal human supervision. For this purpose, the community relies on the development of new Machine Learning (ML) models and techniques. %, like the celebrated Deep Learning (DL). However, ML can only be as good as the data it is fitted with, and data quality is an elusive concept difficult to assess. In this paper, we show that relatively minor modifications on a benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific ML technique considered. We also show that the measured model performance is uncertain, as a result of labelling inaccuracies. Our findings illustrate that the widely adopted approach of comparing a set of models in terms of performance results (e.g., in terms of accuracy or ROC curves) may lead to incorrect conclusions when done without a proper understanding of dataset biases and sensitivity. We contribute a methodology to interpret a model response that can be useful for this understanding.
      [LINK]http://arxiv.org/abs/2305.19770v2
      [DATE]2024-12-05 15:46:11+08:00
      [CATEGORIES]cs.LG
    • JANUS: A Difference-Oriented Analyzer For Financial Centralization Risks in Smart Contracts
      [AUTHORS]Wansen Wang, Pu Zhang, Renjie Ji, Wenchao Huang, Zhaoyi Meng, Yan Xiong
      [ABSTRACT]Some smart contracts violate decentralization principles by defining privileged accounts that manage other users' assets without permission, introducing centralization risks that have caused financial losses. Existing methods, however, face challenges in accurately detecting diverse centralization risks due to their dependence on predefined behavior patterns. In this paper, we propose JANUS, an automated analyzer for Solidity smart contracts that detects financial centralization risks independently of their specific behaviors. JANUS identifies differences between states reached by privileged and ordinary accounts, and analyzes whether these differences are finance-related. Focusing on the impact of risks rather than behaviors, JANUS achieves improved accuracy compared to existing tools and can uncover centralization risks with unknown patterns. To evaluate JANUS's performance, we compare it with other tools using a dataset of 540 contracts. Our evaluation demonstrates that JANUS outperforms representative tools in terms of detection accuracy for financial centralization risks . Additionally, we evaluate JANUS on a real-world dataset of 33,151 contracts, successfully identifying two types of risks that other tools fail to detect. We also prove that the state traversal method and variable summaries, which are used in JANUS to reduce the number of states to be compared, do not introduce false alarms or omissions in detection.
      [LINK]http://arxiv.org/abs/2412.03938v1
      [DATE]2024-12-05 15:35:56+08:00
      [CATEGORIES]cs.LG
    • Developing a Thailand solar irradiance map using Himawari-8 satellite imageries and deep learning models
      [AUTHORS]Suwichaya Suwanwimolkul, Natanon Tongamrak, Nuttamon Thungka, Naebboon Hoonchareon, Jitkomut Songsiri
      [ABSTRACT]This paper presents an online platform showing Thailand solar irradiance map every 30 minutes, available at https://www.cusolarforecast.com. The methodology for estimating global horizontal irradiance (GHI) across Thailand relies on cloud index extracted from Himawari-8 satellite imagery, Ineichen clear-sky model with locally-tuned Linke turbidity, and machine learning models. The methods take clear-sky irradiance, cloud index, re-analyzed GHI and temperature data from the MERRA-2 database, and date-time as inputs for GHI estimation models, including LightGBM, LSTM, Informer, and Transformer. These are benchmarked with the estimate from a commercial service X by evaluation of 15-minute ground GHI data from 53 ground stations over 1.5 years during 2022-2023. The results show that the four models exhibit comparable overall MAE performance to the service X. The best model is LightGBM with an overall MAE of 78.58 W/sqm and RMSE of 118.97 W/sqm, while the service X achieves the lowest MAE, RMSE, and MBE in cloudy condition. Obtaining re-analyzed MERRA-2 data for the whole Thailand region is not economically feasible for deployment. When removing these features, the Informer model has a winning performance in MAE of 78.67 W/sqm. The obtained performance aligns with existing literature by taking the climate zone and time granularity of data into consideration. As the map shows an estimate of GHI over 93,000 grids with a frequent update, the paper also describes a computational framework for displaying the entire map. It tests the runtime performance of deep learning models in the GHI estimation process.
      [COMMENTS]23 pages, 14 figures
      [LINK]http://arxiv.org/abs/2409.16320v3
      [DATE]2024-12-05 15:14:52+08:00
      [CATEGORIES]cs.LG
    • MT3DNet: Multi-Task learning Network for 3D Surgical Scene Reconstruction
      [AUTHORS]Mithun Parab, Pranay Lendave, Jiyoung Kim, Thi Quynh Dan Nguyen, Palash Ingle
      [ABSTRACT]In image-assisted minimally invasive surgeries (MIS), understanding surgical scenes is vital for real-time feedback to surgeons, skill evaluation, and improving outcomes through collaborative human-robot procedures. Within this context, the challenge lies in accurately detecting, segmenting, and estimating the depth of surgical scenes depicted in high-resolution images, while simultaneously reconstructing the scene in 3D and providing segmentation of surgical instruments along with detection labels for each instrument. To address this challenge, a novel Multi-Task Learning (MTL) network is proposed for performing these tasks concurrently. A key aspect of this approach involves overcoming the optimization hurdles associated with handling multiple tasks concurrently by integrating a Adversarial Weight Update into the MTL framework, the proposed MTL model achieves 3D reconstruction through the integration of segmentation, depth estimation, and object detection, thereby enhancing the understanding of surgical scenes, which marks a significant advancement compared to existing studies that lack 3D capabilities. Comprehensive experiments on the EndoVis2018 benchmark dataset underscore the adeptness of the model in efficiently addressing all three tasks, demonstrating the efficacy of the proposed techniques.
      [LINK]http://arxiv.org/abs/2412.03928v1
      [DATE]2024-12-05 15:07:35+08:00
      [CATEGORIES]cs.LG
    • MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models
      [AUTHORS]Ming-Chang Chiu, Shicheng Wen, Pin-Yu Chen, Xuezhe Ma
      [ABSTRACT]In vision-language models (VLMs), the ability to perceive and interpret color and physical environment is crucial for achieving contextually accurate understanding and interaction. However, despite advances in multimodal modeling, there remains a significant lack of specialized datasets that rigorously evaluate a model's capacity to discern subtle color variations and spatial context -- critical elements for situational comprehension and reliable deployment across real-world applications. Toward that goal, we curate MegaCOIN, a high-quality, human-labeled dataset based on \emph\{real\} images with various contextual attributes. MegaCOIN consists of two parts: MegaCOIN-Instruct, which serves as a supervised fine-tuning (SFT) dataset for VLMs; and MegaCOIN-Bench, an annotated test set that can be used as a stand-alone QA dataset. MegaCOIN~provides three annotated features for 220,000 real images: foreground color, background color, and description of an object's physical environment, constituting 660k human annotations. In addition, MegaCOIN can be applied to benchmark domain generalization (DG) algorithms. We explore benchmarking DG methods in the linear probing setup for VLM and show some new insights. Last but not least, we show that VLMs, including GPT-4o, have subpar color recognition capabilities, and fine-tuning with MegaCOIN can result in improved performance on visual evaluation tasks. In certain cases, MegaCOIN fine-tuned small-scale opensource models such as LLaVA and Bunny can outperform closed-source GPT-4o. We hope the utilities of MegaCOIN can shed light on the directions VLMs can improve and provide a more complex platform for domain generalization algorithms.
      [COMMENTS]8 pages, 13 tables, 2 figures
      [LINK]http://arxiv.org/abs/2412.03927v1
      [DATE]2024-12-05 15:06:17+08:00
      [CATEGORIES]cs.LG
    • Quantized and Interpretable Learning Scheme for Deep Neural Networks in Classification Task
      [AUTHORS]Alireza Maleki, Mahsa Lavaei, Mohsen Bagheritabar, Salar Beigzad, Zahra Abadi
      [ABSTRACT]Deep learning techniques have proven highly effective in image classification, but their deployment in resourceconstrained environments remains challenging due to high computational demands. Furthermore, their interpretability is of high importance which demands even more available resources. In this work, we introduce an approach that combines saliency-guided training with quantization techniques to create an interpretable and resource-efficient model without compromising accuracy. We utilize Parameterized Clipping Activation (PACT) to perform quantization-aware training, specifically targeting activations and weights to optimize precision while minimizing resource usage. Concurrently, saliency-guided training is employed to enhance interpretability by iteratively masking features with low gradient values, leading to more focused and meaningful saliency maps. This training procedure helps in mitigating noisy gradients and yields models that provide clearer, more interpretable insights into their decision-making processes. To evaluate the impact of our approach, we conduct experiments using famous Convolutional Neural Networks (CNN) architecture on the MNIST and CIFAR-10 benchmark datasets as two popular datasets. We compare the saliency maps generated by standard and quantized models to assess the influence of quantization on both interpretability and classification accuracy. Our results demonstrate that the combined use of saliency-guided training and PACT-based quantization not only maintains classification performance but also produces models that are significantly more efficient and interpretable, making them suitable for deployment in resource-limited settings.
      [LINK]http://arxiv.org/abs/2412.03915v1
      [DATE]2024-12-05 14:34:06+08:00
      [CATEGORIES]cs.LG
    • Can Targeted Clean-Label Poisoning Attacks Generalize?
      [AUTHORS]Zhizhen Chen, Subrat Kishore Dutta, Zhengyu Zhao, Chenhao Lin, Chao Shen, Xiao Zhang
      [ABSTRACT]Targeted poisoning attacks aim to compromise the model's prediction on specific target samples. In a common clean-label setting, they are achieved by slightly perturbing a subset of training samples given access to those specific targets. Despite continuous efforts, it remains unexplored whether such attacks can generalize to unknown variations of those targets. In this paper, we take the first step to systematically study this generalization problem. Observing that the widely adopted, cosine similarity-based attack exhibits limited generalizability, we propose a well-generalizable attack that leverages both the direction and magnitude of model gradients. In particular, we explore diverse target variations, such as an object with varied viewpoints and an animal species with distinct appearances. Extensive experiments across various generalization scenarios demonstrate that our method consistently achieves the best attack effectiveness. For example, our method outperforms the cosine similarity-based attack by 20.95% in attack success rate with similar overall accuracy, averaged over four models on two image benchmark datasets. The code is available at https://github.com/jiaangk/generalizable_tcpa
      [COMMENTS]12 pages, 5 figures, 5 tables
      [LINK]http://arxiv.org/abs/2412.03908v1
      [DATE]2024-12-05 14:27:14+08:00
      [CATEGORIES]cs.LG
    • HERO: Hint-Based Efficient and Reliable Query Optimizer
      [AUTHORS]Sergey Zinchenko, Sergey Iazov
      [ABSTRACT]We propose a novel model for learned query optimization which provides query hints leading to better execution plans. The model addresses the three key challenges in learned hint-based query optimization: reliable hint recommendation (ensuring non-degradation of query latency), efficient hint exploration, and fast inference. We provide an in-depth analysis of existing NN-based approaches to hint-based optimization and experimentally confirm the named challenges for them. Our alternative solution consists of a new inference schema based on an ensemble of context-aware models and a graph storage for reliable hint suggestion and fast inference, and a budget-controlled training procedure with a local search algorithm that solves the issue of exponential search space exploration. In experiments on standard benchmarks, our model demonstrates optimization capability close to the best achievable with coarse-grained hints. Controlling the degree of parallelism (query dop) in addition to operator-related hints enables our model to achieve 3x latency improvement on JOB benchmark which sets a new standard for optimization. Our model is interpretable and easy to debug, which is particularly important for deployment in production.
      [COMMENTS]Submitted to VLDB 2025; 13 pages; 13 figures
      [LINK]http://arxiv.org/abs/2412.02372v2
      [DATE]2024-12-05 14:00:34+08:00
      [CATEGORIES]cs.LG
    • RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy
      [AUTHORS]Geonho Lee, Janghwan Lee, Sukjin Hong, Minsoo Kim, Euijai Ahn, Du-Seong Chang, Jungwook Choi
      [ABSTRACT]Low-rank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning, with LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs. However, LQEC has underperformed in sub-4-bit scenarios, with no prior investigation into understanding this limitation. We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to understand fundamental limitation and boost 2-bit LLM accuracy. Based on rank analysis revealing model-wise activation discrepancy loss's rank-insensitive nature, RILQ employs this loss to adjust adapters cooperatively across layers, enabling robust error compensation with low-rank adapters. Evaluations on LLaMA-2 and LLaMA-3 demonstrate RILQ's consistent improvements in 2-bit quantized inference across various state-of-the-art quantizers and enhanced accuracy in task-specific fine-tuning. RILQ maintains computational efficiency comparable to existing LoRA methods, enabling adapter-merged weight-quantized LLM inference with significantly enhanced accuracy, making it a promising approach for boosting 2-bit LLM performance.
      [COMMENTS]The typo in Table 4 has been corrected
      [LINK]http://arxiv.org/abs/2412.01129v2
      [DATE]2024-12-05 13:05:01+08:00
      [CATEGORIES]cs.LG
    • Training MLPs on Graphs without Supervision
      [AUTHORS]Zehong Wang, Zheyuan Zhang, Chuxu Zhang, Yanfang Ye
      [ABSTRACT]Graph Neural Networks (GNNs) have demonstrated their effectiveness in various graph learning tasks, yet their reliance on neighborhood aggregation during inference poses challenges for deployment in latency-sensitive applications, such as real-time financial fraud detection. To address this limitation, recent studies have proposed distilling knowledge from teacher GNNs into student Multi-Layer Perceptrons (MLPs) trained on node content, aiming to accelerate inference. However, these approaches often inadequately explore structural information when inferring unseen nodes. To this end, we introduce SimMLP, a Self-supervised framework for learning MLPs on graphs, designed to fully integrate rich structural information into MLPs. Notably, SimMLP is the first MLP-learning method that can achieve equivalence to GNNs in the optimal case. The key idea is to employ self-supervised learning to align the representations encoded by graph context-aware GNNs and neighborhood dependency-free MLPs, thereby fully integrating the structural information into MLPs. We provide a comprehensive theoretical analysis, demonstrating the equivalence between SimMLP and GNNs based on mutual information and inductive bias, highlighting SimMLP's advanced structural learning capabilities. Additionally, we conduct extensive experiments on 20 benchmark datasets, covering node classification, link prediction, and graph classification, to showcase SimMLP's superiority over state-of-the-art baselines, particularly in scenarios involving unseen nodes (e.g., inductive and cold-start node classification) where structural insights are crucial. Our codes are available at: https://github.com/Zehong-Wang/SimMLP.
      [COMMENTS]Accepted by WSDM 25
      [LINK]http://arxiv.org/abs/2412.03864v1
      [DATE]2024-12-05 12:20:54+08:00
      [CATEGORIES]cs.LG
    • Exploring Kolmogorov-Arnold networks for realistic image sharpness assessment
      [AUTHORS]Shaode Yu, Ze Chen, Zhimu Yang, Jiacheng Gu, Bizu Feng
      [ABSTRACT]Score prediction is crucial in evaluating realistic image sharpness based on collected informative features. Recently, Kolmogorov-Arnold networks (KANs) have been developed and witnessed remarkable success in data fitting. This study introduces the Taylor series-based KAN (TaylorKAN). Then, different KANs are explored in four realistic image databases (BID2011, CID2013, CLIVE, and KonIQ-10k) to predict the scores by using 15 mid-level features and 2048 high-level features. Compared to support vector regression, results show that KANs are generally competitive or superior, and TaylorKAN is the best one when mid-level features are used. This is the first study to investigate KANs on image quality assessment that sheds some light on how to select and further improve KANs in related tasks.
      [LINK]http://arxiv.org/abs/2409.07762v3
      [DATE]2024-12-05 10:59:02+08:00
      [CATEGORIES]cs.LG
    • Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks
      [AUTHORS]Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu
      [ABSTRACT]Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: https://fudan-generative-vision.github.io/hallo3/.
      [LINK]http://arxiv.org/abs/2412.00733v2
      [DATE]2024-12-05 10:55:56+08:00
      [CATEGORIES]cs.LG
    • Marginal Causal Flows for Validation and Inference
      [AUTHORS]Daniel de Vassimon Manela, Laura Battaglia, Robin J. Evans
      [ABSTRACT]Investigating the marginal causal effect of an intervention on an outcome from complex data remains challenging due to the inflexibility of employed models and the lack of complexity in causal benchmark datasets, which often fail to reproduce intricate real-world data patterns. In this paper we introduce Frugal Flows, a novel likelihood-based machine learning model that uses normalising flows to flexibly learn the data-generating process, while also directly inferring the marginal causal quantities from observational data. We propose that these models are exceptionally well suited for generating synthetic data to validate causal methods. They can create synthetic datasets that closely resemble the empirical dataset, while automatically and exactly satisfying a user-defined average treatment effect. To our knowledge, Frugal Flows are the first generative model to both learn flexible data representations and also exactly parameterise quantities such as the average treatment effect and the degree of unobserved confounding. We demonstrate the above with experiments on both simulated and real-world datasets.
      [COMMENTS]23 pages, 10 figures, Accepted as a Poster at NeurIPS 2024
      [LINK]http://arxiv.org/abs/2411.01295v2
      [DATE]2024-12-05 10:49:36+08:00
      [CATEGORIES]cs.LG
    • Expressivity of Representation Learning on Continuous-Time Dynamic Graphs: An Information-Flow Centric Review
      [AUTHORS]Sofiane Ennadir, Gabriela Zarzar Gandler, Filip Cornell, Lele Cao, Oleg Smirnov, Tianze Wang, Levente Zólyomi, Björn Brinne, Sahar Asadi
      [ABSTRACT]Graphs are ubiquitous in real-world applications, ranging from social networks to biological systems, and have inspired the development of Graph Neural Networks (GNNs) for learning expressive representations. While most research has centered on static graphs, many real-world scenarios involve dynamic, temporally evolving graphs, motivating the need for Continuous-Time Dynamic Graph (CTDG) models. This paper provides a comprehensive review of Graph Representation Learning (GRL) on CTDGs with a focus on Self-Supervised Representation Learning (SSRL). We introduce a novel theoretical framework that analyzes the expressivity of CTDG models through an Information-Flow (IF) lens, quantifying their ability to propagate and encode temporal and structural information. Leveraging this framework, we categorize existing CTDG methods based on their suitability for different graph types and application scenarios. Within the same scope, we examine the design of SSRL methods tailored to CTDGs, such as predictive and contrastive approaches, highlighting their potential to mitigate the reliance on labeled data. Empirical evaluations on synthetic and real-world datasets validate our theoretical insights, demonstrating the strengths and limitations of various methods across long-range, bi-partite and community-based graphs. This work offers both a theoretical foundation and practical guidance for selecting and developing CTDG models, advancing the understanding of GRL in dynamic settings.
      [COMMENTS]12-page main paper + 8-page appendix
      [LINK]http://arxiv.org/abs/2412.03783v1
      [DATE]2024-12-05 08:12:50+08:00
      [CATEGORIES]cs.LG
    • Exploring RAG-based Vulnerability Augmentation with LLMs
      [AUTHORS]Seyed Shayan Daneshvar, Yu Nong, Xu Yang, Shaowei Wang, Haipeng Cai
      [ABSTRACT]Detecting vulnerabilities is vital for software security, yet deep learning-based vulnerability detectors (DLVD) face a data shortage, which limits their effectiveness. Data augmentation can potentially alleviate the data shortage, but augmenting vulnerable code is challenging and requires a generative solution that maintains vulnerability. Previous works have only focused on generating samples that contain single statements or specific types of vulnerabilities. Recently, large language models (LLMs) have been used to solve various code generation and comprehension tasks with inspiring results, especially when fused with retrieval augmented generation (RAG). Therefore, we propose VulScribeR, a novel LLM-based solution that leverages carefully curated prompt templates to augment vulnerable datasets. More specifically, we explore three strategies to augment both single and multi-statement vulnerabilities, with LLMs, namely Mutation, Injection, and Extension. Our extensive evaluation across three vulnerability datasets and DLVD models, using two LLMs, show that our approach beats two SOTA methods Vulgen and VGX, and Random Oversampling (ROS) by 27.48%, 27.93%, and 15.41% in f1-score with 5K generated vulnerable samples on average, and 53.84%, 54.10%, 69.90%, and 40.93% with 15K generated vulnerable samples. Our approach demonstrates its feasibility for large-scale data augmentation by generating 1K samples at as cheap as US$ 1.88.
      [COMMENTS]13 pages, 6 figures, 5 tables, 3 prompt templates, 1 algorithm
      [LINK]http://arxiv.org/abs/2408.04125v2
      [DATE]2024-12-05 08:00:18+08:00
      [CATEGORIES]cs.LG
    • Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration
      [AUTHORS]Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross
      [ABSTRACT]The goal of mechanistic interpretability is discovering simpler, low-rank algorithms implemented by models. While we can compress activations into features, compressing nonlinear feature-maps -- like MLP layers -- is an open problem. In this work, we present the first case study in rigorously compressing nonlinear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models. We work in the classic setting of the modular addition models, and target a non-vacuous bound on the behaviour of the ReLU MLP in time linear in the parameter-count of the circuit. To study the ReLU MLP analytically, we use the infinite-width lens, which turns post-activation matrix multiplications into approximate integrals. We discover a novel interpretation of\} the MLP layer in one-layer transformers implementing the ``pizza'' algorithm: the MLP can be understood as evaluating a quadrature scheme, where each neuron computes the area of a rectangle under the curve of a trigonometric integral identity. Our code is available at https://tinyurl.com/mod-add-integration.
      [LINK]http://arxiv.org/abs/2412.03773v1
      [DATE]2024-12-05 07:29:07+08:00
      [CATEGORIES]cs.LG
    • Diffusion in Zero-Shot Learning for Environmental Audio
      [AUTHORS]Ysobel Sims, Stephan Chalup, Alexandre Mendes
      [ABSTRACT]Zero-shot learning enables models to generalize to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from environmental audio zero-shot learning, where classification-based approaches dominate. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, a novel diffusion model conditioned on class auxiliary data is introduced. The diffusion model generates synthetic data for unseen classes, which is combined with seen-class data to train a classifier. Experiments are conducted on two environmental audio datasets, ESC-50 and FSC22. Results show that the diffusion model significantly outperforms all baseline methods, achieving more than 25% higher accuracy on the ESC-50 test partition. This work establishes the diffusion model as a promising generative approach for zero-shot learning and introduces the first benchmark of generative methods for environmental audio zero-shot learning, providing a foundation for future research in the field. Code is provided at https://github.com/ysims/ZeroDiffusion for the novel ZeroDiffusion method.
      [COMMENTS]This work has been submitted to the IEEE for possible publication
      [LINK]http://arxiv.org/abs/2412.03771v1
      [DATE]2024-12-05 07:18:40+08:00
      [CATEGORIES]cs.LG
    • Learning Networks from Wide-Sense Stationary Stochastic Processes
      [AUTHORS]Anirudh Rayas, Jiajun Cheng, Rajasekhar Anguluri, Deepjyoti Deka, Gautam Dasarathy
      [ABSTRACT]Complex networked systems driven by latent inputs are common in fields like neuroscience, finance, and engineering. A key inference problem here is to learn edge connectivity from node outputs (potentials). We focus on systems governed by steady-state linear conservation laws: $X_t = \{L^\{\ast\}\}Y_\{t\}$, where $X_t, Y_t \in \mathbb\{R\}^p$ denote inputs and potentials, respectively, and the sparsity pattern of the $p \times p$ Laplacian $L^\{\ast\}$ encodes the edge structure. Assuming $X_t$ to be a wide-sense stationary stochastic process with a known spectral density matrix, we learn the support of $L^\{\ast\}$ from temporally correlated samples of $Y_t$ via an $\ell_1$-regularized Whittle's maximum likelihood estimator (MLE). The regularization is particularly useful for learning large-scale networks in the high-dimensional setting where the network size $p$ significantly exceeds the number of samples $n$. We show that the MLE problem is strictly convex, admitting a unique solution. Under a novel mutual incoherence condition and certain sufficient conditions on $(n, p, d)$, we show that the ML estimate recovers the sparsity pattern of $L^\ast$ with high probability, where $d$ is the maximum degree of the graph underlying $L^\{\ast\}$. We provide recovery guarantees for $L^\ast$ in element-wise maximum, Frobenius, and operator norms. Finally, we complement our theoretical results with several simulation studies on synthetic and benchmark datasets, including engineered systems (power and water networks), and real-world datasets from neural systems (such as the human brain).
      [LINK]http://arxiv.org/abs/2412.03768v1
      [DATE]2024-12-05 07:14:00+08:00
      [CATEGORIES]cs.LG
    • End to End Collaborative Synthetic Data Generation
      [AUTHORS]Sikha Pentyala, Geetha Sitaraman, Trae Claar, Martine De Cock
      [ABSTRACT]The success of AI is based on the availability of data to train models. While in some cases a single data custodian may have sufficient data to enable AI, often multiple custodians need to collaborate to reach a cumulative size required for meaningful AI research. The latter is, for example, often the case for rare diseases, with each clinical site having data for only a small number of patients. Recent algorithms for federated synthetic data generation are an important step towards collaborative, privacy-preserving data sharing. Existing techniques, however, focus exclusively on synthesizer training, assuming that the training data is already preprocessed and that the desired synthetic data can be delivered in one shot, without any hyperparameter tuning. In this paper, we propose an end-to-end collaborative framework for publishing of synthetic data that accounts for privacy-preserving preprocessing as well as evaluation. We instantiate this framework with Secure Multiparty Computation (MPC) protocols and evaluate it in a use case for privacy-preserving publishing of synthetic genomic data for leukemia.
      [LINK]http://arxiv.org/abs/2412.03766v1
      [DATE]2024-12-05 07:10:51+08:00
      [CATEGORIES]cs.LG
    • Good practices for evaluation of machine learning systems
      [AUTHORS]Luciana Ferrer, Odette Scharenborg, Tom Bäckström
      [ABSTRACT]Many development decisions affect the results obtained from ML experiments: training data, features, model architecture, hyperparameters, test data, etc. Among these aspects, arguably the most important design decisions are those that involve the evaluation procedure. This procedure is what determines whether the conclusions drawn from the experiments will or will not generalize to unseen data and whether they will be relevant to the application of interest. If the data is incorrectly selected, the wrong metric is chosen for evaluation or the significance of the comparisons between models is overestimated, conclusions may be misleading or result in suboptimal development decisions. To avoid such problems, the evaluation protocol should be very carefully designed before experimentation starts. In this work we discuss the main aspects involved in the design of the evaluation protocol: data selection, metric selection, and statistical significance. This document is not meant to be an exhaustive tutorial on each of these aspects. Instead, the goal is to explain the main guidelines that should be followed in each case. We include examples taken from the speech processing field, and provide a list of common mistakes related to each aspect.
      [COMMENTS]v1.0
      [LINK]http://arxiv.org/abs/2412.03700v1
      [DATE]2024-12-05 04:30:16+08:00
      [CATEGORIES]cs.LG
    • Tight Lower Bounds and Improved Convergence in Performative Prediction
      [AUTHORS]Pedram Khorsandi, Rushil Gupta, Mehrnaz Mofakhami, Simon Lacoste-Julien, Gauthier Gidel
      [ABSTRACT]Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in the real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.
      [LINK]http://arxiv.org/abs/2412.03671v1
      [DATE]2024-12-05 03:06:19+08:00
      [CATEGORIES]cs.LG
    • Navigation World Models
      [AUTHORS]Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun
      [ABSTRACT]Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.
      [COMMENTS]project page: https://www.amirbar.net/nwm/
      [LINK]http://arxiv.org/abs/2412.03572v1
      [DATE]2024-12-05 02:59:45+08:00
      [CATEGORIES]cs.LG
    • Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
      [AUTHORS]Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G. Shapiro, Ranjay Krishna
      [ABSTRACT]Multimodal language models (MLMs) still face challenges in fundamental visual perception tasks where specialized models excel. Tasks requiring reasoning about 3D structures benefit from depth estimation, and reasoning about 2D object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over. Finetuning MLMs on relevant data doesn't generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient. To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in language models. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it to solve the problem effectively. We propose AURORA, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. AURORA leverages a VQVAE to transform intermediate image representations, such as depth maps into a tokenized format and bounding box tokens, which is then used in a multi-task training framework. AURORA achieves notable improvements across counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming finetuning approaches in generalization across datasets. It also improves on relative depth: over +6% on BLINK. With perception tokens, AURORA expands the scope of MLMs beyond language-based reasoning, paving the way for more effective visual reasoning capabilities.
      [LINK]http://arxiv.org/abs/2412.03548v1
      [DATE]2024-12-05 02:45:35+08:00
      [CATEGORIES]cs.LG
    • Evaluating Single Event Upsets in Deep Neural Networks for Semantic Segmentation: an embedded system perspective
      [AUTHORS]Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe
      [ABSTRACT]As the deployment of artifical intelligence (AI) algorithms at edge devices becomes increasingly prevalent, enhancing the robustness and reliability of autonomous AI-based perception and decision systems is becoming as relevant as precision and performance, especially in applications areas considered safety-critical such as autonomous driving and aerospace. This paper delves into the robustness assessment in embedded Deep Neural Networks (DNNs), particularly focusing on the impact of parameter perturbations produced by single event upsets (SEUs) on convolutional neural networks (CNN) for image semantic segmentation. By scrutinizing the layer-by-layer and bit-by-bit sensitivity of various encoder-decoder models to soft errors, this study thoroughly investigates the vulnerability of segmentation DNNs to SEUs and evaluates the consequences of techniques like model pruning and parameter quantization on the robustness of compressed models aimed at embedded implementations. The findings offer valuable insights into the mechanisms underlying SEU-induced failures that allow for evaluating the robustness of DNNs once trained in advance. Moreover, based on the collected data, we propose a set of practical lightweight error mitigation techniques with no memory or computational cost suitable for resource-constrained deployments. The code used to perform the fault injection (FI) campaign is available at https://github.com/jonGuti13/TensorFI2 , while the code to implement proposed techniques is available at https://github.com/jonGuti13/parameterProtection .
      [LINK]http://arxiv.org/abs/2412.03630v1
      [DATE]2024-12-05 02:28:38+08:00
      [CATEGORIES]cs.LG
    • Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications
      [AUTHORS]Christina Sauer, Anne-Laure Boulesteix, Luzia Hanßum, Farina Hodiamont, Claudia Bausewein, Theresa Ullmann
      [ABSTRACT]Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance of the resulting model. Discussions about tuning typically focus on the hyperparameters of the ML algorithm (e.g., the minimum number of observations in each terminal node for a tree-based algorithm). In this context, it is often neglected that hyperparameters also exist for the preprocessing steps that are applied to the data before it is provided to the algorithm (e.g., how to handle missing feature values in the data). As a consequence, users experimenting with different preprocessing options to improve model performance may be unaware that this constitutes a form of hyperparameter tuning - albeit informal and unsystematic - and thus may fail to report or account for this optimization. To illuminate this issue, this paper reviews and empirically illustrates different procedures for generating and evaluating prediction models, explicitly addressing the different ways algorithm and preprocessing hyperparameters are typically handled by applied ML users. By highlighting potential pitfalls, especially those that may lead to exaggerated performance claims, this review aims to further improve the quality of predictive modeling in ML applications.
      [LINK]http://arxiv.org/abs/2412.03491v1
      [DATE]2024-12-05 01:29:10+08:00
      [CATEGORIES]cs.LG
    • Coverage-Constrained Human-AI Cooperation with Multiple Experts
      [AUTHORS]Zheng Zhang, Cuong Nguyen, Kevin Wells, Thanh-Toan Do, David Rosewarne, Gustavo Carneiro
      [ABSTRACT]Human-AI cooperative classification (HAI-CC) approaches aim to develop hybrid intelligent systems that enhance decision-making in various high-stakes real-world scenarios by leveraging both human expertise and AI capabilities. Current HAI-CC methods primarily focus on learning-to-defer (L2D), where decisions are deferred to human experts, and learning-to-complement (L2C), where AI and human experts make predictions cooperatively. However, a notable research gap remains in effectively exploring both L2D and L2C under diverse expert knowledge to improve decision-making, particularly when constrained by the cooperation cost required to achieve a target probability for AI-only selection (i.e., coverage). In this paper, we address this research gap by proposing the Coverage-constrained Learning to Defer and Complement with Specific Experts (CL2DC) method. CL2DC makes final decisions through either AI prediction alone or by deferring to or complementing a specific expert, depending on the input data. Furthermore, we propose a coverage-constrained optimisation to control the cooperation cost, ensuring it approximates a target probability for AI-only selection. This approach enables an effective assessment of system performance within a specified budget. Also, CL2DC is designed to address scenarios where training sets contain multiple noisy-label annotations without any clean-label references. Comprehensive evaluations on both synthetic and real-world datasets demonstrate that CL2DC achieves superior performance compared to state-of-the-art HAI-CC methods.
      [LINK]http://arxiv.org/abs/2411.11976v2
      [DATE]2024-12-05 01:13:22+08:00
      [CATEGORIES]cs.LG
    • State Frequency Estimation for Anomaly Detection
      [AUTHORS]Clinton Cao, Agathe Blaise, Annibale Panichella, Sicco Verwer
      [ABSTRACT]Many works have studied the efficacy of state machines for detecting anomalies within NetFlows. These works typically learn a model from unlabeled data and compute anomaly scores for arbitrary traces based on their likelihood of occurrence or how well they fit within the model. However, these methods do not dynamically adapt their scores based on the traces seen at test time. This becomes a problem when an adversary produces seemingly common traces in their attack, causing the model to miss the detection by assigning low anomaly scores. We propose SEQUENT, a new approach that uses the state visit frequency to adapt its scoring for anomaly detection dynamically. SEQUENT subsequently uses the scores to generate root causes for anomalies. These allow the grouping of alarms and simplify the analysis of anomalies. Our evaluation of SEQUENT on three NetFlow datasets indicates that our approach outperforms existing methods, demonstrating its effectiveness in detecting anomalies.
      [COMMENTS]9 pages
      [LINK]http://arxiv.org/abs/2412.03442v1
      [DATE]2024-12-05 00:30:35+08:00
      [CATEGORIES]cs.LG
    • Beyond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products
      [AUTHORS]Nadia Nahar, Christian Kästner, Jenna Butler, Chris Parnin, Thomas Zimmermann, Christian Bird
      [ABSTRACT]Large Language Models (LLMs) are increasingly embedded into software products across diverse industries, enhancing user experiences, but at the same time introducing numerous challenges for developers. Unique characteristics of LLMs force developers, who are accustomed to traditional software development and evaluation, out of their comfort zones as the LLM components shatter standard assumptions about software systems. This study explores the emerging solutions that software developers are adopting to navigate the encountered challenges. Leveraging a mixed-method research, including 26 interviews and a survey with 332 responses, the study identifies 19 emerging solutions regarding quality assurance that practitioners across several product teams at Microsoft are exploring. The findings provide valuable insights that can guide the development and evaluation of LLM-based products more broadly in the face of these challenges.
      [COMMENTS]10 pages, 2 tables
      [LINK]http://arxiv.org/abs/2410.12071v2
      [DATE]2024-12-05 00:20:40+08:00
      [CATEGORIES]cs.LG
    • SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model
      [AUTHORS]Yan Li, Ziya Zhou, Zhiqiang Wang, Wei Xue, Wenhan Luo, Yike Guo
      [ABSTRACT]Recent advancements in generative models have significantly enhanced talking face video generation, yet singing video generation remains underexplored. The differences between human talking and singing limit the performance of existing talking face video generation models when applied to singing. The fundamental differences between talking and singing-specifically in audio characteristics and behavioral expressions-limit the effectiveness of existing models. We observe that the differences between singing and talking audios manifest in terms of frequency and amplitude. To address this, we have designed a multi-scale spectral module to help the model learn singing patterns in the spectral domain. Additionally, we develop a spectral-filtering module that aids the model in learning the human behaviors associated with singing audio. These two modules are integrated into the diffusion model to enhance singing video generation performance, resulting in our proposed model, SINGER. Furthermore, the lack of high-quality real-world singing face videos has hindered the development of the singing video generation community. To address this gap, we have collected an in-the-wild audio-visual singing dataset to facilitate research in this area. Our experiments demonstrate that SINGER is capable of generating vivid singing videos and outperforms state-of-the-art methods in both objective and subjective evaluations.
      [LINK]http://arxiv.org/abs/2412.03430v1
      [DATE]2024-12-05 00:19:47+08:00
      [CATEGORIES]cs.LG
    • Assessing Foundation Models' Transferability to Physiological Signals in Precision Medicine
      [AUTHORS]Matthias Christenson, Cove Geary, Brian Locke, Pranav Koirala, Warren Woodrich Pettine
      [ABSTRACT]The success of precision medicine requires computational models that can effectively process and interpret diverse physiological signals across heterogeneous patient populations. While foundation models have demonstrated remarkable transfer capabilities across various domains, their effectiveness in handling individual-specific physiological signals - crucial for precision medicine - remains largely unexplored. This work introduces a systematic pipeline for rapidly and efficiently evaluating foundation models' transfer capabilities in medical contexts. Our pipeline employs a three-stage approach. First, it leverages physiological simulation software to generate diverse, clinically relevant scenarios, particularly focusing on data-scarce medical conditions. This simulation-based approach enables both targeted capability assessment and subsequent model fine-tuning. Second, the pipeline projects these simulated signals through the foundation model to obtain embeddings, which are then evaluated using linear methods. This evaluation quantifies the model's ability to capture three critical aspects: physiological feature independence, temporal dynamics preservation, and medical scenario differentiation. Finally, the pipeline validates these representations through specific downstream medical tasks. Initial testing of our pipeline on the Moirai time series foundation model revealed significant limitations in physiological signal processing, including feature entanglement, temporal dynamics distortion, and reduced scenario discrimination. These findings suggest that current foundation models may require substantial architectural modifications or targeted fine-tuning before deployment in clinical settings.
      [COMMENTS]Presented at the precision medicine workshop at the AI in Medicine conference (2024) in Salt Lake City
      [LINK]http://arxiv.org/abs/2412.03427v1
      [DATE]2024-12-05 00:17:09+08:00
      [CATEGORIES]cs.LG
    2024 Dec 04, Wed
    • LLM-Based Multi-Hop Question Answering with Knowledge Graph Integration in Evolving Environments
      [AUTHORS]Ruirui Chen, Weifeng Jiang, Chengwei Qin, Ishaan Singh Rawal, Cheston Tan, Dongkyu Choi, Bo Xiong, Bo Ai
      [ABSTRACT]The important challenge of keeping knowledge in Large Language Models (LLMs) up-to-date has led to the development of various methods for incorporating new facts. However, existing methods for such knowledge editing still face difficulties with multi-hop questions that require accurate fact identification and sequential logical reasoning, particularly among numerous fact updates. To tackle these challenges, this paper introduces Graph Memory-based Editing for Large Language Models (GMeLLo), a straightforward and effective method that merges the explicit knowledge representation of Knowledge Graphs (KGs) with the linguistic flexibility of LLMs. Beyond merely leveraging LLMs for question answering, GMeLLo employs these models to convert free-form language into structured queries and fact triples, facilitating seamless interaction with KGs for rapid updates and precise multi-hop reasoning. Our results show that GMeLLo significantly surpasses current state-of-the-art (SOTA) knowledge editing methods in the multi-hop question answering benchmark, MQuAKE, especially in scenarios with extensive knowledge edits.
      [LINK]http://arxiv.org/abs/2408.15903v2
      [DATE]2024-12-04 23:01:47+08:00
      [CATEGORIES]cs.CL
    • Self-Improvement in Language Models: The Sharpening Mechanism
      [AUTHORS]Audrey Huang, Adam Block, Dylan J. Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, Akshay Krishnamurthy
      [ABSTRACT]Recent work in language modeling has raised the possibility of self-improvement, where a language models evaluates and refines its own generations to achieve higher performance without external feedback. It is impossible for this self-improvement to create information that is not already in the model, so why should we expect that this will lead to improved capabilities? We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening. Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training in order to ``sharpen'' the model to one placing large mass on high-quality sequences, thereby amortizing the expensive inference-time computation of generating good sequences. We begin by introducing a new statistical framework for sharpening in which the learner aims to sharpen a pre-trained base policy via sample access, and establish fundamental limits. Then we analyze two natural families of self-improvement algorithms based on SFT and RLHF. We find that (i) the SFT-based approach is minimax optimal whenever the initial model has sufficient coverage, but (ii) the RLHF-based approach can improve over SFT-based self-improvement by leveraging online exploration, bypassing the need for coverage. Finally, we empirically validate the sharpening mechanism via inference-time and amortization experiments. We view these findings as a starting point toward a foundational understanding that can guide the design and evaluation of self-improvement algorithms.
      [LINK]http://arxiv.org/abs/2412.01951v2
      [DATE]2024-12-04 22:20:21+08:00
      [CATEGORIES]cs.CL cs.LG
    • Yankari: A Monolingual Yoruba Dataset
      [AUTHORS]Maro Akpobi
      [ABSTRACT]This paper presents Yankari, a large-scale monolingual dataset for the Yoruba language, aimed at addressing the critical gap in Natural Language Processing (NLP) resources for this important West African language. Despite being spoken by over 30 million people, Yoruba has been severely underrepresented in NLP research and applications. We detail our methodology for creating this dataset, which includes careful source selection, automated quality control, and rigorous data cleaning processes. The Yankari dataset comprises 51,407 documents from 13 diverse sources, totaling over 30 million tokens. Our approach focuses on ethical data collection practices, avoiding problematic sources and addressing issues prevalent in existing datasets. We provide thorough automated evaluations of the dataset, demonstrating its quality compared to existing resources. The Yankari dataset represents a significant advancement in Yoruba language resources, providing a foundation for developing more accurate NLP models, supporting comparative linguistic studies, and contributing to the digital accessibility of the Yoruba language.
      [COMMENTS]8 pages
      [LINK]http://arxiv.org/abs/2412.03334v1
      [DATE]2024-12-04 22:05:18+08:00
      [CATEGORIES]cs.CL
    • AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark
      [AUTHORS]Abhay Gupta, Philip Meng, Ece Yurtseven, Sean O'Brien, Kevin Zhu
      [ABSTRACT]Detecting biases in natural language understanding (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language processing (NLP) systems. To address dialect-induced performance discrepancies, we introduce AAVENUE (\{AAVE\} \{N\}atural Language \{U\}nderstanding \{E\}valuation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across our evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability. Additionally, we recruit fluent AAVE speakers to validate our translations for authenticity. Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases and highlighting the need for more inclusive NLP models. We have open-sourced our source code on GitHub and created a website to showcase our work at https://aavenue.live.
      [COMMENTS]Published at NLP4PI @ EMNLP 2024
      [LINK]http://arxiv.org/abs/2408.14845v2
      [DATE]2024-12-04 21:43:28+08:00
      [CATEGORIES]cs.CL
    • Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
      [AUTHORS]Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Madeline Smith, Antoine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, Sara Hooker
      [ABSTRACT]Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks. These biases stem not only from language but also from the cultural knowledge required to interpret questions, reducing the practical utility of translated datasets like MMLU. Furthermore, translation often introduces artifacts that can distort the meaning or clarity of questions in the target language. A common practice in multilingual evaluation is to rely on machine-translated evaluation sets, but simply translating a dataset is insufficient to address these challenges. In this work, we trace the impact of both of these issues on multilingual evaluations and ensuing model performances. Our large-scale evaluation of state-of-the-art open and proprietary models illustrates that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge. Moreover, for questions requiring geographic knowledge, an astounding 84.9% focus on either North American or European regions. Rankings of model evaluations change depending on whether they are evaluated on the full portion or the subset of questions annotated as culturally sensitive, showing the distortion to model rankings when blindly relying on translated MMLU. We release Global-MMLU, an improved MMLU with evaluation coverage across 42 languages -- with improved overall quality by engaging with compensated professional and community annotators to verify translation quality while also rigorously evaluating cultural biases present in the original dataset. This comprehensive Global-MMLU set also includes designated subsets labeled as culturally sensitive and culturally agnostic to allow for more holistic, complete evaluation.
      [LINK]http://arxiv.org/abs/2412.03304v1
      [DATE]2024-12-04 21:27:09+08:00
      [CATEGORIES]cs.CL
    • Alignment at Pre-training! Towards Native Alignment for Arabic LLMs
      [AUTHORS]Juhao Liang, Zhenyang Cai, Jianqing Zhu, Huang Huang, Kewei Zong, Bang An, Mosen Alharthi, Juncai He, Lian Zhang, Haizhou Li, Benyou Wang, Jinchao Xu
      [ABSTRACT]The alignment of large language models (LLMs) is critical for developing effective and safe language models. Traditional approaches focus on aligning models during the instruction tuning or reinforcement learning stages, referred to in this paper as `post alignment'. We argue that alignment during the pre-training phase, which we term `native alignment', warrants investigation. Native alignment aims to prevent unaligned content from the beginning, rather than relying on post-hoc processing. This approach leverages extensively aligned pre-training data to enhance the effectiveness and usability of pre-trained models. Our study specifically explores the application of native alignment in the context of Arabic LLMs. We conduct comprehensive experiments and ablation studies to evaluate the impact of native alignment on model performance and alignment stability. Additionally, we release open-source Arabic LLMs that demonstrate state-of-the-art performance on various benchmarks, providing significant benefits to the Arabic LLM community.
      [COMMENTS]Accepted to NeurIPS 2024 main conference. see https://github.com/FreedomIntelligence/AceGPT-v2
      [LINK]http://arxiv.org/abs/2412.03253v1
      [DATE]2024-12-04 19:52:03+08:00
      [CATEGORIES]cs.CL
    • Xmodel-1.5: An 1B-scale Multilingual LLM
      [AUTHORS]Wang Qun, Liu Yang, Lin Qingquan, Jiang Ling
      [ABSTRACT]We introduce Xmodel-1.5, a 1-billion-parameter multilingual large language model pretrained on 2 trillion tokens, designed for balanced performance and scalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy. The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English, outperforming Alibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in benchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai. To support low-resource language research, we release Xdata_Thai, a Thai-specific evaluation dataset featuring unique linguistic challenges such as gendered particles and idioms. While the model demonstrates strong performance, there is still room for improvement in handling culturally specific nuances. We hope this work contributes to advancements in multilingual AI research. Models and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelLM-1.5
      [LINK]http://arxiv.org/abs/2411.10083v3
      [DATE]2024-12-04 19:49:04+08:00
      [CATEGORIES]cs.CL
    • AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
      [AUTHORS]Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang
      [ABSTRACT]Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders, leading to high computational demands, which limits their applicability in resource-constrained environments and for long-context tasks. In this work, we propose a training-free adaptive inference method for multi-modal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. Our method consists of a) iterative token merging based on embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that, our method substantially reduces computation load (e.g., a $\textbf\{7-fold\}$ reduction in FLOPs) while preserving the performance of video and image LLMs. Further, under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding (e.g., $\textbf\{+4.6\}$ on MLVU). Additionally, our in-depth analysis provides insights into token redundancy and LLM layer behaviors, offering guidance for future research in designing efficient multi-modal LLMs. Our code will be available at https://github.com/LaVi-Lab/AIM.
      [COMMENTS]12 pages, 2 figures
      [LINK]http://arxiv.org/abs/2412.03248v1
      [DATE]2024-12-04 19:47:57+08:00
      [CATEGORIES]cs.CL
    • ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models
      [AUTHORS]Yahan Tu, Rui Hu, Jitao Sang
      [ABSTRACT]Hallucination poses a persistent challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are generally static, which may overlook the potential risk of data contamination. To address this issue, we propose ODE, an open-set, dynamic protocol designed to evaluate object hallucinations in MLLMs at both the existence and attribute levels. ODE employs a graph-based structure to represent real-world object concepts, their attributes, and the distributional associations between them. This structure facilitates the extraction of concept combinations based on diverse distributional criteria, generating varied samples for structured queries that evaluate hallucinations in both generative and discriminative tasks. Through the generation of new samples, dynamic concept combinations, and varied distribution frequencies, ODE mitigates the risk of data contamination and broadens the scope of evaluation. This protocol is applicable to both general and specialized scenarios, including those with limited data. Experimental results demonstrate the effectiveness of our protocol, revealing that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated samples, which indicates potential data contamination. Furthermore, these generated samples aid in analyzing hallucination patterns and fine-tuning models, offering an effective approach to mitigating hallucinations in MLLMs.
      [LINK]http://arxiv.org/abs/2409.09318v3
      [DATE]2024-12-04 19:44:57+08:00
      [CATEGORIES]cs.CL
    • Benchmarking terminology building capabilities of ChatGPT on an English-Russian Fashion Corpus
      [AUTHORS]Anastasiia Bezobrazova, Miriam Seghiri, Constantin Orasan
      [ABSTRACT]This paper compares the accuracy of the terms extracted using SketchEngine, TBXTools and ChatGPT. In addition, it evaluates the quality of the definitions produced by ChatGPT for these terms. The research is carried out on a comparable corpus of fashion magazines written in English and Russian collected from the web. A gold standard for the fashion terminology was also developed by identifying web pages that can be harvested automatically and contain definitions of terms from the fashion domain in English and Russian. This gold standard was used to evaluate the quality of the extracted terms and of the definitions produced. Our evaluation shows that TBXTools and SketchEngine, while capable of high recall, suffer from reduced precision as the number of terms increases, which affects their overall performance. Conversely, ChatGPT demonstrates superior performance, maintaining or improving precision as more terms are considered. Analysis of the definitions produced by ChatGPT for 60 commonly used terms in English and Russian shows that ChatGPT maintains a reasonable level of accuracy and fidelity across languages, but sometimes the definitions in both languages miss crucial specifics and include unnecessary deviations. Our research reveals that no single tool excels universally; each has strengths suited to particular aspects of terminology extraction and application.
      [COMMENTS]To appear in the Proceedings of Translating and the Computer 2024 (TC46)
      [LINK]http://arxiv.org/abs/2412.03242v1
      [DATE]2024-12-04 19:43:08+08:00
      [CATEGORIES]cs.CL
    • Linq-Embed-Mistral Technical Report
      [AUTHORS]Chanyeol Choi, Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, Jy-yong Sohn
      [ABSTRACT]This report explores the enhancement of text retrieval performance using advanced data refinement techniques. We develop Linq-Embed-Mistral\footnote\{\url\{https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral\}\} by building on the E5-mistral and Mistral-7B-v0.1 models, focusing on sophisticated data crafting, data filtering, and negative mining methods, which are highly tailored to each task, applied to both existing benchmark dataset and highly tailored synthetic dataset generated via large language models (LLMs). Linq-Embed-Mistral excels in the MTEB benchmarks (as of May 29, 2024), achieving an average score of 68.2 across 56 datasets, and ranks 1st among all models for retrieval tasks on the MTEB leaderboard with a performance score of 60.2. This performance underscores its superior capability in enhancing search precision and reliability. Our contributions include advanced data refinement methods that significantly improve model performance on benchmark and synthetic datasets, techniques for homogeneous task ordering and mixed task fine-tuning to enhance model generalization and stability, and a streamlined evaluation process using 4-bit precision and a light retrieval evaluation set, which accelerates validation without sacrificing accuracy.
      [COMMENTS]15 pages
      [LINK]http://arxiv.org/abs/2412.03223v1
      [DATE]2024-12-04 19:18:32+08:00
      [CATEGORIES]cs.CL
    • U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs
      [AUTHORS]Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga
      [ABSTRACT]The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release $\mu$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on $\mu$-MATH.
      [LINK]http://arxiv.org/abs/2412.03205v1
      [DATE]2024-12-04 18:44:50+08:00
      [CATEGORIES]cs.CL
    • Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models
      [AUTHORS]Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, Rich Caruana
      [ABSTRACT]While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. While LLMs are significantly better than random at solving statistical classification problems, the sample efficiency of few-shot learning lags behind traditional statistical learning algorithms, especially as the dimension of the problem increases. This suggests that much of the observed few-shot performance on novel real-world datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We release the https://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package to test LLMs for memorization of tabular datasets.
      [COMMENTS]COLM camera ready, fix typo
      [LINK]http://arxiv.org/abs/2404.06209v3
      [DATE]2024-12-04 18:33:18+08:00
      [CATEGORIES]cs.LG cs.CL
    • Weighted-Reward Preference Optimization for Implicit Model Fusion
      [AUTHORS]Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan
      [ABSTRACT]While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to the source LLMs. Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct as the target model, WRPO achieves a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. Our code is available at \url\{https://github.com/SLIT-AI/WRPO\}.
      [COMMENTS]Work in progress
      [LINK]http://arxiv.org/abs/2412.03187v1
      [DATE]2024-12-04 18:15:12+08:00
      [CATEGORIES]cs.CL
    • Multi-Level Correlation Network For Few-Shot Image Classification
      [AUTHORS]Yunkai Dang, Min Zhang, Zhengyu Chen, Xinliang Zhang, Zheng Wang, Meijun Sun, Donglin Wang
      [ABSTRACT]Few-shot image classification(FSIC) aims to recognize novel classes given few labeled images from base classes. Recent works have achieved promising classification performance, especially for metric-learning methods, where a measure at only image feature level is usually used. In this paper, we argue that measure at such a level may not be effective enough to generalize from base to novel classes when using only a few images. Instead, a multi-level descriptor of an image is taken for consideration in this paper. We propose a multi-level correlation network (MLCN) for FSIC to tackle this problem by effectively capturing local information. Concretely, we present the self-correlation module and cross-correlation module to learn the semantic correspondence relation of local information based on learned representations. Moreover, we propose a pattern-correlation module to capture the pattern of fine-grained images and find relevant structural patterns between base classes and novel classes. Extensive experiments and analysis show the effectiveness of our proposed method on four widely-used FSIC benchmarks. The code for our approach is available at: https://github.com/Yunkai696/MLCN.
      [LINK]http://arxiv.org/abs/2412.03159v1
      [DATE]2024-12-04 17:36:24+08:00
      [CATEGORIES]cs.CL
    • Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment
      [AUTHORS]Yanshi Li, Shaopan Xiong, Gengru Chen, Xiaoyang Li, Yijia Luo, Xingyao Zhang, Yanhui Huang, Xingyuan Bu, Yingshui Tan, Chun Yuan, Jiamang Wang, Wenbo Su, Bo Zheng
      [ABSTRACT]Reinforcement Learning from Human Feedback (RLHF) has proven highly effective in aligning Large Language Models (LLMs) with human preferences. However, the original RLHF typically optimizes under an overall reward, which can lead to a suboptimal learning process. This limitation stems from RLHF's lack of awareness regarding which specific tokens should be reinforced or suppressed. Moreover, conflicts in supervision can arise, for instance, when a chosen response includes erroneous tokens, while a rejected response contains accurate elements. To rectify these shortcomings, increasing dense reward methods, such as step-wise and token-wise RLHF, have been proposed. However, these existing methods are limited to specific tasks (like mathematics). In this paper, we propose the ``Adaptive Message-wise RLHF'' method, which robustly applies to various tasks. By defining pivot tokens as key indicators, our approach adaptively identifies essential information and converts sequence-level supervision into fine-grained, subsequence-level supervision. This aligns the density of rewards and action spaces more closely with the information density of the input. Experiments demonstrate that our method can be integrated into various training methods, significantly mitigating hallucinations and catastrophic forgetting problems, while outperforming other methods on multiple evaluation metrics. Our method improves the success rate on adversarial samples by 10\% compared to the sample-wise approach, and achieves a 1.3\% improvement on evaluation benchmarks such as MMLU, GSM8K, HumanEval, etc.
      [LINK]http://arxiv.org/abs/2411.00809v2
      [DATE]2024-12-04 17:26:47+08:00
      [CATEGORIES]cs.LG cs.CL
    • A Measure of the System Dependence of Automated Metrics
      [AUTHORS]Pius von Däniken, Jan Deriu, Mark Cieliebak
      [ABSTRACT]Automated metrics for Machine Translation have made significant progress, with the goal of replacing expensive and time-consuming human evaluations. These metrics are typically assessed by their correlation with human judgments, which captures the monotonic relationship between human and metric scores. However, we argue that it is equally important to ensure that metrics treat all systems fairly and consistently. In this paper, we introduce a method to evaluate this aspect.
      [LINK]http://arxiv.org/abs/2412.03152v1
      [DATE]2024-12-04 17:21:46+08:00
      [CATEGORIES]cs.CL
    • ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction
      [AUTHORS]Victor Junqiu Wei, Weicheng Wang, Di Jiang, Yuanfeng Song, Lu Wang
      [ABSTRACT]Automatic speech Recognition (ASR) is a fundamental and important task in the field of speech and natural language processing. It is an inherent building block in many applications such as voice assistant, speech translation, etc. Despite the advancement of ASR technologies in recent years, it is still inevitable for modern ASR systems to have a substantial number of erroneous recognition due to environmental noise, ambiguity, etc. Therefore, the error correction in ASR is crucial. Motivated by this, this paper studies ASR error correction in the Chinese language, which is one of the most popular languages and enjoys a large number of users in the world. We first create a benchmark dataset named \emph\{ASR-EC\} that contains a wide spectrum of ASR errors generated by industry-grade ASR systems. To the best of our knowledge, it is the first Chinese ASR error correction benchmark. Then, inspired by the recent advances in \emph\{large language models (LLMs)\}, we investigate how to harness the power of LLMs to correct ASR errors. We apply LLMs to ASR error correction in three paradigms. The first paradigm is prompting, which is further categorized as zero-shot, few-shot, and multi-step. The second paradigm is finetuning, which finetunes LLMs with ASR error correction data. The third paradigm is multi-modal augmentation, which collectively utilizes the audio and ASR transcripts for error correction. Extensive experiments reveal that prompting is not effective for ASR error correction. Finetuning is effective only for a portion of LLMs. Multi-modal augmentation is the most effective method for error correction and achieves state-of-the-art performance.
      [LINK]http://arxiv.org/abs/2412.03075v1
      [DATE]2024-12-04 14:52:10+08:00
      [CATEGORIES]cs.CL
    • An Effective Framework to Help Large Language Models Handle Numeric-involved Long-context Tasks
      [AUTHORS]Yijiong Yu
      [ABSTRACT]Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long texts and have almost perfect performance in traditional retrieval tasks. However, their performance significantly degrades when it comes to numerical calculations in the long-context. Numeric-involved long-context tasks typically cannot be addressed by current LLMs in normal settings due to their inherent limitations in simultaneously handling complex and massive information. Some CoT like prompting methods can improve accuracy but demands massive output tokens, which is costly and slow. To address this issue, we propose a workflow, which decompose a numeric-involved long-context task into 4 low-level subtasks: judging, extracting and processing with code and conclusion. The former 2 subtasks is relatively simple, which allows us to use smaller models for efficiently processing long context. When numerical calculations are required, we use code generated by LLMs to avoid the disadvantage of LLM not being good at calculations. The results in 2 numeric-involved long-context benchmarks demonstrate our workflow can not only improve accuracy, but also significantly reduce the cost of API calls.
      [LINK]http://arxiv.org/abs/2411.10145v2
      [DATE]2024-12-04 13:54:43+08:00
      [CATEGORIES]cs.CL
    • CBEval: A framework for evaluating and interpreting cognitive biases in LLMs
      [AUTHORS]Ammar Shaikh, Raj Abhijit Dandekar, Sreedath Panat, Rajat Dandekar
      [ABSTRACT]Rapid advancements in Large Language models (LLMs) has significantly enhanced their reasoning capabilities. Despite improved performance on benchmarks, LLMs exhibit notable gaps in their cognitive processes. Additionally, as reflections of human-generated data, these models have the potential to inherit cognitive biases, raising concerns about their reasoning and decision making capabilities. In this paper we present a framework to interpret, understand and provide insights into a host of cognitive biases in LLMs. Conducting our research on frontier language models we're able to elucidate reasoning limitations and biases, and provide reasoning behind these biases by constructing influence graphs that identify phrases and words most responsible for biases manifested in LLMs. We further investigate biases such as round number bias and cognitive bias barrier revealed when noting framing effect in language models.
      [LINK]http://arxiv.org/abs/2412.03605v1
      [DATE]2024-12-04 13:53:28+08:00
      [CATEGORIES]cs.CL
    • Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework
      [AUTHORS]Lu Chen, Ruqing Zhang, Jiafeng Guo, Yixing Fan, Xueqi Cheng
      [ABSTRACT]Retrieval-augmented generation (RAG) has emerged as a popular solution to mitigate the hallucination issues of large language models. However, existing studies on RAG seldom address the issue of predictive uncertainty, i.e., how likely it is that a RAG model's prediction is incorrect, resulting in uncontrollable risks in real-world applications. In this work, we emphasize the importance of risk control, ensuring that RAG models proactively refuse to answer questions with low confidence. Our research identifies two critical latent factors affecting RAG's confidence in its predictions: the quality of the retrieved results and the manner in which these results are utilized. To guide RAG models in assessing their own confidence based on these two latent factors, we develop a counterfactual prompting framework that induces the models to alter these factors and analyzes the effect on their answers. We also introduce a benchmarking procedure to collect answers with the option to abstain, facilitating a series of experiments. For evaluation, we introduce several risk-related metrics and the experimental results demonstrate the effectiveness of our approach. Our code and benchmark dataset are available at https://github.com/ict-bigdatalab/RC-RAG.
      [LINK]http://arxiv.org/abs/2409.16146v2
      [DATE]2024-12-04 11:21:44+08:00
      [CATEGORIES]cs.CL
    • Advancing Conversational Psychotherapy: Integrating Privacy, Dual-Memory, and Domain Expertise with Large Language Models
      [AUTHORS]XiuYu Zhang, Zening Luo
      [ABSTRACT]Mental health has increasingly become a global issue that reveals the limitations of traditional conversational psychotherapy, constrained by location, time, expense, and privacy concerns. In response to these challenges, we introduce SoulSpeak, a Large Language Model (LLM)-enabled chatbot designed to democratize access to psychotherapy. SoulSpeak improves upon the capabilities of standard LLM-enabled chatbots by incorporating a novel dual-memory component that combines short-term and long-term context via Retrieval Augmented Generation (RAG) to offer personalized responses while ensuring the preservation of user privacy and intimacy through a dedicated privacy module. In addition, it leverages a counseling chat dataset of therapist-client interactions and various prompting techniques to align the generated responses with psychotherapeutic methods. We introduce two fine-tuned BERT models to evaluate the system against existing LLMs and human therapists: the Conversational Psychotherapy Preference Model (CPPM) to simulate human preference among responses and another to assess response relevance to user input. CPPM is useful for training and evaluating psychotherapy-focused language models independent from SoulSpeak, helping with the constrained resources available for psychotherapy. Furthermore, the effectiveness of the dual-memory component and the robustness of the privacy module are also examined. Our findings highlight the potential and challenge of enhancing mental health care by offering an alternative that combines the expertise of traditional therapy with the advantages of LLMs, providing a promising way to address the accessibility and personalization gap in current mental health services.
      [COMMENTS]Accepted as a Poster at Statistical Foundations of LLMs and Foundation Models (NeurIPS 2024 Workshop)
      [LINK]http://arxiv.org/abs/2412.02987v1
      [DATE]2024-12-04 11:02:46+08:00
      [CATEGORIES]cs.CL
    • The use of large language models to enhance cancer clinical trial educational materials
      [AUTHORS]Mingye Gao, Aman Varshney, Shan Chen, Vikram Goddla, Jack Gallifant, Patrick Doyle, Claire Novack, Maeve Dillon-Martin, Teresia Perkins, Xinrong Correia, Erik Duhaime, Howard Isenstein, Elad Sharon, Lisa Soleymani Lehmann, David Kozono, Brian Anthony, Dmitriy Dligach, Danielle S. Bitterman
      [ABSTRACT]Cancer clinical trials often face challenges in recruitment and engagement due to a lack of participant-facing informational and educational resources. This study investigated the potential of Large Language Models (LLMs), specifically GPT4, in generating patient-friendly educational content from clinical trial informed consent forms. Using data from ClinicalTrials.gov, we employed zero-shot learning for creating trial summaries and one-shot learning for developing multiple-choice questions, evaluating their effectiveness through patient surveys and crowdsourced annotation. Results showed that GPT4-generated summaries were both readable and comprehensive, and may improve patients' understanding and interest in clinical trials. The multiple-choice questions demonstrated high accuracy and agreement with crowdsourced annotators. For both resource types, hallucinations were identified that require ongoing human oversight. The findings demonstrate the potential of LLMs "out-of-the-box" to support the generation of clinical trial education materials with minimal trial-specific engineering, but implementation with a human-in-the-loop is still needed to avoid misinformation risks.
      [LINK]http://arxiv.org/abs/2412.01955v2
      [DATE]2024-12-04 10:25:04+08:00
      [CATEGORIES]cs.CL
    • SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models
      [AUTHORS]Haotian Xia, Zhengbang Yang, Junbo Zou, Rhys Tracy, Yuqing Wang, Chi Lu, Christopher Lai, Yanjun He, Xun Shao, Zhuoqing Xie, Yuan-fang Wang, Weining Shen, Hanjie Chen
      [ABSTRACT]Multimodal Large Language Models (MLLMs) are advancing the ability to reason about complex sports scenarios by integrating textual and visual information. To comprehensively evaluate their capabilities, we introduce SPORTU, a benchmark designed to assess MLLMs across multi-level sports reasoning tasks. SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice questions with human-annotated explanations for rule comprehension and strategy understanding. This component focuses on testing models' ability to reason about sports solely through question-answering (QA), without requiring visual inputs; SPORTU-video, consisting of 1,701 slow-motion video clips across 7 different sports and 12,048 QA pairs, designed to assess multi-level reasoning, from simple sports recognition to complex tasks like foul detection and rule application. We evaluate four prevalent LLMs mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting on the SPORTU-text part. We evaluate four LLMs using few-shot learning and chain-of-thought (CoT) prompting on SPORTU-text. GPT-4o achieves the highest accuracy of 71%, but still falls short of human-level performance, highlighting room for improvement in rule comprehension and reasoning. The evaluation for the SPORTU-video part includes 7 proprietary and 6 open-source MLLMs. Experiments show that models fall short on hard tasks that require deep reasoning and rule-based understanding. Claude-3.5-Sonnet performs the best with only 52.6% accuracy on the hard task, showing large room for improvement. We hope that SPORTU will serve as a critical step toward evaluating models' capabilities in sports understanding and reasoning.
      [LINK]http://arxiv.org/abs/2410.08474v3
      [DATE]2024-12-04 08:43:57+08:00
      [CATEGORIES]cs.CL
    • Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data
      [AUTHORS]Junhao Liu, Siwei Xu, Lei Zhang, Jing Zhang
      [ABSTRACT]Over the past decade, the revolution in single-cell sequencing has enabled the simultaneous molecular profiling of various modalities across thousands of individual cells, allowing scientists to investigate the diverse functions of complex tissues and uncover underlying disease mechanisms. Among all the analytical steps, assigning individual cells to specific types is fundamental for understanding cellular heterogeneity. However, this process is usually labor-intensive and requires extensive expert knowledge. Recent advances in large language models (LLMs) have demonstrated their ability to efficiently process and synthesize vast corpora of text to automatically extract essential biological knowledge, such as marker genes, potentially promoting more efficient and automated cell type annotations. To thoroughly evaluate the capability of modern instruction-tuned LLMs in automating the cell type identification process, we introduce SOAR, a comprehensive benchmarking study of LLMs for cell type annotation tasks in single-cell genomics. Specifically, we assess the performance of 8 instruction-tuned LLMs across 11 datasets, spanning multiple cell types and species. Our study explores the potential of LLMs to accurately classify and annotate cell types in single-cell RNA sequencing (scRNA-seq) data, while extending their application to multiomics data through cross-modality translation. Additionally, we evaluate the effectiveness of chain-of-thought (CoT) prompting techniques in generating detailed biological insights during the annotation process. The results demonstrate that LLMs can provide robust interpretations of single-cell data without requiring additional fine-tuning, advancing the automation of cell type annotation in genomics research.
      [LINK]http://arxiv.org/abs/2412.02915v1
      [DATE]2024-12-04 07:58:35+08:00
      [CATEGORIES]cs.CL
    • Does Few-Shot Learning Help LLM Performance in Code Synthesis?
      [AUTHORS]Derek Xu, Tong Xie, Botao Xia, Haoyu Li, Yunsheng Bai, Yizhou Sun, Wei Wang
      [ABSTRACT]Large language models (LLMs) have made significant strides at code generation through improved model design, training, and chain-of-thought. However, prompt-level optimizations remain an important yet under-explored aspect of LLMs for coding. This work focuses on the few-shot examples present in most code generation prompts, offering a systematic study on whether few-shot examples improve LLM's coding capabilities, which few-shot examples have the largest impact, and how to select impactful examples. Our work offers 2 approaches for selecting few-shot examples, a model-free method, CODEEXEMPLAR-FREE, and a model-based method, CODEEXEMPLAR-BASED. The 2 methods offer a trade-off between improved performance and reliance on training data and interpretability. Both methods significantly improve CodeLlama's coding ability across the popular HumanEval+ coding benchmark. In summary, our work provides valuable insights into how to pick few-shot examples in code generation prompts to improve LLM code generation capabilities.
      [LINK]http://arxiv.org/abs/2412.02906v1
      [DATE]2024-12-04 07:19:40+08:00
      [CATEGORIES]cs.CL cs.LG
    • Enhancing Trust in Large Language Models with Uncertainty-Aware Fine-Tuning
      [AUTHORS]Ranganath Krishnan, Piyush Khanna, Omesh Tickoo
      [ABSTRACT]Large language models (LLMs) have revolutionized the field of natural language processing with their impressive reasoning and question-answering capabilities. However, these models are sometimes prone to generating credible-sounding but incorrect information, a phenomenon known as LLM hallucinations. Reliable uncertainty estimation in LLMs is essential for fostering trust in their generated responses and serves as a critical tool for the detection and prevention of erroneous or hallucinated outputs. To achieve reliable and well-calibrated uncertainty quantification in open-ended and free-form natural language generation, we propose an uncertainty-aware fine-tuning approach for LLMs. This approach enhances the model's ability to provide reliable uncertainty estimates without compromising accuracy, thereby guiding them to produce more trustworthy responses. We introduce a novel uncertainty-aware causal language modeling loss function, grounded in the principles of decision theory. Through rigorous evaluation on multiple free-form question-answering datasets and models, we demonstrate that our uncertainty-aware fine-tuning approach yields better calibrated uncertainty estimates in natural language generation tasks than fine-tuning with the standard causal language modeling loss. Furthermore, the experimental results show that the proposed method significantly improves the model's ability to detect hallucinations and identify out-of-domain prompts.
      [LINK]http://arxiv.org/abs/2412.02904v1
      [DATE]2024-12-04 07:14:47+08:00
      [CATEGORIES]cs.CL cs.LG
    • TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved?
      [AUTHORS]Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, Saurabh Sinha
      [ABSTRACT]Test-driven development (TDD) is the practice of writing tests first and coding later, and the proponents of TDD expound its numerous benefits. For instance, given an issue on a source code repository, tests can clarify the desired behavior among stake-holders before anyone writes code for the agreed-upon fix. Although there has been a lot of work on automated test generation for the practice "write code first, test later", there has been little such automation for TDD. Ideally, tests for TDD should be fail-to-pass (i.e., fail before the issue is resolved and pass after) and have good adequacy with respect to covering the code changed during issue resolution. This paper introduces TDD-Bench Verified, a high-quality benchmark suite of 449 issues mined from real-world GitHub code repositories. The benchmark's evaluation harness runs only relevant tests in isolation for simple yet accurate coverage measurements, and the benchmark's dataset is filtered both by human judges and by execution in the harness. This paper also presents Auto-TDD, an LLM-based solution that takes as input an issue description and a codebase (prior to issue resolution) and returns as output a test that can be used to validate the changes made for resolving the issue. Our evaluation shows that Auto-TDD yields a better fail-to-pass rate than the strongest prior work while also yielding high coverage adequacy. Overall, we hope that this work helps make developers more productive at resolving issues while simultaneously leading to more robust fixes.
      [LINK]http://arxiv.org/abs/2412.02883v1
      [DATE]2024-12-04 06:38:05+08:00
      [CATEGORIES]cs.CL cs.LG
    • Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation
      [AUTHORS]To Eun Kim, Fernando Diaz
      [ABSTRACT]Many language models now enhance their responses with retrieval capabilities, leading to the widespread adoption of retrieval-augmented generation (RAG) systems. However, despite retrieval being a core component of RAG, much of the research in this area overlooks the extensive body of work on fair ranking, neglecting the importance of considering all stakeholders involved. This paper presents the first systematic evaluation of RAG systems integrated with fair rankings. We focus specifically on measuring the fair exposure of each relevant item across the rankings utilized by RAG systems (i.e., item-side fairness), aiming to promote equitable growth for relevant item providers. To gain a deep understanding of the relationship between item-fairness, ranking quality, and generation quality in the context of RAG, we analyze nine different RAG systems that incorporate fair rankings across seven distinct datasets. Our findings indicate that RAG systems with fair rankings can maintain a high level of generation quality and, in many cases, even outperform traditional RAG systems, despite the general trend of a tradeoff between ensuring fairness and maintaining system-effectiveness. We believe our insights lay the groundwork for responsible and equitable RAG systems and open new avenues for future research. We publicly release our codebase and dataset at https://github.com/kimdanny/Fair-RAG.
      [COMMENTS]Top 5 Spotlight at AFME Workshop at NeurIPS 2024
      [LINK]http://arxiv.org/abs/2409.11598v2
      [DATE]2024-12-04 06:23:53+08:00
      [CATEGORIES]cs.CL
    • Investigating the Contextualised Word Embedding Dimensions Specified for Contextual and Temporal Semantic Changes
      [AUTHORS]Taichi Aida, Danushka Bollegala
      [ABSTRACT]The sense-aware contextualised word embeddings (SCWEs) encode semantic changes of words within the contextualised word embedding (CWE) spaces. Despite the superior performance of SCWEs in contextual/temporal semantic change detection (SCD) benchmarks, it remains unclear as to how the meaning changes are encoded in the embedding space. To study this, we compare pre-trained CWEs and their fine-tuned versions on contextual and temporal semantic change benchmarks under Principal Component Analysis (PCA) and Independent Component Analysis (ICA) transformations. Our experimental results reveal (a) although there exist a smaller number of axes that are specific to semantic changes of words in the pre-trained CWE space, this information gets distributed across all dimensions when fine-tuned, and (b) in contrast to prior work studying the geometry of CWEs, we find that PCA to better represent semantic changes than ICA within the top 10% of axes. These findings encourage the development of more efficient SCD methods with a small number of SCD-aware dimensions. Source code is available at https://github.com/LivNLP/svp-dims .
      [COMMENTS]COLING2025
      [LINK]http://arxiv.org/abs/2407.02820v2
      [DATE]2024-12-04 04:56:16+08:00
      [CATEGORIES]cs.CL
    • Can Open-source LLMs Enhance Data Synthesis for Toxic Detection?: An Experimental Study
      [AUTHORS]Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Lin Ai, Yinheng Li, Julia Hirschberg, Congrui Huang
      [ABSTRACT]Effective toxic content detection relies heavily on high-quality and diverse data, which serves as the foundation for robust content moderation models. This study explores the potential of open-source LLMs for harmful data synthesis, utilizing prompt engineering and fine-tuning techniques to enhance data quality and diversity. In a two-stage evaluation, we first examine the capabilities of six open-source LLMs in generating harmful data across multiple datasets using prompt engineering. In the second stage, we fine-tune these models to improve data generation while addressing challenges such as hallucination, data duplication, and overfitting. Our findings reveal that Mistral excels in generating high-quality and diverse harmful data with minimal hallucination. Furthermore, fine-tuning enhances data quality, offering scalable and cost-effective solutions for augmenting datasets for specific toxic content detection tasks. These results emphasize the significance of data synthesis in building robust, standalone detection models and highlight the potential of open-source LLMs to advance smaller downstream content moderation systems. We implemented this approach in real-world industrial settings, demonstrating the feasibility and efficiency of fine-tuned open-source LLMs for harmful data synthesis.
      [COMMENTS]12 pages
      [LINK]http://arxiv.org/abs/2411.15175v2
      [DATE]2024-12-04 04:07:58+08:00
      [CATEGORIES]cs.CL
    • Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning
      [AUTHORS]Simran Kaur, Simon Park, Anirudh Goyal, Sanjeev Arora
      [ABSTRACT]We introduce Instruct-SkillMix, an automated approach for creating diverse, high quality SFT data. The Instruct-SkillMix pipeline involves two stages, each leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to extract core "skills" for instruction-following, either from existing datasets, or by directly prompting the model; (2) Data generation: uses the powerful LLM to generate (instruction, response) data that exhibit a randomly chosen pair of these skills. Here, the use of random skill combinations promotes diversity and difficulty. Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from Instruct-SkillMix leads to strong gains on instruction following benchmarks such as AlpacaEval 2.0, MT-Bench, and WildBench. With just $4$K examples, LLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0. To our knowledge, this achieves state-of-the-art performance among all models that have only undergone SFT (no RL methods) and competes with proprietary models such as Claude 3 Opus and LLaMA-3.1-405B-Instruct. Ablation studies also suggest plausible reasons for why creating open instruction-tuning datasets via naive crowd-sourcing has proved difficult. Introducing low quality answers ("shirkers") in $20\%$ of Instruct-SkillMix examples causes performance to plummet, sometimes catastrophically. The Instruct-SkillMix pipeline is flexible and is adaptable to other settings.
      [LINK]http://arxiv.org/abs/2408.14774v3
      [DATE]2024-12-04 04:01:23+08:00
      [CATEGORIES]cs.LG cs.CL
    • When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs
      [AUTHORS]Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, Rui Zhang
      [ABSTRACT]Self-correction is an approach to improving responses from large language models (LLMs) by refining the responses using LLMs during inference. Prior work has proposed various self-correction frameworks using different sources of feedback, including self-evaluation and external feedback. However, there is still no consensus on the question of when LLMs can correct their own mistakes, as recent studies also report negative results. In this work, we critically survey broad papers and discuss the conditions required for successful self-correction. We first find that prior studies often do not define their research questions in detail and involve impractical frameworks or unfair evaluations that over-evaluate self-correction. To tackle these issues, we categorize research questions in self-correction research and provide a checklist for designing appropriate experiments. Our critical survey based on the newly categorized research questions shows that (1) no prior work demonstrates successful self-correction with feedback from prompted LLMs, except for studies in tasks that are exceptionally suited for self-correction, (2) self-correction works well in tasks that can use reliable external feedback, and (3) large-scale fine-tuning enables self-correction.
      [COMMENTS]TACL 2024
      [LINK]http://arxiv.org/abs/2406.01297v3
      [DATE]2024-12-04 03:14:06+08:00
      [CATEGORIES]cs.CL
    • T-REG: Preference Optimization with Token-Level Reward Regularization
      [AUTHORS]Wenxuan Zhou, Shujian Zhang, Lingxiao Zhao, Tao Meng
      [ABSTRACT]Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models (LLMs) with human values. Traditionally, RLHF involves generating responses to a query and using a reward model to assign a reward to the entire response. However, this approach faces challenges due to its reliance on a single, sparse reward, which makes it challenging for the model to identify which parts of the sequence contribute most significantly to the final reward. Recent methods have attempted to address this limitation by introducing token-level rewards. However, these methods often rely on either a trained credit assignment model or AI annotators, raising concerns about the quality and reliability of the rewards. In this paper, we propose token-level reward regularization (T-REG), a novel approach that leverages both sequence-level and token-level rewards for preference optimization. Harnessing the self-refinement capabilities of LLMs, our method uses contrastive prompting to enable LLMs to self-generate token-level rewards. These self-generated rewards then act as reward regularization, guiding the model to more effectively distribute sequence-level rewards across tokens. This facilitates better token-level credit assignment and enhances alignment performance. Experiments on the instruction following benchmarks, including Alpaca Eval 2 and Arena-Hard, show that our method consistently outperforms baseline methods by up to 3.8% and 4.4%, respectively. We will release the code and models at https://github.com/wzhouad/T-REG.
      [LINK]http://arxiv.org/abs/2412.02685v1
      [DATE]2024-12-04 02:56:07+08:00
      [CATEGORIES]cs.CL cs.LG
    • From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs
      [AUTHORS]Alireza Rezazadeh, Zichao Li, Wei Wei, Yujia Bao
      [ABSTRACT]Recent advancements in large language models have significantly improved their context windows, yet challenges in effective long-term memory management remain. We introduce MemTree, an algorithm that leverages a dynamic, tree-structured memory representation to optimize the organization, retrieval, and integration of information, akin to human cognitive schemas. MemTree organizes memory hierarchically, with each node encapsulating aggregated textual content, corresponding semantic embeddings, and varying abstraction levels across the tree's depths. Our algorithm dynamically adapts this memory structure by computing and comparing semantic embeddings of new and existing information to enrich the model's context-awareness. This approach allows MemTree to handle complex reasoning and extended interactions more effectively than traditional memory augmentation methods, which often rely on flat lookup tables. Evaluations on benchmarks for multi-turn dialogue understanding and document question answering show that MemTree significantly enhances performance in scenarios that demand structured memory management.
      [LINK]http://arxiv.org/abs/2410.14052v2
      [DATE]2024-12-04 02:48:00+08:00
      [CATEGORIES]cs.CL cs.LG
    • QA-TOOLBOX: Conversational Question-Answering for process task guidance in manufacturing
      [AUTHORS]Ramesh Manuvinakurike, Elizabeth Watkins, Celal Savur, Anthony Rhodes, Sovan Biswas, Gesem Gudino Mejia, Richard Beckwith, Saurav Sahay, Giuseppe Raffa, Lama Nachman
      [ABSTRACT]In this work we explore utilizing LLMs for data augmentation for manufacturing task guidance system. The dataset consists of representative samples of interactions with technicians working in an advanced manufacturing setting. The purpose of this work to explore the task, data augmentation for the supported tasks and evaluating the performance of the existing LLMs. We observe that that task is complex requiring understanding from procedure specification documents, actions and objects sequenced temporally. The dataset consists of 200,000+ question/answer pairs that refer to the spec document and are grounded in narrations and/or video demonstrations. We compared the performance of several popular open-sourced LLMs by developing a baseline using each LLM and then compared the responses in a reference-free setting using LLM-as-a-judge and compared the ratings with crowd-workers whilst validating the ratings with experts.
      [LINK]http://arxiv.org/abs/2412.02638v1
      [DATE]2024-12-04 02:10:31+08:00
      [CATEGORIES]cs.CL
    • AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
      [AUTHORS]Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue
      [ABSTRACT]Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.
      [COMMENTS]Project page: https://av-odyssey.github.io/
      [LINK]http://arxiv.org/abs/2412.02611v1
      [DATE]2024-12-04 01:41:23+08:00
      [CATEGORIES]cs.CL
    • Interpretable Company Similarity with Sparse Autoencoders
      [AUTHORS]Marco Molinari, Vladimir Tregubiak, Victor Shao, Abhimanyu Pandey, Mateusz Mikolajczak, Sebastião Kuznetsov Ryder Torres Pereira
      [ABSTRACT]Determining company similarity is a vital task in finance, underpinning hedging, risk management, portfolio diversification, and more. Practitioners often rely on sector and industry classifications to gauge similarity, such as SIC-codes and GICS-codes, the former being used by the U.S. Securities and Exchange Commission (SEC), and the latter widely used by the investment community. Clustering embeddings of company descriptions has been proposed as a potential technique for determining company similarity, but the lack of interpretability in token embeddings poses a significant barrier to adoption in high-stakes contexts. Sparse Autoencoders have shown promise in enhancing the interpretability of Large Language Models by decomposing LLM activations into interpretable features. In this paper, we explore the use of SAE features in measuring company similarity and benchmark them against (1) SIC codes and (2) Major Group codes. We conclude that SAE features can reproduce and even surpass sector classifications in quantifying fundamental characteristics of companies, evaluated by the correlation of monthly returns, a proxy for similarity, and PnL from cointegration.
      [LINK]http://arxiv.org/abs/2412.02605v1
      [DATE]2024-12-04 01:34:50+08:00
      [CATEGORIES]cs.CL cs.LG
    • Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
      [AUTHORS]Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
      [ABSTRACT]Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved significant benchmark gains via aggressive model-based filtering, but at the cost of removing 90% of data. This limits their suitability for long token horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters. When training 8B parameter models for 1T tokens, using a high-quality subset of our data improves MMLU by 5.6 over DCLM, demonstrating the efficacy of our methods for boosting accuracies over a relatively short token horizon. Furthermore, our full 6.3T token dataset matches DCLM on MMLU, but contains four times more unique real tokens than DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B parameter model trained for 15T tokens, of which 7.2T came from our dataset, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks. The dataset is available at https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html
      [LINK]http://arxiv.org/abs/2412.02595v1
      [DATE]2024-12-04 01:28:50+08:00
      [CATEGORIES]cs.CL
    • Patent-CR: A Dataset for Patent Claim Revision
      [AUTHORS]Lekang Jiang, Pascal A Scherz, Stephan Goetz
      [ABSTRACT]This paper presents Patent-CR, the first dataset created for the patent claim revision task in English. It includes both initial patent applications rejected by patent examiners and the final granted versions. Unlike normal text revision tasks that predominantly focus on enhancing sentence quality, such as grammar correction and coherence improvement, patent claim revision aims at ensuring the claims meet stringent legal criteria. These criteria are beyond novelty and inventiveness, including clarity of scope, technical accuracy, language precision, and legal robustness. We assess various large language models (LLMs) through professional human evaluation, including general LLMs with different sizes and architectures, text revision models, and domain-specific models. Our results indicate that LLMs often bring ineffective edits that deviate from the target revisions. In addition, domain-specific models and the method of fine-tuning show promising results. Notably, GPT-4 outperforms other tested LLMs, but further revisions are still necessary to reach the examination standard. Furthermore, we demonstrate the inconsistency between automated and human evaluation results, suggesting that GPT-4-based automated evaluation has the highest correlation with human judgment. This dataset, along with our preliminary empirical research, offers invaluable insights for further exploration in patent claim revision.
      [COMMENTS]15 pages, 6 tables, 3 figures
      [LINK]http://arxiv.org/abs/2412.02549v1
      [DATE]2024-12-04 00:43:42+08:00
      [CATEGORIES]cs.CL
    • Granular Ball Twin Support Vector Machine with Universum Data
      [AUTHORS]M. A. Ganaie, Vrushank Ahire
      [ABSTRACT]Classification with support vector machines (SVM) often suffers from limited performance when relying solely on labeled data from target classes and is sensitive to noise and outliers. Incorporating prior knowledge from Universum data and more robust data representations can enhance accuracy and efficiency. Motivated by these findings, we propose a novel Granular Ball Twin Support Vector Machine with Universum Data (GBU-TSVM) that extends the TSVM framework to leverage both Universum samples and granular ball computing during model training. Unlike existing TSVM methods, the proposed GBU-TSVM represents data instances as hyper-balls rather than points in the feature space. This innovative approach improves the model's robustness and efficiency, particularly in handling noisy and large datasets. By grouping data points into granular balls, the model achieves superior computational efficiency, increased noise resistance, and enhanced interpretability. Additionally, the inclusion of Universum data, which consists of samples that are not strictly from the target classes, further refines the classification boundaries. This integration enriches the model with contextual information, refining classification boundaries and boosting overall accuracy. Experimental results on UCI benchmark datasets demonstrate that the GBU-TSVM outperforms existing TSVM models in both accuracy and computational efficiency. These findings highlight the potential of the GBU-TSVM model in setting a new standard in data representation and classification.
      [LINK]http://arxiv.org/abs/2412.03375v1
      [DATE]2024-12-04 23:02:28+08:00
      [CATEGORIES]cs.LG
    • Deep Learning in Single-Cell and Spatial Transcriptomics Data Analysis: Advances and Challenges from a Data Science Perspective
      [AUTHORS]Shuang Ge, Shuqing Sun, Huan Xu, Qiang Cheng, Zhixiang Ren
      [ABSTRACT]The development of single-cell and spatial transcriptomics has revolutionized our capacity to investigate cellular properties, functions, and interactions in both cellular and spatial contexts. However, the analysis of single-cell and spatial omics data remains challenging. First, single-cell sequencing data are high-dimensional and sparse, often contaminated by noise and uncertainty, obscuring the underlying biological signals. Second, these data often encompass multiple modalities, including gene expression, epigenetic modifications, and spatial locations. Integrating these diverse data modalities is crucial for enhancing prediction accuracy and biological interpretability. Third, while the scale of single-cell sequencing has expanded to millions of cells, high-quality annotated datasets are still limited. Fourth, the complex correlations of biological tissues make it difficult to accurately reconstruct cellular states and spatial contexts. Traditional feature engineering-based analysis methods struggle to deal with the various challenges presented by intricate biological networks. Deep learning has emerged as a powerful tool capable of handling high-dimensional complex data and automatically identifying meaningful patterns, offering significant promise in addressing these challenges. This review systematically analyzes these challenges and discusses related deep learning approaches. Moreover, we have curated 21 datasets from 9 benchmarks, encompassing 58 computational methods, and evaluated their performance on the respective modeling tasks. Finally, we highlight three areas for future development from a technical, dataset, and application perspective. This work will serve as a valuable resource for understanding how deep learning can be effectively utilized in single-cell and spatial transcriptomics analyses, while inspiring novel approaches to address emerging challenges.
      [LINK]http://arxiv.org/abs/2412.03614v1
      [DATE]2024-12-04 22:07:11+08:00
      [CATEGORIES]cs.LG
    • OpenDriver: An Open-Road Driver State Detection Dataset
      [AUTHORS]Delong Liu, Shichao Li, Tianyi Shi, Zhu Meng, Guanyu Chen, Yadong Huang, Jin Dong, Zhicheng Zhao
      [ABSTRACT]Among numerous studies for driver state detection, wearable physiological measurements offer a practical method for real-time monitoring. However, there are few driver physiological datasets in open-road scenarios, and the existing datasets suffer from issues such as poor signal quality, small sample sizes, and short data collection periods. Therefore, in this paper, a large-scale multimodal driving dataset, OpenDriver, for driver state detection is developed. The OpenDriver encompasses a total of 3,278 driving trips, with a signal collection duration spanning approximately 4,600 hours. Two modalities of driving signals are enrolled in OpenDriver: electrocardiogram (ECG) signals and six-axis motion data of the steering wheel from a motion measurement unit (IMU), which were recorded from 81 drivers and their vehicles. Furthermore, three challenging tasks are involved in our work, namely ECG signal quality assessment, individual biometric identification based on ECG signals, and physiological signal analysis in complex driving environments. To facilitate research in these tasks, corresponding benchmarks have also been introduced. First, a noisy augmentation strategy is applied to generate a larger-scale ECG signal dataset with realistic noise simulation for quality assessment. Second, an end-to-end contrastive learning framework is employed for individual biometric identification. Finally, a comprehensive analysis of drivers' HRV features under different driving conditions is conducted. Each benchmark provides evaluation metrics and reference results. The OpenDriver dataset will be publicly available at https://github.com/bdne/OpenDriver.
      [COMMENTS]Considering that there are flaws in the statistical data of the dataset, all the authors agreed to withdraw the manuscript
      [LINK]http://arxiv.org/abs/2304.04203v3
      [DATE]2024-12-04 21:43:10+08:00
      [CATEGORIES]cs.LG
    • NeRF and Gaussian Splatting SLAM in the Wild
      [AUTHORS]Fabian Schmidt, Markus Enzweiler, Abhinav Valada
      [ABSTRACT]Navigating outdoor environments with visual Simultaneous Localization and Mapping (SLAM) systems poses significant challenges due to dynamic scenes, lighting variations, and seasonal changes, requiring robust solutions. While traditional SLAM methods struggle with adaptability, deep learning-based approaches and emerging neural radiance fields as well as Gaussian Splatting-based SLAM methods, offer promising alternatives. However, these methods have primarily been evaluated in controlled indoor environments with stable conditions, leaving a gap in understanding their performance in unstructured and variable outdoor settings. This study addresses this gap by evaluating these methods in natural outdoor environments, focusing on camera tracking accuracy, robustness to environmental factors, and computational efficiency, highlighting distinct trade-offs. Extensive evaluations demonstrate that neural SLAM methods achieve superior robustness, particularly under challenging conditions such as low light, but at a high computational cost. At the same time, traditional methods perform the best across seasons but are highly sensitive to variations in lighting conditions. The code of the benchmark is publicly available at https://github.com/iis-esslingen/nerf-3dgs-benchmark.
      [COMMENTS]5 pages, 2 figures, 4 tables
      [LINK]http://arxiv.org/abs/2412.03263v1
      [DATE]2024-12-04 20:11:19+08:00
      [CATEGORIES]cs.LG
    • Learning on One Mode: Addressing Multi-Modality in Offline Reinforcement Learning
      [AUTHORS]Mianchu Wang, Yue Jin, Giovanni Montana
      [ABSTRACT]Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without interacting with the environment. A common challenge is handling multi-modal action distributions, where multiple behaviours are represented in the data. Existing methods often assume unimodal behaviour policies, leading to suboptimal performance when this assumption is violated. We propose Weighted Imitation Learning on One Mode (LOM), a novel approach that focuses on learning from a single, promising mode of the behaviour policy. By using a Gaussian mixture model to identify modes and selecting the best mode based on expected returns, LOM avoids the pitfalls of averaging over conflicting actions. Theoretically, we show that LOM improves performance while maintaining simplicity in policy learning. Empirically, LOM outperforms existing methods on standard D4RL benchmarks and demonstrates its effectiveness in complex, multi-modal scenarios.
      [LINK]http://arxiv.org/abs/2412.03258v1
      [DATE]2024-12-04 19:57:36+08:00
      [CATEGORIES]cs.LG
    • Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges
      [AUTHORS]Minghao Shao, Abdul Basit, Ramesh Karri, Muhammad Shafique
      [ABSTRACT]Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language and generating coherent responses to various prompts or queries. These models far exceed the complexity of conventional neural networks, often encompassing dozens of neural network layers and containing billions to trillions of parameters. They are typically trained on vast datasets, utilizing architectures based on transformer blocks. Present-day LLMs are multi-functional, capable of performing a range of tasks from text generation and language translation to question answering, as well as code generation and analysis. An advanced subset of these models, known as Multimodal Large Language Models (MLLMs), extends LLM capabilities to process and interpret multiple data modalities, including images, audio, and video. This enhancement empowers MLLMs with capabilities like video editing, image comprehension, and captioning for visual content. This survey provides a comprehensive overview of the recent advancements in LLMs. We begin by tracing the evolution of LLMs and subsequently delve into the advent and nuances of MLLMs. We analyze emerging state-of-the-art MLLMs, exploring their technical features, strengths, and limitations. Additionally, we present a comparative analysis of these models and discuss their challenges, potential limitations, and prospects for future development.
      [LINK]http://arxiv.org/abs/2412.03220v1
      [DATE]2024-12-04 19:14:06+08:00
      [CATEGORIES]cs.LG
    • Local Lesion Generation is Effective for Capsule Endoscopy Image Data Augmentation in a Limited Data Setting
      [AUTHORS]Adrian B. Chłopowiec, Adam R. Chłopowiec, Krzysztof Galus, Wojciech Cebula, Martin Tabakov
      [ABSTRACT]Limited medical imaging datasets challenge deep learning models by increasing risks of overfitting and reduced generalization, particularly in Generative Adversarial Networks (GANs), where discriminators may overfit, leading to training divergence. This constraint also impairs classification models trained on small datasets. Generative Data Augmentation (GDA) addresses this by expanding training datasets with synthetic data, although it requires training a generative model. We propose and evaluate two local lesion generation approaches to address the challenge of augmenting small medical image datasets. The first approach employs the Poisson Image Editing algorithm, a classical image processing technique, to create realistic image composites that outperform current state-of-the-art methods. The second approach introduces a novel generative method, leveraging a fine-tuned Image Inpainting GAN to synthesize realistic lesions within specified regions of real training images. A comprehensive comparison of the two proposed methods demonstrates that effective local lesion generation in a data-constrained setting allows for reaching new state-of-the-art results in capsule endoscopy lesion classification. Combination of our techniques achieves a macro F1-score of 33.07%, surpassing the previous best result by 7.84 percentage points (p.p.) on the highly imbalanced Kvasir Capsule Dataset, a benchmark for capsule endoscopy. To the best of our knowledge, this work is the first to apply a fine-tuned Image Inpainting GAN for GDA in medical imaging, demonstrating that an image-conditional GAN can be adapted effectively to limited datasets to generate high-quality examples, facilitating effective data augmentation. Additionally, we show that combining this GAN-based approach with classical image processing techniques further improves the results.
      [COMMENTS]54 pages, 35 figures
      [LINK]http://arxiv.org/abs/2411.03098v2
      [DATE]2024-12-04 18:52:25+08:00
      [CATEGORIES]cs.LG
    • RLLTE: Long-Term Evolution Project of Reinforcement Learning
      [AUTHORS]Mingqi Yuan, Zequn Zhang, Yang Xu, Shihao Luo, Bo Li, Xin Jin, Wenjun Zeng
      [ABSTRACT]We present RLLTE: a long-term evolution, extremely modular, and open-source framework for reinforcement learning (RL) research and application. Beyond delivering top-notch algorithm implementations, RLLTE also serves as a toolkit for developing algorithms. More specifically, RLLTE decouples the RL algorithms completely from the exploitation-exploration perspective, providing a large number of components to accelerate algorithm development and evolution. In particular, RLLTE is the first RL framework to build a comprehensive ecosystem, which includes model training, evaluation, deployment, benchmark hub, and large language model (LLM)-empowered copilot. RLLTE is expected to set standards for RL engineering practice and be highly stimulative for industry and academia. Our documentation, examples, and source code are available at https://github.com/RLE-Foundation/rllte.
      [COMMENTS]Proceedings of the AAAI Conference on Artificial Intelligence, 2025
      [LINK]http://arxiv.org/abs/2309.16382v2
      [DATE]2024-12-04 18:27:58+08:00
      [CATEGORIES]cs.LG
    • Semi-decentralized Training of Spatio-Temporal Graph Neural Networks for Traffic Prediction
      [AUTHORS]Ivan Kralj, Lodovico Giaretta, Gordan Ježić, Ivana Podnar Žarko, Šarūnas Girdzijauskas
      [ABSTRACT]In smart mobility, large networks of geographically distributed sensors produce vast amounts of high-frequency spatio-temporal data that must be processed in real time to avoid major disruptions. Traditional centralized approaches are increasingly unsuitable to this task, as they struggle to scale with expanding sensor networks, and reliability issues in central components can easily affect the whole deployment. To address these challenges, we explore and adapt semi-decentralized training techniques for Spatio-Temporal Graph Neural Networks (ST-GNNs) in smart mobility domain. We implement a simulation framework where sensors are grouped by proximity into multiple cloudlets, each handling a subgraph of the traffic graph, fetching node features from other cloudlets to train its own local ST-GNN model, and exchanging model updates with other cloudlets to ensure consistency, enhancing scalability and removing reliance on a centralized aggregator. We perform extensive comparative evaluation of four different ST-GNN training setups -- centralized, traditional FL, server-free FL, and Gossip Learning -- on large-scale traffic datasets, the METR-LA and PeMS-BAY datasets, for short-, mid-, and long-term vehicle speed predictions. Experimental results show that semi-decentralized setups are comparable to centralized approaches in performance metrics, while offering advantages in terms of scalability and fault tolerance. In addition, we highlight often overlooked issues in existing literature for distributed ST-GNNs, such as the variation in model performance across different geographical areas due to region-specific traffic patterns, and the significant communication overhead and computational costs that arise from the large receptive field of GNNs, leading to substantial data transfers and increased computation of partial embeddings.
      [COMMENTS]8 pages, 4 figures, 3 tables, conference
      [LINK]http://arxiv.org/abs/2412.03188v1
      [DATE]2024-12-04 18:20:21+08:00
      [CATEGORIES]cs.LG
    • A path-norm toolkit for modern networks: consequences, promises and challenges
      [AUTHORS]Antoine Gonon, Nicolas Brisebarre, Elisa Riccietti, Rémi Gribonval
      [ABSTRACT]This work introduces the first toolkit around path-norms that fully encompasses general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on layered fully-connected networks compared to the product of operator norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.
      [COMMENTS]Erratum: in the published version there was a typo in the definition of the activation matrix in Definition A.3. This is fixed with this new version
      [LINK]http://arxiv.org/abs/2310.01225v5
      [DATE]2024-12-04 18:04:02+08:00
      [CATEGORIES]cs.LG
    • Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation
      [AUTHORS]Gianni Franchi, Dat Nguyen Trong, Nacim Belkhir, Guoxuan Xia, Andrea Pilzer
      [ABSTRACT]Uncertainty quantification in text-to-image (T2I) generative models is crucial for understanding model behavior and improving output reliability. In this paper, we are the first to quantify and evaluate the uncertainty of T2I models with respect to the prompt. Alongside adapting existing approaches designed to measure uncertainty in the image space, we also introduce Prompt-based UNCertainty Estimation for T2I models (PUNC), a novel method leveraging Large Vision-Language Models (LVLMs) to better address uncertainties arising from the semantics of the prompt and generated images. PUNC utilizes a LVLM to caption a generated image, and then compares the caption with the original prompt in the more semantically meaningful text space. PUNC also enables the disentanglement of both aleatoric and epistemic uncertainties via precision and recall, which image-space approaches are unable to do. Extensive experiments demonstrate that PUNC outperforms state-of-the-art uncertainty estimation techniques across various settings. Uncertainty quantification in text-to-image generation models can be used on various applications including bias detection, copyright protection, and OOD detection. We also introduce a comprehensive dataset of text prompts and generation pairs to foster further research in uncertainty quantification for generative models. Our findings illustrate that PUNC not only achieves competitive performance but also enables novel applications in evaluating and improving the trustworthiness of text-to-image models.
      [COMMENTS]28 pages and 22 figures
      [LINK]http://arxiv.org/abs/2412.03178v1
      [DATE]2024-12-04 18:03:52+08:00
      [CATEGORIES]cs.LG
    • Exploration of Parameter Spaces Assisted by Machine Learning
      [AUTHORS]A. Hammad, Myeonghun Park, Raymundo Ramos, Pankaj Saha
      [ABSTRACT]We demonstrate two sampling procedures assisted by machine learning models via regression and classification. The main objective is the use of a neural network to suggest points likely inside regions of interest, reducing the number of evaluations of time consuming calculations. We compare results from this approach with results from other sampling methods, namely Markov chain Monte Carlo and MultiNest, obtaining results that range from comparably similar to arguably better. In particular, we augment our classifier method with a boosting technique that rapidly increases the efficiency within a few iterations. We show results from our methods applied to a toy model and the type II 2HDM, using 3 and 7 free parameters, respectively. The code used for this paper and instructions are publicly available on the web.
      [COMMENTS]30 pages, 9 figures. Matches published version. Code and instructions are available on https://github.com/AHamamd150/MLscanner
      [LINK]http://arxiv.org/abs/2207.09959v4
      [DATE]2024-12-04 17:45:26+08:00
      [CATEGORIES]cs.LG
    • Testing Neural Network Verifiers: A Soundness Benchmark with Hidden Counterexamples
      [AUTHORS]Xingjian Zhou, Hongji Xu, Andy Xu, Zhouxing Shi, Cho-Jui Hsieh, Huan Zhang
      [ABSTRACT]In recent years, many neural network (NN) verifiers have been developed to formally verify certain properties of neural networks such as robustness. Although many benchmarks have been constructed to evaluate the performance of NN verifiers, they typically lack a ground-truth for hard instances where no current verifier can verify and no counterexample can be found, which makes it difficult to check the soundness of a new verifier if it claims to verify hard instances which no other verifier can do. We propose to develop a soundness benchmark for NN verification. Our benchmark contains instances with deliberately inserted counterexamples while we also try to hide the counterexamples from regular adversarial attacks which can be used for finding counterexamples. We design a training method to produce neural networks with such hidden counterexamples. Our benchmark aims to be used for testing the soundness of NN verifiers and identifying falsely claimed verifiability when it is known that hidden counterexamples exist. We systematically construct our benchmark and generate instances across diverse model architectures, activation functions, input sizes, and perturbation radii. We demonstrate that our benchmark successfully identifies bugs in state-of-the-art NN verifiers, as well as synthetic bugs, providing a crucial step toward enhancing the reliability of testing NN verifiers. Our code is available at https://github.com/MVP-Harry/SoundnessBench and our benchmark is available at https://huggingface.co/datasets/SoundnessBench/SoundnessBench.
      [COMMENTS]Preprint
      [LINK]http://arxiv.org/abs/2412.03154v1
      [DATE]2024-12-04 17:24:33+08:00
      [CATEGORIES]cs.LG
    • Advanced Risk Prediction and Stability Assessment of Banks Using Time Series Transformer Models
      [AUTHORS]Wenying Sun, Zhen Xu, Wenqing Zhang, Kunyuan Ma, You Wu, Mengfang Sun
      [ABSTRACT]This paper aims to study the prediction of the bank stability index based on the Time Series Transformer model. The bank stability index is an important indicator to measure the health status and risk resistance of financial institutions. Traditional prediction methods are difficult to adapt to complex market changes because they rely on single-dimensional macroeconomic data. This paper proposes a prediction framework based on the Time Series Transformer, which uses the self-attention mechanism of the model to capture the complex temporal dependencies and nonlinear relationships in financial data. Through experiments, we compare the model with LSTM, GRU, CNN, TCN and RNN-Transformer models. The experimental results show that the Time Series Transformer model outperforms other models in both mean square error (MSE) and mean absolute error (MAE) evaluation indicators, showing strong prediction ability. This shows that the Time Series Transformer model can better handle multidimensional time series data in bank stability prediction, providing new technical approaches and solutions for financial risk management.
      [LINK]http://arxiv.org/abs/2412.03606v1
      [DATE]2024-12-04 16:15:27+08:00
      [CATEGORIES]cs.LG
    • DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries
      [AUTHORS]Hanqun Cao, Mutian He, Ning Ma, Chang-yu Hsieh, Chunbin Gu, Pheng-Ann Heng
      [ABSTRACT]DNA-encoded library (DEL) screening has revolutionized the detection of protein-ligand interactions through read counts, enabling rapid exploration of vast chemical spaces. However, noise in read counts, stemming from nonspecific interactions, can mislead this exploration process. We present DEL-Ranking, a novel distribution-correction denoising framework that addresses these challenges. Our approach introduces two key innovations: (1) a novel ranking loss that rectifies relative magnitude relationships between read counts, enabling the learning of causal features determining activity levels, and (2) an iterative algorithm employing self-training and consistency loss to establish model coherence between activity label and read count predictions. Furthermore, we contribute three new DEL screening datasets, the first to comprehensively include multi-dimensional molecular representations, protein-ligand enrichment values, and their activity labels. These datasets mitigate data scarcity issues in AI-driven DEL screening research. Rigorous evaluation on diverse DEL datasets demonstrates DEL-Ranking's superior performance across multiple correlation metrics, with significant improvements in binding affinity prediction accuracy. Our model exhibits zero-shot generalization ability across different protein targets and successfully identifies potential motifs determining compound binding affinity. This work advances DEL screening analysis and provides valuable resources for future research in this area.
      [LINK]http://arxiv.org/abs/2410.14946v2
      [DATE]2024-12-04 15:58:40+08:00
      [CATEGORIES]cs.LG
    • UTSD: Unified Time Series Diffusion Model
      [AUTHORS]Xiangkai Ma, Xiaobin Hong, Wenzhong Li, Sanglu Lu
      [ABSTRACT]Transformer-based architectures have achieved unprecedented success in time series analysis. However, facing the challenge of across-domain modeling, existing studies utilize statistical prior as prompt engineering fails under the huge distribution shift among various domains. In this paper, a Unified Time Series Diffusion (UTSD) model is established for the first time to model the multi-domain probability distribution, utilizing the powerful probability distribution modeling ability of Diffusion. Unlike the autoregressive models that capture the conditional probabilities of the prediction horizon to the historical sequence, we use a diffusion denoising process to model the mixture distribution of the cross-domain data and generate the prediction sequence for the target domain directly utilizing conditional sampling. The proposed UTSD contains three pivotal designs: (1) The condition network captures the multi-scale fluctuation patterns from the observation sequence, which are utilized as context representations to guide the denoising network to generate the prediction sequence; (2) Adapter-based fine-tuning strategy, the multi-domain universal representation learned in the pretraining stage is utilized for downstream tasks in target domains; (3) The diffusion and denoising process on the actual sequence space, combined with the improved classifier free guidance as the conditional generation strategy, greatly improves the stability and accuracy of the downstream task. We conduct extensive experiments on mainstream benchmarks, and the pre-trained UTSD outperforms existing foundation models on all data domains, exhibiting superior zero-shot generalization ability. After training from scratch, UTSD achieves comparable performance against domain-specific proprietary models. The empirical results validate the potential of UTSD as a time series foundational model.
      [LINK]http://arxiv.org/abs/2412.03068v1
      [DATE]2024-12-04 14:42:55+08:00
      [CATEGORIES]cs.LG
    • Point-GN: A Non-Parametric Network Using Gaussian Positional Encoding for Point Cloud Classification
      [AUTHORS]Marzieh Mohammadi, Amir Salarpour
      [ABSTRACT]This paper introduces Point-GN, a novel non-parametric network for efficient and accurate 3D point cloud classification. Unlike conventional deep learning models that rely on a large number of trainable parameters, Point-GN leverages non-learnable components-specifically, Farthest Point Sampling (FPS), k-Nearest Neighbors (k-NN), and Gaussian Positional Encoding (GPE)-to extract both local and global geometric features. This design eliminates the need for additional training while maintaining high performance, making Point-GN particularly suited for real-time, resource-constrained applications. We evaluate Point-GN on two benchmark datasets, ModelNet40 and ScanObjectNN, achieving classification accuracies of 85.29% and 85.89%, respectively, while significantly reducing computational complexity. Point-GN outperforms existing non-parametric methods and matches the performance of fully trained models, all with zero learnable parameters. Our results demonstrate that Point-GN is a promising solution for 3D point cloud classification in practical, real-time environments.
      [COMMENTS]This paper has been accepted for presentation at the IEEE Winter Conference on Applications of Computer Vision (WACV) 2025
      [LINK]http://arxiv.org/abs/2412.03056v1
      [DATE]2024-12-04 14:20:51+08:00
      [CATEGORIES]cs.LG
    • Phased Consistency Models
      [AUTHORS]Fu-Yun Wang, Zhaoyang Huang, Alexander William Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, Xiaogang Wang, Hongsheng Li
      [ABSTRACT]Consistency Models (CMs) have made significant progress in accelerating the generation of diffusion models. However, their application to high-resolution, text-conditioned image generation in the latent space remains unsatisfactory. In this paper, we identify three key flaws in the current design of Latent Consistency Models (LCMs). We investigate the reasons behind these limitations and propose Phased Consistency Models (PCMs), which generalize the design space and address the identified limitations. Our evaluations demonstrate that PCMs outperform LCMs across 1--16 step generation settings. While PCMs are specifically designed for multi-step refinement, they achieve comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show the methodology of PCMs is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. Our code is available at https://github.com/G-U-N/Phased-Consistency-Model.
      [COMMENTS]NeurIPS 2024
      [LINK]http://arxiv.org/abs/2405.18407v2
      [DATE]2024-12-04 14:18:16+08:00
      [CATEGORIES]cs.LG
    • Stable Consistency Tuning: Understanding and Improving Consistency Models
      [AUTHORS]Fu-Yun Wang, Zhengyang Geng, Hongsheng Li
      [ABSTRACT]Diffusion models achieve superior generation quality but suffer from slow generation speed due to the iterative nature of denoising. In contrast, consistency models, a new generative family, achieve competitive performance with significantly faster sampling. These models are trained either through consistency distillation, which leverages pretrained diffusion models, or consistency training/tuning directly from raw data. In this work, we propose a novel framework for understanding consistency models by modeling the denoising process of the diffusion model as a Markov Decision Process (MDP) and framing consistency model training as the value estimation through Temporal Difference~(TD) Learning. More importantly, this framework allows us to analyze the limitations of current consistency training/tuning strategies. Built upon Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT), which incorporates variance-reduced learning using the score identity. SCT leads to significant performance improvements on benchmarks such as CIFAR-10 and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID 1.55, a new SoTA for consistency models.
      [COMMENTS]Code is available at https://github.com/G-U-N/Stable-Consistency-Tuning
      [LINK]http://arxiv.org/abs/2410.18958v3
      [DATE]2024-12-04 13:04:42+08:00
      [CATEGORIES]cs.LG
    • Data Acquisition for Improving Model Fairness using Reinforcement Learning
      [AUTHORS]Jahid Hasan, Romila Pradhan
      [ABSTRACT]Machine learning systems are increasingly being used in critical decision making such as healthcare, finance, and criminal justice. Concerns around their fairness have resulted in several bias mitigation techniques that emphasize the need for high-quality data to ensure fairer decisions. However, the role of earlier stages of machine learning pipelines in mitigating model bias has not been explored well. In this paper, we focus on the task of acquiring additional labeled data points for training the downstream machine learning model to rapidly improve its fairness. Since not all data points in a data pool are equally beneficial to the task of fairness, we generate an ordering in which data points should be acquired. We present DataSift, a data acquisition framework based on the idea of data valuation that relies on partitioning and multi-armed bandits to determine the most valuable data points to acquire. Over several iterations, DataSift selects a partition and randomly samples a batch of data points from the selected partition, evaluates the benefit of acquiring the batch on model fairness, and updates the utility of partitions depending on the benefit. To further improve the effectiveness and efficiency of evaluating batches, we leverage influence functions that estimate the effect of acquiring a batch without retraining the model. We empirically evaluate DataSift on several real-world and synthetic datasets and show that the fairness of a machine learning model can be significantly improved even while acquiring a few data points.
      [COMMENTS]19 pages, 9 figures
      [LINK]http://arxiv.org/abs/2412.03009v1
      [DATE]2024-12-04 11:56:54+08:00
      [CATEGORIES]cs.LG
    • SurvMamba: State Space Model with Multi-grained Multi-modal Interaction for Survival Prediction
      [AUTHORS]Ying Chen, Jiajing Xie, Yuxiang Lin, Yuhang Song, Wenxian Yang, Rongshan Yu
      [ABSTRACT]Multi-modal learning that combines pathological images with genomic data has significantly enhanced the accuracy of survival prediction. Nevertheless, existing methods have not fully utilized the inherent hierarchical structure within both whole slide images (WSIs) and transcriptomic data, from which better intra-modal representations and inter-modal integration could be derived. Moreover, many existing studies attempt to improve multi-modal representations through attention mechanisms, which inevitably lead to high complexity when processing high-dimensional WSIs and transcriptomic data. Recently, a structured state space model named Mamba emerged as a promising approach for its superior performance in modeling long sequences with low complexity. In this study, we propose Mamba with multi-grained multi-modal interaction (SurvMamba) for survival prediction. SurvMamba is implemented with a Hierarchical Interaction Mamba (HIM) module that facilitates efficient intra-modal interactions at different granularities, thereby capturing more detailed local features as well as rich global representations. In addition, an Interaction Fusion Mamba (IFM) module is used for cascaded inter-modal interactive fusion, yielding more comprehensive features for survival prediction. Comprehensive evaluations on five TCGA datasets demonstrate that SurvMamba outperforms other existing methods in terms of performance and computational cost.
      [LINK]http://arxiv.org/abs/2404.08027v2
      [DATE]2024-12-04 10:57:03+08:00
      [CATEGORIES]cs.LG
    • Unified Inductive Logic: From Formal Learning to Statistical Inference to Supervised Learning
      [AUTHORS]Hanti Lin
      [ABSTRACT]While the traditional conception of inductive logic is Carnapian, I develop a Peircean alternative and use it to unify formal learning theory, statistics, and a significant part of machine learning: supervised learning. Some crucial standards for evaluating non-deductive inferences have been assumed separately in those areas, but can actually be justified by a unifying principle.
      [LINK]http://arxiv.org/abs/2412.02969v1
      [DATE]2024-12-04 10:31:31+08:00
      [CATEGORIES]cs.LG
    • How Many Ratings per Item are Necessary for Reliable Significance Testing?
      [AUTHORS]Christopher Homan, Flip Korn, Chris Welty
      [ABSTRACT]Most approaches to machine learning evaluation assume that machine and human responses are repeatable enough to be measured against data with unitary, authoritative, "gold standard" responses, via simple metrics such as accuracy, precision, and recall that assume scores are independent given the test item. However, AI models have multiple sources of stochasticity and the human raters who create gold standards tend to disagree with each other, often in meaningful ways, hence a single output response per input item may not provide enough information. We introduce methods for determining whether an (existing or planned) evaluation dataset has enough responses per item to reliably compare the performance of one model to another. We apply our methods to several of very few extant gold standard test sets with multiple disaggregated responses per item and show that there are usually not enough responses per item to reliably compare the performance of one model against another. Our methods also allow us to estimate the number of responses per item for hypothetical datasets with similar response distributions to the existing datasets we study. When two models are very far apart in their predictive performance, fewer raters are needed to confidently compare them, as expected. However, as the models draw closer, we find that a larger number of raters than are currently typical in annotation collection are needed to ensure that the power analysis correctly reflects the difference in performance.
      [LINK]http://arxiv.org/abs/2412.02968v1
      [DATE]2024-12-04 10:31:28+08:00
      [CATEGORIES]cs.LG
    • RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data
      [AUTHORS]Maxwell A. Xu, Jaya Narain, Gregory Darnell, Haraldur Hallgrimsson, Hyewon Jeong, Darren Forde, Richard Fineman, Karthik J. Raghuram, James M. Rehg, Shirley Ren
      [ABSTRACT]We present RelCon, a novel self-supervised *Rel*ative *Con*trastive learning approach that uses a learnable distance measure in combination with a softened contrastive loss for training an motion foundation model from wearable sensors. The learnable distance measure captures motif similarity and domain-specific semantic information such as rotation invariance. The learned distance provides a measurement of semantic similarity between a pair of accelerometer time-series segments, which is used to measure the distance between an anchor and various other sampled candidate segments. The self-supervised model is trained on 1 billion segments from 87,376 participants from a large wearables dataset. The model achieves strong performance across multiple downstream tasks, encompassing both classification and regression. To our knowledge, we are the first to show the generalizability of a self-supervised learning model with motion data from wearables across distinct evaluation tasks.
      [LINK]http://arxiv.org/abs/2411.18822v2
      [DATE]2024-12-04 09:56:07+08:00
      [CATEGORIES]cs.LG
    • Zero-Shot Relational Learning for Multimodal Knowledge Graphs
      [AUTHORS]Rui Cai, Shichao Pei, Xiangliang Zhang
      [ABSTRACT]Relational learning is an essential task in the domain of knowledge representation, particularly in knowledge graph completion (KGC). While relational learning in traditional single-modal settings has been extensively studied, exploring it within a multimodal KGC context presents distinct challenges and opportunities. One of the major challenges is inference on newly discovered relations without any associated training data. This zero-shot relational learning scenario poses unique requirements for multimodal KGC, i.e., utilizing multimodality to facilitate relational learning.However, existing works fail to support the leverage of multimodal information and leave the problem unexplored. In this paper, we propose a novel end-to-end framework, consisting of three components, i.e., multimodal learner, structure consolidator, and relation embedding generator, to integrate diverse multimodal information and knowledge graph structures to facilitate the zero-shot relational learning. Evaluation results on three multimodal knowledge graphs demonstrate the superior performance of our proposed method.
      [COMMENTS]In the Proceedings of the 2024 IEEE International Conference on Big Data (IEEE BigData 2024)
      [LINK]http://arxiv.org/abs/2404.06220v2
      [DATE]2024-12-04 09:47:08+08:00
      [CATEGORIES]cs.LG
    • Inverse Delayed Reinforcement Learning
      [AUTHORS]Simon Sinong Zhan, Qingyuan Wu, Zhian Ruan, Frank Yang, Philip Wang, Yixuan Wang, Ruochen Jiao, Chao Huang, Qi Zhu
      [ABSTRACT]Inverse Reinforcement Learning (IRL) has demonstrated effectiveness in a variety of imitation tasks. In this paper, we introduce an IRL framework designed to extract rewarding features from expert trajectories affected by delayed disturbances. Instead of relying on direct observations, our approach employs an efficient off-policy adversarial training framework to derive expert features and recover optimal policies from augmented delayed observations. Empirical evaluations in the MuJoCo environment under diverse delay settings validate the effectiveness of our method. Furthermore, we provide a theoretical analysis showing that recovering expert policies from augmented delayed observations outperforms using direct delayed observations.
      [LINK]http://arxiv.org/abs/2412.02931v1
      [DATE]2024-12-04 08:53:55+08:00
      [CATEGORIES]cs.LG
    • HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond
      [AUTHORS]Stefan Abi-Karam, Rishov Sarkar, Allison Seigler, Sean Lowe, Zhigang Wei, Hanqiu Chen, Nanditha Rao, Lizy John, Aman Arora, Cong Hao
      [ABSTRACT]Machine learning (ML) techniques have been applied to high-level synthesis (HLS) flows for quality-of-result (QoR) prediction and design space exploration (DSE). Nevertheless, the scarcity of accessible high-quality HLS datasets and the complexity of building such datasets present challenges. Existing datasets have limitations in terms of benchmark coverage, design space enumeration, vendor extensibility, or lack of reproducible and extensible software for dataset construction. Many works also lack user-friendly ways to add more designs, limiting wider adoption of such datasets. In response to these challenges, we introduce HLSFactory, a comprehensive framework designed to facilitate the curation and generation of high-quality HLS design datasets. HLSFactory has three main stages: 1) a design space expansion stage to elaborate single HLS designs into large design spaces using various optimization directives across multiple vendor tools, 2) a design synthesis stage to execute HLS and FPGA tool flows concurrently across designs, and 3) a data aggregation stage for extracting standardized data into packaged datasets for ML usage. This tripartite architecture ensures broad design space coverage via design space expansion and supports multiple vendor tools. Users can contribute to each stage with their own HLS designs and synthesis results and extend the framework itself with custom frontends and tool flows. We also include an initial set of built-in designs from common HLS benchmarks curated open-source HLS designs. We showcase the versatility and multi-functionality of our framework through seven case studies: I) ML model for QoR prediction; II) Design space sampling; III) Fine-grained parallelism backend speedup; IV) Targeting Intel's HLS flow; V) Adding new auxiliary designs; VI) Integrating published HLS data; VII) HLS tool version regression benchmarking.
      [COMMENTS]MLCAD 2024 version of the paper. New case study with ML QoR prediction. Artifact evaluation details included
      [LINK]http://arxiv.org/abs/2405.00820v3
      [DATE]2024-12-04 07:30:43+08:00
      [CATEGORIES]cs.LG
    • MACAW: A Causal Generative Model for Medical Imaging
      [AUTHORS]Vibujithan Vigneshwaran, Erik Ohara, Matthias Wilms, Nils Forkert
      [ABSTRACT]Although deep learning techniques show promising results for many neuroimaging tasks in research settings, they have not yet found widespread use in clinical scenarios. One of the reasons for this problem is that many machine learning models only identify correlations between the input images and the outputs of interest, which can lead to many practical problems, such as encoding of uninformative biases and reduced explainability. Thus, recent research is exploring if integrating a priori causal knowledge into deep learning models is a potential avenue to identify these problems. This work introduces a new causal generative architecture named Masked Causal Flow (MACAW) for neuroimaging applications. Within this context, three main contributions are described. First, a novel approach that integrates complex causal structures into normalizing flows is proposed. Second, counterfactual prediction is performed to identify the changes in effect variables associated with a cause variable. Finally, an explicit Bayesian inference for classification is derived and implemented, providing an inherent uncertainty estimation. The feasibility of the proposed method was first evaluated using synthetic data and then using MRI brain data from more than 23000 participants of the UK biobank study. The evaluation results show that the proposed method can (1) accurately encode causal reasoning and generate counterfactuals highlighting the structural changes in the brain known to be associated with aging, (2) accurately predict a subject's age from a single 2D MRI slice, and (3) generate new samples assuming other values for subject-specific indicators such as age, sex, and body mass index. The code for a toy dataset is available at the following link: https://github.com/vibujithan/macaw-2D.git.
      [COMMENTS]27 pages
      [LINK]http://arxiv.org/abs/2412.02900v1
      [DATE]2024-12-04 07:05:41+08:00
      [CATEGORIES]cs.LG
    • Modeling and Discovering Direct Causes for Predictive Models
      [AUTHORS]Yizuo Chen, Amit Bhatia
      [ABSTRACT]We introduce a causal modeling framework that captures the input-output behavior of predictive models (e.g., machine learning models) by representing it using causal graphs. The framework enables us to define and identify features that directly cause the predictions, which has broad implications for data collection and model evaluation. We show two assumptions under which the direct causes can be discovered from data, one of which further simplifies the discovery process. In addition to providing sound and complete algorithms, we propose an optimization technique based on an independence rule that can be integrated with the algorithms to speed up the discovery process both theoretically and empirically.
      [LINK]http://arxiv.org/abs/2412.02878v1
      [DATE]2024-12-04 06:25:42+08:00
      [CATEGORIES]cs.LG
    • Scale Invariance of Graph Neural Networks
      [AUTHORS]Qin Jiang, Chengjia Wang, Michael Lones, Wei Pang
      [ABSTRACT]We address two fundamental challenges in Graph Neural Networks (GNNs): (1) the lack of theoretical support for invariance learning, a critical property in image processing, and (2) the absence of a unified model capable of excelling on both homophilic and heterophilic graph datasets. To tackle these issues, we establish and prove scale invariance in graphs, extending this key property to graph learning, and validate it through experiments on real-world datasets. Leveraging directed multi-scaled graphs and an adaptive self-loop strategy, we propose ScaleNet, a unified network architecture that achieves state-of-the-art performance across four homophilic and two heterophilic benchmark datasets. Furthermore, we show that through graph transformation based on scale invariance, uniform weights can replace computationally expensive edge weights in digraph inception networks while maintaining or improving performance. For another popular GNN approach to digraphs, we demonstrate the equivalence between Hermitian Laplacian methods and GraphSAGE with incidence normalization. ScaleNet bridges the gap between homophilic and heterophilic graph learning, offering both theoretical insights into scale invariance and practical advancements in unified graph learning. Our implementation is publicly available at https://github.com/Qin87/ScaleNet/tree/Aug23.
      [COMMENTS]add theoretical proof,. arXiv admin note: substantial text overlap with arXiv:2411.08758
      [LINK]http://arxiv.org/abs/2411.19392v2
      [DATE]2024-12-04 06:08:55+08:00
      [CATEGORIES]cs.LG
    • Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation
      [AUTHORS]Adam Fisch, Joshua Maynez, R. Alex Hofer, Bhuwan Dhingra, Amir Globerson, William W. Cohen
      [ABSTRACT]Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data. PPI achieves this by combining small amounts of human-labeled data with larger amounts of data labeled by a reasonably accurate -- but potentially biased -- automatic system, in a way that results in tighter confidence intervals for certain parameters of interest (e.g., the mean performance of a language model). In this paper, we propose a method called Stratified Prediction-Powered Inference (StratPPI), in which we show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies. Without making any assumptions on the underlying automatic labeling system or data distribution, we derive an algorithm for computing provably valid confidence intervals for population parameters (such as averages) that is based on stratified sampling. In particular, we show both theoretically and empirically that, with appropriate choices of stratification and sample allocation, our approach can provide substantially tighter confidence intervals than unstratified approaches. Specifically, StratPPI is expected to improve in cases where the performance of the autorater varies across different conditional distributions of the target data.
      [LINK]http://arxiv.org/abs/2406.04291v2
      [DATE]2024-12-04 05:59:32+08:00
      [CATEGORIES]cs.LG
    • Is Large-Scale Pretraining the Secret to Good Domain Generalization?
      [AUTHORS]Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Bryan A. Plummer, Kate Saenko
      [ABSTRACT]Multi-Source Domain Generalization (DG) is the task of training on multiple source domains and achieving high classification performance on unseen target domains. Recent methods combine robust features from web-scale pretrained backbones with new features learned from source data, and this has dramatically improved benchmark results. However, it remains unclear if DG finetuning methods are becoming better over time, or if improved benchmark performance is simply an artifact of stronger pre-training. Prior studies have shown that perceptual similarity to pre-training data correlates with zero-shot performance, but we find the effect limited in the DG setting. Instead, we posit that having perceptually similar data in pretraining is not enough; and that it is how well these data were learned that determines performance. This leads us to introduce the Alignment Hypothesis, which states that the final DG performance will be high if and only if alignment of image and class label text embeddings is high. Our experiments confirm the Alignment Hypothesis is true, and we use it as an analysis tool of existing DG methods evaluated on DomainBed datasets by splitting evaluation data into In-pretraining (IP) and Out-of-pretraining (OOP). We show that all evaluated DG methods struggle on DomainBed-OOP, while recent methods excel on DomainBed-IP. Put together, our findings highlight the need for DG methods which can generalize beyond pretraining alignment.
      [LINK]http://arxiv.org/abs/2412.02856v1
      [DATE]2024-12-04 05:43:11+08:00
      [CATEGORIES]cs.LG
    • Optimized CNNs for Rapid 3D Point Cloud Object Recognition
      [AUTHORS]Tianyi Lyu, Dian Gu, Peiyuan Chen, Yaoting Jiang, Zhenhong Zhang, Huadong Pang, Li Zhou, Yiping Dong
      [ABSTRACT]This study introduces a method for efficiently detecting objects within 3D point clouds using convolutional neural networks (CNNs). Our approach adopts a unique feature-centric voting mechanism to construct convolutional layers that capitalize on the typical sparsity observed in input data. We explore the trade-off between accuracy and speed across diverse network architectures and advocate for integrating an $\mathcal\{L\}_1$ penalty on filter activations to augment sparsity within intermediate layers. This research pioneers the proposal of sparse convolutional layers combined with $\mathcal\{L\}_1$ regularization to effectively handle large-scale 3D data processing. Our method's efficacy is demonstrated on the MVTec 3D-AD object detection benchmark. The Vote3Deep models, with just three layers, outperform the previous state-of-the-art in both laser-only approaches and combined laser-vision methods. Additionally, they maintain competitive processing speeds. This underscores our approach's capability to substantially enhance detection performance while ensuring computational efficiency suitable for real-time applications.
      [COMMENTS]15 pages
      [LINK]http://arxiv.org/abs/2412.02855v1
      [DATE]2024-12-04 05:42:30+08:00
      [CATEGORIES]cs.LG
    • An L-BFGS-B approach for linear and nonlinear system identification under $\ell_1$ and group-Lasso regularization
      [AUTHORS]Alberto Bemporad
      [ABSTRACT]In this paper, we propose a very efficient numerical method based on the L-BFGS-B algorithm for identifying linear and nonlinear discrete-time state-space models, possibly under $\ell_1$ and group-Lasso regularization for reducing model complexity. For the identification of linear models, we show that, compared to classical linear subspace methods, the approach often provides better results, is much more general in terms of the loss and regularization terms used (such as penalties for enforcing system stability), and is also more stable from a numerical point of view. The proposed method not only enriches the existing set of linear system identification tools but can also be applied to identifying a very broad class of parametric nonlinear state-space models, including recurrent neural networks. We illustrate the approach on synthetic and experimental datasets and apply it to solve a challenging industrial robot benchmark for nonlinear multi-input/multi-output system identification. A Python implementation of the proposed identification method is available in the package jax-sysid, available at https://github.com/bemporad/jax-sysid.
      [COMMENTS]23 pages, 4 figures
      [LINK]http://arxiv.org/abs/2403.03827v3
      [DATE]2024-12-04 04:45:59+08:00
      [CATEGORIES]cs.LG
    • Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation
      [AUTHORS]Raphael Ruschel, Md Awsafur Rahman, Hardik Prajapati, Suya You, B. S. Manjuanth
      [ABSTRACT]Understanding video content is pivotal for advancing real-world applications like activity recognition, autonomous systems, and human-computer interaction. While scene graphs are adept at capturing spatial relationships between objects in individual frames, extending these representations to capture dynamic interactions across video sequences remains a significant challenge. To address this, we present TCDSG, Temporally Consistent Dynamic Scene Graphs, an innovative end-to-end framework that detects, tracks, and links subject-object relationships across time, generating action tracklets, temporally consistent sequences of entities and their interactions. Our approach leverages a novel bipartite matching mechanism, enhanced by adaptive decoder queries and feedback loops, ensuring temporal coherence and robust tracking over extended sequences. This method not only establishes a new benchmark by achieving over 60% improvement in temporal recall@k on the Action Genome, OpenPVSG, and MEVA datasets but also pioneers the augmentation of MEVA with persistent object ID annotations for comprehensive tracklet generation. By seamlessly integrating spatial and temporal dynamics, our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.
      [LINK]http://arxiv.org/abs/2412.02808v1
      [DATE]2024-12-04 04:19:20+08:00
      [CATEGORIES]cs.LG
    • Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code
      [AUTHORS]Timur Galimzyanov, Sergey Titov, Yaroslav Golubev, Egor Bogomolov
      [ABSTRACT]This paper introduces the human-curated PandasPlotBench dataset, designed to evaluate language models' effectiveness as assistants in visual data exploration. Our benchmark focuses on generating code for visualizing tabular data - such as a Pandas DataFrame - based on natural language instructions, complementing current evaluation tools and expanding their scope. The dataset includes 175 unique tasks. Our experiments assess several leading Large Language Models (LLMs) across three visualization libraries: Matplotlib, Seaborn, and Plotly. We show that the shortening of tasks has a minimal effect on plotting capabilities, allowing for the user interface that accommodates concise user input without sacrificing functionality or accuracy. Another of our findings reveals that while LLMs perform well with popular libraries like Matplotlib and Seaborn, challenges persist with Plotly, highlighting areas for improvement. We hope that the modular design of our benchmark will broaden the current studies on generating visualizations. Our benchmark is available online: https://huggingface.co/datasets/JetBrains-Research/plot_bench. The code for running the benchmark is also available: https://github.com/JetBrains-Research/PandasPlotBench.
      [COMMENTS]5 pages
      [LINK]http://arxiv.org/abs/2412.02764v1
      [DATE]2024-12-04 03:05:37+08:00
      [CATEGORIES]cs.LG
    • Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply
      [AUTHORS]Chengting Yu, Fengzhao Zhang, Hanzhi Ma, Aili Wang, Erping Li
      [ABSTRACT]Traditional end-to-end (E2E) training of deep networks necessitates storing intermediate activations for back-propagation, resulting in a large memory footprint on GPUs and restricted model parallelization. As an alternative, greedy local learning partitions the network into gradient-isolated modules and trains supervisely based on local preliminary losses, thereby providing asynchronous and parallel training methods that substantially reduce memory cost. However, empirical experiments reveal that as the number of segmentations of the gradient-isolated module increases, the performance of the local learning scheme degrades substantially, severely limiting its expansibility. To avoid this issue, we theoretically analyze the greedy local learning from the standpoint of information theory and propose a ContSup scheme, which incorporates context supply between isolated modules to compensate for information loss. Experiments on benchmark datasets (i.e. CIFAR, SVHN, STL-10) achieve SOTA results and indicate that our proposed method can significantly improve the performance of greedy local learning with minimal memory and computational overhead, allowing for the boost of the number of isolated modules. Our codes are available at https://github.com/Tab-ct/ContSup.
      [COMMENTS]9 figures, 12 tables
      [LINK]http://arxiv.org/abs/2312.07636v2
      [DATE]2024-12-04 02:35:27+08:00
      [CATEGORIES]cs.LG
    • CPP-UT-Bench: Can LLMs Write Complex Unit Tests in C++?
      [AUTHORS]Vaishnavi Bhargava, Rajat Ghosh, Debojyoti Dutta
      [ABSTRACT]We introduce CPP-UT-Bench, a benchmark dataset to measure C++ unit test generation capability of a large language model (LLM). CPP-UT-Bench aims to reflect a broad and diverse set of C++ codebases found in the real world. The dataset includes 2,653 \{code, unit test\} pairs drawn from 14 different opensource C++ codebases spanned across nine diverse domains including machine learning, software testing, parsing, standard input-output, data engineering, logging, complete expression evaluation, key value storage, and server protocols. We demonstrated the effectiveness of CPP-UT-Bench as a benchmark dataset through extensive experiments in in-context learning, parameter-efficient fine-tuning (PEFT), and full-parameter fine-tuning. We also discussed the challenges of the dataset compilation and insights we learned from in-context learning and fine-tuning experiments. Besides the CPP-UT-Bench dataset and data compilation code, we are also offering the fine-tuned model weights for further research. For nine out of ten experiments, our fine-tuned LLMs outperformed the corresponding base models by an average of more than 70%.
      [LINK]http://arxiv.org/abs/2412.02735v1
      [DATE]2024-12-04 02:35:24+08:00
      [CATEGORIES]cs.LG
    • A Fast Convergence Theory for Offline Decision Making
      [AUTHORS]Chenjie Mao, Qiaosheng Zhang
      [ABSTRACT]This paper proposes the first generic fast convergence result in general function approximation for offline decision making problems, which include offline reinforcement learning (RL) and off-policy evaluation (OPE) as special cases. To unify different settings, we introduce a framework called Decision Making with Offline Feedback (DMOF), which captures a wide range of offline decision making problems. Within this framework, we propose a simple yet powerful algorithm called Empirical Decision with Divergence (EDD), whose upper bound can be termed as a coefficient named Empirical Offline Estimation Coefficient (EOEC). We show that EOEC is instance-dependent and actually measures the correlation of the problem. When assuming partial coverage in the dataset, EOEC will reduce in a rate of $1/N$ where $N$ is the size of the dataset, endowing EDD with a fast convergence guarantee. Finally, we complement the above results with a lower bound in the DMOF framework, which further demonstrates the soundness of our theory.
      [LINK]http://arxiv.org/abs/2406.01378v2
      [DATE]2024-12-04 02:32:15+08:00
      [CATEGORIES]cs.LG
    • Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback
      [AUTHORS]Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, Sherry Yang
      [ABSTRACT]Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.
      [COMMENTS]Website: https://sites.google.com/view/aif-dynamic-t2v/
      [LINK]http://arxiv.org/abs/2412.02617v1
      [DATE]2024-12-04 01:44:23+08:00
      [CATEGORIES]cs.LG
    • TAB-Fields: A Maximum Entropy Framework for Mission-Aware Adversarial Planning
      [AUTHORS]Gokul Puthumanaillam, Jae Hyuk Song, Nurzhan Yesmagambet, Shinkyu Park, Melkior Ornik
      [ABSTRACT]Autonomous agents operating in adversarial scenarios face a fundamental challenge: while they may know their adversaries' high-level objectives, such as reaching specific destinations within time constraints, the exact policies these adversaries will employ remain unknown. Traditional approaches address this challenge by treating the adversary's state as a partially observable element, leading to a formulation as a Partially Observable Markov Decision Process (POMDP). However, the induced belief-space dynamics in a POMDP require knowledge of the system's transition dynamics, which, in this case, depend on the adversary's unknown policy. Our key observation is that while an adversary's exact policy is unknown, their behavior is necessarily constrained by their mission objectives and the physical environment, allowing us to characterize the space of possible behaviors without assuming specific policies. In this paper, we develop Task-Aware Behavior Fields (TAB-Fields), a representation that captures adversary state distributions over time by computing the most unbiased probability distribution consistent with known constraints. We construct TAB-Fields by solving a constrained optimization problem that minimizes additional assumptions about adversary behavior beyond mission and environmental requirements. We integrate TAB-Fields with standard planning algorithms by introducing TAB-conditioned POMCP, an adaptation of Partially Observable Monte Carlo Planning. Through experiments in simulation with underwater robots and hardware implementations with ground robots, we demonstrate that our approach achieves superior performance compared to baselines that either assume specific adversary policies or neglect mission constraints altogether. Evaluation videos and code are available at https://tab-fields.github.io.
      [LINK]http://arxiv.org/abs/2412.02570v1
      [DATE]2024-12-04 00:55:27+08:00
      [CATEGORIES]cs.LG
    2024 Dec 03, Tue
    • BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment
      [AUTHORS]Shaolei Zhang, Kehao Zhang, Qingkai Fang, Shoutao Guo, Yan Zhou, Xiaodong Liu, Yang Feng
      [ABSTRACT]Large language models (LLMs), with their powerful generative capabilities and vast knowledge, empower various tasks in everyday life. However, these abilities are primarily concentrated in high-resource languages, leaving low-resource languages with weaker generative capabilities and relatively limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore crucial for serving over 100 linguistic communities worldwide. An intuitive approach to enhance the multilingual capabilities would be to construct instruction data for various languages, but constructing instruction data for over 100 languages is prohibitively costly. In this paper, we introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages through language alignment. To achieve this, we constructed a dataset of 3.2 million instructions, comprising high-resource language instructions (Chinese and English) and cross-lingual instructions for 100+ languages and performed instruction tuning based on the dataset to facilitate the capability transfer between languages. Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B, and BayLing-2-8B, and conducted a comprehensive evaluation of BayLing. For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale. For multilingual knowledge and understanding benchmarks, BayLing achieves significant improvements across over 20 low-resource languages, demonstrating its capability of effective knowledge transfer from high-resource to low-resource languages. Furthermore, results on English benchmarks indicate that BayLing maintains high performance in highresource languages while enhancing the performance in low-resource languages. Demo, homepage, code and models of BayLing are available.
      [COMMENTS]BayLing 2's online demo: http://nlp.ict.ac.cn/bayling/demo. BayLing 2's code and models: https://github.com/ictnlp/BayLing
      [LINK]http://arxiv.org/abs/2411.16300v2
      [DATE]2024-12-03 22:17:41+08:00
      [CATEGORIES]cs.CL
    • TSCheater: Generating High-Quality Tibetan Adversarial Texts via Visual Similarity
      [AUTHORS]Xi Cao, Quzong Gesang, Yuan Sun, Nuo Qun, Tashi Nyima
      [ABSTRACT]Language models based on deep neural networks are vulnerable to textual adversarial attacks. While rich-resource languages like English are receiving focused attention, Tibetan, a cross-border language, is gradually being studied due to its abundant ancient literature and critical language strategy. Currently, there are several Tibetan adversarial text generation methods, but they do not fully consider the textual features of Tibetan script and overestimate the quality of generated adversarial texts. To address this issue, we propose a novel Tibetan adversarial text generation method called TSCheater, which considers the characteristic of Tibetan encoding and the feature that visually similar syllables have similar semantics. This method can also be transferred to other abugidas, such as Devanagari script. We utilize a self-constructed Tibetan syllable visual similarity database called TSVSDB to generate substitution candidates and adopt a greedy algorithm-based scoring mechanism to determine substitution order. After that, we conduct the method on eight victim language models. Experimentally, TSCheater outperforms existing methods in attack effectiveness, perturbation magnitude, semantic similarity, visual similarity, and human acceptance. Finally, we construct the first Tibetan adversarial robustness evaluation benchmark called AdvTS, which is generated by existing methods and proofread by humans.
      [COMMENTS]Review Version; Submitted to ICASSP 2025
      [LINK]http://arxiv.org/abs/2412.02371v1
      [DATE]2024-12-03 18:57:19+08:00
      [CATEGORIES]cs.CL
    • ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?
      [AUTHORS]Leixin Zhang, Steffen Eger, Yinjie Cheng, Weihe Zhai, Jonas Belouadi, Christoph Leiter, Simone Paolo Ponzetto, Fahimeh Moafian, Zhixue Zhao
      [ABSTRACT]Multimodal large language models (LLMs) have demonstrated impressive capabilities in generating high-quality images from textual instructions. However, their performance in generating scientific images--a critical application for accelerating scientific progress--remains underexplored. In this work, we address this gap by introducing ScImage, a benchmark designed to evaluate the multimodal capabilities of LLMs in generating scientific images from textual descriptions. ScImage assesses three key dimensions of understanding: spatial, numeric, and attribute comprehension, as well as their combinations, focusing on the relationships between scientific objects (e.g., squares, circles). We evaluate five models, GPT-4o, Llama, AutomaTikZ, Dall-E, and StableDiffusion, using two modes of output generation: code-based outputs (Python, TikZ) and direct raster image generation. Additionally, we examine four different input languages: English, German, Farsi, and Chinese. Our evaluation, conducted with 11 scientists across three criteria (correctness, relevance, and scientific accuracy), reveals that while GPT-4o produces outputs of decent quality for simpler prompts involving individual dimensions such as spatial, numeric, or attribute understanding in isolation, all models face challenges in this task, especially for more complex prompts.
      [LINK]http://arxiv.org/abs/2412.02368v1
      [DATE]2024-12-03 18:52:06+08:00
      [CATEGORIES]cs.CL
    • Evaluating Distributed Representations for Multi-Level Lexical Semantics: A Research Proposal
      [AUTHORS]Zhu Liu
      [ABSTRACT]Modern neural networks (NNs), trained on extensive raw sentence data, construct distributed representations by compressing individual words into dense, continuous, high-dimensional vectors. These representations are expected to capture multi-level lexical meaning. In this thesis, our objective is to examine the efficacy of distributed representations from NNs in encoding lexical meaning. Initially, we identify and formalize three levels of lexical semantics: \textit\{local\}, \textit\{global\}, and \textit\{mixed\} levels. Then, for each level, we evaluate language models by collecting or constructing multilingual datasets, leveraging various language models, and employing linguistic analysis theories. This thesis builds a bridge between computational models and lexical semantics, aiming to complement each other.
      [COMMENTS]Paper under review
      [LINK]http://arxiv.org/abs/2406.00751v2
      [DATE]2024-12-03 18:37:09+08:00
      [CATEGORIES]cs.CL
    • A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis
      [AUTHORS]Changzhi Zhou, Dandan Song, Yuhang Tian, Zhijing Wu, Hao Wang, Xinyu Zhang, Jun Yang, Ziyi Yang, Shuhao Zhang
      [ABSTRACT]Recently, Large Language Models (LLMs) have garnered increasing attention in the field of natural language processing, revolutionizing numerous downstream tasks with powerful reasoning and generation abilities. For example, In-Context Learning (ICL) introduces a fine-tuning-free paradigm, allowing out-of-the-box LLMs to execute downstream tasks by analogy learning without any fine-tuning. Besides, in a fine-tuning-dependent paradigm where substantial training data exists, Parameter-Efficient Fine-Tuning (PEFT), as the cost-effective methods, enable LLMs to achieve excellent performance comparable to full fine-tuning. However, these fascinating techniques employed by LLMs have not been fully exploited in the ABSA field. Previous works probe LLMs in ABSA by merely using randomly selected input-output pairs as demonstrations in ICL, resulting in an incomplete and superficial evaluation. In this paper, we shed light on a comprehensive evaluation of LLMs in the ABSA field, involving 13 datasets, 8 ABSA subtasks, and 6 LLMs. Specifically, we design a unified task formulation to unify ``multiple LLMs for multiple ABSA subtasks in multiple paradigms.'' For the fine-tuning-dependent paradigm, we efficiently fine-tune LLMs using instruction-based multi-task learning. For the fine-tuning-free paradigm, we propose 3 demonstration selection strategies to stimulate the few-shot abilities of LLMs. Our extensive experiments demonstrate that LLMs achieve a new state-of-the-art performance compared to fine-tuned Small Language Models (SLMs) in the fine-tuning-dependent paradigm. More importantly, in the fine-tuning-free paradigm where SLMs are ineffective, LLMs with ICL still showcase impressive potential and even compete with fine-tuned SLMs on some ABSA subtasks.
      [LINK]http://arxiv.org/abs/2412.02279v1
      [DATE]2024-12-03 16:54:17+08:00
      [CATEGORIES]cs.CL
    • Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
      [AUTHORS]Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen
      [ABSTRACT]Efficiently modeling sequences with infinite context length has long been a challenging problem. Previous approaches have either suffered from quadratic computational complexity or limited extrapolation ability in length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall recent memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and demonstrate that it significantly outperforms state-of-the-art models across a variety of benchmarks. Pretrained on sequences of 4K length, Samba shows improved perplexity in context lengths of up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba efficiently extrapolates to a 256K context length with perfect memory recall on the Passkey Retrieval task, and exhibits superior retrieval extrapolation on the challenging Phonebook task compared to full-attention models. As a linear-time sequence model, Samba achieves a 3.73x higher throughput compared to Transformers with grouped-query attention for user prompts of 128K length, and a 3.64x speedup when generating 64K tokens with unlimited streaming. Our code for training on open source data is publicly available at https://github.com/microsoft/Samba.
      [LINK]http://arxiv.org/abs/2406.07522v2
      [DATE]2024-12-03 16:27:49+08:00
      [CATEGORIES]cs.CL cs.LG
    • Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning
      [AUTHORS]Aditya Narayan Sankaran, Reza Farahbakhsh, Noel Crespi
      [ABSTRACT]Online abusive content detection, particularly in low-resource settings and within the audio modality, remains underexplored. We investigate the potential of pre-trained audio representations for detecting abusive language in low-resource languages, in this case, in Indian languages using Few Shot Learning (FSL). Leveraging powerful representations from models such as Wav2Vec and Whisper, we explore cross-lingual abuse detection using the ADIMA dataset with FSL. Our approach integrates these representations within the Model-Agnostic Meta-Learning (MAML) framework to classify abusive language in 10 languages. We experiment with various shot sizes (50-200) evaluating the impact of limited data on performance. Additionally, a feature visualization study was conducted to better understand model behaviour. This study highlights the generalization ability of pre-trained models in low-resource scenarios and offers valuable insights into detecting abusive language in multilingual contexts.
      [COMMENTS]Accepted as part of the proceedings of COLING 2025
      [LINK]http://arxiv.org/abs/2412.01408v2
      [DATE]2024-12-03 15:52:35+08:00
      [CATEGORIES]cs.CL
    • BANER: Boundary-Aware LLMs for Few-Shot Named Entity Recognition
      [AUTHORS]Quanjiang Guo, Yihong Dong, Ling Tian, Zhao Kang, Yu Zhang, Sijie Wang
      [ABSTRACT]Despite the recent success of two-stage prototypical networks in few-shot named entity recognition (NER), challenges such as over/under-detected false spans in the span detection stage and unaligned entity prototypes in the type classification stage persist. Additionally, LLMs have not proven to be effective few-shot information extractors in general. In this paper, we propose an approach called Boundary-Aware LLMs for Few-Shot Named Entity Recognition to address these issues. We introduce a boundary-aware contrastive learning strategy to enhance the LLM's ability to perceive entity boundaries for generalized entity spans. Additionally, we utilize LoRAHub to align information from the target domain to the source domain, thereby enhancing adaptive cross-domain classification capabilities. Extensive experiments across various benchmarks demonstrate that our framework outperforms prior methods, validating its effectiveness. In particular, the proposed strategies demonstrate effectiveness across a range of LLM architectures. The code and data are released on https://github.com/UESTC-GQJ/BANER.
      [COMMENTS]Appear on COLING 2025
      [LINK]http://arxiv.org/abs/2412.02228v1
      [DATE]2024-12-03 15:51:14+08:00
      [CATEGORIES]cs.CL cs.LG
    • Model Editing for LLMs4Code: How Far are We?
      [AUTHORS]Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, Weimin Zhang
      [ABSTRACT]Large Language Models for Code (LLMs4Code) have been found to exhibit outstanding performance in the software engineering domain, especially the remarkable performance in coding tasks. However, even the most advanced LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to the high cost of training LLMs4Code, it is impractical to re-train the models for fixing these problematic code knowledge. Model editing is a new technical field for effectively and efficiently correcting erroneous knowledge in LLMs, where various model editing techniques and benchmarks have been proposed recently. Despite that, a comprehensive study that thoroughly compares and analyzes the performance of the state-of-the-art model editing techniques for adapting the knowledge within LLMs4Code across various code-related tasks is notably absent. To bridge this gap, we perform the first systematic study on applying state-of-the-art model editing approaches to repair the inaccuracy of LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help of CLMEEval, we evaluate six advanced model editing techniques on three LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings include that the external memorization-based GRACE approach achieves the best knowledge editing effectiveness and specificity (the editing does not influence untargeted knowledge), while generalization (whether the editing can generalize to other semantically-identical inputs) is a universal challenge for existing techniques. Furthermore, building on in-depth case analysis, we introduce an enhanced version of GRACE called A-GRACE, which incorporates contrastive learning to better capture the semantics of the inputs.
      [COMMENTS]Accepted by ICSE2025. The code is available at: https://github.com/xpq-tech/code-llmedit.git
      [LINK]http://arxiv.org/abs/2411.06638v2
      [DATE]2024-12-03 15:40:40+08:00
      [CATEGORIES]cs.CL
    • AutoGuide: Automated Generation and Selection of Context-Aware Guidelines for Large Language Model Agents
      [AUTHORS]Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, Honglak Lee
      [ABSTRACT]Recent advances in large language models (LLMs) have empowered AI agents capable of performing various sequential decision-making tasks. However, effectively guiding LLMs to perform well in unfamiliar domains like web navigation, where they lack sufficient knowledge, has proven to be difficult with the demonstration-based in-context learning paradigm. In this paper, we introduce a novel framework, called AutoGuide, which addresses this limitation by automatically generating context-aware guidelines from offline experiences. Importantly, each context-aware guideline is expressed in concise natural language and follows a conditional structure, clearly describing the context where it is applicable. As a result, our guidelines facilitate the provision of relevant knowledge for the agent's current decision-making process, overcoming the limitations of the conventional demonstration-based learning paradigm. Our evaluation demonstrates that AutoGuide significantly outperforms competitive baselines in complex benchmark domains, including real-world web navigation.
      [LINK]http://arxiv.org/abs/2403.08978v2
      [DATE]2024-12-03 15:36:47+08:00
      [CATEGORIES]cs.CL cs.LG
    • Enabling Efficient Attack Investigation via Human-in-the-Loop Security Analysis
      [AUTHORS]Xinyu Yang, Haoyuan Liu, Saimon Amanuel Tsegai, Peng Gao
      [ABSTRACT]System auditing is a vital technique for collecting system call events as system provenance and investigating complex multi-step attacks such as Advanced Persistent Threats. However, existing attack investigation methods struggle to uncover long attack sequences due to the massive volume of system provenance data and their inability to focus on attack-relevant parts. In this paper, we present Raptor, a defense system that enables human analysts to effectively analyze large-scale system provenance to reveal multi-step attack sequences. Raptor introduces an expressive domain-specific language, ProvQL, that offers essential primitives for various types of attack analyses (e.g., attack pattern search, attack dependency tracking) with user-defined constraints, enabling analysts to focus on attack-relevant parts and iteratively sift through the large provenance data. Moreover, Raptor provides an optimized execution engine for efficient language execution. Our extensive evaluations on a wide range of attack scenarios demonstrate the practical effectiveness of Raptor in facilitating timely attack investigation.
      [LINK]http://arxiv.org/abs/2211.05403v2
      [DATE]2024-12-03 13:18:59+08:00
      [CATEGORIES]cs.CL
    • VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
      [AUTHORS]Xueqing Wu, Yuheng Ding, Bingxuan Li, Pan Lu, Da Yin, Kai-Wei Chang, Nanyun Peng
      [ABSTRACT]The ability of large vision-language models (LVLMs) to critique and correct their reasoning is an essential building block towards their self-improvement. However, a systematic analysis of such capabilities in LVLMs is still lacking. We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs. Compared to existing work that uses a single scalar value to critique the entire reasoning [4], VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought and provide natural language explanations to support their judgments. Extensive evaluation of 24 LVLMs demonstrates that human-written critiques significantly enhance the performance after correction, showcasing the potential of the self-improvement strategy. However, the model-generated critiques are less helpful and sometimes detrimental to the performance, suggesting that critique is the crucial bottleneck. We identified three common patterns in critique failures: failure to critique visual perception, reluctance to "say no", and exaggerated assumption of error propagation. To address these issues, we propose an effective LookBack strategy that revisits the image to verify each piece of information in the initial reasoning. LookBack significantly improves critique and correction performance by up to 13.5%.
      [COMMENTS]Project: https://visco-benchmark.github.io/
      [LINK]http://arxiv.org/abs/2412.02172v1
      [DATE]2024-12-03 13:04:49+08:00
      [CATEGORIES]cs.CL
    • AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning
      [AUTHORS]Amy Xin, Jinxin Liu, Zijun Yao, Zhicheng Lee, Shulin Cao, Lei Hou, Juanzi Li
      [ABSTRACT]Recent advancements in large language models (LLMs) have led to significant improvements in various natural language processing tasks, but it is still challenging for LLMs to perform knowledge-intensive complex question answering due to LLMs' inefficacy in reasoning planning and the hallucination problem. A typical solution is to employ retrieval-augmented generation (RAG) coupled with chain-of-thought (CoT) reasoning, which decomposes complex questions into chain-like sub-questions and applies iterative RAG at each sub-question. However, prior works exhibit sub-optimal reasoning planning and overlook dynamic knowledge retrieval from heterogeneous sources. In this paper, we propose AtomR, a novel heterogeneous knowledge reasoning framework that conducts multi-source reasoning at the atomic level. Drawing inspiration from the graph modeling of knowledge, AtomR leverages large language models (LLMs) to decompose complex questions into combinations of three atomic knowledge operators, significantly enhancing the reasoning process at both the planning and execution stages. We also introduce BlendQA, a novel evaluation benchmark tailored to assess complex heterogeneous knowledge reasoning. Experiments show that AtomR significantly outperforms state-of-the-art baselines across three single-source and two multi-source reasoning benchmarks, with notable performance gains of 9.4% on 2WikiMultihop and 9.5% on BlendQA.
      [LINK]http://arxiv.org/abs/2411.16495v2
      [DATE]2024-12-03 13:00:18+08:00
      [CATEGORIES]cs.CL
    • NüshuRescue: Revitalization of the endangered Nüshu Language with AI
      [AUTHORS]Ivory Yang, Weicheng Ma, Soroush Vosoughi
      [ABSTRACT]The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by N\"ushu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce N\"ushuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. N\"ushuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence N\"ushu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to N\"ushu and only 35 short examples from NCGold, N\"ushuRescue achieved 48.69\% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. A sample of both NCGold and NCSilver is included in the Supplementary Materials. Additionally, we developed FastText-based and Seq2Seq models to further support research on N\"ushu. N\"ushuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input.
      [COMMENTS]Accepted to COLING 2025
      [LINK]http://arxiv.org/abs/2412.00218v2
      [DATE]2024-12-03 12:38:31+08:00
      [CATEGORIES]cs.CL cs.LG
    • Analyzing Nobel Prize Literature with Large Language Models
      [AUTHORS]Zhenyuan Yang, Zhengliang Liu, Jing Zhang, Cen Lu, Jiaxin Tai, Tianyang Zhong, Yiwei Li, Siyan Zhao, Teng Yao, Qing Liu, Jinlin Yang, Qixin Liu, Zhaowei Li, Kexin Wang, Longjun Ma, Dajiang Zhu, Yudan Ren, Bao Ge, Wei Zhang, Ning Qiang, Tuo Zhang, Tianming Liu
      [ABSTRACT]This study examines the capabilities of advanced Large Language Models (LLMs), particularly the o1 model, in the context of literary analysis. The outputs of these models are compared directly to those produced by graduate-level human participants. By focusing on two Nobel Prize-winning short stories, 'Nine Chapters' by Han Kang, the 2024 laureate, and 'Friendship' by Jon Fosse, the 2023 laureate, the research explores the extent to which AI can engage with complex literary elements such as thematic analysis, intertextuality, cultural and historical contexts, linguistic and structural innovations, and character development. Given the Nobel Prize's prestige and its emphasis on cultural, historical, and linguistic richness, applying LLMs to these works provides a deeper understanding of both human and AI approaches to interpretation. The study uses qualitative and quantitative evaluations of coherence, creativity, and fidelity to the text, revealing the strengths and limitations of AI in tasks typically reserved for human expertise. While LLMs demonstrate strong analytical capabilities, particularly in structured tasks, they often fall short in emotional nuance and coherence, areas where human interpretation excels. This research underscores the potential for human-AI collaboration in the humanities, opening new opportunities in literary studies and beyond.
      [LINK]http://arxiv.org/abs/2410.18142v2
      [DATE]2024-12-03 12:19:36+08:00
      [CATEGORIES]cs.CL
    • Leveraging Large Language Models for Comparative Literature Summarization with Reflective Incremental Mechanisms
      [AUTHORS]Fernando Gabriela Garcia, Spencer Burns, Harrison Fuller
      [ABSTRACT]In this paper, we introduce ChatCite, a novel method leveraging large language models (LLMs) for generating comparative literature summaries. The ability to summarize research papers with a focus on key comparisons between studies is an essential task in academic research. Existing summarization models, while effective at generating concise summaries, fail to provide deep comparative insights. ChatCite addresses this limitation by incorporating a multi-step reasoning mechanism that extracts critical elements from papers, incrementally builds a comparative summary, and refines the output through a reflective memory process. We evaluate ChatCite on a custom dataset, CompLit-LongContext, consisting of 1000 research papers with annotated comparative summaries. Experimental results show that ChatCite outperforms several baseline methods, including GPT-4, BART, T5, and CoT, across various automatic evaluation metrics such as ROUGE and the newly proposed G-Score. Human evaluation further confirms that ChatCite generates more coherent, insightful, and fluent summaries compared to these baseline models. Our method provides a significant advancement in automatic literature review generation, offering researchers a powerful tool for efficiently comparing and synthesizing scientific research.
      [COMMENTS]8 pages
      [LINK]http://arxiv.org/abs/2412.02149v1
      [DATE]2024-12-03 12:09:36+08:00
      [CATEGORIES]cs.CL
    • Personalized Multimodal Large Language Models: A Survey
      [AUTHORS]Junda Wu, Hanjia Lyu, Yu Xia, Zhehao Zhang, Joe Barrow, Ishita Kumar, Mehrnoosh Mirtaheri, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Namyong Park, Sungchul Kim, Huanrui Yang, Subrata Mitra, Zhengmian Hu, Nedim Lipka, Dang Nguyen, Yue Zhao, Jiebo Luo, Julian McAuley
      [ABSTRACT]Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities, such as text, images, and audio, to perform complex tasks with high accuracy. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications. We propose an intuitive taxonomy for categorizing the techniques used to personalize MLLMs to individual users, and discuss the techniques accordingly. Furthermore, we discuss how such techniques can be combined or adapted when appropriate, highlighting their advantages and underlying rationale. We also provide a succinct summary of personalization tasks investigated in existing research, along with the evaluation metrics commonly used. Additionally, we summarize the datasets that are useful for benchmarking personalized MLLMs. Finally, we outline critical open challenges. This survey aims to serve as a valuable resource for researchers and practitioners seeking to understand and advance the development of personalized multimodal large language models.
      [LINK]http://arxiv.org/abs/2412.02142v1
      [DATE]2024-12-03 11:59:03+08:00
      [CATEGORIES]cs.CL
    • WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image
      [AUTHORS]Yuci Liang, Xinheng Lyu, Meidan Ding, Wenting Chen, Jipeng Zhang, Yuexiang Ren, Xiangjian He, Song Wu, Sen Yang, Xiyue Wang, Xiaohan Xing, Linlin Shen
      [ABSTRACT]Recent advancements in computational pathology have produced patch-level Multi-modal Large Language Models (MLLMs), but these models are limited by their inability to analyze whole slide images (WSIs) comprehensively and their tendency to bypass crucial morphological features that pathologists rely on for diagnosis. To address these challenges, we first introduce WSI-Bench, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, designed to evaluate MLLMs' understanding of morphological characteristics crucial for accurate diagnosis. Building upon this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI understanding that employs a three-stage training approach: WSI-text alignment, feature space alignment, and task-specific instruction tuning. To better assess model performance in pathological contexts, we develop two specialized WSI metrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that WSI-LLaVA outperforms existing models across all capability dimensions, with a significant improvement in morphological analysis, establishing a clear correlation between morphological understanding and diagnostic accuracy.
      [COMMENTS]38 pages, 22 figures, 35 tables
      [LINK]http://arxiv.org/abs/2412.02141v1
      [DATE]2024-12-03 11:57:24+08:00
      [CATEGORIES]cs.CL
    • Large Language Model-Brained GUI Agents: A Survey
      [AUTHORS]Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
      [ABSTRACT]GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.
      [COMMENTS]The collection of papers reviewed in this survey will be hosted and regularly updated on the GitHub repository: https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a searchable webpage is available at https://aka.ms/gui-agent for easier access and exploration
      [LINK]http://arxiv.org/abs/2411.18279v3
      [DATE]2024-12-03 11:16:27+08:00
      [CATEGORIES]cs.CL
    • Trustful LLMs: Customizing and Grounding Text Generation with Knowledge Bases and Dual Decoders
      [AUTHORS]Xiaofeng Zhu, Jaya Krishna Mandivarapu
      [ABSTRACT]Although people are impressed by the content generation skills of large language models, the use of LLMs, such as ChatGPT, is limited by the domain grounding of the content. The correctness and groundedness of the generated content need to be based on a verified context, such as results from Retrieval-Augmented Generation (RAG). One important issue when adapting LLMs to a customized domain is that the generated responses are often incomplete, or the additions are not verified and may even be hallucinated. Prior studies on hallucination detection have focused on evaluation metrics, which are not easily adaptable to dynamic domains and can be vulnerable to attacks like jail-breaking. In this work, we propose 1) a post-processing algorithm that leverages knowledge triplets in RAG context to correct hallucinations and 2) a dual-decoder model that fuses RAG context to guide the generation process.
      [LINK]http://arxiv.org/abs/2411.07870v3
      [DATE]2024-12-03 09:04:10+08:00
      [CATEGORIES]cs.CL
    • BN-AuthProf: Benchmarking Machine Learning for Bangla Author Profiling on Social Media Texts
      [AUTHORS]Raisa Tasnim, Mehanaz Chowdhury, Md Ataur Rahman
      [ABSTRACT]Author profiling, the analysis of texts to uncover attributes such as gender and age of the author, has become essential with the widespread use of social media platforms. This paper focuses on author profiling in the Bangla language, aiming to extract valuable insights about anonymous authors based on their writing style on social media. The primary objective is to introduce and benchmark the performance of machine learning approaches on a newly created Bangla Author Profiling dataset, BN-AuthProf. The dataset comprises 30,131 social media posts from 300 authors, labeled by their age and gender. Authors' identities and sensitive information were anonymized to ensure privacy. Various classical machine learning and deep learning techniques were employed to evaluate the dataset. For gender classification, the best accuracy achieved was 80% using Support Vector Machine (SVM), while a Multinomial Naive Bayes (MNB) classifier achieved the best F1 score of 0.756. For age classification, MNB attained a maximum accuracy score of 91% with an F1 score of 0.905. This research highlights the effectiveness of machine learning in gender and age classification for Bangla author profiling, with practical implications spanning marketing, security, forensic linguistics, education, and criminal investigations, considering privacy and biases.
      [COMMENTS]Accepted to be Published in 2024 27th International Conference on Computer and Information Technology (ICCIT)
      [LINK]http://arxiv.org/abs/2412.02058v1
      [DATE]2024-12-03 08:32:32+08:00
      [CATEGORIES]cs.CL
    • A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil and Sinhala
      [AUTHORS]Surangika Ranathunga, Asanka Ranasinghea, Janaka Shamala, Ayodya Dandeniyaa, Rashmi Galappaththia, Malithi Samaraweeraa
      [ABSTRACT]This paper presents a multi-way parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs), where Sinhala and Tamil are low-resource languages. Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil. We also carry out a detailed investigation on the NER capabilities of different types of mLMs. Finally, we demonstrate the utility of our NER system on a low-resource Neural Machine Translation (NMT) task. Our dataset is publicly released: https://github.com/suralk/multiNER.
      [LINK]http://arxiv.org/abs/2412.02056v1
      [DATE]2024-12-03 08:28:31+08:00
      [CATEGORIES]cs.CL
    • Real-Time Multilingual Sign Language Processing
      [AUTHORS]Amit Moryossef
      [ABSTRACT]Sign Language Processing (SLP) is an interdisciplinary field comprised of Natural Language Processing (NLP) and Computer Vision. It is focused on the computational understanding, translation, and production of signed languages. Traditional approaches have often been constrained by the use of gloss-based systems that are both language-specific and inadequate for capturing the multidimensional nature of sign language. These limitations have hindered the development of technology capable of processing signed languages effectively. This thesis aims to revolutionize the field of SLP by proposing a simple paradigm that can bridge this existing technological gap. We propose the use of SignWiring, a universal sign language transcription notation system, to serve as an intermediary link between the visual-gestural modality of signed languages and text-based linguistic representations. We contribute foundational libraries and resources to the SLP community, thereby setting the stage for a more in-depth exploration of the tasks of sign language translation and production. These tasks encompass the translation of sign language from video to spoken language text and vice versa. Through empirical evaluations, we establish the efficacy of our transcription method as a pivot for enabling faster, more targeted research, that can lead to more natural and accurate translations across a range of languages. The universal nature of our transcription-based paradigm also paves the way for real-time, multilingual applications in SLP, thereby offering a more inclusive and accessible approach to language technology. This is a significant step toward universal accessibility, enabling a wider reach of AI-driven language technologies to include the deaf and hard-of-hearing community.
      [COMMENTS]PhD Thesis
      [LINK]http://arxiv.org/abs/2412.01991v1
      [DATE]2024-12-03 05:51:41+08:00
      [CATEGORIES]cs.CL
    • Discovering influential text using convolutional neural networks
      [AUTHORS]Megan Ayers, Luke Sanford, Margaret Roberts, Eddie Yang
      [ABSTRACT]Experimental methods for estimating the impacts of text on human evaluation have been widely used in the social sciences. However, researchers in experimental settings are usually limited to testing a small number of pre-specified text treatments. While efforts to mine unstructured texts for features that causally affect outcomes have been ongoing in recent years, these models have primarily focused on the topics or specific words of text, which may not always be the mechanism of the effect. We connect these efforts with NLP interpretability techniques and present a method for flexibly discovering clusters of similar text phrases that are predictive of human reactions to texts using convolutional neural networks. When used in an experimental setting, this method can identify text treatments and their effects under certain assumptions. We apply the method to two datasets. The first enables direct validation of the model's ability to detect phrases known to cause the outcome. The second demonstrates its ability to flexibly discover text treatments with varying textual structures. In both cases, the model learns a greater variety of text treatments compared to benchmark methods, and these text features quantitatively meet or exceed the ability of benchmark methods to predict the outcome.
      [COMMENTS]Published in Findings of ACL 2024 ( see https://aclanthology.org/2024.findings-acl.714 )
      [LINK]http://arxiv.org/abs/2406.10086v3
      [DATE]2024-12-03 05:31:59+08:00
      [CATEGORIES]cs.CL cs.LG
    • The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?
      [AUTHORS]Sourav Banerjee, Ayushi Agarwal, Eishkaran Singh
      [ABSTRACT]The pursuit of leaderboard rankings in Large Language Models (LLMs) has created a fundamental paradox: models excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum, from basic metrics to complex benchmarks like GLUE and MMLU. These vulnerabilities manifest through benchmark exploitation, dataset contamination, and evaluation bias, creating a false perception of progress in language understanding capabilities. Through extensive review of contemporary evaluation approaches, we identify significant limitations in static benchmark designs, human evaluation protocols, and LLM-as-judge frameworks, all of which compromise the reliability of current performance assessments. As LLM capabilities evolve and existing benchmarks become redundant, we lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks. This requires frameworks that are adapted dynamically, addressing current limitations and providing a more accurate reflection of LLM performance.
      [COMMENTS]11 pages
      [LINK]http://arxiv.org/abs/2412.03597v1
      [DATE]2024-12-03 04:49:21+08:00
      [CATEGORIES]cs.CL cs.LG
    • VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
      [AUTHORS]Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, Joseph E Gonzalez
      [ABSTRACT]Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" -- such as tone, formatting, or writing style -- influence user preferences, yet traditional evaluations focus primarily on the singular axis of correctness. We introduce VibeCheck, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model (vibes) that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discovers vibes from model outputs and then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. VibeCheck discovers vibes like Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often overexplains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash. Code can be found at https://github.com/lisadunlap/VibeCheck
      [COMMENTS]unironic use of the word 'vibe', added more analysis and cooler graphs. arXiv admin note: text overlap with arXiv:2301.07597 by other authors
      [LINK]http://arxiv.org/abs/2410.12851v4
      [DATE]2024-12-03 04:27:39+08:00
      [CATEGORIES]cs.CL
    • Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack
      [AUTHORS]Xiaoyue Xu, Qinyuan Ye, Xiang Ren
      [ABSTRACT]We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn a sequence of language tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilizes contexts in Lifelong ICL. When given a task instruction and test inputs, long-context LMs are expected to leverage the relevant demonstrations in the Lifelong ICL prompt, avoid distraction and interference from other tasks, and achieve test accuracies that are not significantly worse than those of the Single-task ICL baseline. Task Haystack draws inspiration from the widely-adopted "needle-in-a-haystack" (NIAH) evaluation, but presents distinct new challenges. It requires models (1) to utilize the contexts at a deeper level, rather than resorting to simple copying and pasting; (2) to navigate through long streams of evolving topics and tasks, proxying the complexities and dynamism of contexts in real-world scenarios. Additionally, Task Haystack inherits the controllability of NIAH, providing model developers with tools and visualizations to identify model vulnerabilities effectively. We benchmark 14 long-context LMs using Task Haystack, finding that frontier models like GPT-4o still struggle with the setting, failing on 15% of cases on average. Most open-weight models further lack behind by a large margin, with failure rates reaching up to 61%. In our controlled analysis, we identify factors such as distraction and recency bias as contributors to these failure cases. Further, performance declines when task instructions are paraphrased at test time or when ICL demonstrations are repeated excessively, raising concerns about the robustness, instruction understanding, and true context utilization of long-context LMs.
      [COMMENTS]NeurIPS 2024 (Datasets and Benchmarks Track). Code: https://github.com/INK-USC/Lifelong-ICL Website: https://inklab.usc.edu/lifelong-icl/
      [LINK]http://arxiv.org/abs/2407.16695v2
      [DATE]2024-12-03 04:23:49+08:00
      [CATEGORIES]cs.CL cs.LG
    • RIRAG: Regulatory Information Retrieval and Answer Generation
      [AUTHORS]Tuba Gokhan, Kexin Wang, Iryna Gurevych, Ted Briscoe
      [ABSTRACT]Regulatory documents, issued by governmental regulatory bodies, establish rules, guidelines, and standards that organizations must adhere to for legal compliance. These documents, characterized by their length, complexity and frequent updates, are challenging to interpret, requiring significant allocation of time and expertise on the part of organizations to ensure ongoing compliance. Regulatory Natural Language Processing (RegNLP) is a multidisciplinary field aimed at simplifying access to and interpretation of regulatory rules and obligations. We introduce a task of generating question-passages pairs, where questions are automatically created and paired with relevant regulatory passages, facilitating the development of regulatory question-answering systems. We create the ObliQA dataset, containing 27,869 questions derived from the collection of Abu Dhabi Global Markets (ADGM) financial regulation documents, design a baseline Regulatory Information Retrieval and Answer Generation (RIRAG) system and evaluate it with RePASs, a novel evaluation metric that tests whether generated answers accurately capture all relevant obligations while avoiding contradictions.
      [LINK]http://arxiv.org/abs/2409.05677v2
      [DATE]2024-12-03 02:13:28+08:00
      [CATEGORIES]cs.CL
    • Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index
      [AUTHORS]Tyler McDonald, Anthony Colosimo, Yifeng Li, Ali Emami
      [ABSTRACT]As prompt engineering research rapidly evolves, evaluations beyond accuracy are crucial for developing cost-effective techniques. We present the Economical Prompting Index (EPI), a novel metric that combines accuracy scores with token consumption, adjusted by a user-specified cost concern level to reflect different resource constraints. Our study examines 6 advanced prompting techniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts, across 10 widely-used language models and 4 diverse datasets. We demonstrate that approaches such as Self-Consistency often provide statistically insignificant gains while becoming cost-prohibitive. For example, on high-performing models like Claude 3.5 Sonnet, the EPI of simpler techniques like Chain-of-Thought (0.72) surpasses more complex methods like Self-Consistency (0.64) at slight cost concern levels. Our findings suggest a reevaluation of complex prompting strategies in resource-constrained scenarios, potentially reshaping future research priorities and improving cost-effectiveness for end-users.
      [COMMENTS]5 pages (excluding references), accepted to Coling 2025
      [LINK]http://arxiv.org/abs/2412.01690v1
      [DATE]2024-12-03 00:34:18+08:00
      [CATEGORIES]cs.CL
    • ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
      [AUTHORS]Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock
      [ABSTRACT]Forecasts of future events are essential inputs into informed decision-making. Machine learning (ML) systems have the potential to deliver forecasts at scale, but there is no framework for evaluating the accuracy of ML systems on a standardized set of forecasting questions. To address this gap, we introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML systems on an automatically generated and regularly updated set of 1,000 forecasting questions. To avoid any possibility of data leakage, ForecastBench is comprised solely of questions about future events that have no known answer at the time of submission. We quantify the capabilities of current ML systems by collecting forecasts from expert (human) forecasters, the general public, and LLMs on a random subset of questions from the benchmark ($N=200$). While LLMs have achieved super-human performance on many benchmarks, they perform less well here: expert forecasters outperform the top-performing LLM (p-value $<0.01$). We display system and human scores in a public leaderboard at www.forecastbench.org.
      [LINK]http://arxiv.org/abs/2409.19839v3
      [DATE]2024-12-03 00:27:16+08:00
      [CATEGORIES]cs.LG cs.CL
    • R-Bot: An LLM-based Query Rewrite System
      [AUTHORS]Zhaoyan Sun, Xuanhe Zhou, Guoliang Li
      [ABSTRACT]Query rewrite is essential for optimizing SQL queries to improve their execution efficiency without changing their results. Traditionally, this task has been tackled through heuristic and learning-based methods, each with its limitations in terms of inferior quality and low robustness. Recent advancements in LLMs offer a new paradigm by leveraging their superior natural language and code comprehension abilities. Despite their potential, directly applying LLMs like GPT-4 has faced challenges due to problems such as hallucinations, where the model might generate inaccurate or irrelevant results. To address this, we propose R-Bot, an LLM-based query rewrite system with a systematic approach. We first design a multi-source rewrite evidence preparation pipeline to generate query rewrite evidences for guiding LLMs to avoid hallucinations. We then propose a hybrid structure-semantics retrieval method that combines structural and semantic analysis to retrieve the most relevant rewrite evidences for effectively answering an online query. We next propose a step-by-step LLM rewrite method that iteratively leverages the retrieved evidences to select and arrange rewrite rules with self-reflection. We conduct comprehensive experiments on widely used benchmarks, and demonstrate the superior performance of our system, R-Bot, surpassing state-of-the-art query rewrite methods.
      [LINK]http://arxiv.org/abs/2412.01661v1
      [DATE]2024-12-03 00:13:04+08:00
      [CATEGORIES]cs.CL cs.LG
    • Learning to Predict Structural Vibrations
      [AUTHORS]Jan van Delden, Julius Schultz, Christopher Blech, Sabine C. Langer, Timo Lüddecke
      [ABSTRACT]In mechanical structures like airplanes, cars and houses, noise is generated and transmitted through vibrations. To take measures to reduce this noise, vibrations need to be simulated with expensive numerical computations. Deep learning surrogate models present a promising alternative to classical numerical simulations as they can be evaluated magnitudes faster, while trading-off accuracy. To quantify such trade-offs systematically and foster the development of methods, we present a benchmark on the task of predicting the vibration of harmonically excited plates. The benchmark features a total of 12,000 plate geometries with varying forms of beadings, material, boundary conditions, load position and sizes with associated numerical solutions. To address the benchmark task, we propose a new network architecture, named Frequency-Query Operator, which predicts vibration patterns of plate geometries given a specific excitation frequency. Applying principles from operator learning and implicit models for shape encoding, our approach effectively addresses the prediction of highly variable frequency response functions occurring in dynamic systems. To quantify the prediction quality, we introduce a set of evaluation metrics and evaluate the method on our vibrating-plates benchmark. Our method outperforms DeepONets, Fourier Neural Operators and more traditional neural network architectures and can be used for design optimization. Code, dataset and visualizations: https://github.com/ecker-lab/Learning_Vibrating_Plates
      [COMMENTS]Accepted at Neurips 2024
      [LINK]http://arxiv.org/abs/2310.05469v4
      [DATE]2024-12-03 23:21:53+08:00
      [CATEGORIES]cs.LG
    • Vector Optimization with Gaussian Process Bandits
      [AUTHORS]İlter Onat Korkmaz, Yaşar Cahit Yıldırım, Çağın Ararat, Cem Tekin
      [ABSTRACT]Learning problems in which multiple conflicting objectives must be considered simultaneously often arise in various fields, including engineering, drug design, and environmental management. Traditional methods for dealing with multiple black-box objective functions, such as scalarization and identification of the Pareto set under the componentwise order, have limitations in incorporating objective preferences and exploring the solution space accordingly. While vector optimization offers improved flexibility and adaptability via specifying partial orders based on ordering cones, current techniques designed for sequential experiments either suffer from high sample complexity or lack theoretical guarantees. To address these issues, we propose Vector Optimization with Gaussian Process (VOGP), a probably approximately correct adaptive elimination algorithm that performs black-box vector optimization using Gaussian process bandits. VOGP allows users to convey objective preferences through ordering cones while performing efficient sampling by exploiting the smoothness of the objective function, resulting in a more effective optimization process that requires fewer evaluations. We establish theoretical guarantees for VOGP and derive information gain-based and kernel-specific sample complexity bounds. We also conduct experiments on both real-world and synthetic datasets to compare VOGP with the state-of-the-art methods.
      [LINK]http://arxiv.org/abs/2412.02484v1
      [DATE]2024-12-03 22:47:46+08:00
      [CATEGORIES]cs.LG
    • SpaCE: The Spatial Confounding Environment
      [AUTHORS]Mauricio Tec, Ana Trisovic, Michelle Audirac, Sophie Woodward, Jie Kate Hu, Naeem Khoshnevis, Francesca Dominici
      [ABSTRACT]Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.
      [LINK]http://arxiv.org/abs/2312.00710v3
      [DATE]2024-12-03 22:45:03+08:00
      [CATEGORIES]cs.LG
    • OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations
      [AUTHORS]Caixin Kang, Yubo Chen, Shouwei Ruan, Shiji Zhao, Ruochen Zhang, Jiayi Wang, Shan Fu, Xingxing Wei
      [ABSTRACT]With the rise of deep learning, facial recognition technology has seen extensive research and rapid development. Although facial recognition is considered a mature technology, we find that existing open-source models and commercial algorithms lack robustness in certain real-world Out-of-Distribution (OOD) scenarios, raising concerns about the reliability of these systems. In this paper, we introduce OODFace, which explores the OOD challenges faced by facial recognition models from two perspectives: common corruptions and appearance variations. We systematically design 30 OOD scenarios across 9 major categories tailored for facial recognition. By simulating these challenges on public datasets, we establish three robustness benchmarks: LFW-C/V, CFP-FP-C/V, and YTF-C/V. We then conduct extensive experiments on 19 different facial recognition models and 3 commercial APIs, along with extended experiments on face masks, Vision-Language Models (VLMs), and defense strategies to assess their robustness. Based on the results, we draw several key insights, highlighting the vulnerability of facial recognition systems to OOD data and suggesting possible solutions. Additionally, we offer a unified toolkit that includes all corruption and variation types, easily extendable to other datasets. We hope that our benchmarks and findings can provide guidance for future improvements in facial recognition model robustness.
      [LINK]http://arxiv.org/abs/2412.02479v1
      [DATE]2024-12-03 22:42:31+08:00
      [CATEGORIES]cs.LG
    • A Probabilistic Perspective on Unlearning and Alignment for Large Language Models
      [AUTHORS]Yan Scholten, Stephan Günnemann, Leo Schwinn
      [ABSTRACT]Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework in LLMs. Namely, we derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Through a case study focused on unlearning, we reveal that deterministic evaluations falsely indicate successful unlearning, whereas our probabilistic evaluations demonstrate that most if not all of the supposedly unlearned information remains accessible in these models. Additionally, we propose a novel unlearning loss based on entropy optimization and adaptive temperature scaling, which significantly improves unlearning in probabilistic settings on recent benchmarks. Our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. Code available at https://github.com/yascho/probabilistic-unlearning.
      [LINK]http://arxiv.org/abs/2410.03523v4
      [DATE]2024-12-03 22:31:41+08:00
      [CATEGORIES]cs.LG
    • Time-Series-Informed Closed-loop Learning for Sequential Decision Making and Control
      [AUTHORS]Sebastian Hirt, Lukas Theiner, Rolf Findeisen
      [ABSTRACT]Closed-loop performance of sequential decision making algorithms, such as model predictive control, depends strongly on the parameters of cost functions, models, and constraints. Bayesian optimization is a common approach to learning these parameters based on closed-loop experiments. However, traditional Bayesian optimization approaches treat the learning problem as a black box, ignoring valuable information and knowledge about the structure of the underlying problem, resulting in slow convergence and high experimental resource use. We propose a time-series-informed optimization framework that incorporates intermediate performance evaluations from early iterations of each experimental episode into the learning procedure. Additionally, probabilistic early stopping criteria are proposed to terminate unpromising experiments, significantly reducing experimental time. Simulation results show that our approach achieves baseline performance with approximately half the resources. Moreover, with the same resource budget, our approach outperforms the baseline in terms of final closed-loop performance, highlighting its efficiency in sequential decision making scenarios.
      [COMMENTS]12 pages, 3 figures, submitted to L4DC 2025
      [LINK]http://arxiv.org/abs/2412.02423v1
      [DATE]2024-12-03 20:38:53+08:00
      [CATEGORIES]cs.LG
    • OMENN: One Matrix to Explain Neural Networks
      [AUTHORS]Adam Wróbel, Mikołaj Janusz, Bartosz Zieliński, Dawid Rymarczyk
      [ABSTRACT]Deep Learning (DL) models are often black boxes, making their decision-making processes difficult to interpret. This lack of transparency has driven advancements in eXplainable Artificial Intelligence (XAI), a field dedicated to clarifying the reasoning behind DL model predictions. Among these, attribution-based methods such as LRP and GradCAM are widely used, though they rely on approximations that can be imprecise. To address these limitations, we introduce One Matrix to Explain Neural Networks (OMENN), a novel post-hoc method that represents a neural network as a single, interpretable matrix for each specific input. This matrix is constructed through a series of linear transformations that represent the processing of the input by each successive layer in the neural network. As a result, OMENN provides locally precise, attribution-based explanations of the input across various modern models, including ViTs and CNNs. We present a theoretical analysis of OMENN based on dynamic linearity property and validate its effectiveness with extensive tests on two XAI benchmarks, demonstrating that OMENN is competitive with state-of-the-art methods.
      [COMMENTS]Under review, code will be released after acceptance
      [LINK]http://arxiv.org/abs/2412.02399v1
      [DATE]2024-12-03 19:49:01+08:00
      [CATEGORIES]cs.LG
    • Training for Speech Recognition on Coprocessors
      [AUTHORS]Sebastian Baunsgaard, Sebastian B. Wrede, Pınar Tozun
      [ABSTRACT]Automatic Speech Recognition (ASR) has increased in popularity in recent years. The evolution of processor and storage technologies has enabled more advanced ASR mechanisms, fueling the development of virtual assistants such as Amazon Alexa, Apple Siri, Microsoft Cortana, and Google Home. The interest in such assistants, in turn, has amplified the novel developments in ASR research. However, despite this popularity, there has not been a detailed training efficiency analysis of modern ASR systems. This mainly stems from: the proprietary nature of many modern applications that depend on ASR, like the ones listed above; the relatively expensive co-processor hardware that is used to accelerate ASR by big vendors to enable such applications; and the absence of well-established benchmarks. The goal of this paper is to address the latter two of these challenges. The paper first describes an ASR model, based on a deep neural network inspired by recent work in this domain, and our experiences building it. Then we evaluate this model on three CPU-GPU co-processor platforms that represent different budget categories. Our results demonstrate that utilizing hardware acceleration yields good results even without high-end equipment. While the most expensive platform (10X price of the least expensive one) converges to the initial accuracy target 10-30% and 60-70% faster than the other two, the differences among the platforms almost disappear at slightly higher accuracy targets. In addition, our results further highlight both the difficulty of evaluating ASR systems due to the complex, long, and resource intensive nature of the model training in this domain, and the importance of establishing benchmarks for ASR.
      [COMMENTS]published at ADMS 2020
      [LINK]http://arxiv.org/abs/2003.12366v2
      [DATE]2024-12-03 19:13:27+08:00
      [CATEGORIES]cs.LG
    • Flow Matching for Accelerated Simulation of Atomic Transport in Materials
      [AUTHORS]Juno Nam, Sulin Liu, Gavin Winter, KyuJung Jun, Soojung Yang, Rafael Gómez-Bombarelli
      [ABSTRACT]We introduce LiFlow, a generative framework to accelerate molecular dynamics (MD) simulations for crystalline materials that formulates the task as conditional generation of atomic displacements. The model uses flow matching, with a Propagator submodel to generate atomic displacements and a Corrector to locally correct unphysical geometries, and incorporates an adaptive prior based on the Maxwell-Boltzmann distribution to account for chemical and thermal conditions. We benchmark LiFlow on a dataset comprising 25-ps trajectories of lithium diffusion across 4,186 solid-state electrolyte (SSE) candidates at four temperatures. The model obtains a consistent Spearman rank correlation of 0.7-0.8 for lithium mean squared displacement (MSD) predictions on unseen compositions. Furthermore, LiFlow generalizes from short training trajectories to larger supercells and longer simulations while maintaining high accuracy. With speed-ups of up to 600,000$\times$ compared to first-principles methods, LiFlow enables scalable simulations at significantly larger length and time scales.
      [LINK]http://arxiv.org/abs/2410.01464v2
      [DATE]2024-12-03 18:01:06+08:00
      [CATEGORIES]cs.LG
    • Optimizing Plastic Waste Collection in Water Bodies Using Heterogeneous Autonomous Surface Vehicles with Deep Reinforcement Learning
      [AUTHORS]Alejandro Mendoza Barrionuevo, Samuel Yanes Luis, Daniel Gutiérrez Reina, Sergio L. Toral Marín
      [ABSTRACT]This paper presents a model-free deep reinforcement learning framework for informative path planning with heterogeneous fleets of autonomous surface vehicles to locate and collect plastic waste. The system employs two teams of vehicles: scouts and cleaners. Coordination between these teams is achieved through a deep reinforcement approach, allowing agents to learn strategies to maximize cleaning efficiency. The primary objective is for the scout team to provide an up-to-date contamination model, while the cleaner team collects as much waste as possible following this model. This strategy leads to heterogeneous teams that optimize fleet efficiency through inter-team cooperation supported by a tailored reward function. Different trainings of the proposed algorithm are compared with other state-of-the-art heuristics in two distinct scenarios, one with high convexity and another with narrow corridors and challenging access. According to the obtained results, it is demonstrated that deep reinforcement learning based algorithms outperform other benchmark heuristics, exhibiting superior adaptability. In addition, training with greedy actions further enhances performance, particularly in scenarios with intricate layouts.
      [COMMENTS]This article is currently under revision for the Robotics and Automation Letters (IEEE)
      [LINK]http://arxiv.org/abs/2412.02316v1
      [DATE]2024-12-03 17:32:02+08:00
      [CATEGORIES]cs.LG
    • Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods
      [AUTHORS]Jiamian Hu, Yuanyuan Hong, Yihua Chen, He Wang, Moriaki Yasuhara
      [ABSTRACT]We present the Noisy Ostracods, a noisy dataset for genus and species classification of crustacean ostracods with specialists' annotations. Over the 71466 specimens collected, 5.58% of them are estimated to be noisy (possibly problematic) at genus level. The dataset is created to addressing a real-world challenge: creating a clean fine-grained taxonomy dataset. The Noisy Ostracods dataset has diverse noises from multiple sources. Firstly, the noise is open-set, including new classes discovered during curation that were not part of the original annotation. The dataset has pseudo-classes, where annotators misclassified samples that should belong to an existing class into a new pseudo-class. The Noisy Ostracods dataset is highly imbalanced with a imbalance factor $\rho$ = 22429. This presents a unique challenge for robust machine learning methods, as existing approaches have not been extensively evaluated on fine-grained classification tasks with such diverse real-world noise. Initial experiments using current robust learning techniques have not yielded significant performance improvements on the Noisy Ostracods dataset compared to cross-entropy training on the raw, noisy data. On the other hand, noise detection methods have underperformed in error hit rate compared to naive cross-validation ensembling for identifying problematic labels. These findings suggest that the fine-grained, imbalanced nature, and complex noise characteristics of the dataset present considerable challenges for existing noise-robust algorithms. By openly releasing the Noisy Ostracods dataset, our goal is to encourage further research into the development of noise-resilient machine learning methods capable of effectively handling diverse, real-world noise in fine-grained classification tasks. The dataset, along with its evaluation protocols, can be accessed at https://github.com/H-Jamieu/Noisy_ostracods.
      [COMMENTS]Initial submit
      [LINK]http://arxiv.org/abs/2412.02313v1
      [DATE]2024-12-03 17:30:57+08:00
      [CATEGORIES]cs.LG
    • LLM-ABBA: Understanding time series via symbolic approximation
      [AUTHORS]Erin Carson, Xinye Chen, Cheng Kang
      [ABSTRACT]The success of large language models (LLMs) for time series has been demonstrated in previous work. Utilizing a symbolic time series representation, one can efficiently bridge the gap between LLMs and time series. However, the remaining challenge is to exploit the semantic information hidden in time series by using symbols or existing tokens of LLMs, while aligning the embedding space of LLMs according to the hidden information of time series. The symbolic time series approximation (STSA) method called adaptive Brownian bridge-based symbolic aggregation (ABBA) shows outstanding efficacy in preserving salient time series features by modeling time series patterns in terms of amplitude and period while using existing tokens of LLMs. In this paper, we introduce a method, called LLM-ABBA, that integrates ABBA into large language models for various downstream time series tasks. By symbolizing time series, LLM-ABBA compares favorably to the recent state-of-the-art (SOTA) in UCR and three medical time series classification tasks. Meanwhile, a fixed-polygonal chain trick in ABBA is introduced to \kc\{avoid obvious drifting\} during prediction tasks by significantly mitigating the effects of cumulative error arising from misused symbols during the transition from symbols to numerical values. In time series regression tasks, LLM-ABBA achieves the new SOTA on Time Series Extrinsic Regression (TSER) benchmarks. LLM-ABBA also shows competitive prediction capability compared to recent SOTA time series prediction results. We believe this framework can also seamlessly extend to other time series tasks.
      [LINK]http://arxiv.org/abs/2411.18506v2
      [DATE]2024-12-03 17:25:11+08:00
      [CATEGORIES]cs.LG
    • BInD: Bond and Interaction-generating Diffusion Model for Multi-objective Structure-based Drug Design
      [AUTHORS]Joongwon Lee, Wonho Zhung, Jisu Seo, Woo Youn Kim
      [ABSTRACT]A remarkable advance in geometric deep generative models with accumulated structural data enables structure-based drug design (SBDD) with target protein information only. However, most existing models struggle to address multi-objectives simultaneously while performing well only in their specialized tasks. Here, we present BInD, a diffusion model with knowledge-based guidance for multi-objective SBDD. BInD is designed to co-generate molecules and their interactions with a target protein to consider all key objectives equally well, including target-specific interactions, molecular properties, and local geometry. Comprehensive evaluations show that BInD achieves robust performance for all objectives while outperforming or matching state-of-the-art methods for each. Finally, we propose a train-free optimization method empowered by retrieving target-specific interactions, highlighting the role of non-covalent interactions in achieving higher selectivity and binding affinities to a target protein.
      [LINK]http://arxiv.org/abs/2405.16861v2
      [DATE]2024-12-03 17:17:43+08:00
      [CATEGORIES]cs.LG
    • CADMR: Cross-Attention and Disentangled Learning for Multimodal Recommender Systems
      [AUTHORS]Yasser Khalafaoui, Martino Lovisetto, Basarab Matei, Nistor Grozavu
      [ABSTRACT]The increasing availability and diversity of multimodal data in recommender systems offer new avenues for enhancing recommendation accuracy and user satisfaction. However, these systems must contend with high-dimensional, sparse user-item rating matrices, where reconstructing the matrix with only small subsets of preferred items for each user poses a significant challenge. To address this, we propose CADMR, a novel autoencoder-based multimodal recommender system framework. CADMR leverages multi-head cross-attention mechanisms and Disentangled Learning to effectively integrate and utilize heterogeneous multimodal data in reconstructing the rating matrix. Our approach first disentangles modality-specific features while preserving their interdependence, thereby learning a joint latent representation. The multi-head cross-attention mechanism is then applied to enhance user-item interaction representations with respect to the learned multimodal item latent representations. We evaluate CADMR on three benchmark datasets, demonstrating significant performance improvements over state-of-the-art methods.
      [LINK]http://arxiv.org/abs/2412.02295v1
      [DATE]2024-12-03 17:09:52+08:00
      [CATEGORIES]cs.LG
    • Deep Matrix Factorization with Adaptive Weights for Multi-View Clustering
      [AUTHORS]Yasser Khalafaoui, Basarab Matei, Martino Lovisetto, Nistor Grozavu
      [ABSTRACT]Recently, deep matrix factorization has been established as a powerful model for unsupervised tasks, achieving promising results, especially for multi-view clustering. However, existing methods often lack effective feature selection mechanisms and rely on empirical hyperparameter selection. To address these issues, we introduce a novel Deep Matrix Factorization with Adaptive Weights for Multi-View Clustering (DMFAW). Our method simultaneously incorporates feature selection and generates local partitions, enhancing clustering results. Notably, the features weights are controlled and adjusted by a parameter that is dynamically updated using Control Theory inspired mechanism, which not only improves the model's stability and adaptability to diverse datasets but also accelerates convergence. A late fusion approach is then proposed to align the weighted local partitions with the consensus partition. Finally, the optimization problem is solved via an alternating optimization algorithm with theoretically guaranteed convergence. Extensive experiments on benchmark datasets highlight that DMFAW outperforms state-of-the-art methods in terms of clustering performance.
      [LINK]http://arxiv.org/abs/2412.02292v1
      [DATE]2024-12-03 17:08:27+08:00
      [CATEGORIES]cs.LG
    • Conformal Symplectic Optimization for Stable Reinforcement Learning
      [AUTHORS]Yao Lyu, Xiangteng Zhang, Shengbo Eben Li, Jingliang Duan, Letian Tao, Qing Xu, Lei He, Keqiang Li
      [ABSTRACT]Training deep reinforcement learning (RL) agents necessitates overcoming the highly unstable nonconvex stochastic optimization inherent in the trial-and-error mechanism. To tackle this challenge, we propose a physics-inspired optimization algorithm called relativistic adaptive gradient descent (RAD), which enhances long-term training stability. By conceptualizing neural network (NN) training as the evolution of a conformal Hamiltonian system, we present a universal framework for transferring long-term stability from conformal symplectic integrators to iterative NN updating rules, where the choice of kinetic energy governs the dynamical properties of resulting optimization algorithms. By utilizing relativistic kinetic energy, RAD incorporates principles from special relativity and limits parameter updates below a finite speed, effectively mitigating abnormal gradient influences. Additionally, RAD models NN optimization as the evolution of a multi-particle system where each trainable parameter acts as an independent particle with an individual adaptive learning rate. We prove RAD's sublinear convergence under general nonconvex settings, where smaller gradient variance and larger batch sizes contribute to tighter convergence. Notably, RAD degrades to the well-known adaptive moment estimation (ADAM) algorithm when its speed coefficient is chosen as one and symplectic factor as a small positive value. Experimental results show RAD outperforming nine baseline optimizers with five RL algorithms across twelve environments, including standard benchmarks and challenging scenarios. Notably, RAD achieves up to a 155.1% performance improvement over ADAM in Atari games, showcasing its efficacy in stabilizing and accelerating RL training.
      [LINK]http://arxiv.org/abs/2412.02291v1
      [DATE]2024-12-03 17:07:31+08:00
      [CATEGORIES]cs.LG
    • Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control
      [AUTHORS]Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, Marek Cygan
      [ABSTRACT]Sample efficiency in Reinforcement Learning (RL) has traditionally been driven by algorithmic enhancements. In this work, we demonstrate that scaling can also lead to substantial improvements. We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.
      [COMMENTS]NeurIPS 2024 Spotlight
      [LINK]http://arxiv.org/abs/2405.16158v3
      [DATE]2024-12-03 16:42:49+08:00
      [CATEGORIES]cs.LG
    • Normalizing self-supervised learning for provably reliable Change Point Detection
      [AUTHORS]Alexandra Bazarova, Evgenia Romanenkova, Alexey Zaytsev
      [ABSTRACT]Change point detection (CPD) methods aim to identify abrupt shifts in the distribution of input data streams. Accurate estimators for this task are crucial across various real-world scenarios. Yet, traditional unsupervised CPD techniques face significant limitations, often relying on strong assumptions or suffering from low expressive power due to inherent model simplicity. In contrast, representation learning methods overcome these drawbacks by offering flexibility and the ability to capture the full complexity of the data without imposing restrictive assumptions. However, these approaches are still emerging in the CPD field and lack robust theoretical foundations to ensure their reliability. Our work addresses this gap by integrating the expressive power of representation learning with the groundedness of traditional CPD techniques. We adopt spectral normalization (SN) for deep representation learning in CPD tasks and prove that the embeddings after SN are highly informative for CPD. Our method significantly outperforms current state-of-the-art methods during the comprehensive evaluation via three standard CPD datasets.
      [LINK]http://arxiv.org/abs/2410.13637v2
      [DATE]2024-12-03 16:29:54+08:00
      [CATEGORIES]cs.LG
    • Learning from Reduced Labels for Long-Tailed Data
      [AUTHORS]Meng Wei, Zhongnian Li, Yong Zhou, Xinzheng Xu
      [ABSTRACT]Long-tailed data is prevalent in real-world classification tasks and heavily relies on supervised information, which makes the annotation process exceptionally labor-intensive and time-consuming. Unfortunately, despite being a common approach to mitigate labeling costs, existing weakly supervised learning methods struggle to adequately preserve supervised information for tail samples, resulting in a decline in accuracy for the tail classes. To alleviate this problem, we introduce a novel weakly supervised labeling setting called Reduced Label. The proposed labeling setting not only avoids the decline of supervised information for the tail samples, but also decreases the labeling costs associated with long-tailed data. Additionally, we propose an straightforward and highly efficient unbiased framework with strong theoretical guarantees to learn from these Reduced Labels. Extensive experiments conducted on benchmark datasets including ImageNet validate the effectiveness of our approach, surpassing the performance of state-of-the-art weakly supervised methods.
      [COMMENTS]11 pages, 3 figures
      [LINK]http://arxiv.org/abs/2403.16469v2
      [DATE]2024-12-03 16:18:17+08:00
      [CATEGORIES]cs.LG
    • Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs
      [AUTHORS]Zixuan Hu, Yongxian Wei, Li Shen, Chun Yuan, Dacheng Tao
      [ABSTRACT]Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot adaptability without requiring fine-tuning, positioning them ideal for data-limited and real-time applications. However, this adaptability has not yet been replicated in current Visual Foundation Models (VFMs), which require explicit fine-tuning with sufficient tuning data. Besides, the pretraining-finetuning paradigm has led to the surge of numerous task-specific modular components, such as Low-Rank Adaptation (LoRA). For the first time, we explore the potential of reusing diverse pre-tuned LoRAs without accessing their original training data, to achieve tuning-free few-shot adaptation in VFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned LoRAs with a meta-learning objective, using surrogate data generated inversely from pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is empowered to solve new few-shot tasks in a single forward pass, akin to the in-context learning of LLMs. Additionally, we incorporate a double-efficient mechanism tailored to our framework, significantly accelerating the meta-training process while maintaining or even improving performance. Extensive experiments across various few-shot classification benchmarks across both in- and cross-domain scenarios demonstrate the superiority of our framework.
      [LINK]http://arxiv.org/abs/2412.02220v1
      [DATE]2024-12-03 15:25:30+08:00
      [CATEGORIES]cs.LG
    • Recovering implicit physics model under real-world constraints
      [AUTHORS]Ayan Banerjee, Sandeep K. S. Gupta
      [ABSTRACT]Recovering a physics-driven model, i.e. a governing set of equations of the underlying dynamical systems, from the real-world data has been of recent interest. Most existing methods either operate on simulation data with unrealistically high sampling rates or require explicit measurements of all system variables, which is not amenable in real-world deployments. Moreover, they assume the timestamps of external perturbations to the physical system are known a priori, without uncertainty, implicitly discounting any sensor time-synchronization or human reporting errors. In this paper, we propose a novel liquid time constant neural network (LTC-NN) based architecture to recover underlying model of physical dynamics from real-world data. The automatic differentiation property of LTC-NN nodes overcomes problems associated with low sampling rates, the input dependent time constant in the forward pass of the hidden layer of LTC-NN nodes creates a massive search space of implicit physical dynamics, the physics model solver based data reconstruction loss guides the search for the correct set of implicit dynamics, and the use of the dropout regularization in the dense layer ensures extraction of the sparsest model. Further, to account for the perturbation timing error, we utilize dense layer nodes to search through input shifts that results in the lowest reconstruction loss. Experiments on four benchmark dynamical systems, three with simulation data and one with the real-world data show that the LTC-NN architecture is more accurate in recovering implicit physics model coefficients than the state-of-the-art sparse model recovery approaches. We also introduce four additional case studies (total eight) on real-life medical examples in simulation and with real-world clinical data to show effectiveness of our approach in recovering underlying model in practice.
      [COMMENTS]This paper is published in ECAI 2024, https://ebooks.iospress.nl/volumearticle/69651
      [LINK]http://arxiv.org/abs/2412.02215v1
      [DATE]2024-12-03 15:11:21+08:00
      [CATEGORIES]cs.LG
    • Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling
      [AUTHORS]Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Maximilian Du, Chelsea Finn
      [ABSTRACT]Predicting and executing a sequence of actions without intermediate replanning, known as action chunking, is increasingly used in robot learning from human demonstrations. Yet, its reported effects on the learned policy are inconsistent: some studies find it crucial for achieving strong results, while others observe decreased performance. In this paper, we first dissect how action chunking impacts the divergence between a learner and a demonstrator. We find that action chunking allows the learner to better capture the temporal dependencies in demonstrations but at the cost of reduced reactivity in stochastic environments. To address this tradeoff, we propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop operations. BID samples multiple predictions at each time step and searches for the optimal one based on two criteria: (i) backward coherence, which favors samples that align with previous decisions; (ii) forward contrast, which seeks samples of high likelihood for future plans. By coupling decisions within and across action chunks, BID promotes consistency over time while maintaining reactivity to unexpected changes. Experimental results show that BID boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks. Code and videos are available at https://bid-robot.github.io.
      [COMMENTS]Project website: https://bid-robot.github.io/
      [LINK]http://arxiv.org/abs/2408.17355v3
      [DATE]2024-12-03 14:53:58+08:00
      [CATEGORIES]cs.LG
    • Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
      [AUTHORS]Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu
      [ABSTRACT]Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns -- fine-tuning over a few harmful data uploaded by the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning attack, has raised a broad research interest among the community. However, as the attack is still new, \textbf\{we observe that there are general misunderstandings within the research community.\} To clear up concern, this paper provide a comprehensive overview to three aspects of harmful fine-tuning: attacks setting, defense design and evaluation methodology. Specifically, we first present the threat model of the problem, and introduce the harmful fine-tuning attack and its variants. Then we systematically survey the existing literature on attacks/defenses/mechanical analysis of the problem. Finally, we introduce the evaluation methodology and outline future research directions that might contribute to the development of the field. Additionally, we present a list of questions of interest, which might be useful to refer to when reviewers in the peer review process question the realism of the experiment/attack/defense setting. A curated list of relevant papers is maintained and made accessible at: https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers.
      [LINK]http://arxiv.org/abs/2409.18169v5
      [DATE]2024-12-03 14:52:11+08:00
      [CATEGORIES]cs.LG
    • FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL
      [AUTHORS]Woosung Koh, Wonbeen Oh, Siyeol Kim, Suhin Shin, Hyeongjin Kim, Jaein Jang, Junghyun Lee, Se-Young Yun
      [ABSTRACT]Multi-agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real-world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically removed or added during the inference trajectory -- a common occurrence in real-world environments like search and rescue missions and dynamic combat situations. In this paper, we tackle the challenge of intra-trajectory dynamic entity composition under zero-shot out-of-domain (OOD) generalization, where such dynamic changes cannot be anticipated beforehand. Our empirical studies reveal that existing MARL methods suffer significant performance degradation and increased uncertainty in these scenarios. In response, we propose FlickerFusion, a novel OOD generalization method that acts as a universally applicable augmentation technique for MARL backbone methods. FlickerFusion stochastically drops out parts of the observation space, emulating being in-domain when inferenced OOD. The results show that FlickerFusion not only achieves superior inference rewards but also uniquely reduces uncertainty vis-\`a-vis the backbone, compared to existing methods. Benchmarks, implementations, and model weights are organized and open-sourced at flickerfusion305.github.io, accompanied by ample demo video renderings.
      [COMMENTS]NeurIPS '24 Open-World Agents Workshop
      [LINK]http://arxiv.org/abs/2410.15876v3
      [DATE]2024-12-03 13:59:09+08:00
      [CATEGORIES]cs.LG
    • Generalizing Weisfeiler-Lehman Kernels to Subgraphs
      [AUTHORS]Dongkwan Kim, Alice Oh
      [ABSTRACT]Subgraph representation learning has been effective in solving various real-world problems. However, current graph neural networks (GNNs) produce suboptimal results for subgraph-level tasks due to their inability to capture complex interactions within and between subgraphs. To provide a more expressive and efficient alternative, we propose WLKS, a Weisfeiler-Lehman (WL) kernel generalized for subgraphs by applying the WL algorithm on induced $k$-hop neighborhoods. We combine kernels across different $k$-hop levels to capture richer structural information that is not fully encoded in existing models. Our approach can balance expressiveness and efficiency by eliminating the need for neighborhood sampling. In experiments on eight real-world and synthetic benchmarks, WLKS significantly outperforms leading approaches on five datasets while reducing training time, ranging from 0.01x to 0.25x compared to the state-of-the-art.
      [COMMENTS]15 pages
      [LINK]http://arxiv.org/abs/2412.02181v1
      [DATE]2024-12-03 13:35:44+08:00
      [CATEGORIES]cs.LG
    • A Comprehensive Study of Shapley Value in Data Analytics
      [AUTHORS]Hong Lin, Shixin Wan, Zhongle Xie, Ke Chen, Meihui Zhang, Lidan Shou, Gang Chen
      [ABSTRACT]Over the recent years, Shapley value (SV), a solution concept from cooperative game theory, has found numerous applications in data analytics (DA). This paper provides the first comprehensive study of SV used throughout the DA workflow, which involves three main steps: data fabric, data exploration, and result reporting. We summarize existing versatile forms of SV used in these steps by a unified definition and clarify the essential functionalities that SV can provide for data scientists. We categorize the arts in this field based on the technical challenges they tackled, which include computation efficiency, approximation error, privacy preservation, and appropriate interpretations. We discuss these challenges and analyze the corresponding solutions. We also implement SVBench, the first open-sourced benchmark for developing SV applications, and conduct experiments on six DA tasks to validate our analysis and discussions. Based on the qualitative and quantitative results, we identify the limitations of current efforts for applying SV to DA and highlight the directions of future research and engineering.
      [LINK]http://arxiv.org/abs/2412.01460v2
      [DATE]2024-12-03 12:48:22+08:00
      [CATEGORIES]cs.LG
    • FSMLP: Modelling Channel Dependencies With Simplex Theory Based Multi-Layer Perceptions In Frequency Domain
      [AUTHORS]Zhengnan Li, Haoxuan Li, Hao Wang, Jun Fang, Duoyin Li Yunxiao Qin
      [ABSTRACT]Time series forecasting (TSF) plays a crucial role in various domains, including web data analysis, energy consumption prediction, and weather forecasting. While Multi-Layer Perceptrons (MLPs) are lightweight and effective for capturing temporal dependencies, they are prone to overfitting when used to model inter-channel dependencies. In this paper, we investigate the overfitting problem in channel-wise MLPs using Rademacher complexity theory, revealing that extreme values in time series data exacerbate this issue. To mitigate this issue, we introduce a novel Simplex-MLP layer, where the weights are constrained within a standard simplex. This strategy encourages the model to learn simpler patterns and thereby reducing overfitting to extreme values. Based on the Simplex-MLP layer, we propose a novel \textbf\{F\}requency \textbf\{S\}implex \textbf\{MLP\} (FSMLP) framework for time series forecasting, comprising of two kinds of modules: \textbf\{S\}implex \textbf\{C\}hannel-\textbf\{W\}ise MLP (SCWM) and \textbf\{F\}requency \textbf\{T\}emporal \textbf\{M\}LP (FTM). The SCWM effectively leverages the Simplex-MLP to capture inter-channel dependencies, while the FTM is a simple yet efficient temporal MLP designed to extract temporal information from the data. Our theoretical analysis shows that the upper bound of the Rademacher Complexity for Simplex-MLP is lower than that for standard MLPs. Moreover, we validate our proposed method on seven benchmark datasets, demonstrating significant improvements in forecasting accuracy and efficiency, while also showcasing superior scalability. Additionally, we demonstrate that Simplex-MLP can improve other methods that use channel-wise MLP to achieve less overfitting and improved performance. Code are available \href\{https://github.com/FMLYD/FSMLP\}\{\textcolor\{red\}\{here\}\}.
      [LINK]http://arxiv.org/abs/2412.01654v2
      [DATE]2024-12-03 12:40:13+08:00
      [CATEGORIES]cs.LG
    • Revisiting the Initial Steps in Adaptive Gradient Descent Optimization
      [AUTHORS]Abulikemu Abuduweili, Changliu Liu
      [ABSTRACT]Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from suboptimal generalization compared to stochastic gradient descent (SGD) and exhibit instability, particularly when training Transformer models. In this work, we show the standard initialization of the second-order moment estimation ($v_0 =0$) as a significant factor contributing to these limitations. We introduce simple yet effective solutions: initializing the second-order moment estimation with non-zero values, using either data-driven or random initialization strategies. Empirical evaluations demonstrate that our approach not only stabilizes convergence but also enhances the final performance of adaptive gradient optimizers. Furthermore, by adopting the proposed initialization strategies, Adam achieves performance comparable to many recently proposed variants of adaptive gradient optimization methods, highlighting the practical impact of this straightforward modification.
      [COMMENTS]OPT workshop at NeurIPS 2024
      [LINK]http://arxiv.org/abs/2412.02153v1
      [DATE]2024-12-03 12:28:14+08:00
      [CATEGORIES]cs.LG
    • Towards Universal Mesh Movement Networks
      [AUTHORS]Mingrui Zhang, Chunyang Wang, Stephan Kramer, Joseph G. Wallwork, Siyi Li, Jiancheng Liu, Xiang Chen, Matthew D. Piggott
      [ABSTRACT]Solving complex Partial Differential Equations (PDEs) accurately and efficiently is an essential and challenging problem in all scientific and engineering disciplines. Mesh movement methods provide the capability to improve the accuracy of the numerical solution without increasing the overall mesh degree of freedom count. Conventional sophisticated mesh movement methods are extremely expensive and struggle to handle scenarios with complex boundary geometries. However, existing learning-based methods require re-training from scratch given a different PDE type or boundary geometry, which limits their applicability, and also often suffer from robustness issues in the form of inverted elements. In this paper, we introduce the Universal Mesh Movement Network (UM2N), which -- once trained -- can be applied in a non-intrusive, zero-shot manner to move meshes with different size distributions and structures, for solvers applicable to different PDE types and boundary geometries. UM2N consists of a Graph Transformer (GT) encoder for extracting features and a Graph Attention Network (GAT) based decoder for moving the mesh. We evaluate our method on advection and Navier-Stokes based examples, as well as a real-world tsunami simulation case. Our method outperforms existing learning-based mesh movement methods in terms of the benchmarks described above. In comparison to the conventional sophisticated Monge-Amp\`ere PDE-solver based method, our approach not only significantly accelerates mesh movement, but also proves effective in scenarios where the conventional method fails. Our project page is at https://erizmr.github.io/UM2N/.
      [COMMENTS]Accepted at NeurIPS 2024 as a spotlight paper
      [LINK]http://arxiv.org/abs/2407.00382v4
      [DATE]2024-12-03 12:07:32+08:00
      [CATEGORIES]cs.LG
    • Benchmarking symbolic regression constant optimization schemes
      [AUTHORS]L. G. A dos Reis, V. L. P. S. Caminha, T. J. P. Penna
      [ABSTRACT]Symbolic regression is a machine learning technique, and it has seen many advancements in recent years, especially in genetic programming approaches (GPSR). Furthermore, it has been known for many years that constant optimization of parameters, during the evolutionary search, greatly increases GPSR performance However, different authors approach such tasks differently and no consensus exists regarding which methods perform best. In this work, we evaluate eight different parameter optimization methods, applied during evolutionary search, over ten known benchmark problems, in two different scenarios. We also propose using an under-explored metric called Tree Edit Distance (TED), aiming to identify symbolic accuracy. In conjunction with classical error measures, we develop a combined analysis of model performance in symbolic regression. We then show that different constant optimization methods perform better in certain scenarios and that there is no overall best choice for every problem. Finally, we discuss how common metric decisions may be biased and appear to generate better models in comparison.
      [COMMENTS]9 pages, 10 figures, 2 tables
      [LINK]http://arxiv.org/abs/2412.02126v1
      [DATE]2024-12-03 11:29:27+08:00
      [CATEGORIES]cs.LG
    • Optimizing Latent Goal by Learning from Trajectory Preference
      [AUTHORS]Guangyu Zhao, Kewei Lian, Haowei Lin, Haobo Fu, Qiang Fu, Shaofei Cai, Zihao Wang, Yitao Liang
      [ABSTRACT]A glowing body of work has emerged focusing on instruction-following policies for open-world agents, aiming to better align the agent's behavior with human intentions. However, the performance of these policies is highly susceptible to the initial prompt, which leads to extra efforts in selecting the best instructions. We propose a framework named Preference Goal Tuning (PGT). PGT allows an instruction following policy to interact with the environment to collect several trajectories, which will be categorized into positive and negative samples based on preference. Then we use preference learning to fine-tune the initial goal latent representation with the categorized trajectories while keeping the policy backbone frozen. The experiment result shows that with minimal data and training, PGT achieves an average relative improvement of 72.0% and 81.6% over 17 tasks in 2 different foundation policies respectively, and outperforms the best human-selected instructions. Moreover, PGT surpasses full fine-tuning in the out-of-distribution (OOD) task-execution environments by 13.4%, indicating that our approach retains strong generalization capabilities. Since our approach stores a single latent representation for each task independently, it can be viewed as an efficient method for continual learning, without the risk of catastrophic forgetting or task interference. In short, PGT enhances the performance of agents across nearly all tasks in the Minecraft Skillforge benchmark and demonstrates robustness to the execution environment.
      [LINK]http://arxiv.org/abs/2412.02125v1
      [DATE]2024-12-03 11:27:48+08:00
      [CATEGORIES]cs.LG
    • ILASH: A Predictive Neural Architecture Search Framework for Multi-Task Applications
      [AUTHORS]Md Hafizur Rahman, Md Mashfiq Rizvee, Sumaiya Shomaji, Prabuddha Chakraborty
      [ABSTRACT]Artificial intelligence (AI) is widely used in various fields including healthcare, autonomous vehicles, robotics, traffic monitoring, and agriculture. Many modern AI applications in these fields are multi-tasking in nature (i.e. perform multiple analysis on same data) and are deployed on resource-constrained edge devices requiring the AI models to be efficient across different metrics such as power, frame rate, and size. For these specific use-cases, in this work, we propose a new paradigm of neural network architecture (ILASH) that leverages a layer sharing concept for minimizing power utilization, increasing frame rate, and reducing model size. Additionally, we propose a novel neural network architecture search framework (ILASH-NAS) for efficient construction of these neural network models for a given set of tasks and device constraints. The proposed NAS framework utilizes a data-driven intelligent approach to make the search efficient in terms of energy, time, and CO2 emission. We perform extensive evaluations of the proposed layer shared architecture paradigm (ILASH) and the ILASH-NAS framework using four open-source datasets (UTKFace, MTFL, CelebA, and Taskonomy). We compare ILASH-NAS with AutoKeras and observe significant improvement in terms of both the generated model performance and neural search efficiency with up to 16x less energy utilization, CO2 emission, and training/search time.
      [COMMENTS]9 pages, 3 figures, 6 tables
      [LINK]http://arxiv.org/abs/2412.02116v1
      [DATE]2024-12-03 11:12:16+08:00
      [CATEGORIES]cs.LG
    • Evaluating the Impact of Data Augmentation on Predictive Model Performance
      [AUTHORS]Valdemar Švábenský, Conrad Borchers, Elizabeth B. Cloude, Atsushi Shimada
      [ABSTRACT]In supervised machine learning (SML) research, large training datasets are essential for valid results. However, obtaining primary data in learning analytics (LA) is challenging. Data augmentation can address this by expanding and diversifying data, though its use in LA remains underexplored. This paper systematically compares data augmentation techniques and their impact on prediction performance in a typical LA task: prediction of academic outcomes. Augmentation is demonstrated on four SML models, which we successfully replicated from a previous LAK study based on AUC values. Among 21 augmentation techniques, SMOTE-ENN sampling performed the best, improving the average AUC by 0.01 and approximately halving the training time compared to the baseline models. In addition, we compared 99 combinations of chaining 21 techniques, and found minor, although statistically significant, improvements across models when adding noise to SMOTE-ENN (+0.014). Notably, some augmentation techniques significantly lowered predictive performance or increased performance fluctuation related to random chance. This paper's contribution is twofold. Primarily, our empirical findings show that sampling techniques provide the most statistically reliable performance improvements for LA applications of SML, and are computationally more efficient than deep generation methods with complex hyperparameter settings. Second, the LA community may benefit from validating a recent study through independent replication.
      [COMMENTS]Published in LAK 2025 conference proceedings in the ACM Digital Library, see https://doi.org/10.1145/3706468.3706485
      [LINK]http://arxiv.org/abs/2412.02108v1
      [DATE]2024-12-03 11:03:04+08:00
      [CATEGORIES]cs.LG
    • Beyond Tree Models: A Hybrid Model of KAN and gMLP for Large-Scale Financial Tabular Data
      [AUTHORS]Mingming Zhang, Jiahao Hu, Pengfei Shi, Ningtao Wang, Ruizhe Gao, Guandong Sun, Feng Zhao, Yulin kang, Xing Fu, Weiqiang Wang, Junbo Zhao
      [ABSTRACT]Tabular data plays a critical role in real-world financial scenarios. Traditionally, tree models have dominated in handling tabular data. However, financial datasets in the industry often encounter some challenges, such as data heterogeneity, the predominance of numerical features and the large scale of the data, which can range from tens of millions to hundreds of millions of records. These challenges can lead to significant memory and computational issues when using tree-based models. Consequently, there is a growing need for neural network-based solutions that can outperform these models. In this paper, we introduce TKGMLP, an hybrid network for tabular data that combines shallow Kolmogorov Arnold Networks with Gated Multilayer Perceptron. This model leverages the strengths of both architectures to improve performance and scalability. We validate TKGMLP on a real-world credit scoring dataset, where it achieves state-of-the-art results and outperforms current benchmarks. Furthermore, our findings demonstrate that the model continues to improve as the dataset size increases, making it highly scalable. Additionally, we propose a novel feature encoding method for numerical data, specifically designed to address the predominance of numerical features in financial datasets. The integration of this feature encoding method within TKGMLP significantly improves prediction accuracy. This research not only advances table prediction technology but also offers a practical and effective solution for handling large-scale numerical tabular data in various industrial applications.
      [COMMENTS]8 pages, 4 figures
      [LINK]http://arxiv.org/abs/2412.02097v1
      [DATE]2024-12-03 10:38:07+08:00
      [CATEGORIES]cs.LG
    • Fault Detection and Identification via Monitoring Modules Based on Clusters of Interacting Measurements
      [AUTHORS]Enrique Luna Villagomez, Vladimir Mahalec
      [ABSTRACT]This work introduces a novel control-aware distributed process monitoring methodology based on modules comprised of clusters of interacting measurements. The methodology relies on the process flow diagram (PFD) and control system structure without requiring cross-correlation data to create monitoring modules. The methodology is validated on the Tennessee Eastman Process benchmark using full Principal Component Analysis (f-PCA) in the monitoring modules. The results are comparable to nonlinear techniques implemented in a centralized manner such as Kernel PCA (KPCA), Autoencoders (AE), and Recurrent Neural Networks (RNN), or distributed techniques like the Distributed Canonical Correlation Analysis (DCCA). Temporal plots of fault detection by different modules show clearly the magnitude and propagation of the fault through each module, pinpointing the module where the fault originates, and separating controllable faults from other faults. This information, combined with PCA contribution plots, helps detection and identification as effectively as more complex nonlinear centralized or distributed methods.
      [COMMENTS]Reworked and submitted to CONENGP 2/12/20024
      [LINK]http://arxiv.org/abs/2409.11444v2
      [DATE]2024-12-03 09:16:51+08:00
      [CATEGORIES]cs.LG
    • emg2pose: A Large and Diverse Benchmark for Surface Electromyographic Hand Pose Estimation
      [AUTHORS]Sasha Salter, Richard Warren, Collin Schlager, Adrian Spurr, Shangchen Han, Rohin Bhasin, Yujun Cai, Peter Walkington, Anuoluwapo Bolarinwa, Robert Wang, Nathan Danielson, Josh Merel, Eftychios Pnevmatikakis, Jesse Marshall
      [ABSTRACT]Hands are the primary means through which humans interact with the world. Reliable and always-available hand pose inference could yield new and intuitive control schemes for human-computer interactions, particularly in virtual and augmented reality. Computer vision is effective but requires one or multiple cameras and can struggle with occlusions, limited field of view, and poor lighting. Wearable wrist-based surface electromyography (sEMG) presents a promising alternative as an always-available modality sensing muscle activities that drive hand motion. However, sEMG signals are strongly dependent on user anatomy and sensor placement, and existing sEMG models have required hundreds of users and device placements to effectively generalize. To facilitate progress on sEMG pose inference, we introduce the emg2pose benchmark, the largest publicly available dataset of high-quality hand pose labels and wrist sEMG recordings. emg2pose contains 2kHz, 16 channel sEMG and pose labels from a 26-camera motion capture rig for 193 users, 370 hours, and 29 stages with diverse gestures - a scale comparable to vision-based hand pose datasets. We provide competitive baselines and challenging tasks evaluating real-world generalization scenarios: held-out users, sensor placements, and stages. emg2pose provides the machine learning community a platform for exploring complex generalization problems, holding potential to significantly enhance the development of sEMG-based human-computer interactions.
      [COMMENTS]Published at NeurIPS 2024 Datasets and Benchmarks Track
      [LINK]http://arxiv.org/abs/2412.02725v1
      [DATE]2024-12-03 07:39:37+08:00
      [CATEGORIES]cs.LG
    • Radial Basis Operator Networks
      [AUTHORS]Jason Kurz, Sean Oughton, Shitao Liu
      [ABSTRACT]Operator networks are designed to approximate nonlinear operators, which provide mappings between infinite-dimensional spaces such as function spaces. These networks are playing an increasingly important role in machine learning, with their most notable contributions in the field of scientific computing. Their significance stems from their ability to handle the type of data often encountered in scientific applications. For instance, in climate modeling or fluid dynamics, input data typically consists of discretized continuous fields (like temperature distributions or velocity fields). We introduce the radial basis operator network (RBON), which represents a significant advancement as the first operator network capable of learning an operator in both the time domain and frequency domain when adjusted to accept complex-valued inputs. Despite the small, single hidden-layer structure, the RBON boasts small $L^2$ relative test error for both in- and out-of-distribution data (OOD) of less than $1\times 10^\{-7\}$ in some benchmark cases. Moreover, the RBON maintains small error on OOD data from entirely different function classes from the training data.
      [LINK]http://arxiv.org/abs/2410.04639v2
      [DATE]2024-12-03 06:46:47+08:00
      [CATEGORIES]cs.LG
    • Representation Learning for Sequential Volumetric Design Tasks
      [AUTHORS]Md Ferdous Alam, Yi Wang, Chin-Yi Cheng, Jieliang Luo
      [ABSTRACT]Volumetric design, also called massing design, is the first and critical step in professional building design which is sequential in nature. As the volumetric design process requires careful design decisions and iterative adjustments, the underlying sequential design process encodes valuable information for designers. Many efforts have been made to automatically generate reasonable volumetric designs, but the quality of the generated design solutions varies, and evaluating a design solution requires either a prohibitively comprehensive set of metrics or expensive human expertise. While previous approaches focused on learning only the final design instead of sequential design tasks, we propose to encode the design knowledge from a collection of expert or high-performing design sequences and extract useful representations using transformer-based models. Later we propose to utilize the learned representations for crucial downstream applications such as design preference evaluation and procedural design generation. We develop the preference model by estimating the density of the learned representations whereas we train an autoregressive transformer model for sequential design generation. We demonstrate our ideas by leveraging a novel dataset of thousands of sequential volumetric designs. Our preference model can compare two arbitrarily given design sequences and is almost $90\%$ accurate in evaluation against random design sequences. Our autoregressive model is also capable of autocompleting a volumetric design sequence from a partial design sequence.
      [COMMENTS]12 pages, 12 figures
      [LINK]http://arxiv.org/abs/2309.02583v3
      [DATE]2024-12-03 06:33:40+08:00
      [CATEGORIES]cs.LG
    • DYffCast: Regional Precipitation Nowcasting Using IMERG Satellite Data. A case study over South America
      [AUTHORS]Daniel Seal, Rossella Arcucci, Salva Rühling-Cachay, César Quilodrán-Casas
      [ABSTRACT]Climate change is increasing the frequency of extreme precipitation events, making weather disasters such as flooding and landslides more likely. The ability to accurately nowcast precipitation is therefore becoming more critical for safeguarding society by providing immediate, accurate information to decision makers. Motivated by the recent success of generative models at precipitation nowcasting, this paper: extends the DYffusion framework to this task and evaluates its performance at forecasting IMERG satellite precipitation data up to a 4-hour horizon; modifies the DYffusion framework to improve its ability to model rainfall data; and introduces a novel loss function that combines MSE, MAE and the LPIPS perceptual score. In a quantitative evaluation of forecasts up to a 4-hour horizon, the modified DYffusion framework trained with the novel loss outperforms four competitor models. It has the highest CSI scores for weak, moderate, and heavy rain thresholds and retains an LPIPS score $<$ 0.2 for the entire roll-out, degrading the least as lead-time increases. The proposed nowcasting model demonstrates visually stable and sharp forecasts up to a 2-hour horizon on a heavy rain case study. Code is available at https://github.com/Dseal95/DYffcast.
      [COMMENTS]Accepted in the Machine Learning for Physical Sciences workshop @ NeurIPS 2024
      [LINK]http://arxiv.org/abs/2412.02723v1
      [DATE]2024-12-03 06:20:31+08:00
      [CATEGORIES]cs.LG
    • The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications
      [AUTHORS]Philippe Brouillard, Chandler Squires, Jonas Wahl, Konrad P. Kording, Karen Sachs, Alexandre Drouin, Dhanya Sridhar
      [ABSTRACT]Causal discovery aims to automatically uncover causal relationships from data, a capability with significant potential across many scientific disciplines. However, its real-world applications remain limited. Current methods often rely on unrealistic assumptions and are evaluated only on simple synthetic toy datasets, often with inadequate evaluation metrics. In this paper, we substantiate these claims by performing a systematic review of the recent causal discovery literature. We present applications in biology, neuroscience, and Earth sciences - fields where causal discovery holds promise for addressing key challenges. We highlight available simulated and real-world datasets from these domains and discuss common assumption violations that have spurred the development of new methods. Our goal is to encourage the community to adopt better evaluation practices by utilizing realistic datasets and more adequate metrics.
      [COMMENTS]39 pages, 8 figures
      [LINK]http://arxiv.org/abs/2412.01953v1
      [DATE]2024-12-03 04:26:29+08:00
      [CATEGORIES]cs.LG
    • Approximately Optimal Search on a Higher-dimensional Sliding Puzzle
      [AUTHORS]Nono SC Merleau, Miguel O'Malley, Érika Roldán, Sayan Mukherjee
      [ABSTRACT]Higher-dimensional sliding puzzles are constructed on the vertices of a $d$-dimensional hypercube, where $2^d-l$ vertices are distinctly coloured. Rings with the same colours are initially set randomly on the vertices of the hypercube. The goal of the puzzle is to move each of the $2^d-l$ rings to pre-defined target vertices on the cube. In this setting, the $k$-rule constraint represents a generalisation of edge collision for the movement of colours between vertices, allowing movement only when a hypercube face of dimension $k$ containing a ring is completely free of other rings. Starting from an initial configuration, what is the minimum number of moves needed to make ring colours match the vertex colours? An algorithm that provides us with such a number is called God's algorithm. When such an algorithm exists, it does not have a polynomial time complexity, at least in the case of the 15-puzzle corresponding to $k=1$ in the cubical puzzle. This paper presents a comprehensive computational study of different scenarios of the higher-dimensional puzzle. A benchmark of three computational techniques, an exact algorithm (the A* search) and two approximately optimal search techniques (an evolutionary algorithm (EA) and reinforcement learning (RL)) is presented in this work. The experiments show that all three methods can successfully solve the puzzle of dimension three for different face dimensions and across various difficulty levels. When the dimension increases, the A* search fails, and RL and EA methods can still provide a generally acceptable solution, i.e. a distribution of a number of moves with a median value of less than $30$. Overall, the EA method consistently requires less computational time, while failing in most cases to minimise the number of moves for the puzzle dimensions $d=4$ and $d=5$.
      [COMMENTS]20 pages, 8 figures
      [LINK]http://arxiv.org/abs/2412.01937v1
      [DATE]2024-12-03 03:59:06+08:00
      [CATEGORIES]cs.LG
    • Kernel-Free Universum Quadratic Surface Twin Support Vector Machines for Imbalanced Data
      [AUTHORS]Hossein Moosaei, Milan Hladík, Ahmad Mousavi, Zheming Gao, Haojie Fu
      [ABSTRACT]Binary classification tasks with imbalanced classes pose significant challenges in machine learning. Traditional classifiers often struggle to accurately capture the characteristics of the minority class, resulting in biased models with subpar predictive performance. In this paper, we introduce a novel approach to tackle this issue by leveraging Universum points to support the minority class within quadratic twin support vector machine models. Unlike traditional classifiers, our models utilize quadratic surfaces instead of hyperplanes for binary classification, providing greater flexibility in modeling complex decision boundaries. By incorporating Universum points, our approach enhances classification accuracy and generalization performance on imbalanced datasets. We generated four artificial datasets to demonstrate the flexibility of the proposed methods. Additionally, we validated the effectiveness of our approach through empirical evaluations on benchmark datasets, showing superior performance compared to conventional classifiers and existing methods for imbalanced classification.
      [LINK]http://arxiv.org/abs/2412.01936v1
      [DATE]2024-12-03 03:57:59+08:00
      [CATEGORIES]cs.LG
    • Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers
      [AUTHORS]Benedikt Stroebl, Sayash Kapoor, Arvind Narayanan
      [ABSTRACT]Recent research has generated hope that inference scaling could allow weaker language models to match or exceed the accuracy of stronger models, such as by repeatedly sampling solutions to a coding problem until it passes unit tests. The central thesis of this paper is that there is no free lunch for inference scaling: indefinite accuracy improvement through resampling can only be realized if the "verifier" (in this case, a set of unit tests) is perfect. When the verifier is imperfect, as it almost always is in domains such as reasoning or coding (for example, unit tests have imperfect coverage), there is a nonzero probability of false positives: incorrect solutions that pass the verifier. Resampling cannot decrease this probability, so it imposes an upper bound to the accuracy of resampling-based inference scaling even with an infinite compute budget. We find that there is a very strong correlation between the model's single-sample accuracy (i.e. accuracy without unit tests) and its false positive rate on coding benchmarks HumanEval and MBPP, whose unit tests have limited coverage. Therefore, no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model (Fig. 1a). When we consider that false positives have a negative utility compared to abstaining from producing a solution, it bends the inference scaling curve further downward. Empirically, we find that the optimal number of samples can be less than 10 under realistic assumptions (Fig. 1b). Finally, we show that beyond accuracy, false positives may have other undesirable qualities, such as poor adherence to coding style conventions.
      [LINK]http://arxiv.org/abs/2411.17501v2
      [DATE]2024-12-03 02:54:28+08:00
      [CATEGORIES]cs.LG
    • CREW: Facilitating Human-AI Teaming Research
      [AUTHORS]Lingyu Zhang, Zhengran Ji, Boyuan Chen
      [ABSTRACT]With the increasing deployment of artificial intelligence (AI) technologies, the potential of humans working with AI agents has been growing at a great speed. Human-AI teaming is an important paradigm for studying various aspects when humans and AI agents work together. The unique aspect of Human-AI teaming research is the need to jointly study humans and AI agents, demanding multidisciplinary research efforts from machine learning to human-computer interaction, robotics, cognitive science, neuroscience, psychology, social science, and complex systems. However, existing platforms for Human-AI teaming research are limited, often supporting oversimplified scenarios and a single task, or specifically focusing on either human-teaming research or multi-agent AI algorithms. We introduce CREW, a platform to facilitate Human-AI teaming research in real-time decision-making scenarios and engage collaborations from multiple scientific disciplines, with a strong emphasis on human involvement. It includes pre-built tasks for cognitive studies and Human-AI teaming with expandable potentials from our modular design. Following conventional cognitive neuroscience research, CREW also supports multimodal human physiological signal recording for behavior analysis. Moreover, CREW benchmarks real-time human-guided reinforcement learning agents using state-of-the-art algorithms and well-tuned baselines. With CREW, we were able to conduct 50 human subject studies within a week to verify the effectiveness of our benchmark.
      [COMMENTS]Our project website is at: http://generalroboticslab.com/CREW
      [LINK]http://arxiv.org/abs/2408.00170v2
      [DATE]2024-12-03 02:37:01+08:00
      [CATEGORIES]cs.LG
    • FERERO: A Flexible Framework for Preference-Guided Multi-Objective Learning
      [AUTHORS]Lisha Chen, AFM Saif, Yanning Shen, Tianyi Chen
      [ABSTRACT]Finding specific preference-guided Pareto solutions that represent different trade-offs among multiple objectives is critical yet challenging in multi-objective problems. Existing methods are restrictive in preference definitions and/or their theoretical guarantees. In this work, we introduce a Flexible framEwork for pREfeRence-guided multi-Objective learning (FERERO) by casting it as a constrained vector optimization problem. Specifically, two types of preferences are incorporated into this formulation -- the relative preference defined by the partial ordering induced by a polyhedral cone, and the absolute preference defined by constraints that are linear functions of the objectives. To solve this problem, convergent algorithms are developed with both single-loop and stochastic variants. Notably, this is the first single-loop primal algorithm for constrained vector optimization to our knowledge. The proposed algorithms adaptively adjust to both constraint and objective values, eliminating the need to solve different subproblems at different stages of constraint satisfaction. Experiments on multiple benchmarks demonstrate the proposed method is very competitive in finding preference-guided optimal solutions. Code is available at https://github.com/lisha-chen/FERERO/.
      [LINK]http://arxiv.org/abs/2412.01773v1
      [DATE]2024-12-03 02:21:16+08:00
      [CATEGORIES]cs.LG
    • OminiControl: Minimal and Universal Control for Diffusion Transformer
      [AUTHORS]Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, Xinchao Wang
      [ABSTRACT]In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone and process them with its flexible multi-modal attention processors. Unlike existing methods, which rely heavily on additional encoder modules with complex architectures, OminiControl (1) effectively and efficiently incorporates injected image conditions with only ~0.1% additional parameters, and (2) addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions such as edges, depth, and more. Remarkably, these capabilities are achieved by training on images generated by the DiT itself, which is particularly beneficial for subject-driven generation. Extensive evaluations demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted models in both subject-driven and spatially-aligned conditional generation. Additionally, we release our training dataset, Subjects200K, a diverse collection of over 200,000 identity-consistent images, along with an efficient data synthesis pipeline to advance research in subject-consistent generation.
      [LINK]http://arxiv.org/abs/2411.15098v3
      [DATE]2024-12-03 01:59:40+08:00
      [CATEGORIES]cs.LG
    • The Data-Driven Censored Newsvendor Problem
      [AUTHORS]Chamsi Hssaine, Sean R. Sinclair
      [ABSTRACT]We study a censored variant of the data-driven newsvendor problem, where the decision-maker must select an ordering quantity that minimizes expected overage and underage costs based only on censored sales data, rather than historical demand realizations. To isolate the impact of demand censoring on this problem, we adopt a distributionally robust optimization framework, evaluating policies according to their worst-case regret over an ambiguity set of distributions. This set is defined by the largest historical order quantity (the observable boundary of the dataset), and contains all distributions matching the true demand distribution up to this boundary, while allowing them to be arbitrary afterwards. We demonstrate a spectrum of achievability under demand censoring by deriving a natural necessary and sufficient condition under which vanishing regret is an achievable goal. In regimes in which it is not, we exactly characterize the information loss due to censoring: an insurmountable lower bound on the performance of any policy, even when the decision-maker has access to infinitely many demand samples. We then leverage these sharp characterizations to propose a natural robust algorithm that adapts to the historical level of demand censoring. We derive finite-sample guarantees for this algorithm across all possible censoring regimes, and show its near-optimality with matching lower bounds (up to polylogarithmic factors). We moreover demonstrate its robust performance via extensive numerical experiments on both synthetic and real-world datasets.
      [COMMENTS]67 pages, 19 tables, 7 figures
      [LINK]http://arxiv.org/abs/2412.01763v1
      [DATE]2024-12-03 01:58:54+08:00
      [CATEGORIES]cs.LG
    2024 Dec 02, Mon
    • Using Large Language Models in Automatic Hint Ranking and Generation Tasks
      [AUTHORS]Jamshid Mozafari, Florian Gerhold, Adam Jatowt
      [ABSTRACT]The use of Large Language Models (LLMs) has increased significantly recently, with individuals frequently interacting with chatbots to receive answers to a wide range of questions. In an era where information is readily accessible, it is crucial to stimulate and preserve human cognitive abilities and maintain strong reasoning skills. This paper addresses such challenges by promoting the use of hints as an alternative or a supplement to direct answers. We first introduce a manually constructed hint dataset, WIKIHINT, which includes 5,000 hints created for 1,000 questions. We then finetune open-source LLMs such as LLaMA-3.1 for hint generation in answer-aware and answer-agnostic contexts. We assess the effectiveness of the hints with human participants who try to answer questions with and without the aid of hints. Additionally, we introduce a lightweight evaluation method, HINTRANK, to evaluate and rank hints in both answer-aware and answer-agnostic settings. Our findings show that (a) the dataset helps generate more effective hints, (b) including answer information along with questions generally improves hint quality, and (c) encoder-based models perform better than decoder-based models in hint ranking.
      [LINK]http://arxiv.org/abs/2412.01626v1
      [DATE]2024-12-02 23:44:19+08:00
      [CATEGORIES]cs.CL
    • NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers
      [AUTHORS]Angel Yahir Loredo Lopez, Tyler McDonald, Ali Emami
      [ABSTRACT]Large Language Models (LLMs) have shown impressive performance on various benchmarks, yet their ability to engage in deliberate reasoning remains questionable. We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive "System 1" thinking, isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple machine learning heuristic, and humans across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints. Our findings reveal a significant performance gap: even top-performing LLMs like GPT-4 fall short of human performance by nearly 30%. Notably, advanced prompting techniques such as Chain-of-Thought and Self-Consistency show diminishing returns as task difficulty increases. NYT-Connections uniquely combines linguistic isolation, resistance to intuitive shortcuts, and regular updates to mitigate data leakage, offering a novel tool for assessing LLM reasoning capabilities.
      [COMMENTS]5 pages (excluding references), accepted to Coling 2025
      [LINK]http://arxiv.org/abs/2412.01621v1
      [DATE]2024-12-02 23:41:47+08:00
      [CATEGORIES]cs.CL
    • Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking
      [AUTHORS]Jie Liu, Wenxuan Wang, Zizhan Ma, Guolin Huang, Yihang SU, Kao-Jung Chang, Wenting Chen, Haoliang Li, Linlin Shen, Michael Lyu
      [ABSTRACT]Clinical decision making (CDM) is a complex, dynamic process crucial to healthcare delivery, yet it remains a significant challenge for artificial intelligence systems. While Large Language Model (LLM)-based agents have been tested on general medical knowledge using licensing exams and knowledge question-answering tasks, their performance in the CDM in real-world scenarios is limited due to the lack of comprehensive testing datasets that mirror actual medical practice. To address this gap, we present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow. MedChain distinguishes itself from existing benchmarks with three key features of real-world clinical practice: personalization, interactivity, and sequentiality. Further, to tackle real-world CDM challenges, we also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses. MedChain-Agent demonstrates remarkable adaptability in gathering information dynamically and handling sequential clinical tasks, significantly outperforming existing approaches. The relevant dataset and code will be released upon acceptance of this paper.
      [LINK]http://arxiv.org/abs/2412.01605v1
      [DATE]2024-12-02 23:25:02+08:00
      [CATEGORIES]cs.CL
    • VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning
      [AUTHORS]Zhihuan Jiang, Zhen Yang, Jinhao Chen, Zhengxiao Du, Weihan Wang, Bin Xu, Jie Tang
      [ABSTRACT]Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks by integrating textual and visual information to achieve visual understanding in complex scenarios. Despite the availability of several benchmarks aims to evaluating MLLMs in tasks from visual question answering to complex problem-solving, most focus predominantly on mathematics or general visual understanding tasks. This reveals a critical gap in current benchmarks, which often overlook the inclusion of other key scientific disciplines such as physics and chemistry. To address this gap, we meticulously construct a comprehensive benchmark, named VisScience, which is utilized to assess the multi-modal scientific reasoning across the three disciplines of mathematics, physics, and chemistry. This benchmark comprises 3,000 questions drawn from K12 education - spanning elementary school through high school - equally distributed across three disciplines, with 1,000 questions per discipline. The questions within VisScience span 21 distinct subjects and are categorized into five difficulty levels, offering a broad spectrum of topics within each discipline. With VisScience, we present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning. Experimental results demonstrate that closed-source MLLMs generally outperform open-source models. The best performance observed include a 53.4\% accuracy in mathematics by Claude3.5-Sonnet, 38.2\% in physics by GPT-4o, and 47.0\% in chemistry by Gemini-1.5-Pro. These results underscore the strengths and limitations of MLLMs, suggesting areas for future improvement and highlighting the importance of developing models that can effectively handle the diverse demands of multi-modal scientific reasoning.
      [COMMENTS]89 pages, 70 figures
      [LINK]http://arxiv.org/abs/2409.13730v2
      [DATE]2024-12-02 23:11:23+08:00
      [CATEGORIES]cs.CL
    • MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model
      [AUTHORS]Zhen Yang, Jinhao Chen, Zhengxiao Du, Wenmeng Yu, Weihan Wang, Wenyi Hong, Zhihuan Jiang, Bin Xu, Jie Tang
      [ABSTRACT]Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.
      [COMMENTS]30 pages,19 figures
      [LINK]http://arxiv.org/abs/2409.13729v2
      [DATE]2024-12-02 22:59:08+08:00
      [CATEGORIES]cs.CL
    • Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis
      [AUTHORS]Alessandro Scirè, Andrei Stefan Bejgu, Simone Tedeschi, Karim Ghonim, Federico Martelli, Roberto Navigli
      [ABSTRACT]After the introduction of Large Language Models (LLMs), there have been substantial improvements in the performance of Natural Language Generation (NLG) tasks, including Text Summarization and Machine Translation. However, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: (i) they are tailored to a specific task or domain; (ii) they are limited in size, thereby preventing the training of new factuality evaluators; (iii) they are designed for simpler verification tasks, such as claim verification. To address these issues, we introduce LLM-Oasis, to the best of our knowledge the largest resource for training end-to-end factuality evaluators. LLM-Oasis is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for benchmarking factuality evaluation systems. Our experiments demonstrate that LLM-Oasis presents a significant challenge for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our proposed end-to-end factuality evaluation task, highlighting its potential to drive future research in the field.
      [COMMENTS]15 pages. To be submitted to CL journal
      [LINK]http://arxiv.org/abs/2411.19655v2
      [DATE]2024-12-02 22:28:07+08:00
      [CATEGORIES]cs.CL
    • Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem
      [AUTHORS]Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt
      [ABSTRACT]Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.
      [COMMENTS]Accepted at COLING 2025; long paper
      [LINK]http://arxiv.org/abs/2409.07123v2
      [DATE]2024-12-02 21:04:18+08:00
      [CATEGORIES]cs.CL cs.LG
    • Multi-Facet Blending for Faceted Query-by-Example Retrieval
      [AUTHORS]Heejin Do, Sangwon Ryu, Jonghwi Kim, Gary Geunbae Lee
      [ABSTRACT]With the growing demand to fit fine-grained user intents, faceted query-by-example (QBE), which retrieves similar documents conditioned on specific facets, has gained recent attention. However, prior approaches mainly depend on document-level comparisons using basic indicators like citations due to the lack of facet-level relevance datasets; yet, this limits their use to citation-based domains and fails to capture the intricacies of facet constraints. In this paper, we propose a multi-facet blending (FaBle) augmentation method, which exploits modularity by decomposing and recomposing to explicitly synthesize facet-specific training sets. We automatically decompose documents into facet units and generate (ir)relevant pairs by leveraging LLMs' intrinsic distinguishing capabilities; then, dynamically recomposing the units leads to facet-wise relevance-informed document pairs. Our modularization eliminates the need for pre-defined facet knowledge or labels. Further, to prove the FaBle's efficacy in a new domain beyond citation-based scientific paper retrieval, we release a benchmark dataset for educational exam item QBE. FaBle augmentation on 1K documents remarkably assists training in obtaining facet conditional embeddings.
      [LINK]http://arxiv.org/abs/2412.01443v1
      [DATE]2024-12-02 20:32:19+08:00
      [CATEGORIES]cs.CL
    • A Top-down Graph-based Tool for Modeling Classical Semantic Maps: A Crosslinguistic Case Study of Supplementary Adverbs
      [AUTHORS]Zhu Liu, Cunliang Kong, Ying Liu, Maosong Sun
      [ABSTRACT]Semantic map models (SMMs) construct a network-like conceptual space from cross-linguistic instances or forms, based on the connectivity hypothesis. This approach has been widely used to represent similarity and entailment relationships in cross-linguistic concept comparisons. However, most SMMs are manually built by human experts using bottom-up procedures, which are often labor-intensive and time-consuming. In this paper, we propose a novel graph-based algorithm that automatically generates conceptual spaces and SMMs in a top-down manner. The algorithm begins by creating a dense graph, which is subsequently pruned into maximum spanning trees, selected according to metrics we propose. These evaluation metrics include both intrinsic and extrinsic measures, considering factors such as network structure and the trade-off between precision and coverage. A case study on cross-linguistic supplementary adverbs demonstrates the effectiveness and efficiency of our model compared to human annotations and other automated methods. The tool is available at \url\{https://github.com/RyanLiut/SemanticMapModel\}.
      [COMMENTS]Paper under review
      [LINK]http://arxiv.org/abs/2412.01423v1
      [DATE]2024-12-02 20:06:41+08:00
      [CATEGORIES]cs.CL
    • Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head
      [AUTHORS]Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee
      [ABSTRACT]End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par with current state-of-the-art supervised models. Furthermore, it establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of OmDet-Turbo in industrial applications is underscored by its exceptional performance on benchmark datasets and superior inference speed, positioning it as a compelling choice for real-time object detection tasks. Code: \url\{https://github.com/om-ai-lab/OmDet\}
      [COMMENTS]Preprint
      [LINK]http://arxiv.org/abs/2403.06892v2
      [DATE]2024-12-02 19:24:20+08:00
      [CATEGORIES]cs.CL
    • Understanding the World's Museums through Vision-Language Reasoning
      [AUTHORS]Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool
      [ABSTRACT]Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions, preserving well-documented collections. Data reveal key attributes such as age, origin, material, and cultural significance. Understanding museum exhibits from their images requires reasoning beyond visual features. In this work, we facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs in the standard museum catalog format for exhibits from all around the world; (b) training large vision-language models on the collected dataset; (c) benchmarking their ability on five visual question answering tasks. The complete dataset is labeled by museum experts, ensuring the quality as well as the practical significance of the labels. We train two VLMs from different categories: the BLIP model, with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through exhaustive experiments, we provide several insights on the complex and fine-grained understanding of museum exhibits. In particular, we show that some questions whose answers can often be derived directly from visual features are well answered by both types of models. On the other hand, questions that require the grounding of the visual features in repositories of human knowledge are better answered by the large vision-language models, thus demonstrating their superior capacity to perform the desired reasoning. Find our dataset, benchmarks, and source code at: https://github.com/insait-institute/Museum-65
      [LINK]http://arxiv.org/abs/2412.01370v1
      [DATE]2024-12-02 18:54:31+08:00
      [CATEGORIES]cs.CL
    • Dual-Personalizing Adapter for Federated Foundation Models
      [AUTHORS]Yiyuan Yang, Guodong Long, Tao Shen, Jing Jiang, Michael Blumenstein
      [ABSTRACT]Recently, foundation models, particularly large language models (LLMs), have demonstrated an impressive ability to adapt to various tasks by fine-tuning diverse instruction data. Notably, federated foundation models (FedFM) emerge as a privacy preservation method to fine-tune models collaboratively under federated learning (FL) settings by leveraging many distributed datasets with non-IID data. To alleviate communication and computation overhead, parameter-efficient methods are introduced for efficiency, and some research adapted personalization methods to FedFM for better user preferences alignment. However, a critical gap in existing research is the neglect of test-time distribution shifts in real-world applications, and conventional methods for test-time distribution shifts in personalized FL are less effective for FedFM due to their failure to adapt to complex distribution shift scenarios and the requirement to train all parameters. To bridge this gap, we refine the setting in FedFM, termed test-time personalization, which aims to learn personalized federated foundation models on clients while effectively handling test-time distribution shifts simultaneously. To address challenges in this setting, we explore a simple yet effective solution, a Federated Dual-Personalizing Adapter (FedDPA) architecture. By co-working with a foundation model, a global adapter and a local adapter jointly tackle the test-time distribution shifts and client-specific personalization. Additionally, we introduce an instance-wise dynamic weighting mechanism that dynamically integrates the global and local adapters for each test instance during inference, facilitating effective test-time personalization. The effectiveness of the proposed method has been evaluated on benchmark datasets across different NLP tasks.
      [LINK]http://arxiv.org/abs/2403.19211v2
      [DATE]2024-12-02 18:44:08+08:00
      [CATEGORIES]cs.LG cs.CL
    • Explicitly Representing Syntax Improves Sentence-to-layout Prediction of Unexpected Situations
      [AUTHORS]Wolf Nuyts, Ruben Cartuyvels, Marie-Francine Moens
      [ABSTRACT]Recognizing visual entities in a natural language sentence and arranging them in a 2D spatial layout require a compositional understanding of language and space. This task of layout prediction is valuable in text-to-image synthesis as it allows localized and controlled in-painting of the image. In this comparative study it is shown that we can predict layouts from language representations that implicitly or explicitly encode sentence syntax, if the sentences mention similar entity-relationships to the ones seen during training. To test compositional understanding, we collect a test set of grammatically correct sentences and layouts describing compositions of entities and relations that unlikely have been seen during training. Performance on this test set substantially drops, showing that current models rely on correlations in the training data and have difficulties in understanding the structure of the input sentences. We propose a novel structural loss function that better enforces the syntactic structure of the input sentence and show large performance gains in the task of 2D spatial layout prediction conditioned on text. The loss has the potential to be used in other generation tasks where a tree-like structure underlies the conditioning modality. Code, trained models and the USCOCO evaluation set are available via github.
      [COMMENTS]Published in TACL
      [LINK]http://arxiv.org/abs/2401.14212v3
      [DATE]2024-12-02 18:30:50+08:00
      [CATEGORIES]cs.CL
    • MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning
      [AUTHORS]Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li
      [ABSTRACT]Previous studies on federated learning (FL) often encounter performance degradation due to data heterogeneity among different clients. In light of the recent advances in multimodal large language models (MLLMs), such as GPT-4v and LLaVA, which demonstrate their exceptional proficiency in multimodal tasks, such as image captioning and multimodal question answering. We introduce a novel federated learning framework, named Multimodal Large Language Model Assisted Federated Learning (MLLM-LLaVA-FL), which employs powerful MLLMs at the server end to address the heterogeneous and long-tailed challenges. Owing to the advanced cross-modality representation capabilities and the extensive open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing the extensive, yet previously underexploited, open-source data accessible from websites and powerful server-side computational resources. Hence, the MLLM-LLaVA-FL not only enhances the performance but also avoids increasing the risk of privacy leakage and the computational burden on local devices, distinguishing it from prior methodologies. Our framework has three key stages. Initially, we conduct global visual-text pretraining of the model. This pretraining is facilitated by utilizing the extensive open-source data available online, with the assistance of MLLMs. Subsequently, the pretrained model is distributed among various clients for local training. Finally, once the locally trained models are transmitted back to the server, a global alignment is carried out under the supervision of MLLMs to further enhance the performance. Experimental evaluations on established benchmarks, show that our framework delivers promising performance in the typical scenarios with data heterogeneity and long-tail distribution across different clients in FL.
      [COMMENTS]Accepted to WACV 2025
      [LINK]http://arxiv.org/abs/2409.06067v2
      [DATE]2024-12-02 18:18:38+08:00
      [CATEGORIES]cs.CL cs.LG
    • A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls
      [AUTHORS]Sheikh Shafayat, Dongkeun Yoon, Woori Jang, Jiwoo Choi, Alice Oh, Seohyon Jung
      [ABSTRACT]In this work, we propose and evaluate the feasibility of a two-stage pipeline to evaluate literary machine translation, in a fine-grained manner, from English to Korean. The results show that our framework provides fine-grained, interpretable metrics suited for literary translation and obtains a higher correlation with human judgment than traditional machine translation metrics. Nonetheless, it still fails to match inter-human agreement, especially in metrics like Korean Honorifics. We also observe that LLMs tend to favor translations generated by other LLMs, and we highlight the necessity of developing more sophisticated evaluation methods to ensure accurate and culturally sensitive machine translation of literary works.
      [LINK]http://arxiv.org/abs/2412.01340v1
      [DATE]2024-12-02 18:07:01+08:00
      [CATEGORIES]cs.CL
    • Query-Guided Self-Supervised Summarization of Nursing Notes
      [AUTHORS]Ya Gao, Hans Moen, Saila Koivusalo, Miika Koskinen, Pekka Marttinen
      [ABSTRACT]Nursing notes, an important part of Electronic Health Records (EHRs), track a patient's health during a care episode. Summarizing key information in nursing notes can help clinicians quickly understand patients' conditions. However, existing summarization methods in the clinical setting, especially abstractive methods, have overlooked nursing notes and require reference summaries for training. We introduce QGSumm, a novel query-guided self-supervised domain adaptation approach for abstractive nursing note summarization. The method uses patient-related clinical queries for guidance, and hence does not need reference summaries for training. Through automatic experiments and manual evaluation by an expert clinician, we study our approach and other state-of-the-art Large Language Models (LLMs) for nursing note summarization. Our experiments show: 1) GPT-4 is competitive in maintaining information in the original nursing notes, 2) QGSumm can generate high-quality summaries with a good balance between recall of the original content and hallucination rate lower than other top methods. Ultimately, our work offers a new perspective on conditional text summarization, tailored to clinical applications.
      [LINK]http://arxiv.org/abs/2407.04125v2
      [DATE]2024-12-02 17:42:24+08:00
      [CATEGORIES]cs.CL cs.LG
    • SiTSE: Sinhala Text Simplification Dataset and Evaluation
      [AUTHORS]Surangika Ranathunga, Rumesh Sirithunga, Himashi Rathnayake, Lahiru De Silva, Thamindu Aluthwala, Saman Peramuna, Ravi Shekhar
      [ABSTRACT]Text Simplification is a task that has been minimally explored for low-resource languages. Consequently, there are only a few manually curated datasets. In this paper, we present a human curated sentence-level text simplification dataset for the Sinhala language. Our evaluation dataset contains 1,000 complex sentences and corresponding 3,000 simplified sentences produced by three different human annotators. We model the text simplification task as a zero-shot and zero resource sequence-to-sequence (seq-seq) task on the multilingual language models mT5 and mBART. We exploit auxiliary data from related seq-seq tasks and explore the possibility of using intermediate task transfer learning (ITTL). Our analysis shows that ITTL outperforms the previously proposed zero-resource methods for text simplification. Our findings also highlight the challenges in evaluating text simplification systems, and support the calls for improved metrics for measuring the quality of automated text simplification systems that would suit low-resource languages as well. Our code and data are publicly available: https://github.com/brainsharks-fyp17/Sinhala-Text-Simplification-Dataset-and-Evaluation
      [LINK]http://arxiv.org/abs/2412.01293v1
      [DATE]2024-12-02 17:08:06+08:00
      [CATEGORIES]cs.CL
    • Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Object-Oriented Programming
      [AUTHORS]Tianyang Wang, Ziqian Bi, Keyu Chen, Jiawei Xu, Qian Niu, Junyu Liu, Benji Peng, Ming Li, Sen Zhang, Xuanhe Pan, Jinlang Wang, Pohsun Feng, Caitlyn Heqi Yin, Yizhu Wen, Ming Liu
      [ABSTRACT]Object-Oriented Programming (OOP) has become a crucial paradigm for managing the growing complexity of modern software systems, particularly in fields like machine learning, deep learning, large language models (LLM), and data analytics. This work provides a comprehensive introduction to the integration of OOP techniques within these domains, with a focus on improving code modularity, maintainability, and scalability. We begin by outlining the evolution of computing and the rise of OOP, followed by an in-depth discussion of key OOP principles such as encapsulation, inheritance, polymorphism, and abstraction. The practical application of these principles is demonstrated using Python, a widely adopted language in AI and data science. Furthermore, we examine how design patterns and modular programming can be employed to enhance the structure and efficiency of machine learning systems. In subsequent sections, we apply these OOP concepts to real-world AI tasks, including the encapsulation of preprocessing workflows, machine learning model training, and evaluation. Detailed examples illustrate how OOP can be used to build reusable, scalable machine learning systems while maintaining code clarity and reducing redundancy.This work is intended to serve as a bridge for both beginners and experienced developers, equipping them with the necessary knowledge to apply OOP methodologies in AI-driven projects, ultimately fostering the development of more robust and maintainable systems.
      [COMMENTS]49pages
      [LINK]http://arxiv.org/abs/2409.19916v3
      [DATE]2024-12-02 16:56:26+08:00
      [CATEGORIES]cs.CL
    • Indexing Economic Fluctuation Narratives from Keiki Watchers Survey
      [AUTHORS]Eriko Shigetsugu, Hiroki Sakaji, Itsuki Noda
      [ABSTRACT]In this paper, we design indices of economic fluctuation narratives derived from economic surveys. Companies, governments, and investors rely on key metrics like GDP and industrial production indices to predict economic trends. However, they have yet to effectively leverage the wealth of information contained in economic text, such as causal relationships, in their economic forecasting. Therefore, we design indices of economic fluctuation from economic surveys by using our previously proposed narrative framework. From the evaluation results, it is observed that the proposed indices had a stronger correlation with cumulative lagging diffusion index than other types of diffusion indices.
      [LINK]http://arxiv.org/abs/2412.01265v1
      [DATE]2024-12-02 16:32:02+08:00
      [CATEGORIES]cs.CL
    • Do Large Language Models with Reasoning and Acting Meet the Needs of Task-Oriented Dialogue?
      [AUTHORS]Michelle Elizabeth, Morgan Veyret, Miguel Couceiro, Ondrej Dusek, Lina M. Rojas-Barahona
      [ABSTRACT]Large language models (LLMs) gained immense popularity due to their impressive capabilities in unstructured conversations. However, they underperform compared to previous approaches in task-oriented dialogue (TOD), wherein reasoning and accessing external information are crucial. Empowering LLMs with advanced prompting strategies such as reasoning and acting (ReAct) has shown promise in solving complex tasks traditionally requiring reinforcement learning. In this work, we apply the ReAct strategy to guide LLMs performing TOD. We evaluate ReAct-based LLMs (ReAct-LLMs) both in simulation and with real users. While ReAct-LLMs seem to underperform state-of-the-art approaches in simulation, human evaluation indicates higher user satisfaction rate compared to handcrafted systems despite having a lower success rate.
      [LINK]http://arxiv.org/abs/2412.01262v1
      [DATE]2024-12-02 16:30:22+08:00
      [CATEGORIES]cs.CL
    • GraphOTTER: Evolving LLM-based Graph Reasoning for Complex Table Question Answering
      [AUTHORS]Qianlong Li, Chen Huang, Shuai Li, Yuanxin Xiang, Deng Xiong, Wenqiang Lei
      [ABSTRACT]Complex Table Question Answering involves providing accurate answers to specific questions based on intricate tables that exhibit complex layouts and flexible header locations. Despite considerable progress having been made in the LLM era, the reasoning processes of existing methods are often implicit, feeding the entire table into prompts, making it difficult to effectively filter out irrelevant information in the table. To this end, we propose GraphOTTER that explicitly establishes the reasoning process to pinpoint the correct answers. In particular, GraphOTTER leverages a graph-based representation, transforming the complex table into an undirected graph. It then conducts step-by-step reasoning on the graph, with each step guided by a set of pre-defined intermediate reasoning actions. As such, it constructs a clear reasoning path and effectively identifies the answer to a given question. Comprehensive experiments on two benchmark datasets and two LLM backbones demonstrate the effectiveness of GraphOTTER. Further analysis indicates that its success may be attributed to the ability to efficiently filter out irrelevant information, thereby focusing the reasoning process on the most pertinent data. Our code and experimental datasets are available at \url\{https://github.com/JDing0521/GraphOTTER\}.
      [COMMENTS]COLING 2025, code is available at https://github.com/JDing0521/GraphOTTER
      [LINK]http://arxiv.org/abs/2412.01230v1
      [DATE]2024-12-02 15:49:23+08:00
      [CATEGORIES]cs.CL
    • GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large Language Model
      [AUTHORS]Xuanchang Zhang, Zhuosheng Zhang, Hai Zhao
      [ABSTRACT]Despite the rapid progress of large language models (LLMs), their task performance remains sensitive to prompt design. Recent studies have explored leveraging the LLM itself as an optimizer to identify optimal prompts that maximize task accuracy. However, when evaluating prompts, such approaches heavily rely on elusive manually annotated gold labels to calculate task accuracy for each candidate prompt, which hinders the widespread implementation and generality. To overcome the limitation, this work proposes a gold label-agnostic prompt evaluation (GLaPE) to alleviate dependence on gold labels. Motivated by the observed correlation between self-consistency and the accuracy of the answer, we adopt self-consistency as the initial evaluation score. Subsequently, we refine the scores of prompts producing identical answers to be mutually consistent. Experimental results show that GLaPE provides reliable evaluations uniform with accuracy, even in the absence of gold labels. Moreover, on six popular reasoning tasks, our GLaPE-based prompt optimization yields effective prompts comparable to accuracy-based ones. The code is publicly available at https://github.com/thunderous77/GLaPE.
      [COMMENTS]EMNLP 2024
      [LINK]http://arxiv.org/abs/2402.02408v2
      [DATE]2024-12-02 15:47:00+08:00
      [CATEGORIES]cs.CL cs.LG
    • Composition of Experts: A Modular Compound AI System Leveraging Large Language Models
      [AUTHORS]Swayambhoo Jain, Ravi Raju, Bo Li, Zoltan Csaki, Jonathan Li, Kaizhao Liang, Guoyao Feng, Urmish Thakkar, Anand Sampat, Raghu Prabhakar, Sumati Jairath
      [ABSTRACT]Large Language Models (LLMs) have achieved remarkable advancements, but their monolithic nature presents challenges in terms of scalability, cost, and customization. This paper introduces the Composition of Experts (CoE), a modular compound AI system leveraging multiple expert LLMs. CoE leverages a router to dynamically select the most appropriate expert for a given input, enabling efficient utilization of resources and improved performance. We formulate the general problem of training a CoE and discuss inherent complexities associated with it. We propose a two-step routing approach to address these complexities that first uses a router to classify the input into distinct categories followed by a category-to-expert mapping to obtain desired experts. CoE offers a flexible and cost-effective solution to build compound AI systems. Our empirical evaluation demonstrates the effectiveness of CoE in achieving superior performance with reduced computational overhead. Given that CoE comprises of many expert LLMs it has unique system requirements for cost-effective serving. We present an efficient implementation of CoE leveraging SambaNova SN40L RDUs unique three-tiered memory architecture. CoEs obtained using open weight LLMs Qwen/Qwen2-7B-Instruct, google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Llama-3.1-70B-Instruct and Qwen/Qwen2-72B-Instruct achieve a score of $59.4$ with merely $31$ billion average active parameters on Arena-Hard and a score of $9.06$ with $54$ billion average active parameters on MT-Bench.
      [LINK]http://arxiv.org/abs/2412.01868v1
      [DATE]2024-12-02 15:43:21+08:00
      [CATEGORIES]cs.LG cs.CL
    • SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages
      [AUTHORS]Jia Guo, Longxu Dou, Guangtao Zeng, Stanley Kok, Wei Lu, Qian Liu
      [ABSTRACT]In this paper, we introduce SailCompass, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian Languages (SEA). SailCompass encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification). To improve the robustness of the evaluation approach, we explore different prompt configurations for multiple-choice questions and leverage calibrations to improve the faithfulness of classification tasks. With SailCompass, we derive the following findings: (1) SEA-specialized LLMs still outperform general LLMs, although the gap has narrowed; (2) A balanced language distribution is important for developing better SEA-specialized LLMs; (3) Advanced prompting techniques (e.g., calibration, perplexity-based ranking) are necessary to better utilize LLMs. All datasets and evaluation scripts are public.
      [COMMENTS]code: https://github.com/sail-sg/sailcompass
      [LINK]http://arxiv.org/abs/2412.01186v1
      [DATE]2024-12-02 14:42:51+08:00
      [CATEGORIES]cs.CL
    • Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability
      [AUTHORS]Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, Zhaopeng Tu
      [ABSTRACT]Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks. They utilize autoregressive token generation to construct reasoning trajectories, enabling the development of a coherent chain of thought. In this work, we explore the impact of individual tokens on the final outcomes of reasoning tasks. We identify the existence of ``critical tokens'' that lead to incorrect reasoning trajectories in LLMs. Specifically, we find that LLMs tend to produce positive outcomes when forced to decode other tokens instead of critical tokens. Motivated by this observation, we propose a novel approach - cDPO - designed to automatically recognize and conduct token-level rewards for the critical tokens during the alignment process. Specifically, we develop a contrastive estimation approach to automatically identify critical tokens. It is achieved by comparing the generation likelihood of positive and negative models. To achieve this, we separately fine-tune the positive and negative models on various reasoning trajectories, consequently, they are capable of identifying identify critical tokens within incorrect trajectories that contribute to erroneous outcomes. Moreover, to further align the model with the critical token information during the alignment process, we extend the conventional DPO algorithms to token-level DPO and utilize the differential likelihood from the aforementioned positive and negative model as important weight for token-level DPO learning.Experimental results on GSM8K and MATH500 benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math (7B) demonstrate the effectiveness of the propsoed approach cDPO.
      [COMMENTS]Work in progress
      [LINK]http://arxiv.org/abs/2411.19943v2
      [DATE]2024-12-02 14:26:38+08:00
      [CATEGORIES]cs.CL cs.LG
    • A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans
      [AUTHORS]Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga
      [ABSTRACT]Recently, much work has concerned itself with the enigma of what exactly PLMs (pretrained language models) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Only one relation was considered, namely hypernymy. Furthermore, previous work did not measure humans' performance on the same task as that solved by the PLMs. This means that at this point in time, there is only an incomplete view of models' semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use six metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, asymmetry, prototypicality, an