Pay Attention to the Robustness of Chinese Minority Language Models! Syllable-level Textual Adversarial Attack on Tibetan Script [AUTHORS]Xi Cao, Dolma Dawa, Nuo Qun, Trashi Nyima [ABSTRACT]The textual adversarial attack refers to an attack method in which the
attacker adds imperceptible perturbations to the original texts by elaborate
design so that the NLP (natural language processing) model produces false
judgments. This method is also used to evaluate the robustness of NLP models.
Currently, most of the research in this field focuses on English, and there is
also a certain amount of research on Chinese. However, to the best of our
knowledge, there is little research targeting Chinese minority languages.
Textual adversarial attacks are a new challenge for the information processing
of Chinese minority languages. In response to this situation, we propose a
Tibetan syllable-level black-box textual adversarial attack called TSAttacker
based on syllable cosine distance and scoring mechanism. And then, we conduct
TSAttacker on six models generated by fine-tuning two PLMs (pre-trained
language models) for three downstream tasks. The experiment results show that
TSAttacker is effective and generates high-quality adversarial samples. In
addition, the robustness of the involved models still has much room for
improvement. [COMMENTS]Revised Version; Accepted at ACL 2023 Workshop on TrustNLP [LINK]http://arxiv.org/abs/2412.02323v2 [DATE]2024-12-04 17:08:45+08:00 [CATEGORIES]cs.CL
2024 Dec 03, Tue
Discovering influential text using convolutional neural networks [AUTHORS]Megan Ayers, Luke Sanford, Margaret Roberts, Eddie Yang [ABSTRACT]Experimental methods for estimating the impacts of text on human evaluation
have been widely used in the social sciences. However, researchers in
experimental settings are usually limited to testing a small number of
pre-specified text treatments. While efforts to mine unstructured texts for
features that causally affect outcomes have been ongoing in recent years, these
models have primarily focused on the topics or specific words of text, which
may not always be the mechanism of the effect. We connect these efforts with
NLP interpretability techniques and present a method for flexibly discovering
clusters of similar text phrases that are predictive of human reactions to
texts using convolutional neural networks. When used in an experimental
setting, this method can identify text treatments and their effects under
certain assumptions. We apply the method to two datasets. The first enables
direct validation of the model's ability to detect phrases known to cause the
outcome. The second demonstrates its ability to flexibly discover text
treatments with varying textual structures. In both cases, the model learns a
greater variety of text treatments compared to benchmark methods, and these
text features quantitatively meet or exceed the ability of benchmark methods to
predict the outcome. [COMMENTS]Published in Findings of ACL 2024 ( see
https://aclanthology.org/2024.findings-acl.714 ) [LINK]http://arxiv.org/abs/2406.10086v3 [DATE]2024-12-03 05:31:59+08:00 [CATEGORIES]cs.CLcs.LG
2024 Dec 06, Fri
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [AUTHORS]Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong [ABSTRACT]Graphical User Interfaces (GUIs) are critical to human-computer interaction,
yet automating GUI tasks remains challenging due to the complexity and
variability of visual environments. Existing approaches often rely on textual
representations of GUIs, which introduce limitations in generalization,
efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure
vision-based framework for autonomous GUI agents that operates across various
platforms. Our approach leverages image-based observations, and grounding
instructions in natural language to visual elements, and employs a consistent
action space to ensure cross-platform generalization. To address the
limitations of previous work, we integrate explicit planning and reasoning
within the model, enhancing its ability to autonomously navigate and interact
with complex digital environments. We construct a large-scale dataset of GUI
agent trajectories, incorporating multimodal reasoning and grounding, and
employ a two-stage training pipeline that first focuses on general GUI
grounding, followed by planning and reasoning. Through comprehensive
experiments, we demonstrate that Aguvis surpasses previous state-of-the-art
methods in both offline and real-world online scenarios, achieving, to our
knowledge, the first fully autonomous pure vision GUI agent capable of
performing tasks independently without collaboration with external
closed-source models. We open-sourced all datasets, models, and training
recipes to facilitate future research at https://aguvis-project.github.io/. [COMMENTS]https://aguvis-project.github.io/ [LINK]http://arxiv.org/abs/2412.04454v1 [DATE]2024-12-06 02:58:26+08:00 [CATEGORIES]cs.CL
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding [AUTHORS]Yuqi Wu, Wenzhao Zheng, Sicheng Zuo, Yuanhui Huang, Jie Zhou, Jiwen Lu [ABSTRACT]3D occupancy prediction provides a comprehensive description of the
surrounding scenes and has become an essential task for 3D perception. Most
existing methods focus on offline perception from one or a few views and cannot
be applied to embodied agents which demands to gradually perceive the scene
through progressive embodied exploration. In this paper, we formulate an
embodied 3D occupancy prediction task to target this practical scenario and
propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize
the global scene with uniform 3D semantic Gaussians and progressively update
local regions observed by the embodied agent. For each update, we extract
semantic and structural features from the observed image and efficiently
incorporate them via deformable cross-attention to refine the regional
Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global
3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown
(i.e., uniformly distributed) environment and maintains an explicit global
memory of it with 3D Gaussians. It gradually gains knowledge through local
refinement of regional Gaussians, which is consistent with how humans
understand new scenes through embodied exploration. We reorganize an
EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the
evaluation of the embodied 3D occupancy prediction task. Experiments
demonstrate that our EmbodiedOcc outperforms existing local prediction methods
and accomplishes the embodied occupancy prediction with high accuracy and
strong expandability. Our code is available at:
https://github.com/YkiWu/EmbodiedOcc. [COMMENTS]Code: https://github.com/YkiWu/EmbodiedOcc [LINK]http://arxiv.org/abs/2412.04380v1 [DATE]2024-12-06 01:57:09+08:00 [CATEGORIES]cs.LG
Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting [AUTHORS]Edoardo Cetin, Ahmed Touati, Yann Ollivier [ABSTRACT]The forward-backward representation (FB) is a recently proposed framework
(Touati et al., 2023; Touati & Ollivier, 2021) to train behavior foundation
models (BFMs) that aim at providing zero-shot efficient policies for any new
task specified in a given reinforcement learning (RL) environment, without
training for each new task. Here we address two core limitations of FB model
training. First, FB, like all successor-feature-based methods, relies on a
linear encoding of tasks: at test time, each new reward function is linearly
projected onto a fixed set of pre-trained features. This limits expressivity as
well as precision of the task representation. We break the linearity limitation
by introducing auto-regressive features for FB, which let finegrained task
features depend on coarser-grained task information. This can represent
arbitrary nonlinear task encodings, thus significantly increasing expressivity
of the FB framework. Second, it is well-known that training RL agents from
offline datasets often requires specific techniques.We show that FB works well
together with such offline RL techniques, by adapting techniques from (Nair et
al.,2020b; Cetin et al., 2024) for FB. This is necessary to get non-flatlining
performance in some datasets, such as DMC Humanoid. As a result, we produce
efficient FB BFMs for a number of new environments. Notably, in the D4RL
locomotion benchmark, the generic FB agent matches the performance of standard
single-task offline agents (IQL, XQL). In many setups, the offline techniques
are needed to get any decent performance at all. The auto-regressive features
have a positive but moderate impact, concentrated on tasks requiring spatial
precision and task generalization beyond the behaviors represented in the
trainset. [LINK]http://arxiv.org/abs/2412.04368v1 [DATE]2024-12-06 01:36:22+08:00 [CATEGORIES]cs.LG
Machine Theory of Mind for Autonomous Cyber-Defence [AUTHORS]Luke Swaby, Matthew Stewart, Daniel Harrold, Chris Willis, Gregory Palmer [ABSTRACT]Intelligent autonomous agents hold much potential for the domain of
cyber-security. However, due to many state-of-the-art approaches relying on
uninterpretable black-box models, there is growing demand for methods that
offer stakeholders clear and actionable insights into their latent beliefs and
motivations. To address this, we evaluate Theory of Mind (ToM) approaches for
Autonomous Cyber Operations. Upon learning a robust prior, ToM models can
predict an agent's goals, behaviours, and contextual beliefs given only a
handful of past behaviour observations. In this paper, we introduce a novel
Graph Neural Network (GNN)-based ToM architecture tailored for cyber-defence,
Graph-In, Graph-Out (GIGO)-ToM, which can accurately predict both the targets
and attack trajectories of adversarial cyber agents over arbitrary computer
network topologies. To evaluate the latter, we propose a novel extension of the
Wasserstein distance for measuring the similarity of graph-based probability
distributions. Whereas the standard Wasserstein distance lacks a fixed
reference scale, we introduce a graph-theoretic normalization factor that
enables a standardized comparison between networks of different sizes. We
furnish this metric, which we term the Network Transport Distance (NTD), with a
weighting function that emphasizes predictions according to custom node
features, allowing network operators to explore arbitrary strategic
considerations. Benchmarked against a Graph-In, Dense-Out (GIDO)-ToM
architecture in an abstract cyber-defence environment, our empirical
evaluations show that GIGO-ToM can accurately predict the goals and behaviours
of various unseen cyber-attacking agents across a range of network topologies,
as well as learn embeddings that can effectively characterize their policies. [COMMENTS]29 pages, 17 figures, 12 tables [LINK]http://arxiv.org/abs/2412.04367v1 [DATE]2024-12-06 01:35:29+08:00 [CATEGORIES]cs.LG
Action Mapping for Reinforcement Learning in Continuous Environments with Constraints [AUTHORS]Mirco Theile, Lukas Dirnberger, Raphael Trumpp, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli [ABSTRACT]Deep reinforcement learning (DRL) has had success across various domains, but
applying it to environments with constraints remains challenging due to poor
sample efficiency and slow convergence. Recent literature explored
incorporating model knowledge to mitigate these problems, particularly through
the use of models that assess the feasibility of proposed actions. However,
integrating feasibility models efficiently into DRL pipelines in environments
with continuous action spaces is non-trivial. We propose a novel DRL training
strategy utilizing action mapping that leverages feasibility models to
streamline the learning process. By decoupling the learning of feasible actions
from policy optimization, action mapping allows DRL agents to focus on
selecting the optimal action from a reduced feasible action set. We demonstrate
through experiments that action mapping significantly improves training
performance in constrained environments with continuous action spaces,
especially with imperfect feasibility models. [LINK]http://arxiv.org/abs/2412.04327v1 [DATE]2024-12-06 00:42:45+08:00 [CATEGORIES]cs.LG
On Multi-Agent Inverse Reinforcement Learning [AUTHORS]Till Freihaut, Giorgia Ramponi [ABSTRACT]In multi-agent systems, the agent behavior is highly influenced by its
utility function, as these utilities shape both individual goals as well as
interactions with the other agents. Inverse Reinforcement Learning (IRL) is a
well-established approach to inferring the utility function by observing an
expert behavior within a given environment. In this paper, we extend the IRL
framework to the multi-agent setting, assuming to observe agents who are
following Nash Equilibrium (NE) policies. We theoretically investigate the set
of utilities that explain the behavior of NE experts. Specifically, we provide
an explicit characterization of the feasible reward set and analyze how errors
in estimating the transition dynamics and expert behavior impact the recovered
rewards. Building on these findings, we provide the first sample complexity
analysis for the multi-agent IRL problem. Finally, we provide a numerical
evaluation of our theoretical results. [COMMENTS]Currently under review [LINK]http://arxiv.org/abs/2411.15046v2 [DATE]2024-12-06 00:04:02+08:00 [CATEGORIES]cs.LG
2024 Dec 05, Thu
Agent-OM: Leveraging LLMAgents for Ontology Matching [AUTHORS]Zhangcheng Qiang, Weiqing Wang, Kerry Taylor [ABSTRACT]Ontology matching (OM) enables semantic interoperability between different
ontologies and resolves their conceptual heterogeneity by aligning related
entities. OM systems currently have two prevailing design paradigms:
conventional knowledge-based expert systems and newer machine learning-based
predictive systems. While large language models (LLMs) and LLMagents have
revolutionised data engineering and have been applied creatively in many
domains, their potential for OM remains underexplored. This study introduces a
novel agent-powered LLM-based design paradigm for OM systems. With
consideration of several specific challenges in leveraging LLMagents for OM,
we propose a generic framework, namely Agent-OM (Agent for Ontology Matching),
consisting of two Siamese agents for retrieval and matching, with a set of
simple OM tools. Our framework is implemented in a proof-of-concept system.
Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks
over state-of-the-art OM systems show that our system can achieve results very
close to the long-standing best performance on simple OM tasks and can
significantly improve the performance on complex and few-shot OM tasks. [COMMENTS]14 pages, 13 figures, 4 tables [LINK]http://arxiv.org/abs/2312.00326v4 [DATE]2024-12-05 22:45:05+08:00 [CATEGORIES]cs.CL
A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios [AUTHORS]Xiachong Feng, Longxu Dou, Ella Li, Qinghao Wang, Haochuan Wang, Yu Guo, Chang Ma, Lingpeng Kong [ABSTRACT]Game-theoretic scenarios have become pivotal in evaluating the social
intelligence of Large Language Model (LLM)-based social agents. While numerous
studies have explored these agents in such settings, there is a lack of a
comprehensive survey summarizing the current progress. To address this gap, we
systematically review existing research on LLM-based social agents within
game-theoretic scenarios. Our survey organizes the findings into three core
components: Game Framework, Social Agent, and Evaluation Protocol. The game
framework encompasses diverse game scenarios, ranging from choice-focusing to
communication-focusing games. The social agent part explores agents'
preferences, beliefs, and reasoning abilities. The evaluation protocol covers
both game-agnostic and game-specific metrics for assessing agent performance.
By reflecting on the current research and identifying future research
directions, this survey provides insights to advance the development and
evaluation of social agents in game-theoretic scenarios. [LINK]http://arxiv.org/abs/2412.03920v1 [DATE]2024-12-05 14:46:46+08:00 [CATEGORIES]cs.CL
MISR: Measuring Instrumental Self-Reasoning in Frontier Models [AUTHORS]Kai Fronsdal, David Lindner [ABSTRACT]We propose a suite of tasks to evaluate the instrumental self-reasoning
ability of large language model (LLM) agents. Instrumental self-reasoning
ability could improve adaptability and enable self-modification, but it could
also pose significant risks, such as enabling deceptive alignment. Prior work
has only evaluated self-reasoning in non-agentic settings or in limited
domains. In this paper, we propose evaluations for instrumental self-reasoning
ability in agentic tasks in a wide range of scenarios, including
self-modification, knowledge seeking, and opaque self-reasoning. We evaluate
agents built using state-of-the-art LLMs, including commercial and open source
systems. We find that instrumental self-reasoning ability emerges only in the
most capable frontier models and that it is highly context-dependent. No model
passes the the most difficult versions of our evaluations, hence our evaluation
can be used to measure increases in instrumental self-reasoning ability in
future models. We open-source our evaluations at
https://github.com/kaifronsdal/Self-Reasoning-Evals. [COMMENTS]10 pages, 65 page appendix, 5 figures [LINK]http://arxiv.org/abs/2412.03904v1 [DATE]2024-12-05 14:20:47+08:00 [CATEGORIES]cs.CLcs.LG
PreAct: Prediction Enhances Agent's Planning Ability [AUTHORS]Dayuan Fu, Jianzhao Huang, Siyuan Lu, Guanting Dong, Yejie Wang, Keqing He, Weiran Xu [ABSTRACT]Addressing the disparity between forecasts and actual results can enable
individuals to expand their thought processes and stimulate self-reflection,
thus promoting accurate planning. In this research, we present **PreAct**, an
agent framework that integrates **pre**diction, **rea**soning, and **act**ion.
By utilizing the information derived from predictions, the large language model
(LLM) agent can provide a wider range and more strategically focused reasoning.
This leads to more efficient actions that aid the agent in accomplishing
intricate tasks. Our experimental results show that PreAct surpasses the ReAct
method in completing complex tasks and that PreAct's performance can be further
improved when paired with other memory or selection strategy techniques. We
presented the model with varying quantities of historical predictions and
discovered that these predictions consistently enhance LLM planning.The
variances in single-step reasoning between PreAct and ReAct indicate that
PreAct indeed has benefits in terms of diversity and strategic orientation over
ReAct. [COMMENTS]Coling 2025 [LINK]http://arxiv.org/abs/2402.11534v2 [DATE]2024-12-05 12:40:54+08:00 [CATEGORIES]cs.CL
Network Formation and Dynamics Among Multi-LLMs [AUTHORS]Marios Papachristou, Yuan Yuan [ABSTRACT]Social networks fundamentally shape human opinions, behaviors, and the
dissemination of information. As large language models (LLMs) like GPT, Claude,
and Llama increasingly integrate into social and professional settings,
understanding their behavior in the context of social interactions and network
formation becomes essential. This study develops a framework to systematically
examine whether the network formation behaviors of multiple LLMs approximate
certain aspects of human network dynamics. By simulating interactions among LLMagents across various model families, we observe that these models consistently
exhibit key patterns associated with social network principles including
preferential attachment, triadic closure, homophily, community structure, and
the small-world phenomenon when forming networks. Moreover, LLMs adapt their
network formation strategies based on each network's characteristics,
reflecting the context-dependent nature of human behavior: in Facebook
networks, they prioritize triadic closure and homophily, mirroring close-knit
friendships; in phone networks, homophily and preferential attachment dominate,
capturing personal and professional connections, while in employment networks,
LLMs favor heterophily and high-degree connections, aligning with career
advancement dynamics. These results open new avenues for using LLMs in network
science research, with potential applications in agent-based modeling and
synthetic network generation. [LINK]http://arxiv.org/abs/2402.10659v4 [DATE]2024-12-05 12:35:22+08:00 [CATEGORIES]cs.CL
Social Life Simulation for Non-Cognitive Skills Learning [AUTHORS]Zihan Yan, Yaohong Xiang, Yun Huang [ABSTRACT]Non-cognitive skills are crucial for personal and social life well-being, and
such skill development can be supported by narrative-based (e.g., storytelling)
technologies. While generative AI enables interactive and role-playing
storytelling, little is known about how users engage with and perceive the use
of AI in social life simulation for non-cognitive skills learning.
Additionally, the benefits of AI mentorship on self-reflection awareness and
ability in this context remain largely underexplored. To this end, we
introduced Simulife++, an interactive platform enabled by a large language
model (LLM). The system allows users to act as protagonists, creating stories
with one or multiple AI-based characters in diverse social scenarios. In
particular, we expanded the Human-AI interaction to a Human-AI-AI collaboration
by including a Sage Agent, who acts as a bystander, providing users with some
perspectives and guidance on their choices and conversations in terms of
non-cognitive skills to promote reflection. In a within-subject user study, our
quantitative results reveal that, when accompanied by Sage Agent, users exhibit
significantly higher levels of reflection on motivation, self-perceptions, and
resilience & coping, along with an enhanced experience of narrative
transportation. Additionally, our qualitative findings suggest that Sage Agent
plays a crucial role in promoting reflection on non-cognitive skills, enhancing
social communication and decision-making performance, and improving overall
user experience within Simulife++. Multiple supportive relationships between
Sage Agent and users were also reported. We offer design implications for the
application of generative AI in narrative solutions and the future potential of
Sage Agent for non-cognitive skill development in broader social contexts. [LINK]http://arxiv.org/abs/2405.00273v3 [DATE]2024-12-05 12:19:45+08:00 [CATEGORIES]cs.CL
Educational-Psychological Dialogue Robot Based on Multi-Agent Collaboration [AUTHORS]Shiwen Ni, Min Yang [ABSTRACT]Intelligent dialogue systems are increasingly used in modern education and
psychological counseling fields, but most existing systems are limited to a
single domain, cannot deal with both educational and psychological issues, and
often lack accuracy and professionalism when dealing with complex issues. To
address these problems, this paper proposes an intelligent dialog system that
combines educational and psychological counseling functions. The system
consists of multiple AI agent, including security detection agent, intent
identification agent, educational LLMagent, and psychological LLMagent, which
work in concert to ensure the provision of accurate educational knowledge Q\&A
and psychological support services. Specifically, the system recognizes
user-input intentions through an intention classification model and invokes a
retrieval-enhanced educational grand model and a psychological grand model
fine-tuned with psychological data in order to provide professional educational
advice and psychological support. [LINK]http://arxiv.org/abs/2412.03847v1 [DATE]2024-12-05 11:27:02+08:00 [CATEGORIES]cs.CL
From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models [AUTHORS]Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji [ABSTRACT]Data visualization in the form of charts plays a pivotal role in data
analysis, offering critical insights and aiding in informed decision-making.
Automatic chart understanding has witnessed significant advancements with the
rise of large foundation models in recent years. Foundation models, such as
large language models, have revolutionized various natural language processing
tasks and are increasingly being applied to chart understanding tasks. This
survey paper provides a comprehensive overview of the recent developments,
challenges, and future directions in chart understanding within the context of
these foundation models. We review fundamental building blocks crucial for
studying chart understanding tasks. Additionally, we explore various tasks and
their evaluation metrics and sources of both charts and textual inputs. Various
modeling strategies are then examined, encompassing both classification-based
and generation-based approaches, along with tool augmentation techniques that
enhance chart understanding performance. Furthermore, we discuss the
state-of-the-art performance of each task and discuss how we can improve the
performance. Challenges and future directions are addressed, highlighting the
importance of several topics, such as domain-specific charts, lack of efforts
in developing evaluation metrics, and agent-oriented settings. This survey
paper serves as a comprehensive resource for researchers and practitioners in
the fields of natural language processing, computer vision, and data analysis,
providing valuable insights and directions for future research in chart
understanding leveraging large foundation models. The studies mentioned in this
paper, along with emerging new research, will be continually updated at:
https://github.com/khuangaf/Awesome-Chart-Understanding. [COMMENTS]IEEE Transactions on Knowledge and Data Engineering (TKDE) [LINK]http://arxiv.org/abs/2403.12027v4 [DATE]2024-12-05 11:26:13+08:00 [CATEGORIES]cs.CL
ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data [AUTHORS]Junhong Shen, Atishay Jain, Zedian Xiao, Ishan Amlekar, Mouad Hadji, Aaron Podolny, Ameet Talwalkar [ABSTRACT]Large Language Model (LLM) agents are rapidly improving to handle
increasingly complex web-based tasks. Most of these agents rely on
general-purpose, proprietary models like GPT-4 and focus on designing better
prompts to improve their planning abilities. However, general-purpose LLMs are
not specifically trained to understand specialized web contexts such as HTML,
and they often struggle with long-horizon planning. We explore an alternative
approach that fine-tunes open-source LLMs using production-scale workflow data
collected from over 250 domains corresponding to 6 billion tokens. This simple
yet effective approach shows substantial gains over prompting-based agents on
existing benchmarks -- ScribeAgent achieves state-of-the-art direct generation
performance on Mind2Web and improves the task success rate by 7.3% over the
previous best text-only web agents on WebArena. We further perform detailed
ablation studies on various fine-tuning design choices and provide insights
into LLM selection, training recipes, context window optimization, and effect
of dataset sizes. [LINK]http://arxiv.org/abs/2411.15004v2 [DATE]2024-12-05 10:00:07+08:00 [CATEGORIES]cs.CL
Agent AI with LangGraph: A Modular Framework for Enhancing Machine Translation Using Large Language Models [AUTHORS]Jialin Wang, Zhihua Duan [ABSTRACT]This paper explores the transformative role of Agent AI and LangGraph in
advancing the automation and effectiveness of machine translation (MT). Agents
are modular components designed to perform specific tasks, such as translating
between particular languages, with specializations like TranslateEnAgent,
TranslateFrenchAgent, and TranslateJpAgent for English, French, and Japanese
translations, respectively. These agents leverage the powerful semantic
capabilities of large language models (LLMs), such as GPT-4o, to ensure
accurate, contextually relevant translations while maintaining modularity,
scalability, and context retention.
LangGraph, a graph-based framework built on LangChain, simplifies the
creation and management of these agents and their workflows. It supports
dynamic state management, enabling agents to maintain dialogue context and
automates complex workflows by linking agents and facilitating their
collaboration. With flexibility, open-source community support, and seamless
integration with LLMs, LangGraph empowers agents to deliver high-quality
translations.
Together, Agent AI and LangGraph create a cohesive system where LangGraph
orchestrates agent interactions, ensuring that user inputs are analyzed,
routed, and processed efficiently. Experimental results demonstrate the
potential of this system to enhance multilingual translation accuracy and
scalability. By highlighting modular design and automated workflows, this paper
sets the stage for further innovations in intelligent machine translation
services. [LINK]http://arxiv.org/abs/2412.03801v1 [DATE]2024-12-05 09:45:12+08:00 [CATEGORIES]cs.CL
The broader spectrum of in-context learning [AUTHORS]Andrew Kyle Lampinen, Stephanie C. Y. Chan, Aaditya K. Singh, Murray Shanahan [ABSTRACT]The ability of language models to learn a task from a few examples in context
has generated substantial interest. Here, we provide a perspective that
situates this type of supervised few-shot learning within a much broader
spectrum of meta-learned in-context learning. Indeed, we suggest that any
distribution of sequences in which context non-trivially decreases loss on
subsequent predictions can be interpreted as eliciting a kind of in-context
learning. We suggest that this perspective helps to unify the broad set of
in-context abilities that language models exhibit $\unicode\{x2014\}$ such as
adapting to tasks from instructions or role play, or extrapolating time series.
This perspective also sheds light on potential roots of in-context learning in
lower-level processing of linguistic dependencies (e.g. coreference or parallel
structures). Finally, taking this perspective highlights the importance of
generalization, which we suggest can be studied along several dimensions: not
only the ability to learn something novel, but also flexibility in learning
from different presentations, and in applying what is learned. We discuss
broader connections to past literature in meta-learning and goal-conditioned
agents, and other perspectives on learning and adaptation. We close by
suggesting that research on in-context learning should consider this broader
spectrum of in-context capabilities and types of generalization. [LINK]http://arxiv.org/abs/2412.03782v1 [DATE]2024-12-05 08:05:11+08:00 [CATEGORIES]cs.CLcs.LG
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [AUTHORS]Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, Lichao Sun [ABSTRACT]Large language models (LLMs) have garnered significant attention due to their
impressive natural language processing (NLP) capabilities. Recently, many
studies have focused on the tool utilization ability of LLMs. They primarily
investigated how LLMs effectively collaborate with given specific tools.
However, in scenarios where LLMs serve as intelligent agents, as seen in
applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate
decision-making processes that involve deciding whether to employ a tool and
selecting the most suitable tool(s) from a collection of available tools to
fulfill user requests. Therefore, in this paper, we introduce MetaTool, a
benchmark designed to evaluate whether LLMs have tool usage awareness and can
correctly choose tools. Specifically, we create a dataset called ToolE within
the benchmark. This dataset contains various types of user queries in the form
of prompts that trigger LLMs to use tools, including both single-tool and
multi-tool scenarios. Subsequently, we set the tasks for both tool usage
awareness and tool selection. We define four subtasks from different
perspectives in tool selection, including tool selection with similar choices,
tool selection in specific scenarios, tool selection with possible reliability
issues, and multi-tool selection. We conduct experiments involving eight
popular LLMs and find that the majority of them still struggle to effectively
select tools, highlighting the existing gaps between LLMs and genuine
intelligent agents. However, through the error analysis, we found there is
still significant room for improvement. Finally, we conclude with insights for
tool developers -- we strongly recommend that tool developers choose an
appropriate rewrite model for generating new descriptions based on the
downstream LLM the tool will apply to. Our code is in
https://github.com/HowieHwong/MetaTool. [LINK]http://arxiv.org/abs/2310.03128v6 [DATE]2024-12-05 03:49:02+08:00 [CATEGORIES]cs.CL
From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents [AUTHORS]Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, Zhongyu Wei [ABSTRACT]Traditional sociological research often relies on human participation, which,
though effective, is expensive, challenging to scale, and with ethical
concerns. Recent advancements in large language models (LLMs) highlight their
potential to simulate human behavior, enabling the replication of individual
responses and facilitating studies on many interdisciplinary studies. In this
paper, we conduct a comprehensive survey of this field, illustrating the recent
progress in simulation driven by LLM-empowered agents. We categorize the
simulations into three types: (1) Individual Simulation, which mimics specific
individuals or demographic groups; (2) Scenario Simulation, where multiple
agents collaborate to achieve goals within specific contexts; and (3) Society
Simulation, which models interactions within agent societies to reflect the
complexity and variety of real-world dynamics. These simulations follow a
progression, ranging from detailed individual modeling to large-scale societal
phenomena. We provide a detailed discussion of each simulation type, including
the architecture or key components of the simulation, the classification of
objectives or scenarios and the evaluation method. Afterward, we summarize
commonly used datasets and benchmarks. Finally, we discuss the trends across
these three types of simulation. A repository for the related sources is at
\{\url\{https://github.com/FudanDISC/SocialAgent\}\}. [LINK]http://arxiv.org/abs/2412.03563v1 [DATE]2024-12-05 02:56:37+08:00 [CATEGORIES]cs.CL
DataLab: A Unified Platform for LLM-Powered Business Intelligence [AUTHORS]Luoxuan Weng, Yinghao Tang, Yingchaojie Feng, Zhuo Chang, Peng Chen, Ruiqin Chen, Haozhe Feng, Chen Hou, Danqing Huang, Yang Li, Huaming Rao, Haonan Wang, Canshi Wei, Xiaofeng Yang, Yuhui Zhang, Yifeng Zheng, Xiuqi Huang, Minfeng Zhu, Yuxin Ma, Bin Cui, Wei Chen [ABSTRACT]Business intelligence (BI) transforms large volumes of data within modern
organizations into actionable insights for informed decision-making. Recently,
large language model (LLM)-based agents have streamlined the BI workflow by
automatically performing task planning, reasoning, and actions in executable
environments based on natural language (NL) queries. However, existing
approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS.
The fragmentation of tasks across different data roles and tools lead to
inefficiencies and potential errors due to the iterative and collaborative
nature of BI. In this paper, we introduce DataLab, a unified BI platform that
integrates a one-stop LLM-based agent framework with an augmented computational
notebook interface. DataLab supports a wide range of BI tasks for different
data roles by seamlessly combining LLM assistance with user customization
within a single environment. To achieve this unification, we design a domain
knowledge incorporation module tailored for enterprise-specific BI tasks, an
inter-agent communication mechanism to facilitate information sharing across
the BI workflow, and a cell-based context management strategy to enhance
context utilization efficiency in BI notebooks. Extensive experiments
demonstrate that DataLab achieves state-of-the-art performance on various BI
tasks across popular research benchmarks. Moreover, DataLab maintains high
effectiveness and efficiency on real-world datasets from Tencent, achieving up
to a 58.58% increase in accuracy and a 61.65% reduction in token cost on
enterprise-specific BI tasks. [LINK]http://arxiv.org/abs/2412.02205v2 [DATE]2024-12-05 00:12:08+08:00 [CATEGORIES]cs.CL
HyperMARL: Adaptive Hypernetworks for Multi-Agent RL [AUTHORS]Kale-ab Abebe Tessera, Arrasy Rahman, Stefano V. Albrecht [ABSTRACT]Balancing individual specialisation and shared behaviours is a critical
challenge in multi-agent reinforcement learning (MARL). Existing methods
typically focus on encouraging diversity or leveraging shared representations.
Full parameter sharing (FuPS) improves sample efficiency but struggles to learn
diverse behaviours when required, while no parameter sharing (NoPS) enables
diversity but is computationally expensive and sample inefficient. To address
these challenges, we introduce HyperMARL, a novel approach using hypernetworks
to balance efficiency and specialisation. HyperMARL generates agent-specific
actor and critic parameters, enabling agents to adaptively exhibit diverse or
homogeneous behaviours as needed, without modifying the learning objective or
requiring prior knowledge of the optimal diversity. Furthermore, HyperMARL
decouples agent-specific and state-based gradients, which empirically
correlates with reduced policy gradient variance, potentially offering insights
into its ability to capture diverse behaviours. Across MARL benchmarks
requiring homogeneous, heterogeneous, or mixed behaviours, HyperMARL
consistently matches or outperforms FuPS, NoPS, and diversity-focused methods,
achieving NoPS-level diversity with a shared architecture. These results
highlight the potential of hypernetworks as a versatile approach to the
trade-off between specialisation and shared behaviours in MARL. [LINK]http://arxiv.org/abs/2412.04233v1 [DATE]2024-12-05 23:09:51+08:00 [CATEGORIES]cs.LG
Towards Generalizable Autonomous Penetration Testing via Domain Randomization and Meta-Reinforcement Learning [AUTHORS]Shicheng Zhou, Jingju Liu, Yuliang Lu, Jiahai Yang, Yue Zhang, Jie Chen [ABSTRACT]With increasing numbers of vulnerabilities exposed on the internet,
autonomous penetration testing (pentesting) has emerged as an emerging research
area, while reinforcement learning (RL) is a natural fit for studying
autonomous pentesting. Previous research in RL-based autonomous pentesting
mainly focused on enhancing agents' learning efficacy within abstract simulated
training environments. They overlooked the applicability and generalization
requirements of deploying agents' policies in real-world environments that
differ substantially from their training settings. In contrast, for the first
time, we shift focus to the pentesting agents' ability to generalize across
unseen real environments. For this purpose, we propose a Generalizable
Autonomous Pentesting framework (namely GAP) for training agents capable of
drawing inferences from one to another -- a key requirement for the broad
application of autonomous pentesting and a hallmark of human intelligence. GAP
introduces a Real-to-Sim-to-Real pipeline with two key methods: domain
randomization and meta-RL learning. Specifically, we are among the first to
apply domain randomization in autonomous pentesting and propose a large
language model-powered domain randomization method for synthetic environment
generation. We further apply meta-RL to improve the agents' generalization
ability in unseen environments by leveraging the synthetic environments. The
combination of these two methods can effectively bridge the generalization gap
and improve policy adaptation performance. Experiments are conducted on various
vulnerable virtual machines, with results showing that GAP can (a) enable
policy learning in unknown real environments, (b) achieve zero-shot policy
transfer in similar environments, and (c) realize rapid policy adaptation in
dissimilar environments. [COMMENTS]This work has been submitted to the IEEE for possible publication [LINK]http://arxiv.org/abs/2412.04078v1 [DATE]2024-12-05 19:24:27+08:00 [CATEGORIES]cs.LG
Learning Speed-Adaptive Walking Agent Using Imitation Learning with Physics-Informed Simulation [AUTHORS]Yi-Hung Chiu, Ung Hee Lee, Changseob Song, Manaen Hu, Inseung Kang [ABSTRACT]Virtual models of human gait, or digital twins, offer a promising solution
for studying mobility without the need for labor-intensive data collection.
However, challenges such as the sim-to-real gap and limited adaptability to
diverse walking conditions persist. To address these, we developed and
validated a framework to create a skeletal humanoid agent capable of adapting
to varying walking speeds while maintaining biomechanically realistic motions.
The framework combines a synthetic data generator, which produces
biomechanically plausible gait kinematics from open-source biomechanics data,
and a training system that uses adversarial imitation learning to train the
agent's walking policy. We conducted comprehensive analyses comparing the
agent's kinematics, synthetic data, and the original biomechanics dataset. The
agent achieved a root mean square error of 5.24 +- 0.09 degrees at varying
speeds compared to ground-truth kinematics data, demonstrating its
adaptability. This work represents a significant step toward developing a
digital twin of human locomotion, with potential applications in biomechanics
research, exoskeleton design, and rehabilitation. [COMMENTS]Currently under review [LINK]http://arxiv.org/abs/2412.03949v1 [DATE]2024-12-05 15:55:58+08:00 [CATEGORIES]cs.LG
Traffic Co-Simulation Framework Empowered by Infrastructure Camera Sensing and Reinforcement Learning [AUTHORS]Talha Azfar, Ruimin Ke [ABSTRACT]Traffic simulations are commonly used to optimize traffic flow, with
reinforcement learning (RL) showing promising potential for automated traffic
signal control. Multi-agent reinforcement learning (MARL) is particularly
effective for learning control strategies for traffic lights in a network using
iterative simulations. However, existing methods often assume perfect vehicle
detection, which overlooks real-world limitations related to infrastructure
availability and sensor reliability. This study proposes a co-simulation
framework integrating CARLA and SUMO, which combines high-fidelity 3D modeling
with large-scale traffic flow simulation. Cameras mounted on traffic light
poles within the CARLA environment use a YOLO-based computer vision system to
detect and count vehicles, providing real-time traffic data as input for
adaptive signal control in SUMO. MARL agents, trained with four different
reward structures, leverage this visual feedback to optimize signal timings and
improve network-wide traffic flow. Experiments in the test-bed demonstrate the
effectiveness of the proposed MARL approach in enhancing traffic conditions
using real-time camera-based detection. The framework also evaluates the
robustness of MARL under faulty or sparse sensing and compares the performance
of YOLOv5 and YOLOv8 for vehicle detection. Results show that while better
accuracy improves performance, MARL agents can still achieve significant
improvements with imperfect detection, demonstrating adaptability for
real-world scenarios. [LINK]http://arxiv.org/abs/2412.03925v1 [DATE]2024-12-05 15:01:56+08:00 [CATEGORIES]cs.LG
Hyper: Hyperparameter Robust Efficient Exploration in Reinforcement Learning [AUTHORS]Yiran Wang, Chenshu Liu, Yunfan Li, Sanae Amani, Bolei Zhou, Lin F. Yang [ABSTRACT]The exploration \& exploitation dilemma poses significant challenges in
reinforcement learning (RL). Recently, curiosity-based exploration methods
achieved great success in tackling hard-exploration problems. However, they
necessitate extensive hyperparameter tuning on different environments, which
heavily limits the applicability and accessibility of this line of methods. In
this paper, we characterize this problem via analysis of the agent behavior,
concluding the fundamental difficulty of choosing a proper hyperparameter. We
then identify the difficulty and the instability of the optimization when the
agent learns with curiosity. We propose our method, hyperparameter robust
exploration (\textbf\{Hyper\}), which extensively mitigates the problem by
effectively regularizing the visitation of the exploration and decoupling the
exploitation to ensure stable training. We theoretically justify that
\textbf\{Hyper\} is provably efficient under function approximation setting and
empirically demonstrate its appealing performance and robustness in various
environments. [COMMENTS]arXiv admin note: text overlap with arXiv:1907.05388 by other authors [LINK]http://arxiv.org/abs/2412.03767v1 [DATE]2024-12-05 07:12:41+08:00 [CATEGORIES]cs.LG
PathletRL++: Optimizing Trajectory Pathlet Extraction and Dictionary Formation via Reinforcement Learning [AUTHORS]Gian Alix, Arian Haghparast, Manos Papagelis [ABSTRACT]Advances in tracking technologies have spurred the rapid growth of
large-scale trajectory data. Building a compact collection of pathlets,
referred to as a trajectory pathlet dictionary, is essential for supporting
mobility-related applications. Existing methods typically adopt a top-down
approach, generating numerous candidate pathlets and selecting a subset,
leading to high memory usage and redundant storage from overlapping pathlets.
To overcome these limitations, we propose a bottom-up strategy that
incrementally merges basic pathlets to build the dictionary, reducing memory
requirements by up to 24,000 times compared to baseline methods. The approach
begins with unit-length pathlets and iteratively merges them while optimizing
utility, which is defined using newly introduced metrics of trajectory loss and
representability. We develop a deep reinforcement learning framework,
PathletRL, which utilizes Deep Q-Networks (DQN) to approximate the utility
function, resulting in a compact and efficient pathlet dictionary. Experiments
on both synthetic and real-world datasets demonstrate that our method
outperforms state-of-the-art techniques, reducing the size of the constructed
dictionary by up to 65.8%. Additionally, our results show that only half of the
dictionary pathlets are needed to reconstruct 85% of the original trajectory
data. Building on PathletRL, we introduce PathletRL++, which extends the
original model by incorporating a richer state representation and an improved
reward function to optimize decision-making during pathlet merging. These
enhancements enable the agent to gain a more nuanced understanding of the
environment, leading to higher-quality pathlet dictionaries. PathletRL++
achieves even greater dictionary size reduction, surpassing the performance of
PathletRL, while maintaining high trajectory representability. [LINK]http://arxiv.org/abs/2412.03715v1 [DATE]2024-12-05 05:09:43+08:00 [CATEGORIES]cs.LG
Navigation World Models [AUTHORS]Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun [ABSTRACT]Navigation is a fundamental skill of agents with visual-motor capabilities.
We introduce a Navigation World Model (NWM), a controllable video generation
model that predicts future visual observations based on past observations and
navigation actions. To capture complex environment dynamics, NWM employs a
Conditional Diffusion Transformer (CDiT), trained on a diverse collection of
egocentric videos of both human and robotic agents, and scaled up to 1 billion
parameters. In familiar environments, NWM can plan navigation trajectories by
simulating them and evaluating whether they achieve the desired goal. Unlike
supervised navigation policies with fixed behavior, NWM can dynamically
incorporate constraints during planning. Experiments demonstrate its
effectiveness in planning trajectories from scratch or by ranking trajectories
sampled from an external policy. Furthermore, NWM leverages its learned visual
priors to imagine trajectories in unfamiliar environments from a single input
image, making it a flexible and powerful tool for next-generation navigation
systems. [COMMENTS]project page: https://www.amirbar.net/nwm/ [LINK]http://arxiv.org/abs/2412.03572v1 [DATE]2024-12-05 02:59:45+08:00 [CATEGORIES]cs.LG
2024 Dec 04, Wed
Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt Formats, Data Integration, and Multilingual Translation [AUTHORS]Yi-Chang Chen, Po-Chun Hsu, Chan-Jan Hsu, Da-shan Shiu [ABSTRACT]Large language models (LLMs) have significantly advanced autonomous agents,
particularly in zero-shot tool usage, also known as function calling. This
research delves into enhancing the function-calling capabilities of LLMs by
exploring different approaches, including prompt formats for integrating
function descriptions, blending function-calling and instruction-following
data, introducing a novel Decision Token for conditional prompts, leveraging
chain-of-thought reasoning, and overcoming multilingual challenges with a
translation pipeline. Our key findings and contributions are as follows: (1)
Instruction-following data improves both function-calling accuracy and
relevance detection. (2) The use of the newly proposed Decision Token, combined
with synthetic non-function-call data, enhances relevance detection. (3) A
tailored translation pipeline effectively overcomes multilingual limitations,
demonstrating significant improvements in Traditional Chinese. These insights
highlight the potential for improved function-calling capabilities and
multilingual applications in LLMs. [LINK]http://arxiv.org/abs/2412.01130v2 [DATE]2024-12-04 11:34:42+08:00 [CATEGORIES]cs.CL
Mediating Modes of Thought: LLM's for design scripting [AUTHORS]Moritz Rietschel, Fang Guo, Kyle Steinfeld [ABSTRACT]Architects adopt visual scripting and parametric design tools to explore more
expansive design spaces (Coates, 2010), refine their thinking about the
geometric logic of their design (Woodbury, 2010), and overcome conventional
software limitations (Burry, 2011). Despite two decades of effort to make
design scripting more accessible, a disconnect between a designer's free ways
of thinking and the rigidity of algorithms remains (Burry, 2011). Recent
developments in Large Language Models (LLMs) suggest this might soon change, as
LLMs encode a general understanding of human context and exhibit the capacity
to produce geometric logic. This project speculates that if LLMs can
effectively mediate between user intent and algorithms, they become a powerful
tool to make scripting in design more widespread and fun. We explore if such
systems can interpret natural language prompts to assemble geometric operations
relevant to computational design scripting. In the system, multiple layers of
LLMagents are configured with specific context to infer the user intent and
construct a sequential logic. Given a user's high-level text prompt, a
geometric description is created, distilled into a sequence of logic
operations, and mapped to software-specific commands. The completed script is
constructed in the user's visual programming interface. The system succeeds in
generating complete visual scripts up to a certain complexity but fails beyond
this complexity threshold. It shows how LLMs can make design scripting much
more aligned with human creativity and thought. Future research should explore
conversational interactions, expand to multimodal inputs and outputs, and
assess the performance of these tools. [COMMENTS]Published at ACADIA 2024 [LINK]http://arxiv.org/abs/2411.14485v2 [DATE]2024-12-04 06:27:12+08:00 [CATEGORIES]cs.CL
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning [AUTHORS]Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, Yuxiao Dong [ABSTRACT]Large language models (LLMs) have shown remarkable potential as autonomous
agents, particularly in web-based tasks. However, existing LLM web agents
heavily rely on expensive proprietary LLM APIs, while open LLMs lack the
necessary decision-making capabilities. This paper introduces WebRL, a
self-evolving online curriculum reinforcement learning framework designed to
train high-performance web agents using open LLMs. WebRL addresses three key
challenges in building LLM web agents, including the scarcity of training
tasks, sparse feedback signals, and policy distribution drift in online
learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that
generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised
reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure
consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4
models into proficient web agents. On WebArena-Lite, WebRL improves the success
rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B.
These open models significantly surpass the performance of GPT-4-Turbo (17.6%)
and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained
on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's
effectiveness in bridging the gap between open and proprietary LLM-based web
agents, paving the way for more accessible and powerful autonomous web
interaction systems. [LINK]http://arxiv.org/abs/2411.02337v2 [DATE]2024-12-04 00:37:23+08:00 [CATEGORIES]cs.CL
Risk-aware Classification via Uncertainty Quantification [AUTHORS]Murat Sensoy, Lance M. Kaplan, Simon Julier, Maryam Saleki, Federico Cerutti [ABSTRACT]Autonomous and semi-autonomous systems are using deep learning models to
improve decision-making. However, deep classifiers can be overly confident in
their incorrect predictions, a major issue especially in safety-critical
domains. The present study introduces three foundational desiderata for
developing real-world risk-aware classification systems. Expanding upon the
previously proposed Evidential Deep Learning (EDL), we demonstrate the unity
between these principles and EDL's operational attributes. We then augment EDL
empowering autonomous agents to exercise discretion during structured
decision-making when uncertainty and risks are inherent. We rigorously examine
empirical scenarios to substantiate these theoretical innovations. In contrast
to existing risk-aware classifiers, our proposed methodologies consistently
exhibit superior performance, underscoring their transformative potential in
risk-conscious classification strategies. [COMMENTS]Accepted for publication in Expert Systems with Applications [LINK]http://arxiv.org/abs/2412.03391v1 [DATE]2024-12-04 23:20:12+08:00 [CATEGORIES]cs.LG
AI-Driven Day-to-Day Route Choice [AUTHORS]Leizhen Wang, Peibo Duan, Zhengbing He, Cheng Lyu, Xin Chen, Nan Zheng, Li Yao, Zhenliang Ma [ABSTRACT]Understanding travelers' route choices can help policymakers devise optimal
operational and planning strategies for both normal and abnormal circumstances.
However, existing choice modeling methods often rely on predefined assumptions
and struggle to capture the dynamic and adaptive nature of travel behavior.
Recently, Large Language Models (LLMs) have emerged as a promising alternative,
demonstrating remarkable ability to replicate human-like behaviors across
various fields. Despite this potential, their capacity to accurately simulate
human route choice behavior in transportation contexts remains doubtful. To
satisfy this curiosity, this paper investigates the potential of LLMs for route
choice modeling by introducing an LLM-empowered agent, "LLMTraveler." This
agent integrates an LLM as its core, equipped with a memory system that learns
from past experiences and makes decisions by balancing retrieved data and
personality traits. The study systematically evaluates the LLMTraveler's
ability to replicate human-like decision-making through two stages: (1)
analyzing its route-switching behavior in single origin-destination (OD) pair
congestion game scenarios, where it demonstrates patterns align with laboratory
data but are not fully explained by traditional models, and (2) testing its
capacity to model day-to-day (DTD) adaptive learning behaviors on the Ortuzar
and Willumsen (OW) network, producing results comparable to Multinomial Logit
(MNL) and Reinforcement Learning (RL) models. These experiments demonstrate
that the framework can partially replicate human-like decision-making in route
choice while providing natural language explanations for its decisions. This
capability offers valuable insights for transportation policymaking, such as
simulating traveler responses to new policies or changes in the network. [LINK]http://arxiv.org/abs/2412.03338v1 [DATE]2024-12-04 22:13:38+08:00 [CATEGORIES]cs.LG
Reinforcement Learning for Finite Space Mean-Field Type Games [AUTHORS]Kai Shao, Jiacheng Shen, Chijie An, Mathieu Laurière [ABSTRACT]Mean field type games (MFTGs) describe Nash equilibria between large
coalitions: each coalition consists of a continuum of cooperative agents who
maximize the average reward of their coalition while interacting
non-cooperatively with a finite number of other coalitions. Although the theory
has been extensively developed, we are still lacking efficient and scalable
computational methods. Here, we develop reinforcement learning methods for such
games in a finite space setting with general dynamics and reward functions. We
start by proving that MFTG solution yields approximate Nash equilibria in
finite-size coalition games. We then propose two algorithms. The first is based
on quantization of mean-field spaces and Nash Q-learning. We provide
convergence and stability analysis. We then propose a deep reinforcement
learning algorithm, which can scale to larger spaces. Numerical experiments in
5 environments with mean-field distributions of dimension up to $200$ show the
scalability and efficiency of the proposed method. [LINK]http://arxiv.org/abs/2409.18152v2 [DATE]2024-12-04 20:18:17+08:00 [CATEGORIES]cs.LG
Out-of-Distribution Detection for Neurosymbolic Autonomous Cyber Agents [AUTHORS]Ankita Samaddar, Nicholas Potteiger, Xenofon Koutsoukos [ABSTRACT]Autonomous agents for cyber applications take advantage of modern defense
techniques by adopting intelligent agents with conventional and
learning-enabled components. These intelligent agents are trained via
reinforcement learning (RL) algorithms, and can learn, adapt to, reason about
and deploy security rules to defend networked computer systems while
maintaining critical operational workflows. However, the knowledge available
during training about the state of the operational network and its environment
may be limited. The agents should be trustworthy so that they can reliably
detect situations they cannot handle, and hand them over to cyber experts. In
this work, we develop an out-of-distribution (OOD) Monitoring algorithm that
uses a Probabilistic Neural Network (PNN) to detect anomalous or OOD situations
of RL-based agents with discrete states and discrete actions. To demonstrate
the effectiveness of the proposed approach, we integrate the OOD monitoring
algorithm with a neurosymbolic autonomous cyber agent that uses behavior trees
with learning-enabled components. We evaluate the proposed approach in a
simulated cyber environment under different adversarial strategies.
Experimental results over a large number of episodes illustrate the overall
efficiency of our proposed approach. [COMMENTS]9 pages, 10 figures, IEEE International Conference on AI in
Cybersecurity (ICAIC), 2025 [LINK]http://arxiv.org/abs/2412.02875v1 [DATE]2024-12-04 06:20:52+08:00 [CATEGORIES]cs.LG
An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits [AUTHORS]Amaury Gouverneur, Borja Rodríguez-Gálvez, Tobias J. Oechtering, Mikael Skoglund [ABSTRACT]We study the performance of the Thompson Sampling algorithm for logistic
bandit problems, where the agent receives binary rewards with probabilities
determined by a logistic function $\exp(\beta \langle a, \theta
\rangle)/(1+\exp(\beta \langle a, \theta \rangle))$. We focus on the setting
where the action $a$ and parameter $\theta$ lie within the $d$-dimensional unit
ball with the action space encompassing the parameter space. Adopting the
information-theoretic framework introduced by (Russo $\&$ Van Roy, 2015), we
analyze the information ratio, which is defined as the ratio of the expected
squared difference between the optimal and actual rewards to the mutual
information between the optimal action and the reward. Improving upon previous
results, we establish that the information ratio is bounded by $\tfrac\{9\}\{2\}d$.
Notably, we obtain a regret bound in $O(d\sqrt\{T \log(\beta T/d)\})$ that
depends only logarithmically on the parameter $\beta$. [COMMENTS]14 pages, Accepted to NeurIPS 2025 Workshop on Bayesian
Decision-Making and Uncertainty [LINK]http://arxiv.org/abs/2412.02861v1 [DATE]2024-12-04 05:55:41+08:00 [CATEGORIES]cs.LG
Learning Temporal Logic Predicates from Data with Statistical Guarantees [AUTHORS]Emi Soroka, Rohan Sinha, Sanjay Lall [ABSTRACT]Temporal logic rules are often used in control and robotics to provide
structured, human-interpretable descriptions of high-dimensional trajectory
data. These rules have numerous applications including safety validation using
formal methods, constraining motion planning among autonomous agents, and
classifying data. However, existing methods for learning temporal logic
predicates from data do not provide assurances about the correctness of the
resulting predicate. We present a novel method to learn temporal logic
predicates from data with finite-sample correctness guarantees. Our approach
leverages expression optimization and conformal prediction to learn predicates
that correctly describe future trajectories under mild assumptions. We provide
experimental results showing the performance of our approach on a simulated
trajectory dataset and perform ablation studies to understand how each
component of our algorithm contributes to its performance. [LINK]http://arxiv.org/abs/2406.10449v2 [DATE]2024-12-04 03:52:27+08:00 [CATEGORIES]cs.LG
TAB-Fields: A Maximum Entropy Framework for Mission-Aware Adversarial Planning [AUTHORS]Gokul Puthumanaillam, Jae Hyuk Song, Nurzhan Yesmagambet, Shinkyu Park, Melkior Ornik [ABSTRACT]Autonomous agents operating in adversarial scenarios face a fundamental
challenge: while they may know their adversaries' high-level objectives, such
as reaching specific destinations within time constraints, the exact policies
these adversaries will employ remain unknown. Traditional approaches address
this challenge by treating the adversary's state as a partially observable
element, leading to a formulation as a Partially Observable Markov Decision
Process (POMDP). However, the induced belief-space dynamics in a POMDP require
knowledge of the system's transition dynamics, which, in this case, depend on
the adversary's unknown policy. Our key observation is that while an
adversary's exact policy is unknown, their behavior is necessarily constrained
by their mission objectives and the physical environment, allowing us to
characterize the space of possible behaviors without assuming specific
policies. In this paper, we develop Task-Aware Behavior Fields (TAB-Fields), a
representation that captures adversary state distributions over time by
computing the most unbiased probability distribution consistent with known
constraints. We construct TAB-Fields by solving a constrained optimization
problem that minimizes additional assumptions about adversary behavior beyond
mission and environmental requirements. We integrate TAB-Fields with standard
planning algorithms by introducing TAB-conditioned POMCP, an adaptation of
Partially Observable Monte Carlo Planning. Through experiments in simulation
with underwater robots and hardware implementations with ground robots, we
demonstrate that our approach achieves superior performance compared to
baselines that either assume specific adversary policies or neglect mission
constraints altogether. Evaluation videos and code are available at
https://tab-fields.github.io. [LINK]http://arxiv.org/abs/2412.02570v1 [DATE]2024-12-04 00:55:27+08:00 [CATEGORIES]cs.LG
Defending Against Diverse Attacks in Federated Learning Through Consensus-Based Bi-Level Optimization [AUTHORS]Nicolás García Trillos, Aditya Kumar Akash, Sixu Li, Konstantin Riedl, Yuhua Zhu [ABSTRACT]Adversarial attacks pose significant challenges in many machine learning
applications, particularly in the setting of distributed training and federated
learning, where malicious agents seek to corrupt the training process with the
goal of jeopardizing and compromising the performance and reliability of the
final models. In this paper, we address the problem of robust federated
learning in the presence of such attacks by formulating the training task as a
bi-level optimization problem. We conduct a theoretical analysis of the
resilience of consensus-based bi-level optimization (CB$^2$O), an interacting
multi-particle metaheuristic optimization method, in adversarial settings.
Specifically, we provide a global convergence analysis of CB$^2$O in mean-field
law in the presence of malicious agents, demonstrating the robustness of
CB$^2$O against a diverse range of attacks. Thereby, we offer insights into how
specific hyperparameter choices enable to mitigate adversarial effects. On the
practical side, we extend CB$^2$O to the clustered federated learning setting
by proposing FedCB$^2$O, a novel interacting multi-particle system, and design
a practical algorithm that addresses the demands of real-world applications.
Extensive experiments demonstrate the robustness of the FedCB$^2$O algorithm
against label-flipping attacks in decentralized clustered federated learning
scenarios, showcasing its effectiveness in practical contexts. [LINK]http://arxiv.org/abs/2412.02535v1 [DATE]2024-12-04 00:26:56+08:00 [CATEGORIES]cs.LG
Introduction to Reinforcement Learning [AUTHORS]Majid Ghasemi, Dariush Ebrahimi [ABSTRACT]Reinforcement Learning (RL), a subfield of Artificial Intelligence (AI),
focuses on training agents to make decisions by interacting with their
environment to maximize cumulative rewards. This paper provides an overview of
RL, covering its core concepts, methodologies, and resources for further
learning. It offers a thorough explanation of fundamental components such as
states, actions, policies, and reward signals, ensuring readers develop a solid
foundational understanding. Additionally, the paper presents a variety of RL
algorithms, categorized based on the key factors such as model-free,
model-based, value-based, policy-based, and other key factors. Resources for
learning and implementing RL, such as books, courses, and online communities
are also provided. By offering a clear, structured introduction, this paper
aims to simplify the complexities of RL for beginners, providing a
straightforward pathway to understanding. [COMMENTS]19 pages [LINK]http://arxiv.org/abs/2408.07712v3 [DATE]2024-12-04 00:17:32+08:00 [CATEGORIES]cs.LG
2024 Dec 03, Tue
Large Multimodal Agents for Accurate Phishing Detection with Enhanced Token Optimization and Cost Reduction [AUTHORS]Fouad Trad, Ali Chehab [ABSTRACT]With the rise of sophisticated phishing attacks, there is a growing need for
effective and economical detection solutions. This paper explores the use of
large multimodal agents, specifically Gemini 1.5 Flash and GPT-4o mini, to
analyze both URLs and webpage screenshots via APIs, thus avoiding the
complexities of training and maintaining AI systems. Our findings indicate that
integrating these two data types substantially enhances detection performance
over using either type alone. However, API usage incurs costs per query that
depend on the number of input and output tokens. To address this, we propose a
two-tiered agentic approach: initially, one agent assesses the URL, and if
inconclusive, a second agent evaluates both the URL and the screenshot. This
method not only maintains robust detection performance but also significantly
reduces API costs by minimizing unnecessary multi-input queries. Cost analysis
shows that with the agentic approach, GPT-4o mini can process about 4.2 times
as many websites per $100 compared to the multimodal approach (107,440 vs.
25,626), and Gemini 1.5 Flash can process about 2.6 times more websites
(2,232,142 vs. 862,068). These findings underscore the significant economic
benefits of the agentic approach over the multimodal method, providing a viable
solution for organizations aiming to leverage advanced AI for phishing
detection while controlling expenses. [COMMENTS]Accepted in the 2nd International Conference on Foundation and Large
Language Models (FLLM2024) [LINK]http://arxiv.org/abs/2412.02301v1 [DATE]2024-12-03 17:13:52+08:00 [CATEGORIES]cs.CL
AutoGuide: Automated Generation and Selection of Context-Aware Guidelines for Large Language ModelAgents [AUTHORS]Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, Honglak Lee [ABSTRACT]Recent advances in large language models (LLMs) have empowered AI agents
capable of performing various sequential decision-making tasks. However,
effectively guiding LLMs to perform well in unfamiliar domains like web
navigation, where they lack sufficient knowledge, has proven to be difficult
with the demonstration-based in-context learning paradigm. In this paper, we
introduce a novel framework, called AutoGuide, which addresses this limitation
by automatically generating context-aware guidelines from offline experiences.
Importantly, each context-aware guideline is expressed in concise natural
language and follows a conditional structure, clearly describing the context
where it is applicable. As a result, our guidelines facilitate the provision of
relevant knowledge for the agent's current decision-making process, overcoming
the limitations of the conventional demonstration-based learning paradigm. Our
evaluation demonstrates that AutoGuide significantly outperforms competitive
baselines in complex benchmark domains, including real-world web navigation. [LINK]http://arxiv.org/abs/2403.08978v2 [DATE]2024-12-03 15:36:47+08:00 [CATEGORIES]cs.CLcs.LG
Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models [AUTHORS]Cong Lu, Shengran Hu, Jeff Clune [ABSTRACT]Go-Explore is a powerful family of algorithms designed to solve
hard-exploration problems built on the principle of archiving discovered
states, and iteratively returning to and exploring from the most promising
states. This approach has led to superhuman performance across a wide variety
of challenging problems including Atari games and robotic control, but requires
manually designing heuristics to guide exploration (i.e., determine which
states to save and explore from, and what actions to consider next), which is
time-consuming and infeasible in general. To resolve this, we propose
Intelligent Go-Explore (IGE) which greatly extends the scope of the original
Go-Explore by replacing these handcrafted heuristics with the intelligence and
internalized human notions of interestingness captured by giant pretrained
foundation models (FMs). This provides IGE with a human-like ability to
instinctively identify how interesting or promising any new state is (e.g.,
discovering new objects, locations, or behaviors), even in complex environments
where heuristics are hard to define. Moreover, IGE offers the exciting
opportunity to recognize and capitalize on serendipitous discoveries-states
encountered during exploration that are valuable in terms of exploration, yet
where what makes them interesting was not anticipated by the human user. We
evaluate our algorithm on a diverse range of language and vision-based tasks
that require search and exploration. Across these tasks, IGE strongly exceeds
classic reinforcement learning and graph search baselines, and also succeeds
where prior state-of-the-art FM agents like Reflexion completely fail. Overall,
Intelligent Go-Explore combines the tremendous strengths of FMs and the
powerful Go-Explore algorithm, opening up a new frontier of research into
creating more generally capable agents with impressive exploration
capabilities. [LINK]http://arxiv.org/abs/2405.15143v3 [DATE]2024-12-03 14:43:39+08:00 [CATEGORIES]cs.LGcs.CL
Proactive Agent: Shifting LLMAgents from Reactive Responses to Active Assistance [AUTHORS]Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen, Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong, Zhong Zhang, Yankai Lin, Weiwen Liu, Yasheng Wang, Zhiyuan Liu, Fangming Liu, Maosong Sun [ABSTRACT]Agents powered by large language models have shown remarkable abilities in
solving complex tasks. However, most agent systems remain reactive, limiting
their effectiveness in scenarios requiring foresight and autonomous
decision-making. In this paper, we tackle the challenge of developing proactive
agents capable of anticipating and initiating tasks without explicit human
instructions. We propose a novel data-driven approach for this problem.
Firstly, we collect real-world human activities to generate proactive task
predictions. These predictions are then labeled by human annotators as either
accepted or rejected. The labeled data is used to train a reward model that
simulates human judgment and serves as an automatic evaluator of the
proactiveness of LLMagents. Building on this, we develop a comprehensive data
generation pipeline to create a diverse dataset, ProactiveBench, containing
6,790 events. Finally, we demonstrate that fine-tuning models with the proposed
ProactiveBench can significantly elicit the proactiveness of LLMagents.
Experimental results show that our fine-tuned model achieves an F1-Score of
66.47% in proactively offering assistance, outperforming all open-source and
close-source models. These results highlight the potential of our method in
creating more proactive and effective agent systems, paving the way for future
advancements in human-agent collaboration. [COMMENTS]9 pages, 4 figures [LINK]http://arxiv.org/abs/2410.12361v3 [DATE]2024-12-03 12:34:09+08:00 [CATEGORIES]cs.CL
Large Language Model-Brained GUI Agents: A Survey [AUTHORS]Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang [ABSTRACT]GUIs have long been central to human-computer interaction, providing an
intuitive and visually-driven way to access and interact with digital systems.
The advent of LLMs, particularly multimodal models, has ushered in a new era of
GUI automation. They have demonstrated exceptional capabilities in natural
language understanding, code generation, and visual processing. This has paved
the way for a new generation of LLM-brained GUI agents capable of interpreting
complex GUI elements and autonomously executing actions based on natural
language instructions. These agents represent a paradigm shift, enabling users
to perform intricate, multi-step tasks through simple conversational commands.
Their applications span across web navigation, mobile app interactions, and
desktop automation, offering a transformative user experience that
revolutionizes how individuals interact with software. This emerging field is
rapidly advancing, with significant progress in both research and industry.
To provide a structured understanding of this trend, this paper presents a
comprehensive survey of LLM-brained GUI agents, exploring their historical
evolution, core components, and advanced techniques. We address research
questions such as existing GUI agent frameworks, the collection and utilization
of data for training specialized GUI agents, the development of large action
models tailored for GUI tasks, and the evaluation metrics and benchmarks
necessary to assess their effectiveness. Additionally, we examine emerging
applications powered by these agents. Through a detailed analysis, this survey
identifies key research gaps and outlines a roadmap for future advancements in
the field. By consolidating foundational knowledge and state-of-the-art
developments, this work aims to guide both researchers and practitioners in
overcoming challenges and unlocking the full potential of LLM-brained GUI
agents. [COMMENTS]The collection of papers reviewed in this survey will be hosted and
regularly updated on the GitHub repository:
https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a
searchable webpage is available at https://aka.ms/gui-agent for easier access
and exploration [LINK]http://arxiv.org/abs/2411.18279v3 [DATE]2024-12-03 11:16:27+08:00 [CATEGORIES]cs.CL
MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for Tabular Applications [AUTHORS]Vishnou Vinayagame, Gregory Senay, Luis Martí [ABSTRACT]Mathematical reasoning capabilities are increasing with tool-augmented
language agents, but methods often rely either on closed-source or large
models, external data, or extensive prompt engineering. This work introduces
MATATA, a novel cost-effective method to train LLMagents for tabular data
problems through reasoning, planning, and tool use. With a progressive
self-improvement paradigm and an iterative weak supervision, it empowers
3.8B/8B Small Language Models (SLMs), particularly suited for local hosting and
sensitive business contexts where data privacy is crucial. By employing a
flexible and reusable tools across different datasets, it achieves robust
performance with effective scalability across shared tasks. Experiments show
that MATATA reaches state-of-the-art performances on FinQA and TAT-QA among
reasoning frameworks based on open-source models. Moreover, MATATA models
compete with GPT-4 based frameworks on TabMWP, while being SLMs. [LINK]http://arxiv.org/abs/2411.18915v2 [DATE]2024-12-03 05:08:00+08:00 [CATEGORIES]cs.LGcs.CL
Optimizing Plastic Waste Collection in Water Bodies Using Heterogeneous Autonomous Surface Vehicles with Deep Reinforcement Learning [AUTHORS]Alejandro Mendoza Barrionuevo, Samuel Yanes Luis, Daniel Gutiérrez Reina, Sergio L. Toral Marín [ABSTRACT]This paper presents a model-free deep reinforcement learning framework for
informative path planning with heterogeneous fleets of autonomous surface
vehicles to locate and collect plastic waste. The system employs two teams of
vehicles: scouts and cleaners. Coordination between these teams is achieved
through a deep reinforcement approach, allowing agents to learn strategies to
maximize cleaning efficiency. The primary objective is for the scout team to
provide an up-to-date contamination model, while the cleaner team collects as
much waste as possible following this model. This strategy leads to
heterogeneous teams that optimize fleet efficiency through inter-team
cooperation supported by a tailored reward function. Different trainings of the
proposed algorithm are compared with other state-of-the-art heuristics in two
distinct scenarios, one with high convexity and another with narrow corridors
and challenging access. According to the obtained results, it is demonstrated
that deep reinforcement learning based algorithms outperform other benchmark
heuristics, exhibiting superior adaptability. In addition, training with greedy
actions further enhances performance, particularly in scenarios with intricate
layouts. [COMMENTS]This article is currently under revision for the Robotics and
Automation Letters (IEEE) [LINK]http://arxiv.org/abs/2412.02316v1 [DATE]2024-12-03 17:32:02+08:00 [CATEGORIES]cs.LG
Conformal Symplectic Optimization for Stable Reinforcement Learning [AUTHORS]Yao Lyu, Xiangteng Zhang, Shengbo Eben Li, Jingliang Duan, Letian Tao, Qing Xu, Lei He, Keqiang Li [ABSTRACT]Training deep reinforcement learning (RL) agents necessitates overcoming the
highly unstable nonconvex stochastic optimization inherent in the
trial-and-error mechanism. To tackle this challenge, we propose a
physics-inspired optimization algorithm called relativistic adaptive gradient
descent (RAD), which enhances long-term training stability. By conceptualizing
neural network (NN) training as the evolution of a conformal Hamiltonian
system, we present a universal framework for transferring long-term stability
from conformal symplectic integrators to iterative NN updating rules, where the
choice of kinetic energy governs the dynamical properties of resulting
optimization algorithms. By utilizing relativistic kinetic energy, RAD
incorporates principles from special relativity and limits parameter updates
below a finite speed, effectively mitigating abnormal gradient influences.
Additionally, RAD models NN optimization as the evolution of a multi-particle
system where each trainable parameter acts as an independent particle with an
individual adaptive learning rate. We prove RAD's sublinear convergence under
general nonconvex settings, where smaller gradient variance and larger batch
sizes contribute to tighter convergence. Notably, RAD degrades to the
well-known adaptive moment estimation (ADAM) algorithm when its speed
coefficient is chosen as one and symplectic factor as a small positive value.
Experimental results show RAD outperforming nine baseline optimizers with five
RL algorithms across twelve environments, including standard benchmarks and
challenging scenarios. Notably, RAD achieves up to a 155.1% performance
improvement over ADAM in Atari games, showcasing its efficacy in stabilizing
and accelerating RL training. [LINK]http://arxiv.org/abs/2412.02291v1 [DATE]2024-12-03 17:07:31+08:00 [CATEGORIES]cs.LG
Feudal Graph Reinforcement Learning [AUTHORS]Tommaso Marzi, Arshjot Khehra, Andrea Cini, Cesare Alippi [ABSTRACT]Graph-based representations and message-passing modular policies constitute
prominent approaches to tackling composable control problems in reinforcement
learning (RL). However, as shown by recent graph deep learning literature, such
local message-passing operators can create information bottlenecks and hinder
global coordination. The issue becomes more serious in tasks requiring
high-level planning. In this work, we propose a novel methodology, named Feudal
Graph Reinforcement Learning (FGRL), that addresses such challenges by relying
on hierarchical RL and a pyramidal message-passing architecture. In particular,
FGRL defines a hierarchy of policies where high-level commands are propagated
from the top of the hierarchy down through a layered graph structure. The
bottom layers mimic the morphology of the physical system, while the upper
layers correspond to higher-order sub-modules. The resulting agents are then
characterized by a committee of policies where actions at a certain level set
goals for the level below, thus implementing a hierarchical decision-making
structure that can naturally implement task decomposition. We evaluate the
proposed framework on a graph clustering problem and MuJoCo locomotion tasks;
simulation results show that FGRL compares favorably against relevant
baselines. Furthermore, an in-depth analysis of the command propagation
mechanism provides evidence that the introduced message-passing scheme favors
learning hierarchical decision-making policies. [LINK]http://arxiv.org/abs/2304.05099v6 [DATE]2024-12-03 16:58:22+08:00 [CATEGORIES]cs.LG
BOTracle: A framework for Discriminating Bots and Humans [AUTHORS]Jan Kadel, August See, Ritwik Sinha, Mathias Fischer [ABSTRACT]Bots constitute a significant portion of Internet traffic and are a source of
various issues across multiple domains. Modern bots often become
indistinguishable from real users, as they employ similar methods to browse the
web, including using real browsers. We address the challenge of bot detection
in high-traffic scenarios by analyzing three distinct detection methods. The
first method operates on heuristics, allowing for rapid detection. The second
method utilizes, well known, technical features, such as IP address, window
size, and user agent. It serves primarily for comparison with the third method.
In the third method, we rely solely on browsing behavior, omitting all static
features and focusing exclusively on how clients behave on a website. In
contrast to related work, we evaluate our approaches using real-world
e-commerce traffic data, comprising 40 million monthly page visits. We further
compare our methods against another bot detection approach, Botcha, on the same
dataset. Our performance metrics, including precision, recall, and AUC, reach
98 percent or higher, surpassing Botcha. [COMMENTS]Bot Detection; User Behaviour Analysis; Published at ESORICS
International Workshops 2024 [LINK]http://arxiv.org/abs/2412.02266v1 [DATE]2024-12-03 16:38:30+08:00 [CATEGORIES]cs.LG
Selective Reviews of Bandit Problems in AI via a Statistical View [AUTHORS]Pengjie Zhou, Haoyu Wei, Huiming Zhang [ABSTRACT]Reinforcement Learning (RL) is a widely researched area in artificial
intelligence that focuses on teaching agents decision-making through
interactions with their environment. A key subset includes stochastic
multi-armed bandit (MAB) and continuum-armed bandit (SCAB) problems, which
model sequential decision-making under uncertainty. This review outlines the
foundational models and assumptions of bandit problems, explores non-asymptotic
theoretical tools like concentration inequalities and minimax regret bounds,
and compares frequentist and Bayesian algorithms for managing
exploration-exploitation trade-offs. We also extend the discussion to $K$-armed
contextual bandits and SCAB, examining their methodologies, regret analyses,
and discussing the relation between the SCAB problems and the functional data
analysis. Finally, we highlight recent advances and ongoing challenges in the
field. [COMMENTS]46 pages, 5 figures, [LINK]http://arxiv.org/abs/2412.02251v1 [DATE]2024-12-03 16:28:47+08:00 [CATEGORIES]cs.LG
FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL [AUTHORS]Woosung Koh, Wonbeen Oh, Siyeol Kim, Suhin Shin, Hyeongjin Kim, Jaein Jang, Junghyun Lee, Se-Young Yun [ABSTRACT]Multi-agent reinforcement learning has demonstrated significant potential in
addressing complex cooperative tasks across various real-world applications.
However, existing MARL approaches often rely on the restrictive assumption that
the number of entities (e.g., agents, obstacles) remains constant between
training and inference. This overlooks scenarios where entities are dynamically
removed or added during the inference trajectory -- a common occurrence in
real-world environments like search and rescue missions and dynamic combat
situations. In this paper, we tackle the challenge of intra-trajectory dynamic
entity composition under zero-shot out-of-domain (OOD) generalization, where
such dynamic changes cannot be anticipated beforehand. Our empirical studies
reveal that existing MARL methods suffer significant performance degradation
and increased uncertainty in these scenarios. In response, we propose
FlickerFusion, a novel OOD generalization method that acts as a universally
applicable augmentation technique for MARL backbone methods. FlickerFusion
stochastically drops out parts of the observation space, emulating being
in-domain when inferenced OOD. The results show that FlickerFusion not only
achieves superior inference rewards but also uniquely reduces uncertainty
vis-\`a-vis the backbone, compared to existing methods. Benchmarks,
implementations, and model weights are organized and open-sourced at
flickerfusion305.github.io, accompanied by ample demo video renderings. [COMMENTS]NeurIPS '24 Open-World Agents Workshop [LINK]http://arxiv.org/abs/2410.15876v3 [DATE]2024-12-03 13:59:09+08:00 [CATEGORIES]cs.LG
Optimizing Latent Goal by Learning from Trajectory Preference [AUTHORS]Guangyu Zhao, Kewei Lian, Haowei Lin, Haobo Fu, Qiang Fu, Shaofei Cai, Zihao Wang, Yitao Liang [ABSTRACT]A glowing body of work has emerged focusing on instruction-following policies
for open-world agents, aiming to better align the agent's behavior with human
intentions. However, the performance of these policies is highly susceptible to
the initial prompt, which leads to extra efforts in selecting the best
instructions. We propose a framework named Preference Goal Tuning (PGT). PGT
allows an instruction following policy to interact with the environment to
collect several trajectories, which will be categorized into positive and
negative samples based on preference. Then we use preference learning to
fine-tune the initial goal latent representation with the categorized
trajectories while keeping the policy backbone frozen. The experiment result
shows that with minimal data and training, PGT achieves an average relative
improvement of 72.0% and 81.6% over 17 tasks in 2 different foundation policies
respectively, and outperforms the best human-selected instructions. Moreover,
PGT surpasses full fine-tuning in the out-of-distribution (OOD) task-execution
environments by 13.4%, indicating that our approach retains strong
generalization capabilities. Since our approach stores a single latent
representation for each task independently, it can be viewed as an efficient
method for continual learning, without the risk of catastrophic forgetting or
task interference. In short, PGT enhances the performance of agents across
nearly all tasks in the Minecraft Skillforge benchmark and demonstrates
robustness to the execution environment. [LINK]http://arxiv.org/abs/2412.02125v1 [DATE]2024-12-03 11:27:48+08:00 [CATEGORIES]cs.LG
The Problem of Social Cost in Multi-Agent General Reinforcement Learning: Survey and Synthesis [AUTHORS]Kee Siong Ng, Samuel Yang-Zhao, Timothy Cadogan-Cowper [ABSTRACT]The AI safety literature is full of examples of powerful AI agents that, in
blindly pursuing a specific and usually narrow objective, ends up with
unacceptable and even catastrophic collateral damage to others. In this paper,
we consider the problem of social harms that can result from actions taken by
learning and utility-maximising agents in a multi-agent environment. The
problem of measuring social harms or impacts in such multi-agent settings,
especially when the agents are artificial generally intelligent (AGI) agents,
was listed as an open problem in Everitt et al, 2018. We attempt a partial
answer to that open problem in the form of market-based mechanisms to quantify
and control the cost of such social harms. The proposed setup captures many
well-studied special cases and is more general than existing formulations of
multi-agent reinforcement learning with mechanism design in two ways: (i) the
underlying environment is a history-based general reinforcement learning
environment like in AIXI; (ii) the reinforcement-learning agents participating
in the environment can have different learning strategies and planning
horizons. To demonstrate the practicality of the proposed setup, we survey some
key classes of learning algorithms and present a few applications, including a
discussion of the Paperclips problem and pollution control with a cap-and-trade
system. [COMMENTS]49 pages [LINK]http://arxiv.org/abs/2412.02091v1 [DATE]2024-12-03 10:22:55+08:00 [CATEGORIES]cs.LG
Comparative Analysis of Multi-Agent Reinforcement Learning Policies for Crop Planning Decision Support [AUTHORS]Anubha Mahajan, Shreya Hegde, Ethan Shay, Daniel Wu, Aviva Prins [ABSTRACT]In India, the majority of farmers are classified as small or marginal, making
their livelihoods particularly vulnerable to economic losses due to market
saturation and climate risks. Effective crop planning can significantly impact
their expected income, yet existing decision support systems (DSS) often
provide generic recommendations that fail to account for real-time market
dynamics and the interactions among multiple farmers. In this paper, we
evaluate the viability of three multi-agent reinforcement learning (MARL)
approaches for optimizing total farmer income and promoting fairness in crop
planning: Independent Q-Learning (IQL), where each farmer acts independently
without coordination, Agent-by-Agent (ABA), which sequentially optimizes each
farmer's policy in relation to the others, and the Multi-agent Rollout Policy,
which jointly optimizes all farmers' actions for global reward maximization.
Our results demonstrate that while IQL offers computational efficiency with
linear runtime, it struggles with coordination among agents, leading to lower
total rewards and an unequal distribution of income. Conversely, the
Multi-agent Rollout policy achieves the highest total rewards and promotes
equitable income distribution among farmers but requires significantly more
computational resources, making it less practical for large numbers of agents.
ABA strikes a balance between runtime efficiency and reward optimization,
offering reasonable total rewards with acceptable fairness and scalability.
These findings highlight the importance of selecting appropriate MARL
approaches in DSS to provide personalized and equitable crop planning
recommendations, advancing the development of more adaptive and farmer-centric
agricultural decision-making systems. [LINK]http://arxiv.org/abs/2412.02057v1 [DATE]2024-12-03 08:30:19+08:00 [CATEGORIES]cs.LG
Explore Reinforced: Equilibrium Approximation with Reinforcement Learning [AUTHORS]Ryan Yu, Mateusz Nowak, Qintong Xie, Michelle Yilin Feng, Peter Chin [ABSTRACT]Current approximate Coarse Correlated Equilibria (CCE) algorithms struggle
with equilibrium approximation for games in large stochastic environments but
are theoretically guaranteed to converge to a strong solution concept. In
contrast, modern Reinforcement Learning (RL) algorithms provide faster training
yet yield weaker solutions. We introduce Exp3-IXrl - a blend of RL and
game-theoretic approach, separating the RL agent's action selection from the
equilibrium computation while preserving the integrity of the learning process.
We demonstrate that our algorithm expands the application of equilibrium
approximation algorithms to new environments. Specifically, we show the
improved performance in a complex and adversarial cybersecurity network
environment - the Cyber Operations Research Gym - and in the classical
multi-armed bandit settings. [LINK]http://arxiv.org/abs/2412.02016v1 [DATE]2024-12-03 06:37:59+08:00 [CATEGORIES]cs.LG
Who's Gaming the System? A Causally-Motivated Approach for Detecting Strategic Adaptation [AUTHORS]Trenton Chang, Lindsay Warrenburg, Sae-Hwan Park, Ravi B. Parikh, Maggie Makar, Jenna Wiens [ABSTRACT]In many settings, machine learning models may be used to inform decisions
that impact individuals or entities who interact with the model. Such entities,
or agents, may game model decisions by manipulating their inputs to the model
to obtain better outcomes and maximize some utility. We consider a multi-agent
setting where the goal is to identify the "worst offenders:" agents that are
gaming most aggressively. However, identifying such agents is difficult without
knowledge of their utility function. Thus, we introduce a framework in which
each agent's tendency to game is parameterized via a scalar. We show that this
gaming parameter is only partially identifiable. By recasting the problem as a
causal effect estimation problem where different agents represent different
"treatments," we prove that a ranking of all agents by their gaming parameters
is identifiable. We present empirical results in a synthetic data study
validating the usage of causal effect estimation for gaming detection and show
in a case study of diagnosis coding behavior in the U.S. that our approach
highlights features associated with gaming. [COMMENTS]38 pages, 31 figures. NeurIPS 2024 [LINK]http://arxiv.org/abs/2412.02000v1 [DATE]2024-12-03 06:07:48+08:00 [CATEGORIES]cs.LG
Generalized EXTRA stochastic gradient Langevin dynamics [AUTHORS]Mert Gurbuzbalaban, Mohammad Rafiqul Islam, Xiaoyu Wang, Lingjiong Zhu [ABSTRACT]Langevin algorithms are popular Markov Chain Monte Carlo methods for Bayesian
learning, particularly when the aim is to sample from the posterior
distribution of a parametric model, given the input data and the prior
distribution over the model parameters. Their stochastic versions such as
stochastic gradient Langevin dynamics (SGLD) allow iterative learning based on
randomly sampled mini-batches of large datasets and are scalable to large
datasets. However, when data is decentralized across a network of agents
subject to communication and privacy constraints, standard SGLD algorithms
cannot be applied. Instead, we employ decentralized SGLD (DE-SGLD) algorithms,
where Bayesian learning is performed collaboratively by a network of agents
without sharing individual data. Nonetheless, existing DE-SGLD algorithms
induce a bias at every agent that can negatively impact performance; this bias
persists even when using full batches and is attributable to network effects.
Motivated by the EXTRA algorithm and its generalizations for decentralized
optimization, we propose the generalized EXTRA stochastic gradient Langevin
dynamics, which eliminates this bias in the full-batch setting. Moreover, we
show that, in the mini-batch setting, our algorithm provides performance bounds
that significantly improve upon those of standard DE-SGLD algorithms in the
literature. Our numerical results also demonstrate the efficiency of the
proposed approach. [LINK]http://arxiv.org/abs/2412.01993v1 [DATE]2024-12-03 05:57:30+08:00 [CATEGORIES]cs.LG
A Multi-Agent Reinforcement Learning Testbed for Cognitive Radio Applications [AUTHORS]Sriniketh Vangaru, Daniel Rosen, Dylan Green, Raphael Rodriguez, Maxwell Wiecek, Amos Johnson, Alyse M. Jones, William C. Headley [ABSTRACT]Technological trends show that Radio Frequency Reinforcement Learning (RFRL)
will play a prominent role in the wireless communication systems of the future.
Applications of RFRL range from military communications jamming to enhancing
WiFi networks. Before deploying algorithms for these purposes, they must be
trained in a simulation environment to ensure adequate performance. For this
reason, we previously created the RFRL Gym: a standardized, accessible tool for
the development and testing of reinforcement learning (RL) algorithms in the
wireless communications space. This environment leveraged the OpenAI Gym
framework and featured customizable simulation scenarios within the RF
spectrum. However, the RFRL Gym was limited to training a single RL agent per
simulation; this is not ideal, as most real-world RF scenarios will contain
multiple intelligent agents in cooperative, competitive, or mixed settings,
which is a natural consequence of spectrum congestion. Therefore, through
integration with Ray RLlib, multi-agent reinforcement learning (MARL)
functionality for training and assessment has been added to the RFRL Gym,
making it even more of a robust tool for RF spectrum simulation. This paper
provides an overview of the updated RFRL Gym environment. In this work, the
general framework of the tool is described relative to comparable existing
resources, highlighting the significant additions and refactoring we have
applied to the Gym. Afterward, results from testing various RF scenarios in the
MARL environment and future additions are discussed. [COMMENTS]Accepted to IEEE CCNC 2025. Added revisions from paper reviews [LINK]http://arxiv.org/abs/2410.21521v2 [DATE]2024-12-03 03:49:59+08:00 [CATEGORIES]cs.LG
MALT: Improving Reasoning with Multi-AgentLLM Training [AUTHORS]Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Markian Rybchuk, Philip H. S. Torr, Ivan Laptev, Fabio Pizzati, Ronald Clark, Christian Schroeder de Witt [ABSTRACT]Enabling effective collaboration among LLMs is a crucial step toward
developing autonomous systems capable of solving complex problems. While LLMs
are typically used as single-model generators, where humans critique and refine
their outputs, the potential for jointly-trained collaborative models remains
largely unexplored. Despite promising results in multi-agent communication and
debate settings, little progress has been made in training models to work
together on tasks. In this paper, we present a first step toward "Multi-agentLLM training" (MALT) on reasoning problems. Our approach employs a sequential
multi-agent setup with heterogeneous LLMs assigned specialized roles: a
generator, verifier, and refinement model iteratively solving problems. We
propose a trajectory-expansion-based synthetic data generation process and a
credit assignment strategy driven by joint outcome based rewards. This enables
our post-training setup to utilize both positive and negative trajectories to
autonomously improve each model's specialized capabilities as part of a joint
sequential system. We evaluate our approach across MATH, GSM8k, and CQA, where
MALT on Llama 3.1 8B models achieves relative improvements of 14.14%, 7.12%,
and 9.40% respectively over the same baseline model. This demonstrates an early
advance in multi-agent cooperative capabilities for performance on mathematical
and common sense reasoning questions. More generally, our work provides a
concrete direction for research around multi-agentLLM training approaches. [COMMENTS]Preliminary work [LINK]http://arxiv.org/abs/2412.01928v1 [DATE]2024-12-03 03:30:36+08:00 [CATEGORIES]cs.LG
CREW: Facilitating Human-AI Teaming Research [AUTHORS]Lingyu Zhang, Zhengran Ji, Boyuan Chen [ABSTRACT]With the increasing deployment of artificial intelligence (AI) technologies,
the potential of humans working with AI agents has been growing at a great
speed. Human-AI teaming is an important paradigm for studying various aspects
when humans and AI agents work together. The unique aspect of Human-AI teaming
research is the need to jointly study humans and AI agents, demanding
multidisciplinary research efforts from machine learning to human-computer
interaction, robotics, cognitive science, neuroscience, psychology, social
science, and complex systems. However, existing platforms for Human-AI teaming
research are limited, often supporting oversimplified scenarios and a single
task, or specifically focusing on either human-teaming research or multi-agent
AI algorithms. We introduce CREW, a platform to facilitate Human-AI teaming
research in real-time decision-making scenarios and engage collaborations from
multiple scientific disciplines, with a strong emphasis on human involvement.
It includes pre-built tasks for cognitive studies and Human-AI teaming with
expandable potentials from our modular design. Following conventional cognitive
neuroscience research, CREW also supports multimodal human physiological signal
recording for behavior analysis. Moreover, CREW benchmarks real-time
human-guided reinforcement learning agents using state-of-the-art algorithms
and well-tuned baselines. With CREW, we were able to conduct 50 human subject
studies within a week to verify the effectiveness of our benchmark. [COMMENTS]Our project website is at: http://generalroboticslab.com/CREW [LINK]http://arxiv.org/abs/2408.00170v2 [DATE]2024-12-03 02:37:01+08:00 [CATEGORIES]cs.LG
Asynchronous Message-Passing and Zeroth-Order Optimization Based Distributed Learning with a Use-Case in Resource Allocation in Communication Networks [AUTHORS]Pourya Behmandpoor, Marc Moonen, Panagiotis Patrinos [ABSTRACT]Distributed learning and adaptation have received significant interest and
found wide-ranging applications in machine learning and signal processing.
While various approaches, such as shared-memory optimization, multi-task
learning, and consensus-based learning (e.g., federated learning and learning
over graphs), focus on optimizing either local costs or a global cost, there
remains a need for further exploration of their interconnections. This paper
specifically focuses on a scenario where agents collaborate towards a common
task (i.e., optimizing a global cost equal to aggregated local costs) while
effectively having distinct individual tasks (i.e., optimizing individual local
parameters in a local cost). Each agent's actions can potentially impact other
agents' performance through interactions. Notably, each agent has access to
only its local zeroth-order oracle (i.e., cost function value) and shares
scalar values, rather than gradient vectors, with other agents, leading to
communication bandwidth efficiency and agent privacy. Agents employ
zeroth-order optimization to update their parameters, and the asynchronous
message-passing between them is subject to bounded but possibly random
communication delays. This paper presents theoretical convergence analyses and
establishes a convergence rate for nonconvex problems. Furthermore, it
addresses the relevant use-case of deep learning-based resource allocation in
communication networks and conducts numerical experiments in which agents,
acting as transmitters, collaboratively train their individual policies to
maximize a global reward, e.g., a sum of data rates. [LINK]http://arxiv.org/abs/2311.04604v3 [DATE]2024-12-03 02:02:53+08:00 [CATEGORIES]cs.LG
Discovering group dynamics in coordinated time series via hierarchical recurrent switching-state models [AUTHORS]Michael T. Wojnowicz, Kaitlin Gili, Preetish Rath, Eric Miller, Jeffrey Miller, Clifford Hancock, Meghan O'Donovan, Seth Elkin-Frankston, Tad T. Brunyé, Michael C. Hughes [ABSTRACT]We seek a computationally efficient model for a collection of time series
arising from multiple interacting entities (a.k.a. "agents"). Recent models of
spatiotemporal patterns across individuals fail to incorporate explicit
system-level collective behavior that can influence the trajectories of
individual entities. To address this gap in the literature, we present a new
hierarchical switching-state model that can be trained in an unsupervised
fashion to simultaneously learn both system-level and individual-level
dynamics. We employ a latent system-level discrete state Markov chain that
provides top-down influence on latent entity-level chains which in turn govern
the emission of each observed time series. Recurrent feedback from the
observations to the latent chains at both entity and system levels allows
recent situational context to inform how dynamics unfold at all levels in
bottom-up fashion. We hypothesize that including both top-down and bottom-up
influences on group dynamics will improve interpretability of the learned
dynamics and reduce error when forecasting. Our hierarchical switching
recurrent dynamical model can be learned via closed-form variational coordinate
ascent updates to all latent chains that scale linearly in the number of
entities. This is asymptotically no more costly than fitting a separate model
for each entity. Analysis of both synthetic data and real basketball team
movements suggests our lean parametric model can achieve competitive forecasts
compared to larger neural network models that require far more computational
resources. Further experiments on soldier data as well as a synthetic task with
64 cooperating entities show how our approach can yield interpretable insights
about team dynamics over time. [LINK]http://arxiv.org/abs/2401.14973v2 [DATE]2024-12-03 01:35:07+08:00 [CATEGORIES]cs.LG
2024 Dec 02, Mon
Medchain: Bridging the Gap Between LLMAgents and Clinical Practice through Interactive Sequential Benchmarking [AUTHORS]Jie Liu, Wenxuan Wang, Zizhan Ma, Guolin Huang, Yihang SU, Kao-Jung Chang, Wenting Chen, Haoliang Li, Linlin Shen, Michael Lyu [ABSTRACT]Clinical decision making (CDM) is a complex, dynamic process crucial to
healthcare delivery, yet it remains a significant challenge for artificial
intelligence systems. While Large Language Model (LLM)-based agents have been
tested on general medical knowledge using licensing exams and knowledge
question-answering tasks, their performance in the CDM in real-world scenarios
is limited due to the lack of comprehensive testing datasets that mirror actual
medical practice. To address this gap, we present MedChain, a dataset of 12,163
clinical cases that covers five key stages of clinical workflow. MedChain
distinguishes itself from existing benchmarks with three key features of
real-world clinical practice: personalization, interactivity, and
sequentiality. Further, to tackle real-world CDM challenges, we also propose
MedChain-Agent, an AI system that integrates a feedback mechanism and a
MCase-RAG module to learn from previous cases and adapt its responses.
MedChain-Agent demonstrates remarkable adaptability in gathering information
dynamically and handling sequential clinical tasks, significantly outperforming
existing approaches. The relevant dataset and code will be released upon
acceptance of this paper. [LINK]http://arxiv.org/abs/2412.01605v1 [DATE]2024-12-02 23:25:02+08:00 [CATEGORIES]cs.CL
Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach [AUTHORS]Tianyi Huang, Arya Somasundaram [ABSTRACT]Large Language Models (LLMs) often perpetuate biases in pronoun usage,
leading to misrepresentation or exclusion of queer individuals. This paper
addresses the specific problem of biased pronoun usage in LLM outputs,
particularly the inappropriate use of traditionally gendered pronouns ("he,"
"she") when inclusive language is needed to accurately represent all
identities. We introduce a collaborative agent pipeline designed to mitigate
these biases by analyzing and optimizing pronoun usage for inclusivity. Our
multi-agent framework includes specialized agents for both bias detection and
correction. Experimental evaluations using the Tango dataset-a benchmark
focused on gender pronoun usage-demonstrate that our approach significantly
improves inclusive pronoun classification, achieving a 32.6 percentage point
increase over GPT-4o in correctly disagreeing with inappropriate traditionally
gendered pronouns $(\chi^2 = 38.57, p < 0.0001)$. These results accentuate the
potential of agent-driven frameworks in enhancing fairness and inclusivity in
AI-generated content, demonstrating their efficacy in reducing biases and
promoting socially responsible AI. [COMMENTS]NeurIPS 2024 Queer in AI Workshop [LINK]http://arxiv.org/abs/2411.07656v2 [DATE]2024-12-02 12:36:45+08:00 [CATEGORIES]cs.CL
SAUP: Situation Awareness Uncertainty Propagation on LLMAgent [AUTHORS]Qiwei Zhao, Xujiang Zhao, Yanchi Liu, Wei Cheng, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Huaxiu Yao, Haifeng Chen [ABSTRACT]Large language models (LLMs) integrated into multistep agent systems enable
complex decision-making processes across various applications. However, their
outputs often lack reliability, making uncertainty estimation crucial. Existing
uncertainty estimation methods primarily focus on final-step outputs, which
fail to account for cumulative uncertainty over the multistep decision-making
process and the dynamic interactions between agents and their environments. To
address these limitations, we propose SAUP (Situation Awareness Uncertainty
Propagation), a novel framework that propagates uncertainty through each step
of an LLM-based agent's reasoning process. SAUP incorporates situational
awareness by assigning situational weights to each step's uncertainty during
the propagation. Our method, compatible with various one-step uncertainty
estimation techniques, provides a comprehensive and accurate uncertainty
measure. Extensive experiments on benchmark datasets demonstrate that SAUP
significantly outperforms existing state-of-the-art methods, achieving up to
20% improvement in AUROC. [LINK]http://arxiv.org/abs/2412.01033v1 [DATE]2024-12-02 09:31:13+08:00 [CATEGORIES]cs.CLcs.LG
Towards Type Agnostic Cyber Defense Agents [AUTHORS]Erick Galinkin, Emmanouil Pountrourakis, Spiros Mancoridis [ABSTRACT]With computing now ubiquitous across government, industry, and education,
cybersecurity has become a critical component for every organization on the
planet. Due to this ubiquity of computing, cyber threats have continued to grow
year over year, leading to labor shortages and a skills gap in cybersecurity.
As a result, many cybersecurity product vendors and security organizations have
looked to artificial intelligence to shore up their defenses. This work
considers how to characterize attackers and defenders in one approach to the
automation of cyber defense -- the application of reinforcement learning.
Specifically, we characterize the types of attackers and defenders in the sense
of Bayesian games and, using reinforcement learning, derive empirical findings
about how to best train agents that defend against multiple types of attackers. [COMMENTS]Submitted to AICS 2025: https://aics.site [LINK]http://arxiv.org/abs/2412.01542v1 [DATE]2024-12-02 22:32:18+08:00 [CATEGORIES]cs.LG
Moral Alignment for LLMAgents [AUTHORS]Elizaveta Tennant, Stephen Hailes, Mirco Musolesi [ABSTRACT]Decision-making agents based on pre-trained Large Language Models (LLMs) are
increasingly being deployed across various domains of human activity. While
their applications are currently rather specialized, several research efforts
are under way to develop more generalist agents. As LLM-based systems become
more agentic, their influence on human activity will grow and the transparency
of this will decrease. Consequently, developing effective methods for aligning
them to human values is vital.
The prevailing practice in alignment often relies on human preference data
(e.g., in RLHF or DPO), in which values are implicit and are essentially
deduced from relative preferences over different model outputs. In this work,
instead of relying on human feedback, we introduce the design of reward
functions that explicitly encode core human values for Reinforcement
Learning-based fine-tuning of foundation agent models. Specifically, we use
intrinsic rewards for the moral alignment of LLMagents.
We evaluate our approach using the traditional philosophical frameworks of
Deontological Ethics and Utilitarianism, quantifying moral rewards for agents
in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD)
environment. We also show how moral fine-tuning can be deployed to enable an
agent to unlearn a previously developed selfish strategy. Finally, we find that
certain moral strategies learned on the IPD game generalize to several other
matrix game environments. In summary, we demonstrate that fine-tuning with
intrinsic rewards is a promising general solution for aligning LLMagents to
human values, and it might represent a more transparent and cost-effective
alternative to currently predominant alignment techniques. [LINK]http://arxiv.org/abs/2410.01639v2 [DATE]2024-12-02 22:25:30+08:00 [CATEGORIES]cs.LG
MASP: Scalable GNN-based Planning for Multi-Agent Navigation [AUTHORS]Xinyi Yang, Xinting Yang, Chao Yu, Jiayu Chen, Wenbo Ding, Huazhong Yang, Yu Wang [ABSTRACT]We investigate multi-agent navigation tasks, where multiple agents need to
reach initially unassigned goals in a limited time. Classical planning-based
methods suffer from expensive computation overhead at each step and offer
limited expressiveness for complex cooperation strategies. In contrast,
reinforcement learning (RL) has recently become a popular approach for
addressing this issue. However, RL struggles with low data efficiency and
cooperation when directly exploring (nearly) optimal policies in a large
exploration space, especially with an increased number of agents(e.g., 10+
agents) or in complex environments (e.g., 3-D simulators). In this paper, we
propose the Multi-Agent Scalable Graph-based Planner (MASP), a goal-conditioned
hierarchical planner for navigation tasks with a substantial number of agents
in the decentralized setting. MASP employs a hierarchical framework to reduce
space complexity by decomposing a large exploration space into multiple
goal-conditioned subspaces, where a high-level policy assigns agents goals, and
a low-level policy navigates agents toward designated goals. For agent
cooperation and the adaptation to varying team sizes, we model agents and goals
as graphs to better capture their relationship. The high-level policy, the Goal
Matcher, leverages a graph-based Self-Encoder and Cross-Encoder to optimize
goal assignment by updating the agent and the goal graphs. The low-level
policy, the Coordinated Action Executor, introduces the Group Information
Fusion to facilitate group division and extract agent relationships across
groups, enhancing training efficiency for agent cooperation. The results
demonstrate that MASP outperforms RL and planning-based baselines in task
efficiency. [COMMENTS]Submitted to IEEE RA-L [LINK]http://arxiv.org/abs/2312.02522v2 [DATE]2024-12-02 20:49:50+08:00 [CATEGORIES]cs.LG
Masked Generative Priors Improve World Models Sequence Modelling Capabilities [AUTHORS]Cristian Meo, Mircea Lica, Zarif Ikram, Akihiro Nakano, Vedant Shah, Aniket Rajiv Didolkar, Dianbo Liu, Anirudh Goyal, Justin Dauwels [ABSTRACT]Deep Reinforcement Learning (RL) has become the leading approach for creating
artificial agents in complex environments. Model-based approaches, which are RL
methods with world models that predict environment dynamics, are among the most
promising directions for improving data efficiency, forming a critical step
toward bridging the gap between research and real-world deployment. In
particular, world models enhance sample efficiency by learning in imagination,
which involves training a generative sequence model of the environment in a
self-supervised manner. Recently, Masked Generative Modelling has emerged as a
more efficient and superior inductive bias for modelling and generating token
sequences. Building on the Efficient Stochastic Transformer-based World Models
(STORM) architecture, we replace the traditional MLP prior with a Masked
Generative Prior (e.g., MaskGIT Prior) and introduce GIT-STORM. We evaluate our
model on two downstream tasks: reinforcement learning and video prediction.
GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari
100k benchmark. Moreover, we apply Transformer-based World Models to continuous
action environments for the first time, addressing a significant gap in prior
research. To achieve this, we employ a state mixer function that integrates
latent state representations with actions, enabling our model to handle
continuous control tasks. We validate this approach through qualitative and
quantitative analyses on the DeepMind Control Suite, showcasing the
effectiveness of Transformer-based World Models in this new domain. Our results
highlight the versatility and efficacy of the MaskGIT dynamics prior, paving
the way for more accurate world models and effective RL policies. [LINK]http://arxiv.org/abs/2410.07836v4 [DATE]2024-12-02 20:44:48+08:00 [CATEGORIES]cs.LG
Multi-turn Reinforcement Learning from Preference Human Feedback [AUTHORS]Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Rémi Munos [ABSTRACT]Reinforcement Learning from Human Feedback (RLHF) has become the standard
approach for aligning Large Language Models (LLMs) with human preferences,
allowing LLMs to demonstrate remarkable abilities in various tasks. Existing
methods work by emulating the preferences at the single decision (turn) level,
limiting their capabilities in settings that require planning or multi-turn
interactions to achieve a long-term goal. In this paper, we address this issue
by developing novel methods for Reinforcement Learning (RL) from preference
feedback between two full multi-turn conversations. In the tabular setting, we
present a novel mirror-descent-based policy optimization algorithm for the
general multi-turn preference-based RL problem, and prove its convergence to
Nash equilibrium. To evaluate performance, we create a new environment,
Education Dialogue, where a teacher agent guides a student in learning a random
topic, and show that a deep RL variant of our algorithm outperforms RLHF
baselines. Finally, we show that in an environment with explicit rewards, our
algorithm recovers the same performance as a reward-based RL baseline, despite
relying solely on a weaker preference signal. [LINK]http://arxiv.org/abs/2405.14655v2 [DATE]2024-12-02 20:37:46+08:00 [CATEGORIES]cs.LG
LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations [AUTHORS]Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, Tim Genewein [ABSTRACT]Today's largest foundation models have increasingly general capabilities, yet
when used as agents, they often struggle with simple reasoning and
decision-making tasks, even though they possess good factual knowledge of the
task and how to solve it. In this paper, we present a benchmark to
pressure-test these models' multimodal decision-making capabilities in the very
long-context regime (up to one million tokens) and investigate whether they can
learn from a large number of expert demonstrations in their context. We
evaluate a wide range of state-of-the-art frontier models as policies across a
battery of simple interactive decision-making tasks: playing tic-tac-toe,
chess, and Atari, navigating grid worlds, solving crosswords, and controlling a
simulated cheetah. We measure the performance of Claude 3.5 Sonnet, Gemini 1.5
Flash, Gemini 1.5 Pro, GPT-4o, o1-mini, and o1-preview under increasing amounts
of expert demonstrations in the context $\unicode\{x2013\}$ from no
demonstrations up to 512 full episodes, pushing these models' multimodal
long-context reasoning capabilities to their limits. Across our tasks, today's
frontier models rarely manage to fully reach expert performance, showcasing the
difficulty of our benchmark. Presenting more demonstrations often has little
effect, but some models steadily improve with more demonstrations on a few
tasks. We investigate the effect of encoding observations as text or images and
the impact of chain-of-thought prompting. Overall, our results suggest that
even today's most capable models often struggle to imitate desired behavior by
generalizing purely from in-context demonstrations. To help quantify the impact
of other approaches and future innovations aiming to tackle this problem, we
open source our benchmark that covers the zero-, few-, and many-shot regimes in
a unified evaluation. [LINK]http://arxiv.org/abs/2412.01441v1 [DATE]2024-12-02 20:31:58+08:00 [CATEGORIES]cs.LG
Task Adaptation of Reinforcement Learning-based NAS Agents through Transfer Learning [AUTHORS]Amber Cassimon, Siegfried Mercelis, Kevin Mets [ABSTRACT]Recently, a novel paradigm has been proposed for reinforcement learning-based
NAS agents, that revolves around the incremental improvement of a given
architecture. We assess the abilities of such reinforcement learning agents to
transfer between different tasks. We perform our evaluation using the
Trans-NASBench-101 benchmark, and consider the efficacy of the transferred
agents, as well as how quickly they can be trained. We find that pretraining an
agent on one task benefits the performance of the agent in another task in all
but 1 task when considering final performance. We also show that the training
procedure for an agent can be shortened significantly by pretraining it on
another task. Our results indicate that these effects occur regardless of the
source or target task, although they are more pronounced for some tasks than
for others. Our results show that transfer learning can be an effective tool in
mitigating the computational cost of the initial training procedure for
reinforcement learning-based NAS agents. [COMMENTS]15 Pages, 13 Figures [LINK]http://arxiv.org/abs/2412.01420v1 [DATE]2024-12-02 20:00:27+08:00 [CATEGORIES]cs.LG
Practical Performative Policy Learning with Strategic Agents [AUTHORS]Qianyi Chen, Ying Chen, Bo Li [ABSTRACT]This paper studies the performative policy learning problem, where agents
adjust their features in response to a released policy to improve their
potential outcomes, inducing an endogenous distribution shift. There has been
growing interest in training machine learning models in strategic environments,
including strategic classification and performative prediction. However,
existing approaches often rely on restrictive parametric assumptions:
micro-level utility models in strategic classification and macro-level data
distribution maps in performative prediction, severely limiting scalability and
generalizability. We approach this problem as a complex causal inference task,
relaxing parametric assumptions on both micro-level agent behavior and
macro-level data distribution. Leveraging bounded rationality, we uncover a
practical low-dimensional structure in distribution shifts and construct an
effective mediator in the causal path from the deployed model to the shifted
data. We then propose a gradient-based policy optimization algorithm with a
differentiable classifier as a substitute for the high-dimensional distribution
map. Our algorithm efficiently utilizes batch feedback and limited manipulation
patterns. Our approach achieves high sample efficiency compared to methods
reliant on bandit feedback or zero-order optimization. We also provide
theoretical guarantees for algorithmic convergence. Extensive and challenging
experiments on high-dimensional settings demonstrate our method's practical
efficacy. [LINK]http://arxiv.org/abs/2412.01344v1 [DATE]2024-12-02 18:09:44+08:00 [CATEGORIES]cs.LG
BricksRL: A Platform for Democratizing Robotics and Reinforcement Learning Research and Education with LEGO [AUTHORS]Sebastian Dittert, Vincent Moens, Gianni De Fabritiis [ABSTRACT]We present BricksRL, a platform designed to democratize access to robotics
for reinforcement learning research and education. BricksRL facilitates the
creation, design, and training of custom LEGO robots in the real world by
interfacing them with the TorchRL library for reinforcement learning agents.
The integration of TorchRL with the LEGO hubs, via Bluetooth bidirectional
communication, enables state-of-the-art reinforcement learning training on GPUs
for a wide variety of LEGO builds. This offers a flexible and cost-efficient
approach for scaling and also provides a robust infrastructure for
robot-environment-algorithm communication. We present various experiments
across tasks and robot configurations, providing built plans and training
results. Furthermore, we demonstrate that inexpensive LEGO robots can be
trained end-to-end in the real world to achieve simple tasks, with training
times typically under 120 minutes on a normal laptop. Moreover, we show how
users can extend the capabilities, exemplified by the successful integration of
non-LEGO sensors. By enhancing accessibility to both robotics and reinforcement
learning, BricksRL establishes a strong foundation for democratized robotic
learning in research and educational settings. [LINK]http://arxiv.org/abs/2406.17490v2 [DATE]2024-12-02 17:49:23+08:00 [CATEGORIES]cs.LG
Dense Dynamics-Aware Reward Synthesis: Integrating Prior Experience with Demonstrations [AUTHORS]Cevahir Koprulu, Po-han Li, Tianyu Qiu, Ruihan Zhao, Tyler Westenbroek, David Fridovich-Keil, Sandeep Chinchali, Ufuk Topcu [ABSTRACT]Many continuous control problems can be formulated as sparse-reward
reinforcement learning (RL) tasks. In principle, online RL methods can
automatically explore the state space to solve each new task. However,
discovering sequences of actions that lead to a non-zero reward becomes
exponentially more difficult as the task horizon increases. Manually shaping
rewards can accelerate learning for a fixed task, but it is an arduous process
that must be repeated for each new environment. We introduce a systematic
reward-shaping framework that distills the information contained in 1) a
task-agnostic prior data set and 2) a small number of task-specific expert
demonstrations, and then uses these priors to synthesize dense dynamics-aware
rewards for the given task. This supervision substantially accelerates learning
in our experiments, and we provide analysis demonstrating how the approach can
effectively guide online learning agents to faraway goals. [LINK]http://arxiv.org/abs/2412.01114v1 [DATE]2024-12-02 12:37:12+08:00 [CATEGORIES]cs.LG
Realizable Continuous-Space Shields for Safe Reinforcement Learning [AUTHORS]Kyungmin Kim, Davide Corsi, Andoni Rodriguez, JB Lanier, Benjami Parellada, Pierre Baldi, Cesar Sanchez, Roy Fox [ABSTRACT]While Deep Reinforcement Learning (DRL) has achieved remarkable success
across various domains, it remains vulnerable to occasional catastrophic
failures without additional safeguards. An effective solution to prevent these
failures is to use a shield that validates and adjusts the agent's actions to
ensure compliance with a provided set of safety specifications. For real-world
robotic domains, it is essential to define safety specifications over
continuous state and action spaces to accurately account for system dynamics
and compute new actions that minimally deviate from the agent's original
decision. In this paper, we present the first shielding approach specifically
designed to ensure the satisfaction of safety requirements in continuous state
and action spaces, making it suitable for practical robotic applications. Our
method builds upon realizability, an essential property that confirms the
shield will always be able to generate a safe action for any state in the
environment. We formally prove that realizability can be verified for stateful
shields, enabling the incorporation of non-Markovian safety requirements, such
as loop avoidance. Finally, we demonstrate the effectiveness of our approach in
ensuring safety without compromising the policy's success rate by applying it
to a navigation problem and a multi-agent particle environment. [COMMENTS]Kim, Corsi, and Rodriguez contributed equally [LINK]http://arxiv.org/abs/2410.02038v2 [DATE]2024-12-02 12:20:10+08:00 [CATEGORIES]cs.LG
A Memory-Based Reinforcement Learning Approach to Integrated Sensing and Communication [AUTHORS]Homa Nikbakht, Michèle Wigger, Shlomo Shamai, H. Vincent Poor [ABSTRACT]In this paper, we consider a point-to-point integrated sensing and
communication (ISAC) system, where a transmitter conveys a message to a
receiver over a channel with memory and simultaneously estimates the state of
the channel through the backscattered signals from the emitted waveform. Using
Massey's concept of directed information for channels with memory, we formulate
the capacity-distortion tradeoff for the ISAC problem when sensing is performed
in an online fashion. Optimizing the transmit waveform for this system to
simultaneously achieve good communication and sensing performance is a
complicated task, and thus we propose a deep reinforcement learning (RL)
approach to find a solution. The proposed approach enables the agent to
optimize the ISAC performance by learning a reward that reflects the difference
between the communication gain and the sensing loss. Since the state-space in
our RL model is \`a priori unbounded, we employ deep deterministic policy
gradient algorithm (DDPG). Our numerical results suggest a significant
performance improvement when one considers unbounded state-space as opposed to
a simpler RL problem with reduced state-space. In the extreme case of
degenerate state-space only memoryless signaling strategies are possible. Our
results thus emphasize the necessity of well exploiting the memory inherent in
ISAC systems. [LINK]http://arxiv.org/abs/2412.01077v1 [DATE]2024-12-02 11:30:50+08:00 [CATEGORIES]cs.LG
Multi-Agent Deep Reinforcement Learning for Distributed and Autonomous Platoon Coordination via Speed-regulation over Large-scale Transportation Networks [AUTHORS]Dixiao Wei, Peng Yi, Jinlong Lei, Xingyi Zhu [ABSTRACT]Truck platooning technology enables a group of trucks to travel closely
together, with which the platoon can save fuel, improve traffic flow
efficiency, and improve safety. In this paper, we consider the platoon
coordination problem in a large-scale transportation network, to promote
cooperation among trucks and optimize the overall efficiency. Involving the
regulation of both speed and departure times at hubs, we formulate the
coordination problem as a complicated dynamic stochastic integer programming
under network and information constraints. To get an autonomous, distributed,
and robust platoon coordination policy, we formulate the problem into a model
of the Decentralized-Partial Observable Markov Decision Process. Then, we
propose a Multi-Agent Deep Reinforcement Learning framework named Trcuk
Attention-QMIX (TA-QMIX) to train an efficient online decision policy. TA-QMIX
utilizes the attention mechanism to enhance the representation of truck fuel
gains and delay times, and provides explicit truck cooperation information
during the training process, promoting trucks' willingness to cooperate. The
training framework adopts centralized training and distributed execution, thus
training a policy for trucks to make decisions online using only nearby
information. Hence, the policy can be autonomously executed on a large-scale
network. Finally, we perform comparison experiments and ablation experiments in
the transportation network of the Yangtze River Delta region in China to verify
the effectiveness of the proposed framework. In a repeated comparative
experiment with 5,000 trucks, our method average saves 19.17\% of fuel with an
average delay of only 9.57 minutes per truck and a decision time of 0.001
seconds. [LINK]http://arxiv.org/abs/2412.01075v1 [DATE]2024-12-02 11:21:40+08:00 [CATEGORIES]cs.LG
Provable Partially Observable Reinforcement Learning with Privileged Information [AUTHORS]Yang Cai, Xiangyu Liu, Argyris Oikonomou, Kaiqing Zhang [ABSTRACT]Partial observability of the underlying states generally presents significant
challenges for reinforcement learning (RL). In practice, certain
\emph\{privileged information\}, e.g., the access to states from simulators, has
been exploited in training and has achieved prominent empirical successes. To
better understand the benefits of privileged information, we revisit and
examine several simple and practically used paradigms in this setting.
Specifically, we first formalize the empirical paradigm of \emph\{expert
distillation\} (also known as \emph\{teacher-student\} learning), demonstrating
its pitfall in finding near-optimal policies. We then identify a condition of
the partially observable environment, the \emph\{deterministic filter
condition\}, under which expert distillation achieves sample and computational
complexities that are \emph\{both\} polynomial. Furthermore, we investigate
another useful empirical paradigm of \emph\{asymmetric actor-critic\}, and focus
on the more challenging setting of observable partially observable Markov
decision processes. We develop a belief-weighted asymmetric actor-critic
algorithm with polynomial sample and quasi-polynomial computational
complexities, in which one key component is a new provable oracle for learning
belief states that preserve \emph\{filter stability\} under a misspecified model,
which may be of independent interest. Finally, we also investigate the provable
efficiency of partially observable multi-agent RL (MARL) with privileged
information. We develop algorithms featuring
\emph\{centralized-training-with-decentralized-execution\}, a popular framework
in empirical MARL, with polynomial sample and (quasi-)polynomial computational
complexities in both paradigms above. Compared with a few recent related
theoretical studies, our focus is on understanding practically inspired
algorithmic paradigms, without computationally intractable oracles. [COMMENTS]This paper has been accepted to 2024 Conference on Neural Information
Processing Systems (NeurIPS 2024) [LINK]http://arxiv.org/abs/2412.00985v1 [DATE]2024-12-02 06:26:27+08:00 [CATEGORIES]cs.LG
STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft [AUTHORS]Nicholas Lenzen, Amogh Raut, Andrew Melnik [ABSTRACT]Recently, the STEVE-1 approach has been introduced as a method for training
generative agents to follow instructions in the form of latent CLIP embeddings.
In this work, we present a methodology to extend the control modalities by
learning a mapping from new input modalities to the latent goal space of the
agent. We apply our approach to the challenging Minecraft domain, and extend
the goal conditioning to include the audio modality. The resulting
audio-conditioned agent is able to perform on a comparable level to the
original text-conditioned and visual-conditioned agents. Specifically, we
create an Audio-Video CLIP foundation model for Minecraft and an audio prior
network which together map audio samples to the latent goal space of the
STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when
conditioning on different modalities. Our training code, evaluation code, and
Audio-Video CLIP foundation model for Minecraft are made open-source to help
foster further research into multi-modal generalist sequential decision-making
agents. [COMMENTS]Accepted at CoRL 2024: Workshop on Lifelong Learning for Home Robots [LINK]http://arxiv.org/abs/2412.00949v1 [DATE]2024-12-02 03:48:57+08:00 [CATEGORIES]cs.LG
Bilinear Convolution Decomposition for Causal RL Interpretability [AUTHORS]Narmeen Oozeer, Sinem Erisken, Alice Rigg [ABSTRACT]Efforts to interpret reinforcement learning (RL) models often rely on
high-level techniques such as attribution or probing, which provide only
correlational insights and coarse causal control. This work proposes replacing
nonlinearities in convolutional neural networks (ConvNets) with bilinear
variants, to produce a class of models for which these limitations can be
addressed. We show bilinear model variants perform comparably in model-free
reinforcement learning settings, and give a side by side comparison on ProcGen
environments. Bilinear layers' analytic structure enables weight-based
decomposition. Previous work has shown bilinearity enables quantifying
functional importance through eigendecomposition, to identify interpretable low
rank structure. We show how to adapt the decomposition to convolution layers by
applying singular value decomposition to vectors of interest, to separate the
channel and spatial dimensions. Finally, we propose a methodology for causally
validating concept-based probes, and illustrate its utility by studying a
maze-solving agent's ability to track a cheese object. [COMMENTS]8 pages, 10 figures [LINK]http://arxiv.org/abs/2412.00944v1 [DATE]2024-12-02 03:32:04+08:00 [CATEGORIES]cs.LG
A Deep Generative Model for the Design of Synthesizable Ionizable Lipids [AUTHORS]Yuxuan Ou, Jingyi Zhao, Austin Tripp, Morteza Rasoulianboroujeni, José Miguel Hernández-Lobato [ABSTRACT]Lipid nanoparticles (LNPs) are vital in modern biomedicine, enabling the
effective delivery of mRNA for vaccines and therapies by protecting it from
rapid degradation. Among the components of LNPs, ionizable lipids play a key
role in RNA protection and facilitate its delivery into the cytoplasm. However,
designing ionizable lipids is complex. Deep generative models can accelerate
this process and explore a larger candidate space compared to traditional
methods. Due to the structural differences between lipids and small molecules,
existing generative models used for small molecule generation are unsuitable
for lipid generation. To address this, we developed a deep generative model
specifically tailored for the discovery of ionizable lipids. Our model
generates novel ionizable lipid structures and provides synthesis paths using
synthetically accessible building blocks, addressing synthesizability. This
advancement holds promise for streamlining the development of lipid-based
delivery systems, potentially accelerating the deployment of new therapeutic
agents, including mRNA vaccines and gene therapies. [COMMENTS]NeurIPS 2024 Workshop on AI for New Drug Modalities [LINK]http://arxiv.org/abs/2412.00928v1 [DATE]2024-12-02 02:33:22+08:00 [CATEGORIES]cs.LG
2024 Dec 01, Sun
Does chat change LLM's mind? Impact of Conversation on Psychological States of LLMs [AUTHORS]Junhyuk Choi, Yeseon Hong, Minju Kim, Bugeun Kim [ABSTRACT]The recent growth of large language models (LLMs) has enabled more authentic,
human-centered interactions through multi-agent systems. However, investigation
into how conversations affect the psychological states of LLMs is limited,
despite the impact of these states on the usability of LLM-based systems. In
this study, we explored whether psychological states change during multi-agent
interactions, focusing on the effects of conversation depth, topic, and
speaker. We experimentally investigated the behavior of 10 LLMs in open-domain
conversations. We employed 14 questionnaires and a topic-analysis method to
examine the behavior of LLMs across four aspects: personality, interpersonal
relationships, motivation, and emotion. The results revealed distinct
psychological trends influenced by conversation depth and topic, with
significant variations observed between different LLM families and parameter
sizes. [COMMENTS]Under review [LINK]http://arxiv.org/abs/2412.00804v1 [DATE]2024-12-01 21:19:32+08:00 [CATEGORIES]cs.CL
Towards Adaptive Mechanism Activation in Language Agent [AUTHORS]Ziyang Huang, Jun Zhao, Kang Liu [ABSTRACT]Language Agent could be endowed with different mechanisms for autonomous task
accomplishment. Current agents typically rely on fixed mechanisms or a set of
mechanisms activated in a predefined order, limiting their adaptation to varied
potential task solution structures. To this end, this paper proposes
\textbf\{A\}daptive \textbf\{L\}anguage \textbf\{A\}gent \textbf\{M\}echanism
\textbf\{A\}ctivation Learning with Self-Exploration (\textbf\{ALAMA\}), which
focuses on optimizing mechanism activation adaptability without reliance on
expert models. Initially, it builds a harmonized agent framework
(\textbf\{UniAct\}) to \textbf\{Uni\}fy different mechanisms via \textbf\{Act\}ions.
Then it leverages a training-efficient optimization method based on
self-exploration to enable the UniAct to adaptively activate the appropriate
mechanisms according to the potential characteristics of the task. Experimental
results demonstrate significant improvements in downstream agent tasks,
affirming the effectiveness of our approach in facilitating more dynamic and
context-sensitive mechanism activation. [COMMENTS]COLING2025 [LINK]http://arxiv.org/abs/2412.00722v1 [DATE]2024-12-01 16:10:04+08:00 [CATEGORIES]cs.CL
Multi-Agent Collaboration in Incident Response with Large Language Models [AUTHORS]Zefang Liu [ABSTRACT]Incident response (IR) is a critical aspect of cybersecurity, requiring rapid
decision-making and coordinated efforts to address cyberattacks effectively.
Leveraging large language models (LLMs) as intelligent agents offers a novel
approach to enhancing collaboration and efficiency in IR scenarios. This paper
explores the application of LLM-based multi-agent collaboration using the
Backdoors & Breaches framework, a tabletop game designed for cybersecurity
training. We simulate real-world IR dynamics through various team structures,
including centralized, decentralized, and hybrid configurations. By analyzing
agent interactions and performance across these setups, we provide insights
into optimizing multi-agent collaboration for incident response. Our findings
highlight the potential of LLMs to enhance decision-making, improve
adaptability, and streamline IR processes, paving the way for more effective
and coordinated responses to cyber threats. [LINK]http://arxiv.org/abs/2412.00652v1 [DATE]2024-12-01 11:12:26+08:00 [CATEGORIES]cs.CL
A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning [AUTHORS]Xinzhe Li [ABSTRACT]Tool use, planning, and feedback learning are currently three prominent
paradigms for developing Large Language Model (LLM)-based agents across various
tasks. Although numerous frameworks have been devised for each paradigm, their
intricate workflows and inconsistent taxonomy create challenges in
understanding and reviewing the frameworks across different paradigms. This
survey introduces a unified taxonomy to systematically review and discuss these
frameworks. Specifically, 1) the taxonomy defines environments/tasks, common
LLM-profiled roles or LMPRs (policy models, evaluators, and dynamic models),
and universally applicable workflows found in prior work, and 2) it enables a
comparison of key perspectives on the implementations of LMPRs and workflow
designs across different agent paradigms and frameworks. 3) Finally, we
identify three limitations in existing workflow designs and systematically
discuss the future work. Resources have been made publicly available at in our
GitHub repository https://github.com/xinzhel/LLM-Agent-Survey. [COMMENTS]CoLing 2025 Camera Ready (extended to 9 pages) [LINK]http://arxiv.org/abs/2406.05804v6 [DATE]2024-12-01 06:38:57+08:00 [CATEGORIES]cs.CL
Unveiling Performance Challenges of Large Language Models in Low-Resource Healthcare: A Demographic Fairness Perspective [AUTHORS]Yue Zhou, Barbara Di Eugenio, Lu Cheng [ABSTRACT]This paper studies the performance of large language models (LLMs),
particularly regarding demographic fairness, in solving real-world healthcare
tasks. We evaluate state-of-the-art LLMs with three prevalent learning
frameworks across six diverse healthcare tasks and find significant challenges
in applying LLMs to real-world healthcare tasks and persistent fairness issues
across demographic groups. We also find that explicitly providing demographic
information yields mixed results, while LLM's ability to infer such details
raises concerns about biased health predictions. Utilizing LLMs as autonomous
agents with access to up-to-date guidelines does not guarantee performance
improvement. We believe these findings reveal the critical limitations of LLMs
in healthcare fairness and the urgent need for specialized research in this
area. [COMMENTS]Accepted to the main conference of COLING 2025 [LINK]http://arxiv.org/abs/2412.00554v1 [DATE]2024-12-01 02:52:30+08:00 [CATEGORIES]cs.CL
Online Poisoning Attack Against Reinforcement Learning under Black-box Environments [AUTHORS]Jianhui Li, Bokang Zhang, Junfeng Wu [ABSTRACT]This paper proposes an online environment poisoning algorithm tailored for
reinforcement learning agents operating in a black-box setting, where an
adversary deliberately manipulates training data to lead the agent toward a
mischievous policy. In contrast to prior studies that primarily investigate
white-box settings, we focus on a scenario characterized by \textit\{unknown\}
environment dynamics to the attacker and a \textit\{flexible\} reinforcement
learning algorithm employed by the targeted agent. We first propose an attack
scheme that is capable of poisoning the reward functions and state transitions.
The poisoning task is formalized as a constrained optimization problem,
following the framework of \cite\{ma2019policy\}. Given the transition
probabilities are unknown to the attacker in a black-box environment, we apply
a stochastic gradient descent algorithm, where the exact gradients are
approximated using sample-based estimates. A penalty-based method along with a
bilevel reformulation is then employed to transform the problem into an
unconstrained counterpart and to circumvent the double-sampling issue. The
algorithm's effectiveness is validated through a maze environment. [LINK]http://arxiv.org/abs/2412.00797v1 [DATE]2024-12-01 20:43:23+08:00 [CATEGORIES]cs.LG
InvestESG: A multi-agent reinforcement learning benchmark for studying climate investment as a social dilemma [AUTHORS]Xiaoxuan Hou, Jiayi Yuan, Joel Z. Leibo, Natasha Jaques [ABSTRACT]InvestESG is a novel multi-agent reinforcement learning (MARL) benchmark
designed to study the impact of Environmental, Social, and Governance (ESG)
disclosure mandates on corporate climate investments. Supported by both PyTorch
and JAX implementation, the benchmark models an intertemporal social dilemma
where companies balance short-term profit losses from climate mitigation
efforts and long-term benefits from reducing climate risk, while ESG-conscious
investors attempt to influence corporate behavior through their investment
decisions, in a scalable and hardware-accelerated manner. Companies allocate
capital across mitigation, greenwashing, and resilience, with varying
strategies influencing climate outcomes and investor preferences. Our
experiments show that without ESG-conscious investors with sufficient capital,
corporate mitigation efforts remain limited under the disclosure mandate.
However, when a critical mass of investors prioritizes ESG, corporate
cooperation increases, which in turn reduces climate risks and enhances
long-term financial stability. Additionally, providing more information about
global climate risks encourages companies to invest more in mitigation, even
without investor involvement. Our findings align with empirical research using
real-world data, highlighting MARL's potential to inform policy by providing
insights into large-scale socio-economic challenges through efficient testing
of alternative policy and market designs. [LINK]http://arxiv.org/abs/2411.09856v2 [DATE]2024-12-01 13:18:12+08:00 [CATEGORIES]cs.LG
Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning [AUTHORS]Emile Anand, Ishani Karmarkar, Guannan Qu [ABSTRACT]Designing efficient algorithms for multi-agent reinforcement learning (MARL)
is fundamentally challenging due to the fact that the size of the joint state
and action spaces are exponentially large in the number of agents. These
difficulties are exacerbated when balancing sequential global decision-making
with local agent interactions. In this work, we propose a new algorithm
\texttt\{SUBSAMPLE-MFQ\}
(\textbf\{Subsample\}-\textbf\{M\}ean-\textbf\{F\}ield-\textbf\{Q\}-learning) and a
decentralized randomized policy for a system with $n$ agents. For $k\leq n$,
our algorithm system learns a policy for the system in time polynomial in $k$.
We show that this learned policy converges to the optimal policy in the order
of $\tilde\{O\}(1/\sqrt\{k\})$ as the number of subsampled agents $k$ increases. We
validate our method empirically on Gaussian squeeze and global exploration
settings. [COMMENTS]48 pages. 7 figures [LINK]http://arxiv.org/abs/2412.00661v1 [DATE]2024-12-01 11:45:17+08:00 [CATEGORIES]cs.LG
DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents [AUTHORS]Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao [ABSTRACT]On-device control agents, especially on mobile devices, are responsible for
operating mobile devices to fulfill users' requests, enabling seamless and
intuitive interactions. Integrating Multimodal Large Language Models (MLLMs)
into these agents enhances their ability to understand and execute complex
commands, thereby improving user experience. However, fine-tuning MLLMs for
on-device control presents significant challenges due to limited data
availability and inefficient online training processes. This paper introduces
DistRL, a novel framework designed to enhance the efficiency of online RL
fine-tuning for mobile device control agents. DistRL employs centralized
training and decentralized data acquisition to ensure efficient fine-tuning in
the context of dynamic online interactions. Additionally, the framework is
backed by our tailor-made RL algorithm, which effectively balances exploration
with the prioritized utilization of collected data to ensure stable and robust
training. Our experiments show that, on average, DistRL delivers a 3X
improvement in training efficiency and enables training data collection 2.4X
faster than the leading synchronous multi-machine methods. Notably, after
training, DistRL achieves a 20% relative improvement in success rate compared
to state-of-the-art methods on general Android tasks from an open benchmark,
significantly outperforming existing approaches while maintaining the same
training time. These results validate DistRL as a scalable and efficient
solution, offering substantial improvements in both training efficiency and
agent performance for real-world, in-the-wild device control tasks. [COMMENTS]Paper and Appendix, 26 pages [LINK]http://arxiv.org/abs/2410.14803v4 [DATE]2024-12-01 10:09:21+08:00 [CATEGORIES]cs.LG
Towards Fault Tolerance in Multi-Agent Reinforcement Learning [AUTHORS]Yuchen Shi, Huaxin Pei, Liang Feng, Yi Zhang, Danya Yao [ABSTRACT]Agent faults pose a significant threat to the performance of multi-agent
reinforcement learning (MARL) algorithms, introducing two key challenges.
First, agents often struggle to extract critical information from the chaotic
state space created by unexpected faults. Second, transitions recorded before
and after faults in the replay buffer affect training unevenly, leading to a
sample imbalance problem. To overcome these challenges, this paper enhances the
fault tolerance of MARL by combining optimized model architecture with a
tailored training data sampling strategy. Specifically, an attention mechanism
is incorporated into the actor and critic networks to automatically detect
faults and dynamically regulate the attention given to faulty agents.
Additionally, a prioritization mechanism is introduced to selectively sample
transitions critical to current training needs. To further support research in
this area, we design and open-source a highly decoupled code platform for
fault-tolerant MARL, aimed at improving the efficiency of studying related
problems. Experimental results demonstrate the effectiveness of our method in
handling various types of faults, faults occurring in any agent, and faults
arising at random times. [COMMENTS]14 pages, 13 figures [LINK]http://arxiv.org/abs/2412.00534v1 [DATE]2024-12-01 00:56:29+08:00 [CATEGORIES]cs.LG
Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation [AUTHORS]Yiyuan Pan, Yunzhe Xu, Zhe Liu, Hesheng Wang [ABSTRACT]Humans navigate unfamiliar environments using the capabilities of episodic
simulation and episodic memory. Developing imagination-based memory, analogous
to episodic simulation and episodic memory, can enhance embodied agents'
comprehension of the complex relationship between environments and objects.
However, existing Vision-and-Language Navigation (VLN) agents fail to perform
the aforementioned mechanism. We propose a novel architecture to help agents
build a recurrent imaginative memory system. Specifically, the agent can
maintain a reality-imagination hybrid global memory during navigation and
expand the memory map through imaginative mechanisms and navigation actions.
Correspondingly, we design a series of pre-training tasks to help the agent
acquire fine-grained imaginative abilities. Our agents improve the
state-of-the-art (SoTA) success rate (SR) by 7% while simultaneously imagining
high-fidelity RGB representations for future scenes. [LINK]http://arxiv.org/abs/2412.01857v1 [DATE]2024-12-01 00:49:14+08:00 [CATEGORIES]cs.LG
2024 Dec 03, Tue
Compute-Constrained Data Selection [AUTHORS]Junjie Oscar Yin, Alexander M. Rush [ABSTRACT]Data selection can reduce the amount of training data needed to finetune
LLMs; however, the efficacy of data selection scales directly with its compute.
Motivated by the practical challenge of compute-constrained finetuning, we
consider the setting in which both the cost of selecting data and training are
budgeted for. We first formalize the problem of data selection with a
cost-aware utility function, and model the data selection problem as trading
off initial-selection cost for training gain. We run a comprehensive sweep of
experiments across multiple tasks, varying compute budget by scaling finetuning
tokens, model sizes, and data selection compute. Interestingly we find that
many powerful data selection methods are almost never compute-optimal, and that
cheaper data selection alternatives dominate both from a theoretical and
empirical perspective. For compute-optimal training, we find that perplexity
and gradient data selection require training-to-selection model size ratios of
5x and 10x, respectively. [LINK]http://arxiv.org/abs/2410.16208v3 [DATE]2024-12-03 02:59:28+08:00 [CATEGORIES]cs.LGcs.CL
2024 Dec 05, Thu
A Context-aware Framework for Translation-mediated Conversations [AUTHORS]José Pombal, Sweta Agrawal, Patrick Fernandes, Emmanouil Zaranis, André F. T. Martins [ABSTRACT]Effective communication is fundamental to any interaction, yet challenges
arise when participants do not share a common language. Automatic translation
systems offer a powerful solution to bridge language barriers in such
scenarios, but they introduce errors that can lead to misunderstandings and
conversation breakdown. A key issue is that current systems fail to incorporate
the rich contextual information necessary to resolve ambiguities and omitted
details, resulting in literal, inappropriate, or misaligned translations. In
this work, we present a framework to improve large language model-based
translation systems by incorporating contextual information in bilingual
conversational settings. During training, we leverage context-augmented
parallel data, which allows the model to generate translations sensitive to
conversational history. During inference, we perform quality-aware decoding
with context-aware metrics to select the optimal translation from a pool of
candidates. We validate both components of our framework on two task-oriented
domains: customer chat and user-assistant interaction. Across both settings,
our framework consistently results in better translations than state-of-the-art
systems like GPT-4o and TowerInstruct, as measured by multiple automatic
translation quality metrics on several language pairs. We also show that the
resulting model leverages context in an intended and interpretable way,
improving consistency between the conveyed message and the generated
translations. [LINK]http://arxiv.org/abs/2412.04205v1 [DATE]2024-12-05 22:41:05+08:00 [CATEGORIES]cs.CL
2024 Nov 30, Sat
Few-Shot Domain Adaptation for Named-Entity Recognition via Joint Constrained k-Means and Subspace Selection [AUTHORS]Ayoub Hammal, Benno Uthayasooriyar, Caio Corro [ABSTRACT]Named-entity recognition (NER) is a task that typically requires large
annotated datasets, which limits its applicability across domains with varying
entity definitions. This paper addresses few-shot NER, aiming to transfer
knowledge to new domains with minimal supervision. Unlike previous approaches
that rely solely on limited annotated data, we propose a weakly supervised
algorithm that combines small labeled datasets with large amounts of unlabeled
data. Our method extends the k-means algorithm with label supervision, cluster
size constraints and domain-specific discriminative subspace selection. This
unified framework achieves state-of-the-art results in few-shot NER on several
English datasets. [COMMENTS]COLING 2025 [LINK]http://arxiv.org/abs/2412.00426v1 [DATE]2024-11-30 18:52:24+08:00 [CATEGORIES]cs.CL
Noise-powered Multi-modal Knowledge Graph Representation Framework [AUTHORS]Zhuo Chen, Yin Fang, Yichi Zhang, Lingbing Guo, Jiaoyan Che, Jeff Z. Pan, Huajun Chen, Wen Zhang [ABSTRACT]The rise of Multi-modal Pre-training highlights the necessity for a unified
Multi-Modal Knowledge Graph (MMKG) representation learning framework. Such a
framework is essential for embedding structured knowledge into multi-modal
Large Language Models effectively, alleviating issues like knowledge
misconceptions and multi-modal hallucinations. In this work, we explore the
efficacy of models in accurately embedding entities within MMKGs through two
pivotal tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal
Entity Alignment (MMEA). Building on this foundation, we propose a novel SNAG
method that utilizes a Transformer-based architecture equipped with
modality-level noise masking to robustly integrate multi-modal entity features
in KGs. By incorporating specific training objectives for both MKGC and MMEA,
our approach achieves SOTA performance across a total of ten datasets,
demonstrating its versatility. Moreover, SNAG can not only function as a
standalone model but also enhance other existing methods, providing stable
performance improvements. Code and data are available at
https://github.com/zjukg/SNAG. [COMMENTS]COLING 2025 Accpeted, Repo is available at
https://github.com/zjukg/SNAG [LINK]http://arxiv.org/abs/2403.06832v3 [DATE]2024-11-30 12:53:04+08:00 [CATEGORIES]cs.CL
Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection [AUTHORS]Shanu Kumar, Saish Mendke, Karody Lubna Abdul Rahman, Santosh Kurasa, Parag Agrawal, Sandipan Dandapat [ABSTRACT]Chain-of-thought (CoT) prompting has significantly enhanced the capability of
large language models (LLMs) by structuring their reasoning processes. However,
existing methods face critical limitations: handcrafted demonstrations require
extensive human expertise, while trigger phrases are prone to inaccuracies. In
this paper, we propose the Zero-shot Uncertainty-based Selection (ZEUS) method,
a novel approach that improves CoT prompting by utilizing uncertainty estimates
to select effective demonstrations without needing access to model parameters.
Unlike traditional methods, ZEUS offers high sensitivity in distinguishing
between helpful and ineffective questions, ensuring more precise and reliable
selection. Our extensive evaluation shows that ZEUS consistently outperforms
existing CoT strategies across four challenging reasoning benchmarks,
demonstrating its robustness and scalability. [COMMENTS]Accepted in COLING 2025 [LINK]http://arxiv.org/abs/2412.00353v1 [DATE]2024-11-30 12:22:00+08:00 [CATEGORIES]cs.CL
SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration [AUTHORS]Xin Guan, Nathaniel Demchak, Saloni Gupta, Ze Wang, Ediz Ertekin Jr., Adriano Koshiyama, Emre Kazim, Zekun Wu [ABSTRACT]The development of unbiased large language models is widely recognized as
crucial, yet existing benchmarks fall short in detecting biases due to limited
scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the
first holistic benchmarking pipeline to address these problems. The pipeline
encompasses five core stages: scraping materials, assembling benchmarks,
generating responses, extracting numeric features, and diagnosing with
disparity metrics. SAGED includes metrics for max disparity, such as impact
ratio, and bias concentration, such as Max Z-scores. Noticing that assessment
tool bias and contextual bias in prompts can distort evaluation, SAGED
implements counterfactual branching and baseline calibration for mitigation.
For demonstration, we use SAGED on G20 Countries with popular 8b-level models
including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we
find that while Mistral and Qwen2 show lower max disparity and higher bias
concentration than Gemma2 and Llama3.1, all models are notably biased against
countries like Russia and (except for Qwen2) China. With further experiments to
have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies
and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not
engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more
intensively than Biden and Harris, indicating role-playing performance bias in
these models. [COMMENTS]COLING 2025 Main Conference [LINK]http://arxiv.org/abs/2409.11149v4 [DATE]2024-11-30 10:21:25+08:00 [CATEGORIES]cs.CL
GADFA: Generator-Assisted Decision-Focused Approach for Opinion Expressing Timing Identification [AUTHORS]Chung-Chi Chen, Hiroya Takamura, Ichiro Kobayashi, Yusuke Miyao, Hsin-Hsi Chen [ABSTRACT]The advancement of text generation models has granted us the capability to
produce coherent and convincing text on demand. Yet, in real-life
circumstances, individuals do not continuously generate text or voice their
opinions. For instance, consumers pen product reviews after weighing the merits
and demerits of a product, and professional analysts issue reports following
significant news releases. In essence, opinion expression is typically prompted
by particular reasons or signals. Despite long-standing developments in opinion
mining, the appropriate timing for expressing an opinion remains largely
unexplored. To address this deficit, our study introduces an innovative task -
the identification of news-triggered opinion expressing timing. We ground this
task in the actions of professional stock analysts and develop a novel dataset
for investigation. Our approach is decision-focused, leveraging text generation
models to steer the classification model, thus enhancing overall performance.
Our experimental findings demonstrate that the text generated by our model
contributes fresh insights from various angles, effectively aiding in
identifying the optimal timing for opinion expression. [COMMENTS]Accepted: COLING-2025 [LINK]http://arxiv.org/abs/2410.01169v2 [DATE]2024-11-30 09:04:31+08:00 [CATEGORIES]cs.CL
2024 Dec 06, Fri
Context-Informed Machine Translation of Manga using Multimodal Large Language Models [AUTHORS]Philip Lippmann, Konrad Skublicki, Joshua Tanner, Shonosuke Ishiwatari, Jie Yang [ABSTRACT]Due to the significant time and effort required for handcrafting
translations, most manga never leave the domestic Japanese market. Automatic
manga translation is a promising potential solution. However, it is a budding
and underdeveloped field and presents complexities even greater than those
found in standard translation due to the need to effectively incorporate visual
elements into the translation process to resolve ambiguities. In this work, we
investigate to what extent multimodal large language models (LLMs) can provide
effective manga translation, thereby assisting manga authors and publishers in
reaching wider audiences. Specifically, we propose a methodology that leverages
the vision component of multimodal LLMs to improve translation quality and
evaluate the impact of translation unit size, context length, and propose a
token efficient approach for manga translation. Moreover, we introduce a new
evaluation dataset -- the first parallel Japanese-Polish manga translation
dataset -- as part of a benchmark to be used in future research. Finally, we
contribute an open-source software suite, enabling others to benchmarkLLMs for
manga translation. Our findings demonstrate that our proposed methods achieve
state-of-the-art results for Japanese-English translation and set a new
standard for Japanese-Polish. [COMMENTS]COLING 2025 [LINK]http://arxiv.org/abs/2411.02589v2 [DATE]2024-12-06 01:41:48+08:00 [CATEGORIES]cs.CL
Unveiling Entity-Level Unlearning for Large Language Models: A Comprehensive Analysis [AUTHORS]Weitao Ma, Xiaocheng Feng, Weihong Zhong, Lei Huang, Yangfan Ye, Xiachong Feng, Bing Qin [ABSTRACT]Large language model unlearning has garnered increasing attention due to its
potential to address security and privacy concerns, leading to extensive
research in the field. However, much of this research has concentrated on
instance-level unlearning, specifically targeting the removal of predefined
instances containing sensitive content. This focus has left a significant gap
in the exploration of full entity-level unlearning, which is critical in
real-world scenarios such as copyright protection. To this end, we propose a
novel task of Entity-level unlearning, which aims to erase entity-related
knowledge from the target model completely. To thoroughly investigate this
task, we systematically evaluate trending unlearning algorithms, revealing that
current methods struggle to achieve effective entity-level unlearning. Then, we
further explore the factors that influence the performance of the unlearning
algorithms, identifying that knowledge coverage and the size of the forget set
play pivotal roles. Notably, our analysis also uncovers that entities
introduced through fine-tuning are more vulnerable to unlearning than
pre-trained entities. These findings collectively offer valuable insights for
advancing entity-level unlearning for LLMs. [COMMENTS]Accepted by COLING 2025 [LINK]http://arxiv.org/abs/2406.15796v5 [DATE]2024-12-06 00:13:09+08:00 [CATEGORIES]cs.CL
2024 Dec 05, Thu
Representation Purification for End-to-End Speech Translation [AUTHORS]Chengwei Zhang, Yue Zhou, Rui Zhao, Yidong Chen, Xiaodong Shi [ABSTRACT]Speech-to-text translation (ST) is a cross-modal task that involves
converting spoken language into text in a different language. Previous research
primarily focused on enhancing speech translation by facilitating knowledge
transfer from machine translation, exploring various methods to bridge the gap
between speech and text modalities. Despite substantial progress made, factors
in speech that are not relevant to translation content, such as timbre and
rhythm, often limit the efficiency of knowledge transfer. In this paper, we
conceptualize speech representation as a combination of content-agnostic and
content-relevant factors. We examine the impact of content-agnostic factors on
translation performance through preliminary experiments and observe a
significant performance deterioration when content-agnostic perturbations are
introduced to speech signals. To address this issue, we propose a
\textbf\{S\}peech \textbf\{R\}epresentation \textbf\{P\}urification with
\textbf\{S\}upervision \textbf\{E\}nhancement (SRPSE) framework, which excludes the
content-agnostic components within speech representations to mitigate their
negative impact on ST. Experiments on MuST-C and CoVoST-2 datasets demonstrate
that SRPSE significantly improves translation performance across all
translation directions in three settings and achieves preeminent performance
under a \textit\{transcript-free\} setting. [COMMENTS]Accepted by COLING 2025 [LINK]http://arxiv.org/abs/2412.04266v1 [DATE]2024-12-05 23:50:44+08:00 [CATEGORIES]cs.CL
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios [AUTHORS]Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, Xuanjing Huang [ABSTRACT]Existing evaluations of tool learning primarily focus on validating the
alignment of selected tools for large language models (LLMs) with expected
outcomes. However, these approaches rely on a limited set of scenarios where
answers can be pre-determined, diverging from genuine needs. Furthermore, a
sole emphasis on outcomes disregards the complex capabilities required for LLMs
to effectively use tools. To tackle this issue, we propose ToolEyes, a
fine-grained system tailored for the evaluation of the LLMs' tool learning
capabilities in authentic scenarios. The system meticulously examines seven
real-world scenarios, analyzing five dimensions crucial to LLMs in tool
learning: format alignment, intent comprehension, behavior planning, tool
selection, and answer organization. Additionally, ToolEyes incorporates a tool
library boasting approximately 600 tools, serving as an intermediary between
LLMs and the physical world. Evaluations involving ten LLMs across three
categories reveal a preference for specific scenarios and limited cognitive
abilities in tool learning. Intriguingly, expanding the model size even
exacerbates the hindrance to tool learning. The code and data are available at
https://github.com/Junjie-Ye/ToolEyes. [COMMENTS]Accepted by COLING 2025 conference [LINK]http://arxiv.org/abs/2401.00741v3 [DATE]2024-12-05 15:05:59+08:00 [CATEGORIES]cs.CL
LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings [AUTHORS]Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé [ABSTRACT]Sentence embedding models play a key role in various Natural Language
Processing tasks, such as in Topic Modeling, Document Clustering and
Recommendation Systems. However, these models rely heavily on parallel data,
which can be scarce for many low-resource languages, including Luxembourgish.
This scarcity results in suboptimal performance of monolingual and
cross-lingual sentence embedding models for these languages. To address this
issue, we compile a relatively small but high-quality human-generated
cross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence
embedding model for Luxembourgish with strong cross-lingual capabilities.
Additionally, we present evidence suggesting that including low-resource
languages in parallel training datasets can be more advantageous for other
low-resource languages than relying solely on high-resource language pairs.
Furthermore, recognizing the lack of sentence embedding benchmarks for
low-resource languages, we create a paraphrase detection benchmark specifically
for Luxembourgish, aiming to partially fill this gap and promote further
research. [COMMENTS]Accepted at COLING 2025 [LINK]http://arxiv.org/abs/2412.03331v2 [DATE]2024-12-05 15:05:57+08:00 [CATEGORIES]cs.CL
PreAct: Prediction Enhances Agent's Planning Ability [AUTHORS]Dayuan Fu, Jianzhao Huang, Siyuan Lu, Guanting Dong, Yejie Wang, Keqing He, Weiran Xu [ABSTRACT]Addressing the disparity between forecasts and actual results can enable
individuals to expand their thought processes and stimulate self-reflection,
thus promoting accurate planning. In this research, we present **PreAct**, an
agent framework that integrates **pre**diction, **rea**soning, and **act**ion.
By utilizing the information derived from predictions, the large language model
(LLM) agent can provide a wider range and more strategically focused reasoning.
This leads to more efficient actions that aid the agent in accomplishing
intricate tasks. Our experimental results show that PreAct surpasses the ReAct
method in completing complex tasks and that PreAct's performance can be further
improved when paired with other memory or selection strategy techniques. We
presented the model with varying quantities of historical predictions and
discovered that these predictions consistently enhance LLM planning.The
variances in single-step reasoning between PreAct and ReAct indicate that
PreAct indeed has benefits in terms of diversity and strategic orientation over
ReAct. [COMMENTS]Coling 2025 [LINK]http://arxiv.org/abs/2402.11534v2 [DATE]2024-12-05 12:40:54+08:00 [CATEGORIES]cs.CL
Acquired TASTE: Multimodal Stance Detection with Textual and Structural Embeddings [AUTHORS]Guy Barel, Oren Tsur, Dan Volenchik [ABSTRACT]Stance detection plays a pivotal role in enabling an extensive range of
downstream applications, from discourse parsing to tracing the spread of fake
news and the denial of scientific facts. While most stance classification
models rely on textual representation of the utterance in question, prior work
has demonstrated the importance of the conversational context in stance
detection. In this work we introduce TASTE -- a multimodal architecture for
stance detection that harmoniously fuses Transformer-based content embedding
with unsupervised structural embedding. Through the fine-tuning of a pretrained
transformer and the amalgamation with social embedding via a Gated Residual
Network (GRN) layer, our model adeptly captures the complex interplay between
content and conversational structure in determining stance. TASTE achieves
state-of-the-art results on common benchmarks, significantly outperforming an
array of strong baselines. Comparative evaluations underscore the benefits of
social grounding -- emphasizing the criticality of concurrently harnessing both
content and structure for enhanced stance detection. [COMMENTS]The modified camera ready version will be published in January 2025
at COLING [LINK]http://arxiv.org/abs/2412.03681v1 [DATE]2024-12-05 03:23:37+08:00 [CATEGORIES]cs.CL
Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation [AUTHORS]Weihua Wang, Qiuyu Liang, Feilong Bao, Guanglai Gao [ABSTRACT]Quaternion contains one real part and three imaginary parts, which provided a
more expressive hypercomplex space for learning knowledge graph. Existing
quaternion embedding models measure the plausibility of a triplet either
through semantic matching or geometric distance scoring functions. However, it
appears that semantic matching diminishes the separability of entities, while
the distance scoring function weakens the semantics of entities. To address
this issue, we propose a novel quaternion knowledge graph embedding model. Our
model combines semantic matching with entity's geometric distance to better
measure the plausibility of triplets. Specifically, in the quaternion space, we
perform a right rotation on head entity and a reverse rotation on tail entity
to learn rich semantic features. Then, we utilize distance adaptive
translations to learn geometric distance between entities. Furthermore, we
provide mathematical proofs to demonstrate our model can handle complex logical
relationships. Extensive experimental results and analyses show our model
significantly outperforms previous models on well-known knowledge graph
completion benchmark datasets. Our code is available at
https://github.com/llqy123/DaBR. [COMMENTS]Accepted by COLING 2025 [LINK]http://arxiv.org/abs/2412.04076v1 [DATE]2024-12-05 19:17:03+08:00 [CATEGORIES]cs.LG
2024 Dec 04, Wed
DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles [AUTHORS]Jiaxuan Liu, Zhaoci Liu, Yajun Hu, Yingying Gao, Shilei Zhang, Zhenhua Ling [ABSTRACT]Human speech exhibits rich and flexible prosodic variations. To address the
one-to-many mapping problem from text to prosody in a reasonable and flexible
manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a
conditional diffusion module and an improved classifier-free guidance, which
hierarchically models speech prosodic features, and controls different prosodic
styles to guide prosody prediction. Experiments show that our method
outperforms all baselines in naturalness and achieves superior synthesis speed
compared to three diffusion-based baselines. Additionally, by adjusting the
guiding scale, DiffStyleTTS effectively controls the guidance intensity of the
synthetic prosody. [COMMENTS]COLING 2025 [LINK]http://arxiv.org/abs/2412.03388v1 [DATE]2024-12-04 23:17:25+08:00 [CATEGORIES]cs.CL
Sibyl: Empowering Empathetic Dialogue Generation in Large Language Models via Sensible and Visionary Commonsense Inference [AUTHORS]Lanrui Wang, Jiangnan Li, Chenxu Yang, Zheng Lin, Hongyin Tang, Huan Liu, Yanan Cao, Jingang Wang, Weiping Wang [ABSTRACT]Recently, there has been a heightened interest in building chatbots based on
Large Language Models (LLMs) to emulate human-like qualities in multi-turn
conversations. Despite having access to commonsense knowledge to better
understand the psychological aspects and causality of dialogue context, even
these powerful LLMs struggle to achieve the goals of empathy and emotional
support. Current commonsense knowledge derived from dialogue contexts is
inherently limited and often fails to adequately anticipate the future course
of a dialogue. This lack of foresight can mislead LLMs and hinder their ability
to provide effective support. In response to this challenge, we present an
innovative framework named Sensible and Visionary Commonsense Knowledge
(Sibyl). Designed to concentrate on the immediately succeeding dialogue, this
paradigm equips LLMs with the capability to uncover the implicit requirements
of the conversation, aiming to elicit more empathetic responses. Experimental
results demonstrate that incorporating our paradigm for acquiring commonsense
knowledge into LLMs comprehensively enhances the quality of their responses. [COMMENTS]Accepted by COLING 2025 [LINK]http://arxiv.org/abs/2311.15316v4 [DATE]2024-12-04 12:08:49+08:00 [CATEGORIES]cs.CL
A Combinatorial Approach to Neural Emergent Communication [AUTHORS]Zheyuan Zhang [ABSTRACT]Substantial research on deep learning-based emergent communication uses the
referential game framework, specifically the Lewis signaling game, however we
argue that successful communication in this game typically only need one or two
symbols for target image classification because of a sampling pitfall in the
training data. To address this issue, we provide a theoretical analysis and
introduce a combinatorial algorithm SolveMinSym (SMS) to solve the symbolic
complexity for classification, which is the minimum number of symbols in the
message for successful communication. We use the SMS algorithm to create
datasets with different symbolic complexity to empirically show that data with
higher symbolic complexity increases the number of effective symbols in the
emergent language. [COMMENTS]Accepted to COLING 2025 [LINK]http://arxiv.org/abs/2410.18806v2 [DATE]2024-12-04 05:48:17+08:00 [CATEGORIES]cs.LGcs.CL
2024 Dec 03, Tue
Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining [AUTHORS]Zongru Wu, Pengzhou Cheng, Lingyong Fang, Zhuosheng Zhang, Gongshen Liu [ABSTRACT]Backdoor attacks remain significant security threats to generative large
language models (LLMs). Since generative LLMs output sequences of
high-dimensional token logits instead of low-dimensional classification logits,
most existing backdoor defense methods designed for discriminative models like
BERT are ineffective for generative LLMs. Inspired by the observed differences
in learning behavior between backdoor and clean mapping in the frequency space,
we transform gradients of each training sample, directly influencing parameter
updates, into the frequency space. Our findings reveal a distinct separation
between the gradients of backdoor and clean samples in the frequency space.
Based on this phenomenon, we propose Gradient Clustering in the Frequency Space
for Backdoor Sample Filtering (GraCeFul), which leverages sample-wise gradients
in the frequency space to effectively identify backdoor samples without
requiring retraining LLMs. Experimental results show that GraCeFul outperforms
baselines significantly. Notably, GraCeFul exhibits remarkable computational
efficiency, achieving nearly 100% recall and F1 scores in identifying backdoor
samples, reducing the average success rate of various backdoor attacks to 0%
with negligible drops in clean accuracy across multiple free-style question
answering datasets. Additionally, GraCeFul generalizes to Llama-2 and Vicuna.
The codes are publicly available at https://github.com/ZrW00/GraceFul. [COMMENTS]Accepted at COLING 2025 [LINK]http://arxiv.org/abs/2412.02454v1 [DATE]2024-12-03 21:43:36+08:00 [CATEGORIES]cs.CL
Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning [AUTHORS]Aditya Narayan Sankaran, Reza Farahbakhsh, Noel Crespi [ABSTRACT]Online abusive content detection, particularly in low-resource settings and
within the audio modality, remains underexplored. We investigate the potential
of pre-trained audio representations for detecting abusive language in
low-resource languages, in this case, in Indian languages using Few Shot
Learning (FSL). Leveraging powerful representations from models such as Wav2Vec
and Whisper, we explore cross-lingual abuse detection using the ADIMA dataset
with FSL. Our approach integrates these representations within the
Model-Agnostic Meta-Learning (MAML) framework to classify abusive language in
10 languages. We experiment with various shot sizes (50-200) evaluating the
impact of limited data on performance. Additionally, a feature visualization
study was conducted to better understand model behaviour. This study highlights
the generalization ability of pre-trained models in low-resource scenarios and
offers valuable insights into detecting abusive language in multilingual
contexts. [COMMENTS]Accepted as part of the proceedings of COLING 2025 [LINK]http://arxiv.org/abs/2412.01408v2 [DATE]2024-12-03 15:52:35+08:00 [CATEGORIES]cs.CL
BANER: Boundary-Aware LLMs for Few-Shot Named Entity Recognition [AUTHORS]Quanjiang Guo, Yihong Dong, Ling Tian, Zhao Kang, Yu Zhang, Sijie Wang [ABSTRACT]Despite the recent success of two-stage prototypical networks in few-shot
named entity recognition (NER), challenges such as over/under-detected false
spans in the span detection stage and unaligned entity prototypes in the type
classification stage persist. Additionally, LLMs have not proven to be
effective few-shot information extractors in general. In this paper, we propose
an approach called Boundary-Aware LLMs for Few-Shot Named Entity Recognition to
address these issues. We introduce a boundary-aware contrastive learning
strategy to enhance the LLM's ability to perceive entity boundaries for
generalized entity spans. Additionally, we utilize LoRAHub to align information
from the target domain to the source domain, thereby enhancing adaptive
cross-domain classification capabilities. Extensive experiments across various
benchmarks demonstrate that our framework outperforms prior methods, validating
its effectiveness. In particular, the proposed strategies demonstrate
effectiveness across a range of LLM architectures. The code and data are
released on https://github.com/UESTC-GQJ/BANER. [COMMENTS]Appear on COLING 2025 [LINK]http://arxiv.org/abs/2412.02228v1 [DATE]2024-12-03 15:51:14+08:00 [CATEGORIES]cs.CLcs.LG
NüshuRescue: Revitalization of the endangered Nüshu Language with AI [AUTHORS]Ivory Yang, Weicheng Ma, Soroush Vosoughi [ABSTRACT]The preservation and revitalization of endangered and extinct languages is a
meaningful endeavor, conserving cultural heritage while enriching fields like
linguistics and anthropology. However, these languages are typically
low-resource, making their reconstruction labor-intensive and costly. This
challenge is exemplified by N\"ushu, a rare script historically used by Yao
women in China for self-expression within a patriarchal society. To address
this challenge, we introduce N\"ushuRescue, an AI-driven framework designed to
train large language models (LLMs) on endangered languages with minimal data.
N\"ushuRescue automates evaluation and expands target corpora to accelerate
linguistic revitalization. As a foundational component, we developed NCGold, a
500-sentence N\"ushu-Chinese parallel corpus, the first publicly available
dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to N\"ushu
and only 35 short examples from NCGold, N\"ushuRescue achieved 48.69\%
translation accuracy on 50 withheld sentences and generated NCSilver, a set of
98 newly translated modern Chinese sentences of varying lengths. A sample of
both NCGold and NCSilver is included in the Supplementary Materials.
Additionally, we developed FastText-based and Seq2Seq models to further support
research on N\"ushu. N\"ushuRescue provides a versatile and scalable tool for
the revitalization of endangered languages, minimizing the need for extensive
human input. [COMMENTS]Accepted to COLING 2025 [LINK]http://arxiv.org/abs/2412.00218v2 [DATE]2024-12-03 12:38:31+08:00 [CATEGORIES]cs.CLcs.LG
Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index [AUTHORS]Tyler McDonald, Anthony Colosimo, Yifeng Li, Ali Emami [ABSTRACT]As prompt engineering research rapidly evolves, evaluations beyond accuracy
are crucial for developing cost-effective techniques. We present the Economical
Prompting Index (EPI), a novel metric that combines accuracy scores with token
consumption, adjusted by a user-specified cost concern level to reflect
different resource constraints. Our study examines 6 advanced prompting
techniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts,
across 10 widely-used language models and 4 diverse datasets. We demonstrate
that approaches such as Self-Consistency often provide statistically
insignificant gains while becoming cost-prohibitive. For example, on
high-performing models like Claude 3.5 Sonnet, the EPI of simpler techniques
like Chain-of-Thought (0.72) surpasses more complex methods like
Self-Consistency (0.64) at slight cost concern levels. Our findings suggest a
reevaluation of complex prompting strategies in resource-constrained scenarios,
potentially reshaping future research priorities and improving
cost-effectiveness for end-users. [COMMENTS]5 pages (excluding references), accepted to Coling 2025 [LINK]http://arxiv.org/abs/2412.01690v1 [DATE]2024-12-03 00:34:18+08:00 [CATEGORIES]cs.CL
2024 Dec 02, Mon
NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers [AUTHORS]Angel Yahir Loredo Lopez, Tyler McDonald, Ali Emami [ABSTRACT]Large Language Models (LLMs) have shown impressive performance on various
benchmarks, yet their ability to engage in deliberate reasoning remains
questionable. We present NYT-Connections, a collection of 358 simple word
classification puzzles derived from the New York Times Connections game. This
benchmark is designed to penalize quick, intuitive "System 1" thinking,
isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple
machine learning heuristic, and humans across three configurations:
single-attempt, multiple attempts without hints, and multiple attempts with
contextual hints. Our findings reveal a significant performance gap: even
top-performing LLMs like GPT-4 fall short of human performance by nearly 30%.
Notably, advanced prompting techniques such as Chain-of-Thought and
Self-Consistency show diminishing returns as task difficulty increases.
NYT-Connections uniquely combines linguistic isolation, resistance to intuitive
shortcuts, and regular updates to mitigate data leakage, offering a novel tool
for assessing LLM reasoning capabilities. [COMMENTS]5 pages (excluding references), accepted to Coling 2025 [LINK]http://arxiv.org/abs/2412.01621v1 [DATE]2024-12-02 23:41:47+08:00 [CATEGORIES]cs.CL
Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem [AUTHORS]Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt [ABSTRACT]Natural language explanations (NLEs) are vital for elucidating the reasoning
behind large language model (LLM) decisions. Many techniques have been
developed to generate NLEs using LLMs. However, like humans, LLMs might not
always produce optimal NLEs on first attempt. Inspired by human learning
processes, we introduce Cross-Refine, which employs role modeling by deploying
two LLMs as generator and critic, respectively. The generator outputs a first
NLE and then refines this initial explanation using feedback and suggestions
provided by the critic. Cross-Refine does not require any supervised training
data or additional training. We validate Cross-Refine across three NLP tasks
using three state-of-the-art open-source LLMs through automatic and human
evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which
only utilizes self-feedback to refine the explanations. Our findings from
automatic evaluation and a user study indicate that Cross-Refine outperforms
Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful
LLMs, whereas Self-Refine only yields strong results with ChatGPT.
Additionally, we conduct an ablation study to assess the importance of feedback
and suggestions. Both of them play an important role in refining explanations.
We further evaluate Cross-Refine on a bilingual dataset in English and German. [COMMENTS]Accepted at COLING 2025; long paper [LINK]http://arxiv.org/abs/2409.07123v2 [DATE]2024-12-02 21:04:18+08:00 [CATEGORIES]cs.CLcs.LG
GraphOTTER: Evolving LLM-based Graph Reasoning for Complex Table Question Answering [AUTHORS]Qianlong Li, Chen Huang, Shuai Li, Yuanxin Xiang, Deng Xiong, Wenqiang Lei [ABSTRACT]Complex Table Question Answering involves providing accurate answers to
specific questions based on intricate tables that exhibit complex layouts and
flexible header locations. Despite considerable progress having been made in
the LLM era, the reasoning processes of existing methods are often implicit,
feeding the entire table into prompts, making it difficult to effectively
filter out irrelevant information in the table. To this end, we propose
GraphOTTER that explicitly establishes the reasoning process to pinpoint the
correct answers. In particular, GraphOTTER leverages a graph-based
representation, transforming the complex table into an undirected graph. It
then conducts step-by-step reasoning on the graph, with each step guided by a
set of pre-defined intermediate reasoning actions. As such, it constructs a
clear reasoning path and effectively identifies the answer to a given question.
Comprehensive experiments on two benchmark datasets and two LLM backbones
demonstrate the effectiveness of GraphOTTER. Further analysis indicates that
its success may be attributed to the ability to efficiently filter out
irrelevant information, thereby focusing the reasoning process on the most
pertinent data. Our code and experimental datasets are available at
\url\{https://github.com/JDing0521/GraphOTTER\}. [COMMENTS]COLING 2025, code is available at
https://github.com/JDing0521/GraphOTTER [LINK]http://arxiv.org/abs/2412.01230v1 [DATE]2024-12-02 15:49:23+08:00 [CATEGORIES]cs.CL
LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks [AUTHORS]Akshara Prabhakar, Yuanzhi Li, Karthik Narasimhan, Sham Kakade, Eran Malach, Samy Jelassi [ABSTRACT]Low-Rank Adaptation (LoRA) is a popular technique for parameter-efficient
fine-tuning of Large Language Models (LLMs). We study how different LoRA
modules can be merged to achieve skill composition -- testing the performance
of the merged model on a target task that involves combining multiple skills,
each skill coming from a single LoRA. This setup is favorable when it is
difficult to obtain training data for the target task and when it can be
decomposed into multiple skills. First, we identify practically occurring
use-cases that can be studied under the realm of skill composition, e.g.
solving hard math-word problems with code, creating a bot to answer questions
on proprietary manuals or about domain-specialized corpora. Our main
contribution is to show that concatenation of LoRAs (CAT), which optimally
weights LoRAs that were individually trained on different skills, outperforms
existing model- and data- merging techniques; for instance on math-word
problems, CAT beats these methods by an average of 43% and 12% respectively.
Thus, this paper advocates model merging as an efficient way to solve
compositional tasks and underscores CAT as a simple, compute-friendly and
effective procedure. To our knowledge, this is the first work demonstrating the
superiority of model merging over data mixing for binary skill composition
tasks. Code and data are available at https://github.com/aksh555/LoRA-Soups [COMMENTS]COLING 2025 Industry track; 9 pages plus references and appendices [LINK]http://arxiv.org/abs/2410.13025v2 [DATE]2024-12-02 14:40:50+08:00 [CATEGORIES]cs.CLcs.LG
QABISAR: Query-Article Bipartite Interactions for Statutory Article Retrieval [AUTHORS]T. Y. S. S. Santosh, Hassan Sarwat, Matthias Grabmair [ABSTRACT]In this paper, we introduce QABISAR, a novel framework for statutory article
retrieval, to overcome the semantic mismatch problem when modeling each
query-article pair in isolation, making it hard to learn representation that
can effectively capture multi-faceted information. QABISAR leverages bipartite
interactions between queries and articles to capture diverse aspects inherent
in them. Further, we employ knowledge distillation to transfer enriched query
representations from the graph network into the query bi-encoder, to capture
the rich semantics present in the graph representations, despite absence of
graph-based supervision for unseen queries during inference. Our experiments on
a real-world expert-annotated dataset demonstrate its effectiveness. [COMMENTS]Accepted to COLING 2025 [LINK]http://arxiv.org/abs/2412.00934v1 [DATE]2024-12-02 02:58:17+08:00 [CATEGORIES]cs.CL
Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting [AUTHORS]Thilini Wijesiriwardene, Ruwan Wickramarachchi, Sreeram Vennam, Vinija Jain, Aman Chadha, Amitava Das, Ponnurangam Kumaraguru, Amit Sheth [ABSTRACT]Making analogies is fundamental to cognition. Proportional analogies, which
consist of four terms, are often used to assess linguistic and cognitive
abilities. For instance, completing analogies like "Oxygen is to Gas as
is to " requires identifying the semantic relationship (e.g., "type of")
between the first pair of terms ("Oxygen" and "Gas") and finding a second pair
that shares the same relationship (e.g., "Aluminum" and "Metal"). In this work,
we introduce a 15K Multiple-Choice Question Answering (MCQA) dataset for
proportional analogy completion and evaluate the performance of contemporary
Large Language Models (LLMs) in various knowledge-enhanced prompt settings.
Specifically, we augment prompts with three types of knowledge: exemplar,
structured, and targeted. Our results show that despite extensive training
data, solving proportional analogies remains challenging for current LLMs, with
the best model achieving an accuracy of 55%. Notably, we find that providing
targeted knowledge can better assist models in completing proportional
analogies compared to providing exemplars or collections of structured
knowledge. [COMMENTS]Accepted at COLING 2025 [LINK]http://arxiv.org/abs/2412.00869v1 [DATE]2024-12-02 00:15:14+08:00 [CATEGORIES]cs.CL
</li>
</ul>
</details>
2024 Dec 01, Sun
ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [AUTHORS]Kung-Hsiang Huang, Hou Pong Chan, Kathleen McKeown, Heng Ji [ABSTRACT]Considerable advancements have been made to tackle the misrepresentation of
information derived from reference articles in the domains of fact-checking and
faithful summarization. However, an unaddressed aspect remains - the
identification of social media posts that manipulate information within
associated news articles. This task presents a significant challenge, primarily
due to the prevalence of personal opinions in such posts. We present a novel
task, identifying manipulation of news on social media, which aims to detect
manipulation in social media posts and identify manipulated or inserted
information. To study this task, we have proposed a data collection schema and
curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and
corresponding articles. Our analysis demonstrates that this task is highly
challenging, with large language models (LLMs) yielding unsatisfactory
performance. Additionally, we have developed a simple yet effective basic model
that outperforms LLMs significantly on the ManiTweet dataset. Finally, we have
conducted an exploratory analysis of human-written tweets, unveiling intriguing
connections between manipulation and the domain and factuality of news
articles, as well as revealing that manipulated sentences are more likely to
encapsulate the main story or consequences of a news outlet. [COMMENTS]COLING 2025 [LINK]http://arxiv.org/abs/2305.14225v3 [DATE]2024-12-01 21:55:56+08:00 [CATEGORIES]cs.CL
Multi-View Incongruity Learning for Multimodal Sarcasm Detection [AUTHORS]Diandian Guo, Cong Cao, Fangfang Yuan, Yanbing Liu, Guangjie Zeng, Xiaoyan Yu, Hao Peng, Philip S. Yu [ABSTRACT]Multimodal sarcasm detection (MSD) is essential for various downstream tasks.
Existing MSD methods tend to rely on spurious correlations. These methods often
mistakenly prioritize non-essential features yet still make correct
predictions, demonstrating poor generalizability beyond training environments.
Regarding this phenomenon, this paper undertakes several initiatives. Firstly,
we identify two primary causes that lead to the reliance of spurious
correlations. Secondly, we address these challenges by proposing a novel method
that integrate Multimodal Incongruities via Contrastive Learning (MICL) for
multimodal sarcasm detection. Specifically, we first leverage incongruity to
drive multi-view learning from three views: token-patch, entity-object, and
sentiment. Then, we introduce extensive data augmentation to mitigate the
biased learning of the textual modality. Additionally, we construct a test set,
SPMSD, which consists potential spurious correlations to evaluate the the
model's generalizability. Experimental results demonstrate the superiority of
MICL on benchmark datasets, along with the analyses showcasing MICL's
advancement in mitigating the effect of spurious correlation. [COMMENTS]Accepted to COLING 2025 [LINK]http://arxiv.org/abs/2412.00756v1 [DATE]2024-12-01 18:29:36+08:00 [CATEGORIES]cs.CL
Retrieval Augmented Instruction Tuning for Open NER with Large Language Models [AUTHORS]Tingyu Xie, Jian Zhang, Yan Zhang, Yuanyuan Liang, Qi Li, Hongwei Wang [ABSTRACT]The strong capability of large language models (LLMs) has been applied to
information extraction (IE) through either retrieval augmented prompting or
instruction tuning (IT). However, the best way to incorporate information with
LLMs for IE remains an open question. In this paper, we explore Retrieval
Augmented Instruction Tuning (RA-IT) for IE, focusing on the task of open named
entity recognition (NER). Specifically, for each training sample, we retrieve
semantically similar examples from the training dataset as the context and
prepend them to the input of the original instruction. To evaluate our RA-IT
approach more thoroughly, we construct a Chinese IT dataset for open NER and
evaluate RA-IT in both English and Chinese scenarios. Experimental results
verify the effectiveness of RA-IT across various data sizes and in both English
and Chinese scenarios. We also conduct thorough studies to explore the impacts
of various retrieval strategies in the proposed RA-IT framework. Code and data
are available at: https://github.com/Emma1066/Retrieval-Augmented-IT-OpenNER [COMMENTS]To be appeared at COLING 2025 [LINK]http://arxiv.org/abs/2406.17305v2 [DATE]2024-12-01 17:02:35+08:00 [CATEGORIES]cs.CL
A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning [AUTHORS]Xinzhe Li [ABSTRACT]Tool use, planning, and feedback learning are currently three prominent
paradigms for developing Large Language Model (LLM)-based agents across various
tasks. Although numerous frameworks have been devised for each paradigm, their
intricate workflows and inconsistent taxonomy create challenges in
understanding and reviewing the frameworks across different paradigms. This
survey introduces a unified taxonomy to systematically review and discuss these
frameworks. Specifically, 1) the taxonomy defines environments/tasks, common
LLM-profiled roles or LMPRs (policy models, evaluators, and dynamic models),
and universally applicable workflows found in prior work, and 2) it enables a
comparison of key perspectives on the implementations of LMPRs and workflow
designs across different agent paradigms and frameworks. 3) Finally, we
identify three limitations in existing workflow designs and systematically
discuss the future work. Resources have been made publicly available at in our
GitHub repository https://github.com/xinzhel/LLM-Agent-Survey. [COMMENTS]CoLing 2025 Camera Ready (extended to 9 pages) [LINK]http://arxiv.org/abs/2406.05804v6 [DATE]2024-12-01 06:38:57+08:00 [CATEGORIES]cs.CL
Unveiling Performance Challenges of Large Language Models in Low-Resource Healthcare: A Demographic Fairness Perspective [AUTHORS]Yue Zhou, Barbara Di Eugenio, Lu Cheng [ABSTRACT]This paper studies the performance of large language models (LLMs),
particularly regarding demographic fairness, in solving real-world healthcare
tasks. We evaluate state-of-the-art LLMs with three prevalent learning
frameworks across six diverse healthcare tasks and find significant challenges
in applying LLMs to real-world healthcare tasks and persistent fairness issues
across demographic groups. We also find that explicitly providing demographic
information yields mixed results, while LLM's ability to infer such details
raises concerns about biased health predictions. Utilizing LLMs as autonomous
agents with access to up-to-date guidelines does not guarantee performance
improvement. We believe these findings reveal the critical limitations of LLMs
in healthcare fairness and the urgent need for specialized research in this
area. [COMMENTS]Accepted to the main conference of COLING 2025 [LINK]http://arxiv.org/abs/2412.00554v1 [DATE]2024-12-01 02:52:30+08:00 [CATEGORIES]cs.CL
SeQwen at the Financial Misinformation Detection Challenge Task: Sequential Learning for Claim Verification and Explanation Generation in Financial Domains [AUTHORS]Jebish Purbey, Siddhant Gupta, Nikhil Manali, Siddartha Pullakhandam, Drishti Sharma, Ashay Srivastava, Ram Mohan Rao Kadiyala [ABSTRACT]This paper presents the system description of our entry for the COLING 2025
FMD challenge, focusing on misinformation detection in financial domains. We
experimented with a combination of large language models, including Qwen,
Mistral, and Gemma-2, and leveraged pre-processing and sequential learning for
not only identifying fraudulent financial content but also generating coherent,
and concise explanations that clarify the rationale behind the classifications.
Our approach achieved competitive results with an F1-score of 0.8283 for
classification, and ROUGE-1 of 0.7253 for explanations. This work highlights
the transformative potential of LLMs in financial applications, offering
insights into their capabilities for combating misinformation and enhancing
transparency while identifying areas for future improvement in robustness and
domain adaptation. [COMMENTS]6 pages, 9 figures, Submitted to FinNLP-FNP-LLMFinLegal @ COLING 2025 [LINK]http://arxiv.org/abs/2412.00549v1 [DATE]2024-12-01 02:03:04+08:00 [CATEGORIES]cs.CLcs.LG
Evaluating the Consistency of LLM Evaluators [AUTHORS]Noah Lee, Jiwoo Hong, James Thorne [ABSTRACT]Large language models (LLMs) have shown potential as general evaluators along
with the evident benefits of speed and cost. While their correlation against
human annotators has been widely studied, consistency as evaluators is still
understudied, raising concerns about the reliability of LLM evaluators. In this
paper, we conduct extensive studies on the two aspects of consistency in LLMevaluations, Self-Consistency (SC) and Inter-scale Consistency (IC), on
different scoring scales and criterion granularity with open-source and
proprietary models. Our comprehensive analysis demonstrates that strong
proprietary models are not necessarily consistent evaluators, highlighting the
importance of considering consistency in assessing the capability of LLM
evaluators. [COMMENTS]Accepted to COLING 2025 [LINK]http://arxiv.org/abs/2412.00543v1 [DATE]2024-12-01 01:29:08+08:00 [CATEGORIES]cs.CL
</div>
2024 Nov 30, Sat
Few-Shot Domain Adaptation for Named-Entity Recognition via Joint Constrained k-Means and Subspace Selection [AUTHORS]Ayoub Hammal, Benno Uthayasooriyar, Caio Corro [ABSTRACT]Named-entity recognition (NER) is a task that typically requires large
annotated datasets, which limits its applicability across domains with varying
entity definitions. This paper addresses few-shot NER, aiming to transfer
knowledge to new domains with minimal supervision. Unlike previous approaches
that rely solely on limited annotated data, we propose a weakly supervised
algorithm that combines small labeled datasets with large amounts of unlabeled
data. Our method extends the k-means algorithm with label supervision, cluster
size constraints and domain-specific discriminative subspace selection. This
unified framework achieves state-of-the-art results in few-shot NER on several
English datasets. [COMMENTS]COLING 2025 [LINK]http://arxiv.org/abs/2412.00426v1 [DATE]2024-11-30 18:52:24+08:00 [CATEGORIES]cs.CL
2024 Nov 30, Sat
Scaling Laws for Precision [AUTHORS]Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, Aditi Raghunathan [ABSTRACT]Low precision training and inference affect both the quality and cost of
language models, but current scaling laws do not account for this. In this
work, we devise "precision-aware" scaling laws for both training and inference.
We propose that training in lower precision reduces the model's "effective
parameter count," allowing us to predict the additional loss incurred from
training in low precision and post-train quantization. For inference, we find
that the degradation introduced by post-training quantization increases as
models are trained on more data, eventually making additional pretraining data
actively harmful. For training, our scaling laws allow us to predict the loss
of a model with different parts in different precisions, and suggest that
training larger models in lower precision may be compute optimal. We unify the
scaling laws for post and pretraining quantization to arrive at a single
functional form that predicts degradation from training and inference in varied
precisions. We fit on over 465 pretraining runs and validate our predictions on
model sizes up to 1.7B parameters trained on up to 26B tokens. [LINK]http://arxiv.org/abs/2411.04330v2 [DATE]2024-11-30 10:42:31+08:00 [CATEGORIES]cs.LGcs.CL
2024 Dec 02, Mon
Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers [AUTHORS]Andy Yang, David Chiang [ABSTRACT]Deriving formal bounds on the expressivity of transformers, as well as
studying transformers that are constructed to implement known algorithms, are
both effective methods for better understanding the computational power of
transformers. Towards both ends, we introduce the temporal counting logic
$\textsf\{K\}_\text\{t\}$[#] alongside the RASP variant $\textsf\{C-RASP\}$. We show
they are equivalent to each other, and that together they are the best-known
lower bound on the formal expressivity of future-masked soft attention
transformers with unbounded input size. We prove this by showing all
$\textsf\{K\}_\text\{t\}$[#] formulas can be compiled into these transformers. [LINK]http://arxiv.org/abs/2404.04393v2 [DATE]2024-12-02 04:48:11+08:00 [CATEGORIES]cs.CLcs.LG
2024 Dec 06, Fri
The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation [AUTHORS]Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, Joakim Nivre [ABSTRACT]This paper introduces the counter-intuitive generalization results of
overfitting pre-trained large language models (LLMs) on very small datasets. In
the setting of open-ended text generation, it is well-documented that LLMs tend
to generate repetitive and dull sequences, a phenomenon that is especially
apparent when generating using greedy decoding. This issue persists even with
state-of-the-art LLMs containing billions of parameters, trained via next-token
prediction on large datasets. We find that by further fine-tuning these models
to achieve a near-zero training loss on a small set of samples -- a process we
refer to as hyperfitting -- the long-sequence generative capabilities are
greatly enhanced. Greedy decoding with these Hyperfitted models even outperform
Top-P sampling over long-sequences, both in terms of diversity and human
preferences. This phenomenon extends to LLMs of various sizes, different
domains, and even autoregressive image generation. We further find this
phenomena to be distinctly different from that of Grokking and double descent.
Surprisingly, our experiments indicate that hyperfitted models rarely fall into
repeating sequences they were trained on, and even explicitly blocking these
sequences results in high-quality output. All hyperfitted models produce
extremely low-entropy predictions, often allocating nearly all probability to a
single token. [COMMENTS]Under review at ICLR [LINK]http://arxiv.org/abs/2412.04318v1 [DATE]2024-12-06 00:34:20+08:00 [CATEGORIES]cs.CL
2024 Dec 05, Thu
A Context-aware Framework for Translation-mediated Conversations [AUTHORS]José Pombal, Sweta Agrawal, Patrick Fernandes, Emmanouil Zaranis, André F. T. Martins [ABSTRACT]Effective communication is fundamental to any interaction, yet challenges
arise when participants do not share a common language. Automatic translation
systems offer a powerful solution to bridge language barriers in such
scenarios, but they introduce errors that can lead to misunderstandings and
conversation breakdown. A key issue is that current systems fail to incorporate
the rich contextual information necessary to resolve ambiguities and omitted
details, resulting in literal, inappropriate, or misaligned translations. In
this work, we present a framework to improve large language model-based
translation systems by incorporating contextual information in bilingual
conversational settings. During training, we leverage context-augmented
parallel data, which allows the model to generate translations sensitive to
conversational history. During inference, we perform quality-aware decoding
with context-aware metrics to select the optimal translation from a pool of
candidates. We validate both components of our framework on two task-oriented
domains: customer chat and user-assistant interaction. Across both settings,
our framework consistently results in better translations than state-of-the-art
systems like GPT-4o and TowerInstruct, as measured by multiple automatic
translation quality metrics on several language pairs. We also show that the
resulting model leverages context in an intended and interpretable way,
improving consistency between the conveyed message and the generated
translations. [LINK]http://arxiv.org/abs/2412.04205v1 [DATE]2024-12-05 22:41:05+08:00 [CATEGORIES]cs.CL
A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts [AUTHORS]Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng [ABSTRACT]Training and serving long-context large language models (LLMs) incurs
substantial overhead. To address this, two critical steps are often required: a
pretrained LLM typically undergoes a separate stage for context length
extension by training on long-context data, followed by architectural
modifications to reduce the overhead of KV cache during serving. This paper
argues that integrating length extension with a GPU-friendly KV cache reduction
architecture not only reduces training overhead during length extension, but
also achieves better long-context performance. This leads to our proposed
LongGen, which finetunes a pretrained LLM into an efficient architecture during
length extension. LongGen builds on three key insights: (1) Sparse attention
patterns, such as window attention (attending to recent tokens), attention sink
(initial ones), and blockwise sparse attention (strided token blocks) are
well-suited for building efficient long-context models, primarily due to their
GPU-friendly memory access patterns, enabling efficiency gains not just
theoretically but in practice as well. (2) It is essential for the model to
have direct access to all tokens. A hybrid architecture with 1/3 full attention
layers and 2/3 efficient ones achieves a balanced trade-off between efficiency
and long-context performance. (3) Lightweight training on 5B long-context data
is sufficient to extend the hybrid model's context length from 4K to 128K.
We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its
effectiveness across different scales. During training with 128K-long contexts,
LongGen achieves 1.55x training speedup and reduces wall-clock time by 36%,
compared to a full-attention baseline. During inference, LongGen reduces KV
cache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding
speedup. [LINK]http://arxiv.org/abs/2410.01485v2 [DATE]2024-12-05 14:52:42+08:00 [CATEGORIES]cs.CL
Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models [AUTHORS]Jiacheng Ye, Shansan Gong, Liheng Chen, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, Lingpeng Kong [ABSTRACT]Recently, diffusion models have garnered significant interest in the field of
text processing due to their many potential advantages compared to conventional
autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a
novel approach that integrates diffusion models with Chain-of-Thought, a
well-established technique for improving the reasoning ability of
autoregressive language models. In contrast to autoregressive language models
that make decisions in a left-to-right, token-by-token manner, DoT allows
reasoning steps to diffuse over time through a diffusion language model and
offers greater flexibility in trading-off computation for reasoning
performance. Our experimental results demonstrate the effectiveness of DoT in
multi-digit multiplication, boolean logic, and grade school math problems, with
a small diffusion model outperforming a much larger autoregressive model in
both efficiency and accuracy. In addition to that, DoT showcases promising
self-correction abilities and benefits from existing reasoning-enhancing
techniques like self-consistency decoding. Our findings contribute to the
understanding and development of reasoning with diffusion language models. [COMMENTS]NeurIPS 2024 [LINK]http://arxiv.org/abs/2402.07754v3 [DATE]2024-12-05 14:49:06+08:00 [CATEGORIES]cs.CLcs.LG
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension [AUTHORS]Wang Xiyao, Yang Zhengyuan, Li Linjie, Lu Hongjin, Xu Yuancheng, Lin Chung-Ching Lin, Lin Kevin, Huang Furong, Wang Lijuan [ABSTRACT]Despite significant advancements in vision-language models (VLMs), there
lacks effective approaches to enhance response quality by scaling
inference-time computation. This capability is known to be a core step towards
the self-improving models in recent large language model studies. In this
paper, we present Vision Value Model (VisVM) that can guide VLM inference-time
search to generate responses with better visual comprehension. Specifically,
VisVM not only evaluates the generated sentence quality in the current search
step, but also anticipates the quality of subsequent sentences that may result
from the current step, thus providing a long-term value. In this way, VisVM
steers VLMs away from generating sentences prone to hallucinations or
insufficient detail, thereby producing higher quality responses. Experimental
results demonstrate that VisVM-guided search significantly enhances VLMs'
ability to generate descriptive captions with richer visual details and fewer
hallucinations, compared with greedy decoding and search methods with other
visual reward signals. Furthermore, we find that self-training the model with
the VisVM-guided captions improve VLM's performance across a wide range of
multimodal benchmarks, indicating the potential for developing self-improving
VLMs. Our value model and code are available at
https://github.com/si0wang/VisVM. [LINK]http://arxiv.org/abs/2412.03704v1 [DATE]2024-12-05 04:35:07+08:00 [CATEGORIES]cs.CLcs.LG
Learning Semantic Association Rules from Internet of Things Data [AUTHORS]Erkan Karabulut, Paul Groth, Victoria Degeler [ABSTRACT]Association Rule Mining (ARM) is the task of discovering commonalities in
data in the form of logical implications. ARM is used in the Internet of Things
(IoT) for different tasks including monitoring and decision-making. However,
existing methods give limited consideration to IoT-specific requirements such
as heterogeneity and volume. Furthermore, they do not utilize important static
domain-specific description data about IoT systems, which is increasingly
represented as knowledge graphs. In this paper, we propose a novel ARM pipeline
for IoT data that utilizes both dynamic sensor data and static IoT system
metadata. Furthermore, we propose an Autoencoder-based Neurosymbolic ARM method
(Aerial) as part of the pipeline to address the high volume of IoT data and
reduce the total number of rules that are resource-intensive to process. Aerial
learns a neural representation of a given data and extracts association rules
from this representation by exploiting the reconstruction (decoding) mechanism
of an autoencoder. Extensive evaluations on 3 IoT datasets from 2 domains show
that ARM on both static and dynamic IoT data results in more generically
applicable rules while Aerial can learn a more concise set of high-quality
association rules than the state-of-the-art with full coverage over the
datasets. [LINK]http://arxiv.org/abs/2412.03417v2 [DATE]2024-12-05 21:22:28+08:00 [CATEGORIES]cs.LG
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment [AUTHORS]Jason Vega, Junsheng Huang, Gaokai Zhang, Hangoo Kang, Minjia Zhang, Gagandeep Singh [ABSTRACT]Safety alignment of Large Language Models (LLMs) has recently become a
critical objective of model developers. In response, a growing body of work has
been investigating how safety alignment can be bypassed through various
jailbreaking methods, such as adversarial attacks. However, these jailbreak
methods can be rather costly or involve a non-trivial amount of creativity and
effort, introducing the assumption that malicious users are high-resource or
sophisticated. In this paper, we study how simple random augmentations to the
input prompt affect safety alignment effectiveness in state-of-the-art LLMs,
such as Llama 3 and Qwen 2. We perform an in-depth evaluation of 17 different
models and investigate the intersection of safety under random augmentations
with multiple dimensions: augmentation type, model size, quantization,
fine-tuning-based defenses, and decoding strategies (e.g., sampling
temperature). We show that low-resource and unsophisticated attackers, i.e.
$\textit\{stochastic monkeys\}$, can significantly improve their chances of
bypassing alignment with just 25 random augmentations per prompt. Source code
and data: https://github.com/uiuc-focal-lab/stochastic-monkeys/ [COMMENTS]v2: Updated with changes from peer review rebuttal. v1: Version under
peer review [LINK]http://arxiv.org/abs/2411.02785v2 [DATE]2024-12-05 20:58:44+08:00 [CATEGORIES]cs.LG
2024 Dec 04, Wed
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification [AUTHORS]Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin [ABSTRACT]Multimodal Large Language Models (MLLMs) have achieved remarkable success in
vision understanding, reasoning, and interaction. However, the inference
computation and memory increase progressively with the generation of output
tokens during decoding, directly affecting the efficacy of MLLMs. Existing
methods attempt to reduce the vision context redundancy to achieve efficient
MLLMs. Unfortunately, the efficiency benefits of the vision context reduction
in the prefill stage gradually diminish during the decoding stage. To address
this problem, we proposed a dynamic vision-language context sparsification
framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision
context in the prefill stage and decreases the memory and computation overhead
of the generated language context during decoding. Dynamic-LLaVA designs a
tailored sparsification inference scheme for different inference modes, i.e.,
prefill, decoding with and without KV cache, to achieve efficient inference of
MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by
$\sim$75\% in the prefill stage. Meanwhile, throughout the entire generation
process of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumption
under decoding without KV cache, while saving $\sim$50\% GPU memory overhead
when decoding with KV cache, due to the vision-language context sparsification.
Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient
inference for MLLMs with negligible understanding and generation ability
degradation or even performance gains compared to the full-context inference
baselines. Code is available at https://github.com/Osilly/dynamic_llava . [COMMENTS]Code is available at https://github.com/Osilly/dynamic_llava [LINK]http://arxiv.org/abs/2412.00876v2 [DATE]2024-12-04 00:12:09+08:00 [CATEGORIES]cs.CLcs.LG
Channel Reflection: Knowledge-Driven Data Augmentation for EEG-Based Brain-Computer Interfaces [AUTHORS]Ziwei Wang, Siyang Li, Jingwei Luo, Jiajing Liu, Dongrui Wu [ABSTRACT]A brain-computer interface (BCI) enables direct communication between the
human brain and external devices. Electroencephalography (EEG) based BCIs are
currently the most popular for able-bodied users. To increase
user-friendliness, usually a small amount of user-specific EEG data are used
for calibration, which may not be enough to develop a pure data-driven decoding
model. To cope with this typical calibration data shortage challenge in
EEG-based BCIs, this paper proposes a parameter-free channel reflection (CR)
data augmentation approach that incorporates prior knowledge on the channel
distributions of different BCI paradigms in data augmentation. Experiments on
eight public EEG datasets across four different BCI paradigms (motor imagery,
steady-state visual evoked potential, P300, and seizure classifications) using
different decoding algorithms demonstrated that: 1) CR is effective, i.e., it
can noticeably improve the classification accuracy; 2) CR is robust, i.e., it
consistently outperforms existing data augmentation approaches in the
literature; and, 3) CR is flexible, i.e., it can be combined with other data
augmentation approaches to further increase the performance. We suggest that
data augmentation approaches like CR should be an essential step in EEG-based
BCIs. Our code is available online. [LINK]http://arxiv.org/abs/2412.03224v1 [DATE]2024-12-04 19:21:30+08:00 [CATEGORIES]cs.LG
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [AUTHORS]Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo [ABSTRACT]Large Language Models (LLMs) have been widely deployed in a variety of
applications, and the context length is rapidly increasing to handle tasks such
as long-document QA and complex logical reasoning. However, long context poses
significant challenges for inference efficiency, including high memory costs of
key-value (KV) cache and increased latency due to extensive memory accesses.
Recent works have proposed compressing KV cache to approximate computation, but
these methods either evict tokens permanently, never recalling them for later
inference, or recall previous tokens at the granularity of pages divided by
textual positions. Both approaches degrade the model accuracy and output
quality. To achieve efficient and accurate recallable KV cache compression, we
introduce ClusterKV, which recalls tokens at the granularity of semantic
clusters. We design and implement efficient algorithms and systems for
clustering, selection, indexing and caching. Experiment results show that
ClusterKV attains negligible accuracy loss across various tasks with 32k
context lengths, using only a 1k to 2k KV cache budget, and achieves up to a
2$\times$ speedup in latency and a 2.5$\times$ improvement in decoding
throughput. Compared to SoTA recallable KV compression methods, ClusterKV
demonstrates higher model accuracy and output quality, while maintaining or
exceeding inference efficiency. [LINK]http://arxiv.org/abs/2412.03213v1 [DATE]2024-12-04 18:58:27+08:00 [CATEGORIES]cs.LG
Pyramid Vector Quantization for LLMs [AUTHORS]Tycho F. A. van der Ouderaa, Maximilian L. Croci, Agrin Hilmkil, James Hensman [ABSTRACT]Recent works on compression of large language models (LLM) using quantization
considered reparameterizing the architecture such that weights are distributed
on the sphere. This demonstratively improves the ability to quantize by
increasing the mathematical notion of coherence, resulting in fewer weight
outliers without affecting the network output. In this work, we aim to further
exploit this spherical geometry of the weights when performing quantization by
considering Pyramid Vector Quantization (PVQ) for large language models.
Arranging points evenly on the sphere is notoriously difficult, especially in
high dimensions, and in case approximate solutions exists, representing points
explicitly in a codebook is typically not feasible due to its additional memory
cost. Instead, PVQ uses a fixed integer lattice on the sphere by projecting
points onto the 1-sphere, which allows for efficient encoding and decoding
without requiring an explicit codebook in memory. To obtain a practical
algorithm, we propose to combine PVQ with scale quantization for which we
derive theoretically optimal quantizations, under empirically verified
assumptions. Further, we extend pyramid vector quantization to use Hessian
information to minimize quantization error under expected feature activations,
instead of only relying on weight magnitudes. Experimentally, we achieves
state-of-the-art quantization performance with pareto-optimal trade-off between
performance and bits per weight and bits per activation, compared to compared
methods. On weight-only, we find that we can quantize a Llama-3 70B model to
3.25 bits per weight and retain 98\% accuracy on downstream tasks. [LINK]http://arxiv.org/abs/2410.16926v2 [DATE]2024-12-04 18:52:04+08:00 [CATEGORIES]cs.LG
2024 Dec 03, Tue
Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation [AUTHORS]Zhi Qu, Yiran Wang, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Taro Watanabe [ABSTRACT]Existing multilingual neural machine translation (MNMT) approaches mainly
focus on improving models with the encoder-decoder architecture to translate
multiple languages. However, decoder-only architecture has been explored less
in MNMT due to its underperformance when trained on parallel data solely. In
this work, we attribute the issue of the decoder-only architecture to its lack
of language transfer capability. Specifically, the decoder-only architecture is
insufficient in encoding source tokens with the target language features. We
propose dividing the decoding process into two stages so that target tokens are
explicitly excluded in the first stage to implicitly boost the transfer
capability across languages. Additionally, we impose contrastive learning on
translation instructions, resulting in improved performance in zero-shot
translation. We conduct experiments on TED-19 and OPUS-100 datasets,
considering both training from scratch and fine-tuning scenarios. Experimental
results show that, compared to the encoder-decoder architecture, our methods
not only perform competitively in supervised translations but also achieve
improvements of up to 3.39 BLEU, 6.99 chrF++, 3.22 BERTScore, and 4.81 COMET in
zero-shot translations. [LINK]http://arxiv.org/abs/2412.02101v1 [DATE]2024-12-03 10:52:14+08:00 [CATEGORIES]cs.CL
Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models [AUTHORS]Schrasing Tong, Eliott Zemour, Rawisara Lohanimit, Lalana Kagal [ABSTRACT]Although large language models (LLMs) have demonstrated their effectiveness
in a wide range of applications, they have also been observed to perpetuate
unwanted biases present in the training data, potentially leading to harm for
marginalized communities. In this paper, we mitigate bias by leveraging small
biased and anti-biased expert models to obtain a debiasing signal that will be
added to the LLM output at decoding-time. This approach combines resource
efficiency with interpretability and can be optimized for mitigating specific
types of bias, depending on the target use case. Experiments on mitigating
gender, race, and religion biases show a reduction in bias on several local and
global bias metrics while preserving language model performance. [COMMENTS]38th Conference on Neural Information Processing Systems (NeurIPS
2024) Safe Generative AI Workshop [LINK]http://arxiv.org/abs/2412.01711v1 [DATE]2024-12-03 00:56:08+08:00 [CATEGORIES]cs.CL
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [AUTHORS]Yan Scholten, Stephan Günnemann, Leo Schwinn [ABSTRACT]Comprehensive evaluation of Large Language Models (LLMs) is an open research
problem. Existing evaluations rely on deterministic point estimates generated
via greedy decoding. However, we find that deterministic evaluations fail to
capture the whole output distribution of a model, yielding inaccurate
estimations of model capabilities. This is particularly problematic in critical
contexts such as unlearning and alignment, where precise model evaluations are
crucial. To remedy this, we introduce the first formal probabilistic evaluation
framework in LLMs. Namely, we derive novel metrics with high-probability
guarantees concerning the output distribution of a model. Our metrics are
application-independent and allow practitioners to make more reliable estimates
about model capabilities before deployment. Through a case study focused on
unlearning, we reveal that deterministic evaluations falsely indicate
successful unlearning, whereas our probabilistic evaluations demonstrate that
most if not all of the supposedly unlearned information remains accessible in
these models. Additionally, we propose a novel unlearning loss based on entropy
optimization and adaptive temperature scaling, which significantly improves
unlearning in probabilistic settings on recent benchmarks. Our proposed shift
from point estimates to probabilistic evaluations of output distributions
represents an important step toward comprehensive evaluations of LLMs. Code
available at https://github.com/yascho/probabilistic-unlearning. [LINK]http://arxiv.org/abs/2410.03523v4 [DATE]2024-12-03 22:31:41+08:00 [CATEGORIES]cs.LG
An Automated Data Mining Framework Using Autoencoders for Feature Extraction and Dimensionality Reduction [AUTHORS]Yaxin Liang, Xinshi Li, Xin Huang, Ziqi Zhang, Yue Yao [ABSTRACT]This study proposes an automated data mining framework based on autoencoders
and experimentally verifies its effectiveness in feature extraction and data
dimensionality reduction. Through the encoding-decodingstructure, the
autoencoder can capture the data's potential characteristics and achieve noise
reduction and anomaly detection, providing an efficient and stable solution for
the data mining process. The experiment compared the performance of the
autoencoder with traditional dimensionality reduction methods (such as PCA, FA,
T-SNE, and UMAP). The results showed that the autoencoder performed best in
terms of reconstruction error and root mean square error and could better
retain data structure and enhance the generalization ability of the model. The
autoencoder-based framework not only reduces manual intervention but also
significantly improves the automation of data processing. In the future, with
the advancement of deep learning and big data technology, the autoencoder
method combined with a generative adversarial network (GAN) or graph neural
network (GNN) is expected to be more widely used in the fields of complex data
processing, real-time data analysis and intelligent decision-making. [LINK]http://arxiv.org/abs/2412.02211v1 [DATE]2024-12-03 15:04:10+08:00 [CATEGORIES]cs.LG
Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [AUTHORS]Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Maximilian Du, Chelsea Finn [ABSTRACT]Predicting and executing a sequence of actions without intermediate
replanning, known as action chunking, is increasingly used in robot learning
from human demonstrations. Yet, its reported effects on the learned policy are
inconsistent: some studies find it crucial for achieving strong results, while
others observe decreased performance. In this paper, we first dissect how
action chunking impacts the divergence between a learner and a demonstrator. We
find that action chunking allows the learner to better capture the temporal
dependencies in demonstrations but at the cost of reduced reactivity in
stochastic environments. To address this tradeoff, we propose Bidirectional
Decoding (BID), a test-time inference algorithm that bridges action chunking
with closed-loop operations. BID samples multiple predictions at each time step
and searches for the optimal one based on two criteria: (i) backward coherence,
which favors samples that align with previous decisions; (ii) forward contrast,
which seeks samples of high likelihood for future plans. By coupling decisions
within and across action chunks, BID promotes consistency over time while
maintaining reactivity to unexpected changes. Experimental results show that
BID boosts the performance of two state-of-the-art generative policies across
seven simulation benchmarks and two real-world tasks. Code and videos are
available at https://bid-robot.github.io. [COMMENTS]Project website: https://bid-robot.github.io/ [LINK]http://arxiv.org/abs/2408.17355v3 [DATE]2024-12-03 14:53:58+08:00 [CATEGORIES]cs.LG
GNN-based Auto-Encoder for Short Linear Block Codes: A DRL Approach [AUTHORS]Kou Tian, Chentao Yue, Changyang She, Yonghui Li, Branka Vucetic [ABSTRACT]This paper presents a novel auto-encoder based end-to-end channel encoding
and decoding. It integrates deep reinforcement learning (DRL) and graph neural
networks (GNN) in code design by modeling the generation of code parity-check
matrices as a Markov Decision Process (MDP), to optimize key coding performance
metrics such as error-rates and code algebraic properties. An edge-weighted GNN
(EW-GNN) decoder is proposed, which operates on the Tanner graph with an
iterative message-passing structure. Once trained on a single linear block
code, the EW-GNN decoder can be directly used to decode other linear block
codes of different code lengths and code rates. An iterative joint training of
the DRL-based code designer and the EW-GNN decoder is performed to optimize the
end-end encoding and decoding process. Simulation results show the proposed
auto-encoder significantly surpasses several traditional coding schemes at
short block lengths, including low-density parity-check (LDPC) codes with the
belief propagation (BP) decoding and the maximum-likelihood decoding (MLD), and
BCH with BP decoding, offering superior error-correction capabilities while
maintaining low decoding complexity. [COMMENTS]13 pages; submitted to IEEE Trans. arXiv admin note: text overlap
with arXiv:2211.06962 [LINK]http://arxiv.org/abs/2412.02053v1 [DATE]2024-12-03 08:25:14+08:00 [CATEGORIES]cs.LG
2024 Dec 02, Mon
Early Exit Is a Natural Capability in Transformer-based Models: An Empirical Study on Early Exit without Joint Optimization [AUTHORS]Weiqiao Shan, Long Meng, Tong Zheng, Yingfeng Luo, Bei Li, junxin Wang, Tong Xiao, Jingbo Zhu [ABSTRACT]Large language models (LLMs) exhibit exceptional performance across various
downstream tasks. However, they encounter limitations due to slow inference
speeds stemming from their extensive parameters. The early exit (EE) is an
approach that aims to accelerate auto-regressive decoding. EE generates outputs
from intermediate layers instead of using the whole model, which offers a
promising solution to this challenge. However, additional output layers and
joint optimization used in conventional EE hinder the application of EE in
LLMs.
In this paper, we explore the possibility of LLMs EE without additional
output layers and joint optimization. Our findings indicate that EE is a
natural capability within transformer-based models. While joint optimization
does not give model EE capability, it must be employed to address challenges by
improving the accuracy of locating the optimal EE layer through gating
functions. Additionally, our study reveals patterns in EE behavior from a
sub-word perspective based on the LLaMA model and the potential possibility for
EE based on sub-layers. [LINK]http://arxiv.org/abs/2412.01455v1 [DATE]2024-12-02 20:46:34+08:00 [CATEGORIES]cs.CL
PLD+: Accelerating LLM inference by leveraging Language Model Artifacts [AUTHORS]Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena [ABSTRACT]To reduce the latency associated with autoretrogressive LLM inference,
speculative decoding has emerged as a novel decoding paradigm, where future
tokens are drafted and verified in parallel. However, the practical deployment
of speculative decoding is hindered by its requirements for additional
computational resources and fine-tuning, which limits its out-of-the-box
usability. To address these challenges, we present PLD+, a suite of novel
algorithms developed to accelerate the inference process of LLMs, particularly
for input-guided tasks. These tasks, which include code editing, text editing,
summarization, etc., often feature outputs with substantial overlap with their
inputs-an attribute PLD+ is designed to exploit. PLD+ also leverages the
artifacts (attention and hidden states) generated during inference to
accelerate inference speed. We test our approach on five input-guided tasks and
through extensive experiments we find that PLD+ outperforms all tuning-free
approaches. In the greedy setting, it even outperforms the state-of-the-art
tuning-dependent approach EAGLE on four of the tasks. (by a margin of upto 2.31
in terms of avg. speedup). Our approach is tuning free, does not require any
additional compute and can easily be used for accelerating inference of any
LLM. [LINK]http://arxiv.org/abs/2412.01447v1 [DATE]2024-12-02 20:36:27+08:00 [CATEGORIES]cs.CL
Federated Motor Imagery Classification for Privacy-Preserving Brain-Computer Interfaces [AUTHORS]Tianwang Jia, Lubin Meng, Siyang Li, Jiajing Liu, Dongrui Wu [ABSTRACT]Training an accurate classifier for EEG-based brain-computer interface (BCI)
requires EEG data from a large number of users, whereas protecting their data
privacy is a critical consideration. Federated learning (FL) is a promising
solution to this challenge. This paper proposes Federated classification with
local Batch-specific batch normalization and Sharpness-aware minimization
(FedBS) for privacy protection in EEG-based motor imagery (MI) classification.
FedBS utilizes local batch-specific batch normalization to reduce data
discrepancies among different clients, and sharpness-aware minimization
optimizer in local training to improve model generalization. Experiments on
three public MI datasets using three popular deep learning models demonstrated
that FedBS outperformed six state-of-the-art FL approaches. Remarkably, it also
outperformed centralized training, which does not consider privacy protection
at all. In summary, FedBS protects user EEG data privacy, enabling multiple BCI
users to participate in large-scale machine learning model training, which in
turn improves the BCI decoding accuracy. [LINK]http://arxiv.org/abs/2412.01079v1 [DATE]2024-12-02 11:35:27+08:00 [CATEGORIES]cs.LG
DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding [AUTHORS]Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu [ABSTRACT]Human motion, inherently continuous and dynamic, presents significant
challenges for generative models. Despite their dominance, discrete
quantization methods, such as VQ-VAEs, suffer from inherent limitations,
including restricted expressiveness and frame-wise noise artifacts. Continuous
approaches, while producing smoother and more natural motions, often falter due
to high-dimensional complexity and limited training data. To resolve this
"discord" between discrete and continuous representations, we introduce
DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a
novel method that decodes discrete motion tokens into continuous motion through
rectified flow. By employing an iterative refinement process in the continuous
space, DisCoRD captures fine-grained dynamics and ensures smoother and more
natural motions. Compatible with any discrete-based framework, our method
enhances naturalness without compromising faithfulness to the conditioning
signals. Extensive evaluations demonstrate that DisCoRD achieves
state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on
KIT-ML. These results solidify DisCoRD as a robust solution for bridging the
divide between discrete efficiency and continuous realism. Our project page is
available at: https://whwjdqls.github.io/discord.github.io/. [COMMENTS]20 pages 18 figures [LINK]http://arxiv.org/abs/2411.19527v2 [DATE]2024-12-02 11:34:45+08:00 [CATEGORIES]cs.LG
2024 Dec 01, Sun
PGSO: Prompt-based Generative Sequence Optimization Network for Aspect-based Sentiment Analysis [AUTHORS]Hao Dong, Wei Wei [ABSTRACT]Recently, generative pre-training based models have demonstrated remarkable
results on Aspect-based Sentiment Analysis (ABSA) task. However, previous works
overemphasize crafting various templates to paraphrase training targets for
enhanced decoding, ignoring the internal optimizations on generative models.
Despite notable results achieved by these target-oriented optimization methods,
they struggle with the complicated long texts since the implicit long-distance
relation, e.g., aspect-opinion relation, is difficult to extract under the
position embedding mechanism in generative models. Thus, in this paper, we
first clarify the causes of the problem and introduce two sequence optimization
strategies: the rule-based static optimization and the score-based dynamic
optimization. The rule-based approach relies on handcraft priority of
dependency relation to reorder the context, while the score-based algorithm
dynamically regulates the contextual sequence by calculating word position
scores using neural network. Based on the dynamic optimization structure, we
further propose a unified Prompt-based Generative Sequence Optimization network
(named PGSO), which jointly optimizes the training target as well as the
generative model. Specifically, PGSO contains two components, namely, prompt
construction and sequence regulator. The former constructs a task-specific
prompt based on unsupervised training objects to fully utilize the pre-trained
model. The latter jointly leverages semantic, syntactic and original-sequence
information to dynamically regulate contextual sequence. Our experiments
conducted on four ABSA tasks across multiple benchmarks indicate that PGSO
outperforms state-of-the-art methods, with an average improvement of 3.52% in
F1 score. [LINK]http://arxiv.org/abs/2412.00763v1 [DATE]2024-12-01 18:49:55+08:00 [CATEGORIES]cs.CL
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices [AUTHORS]Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin [ABSTRACT]As large language models gain widespread adoption, running them efficiently
becomes crucial. Recent works on LLM inference use speculative decoding to
achieve extreme speedups. However, most of these works implicitly design their
algorithms for high-end datacenter hardware. In this work, we ask the opposite
question: how fast can we run LLMs on consumer machines? Consumer GPUs can no
longer fit the largest available models (50B+ parameters) and must offload them
to RAM or SSD. When running with offloaded parameters, the inference engine can
process batches of hundreds or thousands of tokens at the same time as just one
token, making it a natural fit for speculative decoding. We propose SpecExec
(Speculative Execution), a simple parallel decoding method that can generate up
to 20 tokens per target model iteration for popular LLM families. It utilizes
the high spikiness of the token probabilities distribution in modern LLMs and a
high degree of alignment between model output probabilities. SpecExec takes the
most probable tokens continuation from the draft model to build a "cache" tree
for the target model, which then gets validated in a single pass. Using
SpecExec, we demonstrate inference of 50B+ parameter LLMs on consumer GPUs with
RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens
per second with 16-bit weights. [LINK]http://arxiv.org/abs/2406.02532v3 [DATE]2024-12-01 05:33:59+08:00 [CATEGORIES]cs.CL
Tree-Wasserstein Distance for High Dimensional Data with a Latent Feature Hierarchy [AUTHORS]Ya-Wei Eileen Lin, Ronald R. Coifman, Gal Mishne, Ronen Talmon [ABSTRACT]Finding meaningful distances between high-dimensional data samples is an
important scientific task. To this end, we propose a new tree-Wasserstein
distance (TWD) for high-dimensional data with two key aspects. First, our TWD
is specifically designed for data with a latent feature hierarchy, i.e., the
features lie in a hierarchical space, in contrast to the usual focus on
embedding samples in hyperbolic space. Second, while the conventional use of
TWD is to speed up the computation of the Wasserstein distance, we use its
inherent tree as a means to learn the latent feature hierarchy. The key idea of
our method is to embed the features into a multi-scale hyperbolic space using
diffusion geometry and then present a new tree decoding method by establishing
analogies between the hyperbolic embedding and trees. We show that our TWD
computed based on data observations provably recovers the TWD defined with the
latent feature hierarchy and that its computation is efficient and scalable. We
showcase the usefulness of the proposed TWD in applications to word-document
and single-cell RNA-sequencing datasets, demonstrating its advantages over
existing TWDs and methods based on pre-trained models. [LINK]http://arxiv.org/abs/2410.21107v2 [DATE]2024-12-01 10:36:26+08:00 [CATEGORIES]cs.LG
2024 Dec 05, Thu
Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction [AUTHORS]Zheye Deng, Chunkit Chan, Weiqi Wang, Yuxi Sun, Wei Fan, Tianshi Zheng, Yauwai Yim, Yangqiu Song [ABSTRACT]The task of condensing large chunks of textual information into concise and
structured tables has gained attention recently due to the emergence of Large
Language Models (LLMs) and their potential benefit for downstream tasks, such
as text summarization and text mining. Previous approaches often generate
tables that directly replicate information from the text, limiting their
applicability in broader contexts, as text-to-table generation in real-life
scenarios necessitates information extraction, reasoning, and integration.
However, there is a lack of both datasets and methodologies towards this task.
In this paper, we introduce LiveSum, a new benchmark dataset created for
generating summary tables of competitions based on real-time commentary texts.
We evaluate the performances of state-of-the-art LLMs on this task in both
fine-tuning and zero-shot settings, and additionally propose a novel pipeline
called $T^3$(Text-Tuple-Table) to improve their performances. Extensive
experimental results demonstrate that LLMs still struggle with this task even
after fine-tuning, while our approach can offer substantial performance gains
without explicit training. Further analyses demonstrate that our method
exhibits strong generalization abilities, surpassing previous approaches on
several other text-to-table datasets. Our code and data can be found at
https://github.com/HKUST-KnowComp/LiveSum. [COMMENTS]Accepted to EMNLP 2024 [LINK]http://arxiv.org/abs/2404.14215v2 [DATE]2024-12-05 14:02:59+08:00 [CATEGORIES]cs.CL
If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions [AUTHORS]Reza Esfandiarpoor, Cristina Menghini, Stephen H. Bach [ABSTRACT]Recent works often assume that Vision-Language Model (VLM) representations
are based on visual attributes like shape. However, it is unclear to what
extent VLMs prioritize this information to represent concepts. We propose
Extract and Explore (EX2), a novel approach to characterize textual features
that are important for VLMs. EX2 uses reinforcement learning to align a large
language model with VLM preferences and generates descriptions that incorporate
features that are important for the VLM. Then, we inspect the descriptions to
identify features that contribute to VLM representations. Using EX2, we find
that spurious descriptions have a major role in VLM representations despite
providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More
importantly, among informative descriptions, VLMs rely significantly on
non-visual attributes like habitat (e.g., North America) to represent visual
concepts. Also, our analysis reveals that different VLMs prioritize different
attributes in their representations. Overall, we show that VLMs do not simply
match images to scene descriptions and that non-visual or even spurious
descriptions significantly influence their representations. [COMMENTS]EMNLP 2024 [LINK]http://arxiv.org/abs/2403.16442v2 [DATE]2024-12-05 06:37:07+08:00 [CATEGORIES]cs.CLcs.LG
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models [AUTHORS]Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo [ABSTRACT]Proprietary LMs such as GPT-4 are often employed to assess the quality of
responses from various LMs. However, concerns including transparency,
controllability, and affordability strongly motivate the development of
open-source LMs specialized in evaluations. On the other hand, existing open
evaluator LMs exhibit critical shortcomings: 1) they issue scores that
significantly diverge from those assigned by humans, and 2) they lack the
flexibility to perform both direct assessment and pairwise ranking, the two
most prevalent forms of assessment. Additionally, they do not possess the
ability to evaluate based on custom evaluation criteria, focusing instead on
general attributes like helpfulness and harmlessness. To address these issues,
we introduce Prometheus 2, a more powerful evaluator LM than its predecessor
that closely mirrors human and GPT-4 judgements. Moreover, it is capable of
processing both direct assessment and pair-wise ranking formats grouped with a
user-defined evaluation criteria. On four direct assessment benchmarks and four
pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and
agreement with humans and proprietary LM judges among all tested open evaluator
LMs. Our models, code, and data are all publicly available at
https://github.com/prometheus-eval/prometheus-eval. [COMMENTS]EMNLP 2024 (Main Conference) [LINK]http://arxiv.org/abs/2405.01535v2 [DATE]2024-12-05 03:23:17+08:00 [CATEGORIES]cs.CL
2024 Dec 04, Wed
AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark [AUTHORS]Abhay Gupta, Philip Meng, Ece Yurtseven, Sean O'Brien, Kevin Zhu [ABSTRACT]Detecting biases in natural language understanding (NLU) for African American
Vernacular English (AAVE) is crucial to developing inclusive natural language
processing (NLP) systems. To address dialect-induced performance discrepancies,
we introduce AAVENUE (\{AAVE\} \{N\}atural Language \{U\}nderstanding \{E\}valuation),
a benchmark for evaluatinglarge language model (LLM) performance on NLU tasks
in AAVE and Standard American English (SAE). AAVENUE builds upon and extends
existing benchmarks like VALUE, replacing deterministic syntactic and
morphological transformations with a more flexible methodology leveraging
LLM-based translation with few-shot prompting, improving performance across our
evaluation metrics when translating key tasks from the GLUE and SuperGLUE
benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs
and a comprehensive set of metrics including fluency, BARTScore, quality,
coherence, and understandability. Additionally, we recruit fluent AAVE speakers
to validate our translations for authenticity. Our evaluations reveal that LLMs
consistently perform better on SAE tasks than AAVE-translated versions,
underscoring inherent biases and highlighting the need for more inclusive NLP
models. We have open-sourced our source code on GitHub and created a website to
showcase our work at https://aavenue.live. [COMMENTS]Published at NLP4PI @ EMNLP 2024 [LINK]http://arxiv.org/abs/2408.14845v2 [DATE]2024-12-04 21:43:28+08:00 [CATEGORIES]cs.CL
Knowledge Mechanisms in Large Language Models: A Survey and Perspective [AUTHORS]Mengru Wang, Yunzhi Yao, Ziwen Xu, Shuofei Qiao, Shumin Deng, Peng Wang, Xiang Chen, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang [ABSTRACT]Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial
for advancing towards trustworthy AGI. This paper reviews knowledge mechanism
analysis from a novel taxonomy including knowledge utilization and evolution.
Knowledge utilization delves into the mechanism of memorization, comprehension
and application, and creation. Knowledge evolution focuses on the dynamic
progression of knowledge within individual and group LLMs. Moreover, we discuss
what knowledge LLMs have learned, the reasons for the fragility of parametric
knowledge, and the potential dark knowledge (hypothesis) that will be
challenging to address. We hope this work can help understand knowledge in LLMs
and provide insights for future research. [COMMENTS]EMNLP 2024 Findings; 39 pages (v4) [LINK]http://arxiv.org/abs/2407.15017v4 [DATE]2024-12-04 17:54:59+08:00 [CATEGORIES]cs.CLcs.LG
Filtered Direct Preference Optimization [AUTHORS]Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu [ABSTRACT]Reinforcement learning from human feedback (RLHF) plays a crucial role in
aligning language models with human preferences. While the significance of
dataset quality is generally recognized, explicit investigations into its
impact within the RLHF framework, to our knowledge, have been limited. This
paper addresses the issue of text quality within the preference dataset by
focusing on direct preference optimization (DPO), an increasingly adopted
reward-model-free RLHF method. We confirm that text quality significantly
influences the performance of models optimized with DPO more than those
optimized with reward-model-based RLHF. Building on this new insight, we
propose an extension of DPO, termed filtered direct preference optimization
(fDPO). fDPO uses a trained reward model to monitor the quality of texts within
the preference dataset during DPO training. Samples of lower quality are
discarded based on comparisons with texts generated by the model being
optimized, resulting in a more accurate dataset. Experimental results
demonstrate that fDPO enhances the final model performance. Our code is
available at https://github.com/CyberAgentAILab/filtered-dpo. [COMMENTS]EMNLP 2024 [LINK]http://arxiv.org/abs/2404.13846v4 [DATE]2024-12-04 01:22:01+08:00 [CATEGORIES]cs.LGcs.CL
2024 Dec 03, Tue
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis [AUTHORS]Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, Jiliang Tang [ABSTRACT]Large language models (LLMs) are susceptible to a type of attack known as
jailbreaking, which misleads LLMs to output harmful contents. Although there
are diverse jailbreak attack strategies, there is no unified understanding on
why some methods succeed and others fail. This paper explores the behavior of
harmful and harmless prompts in the LLM's representation space to investigate
the intrinsic properties of successful jailbreak attacks. We hypothesize that
successful attacks share some similar properties: They are effective in moving
the representation of the harmful prompt towards the direction to the harmless
prompts. We leverage hidden representations into the objective of existing
jailbreak attacks to move the attacks along the acceptance direction, and
conduct experiments to validate the above hypothesis using the proposed
objective. We hope this study provides new insights into understanding how LLMs
understand harmfulness information. [COMMENTS]Accepted by EMNLP 2024 Main [LINK]http://arxiv.org/abs/2406.10794v3 [DATE]2024-12-03 05:48:47+08:00 [CATEGORIES]cs.CL
Salient Information Prompting to Steer Content in Prompt-based Abstractive Summarization [AUTHORS]Lei Xu, Mohammed Asad Karim, Saket Dingliwal, Aparna Elangovan [ABSTRACT]Large language models (LLMs) can generate fluent summaries across domains
using prompting techniques, reducing the need to train models for summarization
applications. However, crafting effective prompts that guide LLMs to generate
summaries with the appropriate level of detail and writing style remains a
challenge. In this paper, we explore the use of salient information extracted
from the source document to enhance summarization prompts. We show that adding
keyphrases in prompts can improve ROUGE F1 and recall, making the generated
summaries more similar to the reference and more complete. The number of
keyphrases can control the precision-recall trade-off. Furthermore, our
analysis reveals that incorporating phrase-level salient information is
superior to word- or sentence-level. However, the impact on hallucination is
not universally positive across LLMs. To conduct this analysis, we introduce
Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned
to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE
improvements across datasets and open-weight and proprietary LLMs without any
LLM customization. Our findings provide insights into leveraging salient
information in building prompt-based summarization systems. We release our code
at \url\{https://github.com/amazon-science/SigExt\} [COMMENTS]Accepted to EMNLP 2024 Industry Track. Code available at
https://github.com/amazon-science/SigExt [LINK]http://arxiv.org/abs/2410.02741v2 [DATE]2024-12-03 05:06:29+08:00 [CATEGORIES]cs.CLcs.LG
Large Language Models for Data Annotation and Synthesis: A Survey [AUTHORS]Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, Huan Liu [ABSTRACT]Data annotation and synthesis generally refers to the labeling or generating
of raw data with relevant information, which could be used for improving the
efficacy of machine learning models. The process, however, is labor-intensive
and costly. The emergence of advanced Large Language Models (LLMs), exemplified
by GPT-4, presents an unprecedented opportunity to automate the complicated
process of data annotation and synthesis. While existing surveys have
extensively covered LLM architecture, training, and general applications, we
uniquely focus on their specific utility for data annotation. This survey
contributes to three core aspects: LLM-Based Annotation Generation,
LLM-Generated Annotations Assessment, and LLM-Generated Annotations
Utilization. Furthermore, this survey includes an in-depth taxonomy of data
types that LLMs can annotate, a comprehensive review of learning strategies for
models utilizing LLM-generated annotations, and a detailed discussion of the
primary challenges and limitations associated with using LLMs for data
annotation and synthesis. Serving as a key guide, this survey aims to assist
researchers and practitioners in exploring the potential of the latest LLMs for
data annotation, thereby fostering future advancements in this critical field. [COMMENTS]Accepted to EMNLP 2024 Main [LINK]http://arxiv.org/abs/2402.13446v3 [DATE]2024-12-03 04:55:15+08:00 [CATEGORIES]cs.CL
AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning [AUTHORS]Yifan Yang, Kai Zhen, Ershad Banijamal, Athanasios Mouchtaris, Zheng Zhang [ABSTRACT]Fine-tuning large language models (LLMs) has achieved remarkable performance
across various natural language processing tasks, yet it demands more and more
memory as model sizes keep growing. To address this issue, the recently
proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs
using only forward passes, thereby avoiding the need for a backpropagation
graph. However, significant performance drops and a high risk of divergence
have limited their widespread adoption. In this paper, we propose the Adaptive
Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed
to improve the performance and convergence of the ZO methods. To enhance
dimension-dependent ZO estimation accuracy, we introduce a fast-forward,
low-parameter tensorized adapter. To tackle the frequently observed divergence
issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number
schedule that guarantees convergence. Detailed theoretical analysis and
extensive experimental results on Roberta-Large and Llama-2-7B models
substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory
efficiency, and convergence speed. [COMMENTS]Accepted for publication in EMNLP 2024 [LINK]http://arxiv.org/abs/2406.18060v3 [DATE]2024-12-03 03:03:47+08:00 [CATEGORIES]cs.CLcs.LG
2024 Dec 02, Mon
GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large Language Model [AUTHORS]Xuanchang Zhang, Zhuosheng Zhang, Hai Zhao [ABSTRACT]Despite the rapid progress of large language models (LLMs), their task
performance remains sensitive to prompt design. Recent studies have explored
leveraging the LLM itself as an optimizer to identify optimal prompts that
maximize task accuracy. However, when evaluating prompts, such approaches
heavily rely on elusive manually annotated gold labels to calculate task
accuracy for each candidate prompt, which hinders the widespread implementation
and generality. To overcome the limitation, this work proposes a gold
label-agnostic prompt evaluation (GLaPE) to alleviate dependence on gold
labels. Motivated by the observed correlation between self-consistency and the
accuracy of the answer, we adopt self-consistency as the initial evaluation
score. Subsequently, we refine the scores of prompts producing identical
answers to be mutually consistent. Experimental results show that GLaPE
provides reliable evaluations uniform with accuracy, even in the absence of
gold labels. Moreover, on six popular reasoning tasks, our GLaPE-based prompt
optimization yields effective prompts comparable to accuracy-based ones. The
code is publicly available at https://github.com/thunderous77/GLaPE. [COMMENTS]EMNLP 2024 [LINK]http://arxiv.org/abs/2402.02408v2 [DATE]2024-12-02 15:47:00+08:00 [CATEGORIES]cs.CLcs.LG
2024 Nov 30, Sat
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding [AUTHORS]Duo Zheng, Shijia Huang, Liwei Wang [ABSTRACT]The rapid advancement of Multimodal Large Language Models (MLLMs) has
significantly impacted various multimodal tasks. However, these models face
challenges in tasks that require spatial understanding within 3D environments.
Efforts to enhance MLLMs, such as incorporating point cloud features, have been
made, yet a considerable gap remains between the models' learned
representations and the inherent complexity of 3D scenes. This discrepancy
largely stems from the training of MLLMs on predominantly 2D data, which
restricts their effectiveness in comprehending 3D spaces. To address this
issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM,
for 3D scene understanding. By treating 3D scenes as dynamic videos and
incorporating 3D position encoding into these representations, our Video-3D LLM
aligns video representations with real-world spatial contexts more accurately.
Additionally, we have implemented a maximum coverage sampling technique to
optimize the balance between computational costs and performance efficiency.
Extensive experiments demonstrate that our model achieves state-of-the-art
performance on several 3D scene understanding benchmarks, including ScanRefer,
Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. [COMMENTS]14 pages, 4 figures [LINK]http://arxiv.org/abs/2412.00493v1 [DATE]2024-11-30 22:28:53+08:00 [CATEGORIES]cs.CL
LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction [AUTHORS]Andre Niyongabo Rubungo, Kangming Li, Jason Hattrick-Simpers, Adji Bousso Dieng [ABSTRACT]Large language models (LLMs) are increasingly being used in materials
science. However, little attention has been given to benchmarking and
standardized evaluation for LLM-based materials property prediction, which
hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for
evaluating the performance of LLMs in predicting the properties of crystalline
materials. LLM4Mat-Bench contains about 1.9M crystal structures in total,
collected from 10 publicly available materials data sources, and 45 distinct
properties. LLM4Mat-Bench features different input modalities: crystal
composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B
tokens in total for each modality, respectively. We use LLM4Mat-Bench to
fine-tune models with different sizes, including LLM-Prop and MatBERT, and
provide zero-shot and few-shot prompts to evaluate the property prediction
capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The
results highlight the challenges of general-purpose LLMs in materials science
and the need for task-specific predictive models and task-specific
instruction-tuned LLMs in materials property prediction. [COMMENTS]Accepted at NeurIPS 2024-AI4Mat Workshop. The Benchmark and code can
be found at https://github.com/vertaix/LLM4Mat-Bench [LINK]http://arxiv.org/abs/2411.00177v3 [DATE]2024-11-30 22:01:56+08:00 [CATEGORIES]cs.CL
Unlocking Structured Thinking in Language Models with Cognitive Prompting [AUTHORS]Oliver Kramer, Jill Baumann [ABSTRACT]We propose cognitive prompting as a novel approach to guide problem-solving
in large language models (LLMs) through structured, human-like cognitive
operations, such as goal clarification, decomposition, filtering, abstraction,
and pattern recognition. By employing systematic, step-by-step reasoning,
cognitive prompting enables LLMs to tackle complex, multi-step tasks more
efficiently. We introduce three variants: a deterministic sequence of cognitive
operations, a self-adaptive variant in which the LLM dynamically selects the
sequence of cognitive operations, and a hybrid variant that uses generated
correct solutions as few-shot chain-of-thought prompts. Experiments with LLaMA,
Gemma~2, and Qwen models in each two sizes on the arithmetic reasoning
benchmark GSM8K demonstrate that cognitive prompting significantly improves
performance compared to standard question answering. [COMMENTS]6 pages, submitted to ESANN 2025 [LINK]http://arxiv.org/abs/2410.02953v3 [DATE]2024-11-30 20:16:51+08:00 [CATEGORIES]cs.CL
ORAssistant: A Custom RAG-based Conversational Assistant for OpenROAD [AUTHORS]Aviral Kaintura, Palaniappan R, Shui Song Luar, Indira Iyer Almeida [ABSTRACT]Open-source Electronic Design Automation (EDA) tools are rapidly transforming
chip design by addressing key barriers of commercial EDA tools such as
complexity, costs, and access. Recent advancements in Large Language Models
(LLMs) have further enhanced efficiency in chip design by providing user
assistance across a range of tasks like setup, decision-making, and flow
automation. This paper introduces ORAssistant, a conversational assistant for
OpenROAD, based on Retrieval-Augmented Generation (RAG). ORAssistant aims to
improve the user experience for the OpenROAD flow, from RTL-GDSII by providing
context-specific responses to common user queries, including installation,
command usage, flow setup, and execution, in prose format. Currently,
ORAssistant integrates OpenROAD, OpenROAD-flow-scripts, Yosys, OpenSTA, and
KLayout. The data model is built from publicly available documentation and
GitHub resources. The proposed architecture is scalable, supporting extensions
to other open-source tools, operating modes, and LLM models. We use Google
Gemini as the base LLM model to build and test ORAssistant. Early evaluation
results of the RAG-based model show notable improvements in performance and
accuracy compared to non-fine-tuned LLMs. [LINK]http://arxiv.org/abs/2410.03845v2 [DATE]2024-11-30 19:19:39+08:00 [CATEGORIES]cs.CL
Kalahi: A handcrafted, grassroots cultural LLMevaluation suite for Filipino [AUTHORS]Jann Railey Montalan, Jian Gang Ngui, Wei Qi Leong, Yosephine Susanto, Hamsawardhini Rengarajan, Alham Fikri Aji, William Chandra Tjhi [ABSTRACT]Multilingual large language models (LLMs) today may not necessarily provide
culturally appropriate and relevant responses to its Filipino users. We
introduce Kalahi, a cultural LLMevaluation suite collaboratively created by
native Filipino speakers. It is composed of 150 high-quality, handcrafted and
nuanced prompts that test LLMs for generations that are relevant to shared
Filipino cultural knowledge and values. Strong LLM performance in Kalahi
indicates a model's ability to generate responses similar to what an average
Filipino would say or do in a given situation. We conducted experiments on LLMs
with multilingual and Filipino language support. Results show that Kalahi,
while trivial for Filipinos, is challenging for LLMs, with the best model
answering only 46.0% of the questions correctly compared to native Filipino
performance of 89.10%. Thus, Kalahi can be used to accurately and reliably
evaluate Filipino cultural representation in LLMs. [COMMENTS]Accepted for presentation at Paclic 38, 2024 [LINK]http://arxiv.org/abs/2409.15380v2 [DATE]2024-11-30 17:57:09+08:00 [CATEGORIES]cs.CL
Uncovering Safety Risks of Large Language Models through Concept Activation Vector [AUTHORS]Zhihao Xu, Ruixuan Huang, Changyu Chen, Xiting Wang [ABSTRACT]Despite careful safety alignment, current large language models (LLMs) remain
vulnerable to various attacks. To further unveil the safety risks of LLMs, we
introduce a Safety Concept Activation Vector (SCAV) framework, which
effectively guides the attacks by accurately interpreting LLMs' safety
mechanisms. We then develop an SCAV-guided attack method that can generate both
attack prompts and embedding-level attacks with automatically selected
perturbation hyperparameters. Both automatic and human evaluations demonstrate
that our attack method significantly improves the attack success rate and
response quality while requiring less training data. Additionally, we find that
our generated attack prompts may be transferable to GPT-4, and the
embedding-level attacks may also be transferred to other white-box LLMs whose
parameters are known. Our experiments further uncover the safety risks present
in current LLMs. For example, in our evaluation of seven open-source LLMs, we
observe an average attack success rate of 99.14%, based on the classic
keyword-matching criterion. Finally, we provide insights into the safety
mechanism of LLMs. The code is available at
https://github.com/SproutNan/AI-Safety_SCAV. [COMMENTS]10 pages, accepted at NeurIPS 2024 [LINK]http://arxiv.org/abs/2404.12038v5 [DATE]2024-11-30 16:52:29+08:00 [CATEGORIES]cs.CL
Fine-Grained Alignment in Vision-and-Language Navigation through Bayesian Optimization [AUTHORS]Yuhang Song, Mario Gianni, Chenguang Yang, Kunyang Lin, Te-Chuan Chiu, Anh Nguyen, Chun-Yi Lee [ABSTRACT]This paper addresses the challenge of fine-grained alignment in
Vision-and-Language Navigation (VLN) tasks, where robots navigate realistic 3D
environments based on natural language instructions. Current approaches use
contrastive learning to align language with visual trajectory sequences.
Nevertheless, they encounter difficulties with fine-grained vision negatives.
To enhance cross-modal embeddings, we introduce a novel Bayesian
Optimization-based adversarial optimization framework for creating fine-grained
contrastive vision samples. To validate the proposed methodology, we conduct a
series of experiments to assess the effectiveness of the enriched embeddings on
fine-grained vision negatives. We conduct experiments on two common VLN
benchmarks R2R and REVERIE, experiments on the them demonstrate that these
embeddings benefit navigation, and can lead to a promising performance
enhancement. Our source code and trained models are available at:
https://anonymous.4open.science/r/FGVLN. [LINK]http://arxiv.org/abs/2411.14811v2 [DATE]2024-11-30 16:47:23+08:00 [CATEGORIES]cs.CLcs.LG
Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection [AUTHORS]Shanu Kumar, Saish Mendke, Karody Lubna Abdul Rahman, Santosh Kurasa, Parag Agrawal, Sandipan Dandapat [ABSTRACT]Chain-of-thought (CoT) prompting has significantly enhanced the capability of
large language models (LLMs) by structuring their reasoning processes. However,
existing methods face critical limitations: handcrafted demonstrations require
extensive human expertise, while trigger phrases are prone to inaccuracies. In
this paper, we propose the Zero-shot Uncertainty-based Selection (ZEUS) method,
a novel approach that improves CoT prompting by utilizing uncertainty estimates
to select effective demonstrations without needing access to model parameters.
Unlike traditional methods, ZEUS offers high sensitivity in distinguishing
between helpful and ineffective questions, ensuring more precise and reliable
selection. Our extensive evaluation shows that ZEUS consistently outperforms
existing CoT strategies across four challenging reasoning benchmarks,
demonstrating its robustness and scalability. [COMMENTS]Accepted in COLING 2025 [LINK]http://arxiv.org/abs/2412.00353v1 [DATE]2024-11-30 12:22:00+08:00 [CATEGORIES]cs.CL
THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models [AUTHORS]Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven [ABSTRACT]Hallucination, the generation of factually incorrect content, is a growing
challenge in Large Language Models (LLMs). Existing detection and mitigation
methods are often isolated and insufficient for domain-specific needs, lacking
a standardized pipeline. This paper introduces THaMES (Tool for Hallucination
Mitigations and EvaluationS), an integrated framework and library addressing
this gap. THaMES offers an end-to-end solution for evaluating and mitigating
hallucinations in LLMs, featuring automated test set generation, multifaceted
benchmarking, and adaptable mitigation strategies. It automates test set
creation from any corpus, ensuring high data quality, diversity, and
cost-efficiency through techniques like batch processing, weighted sampling,
and counterfactual validation. THaMES assesses a model's ability to detect and
reduce hallucinations across various tasks, including text generation and
binary classification, applying optimal mitigation strategies like In-Context
Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient
Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base
of academic papers, political news, and Wikipedia reveal that commercial models
like GPT-4o benefit more from RAG than ICL, while open-weight models like
Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT
significantly enhances the performance of Llama-3.1-8B-Instruct in both
evaluation tasks. [COMMENTS]NeurIPS 2024 SoLaR (Socially Responsible Language Modelling Research
) Workshop [LINK]http://arxiv.org/abs/2409.11353v3 [DATE]2024-11-30 10:27:09+08:00 [CATEGORIES]cs.CL
SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration [AUTHORS]Xin Guan, Nathaniel Demchak, Saloni Gupta, Ze Wang, Ediz Ertekin Jr., Adriano Koshiyama, Emre Kazim, Zekun Wu [ABSTRACT]The development of unbiased large language models is widely recognized as
crucial, yet existing benchmarks fall short in detecting biases due to limited
scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the
first holistic benchmarking pipeline to address these problems. The pipeline
encompasses five core stages: scraping materials, assembling benchmarks,
generating responses, extracting numeric features, and diagnosing with
disparity metrics. SAGED includes metrics for max disparity, such as impact
ratio, and bias concentration, such as Max Z-scores. Noticing that assessment
tool bias and contextual bias in prompts can distort evaluation, SAGED
implements counterfactual branching and baseline calibration for mitigation.
For demonstration, we use SAGED on G20 Countries with popular 8b-level models
including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we
find that while Mistral and Qwen2 show lower max disparity and higher bias
concentration than Gemma2 and Llama3.1, all models are notably biased against
countries like Russia and (except for Qwen2) China. With further experiments to
have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies
and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not
engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more
intensively than Biden and Harris, indicating role-playing performance bias in
these models. [COMMENTS]COLING 2025 Main Conference [LINK]http://arxiv.org/abs/2409.11149v4 [DATE]2024-11-30 10:21:25+08:00 [CATEGORIES]cs.CL
2024 Dec 06, Fri
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay [AUTHORS]Jun Zhang, Desen Meng, Ji Qi, Zhenpeng Huang, Tao Wu, Limin Wang [ABSTRACT]Despite the remarkable performance of multimodal large language models
(MLLMs) across diverse tasks, the substantial training and inference costs
impede their advancement. The majority of computation stems from the
overwhelming volume of vision tokens processed by the transformer decoder. In
this paper, we propose to build efficient MLLMs by leveraging the
Mixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects
essential vision tokens to process while skipping redundant ones. However,
integrating MoD into MLLMs is non-trivial. To address the challenges of
training and inference stability as well as limited training data, we adapt the
MoD module with two novel designs: tanh-gated weight normalization (TanhNorm)
and symmetric token reweighting (STRing). Moreover, we observe that vision
tokens exhibit higher redundancy in deeper layer and thus design a progressive
ratio decay (PRD) strategy, which gradually reduces the token retention ratio
layer by layer, employing a shifted cosine schedule. This crucial design fully
unleashes the potential of MoD, significantly boosting the efficiency and
performance of our models. To validate the effectiveness of our approach, we
conduct extensive experiments with two baseline models across 14 benchmarks.
Our model, p-MoD, matches or even surpasses the performance of the baseline
models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and
77.7% GPU hours during training. [COMMENTS]Technical Report; Code released at https://github.com/MCG-NJU/p-MoD [LINK]http://arxiv.org/abs/2412.04449v1 [DATE]2024-12-06 02:58:03+08:00 [CATEGORIES]cs.CL
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation [AUTHORS]Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu [ABSTRACT]Recent developments in Large Language Models pre-trained on extensive corpora
have shown significant success in various natural language processing tasks
with minimal fine-tuning. This success offers new promise for robotics, which
has long been constrained by the high cost of action-labeled data. We ask:
given the abundant video data containing interaction-related knowledge
available as a rich "corpus", can a similar generative pre-training approach be
effectively applied to enhance robot learning? The key challenge is to identify
an effective representation for autoregressive pre-training that benefits robot
manipulation tasks. Inspired by the way humans learn new skills through
observing dynamic environments, we propose that effective robotic learning
should emphasize motion-related knowledge, which is closely tied to low-level
actions and is hardware-agnostic, facilitating the transfer of learned motions
to actual robot actions. To this end, we introduce Moto, which converts video
content into latent Motion Token sequences by a Latent Motion Tokenizer,
learning a bridging "language" of motion from videos in an unsupervised manner.
We pre-train Moto-GPT through motion token autoregression, enabling it to
capture diverse visual motion knowledge. After pre-training, Moto-GPT
demonstrates the promising ability to produce semantically interpretable motion
tokens, predict plausible motion trajectories, and assess trajectory
rationality through output likelihood. To transfer learned motion priors to
real robot actions, we implement a co-fine-tuning strategy that seamlessly
bridges latent motion token prediction and real robot control. Extensive
experiments show that the fine-tuned Moto-GPT exhibits superior robustness and
efficiency on robot manipulation benchmarks, underscoring its effectiveness in
transferring knowledge from video data to downstream visual manipulation tasks. [COMMENTS]Project released at: https://chenyi99.github.io/moto/ [LINK]http://arxiv.org/abs/2412.04445v1 [DATE]2024-12-06 02:57:04+08:00 [CATEGORIES]cs.CLcs.LG
CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing [AUTHORS]Yen-Ju Lu, Jing Liu, Thomas Thebaud, Laureano Moro-Velazquez, Ariya Rastrow, Najim Dehak, Jesus Villalba [ABSTRACT]We introduce Condition-Aware Self-Supervised Learning Representation
(CA-SSLR), a generalist conditioning model broadly applicable to various
speech-processing tasks. Compared to standard fine-tuning methods that optimize
for downstream models, CA-SSLR integrates language and speaker embeddings from
earlier layers, making the SSL model aware of the current language and speaker
context. This approach reduces the reliance on input audio features while
preserving the integrity of the base SSLR. CA-SSLR improves the model's
capabilities and demonstrates its generality on unseen tasks with minimal
task-specific tuning. Our method employs linear modulation to dynamically
adjust internal representations, enabling fine-grained adaptability without
significantly altering the original model behavior. Experiments show that
CA-SSLR reduces the number of trainable parameters, mitigates overfitting, and
excels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a
10% relative reduction in LID errors, a 37% improvement in ASR CER on the
ML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating
its effectiveness. [COMMENTS]38th Conference on Neural Information Processing Systems (NeurIPS
2024) [LINK]http://arxiv.org/abs/2412.04425v1 [DATE]2024-12-06 02:51:10+08:00 [CATEGORIES]cs.CLcs.LG
SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models [AUTHORS]Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman [ABSTRACT]Despite the effectiveness of data selection for large language models (LLMs)
during pretraining and instruction fine-tuning phases, improving data
efficiency in supervised fine-tuning (SFT) for specialized domains poses
significant challenges due to the complexity of fine-tuning data. To bridge
this gap, we introduce an effective and scalable data selection method for SFT,
SmallToLarge (S2L), which leverages training trajectories from small models to
guide the data selection for larger models. We demonstrate through extensive
experiments that S2L significantly improves data efficiency in SFT for
mathematical problem-solving, reducing the training data to just 11% of the
original MathInstruct dataset (Yue et al., 2023) to match full dataset
performance while outperforming state-of-the-art data selection algorithms by
an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably,
selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most
challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et
al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset
(Johnson et al., 2016), S2L again outperforms training on the full dataset
using only 50% of the data. Notably, S2L can perform data selection using a
reference model 40x smaller than the target model, proportionally reducing the
cost of data selection. [LINK]http://arxiv.org/abs/2403.07384v2 [DATE]2024-12-06 02:47:47+08:00 [CATEGORIES]cs.CLcs.LG
CNNSum: Exploring Long-Conext Summarization with Large Language Models in Chinese Novels [AUTHORS]Lingxiao Wei, He Yan, Xiangju Lu, Junmin Zhu, Jun Wang, Wei Zhang [ABSTRACT]Large Language Models (LLMs) have been well-researched in many long-context
tasks. However, due to high annotation costs, high-quality long-context summary
datasets for training or evaluation are scarce, limiting further research. In
this work, we introduce CNNSum, a new multi-scale Chinese long-context novel
summarization benchmark, including four subsets, length covering
16k\textasciitilde128k, 695 samples in total, the annotations are human-driven.
We evaluate commercial and open-source models on CNNSum and conduct a detailed
analysis. Based on the observations, we further conduct fine-tuning exploration
with short-context summary data. In our study: (1) GPT-4o underperformed, due
to excessive subjective commentary. (2) Currently, long-context summarization
mainly relies on memory ability, small LLMs with stable longer context lengths
are the most cost-effective. Using long data concatenated from short-context
summaries makes a significant improvement. (3) Prompt templates may cause a
large performance gap but can be mitigated through fine-tuning. (4) Fine-tuned
Chat or Instruction versions may harm the Base model and further fine-tuning
cannot bridge performance gap. (5) while models with RoPE base scaling exhibit
strong extrapolation potential, their performance may vary significantly when
combined with other interpolation methods and need careful selection. (6)
CNNSum provides more reliable and insightful evaluation results than other
benchmarks. We release CNNSum to advance research in this field. [LINK]http://arxiv.org/abs/2412.02819v2 [DATE]2024-12-06 01:51:20+08:00 [CATEGORIES]cs.CL
Context-Informed Machine Translation of Manga using Multimodal Large Language Models [AUTHORS]Philip Lippmann, Konrad Skublicki, Joshua Tanner, Shonosuke Ishiwatari, Jie Yang [ABSTRACT]Due to the significant time and effort required for handcrafting
translations, most manga never leave the domestic Japanese market. Automatic
manga translation is a promising potential solution. However, it is a budding
and underdeveloped field and presents complexities even greater than those
found in standard translation due to the need to effectively incorporate visual
elements into the translation process to resolve ambiguities. In this work, we
investigate to what extent multimodal large language models (LLMs) can provide
effective manga translation, thereby assisting manga authors and publishers in
reaching wider audiences. Specifically, we propose a methodology that leverages
the vision component of multimodal LLMs to improve translation quality and
evaluate the impact of translation unit size, context length, and propose a
token efficient approach for manga translation. Moreover, we introduce a new
evaluation dataset -- the first parallel Japanese-Polish manga translation
dataset -- as part of a benchmark to be used in future research. Finally, we
contribute an open-source software suite, enabling others to benchmarkLLMs for
manga translation. Our findings demonstrate that our proposed methods achieve
state-of-the-art results for Japanese-English translation and set a new
standard for Japanese-Polish. [COMMENTS]COLING 2025 [LINK]http://arxiv.org/abs/2411.02589v2 [DATE]2024-12-06 01:41:48+08:00 [CATEGORIES]cs.CL
BhashaVerse : Translation Ecosystem for Indian Subcontinent Languages [AUTHORS]Vandan Mujadia, Dipti Misra Sharma [ABSTRACT]This paper focuses on developing translation models and related applications
for 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj,
Bodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada,
Kangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili,
Malayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi,
Sanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu,
Telugu, and Urdu. Achieving this requires parallel and other types of corpora
for all 36 * 36 language pairs, addressing challenges like script variations,
phonetic differences, and syntactic diversity. For instance, languages like
Kashmiri and Sindhi, which use multiple scripts, demand script normalization
for alignment, while low-resource languages such as Khasi and Santali require
synthetic data augmentation to ensure sufficient coverage and quality.
To address these challenges, this work proposes strategies for corpus
creation by leveraging existing resources, developing parallel datasets,
generating domain-specific corpora, and utilizing synthetic data techniques.
Additionally, it evaluates machine translation across various dimensions,
including standard and discourse-level translation, domain-specific
translation, reference-based and reference-free evaluation, error analysis, and
automatic post-editing. By integrating these elements, the study establishes a
comprehensive framework to improve machine translation quality and enable
better cross-lingual communication in India's linguistically diverse ecosystem. [LINK]http://arxiv.org/abs/2412.04351v1 [DATE]2024-12-06 01:10:19+08:00 [CATEGORIES]cs.CL
Retrieval-Augmented Machine Translation with Unstructured Knowledge [AUTHORS]Jiaan Wang, Fandong Meng, Yingxue Zhang, Jie Zhou [ABSTRACT]Retrieval-augmented generation (RAG) introduces additional information to
enhance large language models (LLMs). In machine translation (MT), previous
work typically retrieves in-context examples from paired MT corpora, or
domain-specific knowledge from knowledge graphs, to enhance models' MT ability.
However, a large amount of world knowledge is organized in unstructured
documents, and might not be fully paired across different languages. In this
paper, we study retrieval-augmented MT using unstructured documents.
Specifically, we build RAGtrans, the first benchmark to train and evaluate
LLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples
collected via GPT-4o and human translators. Besides, documents from different
languages are also provided to supply the knowledge to these samples. Based on
RAGtrans, we further propose a multi-task training method to teach LLMs how to
use information from multilingual documents during their translation. The
method uses existing multilingual corpora to create auxiliary training
objectives without additional labeling requirements. Extensive experiments show
that the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores. [LINK]http://arxiv.org/abs/2412.04342v1 [DATE]2024-12-06 01:00:32+08:00 [CATEGORIES]cs.CL
Densing Law of LLMs [AUTHORS]Chaojun Xiao, Jie Cai, Weilin Zhao, Guoyang Zeng, Xu Han, Zhiyuan Liu, Maosong Sun [ABSTRACT]Large Language Models (LLMs) have emerged as a milestone in artificial
intelligence, and their performance can improve as the model size increases.
However, this scaling brings great challenges to training and inference
efficiency, particularly for deploying LLMs in resource-constrained
environments, and the scaling trend is becoming increasingly unsustainable.
This paper introduces the concept of ``\textit\{capacity density\}'' as a new
metric to evaluate the quality of the LLMs across different scales and
describes the trend of LLMs in terms of both effectiveness and efficiency. To
calculate the capacity density of a given target LLM, we first introduce a set
of reference models and develop a scaling law to predict the downstream
performance of these reference models based on their parameter sizes. We then
define the \textit\{effective parameter size\} of the target LLM as the parameter
size required by a reference model to achieve equivalent performance, and
formalize the capacity density as the ratio of the effective parameter size to
the actual parameter size of the target LLM. Capacity density provides a
unified framework for assessing both model effectiveness and efficiency. Our
further analysis of recent open-source base LLMs reveals an empirical law (the
densing law)that the capacity density of LLMs grows exponentially over time.
More specifically, using some widely used benchmarks for evaluation, the
capacity density of LLMs doubles approximately every three months. The law
provides new perspectives to guide future LLM development, emphasizing the
importance of improving capacity density to achieve optimal results with
minimal computational overhead. [LINK]http://arxiv.org/abs/2412.04315v1 [DATE]2024-12-06 00:31:13+08:00 [CATEGORIES]cs.CL
ALMA: Alignment with Minimal Annotation [AUTHORS]Michihiro Yasunaga, Leonid Shamis, Chunting Zhou, Andrew Cohen, Jason Weston, Luke Zettlemoyer, Marjan Ghazvininejad [ABSTRACT]Recent approaches to large language model (LLM) alignment typically require
millions of human annotations or rely on external aligned models for synthetic
data generation. This paper introduces ALMA: Alignment with Minimal Annotation,
demonstrating that effective alignment can be achieved using only 9,000 labeled
examples -- less than 1% of conventional approaches. ALMA generates large
amounts of high-quality synthetic alignment data through new techniques:
diverse prompt synthesis via few-shot learning, diverse response generation
with multiple model checkpoints, and judge (reward model) enhancement through
score aggregation and self-distillation. Using only a pretrained Llama3 base
model, 5,000 SFT examples, and 4,000 judge annotations, ALMA achieves
performance close to Llama3-Instruct across diverse alignment benchmarks (e.g.,
0.1% difference on AlpacaEval 2.0 score). These results are achieved with a
multi-round, self-bootstrapped data synthesis and training recipe that
continues to improve for 10 rounds, surpassing the typical 3-round ceiling of
previous methods. These results suggest that base models already possess
sufficient knowledge for effective alignment, and that synthetic data
generation methods can expose it. [LINK]http://arxiv.org/abs/2412.04305v1 [DATE]2024-12-06 00:26:31+08:00 [CATEGORIES]cs.CLcs.LG
Evolutionary Pre-Prompt Optimization for Mathematical Reasoning [AUTHORS]Mathurin Videau, Alessandro Leite, Marc Schoenauer, Olivier Teytaud [ABSTRACT]Recent advancements have highlighted that large language models (LLMs), when
given a small set of task-specific examples, demonstrate remarkable
proficiency, a capability that extends to complex reasoning tasks. In
particular, the combination of few-shot learning with the chain-of-thought
(CoT) approach has been pivotal in steering models towards more logically
consistent conclusions. This paper explores the optimization of example
selection for designing effective CoT pre-prompts and shows that the choice of
the optimization algorithm, typically in favor of comparison-based methods such
as evolutionary computation, significantly enhances efficacy and feasibility.
Specifically, thanks to a limited exploitative and overfitted optimization,
Evolutionary Pre-Prompt Optimization (EPPO) brings an improvement over the
naive few-shot approach exceeding 10 absolute points in exact match scores on
benchmark datasets such as GSM8k and MathQA. These gains are consistent across
various contexts and are further amplified when integrated with
self-consistency (SC) [LINK]http://arxiv.org/abs/2412.04291v1 [DATE]2024-12-06 00:12:06+08:00 [CATEGORIES]cs.CL
A method to benchmark high-dimensional process drift detection [AUTHORS]Edgar Wolf, Tobias Windisch [ABSTRACT]Process curves are multivariate finite time series data coming from
manufacturing processes. This paper studies machine learning that detect drifts
in process curve datasets. A theoretic framework to synthetically generate
process curves in a controlled way is introduced in order to benchmark machine
learning algorithms for process drift detection. An evaluation score, called
the temporal area under the curve, is introduced, which allows to quantify how
well machine learning models unveil curves belonging to drift segments.
Finally, a benchmark study comparing popular machine learning approaches on
synthetic data generated with the introduced framework is presented that shows
that existing algorithms often struggle with datasets containing multiple drift
segments. [LINK]http://arxiv.org/abs/2409.03669v2 [DATE]2024-12-06 02:56:04+08:00 [CATEGORIES]cs.LG
FedDUAL: A Dual-Strategy with Adaptive Loss and Dynamic Aggregation for Mitigating Data Heterogeneity in Federated Learning [AUTHORS]Pranab Sahoo, Ashutosh Tripathi, Sriparna Saha, Samrat Mondal [ABSTRACT]Federated Learning (FL) marks a transformative approach to distributed model
training by combining locally optimized models from various clients into a
unified global model. While FL preserves data privacy by eliminating
centralized storage, it encounters significant challenges such as performance
degradation, slower convergence, and reduced robustness of the global model due
to the heterogeneity in client data distributions. Among the various forms of
data heterogeneity, label skew emerges as a particularly formidable and
prevalent issue, especially in domains such as image classification. To address
these challenges, we begin with comprehensive experiments to pinpoint the
underlying issues in the FL training process. Based on our findings, we then
introduce an innovative dual-strategy approach designed to effectively resolve
these issues. First, we introduce an adaptive loss function for client-side
training, meticulously crafted to preserve previously acquired knowledge while
maintaining an optimal equilibrium between local optimization and global model
coherence. Secondly, we develop a dynamic aggregation strategy for aggregating
client models at the server. This approach adapts to each client's unique
learning patterns, effectively addressing the challenges of diverse data across
the network. Our comprehensive evaluation, conducted across three diverse
real-world datasets, coupled with theoretical convergence guarantees,
demonstrates the superior efficacy of our method compared to several
established state-of-the-art approaches. [LINK]http://arxiv.org/abs/2412.04416v1 [DATE]2024-12-06 02:42:29+08:00 [CATEGORIES]cs.LG
Efficient Task Grouping Through Samplewise Optimisation Landscape Analysis [AUTHORS]Anshul Thakur, Yichen Huang, Soheila Molaei, Yujiang Wang, David A. Clifton [ABSTRACT]Shared training approaches, such as multi-task learning (MTL) and
gradient-based meta-learning, are widely used in various machine learning
applications, but they often suffer from negative transfer, leading to
performance degradation in specific tasks. While several optimisation
techniques have been developed to mitigate this issue for pre-selected task
cohorts, identifying optimal task combinations for joint learning - known as
task grouping - remains underexplored and computationally challenging due to
the exponential growth in task combinations and the need for extensive training
and evaluation cycles. This paper introduces an efficient task grouping
framework designed to reduce these overwhelming computational demands of the
existing methods. The proposed framework infers pairwise task similarities
through a sample-wise optimisation landscape analysis, eliminating the need for
the shared model training required to infer task similarities in existing
methods. With task similarities acquired, a graph-based clustering algorithm is
employed to pinpoint near-optimal task groups, providing an approximate yet
efficient and effective solution to the originally NP-hard problem. Empirical
assessments conducted on 8 different datasets highlight the effectiveness of
the proposed framework, revealing a five-fold speed enhancement compared to
previous state-of-the-art methods. Moreover, the framework consistently
demonstrates comparable performance, confirming its remarkable efficiency and
effectiveness in task grouping. [COMMENTS]Under review at IEEE Transactions on Pattern Analysis and Machine
Intelligence [LINK]http://arxiv.org/abs/2412.04413v1 [DATE]2024-12-06 02:33:59+08:00 [CATEGORIES]cs.LG
Asynchronous Batch Bayesian Optimization with Pipelining Evaluations for Experimental Resource$\unicode\{x2013\}$constrained Conditions [AUTHORS]Yujin Taguchi, Yusuke Shibuya, Yusuke Hiki, Takashi Morikura, Takahiro G. Yamada, Akira Funahashi [ABSTRACT]Bayesian optimization is efficient even with a small amount of data and is
used in engineering and in science, including biology and chemistry. In
Bayesian optimization, a parameterized model with an uncertainty is fitted to
explain the experimental data, and then the model suggests parameters that
would most likely improve the results. Batch Bayesian optimization reduces the
processing time of optimization by parallelizing experiments. However, batch
Bayesian optimization cannot be applied if the number of parallelized
experiments is limited by the cost or scarcity of equipment; in such cases,
sequential methods require an unrealistic amount of time. In this study, we
developed pipelining Bayesian optimization (PipeBO) to reduce the processing
time of optimization even with a limited number of parallel experiments. PipeBO
was inspired by the pipelining of central processing unit architecture, which
divides computational tasks into multiple processes. PipeBO was designed to
achieve experiment parallelization by overlapping various processes of the
experiments. PipeBO uses the results of completed experiments to update the
parameters of running parallelized experiments. Using the Black-Box
Optimization Benchmarking, which consists of 24 benchmark functions, we
compared PipeBO with the sequential Bayesian optimization methods. PipeBO
reduced the average processing time of optimization to about 56% for the
experiments that consisted of two processes or even less for those with more
processes for 20 out of the 24 functions. Overall, PipeBO parallelizes Bayesian
optimization in the resource-constrained settings so that efficient
optimization can be achieved. [LINK]http://arxiv.org/abs/2412.04392v1 [DATE]2024-12-06 02:06:09+08:00 [CATEGORIES]cs.LG
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding [AUTHORS]Yuqi Wu, Wenzhao Zheng, Sicheng Zuo, Yuanhui Huang, Jie Zhou, Jiwen Lu [ABSTRACT]3D occupancy prediction provides a comprehensive description of the
surrounding scenes and has become an essential task for 3D perception. Most
existing methods focus on offline perception from one or a few views and cannot
be applied to embodied agents which demands to gradually perceive the scene
through progressive embodied exploration. In this paper, we formulate an
embodied 3D occupancy prediction task to target this practical scenario and
propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize
the global scene with uniform 3D semantic Gaussians and progressively update
local regions observed by the embodied agent. For each update, we extract
semantic and structural features from the observed image and efficiently
incorporate them via deformable cross-attention to refine the regional
Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global
3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown
(i.e., uniformly distributed) environment and maintains an explicit global
memory of it with 3D Gaussians. It gradually gains knowledge through local
refinement of regional Gaussians, which is consistent with how humans
understand new scenes through embodied exploration. We reorganize an
EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the
evaluation of the embodied 3D occupancy prediction task. Experiments
demonstrate that our EmbodiedOcc outperforms existing local prediction methods
and accomplishes the embodied occupancy prediction with high accuracy and
strong expandability. Our code is available at:
https://github.com/YkiWu/EmbodiedOcc. [COMMENTS]Code: https://github.com/YkiWu/EmbodiedOcc [LINK]http://arxiv.org/abs/2412.04380v1 [DATE]2024-12-06 01:57:09+08:00 [CATEGORIES]cs.LG
Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting [AUTHORS]Edoardo Cetin, Ahmed Touati, Yann Ollivier [ABSTRACT]The forward-backward representation (FB) is a recently proposed framework
(Touati et al., 2023; Touati & Ollivier, 2021) to train behavior foundation
models (BFMs) that aim at providing zero-shot efficient policies for any new
task specified in a given reinforcement learning (RL) environment, without
training for each new task. Here we address two core limitations of FB model
training. First, FB, like all successor-feature-based methods, relies on a
linear encoding of tasks: at test time, each new reward function is linearly
projected onto a fixed set of pre-trained features. This limits expressivity as
well as precision of the task representation. We break the linearity limitation
by introducing auto-regressive features for FB, which let finegrained task
features depend on coarser-grained task information. This can represent
arbitrary nonlinear task encodings, thus significantly increasing expressivity
of the FB framework. Second, it is well-known that training RL agents from
offline datasets often requires specific techniques.We show that FB works well
together with such offline RL techniques, by adapting techniques from (Nair et
al.,2020b; Cetin et al., 2024) for FB. This is necessary to get non-flatlining
performance in some datasets, such as DMC Humanoid. As a result, we produce
efficient FB BFMs for a number of new environments. Notably, in the D4RL
locomotion benchmark, the generic FB agent matches the performance of standard
single-task offline agents (IQL, XQL). In many setups, the offline techniques
are needed to get any decent performance at all. The auto-regressive features
have a positive but moderate impact, concentrated on tasks requiring spatial
precision and task generalization beyond the behaviors represented in the
trainset. [LINK]http://arxiv.org/abs/2412.04368v1 [DATE]2024-12-06 01:36:22+08:00 [CATEGORIES]cs.LG
Machine Theory of Mind for Autonomous Cyber-Defence [AUTHORS]Luke Swaby, Matthew Stewart, Daniel Harrold, Chris Willis, Gregory Palmer [ABSTRACT]Intelligent autonomous agents hold much potential for the domain of
cyber-security. However, due to many state-of-the-art approaches relying on
uninterpretable black-box models, there is growing demand for methods that
offer stakeholders clear and actionable insights into their latent beliefs and
motivations. To address this, we evaluate Theory of Mind (ToM) approaches for
Autonomous Cyber Operations. Upon learning a robust prior, ToM models can
predict an agent's goals, behaviours, and contextual beliefs given only a
handful of past behaviour observations. In this paper, we introduce a novel
Graph Neural Network (GNN)-based ToM architecture tailored for cyber-defence,
Graph-In, Graph-Out (GIGO)-ToM, which can accurately predict both the targets
and attack trajectories of adversarial cyber agents over arbitrary computer
network topologies. To evaluate the latter, we propose a novel extension of the
Wasserstein distance for measuring the similarity of graph-based probability
distributions. Whereas the standard Wasserstein distance lacks a fixed
reference scale, we introduce a graph-theoretic normalization factor that
enables a standardized comparison between networks of different sizes. We
furnish this metric, which we term the Network Transport Distance (NTD), with a
weighting function that emphasizes predictions according to custom node
features, allowing network operators to explore arbitrary strategic
considerations. Benchmarked against a Graph-In, Dense-Out (GIDO)-ToM
architecture in an abstract cyber-defence environment, our empirical
evaluations show that GIGO-ToM can accurately predict the goals and behaviours
of various unseen cyber-attacking agents across a range of network topologies,
as well as learn embeddings that can effectively characterize their policies. [COMMENTS]29 pages, 17 figures, 12 tables [LINK]http://arxiv.org/abs/2412.04367v1 [DATE]2024-12-06 01:35:29+08:00 [CATEGORIES]cs.LG
Approximate Top-$k$ for Increased Parallelism [AUTHORS]Oscar Key, Luka Ribar, Alberto Cattaneo, Luke Hudlass-Galley, Douglas Orr [ABSTRACT]We present an evaluation of bucketed approximate top-$k$ algorithms.
Computing top-$k$ exactly suffers from limited parallelism, because the $k$
largest values must be aggregated along the vector, thus is not well suited to
computation on highly-parallel machine learning accelerators. By relaxing the
requirement that the top-$k$ is exact, bucketed algorithms can dramatically
increase the parallelism available by independently computing many smaller
top-$k$ operations. We explore the design choices of this class of algorithms
using both theoretical analysis and empirical evaluation on downstream tasks.
Our motivating examples are sparsity algorithms for language models, which
often use top-$k$ to select the most important parameters or activations. We
also release a fast bucketed top-$k$ implementation for PyTorch. [LINK]http://arxiv.org/abs/2412.04358v1 [DATE]2024-12-06 01:17:28+08:00 [CATEGORIES]cs.LG
ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation [AUTHORS]Dayoung Gong, Suha Kwak, Minsu Cho [ABSTRACT]Temporal action segmentation and long-term action anticipation are two
popular vision tasks for the temporal analysis of actions in videos. Despite
apparent relevance and potential complementarity, these two problems have been
investigated as separate and distinct tasks. In this work, we tackle these two
problems, action segmentation and action anticipation, jointly using a unified
diffusion model dubbed ActFusion. The key idea to unification is to train the
model to effectively handle both visible and invisible parts of the sequence in
an integrated manner; the visible part is for temporal segmentation, and the
invisible part is for future anticipation. To this end, we introduce a new
anticipative masking strategy during training in which a late part of the video
frames is masked as invisible, and learnable tokens replace these frames to
learn to predict the invisible future. Experimental results demonstrate the
bi-directional benefits between action segmentation and anticipation. ActFusion
achieves the state-of-the-art performance across the standard benchmarks of 50
Salads, Breakfast, and GTEA, outperforming task-specific models in both of the
two tasks with a single unified model through joint learning. [COMMENTS]Accepted to NeurIPS 2024 [LINK]http://arxiv.org/abs/2412.04353v1 [DATE]2024-12-06 01:12:35+08:00 [CATEGORIES]cs.LG
A Fisher-Rao gradient flow for entropy-regularised Markov decision processes in Polish spaces [AUTHORS]Bekzhan Kerimkulov, James-Michael Leahy, David Siska, Lukasz Szpruch, Yufei Zhang [ABSTRACT]We study the global convergence of a Fisher-Rao policy gradient flow for
infinite-horizon entropy-regularised Markov decision processes with Polish
state and action space. The flow is a continuous-time analogue of a policy
mirror descent method. We establish the global well-posedness of the gradient
flow and demonstrate its exponential convergence to the optimal policy.
Moreover, we prove the flow is stable with respect to gradient evaluation,
offering insights into the performance of a natural policy gradient flow with
log-linear policy parameterisation. To overcome challenges stemming from the
lack of the convexity of the objective function and the discontinuity arising
from the entropy regulariser, we leverage the performance difference lemma and
the duality relationship between the gradient and mirror descent flows. Our
analysis provides a theoretical foundation for developing various discrete
policy gradient algorithms. [COMMENTS]add discretizations of gradient flow and their convergence analysis [LINK]http://arxiv.org/abs/2310.02951v2 [DATE]2024-12-06 00:35:46+08:00 [CATEGORIES]cs.LG
Enhancing Novel Object Detection via Cooperative Foundational Models [AUTHORS]Rohit Bharadwaj, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan [ABSTRACT]In this work, we address the challenging and emergent problem of novel object
detection (NOD), focusing on the accurate detection of both known and novel
object categories during inference. Traditional object detection algorithms are
inherently closed-set, limiting their capability to handle NOD. We present a
novel approach to transform existing closed-set detectors into open-set
detectors. This transformation is achieved by leveraging the complementary
strengths of pre-trained foundational models, specifically CLIP and SAM,
through our cooperative mechanism. Furthermore, by integrating this mechanism
with state-of-the-art open-set detectors such as GDINO, we establish new
benchmarks in object detection performance. Our method achieves 17.42 mAP in
novel object detection and 42.08 mAP for known objects on the challenging LVIS
dataset. Adapting our approach to the COCO OVD split, we surpass the current
state-of-the-art by a margin of 7.2 $ \text\{AP\}_\{50\} $ for novel classes. Our
code is available at https://rohit901.github.io/coop-foundation-models/ . [COMMENTS]Accepted at WACV 2025 [LINK]http://arxiv.org/abs/2311.12068v3 [DATE]2024-12-06 00:34:21+08:00 [CATEGORIES]cs.LG
The Tile: A 2D Map of Ranking Scores for Two-Class Classification [AUTHORS]Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, Marc Van Droogenbroeck [ABSTRACT]In the computer vision and machine learning communities, as well as in many
other research domains, rigorous evaluation of any new method, including
classifiers, is essential. One key component of the evaluation process is the
ability to compare and rank methods. However, ranking classifiers and
accurately comparing their performances, especially when taking
application-specific preferences into account, remains challenging. For
instance, commonly used evaluation tools like Receiver Operating Characteristic
(ROC) and Precision/Recall (PR) spaces display performances based on two
scores. Hence, they are inherently limited in their ability to compare
classifiers across a broader range of scores and lack the capability to
establish a clear ranking among classifiers. In this paper, we present a novel
versatile tool, named the Tile, that organizes an infinity of ranking scores in
a single 2D map for two-class classifiers, including common evaluation scores
such as the accuracy, the true positive rate, the positive predictive value,
Jaccard's coefficient, and all F-beta scores. Furthermore, we study the
properties of the underlying ranking scores, such as the influence of the
priors or the correspondences with the ROC space, and depict how to
characterize any other score by comparing them to the Tile. Overall, we
demonstrate that the Tile is a powerful tool that effectively captures all the
rankings in a single visualization and allows interpreting them. [LINK]http://arxiv.org/abs/2412.04309v1 [DATE]2024-12-06 00:27:59+08:00 [CATEGORIES]cs.LG
Structure-Aware Stylized Image Synthesis for Robust Medical Image Segmentation [AUTHORS]Jie Bao, Zhixin Zhou, Wen Jung Li, Rui Luo [ABSTRACT]Accurate medical image segmentation is essential for effective diagnosis and
treatment planning but is often challenged by domain shifts caused by
variations in imaging devices, acquisition conditions, and patient-specific
attributes. Traditional domain generalization methods typically require
inclusion of parts of the test domain within the training set, which is not
always feasible in clinical settings with limited diverse data. Additionally,
although diffusion models have demonstrated strong capabilities in image
generation and style transfer, they often fail to preserve the critical
structural information necessary for precise medical analysis. To address these
issues, we propose a novel medical image segmentation method that combines
diffusion models and Structure-Preserving Network for structure-aware one-shot
image stylization. Our approach effectively mitigates domain shifts by
transforming images from various sources into a consistent style while
maintaining the location, size, and shape of lesions. This ensures robust and
accurate segmentation even when the target domain is absent from the training
data. Experimental evaluations on colonoscopy polyp segmentation and skin
lesion segmentation datasets show that our method enhances the robustness and
accuracy of segmentation models, achieving superior performance metrics
compared to baseline models without style transfer. This structure-aware
stylization framework offers a practical solution for improving medical image
segmentation across diverse domains, facilitating more reliable clinical
diagnoses. [LINK]http://arxiv.org/abs/2412.04296v1 [DATE]2024-12-06 00:15:32+08:00 [CATEGORIES]cs.LG
On Multi-Agent Inverse Reinforcement Learning [AUTHORS]Till Freihaut, Giorgia Ramponi [ABSTRACT]In multi-agent systems, the agent behavior is highly influenced by its
utility function, as these utilities shape both individual goals as well as
interactions with the other agents. Inverse Reinforcement Learning (IRL) is a
well-established approach to inferring the utility function by observing an
expert behavior within a given environment. In this paper, we extend the IRL
framework to the multi-agent setting, assuming to observe agents who are
following Nash Equilibrium (NE) policies. We theoretically investigate the set
of utilities that explain the behavior of NE experts. Specifically, we provide
an explicit characterization of the feasible reward set and analyze how errors
in estimating the transition dynamics and expert behavior impact the recovered
rewards. Building on these findings, we provide the first sample complexity
analysis for the multi-agent IRL problem. Finally, we provide a numerical
evaluation of our theoretical results. [COMMENTS]Currently under review [LINK]http://arxiv.org/abs/2411.15046v2 [DATE]2024-12-06 00:04:02+08:00 [CATEGORIES]cs.LG
2024 Dec 05, Thu
Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic [AUTHORS]Zaid Alyafeai, Michael Pieler, Hannah Teufel, Jonathan Tow, Marco Bellagente, Duy Phung, Nikhil Pinnaparaju, Reshinth Adithyan, Paulo Rocha, Maksym Zhuravinskyi, Carlos Riquelme [ABSTRACT]Large Language Models (LLMs) have shown impressive results in multiple
domains of natural language processing (NLP) but are mainly focused on the
English language. Recently, more LLMs have incorporated a larger proportion of
multilingual text to represent low-resource languages. In Arabic NLP, several
Arabic-centric LLMs have shown remarkable results on multiple benchmarks in the
past two years. However, most Arabic LLMs have more than 7 billion parameters,
which increases their hardware requirements and inference latency, when
compared to smaller LLMs. This paper introduces Arabic Stable LM 1.6B in a base
and chat version as a small but powerful Arabic-centric LLM. Our Arabic Stable
LM 1.6B chat model achieves impressive results on several benchmarks beating
multiple models with up to 8x the parameters. In addition, we show the benefit
of mixing in synthetic instruction tuning data by augmenting our fine-tuning
data with a large synthetic dialogue dataset. [LINK]http://arxiv.org/abs/2412.04277v1 [DATE]2024-12-05 23:59:29+08:00 [CATEGORIES]cs.CL
CoSy: Evaluating Textual Explanations of Neurons [AUTHORS]Laura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina M. -C. Höhne, Kirill Bykov [ABSTRACT]A crucial aspect of understanding the complex nature of Deep Neural Networks
(DNNs) is the ability to explain learned concepts within their latent
representations. While methods exist to connect neurons to human-understandable
textual descriptions, evaluating the quality of these explanations is
challenging due to the lack of a unified quantitative approach. We introduce
CoSy (Concept Synthesis), a novel, architecture-agnostic framework for
evaluating textual explanations of latent neurons. Given textual explanations,
our proposed framework uses a generative model conditioned on textual input to
create data points representing the explanations. By comparing the neuron's
response to these generated data points and control data points, we can
estimate the quality of the explanation. We validate our framework through
sanity checks and benchmark various neuron description methods for Computer
Vision tasks, revealing significant differences in quality. [COMMENTS]10 pages, 5 figures [LINK]http://arxiv.org/abs/2405.20331v2 [DATE]2024-12-05 23:48:24+08:00 [CATEGORIES]cs.LGcs.CL
Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier [AUTHORS]John Dang, Shivalika Singh, Daniel D'souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campos, Yi-Chern Tan, Tom Kocmi, Florian Strub, Nathan Grinsztajn, Yannis Flet-Berliac, Acyr Locatelli, Hangyu Lin, Dwarak Talupuru, Bharat Venkitesh, David Cairuz, Bowen Yang, Tim Chung, Wei-Yin Ko, Sylvie Shang Shi, Amir Shukayev, Sammie Bae, Aleksandra Piktus, Roman Castagné, Felipe Cruz-Salinas, Eddie Kim, Lucas Crawhall-Stein, Adrien Morisot, Sudip Roy, Phil Blunsom, Ivan Zhang, Aidan Gomez, Nick Frosst, Marzieh Fadaee, Beyza Ermis, Ahmet Üstün, Sara Hooker [ABSTRACT]We introduce the Aya Expanse model family, a new generation of 8B and 32B
parameter multilingual language models, aiming to address the critical
challenge of developing highly performant multilingual models that match or
surpass the capabilities of monolingual models. By leveraging several years of
research at Cohere For AI and Cohere, including advancements in data arbitrage,
multilingual preference training, and model merging, Aya Expanse sets a new
state-of-the-art in multilingual performance. Our evaluations on the
Arena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya
Expanse 8B and 32B outperform leading open-weight models in their respective
parameter classes, including Gemma 2, Qwen 2.5, and Llama 3.1, achieving up to
a 76.6% win-rate. Notably, Aya Expanse 32B outperforms Llama 3.1 70B, a model
with twice as many parameters, achieving a 54.0% win-rate. In this short
technical report, we present extended evaluation results for the Aya Expanse
model family and release their open-weights, together with a new multilingual
evaluation dataset m-ArenaHard. [LINK]http://arxiv.org/abs/2412.04261v1 [DATE]2024-12-05 23:41:06+08:00 [CATEGORIES]cs.CL
CLINICSUM: Utilizing Language Models for Generating Clinical Summaries from Patient-Doctor Conversations [AUTHORS]Subash Neupane, Himanshu Tripathi, Shaswata Mitra, Sean Bozorgzad, Sudip Mittal, Shahram Rahimi, Amin Amirlatifi [ABSTRACT]This paper presents ClinicSum, a novel framework designed to automatically
generate clinical summaries from patient-doctor conversations. It utilizes a
two-module architecture: a retrieval-based filtering module that extracts
Subjective, Objective, Assessment, and Plan (SOAP) information from
conversation transcripts, and an inference module powered by fine-tuned
Pre-trained Language Models (PLMs), which leverage the extracted SOAP data to
generate abstracted clinical summaries. To fine-tune the PLM, we created a
training dataset of consisting 1,473 conversations-summaries pair by
consolidating two publicly available datasets, FigShare and MTS-Dialog, with
ground truth summaries validated by Subject Matter Experts (SMEs). ClinicSum's
effectiveness is evaluated through both automatic metrics (e.g., ROUGE,
BERTScore) and expert human assessments. Results show that ClinicSum
outperforms state-of-the-art PLMs, demonstrating superior precision, recall,
and F-1 scores in automatic evaluations and receiving high preference from SMEs
in human assessment, making it a robust solution for automated clinical
summarization. [COMMENTS]accepted at the the 2024 IEEE International Conference on Big Data
workshop Workshop on Big Data and AI for Healthcare [LINK]http://arxiv.org/abs/2412.04254v1 [DATE]2024-12-05 23:34:02+08:00 [CATEGORIES]cs.CL
Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots [AUTHORS]Maria Paola Priola [ABSTRACT]I combine detection and mitigation techniques to addresses hallucinations in
Large Language Models (LLMs). Mitigation is achieved in a question-answering
Retrieval-Augmented Generation (RAG) framework while detection is obtained by
introducing the Negative Missing Information Scoring System (NMISS), which
accounts for contextual relevance in responses. While RAG mitigates
hallucinations by grounding answers in external data, NMISS refines the
evaluation by identifying cases where traditional metrics incorrectly flag
contextually accurate responses as hallucinations. I use Italian health news
articles as context to evaluate LLM performance. Results show that Gemma2 and
GPT-4 outperform the other models, with GPT-4 producing answers closely aligned
with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral
benefit significantly from NMISS, highlighting their ability to provide richer
contextual information. This combined approach offers new insights into the
reduction and more accurate assessment of hallucinations in LLMs, with
applications in real-world healthcare tasks and other domains. [LINK]http://arxiv.org/abs/2412.04235v1 [DATE]2024-12-05 23:11:12+08:00 [CATEGORIES]cs.CL
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild [AUTHORS]Xinyu Zhao, Guoheng Sun, Ruisi Cai, Yukun Zhou, Pingzhi Li, Peihao Wang, Bowen Tan, Yexiao He, Li Chen, Yi Liang, Beidi Chen, Binhang Yuan, Hongyi Wang, Ang Li, Zhangyang Wang, Tianlong Chen [ABSTRACT]As Large Language Models (LLMs) excel across tasks and specialized domains,
scaling LLMs based on existing models has garnered significant attention, which
faces the challenge of decreasing performance when combining disparate models.
Various techniques have been proposed for the aggregation of pre-trained LLMs,
including model merging, Mixture-of-Experts, and stacking. Despite their
merits, a comprehensive comparison and synergistic application of them to a
diverse model zoo is yet to be adequately addressed. In light of this research
gap, this paper introduces Model-GLUE, a holistic LLM scaling guideline. First,
our work starts with a benchmarking of existing LLM scaling techniques,
especially selective merging, and variants of mixture. Utilizing the insights
from the benchmark results, we formulate an optimal strategy for the selection
and aggregation of a heterogeneous model zoo characterizing different
architectures and initialization.Our methodology involves the clustering of
mergeable models and optimal merging strategy selection, and the integration of
clusters through a model mixture. Finally, evidenced by our experiments on a
diverse Llama-2-based model zoo, Model-GLUE shows an average performance
enhancement of 5.61%, achieved without additional training. Codes are available
at: https://github.com/Model-GLUE/Model-GLUE. [COMMENTS]24 pages, 4 figures, accepted to NeurIPS 2024 Datasets and Benchmarks
Track [LINK]http://arxiv.org/abs/2410.05357v2 [DATE]2024-12-05 23:08:56+08:00 [CATEGORIES]cs.LGcs.CL
Agent-OM: Leveraging LLMAgents for Ontology Matching [AUTHORS]Zhangcheng Qiang, Weiqing Wang, Kerry Taylor [ABSTRACT]Ontology matching (OM) enables semantic interoperability between different
ontologies and resolves their conceptual heterogeneity by aligning related
entities. OM systems currently have two prevailing design paradigms:
conventional knowledge-based expert systems and newer machine learning-based
predictive systems. While large language models (LLMs) and LLMagents have
revolutionised data engineering and have been applied creatively in many
domains, their potential for OM remains underexplored. This study introduces a
novel agent-powered LLM-based design paradigm for OM systems. With
consideration of several specific challenges in leveraging LLMagents for OM,
we propose a generic framework, namely Agent-OM (Agent for Ontology Matching),
consisting of two Siamese agents for retrieval and matching, with a set of
simple OM tools. Our framework is implemented in a proof-of-concept system.
Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks
over state-of-the-art OM systems show that our system can achieve results very
close to the long-standing best performance on simple OM tasks and can
significantly improve the performance on complex and few-shot OM tasks. [COMMENTS]14 pages, 13 figures, 4 tables [LINK]http://arxiv.org/abs/2312.00326v4 [DATE]2024-12-05 22:45:05+08:00 [CATEGORIES]cs.CL
AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic [AUTHORS]Nathaniel R. Robinson, Shahd Abdelmoneim, Kelly Marchisio, Sebastian Ruder [ABSTRACT]Dialectal Arabic (DA) varieties are under-served by language technologies,
particularly large language models (LLMs). This trend threatens to exacerbate
existing social inequalities and limits language modeling applications, yet the
research community lacks operationalized LLM performance measurements in DA. We
present a method that comprehensively evaluates LLM fidelity, understanding,
quality, and diglossia in modeling DA. We evaluate nine LLMs in eight DA
varieties across these four dimensions and provide best practice
recommendations. Our evaluation suggests that LLMs do not produce DA as well as
they understand it, but does not suggest deterioration in quality when they do.
Further analysis suggests that current post-training can degrade DA
capabilities, that few-shot examples can overcome this and other LLM
deficiencies, and that otherwise no measurable features of input text correlate
well with LLM DA performance. [COMMENTS]Pre-print [LINK]http://arxiv.org/abs/2412.04193v1 [DATE]2024-12-05 22:33:00+08:00 [CATEGORIES]cs.CL
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models [AUTHORS]Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi [ABSTRACT]Today's most advanced vision-language models (VLMs) remain proprietary. The
strongest open-weight models rely heavily on synthetic data from proprietary
VLMs to achieve good performance, effectively distilling these closed VLMs into
open ones. As a result, the community has been missing foundational knowledge
about how to build performant VLMs from scratch. We present Molmo, a new family
of VLMs that are state-of-the-art in their class of openness. Our key
contribution is a collection of new datasets called PixMo, including a dataset
of highly detailed image captions for pre-training, a free-form image Q&A
dataset for fine-tuning, and an innovative 2D pointing dataset, all collected
without the use of external VLMs. The success of our approach relies on careful
modeling choices, a well-tuned training pipeline, and, most critically, the
quality of our newly collected datasets. Our best-in-class 72B model not only
outperforms others in the class of open weight and data models, but also
outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini
1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and
on a large human evaluation. Our model weights, new datasets, and source code
are available at https://molmo.allenai.org/blog. [COMMENTS]Updated with ablations and more technical details [LINK]http://arxiv.org/abs/2409.17146v2 [DATE]2024-12-05 22:28:40+08:00 [CATEGORIES]cs.CLcs.LG
Reducing Tool Hallucination via Reliability Alignment [AUTHORS]Hongshen Xu, Su Zhu, Zihan Wang, Hang Zheng, Da Ma, Ruisheng Cao, Shuai Fan, Lu Chen, Kai Yu [ABSTRACT]Large Language Models (LLMs) have extended their capabilities beyond language
generation to interact with external systems through tool calling, offering
powerful potential for real-world applications. However, the phenomenon of tool
hallucinations, which occur when models improperly select or misuse tools,
presents critical challenges that can lead to flawed task execution and
increased operational costs. This paper investigates the concept of reliable
tool calling and highlights the necessity of addressing tool hallucinations. We
systematically categorize tool hallucinations into two main types: tool
selection hallucination and tool usage hallucination. To mitigate these issues,
we propose a reliability-focused alignment framework that enhances the model's
ability to accurately assess tool relevance and usage. By proposing a suite of
evaluation metrics and evaluating on StableToolBench, we further demonstrate
the effectiveness of our framework in mitigating tool hallucination and
improving the overall system reliability of LLM tool calling. [LINK]http://arxiv.org/abs/2412.04141v1 [DATE]2024-12-05 21:10:54+08:00 [CATEGORIES]cs.CL
Text Change Detection in Multilingual Documents Using Image Comparison [AUTHORS]Doyoung Park, Naresh Reddy Yarram, Sunjin Kim, Minkyu Kim, Seongho Cho, Taehee Lee [ABSTRACT]Document comparison typically relies on optical character recognition (OCR)
as its core technology. However, OCR requires the selection of appropriate
language models for each document and the performance of multilingual or hybrid
models remains limited. To overcome these challenges, we propose text change
detection (TCD) using an image comparison model tailored for multilingual
documents. Unlike OCR-based approaches, our method employs word-level text
image-to-image comparison to detect changes. Our model generates bidirectional
change segmentation maps between the source and target documents. To enhance
performance without requiring explicit text alignment or scaling preprocessing,
we employ correlations among multi-scale attention features. We also construct
a benchmark dataset comprising actual printed and scanned word pairs in various
languages to evaluate our model. We validate our approach using our benchmark
dataset and public benchmarks Distorted Document Images and the LRDE Document
Binarization Dataset. We compare our model against state-of-the-art semantic
segmentation and change detection models, as well as to conventional OCR-based
models. [COMMENTS]15pages, 11figures 6tables, wacv2025 accepted [LINK]http://arxiv.org/abs/2412.04137v1 [DATE]2024-12-05 21:04:10+08:00 [CATEGORIES]cs.CLcs.LG
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs [AUTHORS]Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Patrick Schramowski, Kristian Kersting [ABSTRACT]Large Language Models (LLMs) have demonstrated remarkable capabilities in
generating human-like text, but their output may not be aligned with the user
or even produce harmful content. This paper presents a novel approach to detect
and steer concepts such as toxicity before generation. We introduce the Sparse
Conditioned Autoencoder (SCAR), a single trained module that extends the
otherwise untouched LLM. SCAR ensures full steerability, towards and away from
concepts (e.g., toxic content), without compromising the quality of the model's
text generation on standard evaluationbenchmarks. We demonstrate the effective
application of our approach through a variety of concepts, including toxicity,
safety, and writing style alignment. As such, this work establishes a robust
framework for controlling LLM generations, ensuring their ethical and safe
deployment in real-world applications. [COMMENTS]Accepted at Socially Responsible Language Modelling Research (SoLaR)
Workshop at NeurIPS 2024 [LINK]http://arxiv.org/abs/2411.07122v2 [DATE]2024-12-05 18:45:02+08:00 [CATEGORIES]cs.CL
M$^\{3\}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction [AUTHORS]Jiang Liu, Bobo Li, Xinran Yang, Na Yang, Hao Fei, Mingyao Zhang, Fei Li, Donghong Ji [ABSTRACT]Multimodal information extraction (IE) tasks have attracted increasing
attention because many studies have shown that multimodal information benefits
text information extraction. However, existing multimodal IE datasets mainly
focus on sentence-level image-facilitated IE in English text, and pay little
attention to video-based multimodal IE and fine-grained visual grounding.
Therefore, in order to promote the development of multimodal IE, we constructed
a multimodal multilingual multitask dataset, named M$^\{3\}$D, which has the
following features: (1) It contains paired document-level text and video to
enrich multimodal information; (2) It supports two widely-used languages,
namely English and Chinese; (3) It includes more multimodal IE tasks such as
entity recognition, entity chain extraction, relation extraction and visual
grounding. In addition, our dataset introduces an unexplored theme, i.e.,
biography, enriching the domains of multimodal IE resources. To establish a
benchmark for our dataset, we propose an innovative hierarchical multimodal IE
model. This model effectively leverages and integrates multimodal information
through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal
scenarios, modal information is often incomplete. Thus, we designed a Missing
Modality Construction Module (MMCM) to alleviate the issues caused by missing
modalities. Our model achieved an average performance of 53.80% and 53.77% on
four tasks in English and Chinese datasets, respectively, which set a
reasonable standard for subsequent research. In addition, we conducted more
analytical experiments to verify the effectiveness of our proposed module. We
believe that our work can promote the development of the field of multimodal
IE. [COMMENTS]14 pages, 9 figures, 6 tables [LINK]http://arxiv.org/abs/2412.04026v1 [DATE]2024-12-05 18:00:58+08:00 [CATEGORIES]cs.CL
Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement [AUTHORS]Lingfeng Ming, Bo Zeng, Chenyang Lyu, Tianqi Shi, Yu Zhao, Xue Yang, Yefeng Liu, Yiyu Wang, Linlong Xu, Yangyang Liu, Xiaohu Zhao, Hao Wang, Heng Liu, Hao Zhou, Huifeng Yin, Zifu Shang, Haijun Li, Longyue Wang, Weihua Luo, Kaifu Zhang [ABSTRACT]Large Language Models (LLMs) have achieved remarkable progress in recent
years; however, their excellent performance is still largely limited to major
world languages, primarily English. Many LLMs continue to face challenges with
multilingual tasks, especially when it comes to low-resource languages. To
address this issue, we introduced Marco-LLM: Massive multilingual training for
cross-lingual enhancement LLM. We have collected a substantial amount of
multilingual data for several low-resource languages and conducted extensive
continual pre-training using the Qwen2 models. This effort has resulted in a
multilingual LLM named Marco-LLM. Through comprehensive evaluations on various
multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA
and many others, Marco-LLM has demonstrated substantial improvements over
state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements
in any-to-any machine translation tasks, showing the effectiveness of our
multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not
only perform exceptionally well in multilingual tasks, including low-resource
languages, but also maintain strong performance in English and other major
languages, closing the performance gap between high- and low-resource language
capabilities. By bridging languages, this effort demonstrates our dedication to
ensuring LLMs work accurately across various languages. [LINK]http://arxiv.org/abs/2412.04003v1 [DATE]2024-12-05 17:26:58+08:00 [CATEGORIES]cs.CL
Demonstration Selection for In-Context Learning via Reinforcement Learning [AUTHORS]Xubin Wang, Jianfei Wu, Yichen Yuan, Mingzhe Li, Deyu Cai, Weijia Jia [ABSTRACT]Diversity in demonstration selection is crucial for enhancing model
generalization, as it enables a broader coverage of structures and concepts.
However, constructing an appropriate set of demonstrations has remained a focal
point of research. This paper presents the Relevance-Diversity Enhanced
Selection (RDES), an innovative approach that leverages reinforcement learning
to optimize the selection of diverse reference demonstrations for text
classification tasks using Large Language Models (LLMs), especially in few-shot
prompting scenarios. RDES employs a Q-learning framework to dynamically
identify demonstrations that maximize both diversity and relevance to the
classification objective by calculating a diversity score based on label
distribution among selected demonstrations. This method ensures a balanced
representation of reference data, leading to improved classification accuracy.
Through extensive experiments on four benchmark datasets and involving 12
closed-source and open-source LLMs, we demonstrate that RDES significantly
enhances classification accuracy compared to ten established baselines.
Furthermore, we investigate the incorporation of Chain-of-Thought (CoT)
reasoning in the reasoning process, which further enhances the model's
predictive performance. The results underscore the potential of reinforcement
learning to facilitate adaptive demonstration selection and deepen the
understanding of classification challenges. [LINK]http://arxiv.org/abs/2412.03966v1 [DATE]2024-12-05 16:33:52+08:00 [CATEGORIES]cs.CL
MIND: Effective Incorrect Assignment Detection through a Multi-Modal Structure-Enhanced Language Model [AUTHORS]Yunhe Pang, Bo Chen, Fanjin Zhang, Yanghui Rao, Jie Tang [ABSTRACT]The rapid growth of academic publications has exacerbated the issue of author
name ambiguity in online digital libraries. Despite advances in name
disambiguation algorithms, cumulative errors continue to undermine the
reliability of academic systems. It is estimated that over 10% paper-author
assignments are rectified when constructing the million-scale WhoIsWho
benchmark. Existing endeavors to detect incorrect assignments are either
semantic-based or graph-based approaches, which fall short of making full use
of the rich text attributes of papers and implicit structural features defined
via the co-occurrence of paper attributes. To this end, this paper introduces a
structure-enhanced language model that combines key structural features from
graph-based methods with fine-grained semantic features from rich paper
attributes to detect incorrect assignments. The proposed model is trained with
a highly effective multi-modal multi-turn instruction tuning framework, which
incorporates task-guided instruction tuning, text-attribute modality, and
structural modality. Experimental results demonstrate that our model
outperforms previous approaches, achieving top performance on the leaderboard
of KDD Cup 2024. Our code has been publicly available. [LINK]http://arxiv.org/abs/2412.03930v1 [DATE]2024-12-05 15:12:53+08:00 [CATEGORIES]cs.CL
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios [AUTHORS]Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, Xuanjing Huang [ABSTRACT]Existing evaluations of tool learning primarily focus on validating the
alignment of selected tools for large language models (LLMs) with expected
outcomes. However, these approaches rely on a limited set of scenarios where
answers can be pre-determined, diverging from genuine needs. Furthermore, a
sole emphasis on outcomes disregards the complex capabilities required for LLMs
to effectively use tools. To tackle this issue, we propose ToolEyes, a
fine-grained system tailored for the evaluation of the LLMs' tool learning
capabilities in authentic scenarios. The system meticulously examines seven
real-world scenarios, analyzing five dimensions crucial to LLMs in tool
learning: format alignment, intent comprehension, behavior planning, tool
selection, and answer organization. Additionally, ToolEyes incorporates a tool
library boasting approximately 600 tools, serving as an intermediary between
LLMs and the physical world. Evaluations involving ten LLMs across three
categories reveal a preference for specific scenarios and limited cognitive
abilities in tool learning. Intriguingly, expanding the model size even
exacerbates the hindrance to tool learning. The code and data are available at
https://github.com/Junjie-Ye/ToolEyes. [COMMENTS]Accepted by COLING 2025 conference [LINK]http://arxiv.org/abs/2401.00741v3 [DATE]2024-12-05 15:05:59+08:00 [CATEGORIES]cs.CL
LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings [AUTHORS]Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé [ABSTRACT]Sentence embedding models play a key role in various Natural Language
Processing tasks, such as in Topic Modeling, Document Clustering and
Recommendation Systems. However, these models rely heavily on parallel data,
which can be scarce for many low-resource languages, including Luxembourgish.
This scarcity results in suboptimal performance of monolingual and
cross-lingual sentence embedding models for these languages. To address this
issue, we compile a relatively small but high-quality human-generated
cross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence
embedding model for Luxembourgish with strong cross-lingual capabilities.
Additionally, we present evidence suggesting that including low-resource
languages in parallel training datasets can be more advantageous for other
low-resource languages than relying solely on high-resource language pairs.
Furthermore, recognizing the lack of sentence embedding benchmarks for
low-resource languages, we create a paraphrase detection benchmark specifically
for Luxembourgish, aiming to partially fill this gap and promote further
research. [COMMENTS]Accepted at COLING 2025 [LINK]http://arxiv.org/abs/2412.03331v2 [DATE]2024-12-05 15:05:57+08:00 [CATEGORIES]cs.CL
DRS: Deep Question Reformulation With Structured Output [AUTHORS]Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, Kai-Wei Chang [ABSTRACT]Question answering represents a core capability of large language models
(LLMs). However, when individuals encounter unfamiliar knowledge in texts, they
often formulate questions that the text itself cannot answer due to
insufficient understanding of the underlying information. Recent studies reveal
that while LLMs can detect unanswerable questions, they struggle to assist
users in reformulating these questions. Even advanced models like GPT-3.5
demonstrate limited effectiveness in this regard. To address this limitation,
we propose DRS: Deep Question Reformulation with Structured Output, a novel
zero-shot method aimed at enhancing LLMs ability to assist users in
reformulating questions to extract relevant information from new documents. DRS
combines the strengths of LLMs with a DFS-based algorithm to iteratively
explore potential entity combinations and constrain outputs using predefined
entities. This structured approach significantly enhances the reformulation
capabilities of LLMs. Comprehensive experimental evaluations demonstrate that
DRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while
also enhancing the performance of open-source models, such as Gemma2-9B, from
26.35% to 56.75%. [LINK]http://arxiv.org/abs/2411.17993v2 [DATE]2024-12-05 14:53:40+08:00 [CATEGORIES]cs.CL
A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios [AUTHORS]Xiachong Feng, Longxu Dou, Ella Li, Qinghao Wang, Haochuan Wang, Yu Guo, Chang Ma, Lingpeng Kong [ABSTRACT]Game-theoretic scenarios have become pivotal in evaluating the social
intelligence of Large Language Model (LLM)-based social agents. While numerous
studies have explored these agents in such settings, there is a lack of a
comprehensive survey summarizing the current progress. To address this gap, we
systematically review existing research on LLM-based social agents within
game-theoretic scenarios. Our survey organizes the findings into three core
components: Game Framework, Social Agent, and Evaluation Protocol. The game
framework encompasses diverse game scenarios, ranging from choice-focusing to
communication-focusing games. The social agent part explores agents'
preferences, beliefs, and reasoning abilities. The evaluation protocol covers
both game-agnostic and game-specific metrics for assessing agent performance.
By reflecting on the current research and identifying future research
directions, this survey provides insights to advance the development and
evaluation of social agents in game-theoretic scenarios. [LINK]http://arxiv.org/abs/2412.03920v1 [DATE]2024-12-05 14:46:46+08:00 [CATEGORIES]cs.CL
MISR: Measuring Instrumental Self-Reasoning in Frontier Models [AUTHORS]Kai Fronsdal, David Lindner [ABSTRACT]We propose a suite of tasks to evaluate the instrumental self-reasoning
ability of large language model (LLM) agents. Instrumental self-reasoning
ability could improve adaptability and enable self-modification, but it could
also pose significant risks, such as enabling deceptive alignment. Prior work
has only evaluated self-reasoning in non-agentic settings or in limited
domains. In this paper, we propose evaluations for instrumental self-reasoning
ability in agentic tasks in a wide range of scenarios, including
self-modification, knowledge seeking, and opaque self-reasoning. We evaluate
agents built using state-of-the-art LLMs, including commercial and open source
systems. We find that instrumental self-reasoning ability emerges only in the
most capable frontier models and that it is highly context-dependent. No model
passes the the most difficult versions of our evaluations, hence our evaluation
can be used to measure increases in instrumental self-reasoning ability in
future models. We open-source our evaluations at
https://github.com/kaifronsdal/Self-Reasoning-Evals. [COMMENTS]10 pages, 65 page appendix, 5 figures [LINK]http://arxiv.org/abs/2412.03904v1 [DATE]2024-12-05 14:20:47+08:00 [CATEGORIES]cs.CLcs.LG
Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction [AUTHORS]Zheye Deng, Chunkit Chan, Weiqi Wang, Yuxi Sun, Wei Fan, Tianshi Zheng, Yauwai Yim, Yangqiu Song [ABSTRACT]The task of condensing large chunks of textual information into concise and
structured tables has gained attention recently due to the emergence of Large
Language Models (LLMs) and their potential benefit for downstream tasks, such
as text summarization and text mining. Previous approaches often generate
tables that directly replicate information from the text, limiting their
applicability in broader contexts, as text-to-table generation in real-life
scenarios necessitates information extraction, reasoning, and integration.
However, there is a lack of both datasets and methodologies towards this task.
In this paper, we introduce LiveSum, a new benchmark dataset created for
generating summary tables of competitions based on real-time commentary texts.
We evaluate the performances of state-of-the-art LLMs on this task in both
fine-tuning and zero-shot settings, and additionally propose a novel pipeline
called $T^3$(Text-Tuple-Table) to improve their performances. Extensive
experimental results demonstrate that LLMs still struggle with this task even
after fine-tuning, while our approach can offer substantial performance gains
without explicit training. Further analyses demonstrate that our method
exhibits strong generalization abilities, surpassing previous approaches on
several other text-to-table datasets. Our code and data can be found at
https://github.com/HKUST-KnowComp/LiveSum. [COMMENTS]Accepted to EMNLP 2024 [LINK]http://arxiv.org/abs/2404.14215v2 [DATE]2024-12-05 14:02:59+08:00 [CATEGORIES]cs.CL
AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer [AUTHORS]Davor Lauc, Attapol Rutherford, Weerin Wongwarawipatr [ABSTRACT]This study introduces AyutthayaAlpha, an advanced transformer-based machine
learning model designed for the transliteration of Thai proper names into Latin
script. Our system achieves state-of-the-art performance with 82.32%
first-token accuracy and 95.24% first-three-token accuracy, while maintaining a
low character error rate of 0.0047. The complexity of Thai phonology, including
tonal features and vowel length distinctions, presents significant challenges
for accurate transliteration, which we address through a novel two-model
approach: AyutthayaAlpha-Small, based on the ByT5 architecture, and
AyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly
outperforms its larger counterpart. Our research combines linguistic rules with
deep learning, training on a carefully curated dataset of 1.2 million
Thai-Latin name pairs, augmented through strategic upsampling to 2.7 million
examples. Extensive evaluations against existing transliteration methods and
human expert benchmarks demonstrate that AyutthayaAlpha not only achieves
superior accuracy but also effectively captures personal and cultural
preferences in name romanization. The system's practical applications extend to
cross-lingual information retrieval, international data standardization, and
identity verification systems, with particular relevance for government
databases, academic institutions, and global business operations. This work
represents a significant advance in bridging linguistic gaps between Thai and
Latin scripts, while respecting the cultural and personal dimensions of name
transliteration. [LINK]http://arxiv.org/abs/2412.03877v1 [DATE]2024-12-05 13:18:09+08:00 [CATEGORIES]cs.CL
Yi-Lightning Technical Report [AUTHORS]01. AI, :, Alan Wake, Albert Wang, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qicheng Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang [ABSTRACT]This technical report presents Yi-Lightning, our latest flagship large
language model (LLM). It achieves exceptional performance, ranking 6th overall
on Chatbot Arena, with particularly strong results (2nd to 4th place) in
specialized categories including Chinese, Math, Coding, and Hard Prompts.
Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture,
featuring advanced expert segmentation and routing mechanisms coupled with
optimized KV-caching techniques. Our development process encompasses
comprehensive pre-training, supervised fine-tuning (SFT), and reinforcement
learning from human feedback (RLHF), where we devise deliberate strategies for
multi-stage training, synthetic data construction, and reward modeling.
Furthermore, we implement RAISE (Responsible AI Safety Engine), a
four-component framework to address safety issues across pre-training,
post-training, and serving phases. Empowered by our scalable super-computing
infrastructure, all these innovations substantially reduce training, deployment
and inference costs while maintaining high-performance standards. With further
evaluations on public academic benchmarks, Yi-Lightning demonstrates
competitive performance against top-tier LLMs, while we observe a notable
disparity between traditional, static benchmark results and real-world, dynamic
human preferences. This observation prompts a critical reassessment of
conventional benchmarks' utility in guiding the development of more intelligent
and powerful AI systems for practical applications. Yi-Lightning is now
available through our developer platform at https://platform.lingyiwanwu.com. [LINK]http://arxiv.org/abs/2412.01253v3 [DATE]2024-12-05 12:29:49+08:00 [CATEGORIES]cs.CLcs.LG
Calibrating Reasoning in Language Models with Internal Consistency [AUTHORS]Zhihui Xie, Jizhou Guo, Tong Yu, Shuai Li [ABSTRACT]Large language models (LLMs) have demonstrated impressive capabilities in
various reasoning tasks, aided by techniques like chain-of-thought prompting
that elicits verbalized reasoning. However, LLMs often generate text with
obvious mistakes and contradictions, raising doubts about their ability to
robustly process and utilize generated rationales. In this work, we investigate
reasoning in LLMs through the lens of internal representations, focusing on how
these representations are influenced by generated rationales. Our preliminary
analysis reveals that while generated rationales improve answer accuracy,
inconsistencies emerge between the model's internal representations in middle
layers and those in final layers, potentially undermining the reliability of
their reasoning processes. To address this, we propose internal consistency as
a measure of the model's confidence by examining the agreement of latent
predictions decoded from intermediate layers. Extensive empirical studies
across different models and datasets demonstrate that internal consistency
effectively distinguishes between correct and incorrect reasoning paths.
Motivated by this, we propose a new approach to calibrate reasoning by
up-weighting reasoning paths with high internal consistency, resulting in a
significant boost in reasoning performance. Further analysis uncovers distinct
patterns in attention and feed-forward modules across layers, providing
insights into the emergence of internal inconsistency. In summary, our results
demonstrate the potential of using internal representations for self-evaluation
of LLMs. Our code is available at github.com/zhxieml/internal-consistency. [COMMENTS]NeurIPS 2024 camera ready [LINK]http://arxiv.org/abs/2405.18711v2 [DATE]2024-12-05 12:01:28+08:00 [CATEGORIES]cs.CL
From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models [AUTHORS]Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji [ABSTRACT]Data visualization in the form of charts plays a pivotal role in data
analysis, offering critical insights and aiding in informed decision-making.
Automatic chart understanding has witnessed significant advancements with the
rise of large foundation models in recent years. Foundation models, such as
large language models, have revolutionized various natural language processing
tasks and are increasingly being applied to chart understanding tasks. This
survey paper provides a comprehensive overview of the recent developments,
challenges, and future directions in chart understanding within the context of
these foundation models. We review fundamental building blocks crucial for
studying chart understanding tasks. Additionally, we explore various tasks and
their evaluation metrics and sources of both charts and textual inputs. Various
modeling strategies are then examined, encompassing both classification-based
and generation-based approaches, along with tool augmentation techniques that
enhance chart understanding performance. Furthermore, we discuss the
state-of-the-art performance of each task and discuss how we can improve the
performance. Challenges and future directions are addressed, highlighting the
importance of several topics, such as domain-specific charts, lack of efforts
in developing evaluation metrics, and agent-oriented settings. This survey
paper serves as a comprehensive resource for researchers and practitioners in
the fields of natural language processing, computer vision, and data analysis,
providing valuable insights and directions for future research in chart
understanding leveraging large foundation models. The studies mentioned in this
paper, along with emerging new research, will be continually updated at:
https://github.com/khuangaf/Awesome-Chart-Understanding. [COMMENTS]IEEE Transactions on Knowledge and Data Engineering (TKDE) [LINK]http://arxiv.org/abs/2403.12027v4 [DATE]2024-12-05 11:26:13+08:00 [CATEGORIES]cs.CL
ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data [AUTHORS]Junhong Shen, Atishay Jain, Zedian Xiao, Ishan Amlekar, Mouad Hadji, Aaron Podolny, Ameet Talwalkar [ABSTRACT]Large Language Model (LLM) agents are rapidly improving to handle
increasingly complex web-based tasks. Most of these agents rely on
general-purpose, proprietary models like GPT-4 and focus on designing better
prompts to improve their planning abilities. However, general-purpose LLMs are
not specifically trained to understand specialized web contexts such as HTML,
and they often struggle with long-horizon planning. We explore an alternative
approach that fine-tunes open-source LLMs using production-scale workflow data
collected from over 250 domains corresponding to 6 billion tokens. This simple
yet effective approach shows substantial gains over prompting-based agents on
existing benchmarks -- ScribeAgent achieves state-of-the-art direct generation
performance on Mind2Web and improves the task success rate by 7.3% over the
previous best text-only web agents on WebArena. We further perform detailed
ablation studies on various fine-tuning design choices and provide insights
into LLM selection, training recipes, context window optimization, and effect
of dataset sizes. [LINK]http://arxiv.org/abs/2411.15004v2 [DATE]2024-12-05 10:00:07+08:00 [CATEGORIES]cs.CL
Labrador: Exploring the Limits of Masked Language Modeling for Laboratory Data [AUTHORS]David R. Bellamy, Bhawesh Kumar, Cindy Wang, Andrew Beam [ABSTRACT]In this work we introduce Labrador, a pre-trained Transformer model for
laboratory data. Labrador and BERT were pre-trained on a corpus of 100 million
lab test results from electronic health records (EHRs) and evaluated on various
downstream outcome prediction tasks. Both models demonstrate mastery of the
pre-training task but neither consistently outperform XGBoost on downstream
supervised tasks. Our ablation studies reveal that transfer learning shows
limited effectiveness for BERT and achieves marginal success with Labrador. We
explore the reasons for the failure of transfer learning and suggest that the
data generating process underlying each patient cannot be characterized
sufficiently using labs alone, among other factors. We encourage future work to
focus on joint modeling of multiple EHR data categories and to include
tree-based baselines in their evaluations. [COMMENTS]26 pages, 8 figures, best paper award at ML4H 2024 [LINK]http://arxiv.org/abs/2312.11502v2 [DATE]2024-12-05 07:09:53+08:00 [CATEGORIES]cs.CLcs.LG
From Language Models over Tokens to Language Models over Characters [AUTHORS]Tim Vieira, Ben LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Brian DuSell, John Terilla, Timothy J. O'Donnell, Ryan Cotterell [ABSTRACT]Modern language models are internally -- and mathematically -- distributions
over token strings rather than \emph\{character\} strings, posing numerous
challenges for programmers building user applications on top of them. For
example, if a prompt is specified as a character string, it must be tokenized
before passing it to the token-level language model. Thus, the tokenizer and
consequent analyses are very sensitive to the specification of the prompt
(e.g., if the prompt ends with a space or not). This paper presents algorithms
for converting token-level language models to character-level ones. We present
both exact and approximate algorithms. In the empirical portion of the paper,
we benchmark the practical runtime and approximation quality. We find that --
even with a small computation budget -- our method is able to accurately
approximate the character-level distribution (less than 0.00021 excess bits /
character) at reasonably fast speeds (46.3 characters / second) on the Llama
3.1 8B language model. [LINK]http://arxiv.org/abs/2412.03719v1 [DATE]2024-12-05 05:19:20+08:00 [CATEGORIES]cs.CL
Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance [AUTHORS]Meni Brief, Oded Ovadia, Gil Shenderovitz, Noga Ben Yoash, Rachel Lemberg, Eitam Sheetrit [ABSTRACT]The application of large language models (LLMs) in domain-specific contexts,
including finance, has expanded rapidly. Domain-specific LLMs are typically
evaluated based on their performance in various downstream tasks relevant to
the domain. In this work, we present a detailed analysis of fine-tuning LLMs
for such tasks. Somewhat counterintuitively, we find that in domain-specific
cases, fine-tuning exclusively on the target task is not always the most
effective strategy. Instead, multi-task finetuning - where models are trained
on a cocktail of related tasks - can significantly enhance performance. We
demonstrate how this approach enables a small model, such as Phi-3-Mini, to
achieve state-of-the-art results, even surpassing the much larger GPT-4-o model
on financial benchmarks. Our study involves a large-scale experiment,
conducting over 200 training experiments using several widely adopted LLMs as
baselines, and empirically confirms the benefits of multi-task fine-tuning.
Additionally, we explore the use of general instruction data as a form of
regularization, suggesting that it helps minimize performance degradation. We
also investigate the inclusion of mathematical data, finding improvements in
numerical reasoning that transfer effectively to financial tasks. Finally, we
note that while fine-tuning for downstream tasks leads to targeted improvements
in task performance, it does not necessarily result in broader gains in domain
knowledge or complex domain reasoning abilities. [LINK]http://arxiv.org/abs/2410.01109v2 [DATE]2024-12-05 04:57:05+08:00 [CATEGORIES]cs.CL
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension [AUTHORS]Wang Xiyao, Yang Zhengyuan, Li Linjie, Lu Hongjin, Xu Yuancheng, Lin Chung-Ching Lin, Lin Kevin, Huang Furong, Wang Lijuan [ABSTRACT]Despite significant advancements in vision-language models (VLMs), there
lacks effective approaches to enhance response quality by scaling
inference-time computation. This capability is known to be a core step towards
the self-improving models in recent large language model studies. In this
paper, we present Vision Value Model (VisVM) that can guide VLM inference-time
search to generate responses with better visual comprehension. Specifically,
VisVM not only evaluates the generated sentence quality in the current search
step, but also anticipates the quality of subsequent sentences that may result
from the current step, thus providing a long-term value. In this way, VisVM
steers VLMs away from generating sentences prone to hallucinations or
insufficient detail, thereby producing higher quality responses. Experimental
results demonstrate that VisVM-guided search significantly enhances VLMs'
ability to generate descriptive captions with richer visual details and fewer
hallucinations, compared with greedy decoding and search methods with other
visual reward signals. Furthermore, we find that self-training the model with
the VisVM-guided captions improve VLM's performance across a wide range of
multimodal benchmarks, indicating the potential for developing self-improving
VLMs. Our value model and code are available at
https://github.com/si0wang/VisVM. [LINK]http://arxiv.org/abs/2412.03704v1 [DATE]2024-12-05 04:35:07+08:00 [CATEGORIES]cs.CLcs.LG
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [AUTHORS]Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, Lichao Sun [ABSTRACT]Large language models (LLMs) have garnered significant attention due to their
impressive natural language processing (NLP) capabilities. Recently, many
studies have focused on the tool utilization ability of LLMs. They primarily
investigated how LLMs effectively collaborate with given specific tools.
However, in scenarios where LLMs serve as intelligent agents, as seen in
applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate
decision-making processes that involve deciding whether to employ a tool and
selecting the most suitable tool(s) from a collection of available tools to
fulfill user requests. Therefore, in this paper, we introduce MetaTool, a
benchmark designed to evaluate whether LLMs have tool usage awareness and can
correctly choose tools. Specifically, we create a dataset called ToolE within
the benchmark. This dataset contains various types of user queries in the form
of prompts that trigger LLMs to use tools, including both single-tool and
multi-tool scenarios. Subsequently, we set the tasks for both tool usage
awareness and tool selection. We define four subtasks from different
perspectives in tool selection, including tool selection with similar choices,
tool selection in specific scenarios, tool selection with possible reliability
issues, and multi-tool selection. We conduct experiments involving eight
popular LLMs and find that the majority of them still struggle to effectively
select tools, highlighting the existing gaps between LLMs and genuine
intelligent agents. However, through the error analysis, we found there is
still significant room for improvement. Finally, we conclude with insights for
tool developers -- we strongly recommend that tool developers choose an
appropriate rewrite model for generating new descriptions based on the
downstream LLM the tool will apply to. Our code is in
https://github.com/HowieHwong/MetaTool. [LINK]http://arxiv.org/abs/2310.03128v6 [DATE]2024-12-05 03:49:02+08:00 [CATEGORIES]cs.CL
Acquired TASTE: Multimodal Stance Detection with Textual and Structural Embeddings [AUTHORS]Guy Barel, Oren Tsur, Dan Volenchik [ABSTRACT]Stance detection plays a pivotal role in enabling an extensive range of
downstream applications, from discourse parsing to tracing the spread of fake
news and the denial of scientific facts. While most stance classification
models rely on textual representation of the utterance in question, prior work
has demonstrated the importance of the conversational context in stance
detection. In this work we introduce TASTE -- a multimodal architecture for
stance detection that harmoniously fuses Transformer-based content embedding
with unsupervised structural embedding. Through the fine-tuning of a pretrained
transformer and the amalgamation with social embedding via a Gated Residual
Network (GRN) layer, our model adeptly captures the complex interplay between
content and conversational structure in determining stance. TASTE achieves
state-of-the-art results on common benchmarks, significantly outperforming an
array of strong baselines. Comparative evaluations underscore the benefits of
social grounding -- emphasizing the criticality of concurrently harnessing both
content and structure for enhanced stance detection. [COMMENTS]The modified camera ready version will be published in January 2025
at COLING [LINK]http://arxiv.org/abs/2412.03681v1 [DATE]2024-12-05 03:23:37+08:00 [CATEGORIES]cs.CL
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models [AUTHORS]Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo [ABSTRACT]Proprietary LMs such as GPT-4 are often employed to assess the quality of
responses from various LMs. However, concerns including transparency,
controllability, and affordability strongly motivate the development of
open-source LMs specialized in evaluations. On the other hand, existing open
evaluator LMs exhibit critical shortcomings: 1) they issue scores that
significantly diverge from those assigned by humans, and 2) they lack the
flexibility to perform both direct assessment and pairwise ranking, the two
most prevalent forms of assessment. Additionally, they do not possess the
ability to evaluate based on custom evaluation criteria, focusing instead on
general attributes like helpfulness and harmlessness. To address these issues,
we introduce Prometheus 2, a more powerful evaluator LM than its predecessor
that closely mirrors human and GPT-4 judgements. Moreover, it is capable of
processing both direct assessment and pair-wise ranking formats grouped with a
user-defined evaluation criteria. On four direct assessment benchmarks and four
pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and
agreement with humans and proprietary LM judges among all tested open evaluator
LMs. Our models, code, and data are all publicly available at
https://github.com/prometheus-eval/prometheus-eval. [COMMENTS]EMNLP 2024 (Main Conference) [LINK]http://arxiv.org/abs/2405.01535v2 [DATE]2024-12-05 03:23:17+08:00 [CATEGORIES]cs.CL
Evaluating Language Models as Synthetic Data Generators [AUTHORS]Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig [ABSTRACT]Given the increasing use of synthetic data in language model (LM)
post-training, an LM's ability to generate high-quality data has become nearly
as crucial as its ability to solve problems directly. While prior works have
focused on developing effective data generation methods, they lack systematic
comparison of different LMs as data generators in a unified setting. To address
this gap, we propose AgoraBench, a benchmark that provides standardized
settings and metrics to evaluate LMs' data generation abilities. Through
synthesizing 1.26 million training instances using 6 LMs and training 99
student models, we uncover key insights about LMs' data generation
capabilities. First, we observe that LMs exhibit distinct strengths. For
instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet
performs better at enhancing existing ones. Furthermore, our analysis reveals
that an LM's data generation ability doesn't necessarily correlate with its
problem-solving ability. Instead, multiple intrinsic features of data
quality-including response quality, perplexity, and instruction
difficulty-collectively serve as better indicators. Finally, we demonstrate
that strategic choices in output format and cost-conscious model selection
significantly impact data generation effectiveness. [COMMENTS]Work in Progress [LINK]http://arxiv.org/abs/2412.03679v1 [DATE]2024-12-05 03:20:32+08:00 [CATEGORIES]cs.CL
Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis [AUTHORS]Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara [ABSTRACT]The task of image captioning demands an algorithm to generate natural
language descriptions of visual inputs. Recent advancements have seen a
convergence between image captioning research and the development of Large
Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which
extend the capabilities of text-only LLMs to multiple modalities. This paper
investigates whether Multimodal LLMs can supplant traditional image captioning
networks by evaluating their performance on various image description
benchmarks. We explore both the zero-shot capabilities of these models and
their adaptability to different semantic domains through fine-tuning methods,
including prompt learning, prefix tuning, and low-rank adaptation. Our results
demonstrate that while Multimodal LLMs achieve impressive zero-shot
performance, fine-tuning for specific domains while maintaining their
generalization capabilities intact remains challenging. We discuss the
implications of these findings for future research in image captioning and the
development of more adaptable Multimodal LLMs. [COMMENTS]ECCV 2024 Workshop on Green Foundation Models [LINK]http://arxiv.org/abs/2412.03665v1 [DATE]2024-12-05 03:01:06+08:00 [CATEGORIES]cs.CL
From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents [AUTHORS]Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, Zhongyu Wei [ABSTRACT]Traditional sociological research often relies on human participation, which,
though effective, is expensive, challenging to scale, and with ethical
concerns. Recent advancements in large language models (LLMs) highlight their
potential to simulate human behavior, enabling the replication of individual
responses and facilitating studies on many interdisciplinary studies. In this
paper, we conduct a comprehensive survey of this field, illustrating the recent
progress in simulation driven by LLM-empowered agents. We categorize the
simulations into three types: (1) Individual Simulation, which mimics specific
individuals or demographic groups; (2) Scenario Simulation, where multiple
agents collaborate to achieve goals within specific contexts; and (3) Society
Simulation, which models interactions within agent societies to reflect the
complexity and variety of real-world dynamics. These simulations follow a
progression, ranging from detailed individual modeling to large-scale societal
phenomena. We provide a detailed discussion of each simulation type, including
the architecture or key components of the simulation, the classification of
objectives or scenarios and the evaluation method. Afterward, we summarize
commonly used datasets and benchmarks. Finally, we discuss the trends across
these three types of simulation. A repository for the related sources is at
\{\url\{https://github.com/FudanDISC/SocialAgent\}\}. [LINK]http://arxiv.org/abs/2412.03563v1 [DATE]2024-12-05 02:56:37+08:00 [CATEGORIES]cs.CL
Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted Language Models [AUTHORS]Natalie Mackraz, Nivedha Sivakumar, Samira Khorshidi, Krishna Patel, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff [ABSTRACT]Large language models (LLMs) are increasingly being adapted to achieve
task-specificity for deployment in real-world decision systems. Several
previous works have investigated the bias transfer hypothesis (BTH) by studying
the effect of the fine-tuning adaptation strategy on model fairness to find
that fairness in pre-trained masked language models have limited effect on the
fairness of models when adapted using fine-tuning. In this work, we expand the
study of BTH to causal models under prompt adaptations, as prompting is an
accessible, and compute-efficient way to deploy models in real-world systems.
In contrast to previous works, we establish that intrinsic biases in
pre-trained Mistral, Falcon and Llama models are strongly correlated (rho >=
0.94) with biases when the same models are zero- and few-shot prompted, using a
pronoun co-reference resolution task. Further, we find that bias transfer
remains strongly correlated even when LLMs are specifically prompted to exhibit
fair or biased behavior (rho >= 0.92), and few-shot length and stereotypical
composition are varied (rho >= 0.97). Our findings highlight the importance of
ensuring fairness in pre-trained LLMs, especially when they are later used to
perform downstream tasks via prompt adaptation. [LINK]http://arxiv.org/abs/2412.03537v1 [DATE]2024-12-05 02:32:42+08:00 [CATEGORIES]cs.CLcs.LG
StarVector: Generating Scalable Vector Graphics Code from Images and Text [AUTHORS]Juan A. Rodriguez, Abhay Puri, Shubham Agarwal, Issam H. Laradji, Pau Rodriguez, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli [ABSTRACT]Scalable Vector Graphics (SVGs) are vital for modern image rendering due to
their scalability and versatility. Previous SVG generation methods have focused
on curve-based vectorization, lacking semantic understanding, often producing
artifacts, and struggling with SVG primitives beyond path curves. To address
these issues, we introduce StarVector, a multimodal large language model for
SVG generation. It performs image vectorization by understanding image
semantics and using SVG primitives for compact, precise outputs. Unlike
traditional methods, StarVector works directly in the SVG code space,
leveraging visual understanding to apply accurate SVG primitives. To train
StarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables
generalization across vectorization tasks and precise use of primitives like
ellipses, polygons, and text. We address challenges in SVG evaluation, showing
that pixel-based metrics like MSE fail to capture the unique qualities of
vector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and 3
tasks: Image-to-SVG, Text-to-SVG generation, and diagram generation. Using this
setup, StarVector achieves state-of-the-art performance, producing more compact
and semantically rich SVGs. [LINK]http://arxiv.org/abs/2312.11556v2 [DATE]2024-12-05 02:31:44+08:00 [CATEGORIES]cs.CL
A Review on Scientific Knowledge Extraction using Large Language Models in Biomedical Sciences [AUTHORS]Gabriel Lino Garcia, João Renato Ribeiro Manesco, Pedro Henrique Paiola, Lucas Miranda, Maria Paola de Salvo, João Paulo Papa [ABSTRACT]The rapid advancement of large language models (LLMs) has opened new
boundaries in the extraction and synthesis of medical knowledge, particularly
within evidence synthesis. This paper reviews the state-of-the-art applications
of LLMs in the biomedical domain, exploring their effectiveness in automating
complex tasks such as evidence synthesis and data extraction from a biomedical
corpus of documents. While LLMs demonstrate remarkable potential, significant
challenges remain, including issues related to hallucinations, contextual
understanding, and the ability to generalize across diverse medical tasks. We
highlight critical gaps in the current research literature, particularly the
need for unified benchmarks to standardize evaluations and ensure reliability
in real-world applications. In addition, we propose directions for future
research, emphasizing the integration of state-of-the-art techniques such as
retrieval-augmented generation (RAG) to enhance LLM performance in evidence
synthesis. By addressing these challenges and utilizing the strengths of LLMs,
we aim to improve access to medical literature and facilitate meaningful
discoveries in healthcare. [COMMENTS]9 pages, 1 table, 1 figure, conference paper [LINK]http://arxiv.org/abs/2412.03531v1 [DATE]2024-12-05 02:26:13+08:00 [CATEGORIES]cs.CLcs.LG
Number Cookbook: Number Understanding of Language Models and How to Improve It [AUTHORS]Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, Muhan Zhang [ABSTRACT]Large language models (LLMs) can solve an increasing number of complex
reasoning tasks while making surprising mistakes in basic numerical
understanding and processing (such as 9.11 > 9.9). The latter ability is
essential for tackling complex arithmetic and mathematical problems and serves
as a foundation for most reasoning tasks, but previous work paid little
attention to it or only discussed several restricted tasks (like integer
addition). In this paper, we comprehensively investigate the numerical
understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a
benchmark covering four common numerical representations and 17 distinct
numerical tasks in four major categories, resulting in 41 meaningful
combinations in total. These tasks are derived from primary and secondary
education curricula, encompassing nearly all everyday numerical understanding
and processing scenarios, and the rules of these tasks are very simple and
clear. Through the benchmark, we find that current LLMs fail frequently in many
of the tasks. To study the problem, we train small models with existing and
potential techniques for enhancing NUPA (such as tokenizers, PEs, and number
formats), comprehensively evaluating their effectiveness using our testbed. We
also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1)
naive finetuning can improve NUPA a lot on many but not all tasks, and 2)
surprisingly, techniques designed to enhance NUPA prove ineffective for
finetuning pretrained models. We further explore the impact of chain-of-thought
techniques on NUPA. Our work provides a more detailed and comprehensive
understanding of NUPA in LLMs. Our benchmark and code are released at
https://github.com/GraphPKU/number_cookbook. [LINK]http://arxiv.org/abs/2411.03766v2 [DATE]2024-12-05 00:39:04+08:00 [CATEGORIES]cs.CL
DataLab: A Unified Platform for LLM-Powered Business Intelligence [AUTHORS]Luoxuan Weng, Yinghao Tang, Yingchaojie Feng, Zhuo Chang, Peng Chen, Ruiqin Chen, Haozhe Feng, Chen Hou, Danqing Huang, Yang Li, Huaming Rao, Haonan Wang, Canshi Wei, Xiaofeng Yang, Yuhui Zhang, Yifeng Zheng, Xiuqi Huang, Minfeng Zhu, Yuxin Ma, Bin Cui, Wei Chen [ABSTRACT]Business intelligence (BI) transforms large volumes of data within modern
organizations into actionable insights for informed decision-making. Recently,
large language model (LLM)-based agents have streamlined the BI workflow by
automatically performing task planning, reasoning, and actions in executable
environments based on natural language (NL) queries. However, existing
approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS.
The fragmentation of tasks across different data roles and tools lead to
inefficiencies and potential errors due to the iterative and collaborative
nature of BI. In this paper, we introduce DataLab, a unified BI platform that
integrates a one-stop LLM-based agent framework with an augmented computational
notebook interface. DataLab supports a wide range of BI tasks for different
data roles by seamlessly combining LLM assistance with user customization
within a single environment. To achieve this unification, we design a domain
knowledge incorporation module tailored for enterprise-specific BI tasks, an
inter-agent communication mechanism to facilitate information sharing across
the BI workflow, and a cell-based context management strategy to enhance
context utilization efficiency in BI notebooks. Extensive experiments
demonstrate that DataLab achieves state-of-the-art performance on various BI
tasks across popular research benchmarks. Moreover, DataLab maintains high
effectiveness and efficiency on real-world datasets from Tencent, achieving up
to a 58.58% increase in accuracy and a 61.65% reduction in token cost on
enterprise-specific BI tasks. [LINK]http://arxiv.org/abs/2412.02205v2 [DATE]2024-12-05 00:12:08+08:00 [CATEGORIES]cs.CL
Calib3D: Calibrating Model Preferences for Reliable 3D Scene Understanding [AUTHORS]Lingdong Kong, Xiang Xu, Jun Cen, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu [ABSTRACT]Safety-critical 3D scene understanding tasks necessitate not only accurate
but also confident predictions from 3D perception models. This study introduces
Calib3D, a pioneering effort to benchmark and scrutinize the reliability of 3D
scene understanding models from an uncertainty estimation viewpoint. We
comprehensively evaluate 28 state-of-the-art models across 10 diverse 3D
datasets, uncovering insightful phenomena that cope with both the aleatoric and
epistemic uncertainties in 3D scene understanding. We discover that despite
achieving impressive levels of accuracy, existing models frequently fail to
provide reliable uncertainty estimates -- a pitfall that critically undermines
their applicability in safety-sensitive contexts. Through extensive analysis of
key factors such as network capacity, LiDAR representations, rasterization
resolutions, and 3D data augmentation techniques, we correlate these aspects
directly with the model calibration efficacy. Furthermore, we introduce DeptS,
a novel depth-aware scaling approach aimed at enhancing 3D model calibration.
Extensive experiments across a wide range of configurations validate the
superiority of our method. We hope this work could serve as a cornerstone for
fostering reliable 3D scene understanding. Code and benchmark toolkit are
publicly available. [COMMENTS]WACV 2025; 26 pages, 8 figures, 12 tables; Code at
https://github.com/ldkong1205/Calib3D [LINK]http://arxiv.org/abs/2403.17010v2 [DATE]2024-12-05 23:33:29+08:00 [CATEGORIES]cs.LG
HyperMARL: Adaptive Hypernetworks for Multi-Agent RL [AUTHORS]Kale-ab Abebe Tessera, Arrasy Rahman, Stefano V. Albrecht [ABSTRACT]Balancing individual specialisation and shared behaviours is a critical
challenge in multi-agent reinforcement learning (MARL). Existing methods
typically focus on encouraging diversity or leveraging shared representations.
Full parameter sharing (FuPS) improves sample efficiency but struggles to learn
diverse behaviours when required, while no parameter sharing (NoPS) enables
diversity but is computationally expensive and sample inefficient. To address
these challenges, we introduce HyperMARL, a novel approach using hypernetworks
to balance efficiency and specialisation. HyperMARL generates agent-specific
actor and critic parameters, enabling agents to adaptively exhibit diverse or
homogeneous behaviours as needed, without modifying the learning objective or
requiring prior knowledge of the optimal diversity. Furthermore, HyperMARL
decouples agent-specific and state-based gradients, which empirically
correlates with reduced policy gradient variance, potentially offering insights
into its ability to capture diverse behaviours. Across MARL benchmarks
requiring homogeneous, heterogeneous, or mixed behaviours, HyperMARL
consistently matches or outperforms FuPS, NoPS, and diversity-focused methods,
achieving NoPS-level diversity with a shared architecture. These results
highlight the potential of hypernetworks as a versatile approach to the
trade-off between specialisation and shared behaviours in MARL. [LINK]http://arxiv.org/abs/2412.04233v1 [DATE]2024-12-05 23:09:51+08:00 [CATEGORIES]cs.LG
Foundations of the Theory of Performance-Based Ranking [AUTHORS]Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, Marc Van Droogenbroeck [ABSTRACT]Ranking entities such as algorithms, devices, methods, or models based on
their performances, while accounting for application-specific preferences, is a
challenge. To address this challenge, we establish the foundations of a
universal theory for performance-based ranking. First, we introduce a rigorous
framework built on top of both the probability and order theories. Our new
framework encompasses the elements necessary to (1) manipulate performances as
mathematical objects, (2) express which performances are worse than or
equivalent to others, (3) model tasks through a variable called satisfaction,
(4) consider properties of the evaluation, (5) define scores, and (6) specify
application-specific preferences through a variable called importance. On top
of this framework, we propose the first axiomatic definition of performance
orderings and performance-based rankings. Then, we introduce a universal
parametric family of scores, called ranking scores, that can be used to
establish rankings satisfying our axioms, while considering
application-specific preferences. Finally, we show, in the case of two-class
classification, that the family of ranking scores encompasses well-known
performance scores, including the accuracy, the true positive rate (recall,
sensitivity), the true negative rate (specificity), the positive predictive
value (precision), and F1. However, we also show that some other scores
commonly used to compare classifiers are unsuitable to derive performance
orderings satisfying the axioms. Therefore, this paper provides the computer
vision and machine learning communities with a rigorous framework for
evaluating and ranking entities. [LINK]http://arxiv.org/abs/2412.04227v1 [DATE]2024-12-05 23:05:25+08:00 [CATEGORIES]cs.LG
Dockformer: A transformer-based molecular docking paradigm for large-scale virtual screening [AUTHORS]Zhangfan Yang, Junkai Ji, Shan He, Jianqiang Li, Tiantian He, Ruibin Bai, Zexuan Zhu, Yew Soon Ong [ABSTRACT]Molecular docking is a crucial step in drug development, which enables the
virtual screening of compound libraries to identify potential ligands that
target proteins of interest. However, the computational complexity of
traditional docking models increases as the size of the compound library
increases. Recently, deep learning algorithms can provide data-driven research
and development models to increase the speed of the docking process.
Unfortunately, few models can achieve superior screening performance compared
to that of traditional models. Therefore, a novel deep learning-based docking
approach named Dockformer is introduced in this study. Dockformer leverages
multimodal information to capture the geometric topology and structural
knowledge of molecules and can directly generate binding conformations with the
corresponding confidence measures in an end-to-end manner. The experimental
results show that Dockformer achieves success rates of 90.53% and 82.71% on the
PDBbind core set and PoseBusters benchmarks, respectively, and more than a
100-fold increase in the inference process speed, outperforming almost all
state-of-the-art docking methods. In addition, the ability of Dockformer to
identify the main protease inhibitors of coronaviruses is demonstrated in a
real-world virtual screening scenario. Considering its high docking accuracy
and screening efficiency, Dockformer can be regarded as a powerful and robust
tool in the field of drug design. [COMMENTS]15 pages, 10 figures [LINK]http://arxiv.org/abs/2411.06740v4 [DATE]2024-12-05 22:56:30+08:00 [CATEGORIES]cs.LG
Multi-Layer Privacy-Preserving Record Linkage with Clerical Review based on gradual information disclosure [AUTHORS]Florens Rohde, Victor Christen, Martin Franke, Erhard Rahm [ABSTRACT]Privacy-Preserving Record linkage (PPRL) is an essential component in data
integration tasks of sensitive information. The linkage quality determines the
usability of combined datasets and (machine learning) applications based on
them. We present a novel privacy-preserving protocol that integrates clerical
review in PPRL using a multi-layer active learning process. Uncertain match
candidates are reviewed on several layers by human and non-human oracles to
reduce the amount of disclosed information per record and in total. Predictions
are propagated back to update previous layers, resulting in an improved linkage
performance for non-reviewed candidates as well. The data owners remain in
control of the amount of information they share for each record. Therefore, our
approach follows need-to-know and data sovereignty principles. The experimental
evaluation on real-world datasets shows considerable linkage quality
improvements with limited labeling effort and privacy risks. [COMMENTS]Accepted at 21st Conference on Database Systems for Business,
Technology and Web (BTW) [LINK]http://arxiv.org/abs/2412.04178v1 [DATE]2024-12-05 22:18:50+08:00 [CATEGORIES]cs.LG
When Stability meets Sufficiency: Informative Explanations that do not Overwhelm [AUTHORS]Ronny Luss, Amit Dhurandhar [ABSTRACT]Recent studies evaluating various criteria for explainable artificial
intelligence (XAI) suggest that fidelity, stability, and comprehensibility are
among the most important metrics considered by users of AI across a diverse
collection of usage contexts. We consider these criteria as applied to
feature-based attribution methods, which are amongst the most prevalent in XAI
literature. Going beyond standard correlation, methods have been proposed that
highlight what should be minimally sufficient to justify the classification of
an input (viz. pertinent positives). While minimal sufficiency is an attractive
property akin to comprehensibility, the resulting explanations are often too
sparse for a human to understand and evaluate the local behavior of the model.
To overcome these limitations, we incorporate the criteria of stability and
fidelity and propose a novel method called Path-Sufficient Explanations Method
(PSEM) that outputs a sequence of stable and sufficient explanations for a
given input of strictly decreasing size (or value) -- from original input to a
minimally sufficient explanation -- which can be thought to trace the local
boundary of the model in a stable manner, thus providing better intuition about
the local model behavior for the specific input. We validate these claims, both
qualitatively and quantitatively, with experiments that show the benefit of
PSEM across three modalities (image, tabular and text) as well as versus other
path explanations. A user study depicts the strength of the method in
communicating the local behavior, where (many) users are able to correctly
determine the prediction made by a model. [COMMENTS]Published at TMLR [LINK]http://arxiv.org/abs/2109.06181v2 [DATE]2024-12-05 21:50:59+08:00 [CATEGORIES]cs.LG
Looking at Model Debiasing through the Lens of Anomaly Detection [AUTHORS]Vito Paolo Pastore, Massimiliano Ciranni, Davide Marinelli, Francesca Odone, Vittorio Murino [ABSTRACT]It is widely recognized that deep neural networks are sensitive to bias in
the data. This means that during training these models are likely to learn
spurious correlations between data and labels, resulting in limited
generalization abilities and low performance. In this context, model debiasing
approaches can be devised aiming at reducing the model's dependency on such
unwanted correlations, either leveraging the knowledge of bias information or
not. In this work, we focus on the latter and more realistic scenario, showing
the importance of accurately predicting the bias-conflicting and bias-aligned
samples to obtain compelling performance in bias mitigation. On this ground, we
propose to conceive the problem of model bias from an out-of-distribution
perspective, introducing a new bias identification method based on anomaly
detection. We claim that when data is mostly biased, bias-conflicting samples
can be regarded as outliers with respect to the bias-aligned distribution in
the feature space of a biased model, thus allowing for precisely detecting them
with an anomaly detection method. Coupling the proposed bias identification
approach with bias-conflicting data upsampling and augmentation in a two-step
strategy, we reach state-of-the-art performance on synthetic and real benchmark
datasets. Ultimately, our proposed approach shows that the data bias issue does
not necessarily require complex debiasing methods, given that an accurate bias
identification procedure is defined. Source code is available at
https://github.com/Malga-Vision/MoDAD [COMMENTS]13 pages, 8 figures; Accepted at IEEE/CVF Winter Conference on
Applications of Computer Vision (WACV) 2025 [LINK]http://arxiv.org/abs/2407.17449v3 [DATE]2024-12-05 21:37:51+08:00 [CATEGORIES]cs.LG
GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning [AUTHORS]Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang [ABSTRACT]Genetic variants (GVs) are defined as differences in the DNA sequences among
individuals and play a crucial role in diagnosing and treating genetic
diseases. The rapid decrease in next generation sequencing cost has led to an
exponential increase in patient-level GV data. This growth poses a challenge
for clinicians who must efficiently prioritize patient-specific GVs and
integrate them with existing genomic databases to inform patient management. To
addressing the interpretation of GVs, genomic foundation models (GFMs) have
emerged. However, these models lack standardized performance assessments,
leading to considerable variability in model evaluations. This poses the
question: How effectively do deep learning methods classify unknown GVs and
align them with clinically-verified GVs? We argue that representation learning,
which transforms raw data into meaningful feature spaces, is an effective
approach for addressing both indexing and classification challenges. We
introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring
variable-length contexts and detailed annotations, designed for deep learning
models to learn GV representations across various traits, diseases, tissue
types, and experimental contexts. Our contributions are three-fold: (i)
Construction of a comprehensive dataset with 7 million records, each labeled
with characteristics of the corresponding variants, alongside additional data
from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant
combinations, and 156 unique clinically verified GVs from real-world patients.
(ii) Analysis of the structure and properties of the dataset. (iii)
Experimentation of the dataset with pre-trained GFMs. The results show a
significant gap between GFMs current capabilities and accurate GV
representation. We hope this dataset will help advance genomic deep learning to
bridge this gap. [COMMENTS]Preprint [LINK]http://arxiv.org/abs/2407.16940v2 [DATE]2024-12-05 21:30:16+08:00 [CATEGORIES]cs.LG
Learning Semantic Association Rules from Internet of Things Data [AUTHORS]Erkan Karabulut, Paul Groth, Victoria Degeler [ABSTRACT]Association Rule Mining (ARM) is the task of discovering commonalities in
data in the form of logical implications. ARM is used in the Internet of Things
(IoT) for different tasks including monitoring and decision-making. However,
existing methods give limited consideration to IoT-specific requirements such
as heterogeneity and volume. Furthermore, they do not utilize important static
domain-specific description data about IoT systems, which is increasingly
represented as knowledge graphs. In this paper, we propose a novel ARM pipeline
for IoT data that utilizes both dynamic sensor data and static IoT system
metadata. Furthermore, we propose an Autoencoder-based Neurosymbolic ARM method
(Aerial) as part of the pipeline to address the high volume of IoT data and
reduce the total number of rules that are resource-intensive to process. Aerial
learns a neural representation of a given data and extracts association rules
from this representation by exploiting the reconstruction (decoding) mechanism
of an autoencoder. Extensive evaluations on 3 IoT datasets from 2 domains show
that ARM on both static and dynamic IoT data results in more generically
applicable rules while Aerial can learn a more concise set of high-quality
association rules than the state-of-the-art with full coverage over the
datasets. [LINK]http://arxiv.org/abs/2412.03417v2 [DATE]2024-12-05 21:22:28+08:00 [CATEGORIES]cs.LG
DeiSAM: Segment Anything with Deictic Prompting [AUTHORS]Hikaru Shindo, Manuel Brack, Gopika Sudhakaran, Devendra Singh Dhami, Patrick Schramowski, Kristian Kersting [ABSTRACT]Large-scale, pre-trained neural networks have demonstrated strong
capabilities in various tasks, including zero-shot image segmentation. To
identify concrete objects in complex scenes, humans instinctively rely on
deictic descriptions in natural language, i.e., referring to something
depending on the context such as "The object that is on the desk and behind the
cup.". However, deep learning approaches cannot reliably interpret such deictic
representations due to their lack of reasoning capabilities in complex
scenarios. To remedy this issue, we propose DeiSAM -- a combination of large
pre-trained neural networks with differentiable logic reasoners -- for deictic
promptable segmentation. Given a complex, textual segmentation description,
DeiSAM leverages Large Language Models (LLMs) to generate first-order logic
rules and performs differentiable forward reasoning on generated scene graphs.
Subsequently, DeiSAM segments objects by matching them to the logically
inferred image regions. As part of our evaluation, we propose the Deictic
Visual Genome (DeiVG) dataset, containing paired visual input and complex,
deictic textual prompts. Our empirical results demonstrate that DeiSAM is a
substantial improvement over purely data-driven baselines for deictic
promptable segmentation. [COMMENTS]Published as a conference paper at NeurIPS 2024 [LINK]http://arxiv.org/abs/2402.14123v2 [DATE]2024-12-05 21:15:34+08:00 [CATEGORIES]cs.LG
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment [AUTHORS]Jason Vega, Junsheng Huang, Gaokai Zhang, Hangoo Kang, Minjia Zhang, Gagandeep Singh [ABSTRACT]Safety alignment of Large Language Models (LLMs) has recently become a
critical objective of model developers. In response, a growing body of work has
been investigating how safety alignment can be bypassed through various
jailbreaking methods, such as adversarial attacks. However, these jailbreak
methods can be rather costly or involve a non-trivial amount of creativity and
effort, introducing the assumption that malicious users are high-resource or
sophisticated. In this paper, we study how simple random augmentations to the
input prompt affect safety alignment effectiveness in state-of-the-art LLMs,
such as Llama 3 and Qwen 2. We perform an in-depth evaluation of 17 different
models and investigate the intersection of safety under random augmentations
with multiple dimensions: augmentation type, model size, quantization,
fine-tuning-based defenses, and decoding strategies (e.g., sampling
temperature). We show that low-resource and unsophisticated attackers, i.e.
$\textit\{stochastic monkeys\}$, can significantly improve their chances of
bypassing alignment with just 25 random augmentations per prompt. Source code
and data: https://github.com/uiuc-focal-lab/stochastic-monkeys/ [COMMENTS]v2: Updated with changes from peer review rebuttal. v1: Version under
peer review [LINK]http://arxiv.org/abs/2411.02785v2 [DATE]2024-12-05 20:58:44+08:00 [CATEGORIES]cs.LG
DeepFEA: Deep Learning for Prediction of Transient Finite Element Analysis Solutions [AUTHORS]Georgios Triantafyllou, Panagiotis G. Kalozoumis, George Dimas, Dimitris K. Iakovidis [ABSTRACT]Finite Element Analysis (FEA) is a powerful but computationally intensive
method for simulating physical phenomena. Recent advancements in machine
learning have led to surrogate models capable of accelerating FEA. Yet there
are still limitations in developing surrogates of transient FEA models that can
simultaneously predict the solutions for both nodes and elements with
applicability on both the 2D and 3D domains. Motivated by this research gap,
this study proposes DeepFEA, a deep learning-based framework that leverages a
multilayer Convolutional Long Short-Term Memory (ConvLSTM) network branching
into two parallel convolutional neural networks to predict the solutions for
both nodes and elements of FEA models. The proposed network is optimized using
a novel adaptive learning algorithm, called Node-Element Loss Optimization
(NELO). NELO minimizes the error occurring at both branches of the network
enabling the prediction of solutions for transient FEA simulations. The
experimental evaluation of DeepFEA is performed on three datasets in the
context of structural mechanics, generated to serve as publicly available
reference datasets. The results show that DeepFEA can achieve less than 3%
normalized mean and root mean squared error for 2D and 3D simulation scenarios,
and inference times that are two orders of magnitude faster than FEA. In
contrast, relevant state-of-the-art methods face challenges with
multi-dimensional output and dynamic input prediction. Furthermore, DeepFEA's
robustness was demonstrated in a real-life biomedical scenario, confirming its
suitability for accurate and efficient predictions of FEA simulations. [COMMENTS]This work has been submitted to a journal for possible publication [LINK]http://arxiv.org/abs/2412.04121v1 [DATE]2024-12-05 20:46:18+08:00 [CATEGORIES]cs.LG
Group Distributionally Robust Optimization can Suppress Class Imbalance Effect in Network Traffic Classification [AUTHORS]Wumei Du, Dong Liang, Yiqin Lv, Xingxing Liang, Guanlin Wu, Qi Wang, Zheng Xie [ABSTRACT]Internet services have led to the eruption of network traffic, and machine
learning on these Internet data has become an indispensable tool, especially
when the application is risk-sensitive. This paper focuses on network traffic
classification in the presence of class imbalance, which fundamentally and
ubiquitously exists in Internet data analysis. This existence of class
imbalance mostly drifts the optimal decision boundary, resulting in a less
optimal solution for machine learning models. To alleviate the effect, we
propose to design strategies for alleviating the class imbalance through the
lens of group distributionally robust optimization. Our approach iteratively
updates the non-parametric weights for separate classes and optimizes the
learning model by minimizing reweighted losses. We interpret the optimization
process from a Stackelberg game and perform extensive experiments on typical
benchmarks. Results show that our approach can not only suppress the negative
effect of class imbalance but also improve the comprehensive performance in
prediction. [LINK]http://arxiv.org/abs/2409.19214v2 [DATE]2024-12-05 20:45:09+08:00 [CATEGORIES]cs.LG
Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs [AUTHORS]Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause [ABSTRACT]Recent efforts in fine-tuning language models often rely on automatic data
selection, commonly using Nearest Neighbors retrieval from large datasets.
However, we theoretically show that this approach tends to select redundant
data, limiting its effectiveness or even hurting performance. To address this,
we introduce SIFT, a data selection algorithm designed to reduce uncertainty
about the model's response given a prompt, which unifies ideas from retrieval
and active learning. Whereas Nearest Neighbor retrieval typically fails in the
presence of information duplication, SIFT accounts for information duplication
and optimizes the overall information gain of the selected examples. We focus
our evaluations on fine-tuning at test-time for prompt-specific language
modeling on the Pile dataset, and show that SIFT consistently outperforms
Nearest Neighbor retrieval, with minimal computational overhead. Moreover, we
show that our uncertainty estimates can predict the performance gain of
test-time fine-tuning, and use this to develop an adaptive algorithm that
invests test-time compute proportional to realized performance gains. We
provide the $\texttt\{activeft\}$ (Active Fine-Tuning) library which can be used
as a drop-in replacement for Nearest Neighbor retrieval. [LINK]http://arxiv.org/abs/2410.08020v2 [DATE]2024-12-05 20:40:16+08:00 [CATEGORIES]cs.LG
Federated Learning in Mobile Networks: A Comprehensive Case Study on Traffic Forecasting [AUTHORS]Nikolaos Pavlidis, Vasileios Perifanis, Selim F. Yilmaz, Francesc Wilhelmi, Marco Miozzo, Pavlos S. Efraimidis, Remous-Aris Koutsiamanis, Pavol Mulinka, Paolo Dini [ABSTRACT]The increasing demand for efficient resource allocation in mobile networks
has catalyzed the exploration of innovative solutions that could enhance the
task of real-time cellular traffic prediction. Under these circumstances,
federated learning (FL) stands out as a distributed and privacy-preserving
solution to foster collaboration among different sites, thus enabling
responsive near-the-edge solutions. In this paper, we comprehensively study the
potential benefits of FL in telecommunications through a case study on
federated traffic forecasting using real-world data from base stations (BSs) in
Barcelona (Spain). Our study encompasses relevant aspects within the federated
experience, including model aggregation techniques, outlier management, the
impact of individual clients, personalized learning, and the integration of
exogenous sources of data. The performed evaluation is based on both prediction
accuracy and sustainability, thus showcasing the environmental impact of
employed FL algorithms in various settings. The findings from our study
highlight FL as a promising and robust solution for mobile traffic prediction,
emphasizing its twin merits as a privacy-conscious and environmentally
sustainable approach, while also demonstrating its capability to overcome data
heterogeneity and ensure high-quality predictions, marking a significant stride
towards its integration in mobile traffic management systems. [LINK]http://arxiv.org/abs/2412.04081v1 [DATE]2024-12-05 19:32:14+08:00 [CATEGORIES]cs.LG
VGGHeads: 3D Multi Head Alignment with a Large-Scale Synthetic Dataset [AUTHORS]Orest Kupyn, Eugene Khvedchenia, Christian Rupprecht [ABSTRACT]Human head detection, keypoint estimation, and 3D head model fitting are
essential tasks with many applications. However, traditional real-world
datasets often suffer from bias, privacy, and ethical concerns, and they have
been recorded in laboratory environments, which makes it difficult for trained
models to generalize. Here, we introduce \method -- a large-scale synthetic
dataset generated with diffusion models for human head detection and 3D mesh
estimation. Our dataset comprises over 1 million high-resolution images, each
annotated with detailed 3D head meshes, facial landmarks, and bounding boxes.
Using this dataset, we introduce a new model architecture capable of
simultaneous head detection and head mesh reconstruction from a single image in
a single step. Through extensive experimental evaluations, we demonstrate that
models trained on our synthetic data achieve strong performance on real images.
Furthermore, the versatility of our dataset makes it applicable across a broad
spectrum of tasks, offering a general and comprehensive representation of human
heads. [LINK]http://arxiv.org/abs/2407.18245v2 [DATE]2024-12-05 19:29:56+08:00 [CATEGORIES]cs.LG
Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation [AUTHORS]Weihua Wang, Qiuyu Liang, Feilong Bao, Guanglai Gao [ABSTRACT]Quaternion contains one real part and three imaginary parts, which provided a
more expressive hypercomplex space for learning knowledge graph. Existing
quaternion embedding models measure the plausibility of a triplet either
through semantic matching or geometric distance scoring functions. However, it
appears that semantic matching diminishes the separability of entities, while
the distance scoring function weakens the semantics of entities. To address
this issue, we propose a novel quaternion knowledge graph embedding model. Our
model combines semantic matching with entity's geometric distance to better
measure the plausibility of triplets. Specifically, in the quaternion space, we
perform a right rotation on head entity and a reverse rotation on tail entity
to learn rich semantic features. Then, we utilize distance adaptive
translations to learn geometric distance between entities. Furthermore, we
provide mathematical proofs to demonstrate our model can handle complex logical
relationships. Extensive experimental results and analyses show our model
significantly outperforms previous models on well-known knowledge graph
completion benchmark datasets. Our code is available at
https://github.com/llqy123/DaBR. [COMMENTS]Accepted by COLING 2025 [LINK]http://arxiv.org/abs/2412.04076v1 [DATE]2024-12-05 19:17:03+08:00 [CATEGORIES]cs.LG
Integrated Sensing and Communications for Low-Altitude Economy: A Deep Reinforcement Learning Approach [AUTHORS]Xiaowen Ye, Yuyi Mao, Xianghao Yu, Shu Sun, Liqun Fu, Jie Xu [ABSTRACT]This paper studies an integrated sensing and communications (ISAC) system for
low-altitude economy (LAE), where a ground base station (GBS) provides
communication and navigation services for authorized unmanned aerial vehicles
(UAVs), while sensing the low-altitude airspace to monitor the unauthorized
mobile target. The expected communication sum-rate over a given flight period
is maximized by jointly optimizing the beamforming at the GBS and UAVs'
trajectories, subject to the constraints on the average signal-to-noise ratio
requirement for sensing, the flight mission and collision avoidance of UAVs, as
well as the maximum transmit power at the GBS. Typically, this is a sequential
decision-making problem with the given flight mission. Thus, we transform it to
a specific Markov decision process (MDP) model called episode task. Based on
this modeling, we propose a novel LAE-oriented ISAC scheme, referred to as Deep
LAE-ISAC (DeepLSC), by leveraging the deep reinforcement learning (DRL)
technique. In DeepLSC, a reward function and a new action selection policy
termed constrained noise-exploration policy are judiciously designed to fulfill
various constraints. To enable efficient learning in episode tasks, we develop
a hierarchical experience replay mechanism, where the gist is to employ all
experiences generated within each episode to jointly train the neural network.
Besides, to enhance the convergence speed of DeepLSC, a symmetric experience
augmentation mechanism, which simultaneously permutes the indexes of all
variables to enrich available experience sets, is proposed. Simulation results
demonstrate that compared with benchmarks, DeepLSC yields a higher sum-rate
while meeting the preset constraints, achieves faster convergence, and is more
robust against different settings. [COMMENTS]submitted for an IEEE publication [LINK]http://arxiv.org/abs/2412.04074v1 [DATE]2024-12-05 19:12:46+08:00 [CATEGORIES]cs.LG
Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream [AUTHORS]Abdulkadir Gokce, Martin Schrimpf [ABSTRACT]When trained on large-scale object classification datasets, certain
artificial neural network models begin to approximate core object recognition
(COR) behaviors and neural response patterns in the primate visual ventral
stream (VVS). While recent machine learning advances suggest that scaling model
size, dataset size, and compute resources improve task performance, the impact
of scaling on brain alignment remains unclear. In this study, we explore
scaling laws for modeling the primate VVS by systematically evaluating over 600
models trained under controlled conditions on benchmarks spanning V1, V2, V4,
IT and COR behaviors. We observe that while behavioral alignment continues to
scale with larger models, neural alignment saturates. This observation remains
true across model architectures and training datasets, even though models with
stronger inductive bias and datasets with higher-quality images are more
compute-efficient. Increased scaling is especially beneficial for higher-level
visual areas, where small models trained on few samples exhibit only poor
alignment. Finally, we develop a scaling recipe, indicating that a greater
proportion of compute should be allocated to data samples over model size. Our
results suggest that while scaling alone might suffice for alignment with human
core object recognition behavior, it will not yield improved models of the
brain's visual ventral stream with current architectures and datasets,
highlighting the need for novel strategies in building brain-like models. [COMMENTS]10 pages for the main paper, 23 pages in total. 7 main figures and 7
supplementary figures. Code, model weights, and benchmark results can be
accessed at https://github.com/epflneuroailab/scaling-primate-vvs - In
version 2, Figure 7 and the related discussion are added, and the appendix is
updated [LINK]http://arxiv.org/abs/2411.05712v2 [DATE]2024-12-05 17:39:07+08:00 [CATEGORIES]cs.LG
Blind Underwater Image Restoration using Co-Operational Regressor Networks [AUTHORS]Ozer Can Devecioglu, Serkan Kiranyaz, Turker Ince, Moncef Gabbouj [ABSTRACT]The exploration of underwater environments is essential for applications such
as biological research, archaeology, and infrastructure maintenanceHowever,
underwater imaging is challenging due to the waters unique properties,
including scattering, absorption, color distortion, and reduced visibility. To
address such visual degradations, a variety of approaches have been proposed
covering from basic signal processing methods to deep learning models; however,
none of them has proven to be consistently successful. In this paper, we
propose a novel machine learning model, Co-Operational Regressor Networks
(CoRe-Nets), designed to achieve the best possible underwater image
restoration. A CoRe-Net consists of two co-operating networks: the Apprentice
Regressor (AR), responsible for image transformation, and the Master Regressor
(MR), which evaluates the Peak Signal-to-Noise Ratio (PSNR) of the images
generated by the AR and feeds it back to AR. CoRe-Nets are built on
Self-Organized Operational Neural Networks (Self-ONNs), which offer a superior
learning capability by modulating nonlinearity in kernel transformations. The
effectiveness of the proposed model is demonstrated on the benchmark Large
Scale Underwater Image (LSUI) dataset. Leveraging the joint learning
capabilities of the two cooperating networks, the proposed model achieves the
state-of-art restoration performance with significantly reduced computational
complexity and often presents such results that can even surpass the visual
quality of the ground truth with a 2-pass application. Our results and the
optimized PyTorch implementation of the proposed approach are now publicly
shared on GitHub. [COMMENTS]11 pages [LINK]http://arxiv.org/abs/2412.03995v1 [DATE]2024-12-05 17:15:21+08:00 [CATEGORIES]cs.LG
LaserGuider: A Laser Based Physical Backdoor Attack against Deep Neural Networks [AUTHORS]Yongjie Xu, Guangke Chen, Fu Song, Yuqi Chen [ABSTRACT]Backdoor attacks embed hidden associations between triggers and targets in
deep neural networks (DNNs), causing them to predict the target when a trigger
is present while maintaining normal behavior otherwise. Physical backdoor
attacks, which use physical objects as triggers, are feasible but lack remote
control, temporal stealthiness, flexibility, and mobility. To overcome these
limitations, in this work, we propose a new type of backdoor triggers utilizing
lasers that feature long-distance transmission and instant-imaging properties.
Based on the laser-based backdoor triggers, we present a physical backdoor
attack, called LaserGuider, which possesses remote control ability and achieves
high temporal stealthiness, flexibility, and mobility. We also introduce a
systematic approach to optimize laser parameters for improving attack
effectiveness. Our evaluation on traffic sign recognition DNNs, critical in
autonomous vehicles, demonstrates that LaserGuider with three different
laser-based triggers achieves over 90% attack success rate with negligible
impact on normal inputs. Additionally, we release LaserMark, the first dataset
of real world traffic signs stamped with physical laser spots, to support
further research in backdoor attacks and defenses. [COMMENTS]In Proceedings of the 23rd International Conference on Applied
Cryptography and Network Security (ACNS), Munich, Germany, 23-26 June, 2025 [LINK]http://arxiv.org/abs/2412.03993v1 [DATE]2024-12-05 17:14:50+08:00 [CATEGORIES]cs.LG
Local Curvature Smoothing with Stein's Identity for Efficient Score Matching [AUTHORS]Genki Osada, Makoto Shing, Takashi Nishide [ABSTRACT]The training of score-based diffusion models (SDMs) is based on score
matching. The challenge of score matching is that it includes a computationally
expensive Jacobian trace. While several methods have been proposed to avoid
this computation, each has drawbacks, such as instability during training and
approximating the learning as learning a denoising vector field rather than a
true score. We propose a novel score matching variant, local curvature
smoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by
applying Stein's identity, enabling regularization effectiveness and efficient
computation. We show that LCSS surpasses existing methods in sample generation
performance and matches the performance of denoising score matching, widely
adopted by most SDMs, in evaluations such as FID, Inception score, and bits per
dimension. Furthermore, we show that LCSS enables realistic image generation
even at a high resolution of $1024 \times 1024$. [COMMENTS]Accepted at NeurIPS 2024 [LINK]http://arxiv.org/abs/2412.03962v1 [DATE]2024-12-05 16:26:13+08:00 [CATEGORIES]cs.LG
Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16 [AUTHORS]José Camacho, Katarzyna Wasielewska, Pablo Espinosa, Marta Fuentes-García [ABSTRACT]Autonomous or self-driving networks are expected to provide a solution to the
myriad of extremely demanding new applications with minimal human supervision.
For this purpose, the community relies on the development of new Machine
Learning (ML) models and techniques. %, like the celebrated Deep Learning (DL).
However, ML can only be as good as the data it is fitted with, and data quality
is an elusive concept difficult to assess. In this paper, we show that
relatively minor modifications on a benchmark dataset (UGR'16, a flow-based
real-traffic dataset for anomaly detection) cause significantly more impact on
model performance than the specific ML technique considered. We also show that
the measured model performance is uncertain, as a result of labelling
inaccuracies. Our findings illustrate that the widely adopted approach of
comparing a set of models in terms of performance results (e.g., in terms of
accuracy or ROC curves) may lead to incorrect conclusions when done without a
proper understanding of dataset biases and sensitivity. We contribute a
methodology to interpret a model response that can be useful for this
understanding. [LINK]http://arxiv.org/abs/2305.19770v2 [DATE]2024-12-05 15:46:11+08:00 [CATEGORIES]cs.LG
JANUS: A Difference-Oriented Analyzer For Financial Centralization Risks in Smart Contracts [AUTHORS]Wansen Wang, Pu Zhang, Renjie Ji, Wenchao Huang, Zhaoyi Meng, Yan Xiong [ABSTRACT]Some smart contracts violate decentralization principles by defining
privileged accounts that manage other users' assets without permission,
introducing centralization risks that have caused financial losses. Existing
methods, however, face challenges in accurately detecting diverse
centralization risks due to their dependence on predefined behavior patterns.
In this paper, we propose JANUS, an automated analyzer for Solidity smart
contracts that detects financial centralization risks independently of their
specific behaviors. JANUS identifies differences between states reached by
privileged and ordinary accounts, and analyzes whether these differences are
finance-related. Focusing on the impact of risks rather than behaviors, JANUS
achieves improved accuracy compared to existing tools and can uncover
centralization risks with unknown patterns.
To evaluate JANUS's performance, we compare it with other tools using a
dataset of 540 contracts. Our evaluation demonstrates that JANUS outperforms
representative tools in terms of detection accuracy for financial
centralization risks . Additionally, we evaluate JANUS on a real-world dataset
of 33,151 contracts, successfully identifying two types of risks that other
tools fail to detect. We also prove that the state traversal method and
variable summaries, which are used in JANUS to reduce the number of states to
be compared, do not introduce false alarms or omissions in detection. [LINK]http://arxiv.org/abs/2412.03938v1 [DATE]2024-12-05 15:35:56+08:00 [CATEGORIES]cs.LG
Developing a Thailand solar irradiance map using Himawari-8 satellite imageries and deep learning models [AUTHORS]Suwichaya Suwanwimolkul, Natanon Tongamrak, Nuttamon Thungka, Naebboon Hoonchareon, Jitkomut Songsiri [ABSTRACT]This paper presents an online platform showing Thailand solar irradiance map
every 30 minutes, available at https://www.cusolarforecast.com. The methodology
for estimating global horizontal irradiance (GHI) across Thailand relies on
cloud index extracted from Himawari-8 satellite imagery, Ineichen clear-sky
model with locally-tuned Linke turbidity, and machine learning models. The
methods take clear-sky irradiance, cloud index, re-analyzed GHI and temperature
data from the MERRA-2 database, and date-time as inputs for GHI estimation
models, including LightGBM, LSTM, Informer, and Transformer. These are
benchmarked with the estimate from a commercial service X by evaluation of
15-minute ground GHI data from 53 ground stations over 1.5 years during
2022-2023. The results show that the four models exhibit comparable overall MAE
performance to the service X. The best model is LightGBM with an overall MAE of
78.58 W/sqm and RMSE of 118.97 W/sqm, while the service X achieves the lowest
MAE, RMSE, and MBE in cloudy condition. Obtaining re-analyzed MERRA-2 data for
the whole Thailand region is not economically feasible for deployment. When
removing these features, the Informer model has a winning performance in MAE of
78.67 W/sqm. The obtained performance aligns with existing literature by taking
the climate zone and time granularity of data into consideration. As the map
shows an estimate of GHI over 93,000 grids with a frequent update, the paper
also describes a computational framework for displaying the entire map. It
tests the runtime performance of deep learning models in the GHI estimation
process. [COMMENTS]23 pages, 14 figures [LINK]http://arxiv.org/abs/2409.16320v3 [DATE]2024-12-05 15:14:52+08:00 [CATEGORIES]cs.LG
MT3DNet: Multi-Task learning Network for 3D Surgical Scene Reconstruction [AUTHORS]Mithun Parab, Pranay Lendave, Jiyoung Kim, Thi Quynh Dan Nguyen, Palash Ingle [ABSTRACT]In image-assisted minimally invasive surgeries (MIS), understanding surgical
scenes is vital for real-time feedback to surgeons, skill evaluation, and
improving outcomes through collaborative human-robot procedures. Within this
context, the challenge lies in accurately detecting, segmenting, and estimating
the depth of surgical scenes depicted in high-resolution images, while
simultaneously reconstructing the scene in 3D and providing segmentation of
surgical instruments along with detection labels for each instrument. To
address this challenge, a novel Multi-Task Learning (MTL) network is proposed
for performing these tasks concurrently. A key aspect of this approach involves
overcoming the optimization hurdles associated with handling multiple tasks
concurrently by integrating a Adversarial Weight Update into the MTL framework,
the proposed MTL model achieves 3D reconstruction through the integration of
segmentation, depth estimation, and object detection, thereby enhancing the
understanding of surgical scenes, which marks a significant advancement
compared to existing studies that lack 3D capabilities. Comprehensive
experiments on the EndoVis2018 benchmark dataset underscore the adeptness of
the model in efficiently addressing all three tasks, demonstrating the efficacy
of the proposed techniques. [LINK]http://arxiv.org/abs/2412.03928v1 [DATE]2024-12-05 15:07:35+08:00 [CATEGORIES]cs.LG
MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models [AUTHORS]Ming-Chang Chiu, Shicheng Wen, Pin-Yu Chen, Xuezhe Ma [ABSTRACT]In vision-language models (VLMs), the ability to perceive and interpret color
and physical environment is crucial for achieving contextually accurate
understanding and interaction. However, despite advances in multimodal
modeling, there remains a significant lack of specialized datasets that
rigorously evaluate a model's capacity to discern subtle color variations and
spatial context -- critical elements for situational comprehension and reliable
deployment across real-world applications. Toward that goal, we curate
MegaCOIN, a high-quality, human-labeled dataset based on \emph\{real\} images
with various contextual attributes. MegaCOIN consists of two parts:
MegaCOIN-Instruct, which serves as a supervised fine-tuning (SFT) dataset for
VLMs; and MegaCOIN-Bench, an annotated test set that can be used as a
stand-alone QA dataset. MegaCOIN~provides three annotated features for 220,000
real images: foreground color, background color, and description of an object's
physical environment, constituting 660k human annotations. In addition,
MegaCOIN can be applied to benchmark domain generalization (DG) algorithms. We
explore benchmarking DG methods in the linear probing setup for VLM and show
some new insights. Last but not least, we show that VLMs, including GPT-4o,
have subpar color recognition capabilities, and fine-tuning with MegaCOIN can
result in improved performance on visual evaluation tasks. In certain cases,
MegaCOIN fine-tuned small-scale opensource models such as LLaVA and Bunny can
outperform closed-source GPT-4o. We hope the utilities of MegaCOIN can shed
light on the directions VLMs can improve and provide a more complex platform
for domain generalization algorithms. [COMMENTS]8 pages, 13 tables, 2 figures [LINK]http://arxiv.org/abs/2412.03927v1 [DATE]2024-12-05 15:06:17+08:00 [CATEGORIES]cs.LG
Quantized and Interpretable Learning Scheme for Deep Neural Networks in Classification Task [AUTHORS]Alireza Maleki, Mahsa Lavaei, Mohsen Bagheritabar, Salar Beigzad, Zahra Abadi [ABSTRACT]Deep learning techniques have proven highly effective in image
classification, but their deployment in resourceconstrained environments
remains challenging due to high computational demands. Furthermore, their
interpretability is of high importance which demands even more available
resources. In this work, we introduce an approach that combines saliency-guided
training with quantization techniques to create an interpretable and
resource-efficient model without compromising accuracy. We utilize
Parameterized Clipping Activation (PACT) to perform quantization-aware
training, specifically targeting activations and weights to optimize precision
while minimizing resource usage. Concurrently, saliency-guided training is
employed to enhance interpretability by iteratively masking features with low
gradient values, leading to more focused and meaningful saliency maps. This
training procedure helps in mitigating noisy gradients and yields models that
provide clearer, more interpretable insights into their decision-making
processes. To evaluate the impact of our approach, we conduct experiments using
famous Convolutional Neural Networks (CNN) architecture on the MNIST and
CIFAR-10 benchmark datasets as two popular datasets. We compare the saliency
maps generated by standard and quantized models to assess the influence of
quantization on both interpretability and classification accuracy. Our results
demonstrate that the combined use of saliency-guided training and PACT-based
quantization not only maintains classification performance but also produces
models that are significantly more efficient and interpretable, making them
suitable for deployment in resource-limited settings. [LINK]http://arxiv.org/abs/2412.03915v1 [DATE]2024-12-05 14:34:06+08:00 [CATEGORIES]cs.LG
Can Targeted Clean-Label Poisoning Attacks Generalize? [AUTHORS]Zhizhen Chen, Subrat Kishore Dutta, Zhengyu Zhao, Chenhao Lin, Chao Shen, Xiao Zhang [ABSTRACT]Targeted poisoning attacks aim to compromise the model's prediction on
specific target samples. In a common clean-label setting, they are achieved by
slightly perturbing a subset of training samples given access to those specific
targets. Despite continuous efforts, it remains unexplored whether such attacks
can generalize to unknown variations of those targets. In this paper, we take
the first step to systematically study this generalization problem. Observing
that the widely adopted, cosine similarity-based attack exhibits limited
generalizability, we propose a well-generalizable attack that leverages both
the direction and magnitude of model gradients. In particular, we explore
diverse target variations, such as an object with varied viewpoints and an
animal species with distinct appearances. Extensive experiments across various
generalization scenarios demonstrate that our method consistently achieves the
best attack effectiveness. For example, our method outperforms the cosine
similarity-based attack by 20.95% in attack success rate with similar overall
accuracy, averaged over four models on two image benchmark datasets. The code
is available at https://github.com/jiaangk/generalizable_tcpa [COMMENTS]12 pages, 5 figures, 5 tables [LINK]http://arxiv.org/abs/2412.03908v1 [DATE]2024-12-05 14:27:14+08:00 [CATEGORIES]cs.LG
HERO: Hint-Based Efficient and Reliable Query Optimizer [AUTHORS]Sergey Zinchenko, Sergey Iazov [ABSTRACT]We propose a novel model for learned query optimization which provides query
hints leading to better execution plans. The model addresses the three key
challenges in learned hint-based query optimization: reliable hint
recommendation (ensuring non-degradation of query latency), efficient hint
exploration, and fast inference. We provide an in-depth analysis of existing
NN-based approaches to hint-based optimization and experimentally confirm the
named challenges for them. Our alternative solution consists of a new inference
schema based on an ensemble of context-aware models and a graph storage for
reliable hint suggestion and fast inference, and a budget-controlled training
procedure with a local search algorithm that solves the issue of exponential
search space exploration. In experiments on standard benchmarks, our model
demonstrates optimization capability close to the best achievable with
coarse-grained hints. Controlling the degree of parallelism (query dop) in
addition to operator-related hints enables our model to achieve 3x latency
improvement on JOB benchmark which sets a new standard for optimization. Our
model is interpretable and easy to debug, which is particularly important for
deployment in production. [COMMENTS]Submitted to VLDB 2025; 13 pages; 13 figures [LINK]http://arxiv.org/abs/2412.02372v2 [DATE]2024-12-05 14:00:34+08:00 [CATEGORIES]cs.LG
RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy [AUTHORS]Geonho Lee, Janghwan Lee, Sukjin Hong, Minsoo Kim, Euijai Ahn, Du-Seong Chang, Jungwook Choi [ABSTRACT]Low-rank adaptation (LoRA) has become the dominant method for
parameter-efficient LLM fine-tuning, with LoRA-based quantization error
compensation (LQEC) emerging as a powerful tool for recovering accuracy in
compressed LLMs. However, LQEC has underperformed in sub-4-bit scenarios, with
no prior investigation into understanding this limitation. We propose RILQ
(Rank-Insensitive LoRA-based Quantization Error Compensation) to understand
fundamental limitation and boost 2-bit LLM accuracy. Based on rank analysis
revealing model-wise activation discrepancy loss's rank-insensitive nature,
RILQ employs this loss to adjust adapters cooperatively across layers, enabling
robust error compensation with low-rank adapters. Evaluations on LLaMA-2 and
LLaMA-3 demonstrate RILQ's consistent improvements in 2-bit quantized inference
across various state-of-the-art quantizers and enhanced accuracy in
task-specific fine-tuning. RILQ maintains computational efficiency comparable
to existing LoRA methods, enabling adapter-merged weight-quantized LLM
inference with significantly enhanced accuracy, making it a promising approach
for boosting 2-bit LLM performance. [COMMENTS]The typo in Table 4 has been corrected [LINK]http://arxiv.org/abs/2412.01129v2 [DATE]2024-12-05 13:05:01+08:00 [CATEGORIES]cs.LG
Training MLPs on Graphs without Supervision [AUTHORS]Zehong Wang, Zheyuan Zhang, Chuxu Zhang, Yanfang Ye [ABSTRACT]Graph Neural Networks (GNNs) have demonstrated their effectiveness in various
graph learning tasks, yet their reliance on neighborhood aggregation during
inference poses challenges for deployment in latency-sensitive applications,
such as real-time financial fraud detection. To address this limitation, recent
studies have proposed distilling knowledge from teacher GNNs into student
Multi-Layer Perceptrons (MLPs) trained on node content, aiming to accelerate
inference. However, these approaches often inadequately explore structural
information when inferring unseen nodes. To this end, we introduce SimMLP, a
Self-supervised framework for learning MLPs on graphs, designed to fully
integrate rich structural information into MLPs. Notably, SimMLP is the first
MLP-learning method that can achieve equivalence to GNNs in the optimal case.
The key idea is to employ self-supervised learning to align the representations
encoded by graph context-aware GNNs and neighborhood dependency-free MLPs,
thereby fully integrating the structural information into MLPs. We provide a
comprehensive theoretical analysis, demonstrating the equivalence between
SimMLP and GNNs based on mutual information and inductive bias, highlighting
SimMLP's advanced structural learning capabilities. Additionally, we conduct
extensive experiments on 20 benchmark datasets, covering node classification,
link prediction, and graph classification, to showcase SimMLP's superiority
over state-of-the-art baselines, particularly in scenarios involving unseen
nodes (e.g., inductive and cold-start node classification) where structural
insights are crucial. Our codes are available at:
https://github.com/Zehong-Wang/SimMLP. [COMMENTS]Accepted by WSDM 25 [LINK]http://arxiv.org/abs/2412.03864v1 [DATE]2024-12-05 12:20:54+08:00 [CATEGORIES]cs.LG
Exploring Kolmogorov-Arnold networks for realistic image sharpness assessment [AUTHORS]Shaode Yu, Ze Chen, Zhimu Yang, Jiacheng Gu, Bizu Feng [ABSTRACT]Score prediction is crucial in evaluating realistic image sharpness based on
collected informative features. Recently, Kolmogorov-Arnold networks (KANs)
have been developed and witnessed remarkable success in data fitting. This
study introduces the Taylor series-based KAN (TaylorKAN). Then, different KANs
are explored in four realistic image databases (BID2011, CID2013, CLIVE, and
KonIQ-10k) to predict the scores by using 15 mid-level features and 2048
high-level features. Compared to support vector regression, results show that
KANs are generally competitive or superior, and TaylorKAN is the best one when
mid-level features are used. This is the first study to investigate KANs on
image quality assessment that sheds some light on how to select and further
improve KANs in related tasks. [LINK]http://arxiv.org/abs/2409.07762v3 [DATE]2024-12-05 10:59:02+08:00 [CATEGORIES]cs.LG
Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks [AUTHORS]Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu [ABSTRACT]Existing methodologies for animating portrait images face significant
challenges, particularly in handling non-frontal perspectives, rendering
dynamic objects around the portrait, and generating immersive, realistic
backgrounds. In this paper, we introduce the first application of a pretrained
transformer-based video generative model that demonstrates strong
generalization capabilities and generates highly dynamic, realistic videos for
portrait animation, effectively addressing these challenges. The adoption of a
new video backbone model makes previous U-Net-based methods for identity
maintenance, audio conditioning, and video extrapolation inapplicable. To
address this limitation, we design an identity reference network consisting of
a causal 3D VAE combined with a stacked series of transformer layers, ensuring
consistent facial identity across video sequences. Additionally, we investigate
various speech audio conditioning and motion frame mechanisms to enable the
generation of continuous video driven by speech audio. Our method is validated
through experiments on benchmark and newly proposed wild datasets,
demonstrating substantial improvements over prior methods in generating
realistic portraits characterized by diverse orientations within dynamic and
immersive scenes. Further visualizations and the source code are available at:
https://fudan-generative-vision.github.io/hallo3/. [LINK]http://arxiv.org/abs/2412.00733v2 [DATE]2024-12-05 10:55:56+08:00 [CATEGORIES]cs.LG
Marginal Causal Flows for Validation and Inference [AUTHORS]Daniel de Vassimon Manela, Laura Battaglia, Robin J. Evans [ABSTRACT]Investigating the marginal causal effect of an intervention on an outcome
from complex data remains challenging due to the inflexibility of employed
models and the lack of complexity in causal benchmark datasets, which often
fail to reproduce intricate real-world data patterns. In this paper we
introduce Frugal Flows, a novel likelihood-based machine learning model that
uses normalising flows to flexibly learn the data-generating process, while
also directly inferring the marginal causal quantities from observational data.
We propose that these models are exceptionally well suited for generating
synthetic data to validate causal methods. They can create synthetic datasets
that closely resemble the empirical dataset, while automatically and exactly
satisfying a user-defined average treatment effect. To our knowledge, Frugal
Flows are the first generative model to both learn flexible data
representations and also exactly parameterise quantities such as the average
treatment effect and the degree of unobserved confounding. We demonstrate the
above with experiments on both simulated and real-world datasets. [COMMENTS]23 pages, 10 figures, Accepted as a Poster at NeurIPS 2024 [LINK]http://arxiv.org/abs/2411.01295v2 [DATE]2024-12-05 10:49:36+08:00 [CATEGORIES]cs.LG
Expressivity of Representation Learning on Continuous-Time Dynamic Graphs: An Information-Flow Centric Review [AUTHORS]Sofiane Ennadir, Gabriela Zarzar Gandler, Filip Cornell, Lele Cao, Oleg Smirnov, Tianze Wang, Levente Zólyomi, Björn Brinne, Sahar Asadi [ABSTRACT]Graphs are ubiquitous in real-world applications, ranging from social
networks to biological systems, and have inspired the development of Graph
Neural Networks (GNNs) for learning expressive representations. While most
research has centered on static graphs, many real-world scenarios involve
dynamic, temporally evolving graphs, motivating the need for Continuous-Time
Dynamic Graph (CTDG) models. This paper provides a comprehensive review of
Graph Representation Learning (GRL) on CTDGs with a focus on Self-Supervised
Representation Learning (SSRL). We introduce a novel theoretical framework that
analyzes the expressivity of CTDG models through an Information-Flow (IF) lens,
quantifying their ability to propagate and encode temporal and structural
information. Leveraging this framework, we categorize existing CTDG methods
based on their suitability for different graph types and application scenarios.
Within the same scope, we examine the design of SSRL methods tailored to CTDGs,
such as predictive and contrastive approaches, highlighting their potential to
mitigate the reliance on labeled data. Empirical evaluations on synthetic and
real-world datasets validate our theoretical insights, demonstrating the
strengths and limitations of various methods across long-range, bi-partite and
community-based graphs. This work offers both a theoretical foundation and
practical guidance for selecting and developing CTDG models, advancing the
understanding of GRL in dynamic settings. [COMMENTS]12-page main paper + 8-page appendix [LINK]http://arxiv.org/abs/2412.03783v1 [DATE]2024-12-05 08:12:50+08:00 [CATEGORIES]cs.LG
Exploring RAG-based Vulnerability Augmentation with LLMs [AUTHORS]Seyed Shayan Daneshvar, Yu Nong, Xu Yang, Shaowei Wang, Haipeng Cai [ABSTRACT]Detecting vulnerabilities is vital for software security, yet deep
learning-based vulnerability detectors (DLVD) face a data shortage, which
limits their effectiveness. Data augmentation can potentially alleviate the
data shortage, but augmenting vulnerable code is challenging and requires a
generative solution that maintains vulnerability. Previous works have only
focused on generating samples that contain single statements or specific types
of vulnerabilities. Recently, large language models (LLMs) have been used to
solve various code generation and comprehension tasks with inspiring results,
especially when fused with retrieval augmented generation (RAG). Therefore, we
propose VulScribeR, a novel LLM-based solution that leverages carefully curated
prompt templates to augment vulnerable datasets. More specifically, we explore
three strategies to augment both single and multi-statement vulnerabilities,
with LLMs, namely Mutation, Injection, and Extension. Our extensive evaluation
across three vulnerability datasets and DLVD models, using two LLMs, show that
our approach beats two SOTA methods Vulgen and VGX, and Random Oversampling
(ROS) by 27.48%, 27.93%, and 15.41% in f1-score with 5K generated vulnerable
samples on average, and 53.84%, 54.10%, 69.90%, and 40.93% with 15K generated
vulnerable samples. Our approach demonstrates its feasibility for large-scale
data augmentation by generating 1K samples at as cheap as US$ 1.88. [COMMENTS]13 pages, 6 figures, 5 tables, 3 prompt templates, 1 algorithm [LINK]http://arxiv.org/abs/2408.04125v2 [DATE]2024-12-05 08:00:18+08:00 [CATEGORIES]cs.LG
Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration [AUTHORS]Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross [ABSTRACT]The goal of mechanistic interpretability is discovering simpler, low-rank
algorithms implemented by models. While we can compress activations into
features, compressing nonlinear feature-maps -- like MLP layers -- is an open
problem. In this work, we present the first case study in rigorously
compressing nonlinear feature-maps, which are the leading asymptotic bottleneck
to compressing small transformer models. We work in the classic setting of the
modular addition models, and target a non-vacuous bound on the behaviour of the
ReLU MLP in time linear in the parameter-count of the circuit. To study the
ReLU MLP analytically, we use the infinite-width lens, which turns
post-activation matrix multiplications into approximate integrals. We discover
a novel interpretation of\} the MLP layer in one-layer transformers implementing
the ``pizza'' algorithm: the MLP can be understood as evaluating a quadrature
scheme, where each neuron computes the area of a rectangle under the curve of a
trigonometric integral identity. Our code is available at
https://tinyurl.com/mod-add-integration. [LINK]http://arxiv.org/abs/2412.03773v1 [DATE]2024-12-05 07:29:07+08:00 [CATEGORIES]cs.LG
Diffusion in Zero-Shot Learning for Environmental Audio [AUTHORS]Ysobel Sims, Stephan Chalup, Alexandre Mendes [ABSTRACT]Zero-shot learning enables models to generalize to unseen classes by
leveraging semantic information, bridging the gap between training and testing
sets with non-overlapping classes. While much research has focused on zero-shot
learning in computer vision, the application of these methods to environmental
audio remains underexplored, with poor performance in existing studies.
Generative methods, which have demonstrated success in computer vision, are
notably absent from environmental audio zero-shot learning, where
classification-based approaches dominate.
To address this gap, this work investigates generative methods for zero-shot
learning in environmental audio. Two successful generative models from computer
vision are adapted: a cross-aligned and distribution-aligned variational
autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial
network (LisGAN). Additionally, a novel diffusion model conditioned on class
auxiliary data is introduced. The diffusion model generates synthetic data for
unseen classes, which is combined with seen-class data to train a classifier.
Experiments are conducted on two environmental audio datasets, ESC-50 and
FSC22. Results show that the diffusion model significantly outperforms all
baseline methods, achieving more than 25% higher accuracy on the ESC-50 test
partition.
This work establishes the diffusion model as a promising generative approach
for zero-shot learning and introduces the first benchmark of generative methods
for environmental audio zero-shot learning, providing a foundation for future
research in the field.
Code is provided at https://github.com/ysims/ZeroDiffusion for the novel
ZeroDiffusion method. [COMMENTS]This work has been submitted to the IEEE for possible publication [LINK]http://arxiv.org/abs/2412.03771v1 [DATE]2024-12-05 07:18:40+08:00 [CATEGORIES]cs.LG
Learning Networks from Wide-Sense Stationary Stochastic Processes [AUTHORS]Anirudh Rayas, Jiajun Cheng, Rajasekhar Anguluri, Deepjyoti Deka, Gautam Dasarathy [ABSTRACT]Complex networked systems driven by latent inputs are common in fields like
neuroscience, finance, and engineering. A key inference problem here is to
learn edge connectivity from node outputs (potentials). We focus on systems
governed by steady-state linear conservation laws: $X_t = \{L^\{\ast\}\}Y_\{t\}$,
where $X_t, Y_t \in \mathbb\{R\}^p$ denote inputs and potentials, respectively,
and the sparsity pattern of the $p \times p$ Laplacian $L^\{\ast\}$ encodes the
edge structure. Assuming $X_t$ to be a wide-sense stationary stochastic process
with a known spectral density matrix, we learn the support of $L^\{\ast\}$ from
temporally correlated samples of $Y_t$ via an $\ell_1$-regularized Whittle's
maximum likelihood estimator (MLE). The regularization is particularly useful
for learning large-scale networks in the high-dimensional setting where the
network size $p$ significantly exceeds the number of samples $n$.
We show that the MLE problem is strictly convex, admitting a unique solution.
Under a novel mutual incoherence condition and certain sufficient conditions on
$(n, p, d)$, we show that the ML estimate recovers the sparsity pattern of
$L^\ast$ with high probability, where $d$ is the maximum degree of the graph
underlying $L^\{\ast\}$. We provide recovery guarantees for $L^\ast$ in
element-wise maximum, Frobenius, and operator norms. Finally, we complement our
theoretical results with several simulation studies on synthetic and benchmark
datasets, including engineered systems (power and water networks), and
real-world datasets from neural systems (such as the human brain). [LINK]http://arxiv.org/abs/2412.03768v1 [DATE]2024-12-05 07:14:00+08:00 [CATEGORIES]cs.LG
End to End Collaborative Synthetic Data Generation [AUTHORS]Sikha Pentyala, Geetha Sitaraman, Trae Claar, Martine De Cock [ABSTRACT]The success of AI is based on the availability of data to train models. While
in some cases a single data custodian may have sufficient data to enable AI,
often multiple custodians need to collaborate to reach a cumulative size
required for meaningful AI research. The latter is, for example, often the case
for rare diseases, with each clinical site having data for only a small number
of patients. Recent algorithms for federated synthetic data generation are an
important step towards collaborative, privacy-preserving data sharing. Existing
techniques, however, focus exclusively on synthesizer training, assuming that
the training data is already preprocessed and that the desired synthetic data
can be delivered in one shot, without any hyperparameter tuning. In this paper,
we propose an end-to-end collaborative framework for publishing of synthetic
data that accounts for privacy-preserving preprocessing as well as evaluation.
We instantiate this framework with Secure Multiparty Computation (MPC)
protocols and evaluate it in a use case for privacy-preserving publishing of
synthetic genomic data for leukemia. [LINK]http://arxiv.org/abs/2412.03766v1 [DATE]2024-12-05 07:10:51+08:00 [CATEGORIES]cs.LG
Good practices for evaluation of machine learning systems [AUTHORS]Luciana Ferrer, Odette Scharenborg, Tom Bäckström [ABSTRACT]Many development decisions affect the results obtained from ML experiments:
training data, features, model architecture, hyperparameters, test data, etc.
Among these aspects, arguably the most important design decisions are those
that involve the evaluation procedure. This procedure is what determines
whether the conclusions drawn from the experiments will or will not generalize
to unseen data and whether they will be relevant to the application of
interest. If the data is incorrectly selected, the wrong metric is chosen for
evaluation or the significance of the comparisons between models is
overestimated, conclusions may be misleading or result in suboptimal
development decisions. To avoid such problems, the evaluation protocol should
be very carefully designed before experimentation starts.
In this work we discuss the main aspects involved in the design of the
evaluation protocol: data selection, metric selection, and statistical
significance. This document is not meant to be an exhaustive tutorial on each
of these aspects. Instead, the goal is to explain the main guidelines that
should be followed in each case. We include examples taken from the speech
processing field, and provide a list of common mistakes related to each aspect. [COMMENTS]v1.0 [LINK]http://arxiv.org/abs/2412.03700v1 [DATE]2024-12-05 04:30:16+08:00 [CATEGORIES]cs.LG
Tight Lower Bounds and Improved Convergence in Performative Prediction [AUTHORS]Pedram Khorsandi, Rushil Gupta, Mehrnaz Mofakhami, Simon Lacoste-Julien, Gauthier Gidel [ABSTRACT]Performative prediction is a framework accounting for the shift in the data
distribution induced by the prediction of a model deployed in the real world.
Ensuring rapid convergence to a stable solution where the data distribution
remains the same after the model deployment is crucial, especially in evolving
environments. This paper extends the Repeated Risk Minimization (RRM) framework
by utilizing historical datasets from previous retraining snapshots, yielding a
class of algorithms that we call Affine Risk Minimizers and enabling
convergence to a performatively stable point for a broader class of problems.
We introduce a new upper bound for methods that use only the final iteration of
the dataset and prove for the first time the tightness of both this new bound
and the previous existing bounds within the same regime. We also prove that
utilizing historical datasets can surpass the lower bound for last iterate RRM,
and empirically observe faster convergence to the stable point on various
performative prediction benchmarks. We offer at the same time the first lower
bound analysis for RRM within the class of Affine Risk Minimizers, quantifying
the potential improvements in convergence speed that could be achieved with
other variants in our framework. [LINK]http://arxiv.org/abs/2412.03671v1 [DATE]2024-12-05 03:06:19+08:00 [CATEGORIES]cs.LG
Navigation World Models [AUTHORS]Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun [ABSTRACT]Navigation is a fundamental skill of agents with visual-motor capabilities.
We introduce a Navigation World Model (NWM), a controllable video generation
model that predicts future visual observations based on past observations and
navigation actions. To capture complex environment dynamics, NWM employs a
Conditional Diffusion Transformer (CDiT), trained on a diverse collection of
egocentric videos of both human and robotic agents, and scaled up to 1 billion
parameters. In familiar environments, NWM can plan navigation trajectories by
simulating them and evaluating whether they achieve the desired goal. Unlike
supervised navigation policies with fixed behavior, NWM can dynamically
incorporate constraints during planning. Experiments demonstrate its
effectiveness in planning trajectories from scratch or by ranking trajectories
sampled from an external policy. Furthermore, NWM leverages its learned visual
priors to imagine trajectories in unfamiliar environments from a single input
image, making it a flexible and powerful tool for next-generation navigation
systems. [COMMENTS]project page: https://www.amirbar.net/nwm/ [LINK]http://arxiv.org/abs/2412.03572v1 [DATE]2024-12-05 02:59:45+08:00 [CATEGORIES]cs.LG
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models [AUTHORS]Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G. Shapiro, Ranjay Krishna [ABSTRACT]Multimodal language models (MLMs) still face challenges in fundamental visual
perception tasks where specialized models excel. Tasks requiring reasoning
about 3D structures benefit from depth estimation, and reasoning about 2D
object instances benefits from object detection. Yet, MLMs can not produce
intermediate depth or boxes to reason over. Finetuning MLMs on relevant data
doesn't generalize well and outsourcing computation to specialized vision tools
is too compute-intensive and memory-inefficient. To address this, we introduce
Perception Tokens, intrinsic image representations designed to assist reasoning
tasks where language is insufficient. Perception tokens act as auxiliary
reasoning tokens, akin to chain-of-thought prompts in language models. For
example, in a depth-related task, an MLM augmented with perception tokens can
reason by generating a depth map as tokens, enabling it to solve the problem
effectively. We propose AURORA, a training method that augments MLMs with
perception tokens for improved reasoning over visual inputs. AURORA leverages a
VQVAE to transform intermediate image representations, such as depth maps into
a tokenized format and bounding box tokens, which is then used in a multi-task
training framework. AURORA achieves notable improvements across counting
benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench,
outperforming finetuning approaches in generalization across datasets. It also
improves on relative depth: over +6% on BLINK. With perception tokens, AURORA
expands the scope of MLMs beyond language-based reasoning, paving the way for
more effective visual reasoning capabilities. [LINK]http://arxiv.org/abs/2412.03548v1 [DATE]2024-12-05 02:45:35+08:00 [CATEGORIES]cs.LG
Evaluating Single Event Upsets in Deep Neural Networks for Semantic Segmentation: an embedded system perspective [AUTHORS]Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe [ABSTRACT]As the deployment of artifical intelligence (AI) algorithms at edge devices
becomes increasingly prevalent, enhancing the robustness and reliability of
autonomous AI-based perception and decision systems is becoming as relevant as
precision and performance, especially in applications areas considered
safety-critical such as autonomous driving and aerospace. This paper delves
into the robustness assessment in embedded Deep Neural Networks (DNNs),
particularly focusing on the impact of parameter perturbations produced by
single event upsets (SEUs) on convolutional neural networks (CNN) for image
semantic segmentation. By scrutinizing the layer-by-layer and bit-by-bit
sensitivity of various encoder-decoder models to soft errors, this study
thoroughly investigates the vulnerability of segmentation DNNs to SEUs and
evaluates the consequences of techniques like model pruning and parameter
quantization on the robustness of compressed models aimed at embedded
implementations. The findings offer valuable insights into the mechanisms
underlying SEU-induced failures that allow for evaluating the robustness of
DNNs once trained in advance. Moreover, based on the collected data, we propose
a set of practical lightweight error mitigation techniques with no memory or
computational cost suitable for resource-constrained deployments. The code used
to perform the fault injection (FI) campaign is available at
https://github.com/jonGuti13/TensorFI2 , while the code to implement proposed
techniques is available at https://github.com/jonGuti13/parameterProtection . [LINK]http://arxiv.org/abs/2412.03630v1 [DATE]2024-12-05 02:28:38+08:00 [CATEGORIES]cs.LG
Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications [AUTHORS]Christina Sauer, Anne-Laure Boulesteix, Luzia Hanßum, Farina Hodiamont, Claudia Bausewein, Theresa Ullmann [ABSTRACT]Adequately generating and evaluating prediction models based on supervised
machine learning (ML) is often challenging, especially for less experienced
users in applied research areas. Special attention is required in settings
where the model generation process involves hyperparameter tuning, i.e.
data-driven optimization of different types of hyperparameters to improve the
predictive performance of the resulting model. Discussions about tuning
typically focus on the hyperparameters of the ML algorithm (e.g., the minimum
number of observations in each terminal node for a tree-based algorithm). In
this context, it is often neglected that hyperparameters also exist for the
preprocessing steps that are applied to the data before it is provided to the
algorithm (e.g., how to handle missing feature values in the data). As a
consequence, users experimenting with different preprocessing options to
improve model performance may be unaware that this constitutes a form of
hyperparameter tuning - albeit informal and unsystematic - and thus may fail to
report or account for this optimization. To illuminate this issue, this paper
reviews and empirically illustrates different procedures for generating and
evaluating prediction models, explicitly addressing the different ways
algorithm and preprocessing hyperparameters are typically handled by applied ML
users. By highlighting potential pitfalls, especially those that may lead to
exaggerated performance claims, this review aims to further improve the quality
of predictive modeling in ML applications. [LINK]http://arxiv.org/abs/2412.03491v1 [DATE]2024-12-05 01:29:10+08:00 [CATEGORIES]cs.LG
Coverage-Constrained Human-AI Cooperation with Multiple Experts [AUTHORS]Zheng Zhang, Cuong Nguyen, Kevin Wells, Thanh-Toan Do, David Rosewarne, Gustavo Carneiro [ABSTRACT]Human-AI cooperative classification (HAI-CC) approaches aim to develop hybrid
intelligent systems that enhance decision-making in various high-stakes
real-world scenarios by leveraging both human expertise and AI capabilities.
Current HAI-CC methods primarily focus on learning-to-defer (L2D), where
decisions are deferred to human experts, and learning-to-complement (L2C),
where AI and human experts make predictions cooperatively. However, a notable
research gap remains in effectively exploring both L2D and L2C under diverse
expert knowledge to improve decision-making, particularly when constrained by
the cooperation cost required to achieve a target probability for AI-only
selection (i.e., coverage). In this paper, we address this research gap by
proposing the Coverage-constrained Learning to Defer and Complement with
Specific Experts (CL2DC) method. CL2DC makes final decisions through either AI
prediction alone or by deferring to or complementing a specific expert,
depending on the input data. Furthermore, we propose a coverage-constrained
optimisation to control the cooperation cost, ensuring it approximates a target
probability for AI-only selection. This approach enables an effective
assessment of system performance within a specified budget. Also, CL2DC is
designed to address scenarios where training sets contain multiple noisy-label
annotations without any clean-label references. Comprehensive evaluations on
both synthetic and real-world datasets demonstrate that CL2DC achieves superior
performance compared to state-of-the-art HAI-CC methods. [LINK]http://arxiv.org/abs/2411.11976v2 [DATE]2024-12-05 01:13:22+08:00 [CATEGORIES]cs.LG
State Frequency Estimation for Anomaly Detection [AUTHORS]Clinton Cao, Agathe Blaise, Annibale Panichella, Sicco Verwer [ABSTRACT]Many works have studied the efficacy of state machines for detecting
anomalies within NetFlows. These works typically learn a model from unlabeled
data and compute anomaly scores for arbitrary traces based on their likelihood
of occurrence or how well they fit within the model. However, these methods do
not dynamically adapt their scores based on the traces seen at test time. This
becomes a problem when an adversary produces seemingly common traces in their
attack, causing the model to miss the detection by assigning low anomaly
scores. We propose SEQUENT, a new approach that uses the state visit frequency
to adapt its scoring for anomaly detection dynamically. SEQUENT subsequently
uses the scores to generate root causes for anomalies. These allow the grouping
of alarms and simplify the analysis of anomalies. Our evaluation of SEQUENT on
three NetFlow datasets indicates that our approach outperforms existing
methods, demonstrating its effectiveness in detecting anomalies. [COMMENTS]9 pages [LINK]http://arxiv.org/abs/2412.03442v1 [DATE]2024-12-05 00:30:35+08:00 [CATEGORIES]cs.LG
Beyond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products [AUTHORS]Nadia Nahar, Christian Kästner, Jenna Butler, Chris Parnin, Thomas Zimmermann, Christian Bird [ABSTRACT]Large Language Models (LLMs) are increasingly embedded into software products
across diverse industries, enhancing user experiences, but at the same time
introducing numerous challenges for developers. Unique characteristics of LLMs
force developers, who are accustomed to traditional software development and
evaluation, out of their comfort zones as the LLM components shatter standard
assumptions about software systems. This study explores the emerging solutions
that software developers are adopting to navigate the encountered challenges.
Leveraging a mixed-method research, including 26 interviews and a survey with
332 responses, the study identifies 19 emerging solutions regarding quality
assurance that practitioners across several product teams at Microsoft are
exploring. The findings provide valuable insights that can guide the
development and evaluation of LLM-based products more broadly in the face of
these challenges. [COMMENTS]10 pages, 2 tables [LINK]http://arxiv.org/abs/2410.12071v2 [DATE]2024-12-05 00:20:40+08:00 [CATEGORIES]cs.LG
SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model [AUTHORS]Yan Li, Ziya Zhou, Zhiqiang Wang, Wei Xue, Wenhan Luo, Yike Guo [ABSTRACT]Recent advancements in generative models have significantly enhanced talking
face video generation, yet singing video generation remains underexplored. The
differences between human talking and singing limit the performance of existing
talking face video generation models when applied to singing. The fundamental
differences between talking and singing-specifically in audio characteristics
and behavioral expressions-limit the effectiveness of existing models. We
observe that the differences between singing and talking audios manifest in
terms of frequency and amplitude. To address this, we have designed a
multi-scale spectral module to help the model learn singing patterns in the
spectral domain. Additionally, we develop a spectral-filtering module that aids
the model in learning the human behaviors associated with singing audio. These
two modules are integrated into the diffusion model to enhance singing video
generation performance, resulting in our proposed model, SINGER. Furthermore,
the lack of high-quality real-world singing face videos has hindered the
development of the singing video generation community. To address this gap, we
have collected an in-the-wild audio-visual singing dataset to facilitate
research in this area. Our experiments demonstrate that SINGER is capable of
generating vivid singing videos and outperforms state-of-the-art methods in
both objective and subjective evaluations. [LINK]http://arxiv.org/abs/2412.03430v1 [DATE]2024-12-05 00:19:47+08:00 [CATEGORIES]cs.LG
Assessing Foundation Models' Transferability to Physiological Signals in Precision Medicine [AUTHORS]Matthias Christenson, Cove Geary, Brian Locke, Pranav Koirala, Warren Woodrich Pettine [ABSTRACT]The success of precision medicine requires computational models that can
effectively process and interpret diverse physiological signals across
heterogeneous patient populations. While foundation models have demonstrated
remarkable transfer capabilities across various domains, their effectiveness in
handling individual-specific physiological signals - crucial for precision
medicine - remains largely unexplored. This work introduces a systematic
pipeline for rapidly and efficiently evaluating foundation models' transfer
capabilities in medical contexts. Our pipeline employs a three-stage approach.
First, it leverages physiological simulation software to generate diverse,
clinically relevant scenarios, particularly focusing on data-scarce medical
conditions. This simulation-based approach enables both targeted capability
assessment and subsequent model fine-tuning. Second, the pipeline projects
these simulated signals through the foundation model to obtain embeddings,
which are then evaluated using linear methods. This evaluation quantifies the
model's ability to capture three critical aspects: physiological feature
independence, temporal dynamics preservation, and medical scenario
differentiation. Finally, the pipeline validates these representations through
specific downstream medical tasks. Initial testing of our pipeline on the
Moirai time series foundation model revealed significant limitations in
physiological signal processing, including feature entanglement, temporal
dynamics distortion, and reduced scenario discrimination. These findings
suggest that current foundation models may require substantial architectural
modifications or targeted fine-tuning before deployment in clinical settings. [COMMENTS]Presented at the precision medicine workshop at the AI in Medicine
conference (2024) in Salt Lake City [LINK]http://arxiv.org/abs/2412.03427v1 [DATE]2024-12-05 00:17:09+08:00 [CATEGORIES]cs.LG
2024 Dec 04, Wed
LLM-Based Multi-Hop Question Answering with Knowledge Graph Integration in Evolving Environments [AUTHORS]Ruirui Chen, Weifeng Jiang, Chengwei Qin, Ishaan Singh Rawal, Cheston Tan, Dongkyu Choi, Bo Xiong, Bo Ai [ABSTRACT]The important challenge of keeping knowledge in Large Language Models (LLMs)
up-to-date has led to the development of various methods for incorporating new
facts. However, existing methods for such knowledge editing still face
difficulties with multi-hop questions that require accurate fact identification
and sequential logical reasoning, particularly among numerous fact updates. To
tackle these challenges, this paper introduces Graph Memory-based Editing for
Large Language Models (GMeLLo), a straightforward and effective method that
merges the explicit knowledge representation of Knowledge Graphs (KGs) with the
linguistic flexibility of LLMs. Beyond merely leveraging LLMs for question
answering, GMeLLo employs these models to convert free-form language into
structured queries and fact triples, facilitating seamless interaction with KGs
for rapid updates and precise multi-hop reasoning. Our results show that GMeLLo
significantly surpasses current state-of-the-art (SOTA) knowledge editing
methods in the multi-hop question answering benchmark, MQuAKE, especially in
scenarios with extensive knowledge edits. [LINK]http://arxiv.org/abs/2408.15903v2 [DATE]2024-12-04 23:01:47+08:00 [CATEGORIES]cs.CL
Self-Improvement in Language Models: The Sharpening Mechanism [AUTHORS]Audrey Huang, Adam Block, Dylan J. Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, Akshay Krishnamurthy [ABSTRACT]Recent work in language modeling has raised the possibility of
self-improvement, where a language models evaluates and refines its own
generations to achieve higher performance without external feedback. It is
impossible for this self-improvement to create information that is not already
in the model, so why should we expect that this will lead to improved
capabilities? We offer a new perspective on the capabilities of
self-improvement through a lens we refer to as sharpening. Motivated by the
observation that language models are often better at verifying response quality
than they are at generating correct responses, we formalize self-improvement as
using the model itself as a verifier during post-training in order to
``sharpen'' the model to one placing large mass on high-quality sequences,
thereby amortizing the expensive inference-time computation of generating good
sequences. We begin by introducing a new statistical framework for sharpening
in which the learner aims to sharpen a pre-trained base policy via sample
access, and establish fundamental limits. Then we analyze two natural families
of self-improvement algorithms based on SFT and RLHF. We find that (i) the
SFT-based approach is minimax optimal whenever the initial model has sufficient
coverage, but (ii) the RLHF-based approach can improve over SFT-based
self-improvement by leveraging online exploration, bypassing the need for
coverage. Finally, we empirically validate the sharpening mechanism via
inference-time and amortization experiments. We view these findings as a
starting point toward a foundational understanding that can guide the design
and evaluation of self-improvement algorithms. [LINK]http://arxiv.org/abs/2412.01951v2 [DATE]2024-12-04 22:20:21+08:00 [CATEGORIES]cs.CLcs.LG
Yankari: A Monolingual Yoruba Dataset [AUTHORS]Maro Akpobi [ABSTRACT]This paper presents Yankari, a large-scale monolingual dataset for the Yoruba
language, aimed at addressing the critical gap in Natural Language Processing
(NLP) resources for this important West African language. Despite being spoken
by over 30 million people, Yoruba has been severely underrepresented in NLP
research and applications. We detail our methodology for creating this dataset,
which includes careful source selection, automated quality control, and
rigorous data cleaning processes. The Yankari dataset comprises 51,407
documents from 13 diverse sources, totaling over 30 million tokens. Our
approach focuses on ethical data collection practices, avoiding problematic
sources and addressing issues prevalent in existing datasets. We provide
thorough automated evaluations of the dataset, demonstrating its quality
compared to existing resources. The Yankari dataset represents a significant
advancement in Yoruba language resources, providing a foundation for developing
more accurate NLP models, supporting comparative linguistic studies, and
contributing to the digital accessibility of the Yoruba language. [COMMENTS]8 pages [LINK]http://arxiv.org/abs/2412.03334v1 [DATE]2024-12-04 22:05:18+08:00 [CATEGORIES]cs.CL
AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark [AUTHORS]Abhay Gupta, Philip Meng, Ece Yurtseven, Sean O'Brien, Kevin Zhu [ABSTRACT]Detecting biases in natural language understanding (NLU) for African American
Vernacular English (AAVE) is crucial to developing inclusive natural language
processing (NLP) systems. To address dialect-induced performance discrepancies,
we introduce AAVENUE (\{AAVE\} \{N\}atural Language \{U\}nderstanding \{E\}valuation),
a benchmark for evaluatinglarge language model (LLM) performance on NLU tasks
in AAVE and Standard American English (SAE). AAVENUE builds upon and extends
existing benchmarks like VALUE, replacing deterministic syntactic and
morphological transformations with a more flexible methodology leveraging
LLM-based translation with few-shot prompting, improving performance across our
evaluation metrics when translating key tasks from the GLUE and SuperGLUE
benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs
and a comprehensive set of metrics including fluency, BARTScore, quality,
coherence, and understandability. Additionally, we recruit fluent AAVE speakers
to validate our translations for authenticity. Our evaluations reveal that LLMs
consistently perform better on SAE tasks than AAVE-translated versions,
underscoring inherent biases and highlighting the need for more inclusive NLP
models. We have open-sourced our source code on GitHub and created a website to
showcase our work at https://aavenue.live. [COMMENTS]Published at NLP4PI @ EMNLP 2024 [LINK]http://arxiv.org/abs/2408.14845v2 [DATE]2024-12-04 21:43:28+08:00 [CATEGORIES]cs.CL
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation [AUTHORS]Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Madeline Smith, Antoine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, Sara Hooker [ABSTRACT]Cultural biases in multilingual datasets pose significant challenges for
their effectiveness as global benchmarks. These biases stem not only from
language but also from the cultural knowledge required to interpret questions,
reducing the practical utility of translated datasets like MMLU. Furthermore,
translation often introduces artifacts that can distort the meaning or clarity
of questions in the target language. A common practice in multilingual
evaluation is to rely on machine-translated evaluation sets, but simply
translating a dataset is insufficient to address these challenges. In this
work, we trace the impact of both of these issues on multilingual evaluations
and ensuing model performances. Our large-scale evaluation of state-of-the-art
open and proprietary models illustrates that progress on MMLU depends heavily
on learning Western-centric concepts, with 28% of all questions requiring
culturally sensitive knowledge. Moreover, for questions requiring geographic
knowledge, an astounding 84.9% focus on either North American or European
regions. Rankings of model evaluations change depending on whether they are
evaluated on the full portion or the subset of questions annotated as
culturally sensitive, showing the distortion to model rankings when blindly
relying on translated MMLU. We release Global-MMLU, an improved MMLU with
evaluation coverage across 42 languages -- with improved overall quality by
engaging with compensated professional and community annotators to verify
translation quality while also rigorously evaluating cultural biases present in
the original dataset. This comprehensive Global-MMLU set also includes
designated subsets labeled as culturally sensitive and culturally agnostic to
allow for more holistic, complete evaluation. [LINK]http://arxiv.org/abs/2412.03304v1 [DATE]2024-12-04 21:27:09+08:00 [CATEGORIES]cs.CL
Alignment at Pre-training! Towards Native Alignment for Arabic LLMs [AUTHORS]Juhao Liang, Zhenyang Cai, Jianqing Zhu, Huang Huang, Kewei Zong, Bang An, Mosen Alharthi, Juncai He, Lian Zhang, Haizhou Li, Benyou Wang, Jinchao Xu [ABSTRACT]The alignment of large language models (LLMs) is critical for developing
effective and safe language models. Traditional approaches focus on aligning
models during the instruction tuning or reinforcement learning stages, referred
to in this paper as `post alignment'. We argue that alignment during the
pre-training phase, which we term `native alignment', warrants investigation.
Native alignment aims to prevent unaligned content from the beginning, rather
than relying on post-hoc processing. This approach leverages extensively
aligned pre-training data to enhance the effectiveness and usability of
pre-trained models. Our study specifically explores the application of native
alignment in the context of Arabic LLMs. We conduct comprehensive experiments
and ablation studies to evaluate the impact of native alignment on model
performance and alignment stability. Additionally, we release open-source
Arabic LLMs that demonstrate state-of-the-art performance on various
benchmarks, providing significant benefits to the Arabic LLM community. [COMMENTS]Accepted to NeurIPS 2024 main conference. see
https://github.com/FreedomIntelligence/AceGPT-v2 [LINK]http://arxiv.org/abs/2412.03253v1 [DATE]2024-12-04 19:52:03+08:00 [CATEGORIES]cs.CL
Xmodel-1.5: An 1B-scale Multilingual LLM [AUTHORS]Wang Qun, Liu Yang, Lin Qingquan, Jiang Ling [ABSTRACT]We introduce Xmodel-1.5, a 1-billion-parameter multilingual large language
model pretrained on 2 trillion tokens, designed for balanced performance and
scalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5
employs a custom unigram tokenizer with 65,280 tokens, optimizing both
efficiency and accuracy. The model delivers competitive results across multiple
languages, including Thai, Arabic, French, Chinese, and English, outperforming
Alibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in
benchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai.
To support low-resource language research, we release Xdata_Thai, a
Thai-specific evaluation dataset featuring unique linguistic challenges such as
gendered particles and idioms. While the model demonstrates strong performance,
there is still room for improvement in handling culturally specific nuances. We
hope this work contributes to advancements in multilingual AI research. Models
and code are publicly available on GitHub at
https://github.com/XiaoduoAILab/XmodelLM-1.5 [LINK]http://arxiv.org/abs/2411.10083v3 [DATE]2024-12-04 19:49:04+08:00 [CATEGORIES]cs.CL
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning [AUTHORS]Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang [ABSTRACT]Large language models (LLMs) have enabled the creation of multi-modal LLMs
that exhibit strong comprehension of visual data such as images and videos.
However, these models usually rely on extensive visual tokens from visual
encoders, leading to high computational demands, which limits their
applicability in resource-constrained environments and for long-context tasks.
In this work, we propose a training-free adaptive inference method for
multi-modal LLMs that can accommodate a broad range of efficiency requirements
with a minimum performance drop. Our method consists of a) iterative token
merging based on embedding similarity before LLMs, and b) progressive token
pruning within LLM layers based on multi-modal importance. With a minimalist
design, our method can be applied to both video and image LLMs. Extensive
experiments on diverse video and image benchmarks demonstrate that, our method
substantially reduces computation load (e.g., a $\textbf\{7-fold\}$ reduction in
FLOPs) while preserving the performance of video and image LLMs. Further, under
a similar computational cost, our method outperforms the state-of-the-art
methods in long video understanding (e.g., $\textbf\{+4.6\}$ on MLVU).
Additionally, our in-depth analysis provides insights into token redundancy and
LLM layer behaviors, offering guidance for future research in designing
efficient multi-modal LLMs. Our code will be available at
https://github.com/LaVi-Lab/AIM. [COMMENTS]12 pages, 2 figures [LINK]http://arxiv.org/abs/2412.03248v1 [DATE]2024-12-04 19:47:57+08:00 [CATEGORIES]cs.CL
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models [AUTHORS]Yahan Tu, Rui Hu, Jitao Sang [ABSTRACT]Hallucination poses a persistent challenge for multimodal large language
models (MLLMs). However, existing benchmarks for evaluating hallucinations are
generally static, which may overlook the potential risk of data contamination.
To address this issue, we propose ODE, an open-set, dynamic protocol designed
to evaluate object hallucinations in MLLMs at both the existence and attribute
levels. ODE employs a graph-based structure to represent real-world object
concepts, their attributes, and the distributional associations between them.
This structure facilitates the extraction of concept combinations based on
diverse distributional criteria, generating varied samples for structured
queries that evaluate hallucinations in both generative and discriminative
tasks. Through the generation of new samples, dynamic concept combinations, and
varied distribution frequencies, ODE mitigates the risk of data contamination
and broadens the scope of evaluation. This protocol is applicable to both
general and specialized scenarios, including those with limited data.
Experimental results demonstrate the effectiveness of our protocol, revealing
that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated
samples, which indicates potential data contamination. Furthermore, these
generated samples aid in analyzing hallucination patterns and fine-tuning
models, offering an effective approach to mitigating hallucinations in MLLMs. [LINK]http://arxiv.org/abs/2409.09318v3 [DATE]2024-12-04 19:44:57+08:00 [CATEGORIES]cs.CL
Benchmarking terminology building capabilities of ChatGPT on an English-Russian Fashion Corpus [AUTHORS]Anastasiia Bezobrazova, Miriam Seghiri, Constantin Orasan [ABSTRACT]This paper compares the accuracy of the terms extracted using SketchEngine,
TBXTools and ChatGPT. In addition, it evaluates the quality of the definitions
produced by ChatGPT for these terms. The research is carried out on a
comparable corpus of fashion magazines written in English and Russian collected
from the web. A gold standard for the fashion terminology was also developed by
identifying web pages that can be harvested automatically and contain
definitions of terms from the fashion domain in English and Russian. This gold
standard was used to evaluate the quality of the extracted terms and of the
definitions produced. Our evaluation shows that TBXTools and SketchEngine,
while capable of high recall, suffer from reduced precision as the number of
terms increases, which affects their overall performance. Conversely, ChatGPT
demonstrates superior performance, maintaining or improving precision as more
terms are considered. Analysis of the definitions produced by ChatGPT for 60
commonly used terms in English and Russian shows that ChatGPT maintains a
reasonable level of accuracy and fidelity across languages, but sometimes the
definitions in both languages miss crucial specifics and include unnecessary
deviations. Our research reveals that no single tool excels universally; each
has strengths suited to particular aspects of terminology extraction and
application. [COMMENTS]To appear in the Proceedings of Translating and the Computer 2024
(TC46) [LINK]http://arxiv.org/abs/2412.03242v1 [DATE]2024-12-04 19:43:08+08:00 [CATEGORIES]cs.CL
Linq-Embed-Mistral Technical Report [AUTHORS]Chanyeol Choi, Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, Jy-yong Sohn [ABSTRACT]This report explores the enhancement of text retrieval performance using
advanced data refinement techniques. We develop
Linq-Embed-Mistral\footnote\{\url\{https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral\}\}
by building on the E5-mistral and Mistral-7B-v0.1 models, focusing on
sophisticated data crafting, data filtering, and negative mining methods, which
are highly tailored to each task, applied to both existing benchmark dataset
and highly tailored synthetic dataset generated via large language models
(LLMs). Linq-Embed-Mistral excels in the MTEB benchmarks (as of May 29, 2024),
achieving an average score of 68.2 across 56 datasets, and ranks 1st among all
models for retrieval tasks on the MTEB leaderboard with a performance score of
60.2. This performance underscores its superior capability in enhancing search
precision and reliability. Our contributions include advanced data refinement
methods that significantly improve model performance on benchmark and synthetic
datasets, techniques for homogeneous task ordering and mixed task fine-tuning
to enhance model generalization and stability, and a streamlined evaluation
process using 4-bit precision and a light retrieval evaluation set, which
accelerates validation without sacrificing accuracy. [COMMENTS]15 pages [LINK]http://arxiv.org/abs/2412.03223v1 [DATE]2024-12-04 19:18:32+08:00 [CATEGORIES]cs.CL
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs [AUTHORS]Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga [ABSTRACT]The current evaluation of mathematical skills in LLMs is limited, as existing
benchmarks are either relatively small, primarily focus on elementary and
high-school problems, or lack diversity in topics. Additionally, the inclusion
of visual elements in tasks remains largely under-explored.
To address these gaps, we introduce U-MATH, a novel benchmark of 1,100
unpublished open-ended university-level problems sourced from teaching
materials. It is balanced across six core subjects, with 20% of multimodal
problems. Given the open-ended nature of U-MATH problems, we employ an LLM to
judge the correctness of generated solutions. To this end, we release
$\mu$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions.
The evaluation of general domain, math-specific, and multimodal LLMs
highlights the challenges presented by U-MATH. Our findings reveal that LLMs
achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45%
on visual problems. The solution assessment proves challenging for LLMs, with
the best LLM judge having an F1-score of 80% on $\mu$-MATH. [LINK]http://arxiv.org/abs/2412.03205v1 [DATE]2024-12-04 18:44:50+08:00 [CATEGORIES]cs.CL
Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models [AUTHORS]Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, Rich Caruana [ABSTRACT]While many have shown how Large Language Models (LLMs) can be applied to a
diverse set of tasks, the critical issues of data contamination and
memorization are often glossed over. In this work, we address this concern for
tabular data. Specifically, we introduce a variety of different techniques to
assess whether a language model has seen a tabular dataset during training.
This investigation reveals that LLMs have memorized many popular tabular
datasets verbatim. We then compare the few-shot learning performance of LLMs on
datasets that were seen during training to the performance on datasets released
after training. We find that LLMs perform better on datasets seen during
training, indicating that memorization leads to overfitting. At the same time,
LLMs show non-trivial performance on novel datasets and are surprisingly robust
to data transformations. We then investigate the in-context statistical
learning abilities of LLMs. While LLMs are significantly better than random at
solving statistical classification problems, the sample efficiency of few-shot
learning lags behind traditional statistical learning algorithms, especially as
the dimension of the problem increases. This suggests that much of the observed
few-shot performance on novel real-world datasets is due to the LLM's world
knowledge. Overall, our results highlight the importance of testing whether an
LLM has seen an evaluation dataset during pre-training. We release the
https://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package
to test LLMs for memorization of tabular datasets. [COMMENTS]COLM camera ready, fix typo [LINK]http://arxiv.org/abs/2404.06209v3 [DATE]2024-12-04 18:33:18+08:00 [CATEGORIES]cs.LGcs.CL
Weighted-Reward Preference Optimization for Implicit Model Fusion [AUTHORS]Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan [ABSTRACT]While fusing heterogeneous open-source LLMs with varying architectures and
sizes can potentially integrate the strengths of different models, existing
fusion methods face significant challenges, such as vocabulary alignment and
merging distribution matrices. These procedures are not only complex but also
prone to introducing noise and errors. In this paper, we propose an implicit
fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages
preference optimization between the source LLMs and the target LLM to transfer
their capabilities effectively. WRPO eliminates the need for vocabulary
alignment and matrix fusion and can be efficiently scaled to accommodate
various LLMs. To address distributional deviations between the source and
target LLMs, WRPO introduces a progressive adaptation strategy that gradually
shifts reliance on preferred examples from the target LLM to the source LLMs.
Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks
demonstrate that WRPO consistently outperforms existing knowledge fusion
methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct
as the target model, WRPO achieves a length-controlled win rate of 55.9%
against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against
GPT-4-0314 on Arena-Hard. Our code is available at
\url\{https://github.com/SLIT-AI/WRPO\}. [COMMENTS]Work in progress [LINK]http://arxiv.org/abs/2412.03187v1 [DATE]2024-12-04 18:15:12+08:00 [CATEGORIES]cs.CL
Multi-Level Correlation Network For Few-Shot Image Classification [AUTHORS]Yunkai Dang, Min Zhang, Zhengyu Chen, Xinliang Zhang, Zheng Wang, Meijun Sun, Donglin Wang [ABSTRACT]Few-shot image classification(FSIC) aims to recognize novel classes given few
labeled images from base classes. Recent works have achieved promising
classification performance, especially for metric-learning methods, where a
measure at only image feature level is usually used. In this paper, we argue
that measure at such a level may not be effective enough to generalize from
base to novel classes when using only a few images. Instead, a multi-level
descriptor of an image is taken for consideration in this paper. We propose a
multi-level correlation network (MLCN) for FSIC to tackle this problem by
effectively capturing local information. Concretely, we present the
self-correlation module and cross-correlation module to learn the semantic
correspondence relation of local information based on learned representations.
Moreover, we propose a pattern-correlation module to capture the pattern of
fine-grained images and find relevant structural patterns between base classes
and novel classes. Extensive experiments and analysis show the effectiveness of
our proposed method on four widely-used FSIC benchmarks. The code for our
approach is available at: https://github.com/Yunkai696/MLCN. [LINK]http://arxiv.org/abs/2412.03159v1 [DATE]2024-12-04 17:36:24+08:00 [CATEGORIES]cs.CL
Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment [AUTHORS]Yanshi Li, Shaopan Xiong, Gengru Chen, Xiaoyang Li, Yijia Luo, Xingyao Zhang, Yanhui Huang, Xingyuan Bu, Yingshui Tan, Chun Yuan, Jiamang Wang, Wenbo Su, Bo Zheng [ABSTRACT]Reinforcement Learning from Human Feedback (RLHF) has proven highly effective
in aligning Large Language Models (LLMs) with human preferences. However, the
original RLHF typically optimizes under an overall reward, which can lead to a
suboptimal learning process. This limitation stems from RLHF's lack of
awareness regarding which specific tokens should be reinforced or suppressed.
Moreover, conflicts in supervision can arise, for instance, when a chosen
response includes erroneous tokens, while a rejected response contains accurate
elements. To rectify these shortcomings, increasing dense reward methods, such
as step-wise and token-wise RLHF, have been proposed. However, these existing
methods are limited to specific tasks (like mathematics). In this paper, we
propose the ``Adaptive Message-wise RLHF'' method, which robustly applies to
various tasks. By defining pivot tokens as key indicators, our approach
adaptively identifies essential information and converts sequence-level
supervision into fine-grained, subsequence-level supervision. This aligns the
density of rewards and action spaces more closely with the information density
of the input. Experiments demonstrate that our method can be integrated into
various training methods, significantly mitigating hallucinations and
catastrophic forgetting problems, while outperforming other methods on multiple
evaluation metrics. Our method improves the success rate on adversarial samples
by 10\% compared to the sample-wise approach, and achieves a 1.3\% improvement
on evaluationbenchmarks such as MMLU, GSM8K, HumanEval, etc. [LINK]http://arxiv.org/abs/2411.00809v2 [DATE]2024-12-04 17:26:47+08:00 [CATEGORIES]cs.LGcs.CL
A Measure of the System Dependence of Automated Metrics [AUTHORS]Pius von Däniken, Jan Deriu, Mark Cieliebak [ABSTRACT]Automated metrics for Machine Translation have made significant progress,
with the goal of replacing expensive and time-consuming human evaluations.
These metrics are typically assessed by their correlation with human judgments,
which captures the monotonic relationship between human and metric scores.
However, we argue that it is equally important to ensure that metrics treat all
systems fairly and consistently. In this paper, we introduce a method to
evaluate this aspect. [LINK]http://arxiv.org/abs/2412.03152v1 [DATE]2024-12-04 17:21:46+08:00 [CATEGORIES]cs.CL
ASR-EC Benchmark: EvaluatingLarge Language Models on Chinese ASR Error Correction [AUTHORS]Victor Junqiu Wei, Weicheng Wang, Di Jiang, Yuanfeng Song, Lu Wang [ABSTRACT]Automatic speech Recognition (ASR) is a fundamental and important task in the
field of speech and natural language processing. It is an inherent building
block in many applications such as voice assistant, speech translation, etc.
Despite the advancement of ASR technologies in recent years, it is still
inevitable for modern ASR systems to have a substantial number of erroneous
recognition due to environmental noise, ambiguity, etc. Therefore, the error
correction in ASR is crucial.
Motivated by this, this paper studies ASR error correction in the Chinese
language, which is one of the most popular languages and enjoys a large number
of users in the world. We first create a benchmark dataset named \emph\{ASR-EC\}
that contains a wide spectrum of ASR errors generated by industry-grade ASR
systems. To the best of our knowledge, it is the first Chinese ASR error
correction benchmark. Then, inspired by the recent advances in \emph\{large
language models (LLMs)\}, we investigate how to harness the power of LLMs to
correct ASR errors. We apply LLMs to ASR error correction in three paradigms.
The first paradigm is prompting, which is further categorized as zero-shot,
few-shot, and multi-step. The second paradigm is finetuning, which finetunes
LLMs with ASR error correction data. The third paradigm is multi-modal
augmentation, which collectively utilizes the audio and ASR transcripts for
error correction. Extensive experiments reveal that prompting is not effective
for ASR error correction. Finetuning is effective only for a portion of LLMs.
Multi-modal augmentation is the most effective method for error correction and
achieves state-of-the-art performance. [LINK]http://arxiv.org/abs/2412.03075v1 [DATE]2024-12-04 14:52:10+08:00 [CATEGORIES]cs.CL
An Effective Framework to Help Large Language Models Handle Numeric-involved Long-context Tasks [AUTHORS]Yijiong Yu [ABSTRACT]Large Language Models (LLMs) have demonstrated remarkable capabilities in
handling long texts and have almost perfect performance in traditional
retrieval tasks. However, their performance significantly degrades when it
comes to numerical calculations in the long-context. Numeric-involved
long-context tasks typically cannot be addressed by current LLMs in normal
settings due to their inherent limitations in simultaneously handling complex
and massive information. Some CoT like prompting methods can improve accuracy
but demands massive output tokens, which is costly and slow. To address this
issue, we propose a workflow, which decompose a numeric-involved long-context
task into 4 low-level subtasks: judging, extracting and processing with code
and conclusion. The former 2 subtasks is relatively simple, which allows us to
use smaller models for efficiently processing long context. When numerical
calculations are required, we use code generated by LLMs to avoid the
disadvantage of LLM not being good at calculations. The results in 2
numeric-involved long-context benchmarks demonstrate our workflow can not only
improve accuracy, but also significantly reduce the cost of API calls. [LINK]http://arxiv.org/abs/2411.10145v2 [DATE]2024-12-04 13:54:43+08:00 [CATEGORIES]cs.CL
CBEval: A framework for evaluating and interpreting cognitive biases in LLMs [AUTHORS]Ammar Shaikh, Raj Abhijit Dandekar, Sreedath Panat, Rajat Dandekar [ABSTRACT]Rapid advancements in Large Language models (LLMs) has significantly enhanced
their reasoning capabilities. Despite improved performance on benchmarks, LLMs
exhibit notable gaps in their cognitive processes. Additionally, as reflections
of human-generated data, these models have the potential to inherit cognitive
biases, raising concerns about their reasoning and decision making
capabilities. In this paper we present a framework to interpret, understand and
provide insights into a host of cognitive biases in LLMs. Conducting our
research on frontier language models we're able to elucidate reasoning
limitations and biases, and provide reasoning behind these biases by
constructing influence graphs that identify phrases and words most responsible
for biases manifested in LLMs. We further investigate biases such as round
number bias and cognitive bias barrier revealed when noting framing effect in
language models. [LINK]http://arxiv.org/abs/2412.03605v1 [DATE]2024-12-04 13:53:28+08:00 [CATEGORIES]cs.CL
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework [AUTHORS]Lu Chen, Ruqing Zhang, Jiafeng Guo, Yixing Fan, Xueqi Cheng [ABSTRACT]Retrieval-augmented generation (RAG) has emerged as a popular solution to
mitigate the hallucination issues of large language models. However, existing
studies on RAG seldom address the issue of predictive uncertainty, i.e., how
likely it is that a RAG model's prediction is incorrect, resulting in
uncontrollable risks in real-world applications. In this work, we emphasize the
importance of risk control, ensuring that RAG models proactively refuse to
answer questions with low confidence. Our research identifies two critical
latent factors affecting RAG's confidence in its predictions: the quality of
the retrieved results and the manner in which these results are utilized. To
guide RAG models in assessing their own confidence based on these two latent
factors, we develop a counterfactual prompting framework that induces the
models to alter these factors and analyzes the effect on their answers. We also
introduce a benchmarking procedure to collect answers with the option to
abstain, facilitating a series of experiments. For evaluation, we introduce
several risk-related metrics and the experimental results demonstrate the
effectiveness of our approach. Our code and benchmark dataset are available at
https://github.com/ict-bigdatalab/RC-RAG. [LINK]http://arxiv.org/abs/2409.16146v2 [DATE]2024-12-04 11:21:44+08:00 [CATEGORIES]cs.CL
Advancing Conversational Psychotherapy: Integrating Privacy, Dual-Memory, and Domain Expertise with Large Language Models [AUTHORS]XiuYu Zhang, Zening Luo [ABSTRACT]Mental health has increasingly become a global issue that reveals the
limitations of traditional conversational psychotherapy, constrained by
location, time, expense, and privacy concerns. In response to these challenges,
we introduce SoulSpeak, a Large Language Model (LLM)-enabled chatbot designed
to democratize access to psychotherapy. SoulSpeak improves upon the
capabilities of standard LLM-enabled chatbots by incorporating a novel
dual-memory component that combines short-term and long-term context via
Retrieval Augmented Generation (RAG) to offer personalized responses while
ensuring the preservation of user privacy and intimacy through a dedicated
privacy module. In addition, it leverages a counseling chat dataset of
therapist-client interactions and various prompting techniques to align the
generated responses with psychotherapeutic methods. We introduce two fine-tuned
BERT models to evaluate the system against existing LLMs and human therapists:
the Conversational Psychotherapy Preference Model (CPPM) to simulate human
preference among responses and another to assess response relevance to user
input. CPPM is useful for training and evaluating psychotherapy-focused
language models independent from SoulSpeak, helping with the constrained
resources available for psychotherapy. Furthermore, the effectiveness of the
dual-memory component and the robustness of the privacy module are also
examined. Our findings highlight the potential and challenge of enhancing
mental health care by offering an alternative that combines the expertise of
traditional therapy with the advantages of LLMs, providing a promising way to
address the accessibility and personalization gap in current mental health
services. [COMMENTS]Accepted as a Poster at Statistical Foundations of LLMs and
Foundation Models (NeurIPS 2024 Workshop) [LINK]http://arxiv.org/abs/2412.02987v1 [DATE]2024-12-04 11:02:46+08:00 [CATEGORIES]cs.CL
The use of large language models to enhance cancer clinical trial educational materials [AUTHORS]Mingye Gao, Aman Varshney, Shan Chen, Vikram Goddla, Jack Gallifant, Patrick Doyle, Claire Novack, Maeve Dillon-Martin, Teresia Perkins, Xinrong Correia, Erik Duhaime, Howard Isenstein, Elad Sharon, Lisa Soleymani Lehmann, David Kozono, Brian Anthony, Dmitriy Dligach, Danielle S. Bitterman [ABSTRACT]Cancer clinical trials often face challenges in recruitment and engagement
due to a lack of participant-facing informational and educational resources.
This study investigated the potential of Large Language Models (LLMs),
specifically GPT4, in generating patient-friendly educational content from
clinical trial informed consent forms. Using data from ClinicalTrials.gov, we
employed zero-shot learning for creating trial summaries and one-shot learning
for developing multiple-choice questions, evaluating their effectiveness
through patient surveys and crowdsourced annotation. Results showed that
GPT4-generated summaries were both readable and comprehensive, and may improve
patients' understanding and interest in clinical trials. The multiple-choice
questions demonstrated high accuracy and agreement with crowdsourced
annotators. For both resource types, hallucinations were identified that
require ongoing human oversight. The findings demonstrate the potential of LLMs
"out-of-the-box" to support the generation of clinical trial education
materials with minimal trial-specific engineering, but implementation with a
human-in-the-loop is still needed to avoid misinformation risks. [LINK]http://arxiv.org/abs/2412.01955v2 [DATE]2024-12-04 10:25:04+08:00 [CATEGORIES]cs.CL
SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models [AUTHORS]Haotian Xia, Zhengbang Yang, Junbo Zou, Rhys Tracy, Yuqing Wang, Chi Lu, Christopher Lai, Yanjun He, Xun Shao, Zhuoqing Xie, Yuan-fang Wang, Weining Shen, Hanjie Chen [ABSTRACT]Multimodal Large Language Models (MLLMs) are advancing the ability to reason
about complex sports scenarios by integrating textual and visual information.
To comprehensively evaluate their capabilities, we introduce SPORTU, a
benchmark designed to assess MLLMs across multi-level sports reasoning tasks.
SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice
questions with human-annotated explanations for rule comprehension and strategy
understanding. This component focuses on testing models' ability to reason
about sports solely through question-answering (QA), without requiring visual
inputs; SPORTU-video, consisting of 1,701 slow-motion video clips across 7
different sports and 12,048 QA pairs, designed to assess multi-level reasoning,
from simple sports recognition to complex tasks like foul detection and rule
application. We evaluate four prevalent LLMs mainly utilizing few-shot learning
paradigms supplemented by chain-of-thought (CoT) prompting on the SPORTU-text
part. We evaluate four LLMs using few-shot learning and chain-of-thought (CoT)
prompting on SPORTU-text. GPT-4o achieves the highest accuracy of 71%, but
still falls short of human-level performance, highlighting room for improvement
in rule comprehension and reasoning. The evaluation for the SPORTU-video part
includes 7 proprietary and 6 open-source MLLMs. Experiments show that models
fall short on hard tasks that require deep reasoning and rule-based
understanding. Claude-3.5-Sonnet performs the best with only 52.6% accuracy on
the hard task, showing large room for improvement. We hope that SPORTU will
serve as a critical step toward evaluating models' capabilities in sports
understanding and reasoning. [LINK]http://arxiv.org/abs/2410.08474v3 [DATE]2024-12-04 08:43:57+08:00 [CATEGORIES]cs.CL
Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data [AUTHORS]Junhao Liu, Siwei Xu, Lei Zhang, Jing Zhang [ABSTRACT]Over the past decade, the revolution in single-cell sequencing has enabled
the simultaneous molecular profiling of various modalities across thousands of
individual cells, allowing scientists to investigate the diverse functions of
complex tissues and uncover underlying disease mechanisms. Among all the
analytical steps, assigning individual cells to specific types is fundamental
for understanding cellular heterogeneity. However, this process is usually
labor-intensive and requires extensive expert knowledge. Recent advances in
large language models (LLMs) have demonstrated their ability to efficiently
process and synthesize vast corpora of text to automatically extract essential
biological knowledge, such as marker genes, potentially promoting more
efficient and automated cell type annotations. To thoroughly evaluate the
capability of modern instruction-tuned LLMs in automating the cell type
identification process, we introduce SOAR, a comprehensive benchmarking study
of LLMs for cell type annotation tasks in single-cell genomics. Specifically,
we assess the performance of 8 instruction-tuned LLMs across 11 datasets,
spanning multiple cell types and species. Our study explores the potential of
LLMs to accurately classify and annotate cell types in single-cell RNA
sequencing (scRNA-seq) data, while extending their application to multiomics
data through cross-modality translation. Additionally, we evaluate the
effectiveness of chain-of-thought (CoT) prompting techniques in generating
detailed biological insights during the annotation process. The results
demonstrate that LLMs can provide robust interpretations of single-cell data
without requiring additional fine-tuning, advancing the automation of cell type
annotation in genomics research. [LINK]http://arxiv.org/abs/2412.02915v1 [DATE]2024-12-04 07:58:35+08:00 [CATEGORIES]cs.CL
Does Few-Shot Learning Help LLM Performance in Code Synthesis? [AUTHORS]Derek Xu, Tong Xie, Botao Xia, Haoyu Li, Yunsheng Bai, Yizhou Sun, Wei Wang [ABSTRACT]Large language models (LLMs) have made significant strides at code generation
through improved model design, training, and chain-of-thought. However,
prompt-level optimizations remain an important yet under-explored aspect of
LLMs for coding. This work focuses on the few-shot examples present in most
code generation prompts, offering a systematic study on whether few-shot
examples improve LLM's coding capabilities, which few-shot examples have the
largest impact, and how to select impactful examples. Our work offers 2
approaches for selecting few-shot examples, a model-free method,
CODEEXEMPLAR-FREE, and a model-based method, CODEEXEMPLAR-BASED. The 2 methods
offer a trade-off between improved performance and reliance on training data
and interpretability. Both methods significantly improve CodeLlama's coding
ability across the popular HumanEval+ coding benchmark. In summary, our work
provides valuable insights into how to pick few-shot examples in code
generation prompts to improve LLM code generation capabilities. [LINK]http://arxiv.org/abs/2412.02906v1 [DATE]2024-12-04 07:19:40+08:00 [CATEGORIES]cs.CLcs.LG
Enhancing Trust in Large Language Models with Uncertainty-Aware Fine-Tuning [AUTHORS]Ranganath Krishnan, Piyush Khanna, Omesh Tickoo [ABSTRACT]Large language models (LLMs) have revolutionized the field of natural
language processing with their impressive reasoning and question-answering
capabilities. However, these models are sometimes prone to generating
credible-sounding but incorrect information, a phenomenon known as LLM
hallucinations. Reliable uncertainty estimation in LLMs is essential for
fostering trust in their generated responses and serves as a critical tool for
the detection and prevention of erroneous or hallucinated outputs. To achieve
reliable and well-calibrated uncertainty quantification in open-ended and
free-form natural language generation, we propose an uncertainty-aware
fine-tuning approach for LLMs. This approach enhances the model's ability to
provide reliable uncertainty estimates without compromising accuracy, thereby
guiding them to produce more trustworthy responses. We introduce a novel
uncertainty-aware causal language modeling loss function, grounded in the
principles of decision theory. Through rigorous evaluation on multiple
free-form question-answering datasets and models, we demonstrate that our
uncertainty-aware fine-tuning approach yields better calibrated uncertainty
estimates in natural language generation tasks than fine-tuning with the
standard causal language modeling loss. Furthermore, the experimental results
show that the proposed method significantly improves the model's ability to
detect hallucinations and identify out-of-domain prompts. [LINK]http://arxiv.org/abs/2412.02904v1 [DATE]2024-12-04 07:14:47+08:00 [CATEGORIES]cs.CLcs.LG
TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved? [AUTHORS]Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, Saurabh Sinha [ABSTRACT]Test-driven development (TDD) is the practice of writing tests first and
coding later, and the proponents of TDD expound its numerous benefits. For
instance, given an issue on a source code repository, tests can clarify the
desired behavior among stake-holders before anyone writes code for the
agreed-upon fix. Although there has been a lot of work on automated test
generation for the practice "write code first, test later", there has been
little such automation for TDD. Ideally, tests for TDD should be fail-to-pass
(i.e., fail before the issue is resolved and pass after) and have good adequacy
with respect to covering the code changed during issue resolution. This paper
introduces TDD-Bench Verified, a high-quality benchmark suite of 449 issues
mined from real-world GitHub code repositories. The benchmark's evaluation
harness runs only relevant tests in isolation for simple yet accurate coverage
measurements, and the benchmark's dataset is filtered both by human judges and
by execution in the harness. This paper also presents Auto-TDD, an LLM-based
solution that takes as input an issue description and a codebase (prior to
issue resolution) and returns as output a test that can be used to validate the
changes made for resolving the issue. Our evaluation shows that Auto-TDD yields
a better fail-to-pass rate than the strongest prior work while also yielding
high coverage adequacy. Overall, we hope that this work helps make developers
more productive at resolving issues while simultaneously leading to more robust
fixes. [LINK]http://arxiv.org/abs/2412.02883v1 [DATE]2024-12-04 06:38:05+08:00 [CATEGORIES]cs.CLcs.LG
Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation [AUTHORS]To Eun Kim, Fernando Diaz [ABSTRACT]Many language models now enhance their responses with retrieval capabilities,
leading to the widespread adoption of retrieval-augmented generation (RAG)
systems. However, despite retrieval being a core component of RAG, much of the
research in this area overlooks the extensive body of work on fair ranking,
neglecting the importance of considering all stakeholders involved. This paper
presents the first systematic evaluation of RAG systems integrated with fair
rankings. We focus specifically on measuring the fair exposure of each relevant
item across the rankings utilized by RAG systems (i.e., item-side fairness),
aiming to promote equitable growth for relevant item providers. To gain a deep
understanding of the relationship between item-fairness, ranking quality, and
generation quality in the context of RAG, we analyze nine different RAG systems
that incorporate fair rankings across seven distinct datasets. Our findings
indicate that RAG systems with fair rankings can maintain a high level of
generation quality and, in many cases, even outperform traditional RAG systems,
despite the general trend of a tradeoff between ensuring fairness and
maintaining system-effectiveness. We believe our insights lay the groundwork
for responsible and equitable RAG systems and open new avenues for future
research. We publicly release our codebase and dataset at
https://github.com/kimdanny/Fair-RAG. [COMMENTS]Top 5 Spotlight at AFME Workshop at NeurIPS 2024 [LINK]http://arxiv.org/abs/2409.11598v2 [DATE]2024-12-04 06:23:53+08:00 [CATEGORIES]cs.CL
Investigating the Contextualised Word Embedding Dimensions Specified for Contextual and Temporal Semantic Changes [AUTHORS]Taichi Aida, Danushka Bollegala [ABSTRACT]The sense-aware contextualised word embeddings (SCWEs) encode semantic
changes of words within the contextualised word embedding (CWE) spaces. Despite
the superior performance of SCWEs in contextual/temporal semantic change
detection (SCD) benchmarks, it remains unclear as to how the meaning changes
are encoded in the embedding space. To study this, we compare pre-trained CWEs
and their fine-tuned versions on contextual and temporal semantic change
benchmarks under Principal Component Analysis (PCA) and Independent Component
Analysis (ICA) transformations. Our experimental results reveal (a) although
there exist a smaller number of axes that are specific to semantic changes of
words in the pre-trained CWE space, this information gets distributed across
all dimensions when fine-tuned, and (b) in contrast to prior work studying the
geometry of CWEs, we find that PCA to better represent semantic changes than
ICA within the top 10% of axes. These findings encourage the development of
more efficient SCD methods with a small number of SCD-aware dimensions. Source
code is available at https://github.com/LivNLP/svp-dims . [COMMENTS]COLING2025 [LINK]http://arxiv.org/abs/2407.02820v2 [DATE]2024-12-04 04:56:16+08:00 [CATEGORIES]cs.CL
Can Open-source LLMs Enhance Data Synthesis for Toxic Detection?: An Experimental Study [AUTHORS]Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Lin Ai, Yinheng Li, Julia Hirschberg, Congrui Huang [ABSTRACT]Effective toxic content detection relies heavily on high-quality and diverse
data, which serves as the foundation for robust content moderation models. This
study explores the potential of open-source LLMs for harmful data synthesis,
utilizing prompt engineering and fine-tuning techniques to enhance data quality
and diversity. In a two-stage evaluation, we first examine the capabilities of
six open-source LLMs in generating harmful data across multiple datasets using
prompt engineering. In the second stage, we fine-tune these models to improve
data generation while addressing challenges such as hallucination, data
duplication, and overfitting. Our findings reveal that Mistral excels in
generating high-quality and diverse harmful data with minimal hallucination.
Furthermore, fine-tuning enhances data quality, offering scalable and
cost-effective solutions for augmenting datasets for specific toxic content
detection tasks. These results emphasize the significance of data synthesis in
building robust, standalone detection models and highlight the potential of
open-source LLMs to advance smaller downstream content moderation systems. We
implemented this approach in real-world industrial settings, demonstrating the
feasibility and efficiency of fine-tuned open-source LLMs for harmful data
synthesis. [COMMENTS]12 pages [LINK]http://arxiv.org/abs/2411.15175v2 [DATE]2024-12-04 04:07:58+08:00 [CATEGORIES]cs.CL
Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning [AUTHORS]Simran Kaur, Simon Park, Anirudh Goyal, Sanjeev Arora [ABSTRACT]We introduce Instruct-SkillMix, an automated approach for creating diverse,
high quality SFT data. The Instruct-SkillMix pipeline involves two stages, each
leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to
extract core "skills" for instruction-following, either from existing datasets,
or by directly prompting the model; (2) Data generation: uses the powerful LLM
to generate (instruction, response) data that exhibit a randomly chosen pair of
these skills. Here, the use of random skill combinations promotes diversity and
difficulty.
Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from
Instruct-SkillMix leads to strong gains on instruction following benchmarks
such as AlpacaEval 2.0, MT-Bench, and WildBench. With just $4$K examples,
LLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0.
To our knowledge, this achieves state-of-the-art performance among all models
that have only undergone SFT (no RL methods) and competes with proprietary
models such as Claude 3 Opus and LLaMA-3.1-405B-Instruct.
Ablation studies also suggest plausible reasons for why creating open
instruction-tuning datasets via naive crowd-sourcing has proved difficult.
Introducing low quality answers ("shirkers") in $20\%$ of Instruct-SkillMix
examples causes performance to plummet, sometimes catastrophically.
The Instruct-SkillMix pipeline is flexible and is adaptable to other
settings. [LINK]http://arxiv.org/abs/2408.14774v3 [DATE]2024-12-04 04:01:23+08:00 [CATEGORIES]cs.LGcs.CL
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs [AUTHORS]Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, Rui Zhang [ABSTRACT]Self-correction is an approach to improving responses from large language
models (LLMs) by refining the responses using LLMs during inference. Prior work
has proposed various self-correction frameworks using different sources of
feedback, including self-evaluation and external feedback. However, there is
still no consensus on the question of when LLMs can correct their own mistakes,
as recent studies also report negative results. In this work, we critically
survey broad papers and discuss the conditions required for successful
self-correction. We first find that prior studies often do not define their
research questions in detail and involve impractical frameworks or unfair
evaluations that over-evaluate self-correction. To tackle these issues, we
categorize research questions in self-correction research and provide a
checklist for designing appropriate experiments. Our critical survey based on
the newly categorized research questions shows that (1) no prior work
demonstrates successful self-correction with feedback from prompted LLMs,
except for studies in tasks that are exceptionally suited for self-correction,
(2) self-correction works well in tasks that can use reliable external
feedback, and (3) large-scale fine-tuning enables self-correction. [COMMENTS]TACL 2024 [LINK]http://arxiv.org/abs/2406.01297v3 [DATE]2024-12-04 03:14:06+08:00 [CATEGORIES]cs.CL
T-REG: Preference Optimization with Token-Level Reward Regularization [AUTHORS]Wenxuan Zhou, Shujian Zhang, Lingxiao Zhao, Tao Meng [ABSTRACT]Reinforcement learning from human feedback (RLHF) has been crucial in
aligning large language models (LLMs) with human values. Traditionally, RLHF
involves generating responses to a query and using a reward model to assign a
reward to the entire response. However, this approach faces challenges due to
its reliance on a single, sparse reward, which makes it challenging for the
model to identify which parts of the sequence contribute most significantly to
the final reward. Recent methods have attempted to address this limitation by
introducing token-level rewards. However, these methods often rely on either a
trained credit assignment model or AI annotators, raising concerns about the
quality and reliability of the rewards. In this paper, we propose token-level
reward regularization (T-REG), a novel approach that leverages both
sequence-level and token-level rewards for preference optimization. Harnessing
the self-refinement capabilities of LLMs, our method uses contrastive prompting
to enable LLMs to self-generate token-level rewards. These self-generated
rewards then act as reward regularization, guiding the model to more
effectively distribute sequence-level rewards across tokens. This facilitates
better token-level credit assignment and enhances alignment performance.
Experiments on the instruction following benchmarks, including Alpaca Eval 2
and Arena-Hard, show that our method consistently outperforms baseline methods
by up to 3.8% and 4.4%, respectively. We will release the code and models at
https://github.com/wzhouad/T-REG. [LINK]http://arxiv.org/abs/2412.02685v1 [DATE]2024-12-04 02:56:07+08:00 [CATEGORIES]cs.CLcs.LG
From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs [AUTHORS]Alireza Rezazadeh, Zichao Li, Wei Wei, Yujia Bao [ABSTRACT]Recent advancements in large language models have significantly improved
their context windows, yet challenges in effective long-term memory management
remain. We introduce MemTree, an algorithm that leverages a dynamic,
tree-structured memory representation to optimize the organization, retrieval,
and integration of information, akin to human cognitive schemas. MemTree
organizes memory hierarchically, with each node encapsulating aggregated
textual content, corresponding semantic embeddings, and varying abstraction
levels across the tree's depths. Our algorithm dynamically adapts this memory
structure by computing and comparing semantic embeddings of new and existing
information to enrich the model's context-awareness. This approach allows
MemTree to handle complex reasoning and extended interactions more effectively
than traditional memory augmentation methods, which often rely on flat lookup
tables. Evaluations on benchmarks for multi-turn dialogue understanding and
document question answering show that MemTree significantly enhances
performance in scenarios that demand structured memory management. [LINK]http://arxiv.org/abs/2410.14052v2 [DATE]2024-12-04 02:48:00+08:00 [CATEGORIES]cs.CLcs.LG
QA-TOOLBOX: Conversational Question-Answering for process task guidance in manufacturing [AUTHORS]Ramesh Manuvinakurike, Elizabeth Watkins, Celal Savur, Anthony Rhodes, Sovan Biswas, Gesem Gudino Mejia, Richard Beckwith, Saurav Sahay, Giuseppe Raffa, Lama Nachman [ABSTRACT]In this work we explore utilizing LLMs for data augmentation for
manufacturing task guidance system. The dataset consists of representative
samples of interactions with technicians working in an advanced manufacturing
setting. The purpose of this work to explore the task, data augmentation for
the supported tasks and evaluating the performance of the existing LLMs. We
observe that that task is complex requiring understanding from procedure
specification documents, actions and objects sequenced temporally. The dataset
consists of 200,000+ question/answer pairs that refer to the spec document and
are grounded in narrations and/or video demonstrations. We compared the
performance of several popular open-sourced LLMs by developing a baseline using
each LLM and then compared the responses in a reference-free setting using
LLM-as-a-judge and compared the ratings with crowd-workers whilst validating
the ratings with experts. [LINK]http://arxiv.org/abs/2412.02638v1 [DATE]2024-12-04 02:10:31+08:00 [CATEGORIES]cs.CL
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [AUTHORS]Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue [ABSTRACT]Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini
1.5 Pro, and Reka Core, have expanded their capabilities to include vision and
audio modalities. While these models demonstrate impressive performance across
a wide range of audio-visual applications, our proposed DeafTest reveals that
MLLMs often struggle with simple tasks humans find trivial: 1) determining
which of two sounds is louder, and 2) determining which of two sounds has a
higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a
comprehensive audio-visual benchmark designed to assess whether those MLLMs can
truly understand the audio-visual information. This benchmark encompasses 4,555
carefully crafted problems, each incorporating text, visual, and audio
components. To successfully infer answers, models must effectively leverage
clues from both visual and audio inputs. To ensure precise and objective
evaluation of MLLM responses, we have structured the questions as
multiple-choice, eliminating the need for human evaluation or LLM-assisted
assessment. We benchmark a series of closed-source and open-source models and
summarize the observations. By revealing the limitations of current models, we
aim to provide useful insight for future dataset collection and model
development. [COMMENTS]Project page: https://av-odyssey.github.io/ [LINK]http://arxiv.org/abs/2412.02611v1 [DATE]2024-12-04 01:41:23+08:00 [CATEGORIES]cs.CL
Interpretable Company Similarity with Sparse Autoencoders [AUTHORS]Marco Molinari, Vladimir Tregubiak, Victor Shao, Abhimanyu Pandey, Mateusz Mikolajczak, Sebastião Kuznetsov Ryder Torres Pereira [ABSTRACT]Determining company similarity is a vital task in finance, underpinning
hedging, risk management, portfolio diversification, and more. Practitioners
often rely on sector and industry classifications to gauge similarity, such as
SIC-codes and GICS-codes, the former being used by the U.S. Securities and
Exchange Commission (SEC), and the latter widely used by the investment
community. Clustering embeddings of company descriptions has been proposed as a
potential technique for determining company similarity, but the lack of
interpretability in token embeddings poses a significant barrier to adoption in
high-stakes contexts. Sparse Autoencoders have shown promise in enhancing the
interpretability of Large Language Models by decomposing LLM activations into
interpretable features. In this paper, we explore the use of SAE features in
measuring company similarity and benchmark them against (1) SIC codes and (2)
Major Group codes. We conclude that SAE features can reproduce and even surpass
sector classifications in quantifying fundamental characteristics of companies,
evaluated by the correlation of monthly returns, a proxy for similarity, and
PnL from cointegration. [LINK]http://arxiv.org/abs/2412.02605v1 [DATE]2024-12-04 01:34:50+08:00 [CATEGORIES]cs.CLcs.LG
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset [AUTHORS]Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro [ABSTRACT]Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved
significant benchmark gains via aggressive model-based filtering, but at the
cost of removing 90% of data. This limits their suitability for long token
horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how
to achieve better trade-offs between accuracy and data quantity by a
combination of classifier ensembling, synthetic data rephrasing, and reduced
reliance on heuristic filters. When training 8B parameter models for 1T tokens,
using a high-quality subset of our data improves MMLU by 5.6 over DCLM,
demonstrating the efficacy of our methods for boosting accuracies over a
relatively short token horizon. Furthermore, our full 6.3T token dataset
matches DCLM on MMLU, but contains four times more unique real tokens than
DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B
parameter model trained for 15T tokens, of which 7.2T came from our dataset, is
better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5
on average across ten diverse tasks. The dataset is available at
https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html [LINK]http://arxiv.org/abs/2412.02595v1 [DATE]2024-12-04 01:28:50+08:00 [CATEGORIES]cs.CL
Patent-CR: A Dataset for Patent Claim Revision [AUTHORS]Lekang Jiang, Pascal A Scherz, Stephan Goetz [ABSTRACT]This paper presents Patent-CR, the first dataset created for the patent claim
revision task in English. It includes both initial patent applications rejected
by patent examiners and the final granted versions. Unlike normal text revision
tasks that predominantly focus on enhancing sentence quality, such as grammar
correction and coherence improvement, patent claim revision aims at ensuring
the claims meet stringent legal criteria. These criteria are beyond novelty and
inventiveness, including clarity of scope, technical accuracy, language
precision, and legal robustness. We assess various large language models (LLMs)
through professional human evaluation, including general LLMs with different
sizes and architectures, text revision models, and domain-specific models. Our
results indicate that LLMs often bring ineffective edits that deviate from the
target revisions. In addition, domain-specific models and the method of
fine-tuning show promising results. Notably, GPT-4 outperforms other tested
LLMs, but further revisions are still necessary to reach the examination
standard. Furthermore, we demonstrate the inconsistency between automated and
human evaluation results, suggesting that GPT-4-based automated evaluation has
the highest correlation with human judgment. This dataset, along with our
preliminary empirical research, offers invaluable insights for further
exploration in patent claim revision. [COMMENTS]15 pages, 6 tables, 3 figures [LINK]http://arxiv.org/abs/2412.02549v1 [DATE]2024-12-04 00:43:42+08:00 [CATEGORIES]cs.CL
Granular Ball Twin Support Vector Machine with Universum Data [AUTHORS]M. A. Ganaie, Vrushank Ahire [ABSTRACT]Classification with support vector machines (SVM) often suffers from limited
performance when relying solely on labeled data from target classes and is
sensitive to noise and outliers. Incorporating prior knowledge from Universum
data and more robust data representations can enhance accuracy and efficiency.
Motivated by these findings, we propose a novel Granular Ball Twin Support
Vector Machine with Universum Data (GBU-TSVM) that extends the TSVM framework
to leverage both Universum samples and granular ball computing during model
training. Unlike existing TSVM methods, the proposed GBU-TSVM represents data
instances as hyper-balls rather than points in the feature space. This
innovative approach improves the model's robustness and efficiency,
particularly in handling noisy and large datasets. By grouping data points into
granular balls, the model achieves superior computational efficiency, increased
noise resistance, and enhanced interpretability. Additionally, the inclusion of
Universum data, which consists of samples that are not strictly from the target
classes, further refines the classification boundaries. This integration
enriches the model with contextual information, refining classification
boundaries and boosting overall accuracy. Experimental results on UCI benchmark
datasets demonstrate that the GBU-TSVM outperforms existing TSVM models in both
accuracy and computational efficiency. These findings highlight the potential
of the GBU-TSVM model in setting a new standard in data representation and
classification. [LINK]http://arxiv.org/abs/2412.03375v1 [DATE]2024-12-04 23:02:28+08:00 [CATEGORIES]cs.LG
Deep Learning in Single-Cell and Spatial Transcriptomics Data Analysis: Advances and Challenges from a Data Science Perspective [AUTHORS]Shuang Ge, Shuqing Sun, Huan Xu, Qiang Cheng, Zhixiang Ren [ABSTRACT]The development of single-cell and spatial transcriptomics has revolutionized
our capacity to investigate cellular properties, functions, and interactions in
both cellular and spatial contexts. However, the analysis of single-cell and
spatial omics data remains challenging. First, single-cell sequencing data are
high-dimensional and sparse, often contaminated by noise and uncertainty,
obscuring the underlying biological signals. Second, these data often encompass
multiple modalities, including gene expression, epigenetic modifications, and
spatial locations. Integrating these diverse data modalities is crucial for
enhancing prediction accuracy and biological interpretability. Third, while the
scale of single-cell sequencing has expanded to millions of cells, high-quality
annotated datasets are still limited. Fourth, the complex correlations of
biological tissues make it difficult to accurately reconstruct cellular states
and spatial contexts. Traditional feature engineering-based analysis methods
struggle to deal with the various challenges presented by intricate biological
networks. Deep learning has emerged as a powerful tool capable of handling
high-dimensional complex data and automatically identifying meaningful
patterns, offering significant promise in addressing these challenges. This
review systematically analyzes these challenges and discusses related deep
learning approaches. Moreover, we have curated 21 datasets from 9 benchmarks,
encompassing 58 computational methods, and evaluated their performance on the
respective modeling tasks. Finally, we highlight three areas for future
development from a technical, dataset, and application perspective. This work
will serve as a valuable resource for understanding how deep learning can be
effectively utilized in single-cell and spatial transcriptomics analyses, while
inspiring novel approaches to address emerging challenges. [LINK]http://arxiv.org/abs/2412.03614v1 [DATE]2024-12-04 22:07:11+08:00 [CATEGORIES]cs.LG
OpenDriver: An Open-Road Driver State Detection Dataset [AUTHORS]Delong Liu, Shichao Li, Tianyi Shi, Zhu Meng, Guanyu Chen, Yadong Huang, Jin Dong, Zhicheng Zhao [ABSTRACT]Among numerous studies for driver state detection, wearable physiological
measurements offer a practical method for real-time monitoring. However, there
are few driver physiological datasets in open-road scenarios, and the existing
datasets suffer from issues such as poor signal quality, small sample sizes,
and short data collection periods. Therefore, in this paper, a large-scale
multimodal driving dataset, OpenDriver, for driver state detection is
developed. The OpenDriver encompasses a total of 3,278 driving trips, with a
signal collection duration spanning approximately 4,600 hours. Two modalities
of driving signals are enrolled in OpenDriver: electrocardiogram (ECG) signals
and six-axis motion data of the steering wheel from a motion measurement unit
(IMU), which were recorded from 81 drivers and their vehicles. Furthermore,
three challenging tasks are involved in our work, namely ECG signal quality
assessment, individual biometric identification based on ECG signals, and
physiological signal analysis in complex driving environments. To facilitate
research in these tasks, corresponding benchmarks have also been introduced.
First, a noisy augmentation strategy is applied to generate a larger-scale ECG
signal dataset with realistic noise simulation for quality assessment. Second,
an end-to-end contrastive learning framework is employed for individual
biometric identification. Finally, a comprehensive analysis of drivers' HRV
features under different driving conditions is conducted. Each benchmark
provides evaluation metrics and reference results. The OpenDriver dataset will
be publicly available at https://github.com/bdne/OpenDriver. [COMMENTS]Considering that there are flaws in the statistical data of the
dataset, all the authors agreed to withdraw the manuscript [LINK]http://arxiv.org/abs/2304.04203v3 [DATE]2024-12-04 21:43:10+08:00 [CATEGORIES]cs.LG
NeRF and Gaussian Splatting SLAM in the Wild [AUTHORS]Fabian Schmidt, Markus Enzweiler, Abhinav Valada [ABSTRACT]Navigating outdoor environments with visual Simultaneous Localization and
Mapping (SLAM) systems poses significant challenges due to dynamic scenes,
lighting variations, and seasonal changes, requiring robust solutions. While
traditional SLAM methods struggle with adaptability, deep learning-based
approaches and emerging neural radiance fields as well as Gaussian
Splatting-based SLAM methods, offer promising alternatives. However, these
methods have primarily been evaluated in controlled indoor environments with
stable conditions, leaving a gap in understanding their performance in
unstructured and variable outdoor settings. This study addresses this gap by
evaluating these methods in natural outdoor environments, focusing on camera
tracking accuracy, robustness to environmental factors, and computational
efficiency, highlighting distinct trade-offs. Extensive evaluations demonstrate
that neural SLAM methods achieve superior robustness, particularly under
challenging conditions such as low light, but at a high computational cost. At
the same time, traditional methods perform the best across seasons but are
highly sensitive to variations in lighting conditions. The code of the
benchmark is publicly available at
https://github.com/iis-esslingen/nerf-3dgs-benchmark. [COMMENTS]5 pages, 2 figures, 4 tables [LINK]http://arxiv.org/abs/2412.03263v1 [DATE]2024-12-04 20:11:19+08:00 [CATEGORIES]cs.LG
Learning on One Mode: Addressing Multi-Modality in Offline Reinforcement Learning [AUTHORS]Mianchu Wang, Yue Jin, Giovanni Montana [ABSTRACT]Offline reinforcement learning (RL) seeks to learn optimal policies from
static datasets without interacting with the environment. A common challenge is
handling multi-modal action distributions, where multiple behaviours are
represented in the data. Existing methods often assume unimodal behaviour
policies, leading to suboptimal performance when this assumption is violated.
We propose Weighted Imitation Learning on One Mode (LOM), a novel approach that
focuses on learning from a single, promising mode of the behaviour policy. By
using a Gaussian mixture model to identify modes and selecting the best mode
based on expected returns, LOM avoids the pitfalls of averaging over
conflicting actions. Theoretically, we show that LOM improves performance while
maintaining simplicity in policy learning. Empirically, LOM outperforms
existing methods on standard D4RL benchmarks and demonstrates its effectiveness
in complex, multi-modal scenarios. [LINK]http://arxiv.org/abs/2412.03258v1 [DATE]2024-12-04 19:57:36+08:00 [CATEGORIES]cs.LG
Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges [AUTHORS]Minghao Shao, Abdul Basit, Ramesh Karri, Muhammad Shafique [ABSTRACT]Large Language Models (LLMs) represent a class of deep learning models adept
at understanding natural language and generating coherent responses to various
prompts or queries. These models far exceed the complexity of conventional
neural networks, often encompassing dozens of neural network layers and
containing billions to trillions of parameters. They are typically trained on
vast datasets, utilizing architectures based on transformer blocks. Present-day
LLMs are multi-functional, capable of performing a range of tasks from text
generation and language translation to question answering, as well as code
generation and analysis. An advanced subset of these models, known as
Multimodal Large Language Models (MLLMs), extends LLM capabilities to process
and interpret multiple data modalities, including images, audio, and video.
This enhancement empowers MLLMs with capabilities like video editing, image
comprehension, and captioning for visual content. This survey provides a
comprehensive overview of the recent advancements in LLMs. We begin by tracing
the evolution of LLMs and subsequently delve into the advent and nuances of
MLLMs. We analyze emerging state-of-the-art MLLMs, exploring their technical
features, strengths, and limitations. Additionally, we present a comparative
analysis of these models and discuss their challenges, potential limitations,
and prospects for future development. [LINK]http://arxiv.org/abs/2412.03220v1 [DATE]2024-12-04 19:14:06+08:00 [CATEGORIES]cs.LG
Local Lesion Generation is Effective for Capsule Endoscopy Image Data Augmentation in a Limited Data Setting [AUTHORS]Adrian B. Chłopowiec, Adam R. Chłopowiec, Krzysztof Galus, Wojciech Cebula, Martin Tabakov [ABSTRACT]Limited medical imaging datasets challenge deep learning models by increasing
risks of overfitting and reduced generalization, particularly in Generative
Adversarial Networks (GANs), where discriminators may overfit, leading to
training divergence. This constraint also impairs classification models trained
on small datasets. Generative Data Augmentation (GDA) addresses this by
expanding training datasets with synthetic data, although it requires training
a generative model. We propose and evaluate two local lesion generation
approaches to address the challenge of augmenting small medical image datasets.
The first approach employs the Poisson Image Editing algorithm, a classical
image processing technique, to create realistic image composites that
outperform current state-of-the-art methods. The second approach introduces a
novel generative method, leveraging a fine-tuned Image Inpainting GAN to
synthesize realistic lesions within specified regions of real training images.
A comprehensive comparison of the two proposed methods demonstrates that
effective local lesion generation in a data-constrained setting allows for
reaching new state-of-the-art results in capsule endoscopy lesion
classification. Combination of our techniques achieves a macro F1-score of
33.07%, surpassing the previous best result by 7.84 percentage points (p.p.) on
the highly imbalanced Kvasir Capsule Dataset, a benchmark for capsule
endoscopy. To the best of our knowledge, this work is the first to apply a
fine-tuned Image Inpainting GAN for GDA in medical imaging, demonstrating that
an image-conditional GAN can be adapted effectively to limited datasets to
generate high-quality examples, facilitating effective data augmentation.
Additionally, we show that combining this GAN-based approach with classical
image processing techniques further improves the results. [COMMENTS]54 pages, 35 figures [LINK]http://arxiv.org/abs/2411.03098v2 [DATE]2024-12-04 18:52:25+08:00 [CATEGORIES]cs.LG
RLLTE: Long-Term Evolution Project of Reinforcement Learning [AUTHORS]Mingqi Yuan, Zequn Zhang, Yang Xu, Shihao Luo, Bo Li, Xin Jin, Wenjun Zeng [ABSTRACT]We present RLLTE: a long-term evolution, extremely modular, and open-source
framework for reinforcement learning (RL) research and application. Beyond
delivering top-notch algorithm implementations, RLLTE also serves as a toolkit
for developing algorithms. More specifically, RLLTE decouples the RL algorithms
completely from the exploitation-exploration perspective, providing a large
number of components to accelerate algorithm development and evolution. In
particular, RLLTE is the first RL framework to build a comprehensive ecosystem,
which includes model training, evaluation, deployment, benchmark hub, and large
language model (LLM)-empowered copilot. RLLTE is expected to set standards for
RL engineering practice and be highly stimulative for industry and academia.
Our documentation, examples, and source code are available at
https://github.com/RLE-Foundation/rllte. [COMMENTS]Proceedings of the AAAI Conference on Artificial Intelligence, 2025 [LINK]http://arxiv.org/abs/2309.16382v2 [DATE]2024-12-04 18:27:58+08:00 [CATEGORIES]cs.LG
Semi-decentralized Training of Spatio-Temporal Graph Neural Networks for Traffic Prediction [AUTHORS]Ivan Kralj, Lodovico Giaretta, Gordan Ježić, Ivana Podnar Žarko, Šarūnas Girdzijauskas [ABSTRACT]In smart mobility, large networks of geographically distributed sensors
produce vast amounts of high-frequency spatio-temporal data that must be
processed in real time to avoid major disruptions. Traditional centralized
approaches are increasingly unsuitable to this task, as they struggle to scale
with expanding sensor networks, and reliability issues in central components
can easily affect the whole deployment. To address these challenges, we explore
and adapt semi-decentralized training techniques for Spatio-Temporal Graph
Neural Networks (ST-GNNs) in smart mobility domain. We implement a simulation
framework where sensors are grouped by proximity into multiple cloudlets, each
handling a subgraph of the traffic graph, fetching node features from other
cloudlets to train its own local ST-GNN model, and exchanging model updates
with other cloudlets to ensure consistency, enhancing scalability and removing
reliance on a centralized aggregator. We perform extensive comparative
evaluation of four different ST-GNN training setups -- centralized, traditional
FL, server-free FL, and Gossip Learning -- on large-scale traffic datasets, the
METR-LA and PeMS-BAY datasets, for short-, mid-, and long-term vehicle speed
predictions. Experimental results show that semi-decentralized setups are
comparable to centralized approaches in performance metrics, while offering
advantages in terms of scalability and fault tolerance. In addition, we
highlight often overlooked issues in existing literature for distributed
ST-GNNs, such as the variation in model performance across different
geographical areas due to region-specific traffic patterns, and the significant
communication overhead and computational costs that arise from the large
receptive field of GNNs, leading to substantial data transfers and increased
computation of partial embeddings. [COMMENTS]8 pages, 4 figures, 3 tables, conference [LINK]http://arxiv.org/abs/2412.03188v1 [DATE]2024-12-04 18:20:21+08:00 [CATEGORIES]cs.LG
A path-norm toolkit for modern networks: consequences, promises and challenges [AUTHORS]Antoine Gonon, Nicolas Brisebarre, Elisa Riccietti, Rémi Gribonval [ABSTRACT]This work introduces the first toolkit around path-norms that fully
encompasses general DAG ReLU networks with biases, skip connections and any
operation based on the extraction of order statistics: max pooling, GroupSort
etc. This toolkit notably allows us to establish generalization bounds for
modern neural networks that are not only the most widely applicable path-norm
based ones, but also recover or beat the sharpest known bounds of this type.
These extended path-norms further enjoy the usual benefits of path-norms: ease
of computation, invariance under the symmetries of the network, and improved
sharpness on layered fully-connected networks compared to the product of
operator norms, another complexity measure most commonly used.
The versatility of the toolkit and its ease of implementation allow us to
challenge the concrete promises of path-norm-based generalization bounds, by
numerically evaluating the sharpest known bounds for ResNets on ImageNet. [COMMENTS]Erratum: in the published version there was a typo in the definition
of the activation matrix in Definition A.3. This is fixed with this new
version [LINK]http://arxiv.org/abs/2310.01225v5 [DATE]2024-12-04 18:04:02+08:00 [CATEGORIES]cs.LG
Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation [AUTHORS]Gianni Franchi, Dat Nguyen Trong, Nacim Belkhir, Guoxuan Xia, Andrea Pilzer [ABSTRACT]Uncertainty quantification in text-to-image (T2I) generative models is
crucial for understanding model behavior and improving output reliability. In
this paper, we are the first to quantify and evaluate the uncertainty of T2I
models with respect to the prompt. Alongside adapting existing approaches
designed to measure uncertainty in the image space, we also introduce
Prompt-based UNCertainty Estimation for T2I models (PUNC), a novel method
leveraging Large Vision-Language Models (LVLMs) to better address uncertainties
arising from the semantics of the prompt and generated images. PUNC utilizes a
LVLM to caption a generated image, and then compares the caption with the
original prompt in the more semantically meaningful text space. PUNC also
enables the disentanglement of both aleatoric and epistemic uncertainties via
precision and recall, which image-space approaches are unable to do. Extensive
experiments demonstrate that PUNC outperforms state-of-the-art uncertainty
estimation techniques across various settings. Uncertainty quantification in
text-to-image generation models can be used on various applications including
bias detection, copyright protection, and OOD detection. We also introduce a
comprehensive dataset of text prompts and generation pairs to foster further
research in uncertainty quantification for generative models. Our findings
illustrate that PUNC not only achieves competitive performance but also enables
novel applications in evaluating and improving the trustworthiness of
text-to-image models. [COMMENTS]28 pages and 22 figures [LINK]http://arxiv.org/abs/2412.03178v1 [DATE]2024-12-04 18:03:52+08:00 [CATEGORIES]cs.LG
Exploration of Parameter Spaces Assisted by Machine Learning [AUTHORS]A. Hammad, Myeonghun Park, Raymundo Ramos, Pankaj Saha [ABSTRACT]We demonstrate two sampling procedures assisted by machine learning models
via regression and classification. The main objective is the use of a neural
network to suggest points likely inside regions of interest, reducing the
number of evaluations of time consuming calculations. We compare results from
this approach with results from other sampling methods, namely Markov chain
Monte Carlo and MultiNest, obtaining results that range from comparably similar
to arguably better. In particular, we augment our classifier method with a
boosting technique that rapidly increases the efficiency within a few
iterations. We show results from our methods applied to a toy model and the
type II 2HDM, using 3 and 7 free parameters, respectively. The code used for
this paper and instructions are publicly available on the web. [COMMENTS]30 pages, 9 figures. Matches published version. Code and instructions
are available on https://github.com/AHamamd150/MLscanner [LINK]http://arxiv.org/abs/2207.09959v4 [DATE]2024-12-04 17:45:26+08:00 [CATEGORIES]cs.LG
Testing Neural Network Verifiers: A Soundness Benchmark with Hidden Counterexamples [AUTHORS]Xingjian Zhou, Hongji Xu, Andy Xu, Zhouxing Shi, Cho-Jui Hsieh, Huan Zhang [ABSTRACT]In recent years, many neural network (NN) verifiers have been developed to
formally verify certain properties of neural networks such as robustness.
Although many benchmarks have been constructed to evaluate the performance of
NN verifiers, they typically lack a ground-truth for hard instances where no
current verifier can verify and no counterexample can be found, which makes it
difficult to check the soundness of a new verifier if it claims to verify hard
instances which no other verifier can do. We propose to develop a soundness
benchmark for NN verification. Our benchmark contains instances with
deliberately inserted counterexamples while we also try to hide the
counterexamples from regular adversarial attacks which can be used for finding
counterexamples. We design a training method to produce neural networks with
such hidden counterexamples. Our benchmark aims to be used for testing the
soundness of NN verifiers and identifying falsely claimed verifiability when it
is known that hidden counterexamples exist. We systematically construct our
benchmark and generate instances across diverse model architectures, activation
functions, input sizes, and perturbation radii. We demonstrate that our
benchmark successfully identifies bugs in state-of-the-art NN verifiers, as
well as synthetic bugs, providing a crucial step toward enhancing the
reliability of testing NN verifiers. Our code is available at
https://github.com/MVP-Harry/SoundnessBench and our benchmark is available at
https://huggingface.co/datasets/SoundnessBench/SoundnessBench. [COMMENTS]Preprint [LINK]http://arxiv.org/abs/2412.03154v1 [DATE]2024-12-04 17:24:33+08:00 [CATEGORIES]cs.LG
Advanced Risk Prediction and Stability Assessment of Banks Using Time Series Transformer Models [AUTHORS]Wenying Sun, Zhen Xu, Wenqing Zhang, Kunyuan Ma, You Wu, Mengfang Sun [ABSTRACT]This paper aims to study the prediction of the bank stability index based on
the Time Series Transformer model. The bank stability index is an important
indicator to measure the health status and risk resistance of financial
institutions. Traditional prediction methods are difficult to adapt to complex
market changes because they rely on single-dimensional macroeconomic data. This
paper proposes a prediction framework based on the Time Series Transformer,
which uses the self-attention mechanism of the model to capture the complex
temporal dependencies and nonlinear relationships in financial data. Through
experiments, we compare the model with LSTM, GRU, CNN, TCN and RNN-Transformer
models. The experimental results show that the Time Series Transformer model
outperforms other models in both mean square error (MSE) and mean absolute
error (MAE) evaluation indicators, showing strong prediction ability. This
shows that the Time Series Transformer model can better handle multidimensional
time series data in bank stability prediction, providing new technical
approaches and solutions for financial risk management. [LINK]http://arxiv.org/abs/2412.03606v1 [DATE]2024-12-04 16:15:27+08:00 [CATEGORIES]cs.LG
DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries [AUTHORS]Hanqun Cao, Mutian He, Ning Ma, Chang-yu Hsieh, Chunbin Gu, Pheng-Ann Heng [ABSTRACT]DNA-encoded library (DEL) screening has revolutionized the detection of
protein-ligand interactions through read counts, enabling rapid exploration of
vast chemical spaces. However, noise in read counts, stemming from nonspecific
interactions, can mislead this exploration process. We present DEL-Ranking, a
novel distribution-correction denoising framework that addresses these
challenges. Our approach introduces two key innovations: (1) a novel ranking
loss that rectifies relative magnitude relationships between read counts,
enabling the learning of causal features determining activity levels, and (2)
an iterative algorithm employing self-training and consistency loss to
establish model coherence between activity label and read count predictions.
Furthermore, we contribute three new DEL screening datasets, the first to
comprehensively include multi-dimensional molecular representations,
protein-ligand enrichment values, and their activity labels. These datasets
mitigate data scarcity issues in AI-driven DEL screening research. Rigorous
evaluation on diverse DEL datasets demonstrates DEL-Ranking's superior
performance across multiple correlation metrics, with significant improvements
in binding affinity prediction accuracy. Our model exhibits zero-shot
generalization ability across different protein targets and successfully
identifies potential motifs determining compound binding affinity. This work
advances DEL screening analysis and provides valuable resources for future
research in this area. [LINK]http://arxiv.org/abs/2410.14946v2 [DATE]2024-12-04 15:58:40+08:00 [CATEGORIES]cs.LG
UTSD: Unified Time Series Diffusion Model [AUTHORS]Xiangkai Ma, Xiaobin Hong, Wenzhong Li, Sanglu Lu [ABSTRACT]Transformer-based architectures have achieved unprecedented success in time
series analysis. However, facing the challenge of across-domain modeling,
existing studies utilize statistical prior as prompt engineering fails under
the huge distribution shift among various domains. In this paper, a Unified
Time Series Diffusion (UTSD) model is established for the first time to model
the multi-domain probability distribution, utilizing the powerful probability
distribution modeling ability of Diffusion. Unlike the autoregressive models
that capture the conditional probabilities of the prediction horizon to the
historical sequence, we use a diffusion denoising process to model the mixture
distribution of the cross-domain data and generate the prediction sequence for
the target domain directly utilizing conditional sampling. The proposed UTSD
contains three pivotal designs: (1) The condition network captures the
multi-scale fluctuation patterns from the observation sequence, which are
utilized as context representations to guide the denoising network to generate
the prediction sequence; (2) Adapter-based fine-tuning strategy, the
multi-domain universal representation learned in the pretraining stage is
utilized for downstream tasks in target domains; (3) The diffusion and
denoising process on the actual sequence space, combined with the improved
classifier free guidance as the conditional generation strategy, greatly
improves the stability and accuracy of the downstream task. We conduct
extensive experiments on mainstream benchmarks, and the pre-trained UTSD
outperforms existing foundation models on all data domains, exhibiting superior
zero-shot generalization ability. After training from scratch, UTSD achieves
comparable performance against domain-specific proprietary models. The
empirical results validate the potential of UTSD as a time series foundational
model. [LINK]http://arxiv.org/abs/2412.03068v1 [DATE]2024-12-04 14:42:55+08:00 [CATEGORIES]cs.LG
Point-GN: A Non-Parametric Network Using Gaussian Positional Encoding for Point Cloud Classification [AUTHORS]Marzieh Mohammadi, Amir Salarpour [ABSTRACT]This paper introduces Point-GN, a novel non-parametric network for efficient
and accurate 3D point cloud classification. Unlike conventional deep learning
models that rely on a large number of trainable parameters, Point-GN leverages
non-learnable components-specifically, Farthest Point Sampling (FPS), k-Nearest
Neighbors (k-NN), and Gaussian Positional Encoding (GPE)-to extract both local
and global geometric features. This design eliminates the need for additional
training while maintaining high performance, making Point-GN particularly
suited for real-time, resource-constrained applications. We evaluate Point-GN
on two benchmark datasets, ModelNet40 and ScanObjectNN, achieving
classification accuracies of 85.29% and 85.89%, respectively, while
significantly reducing computational complexity. Point-GN outperforms existing
non-parametric methods and matches the performance of fully trained models, all
with zero learnable parameters. Our results demonstrate that Point-GN is a
promising solution for 3D point cloud classification in practical, real-time
environments. [COMMENTS]This paper has been accepted for presentation at the IEEE Winter
Conference on Applications of Computer Vision (WACV) 2025 [LINK]http://arxiv.org/abs/2412.03056v1 [DATE]2024-12-04 14:20:51+08:00 [CATEGORIES]cs.LG
Phased Consistency Models [AUTHORS]Fu-Yun Wang, Zhaoyang Huang, Alexander William Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, Xiaogang Wang, Hongsheng Li [ABSTRACT]Consistency Models (CMs) have made significant progress in accelerating the
generation of diffusion models. However, their application to high-resolution,
text-conditioned image generation in the latent space remains unsatisfactory.
In this paper, we identify three key flaws in the current design of Latent
Consistency Models (LCMs). We investigate the reasons behind these limitations
and propose Phased Consistency Models (PCMs), which generalize the design space
and address the identified limitations. Our evaluations demonstrate that PCMs
outperform LCMs across 1--16 step generation settings. While PCMs are
specifically designed for multi-step refinement, they achieve comparable 1-step
generation results to previously state-of-the-art specifically designed 1-step
methods. Furthermore, we show the methodology of PCMs is versatile and
applicable to video generation, enabling us to train the state-of-the-art
few-step text-to-video generator. Our code is available at
https://github.com/G-U-N/Phased-Consistency-Model. [COMMENTS]NeurIPS 2024 [LINK]http://arxiv.org/abs/2405.18407v2 [DATE]2024-12-04 14:18:16+08:00 [CATEGORIES]cs.LG
Stable Consistency Tuning: Understanding and Improving Consistency Models [AUTHORS]Fu-Yun Wang, Zhengyang Geng, Hongsheng Li [ABSTRACT]Diffusion models achieve superior generation quality but suffer from slow
generation speed due to the iterative nature of denoising. In contrast,
consistency models, a new generative family, achieve competitive performance
with significantly faster sampling. These models are trained either through
consistency distillation, which leverages pretrained diffusion models, or
consistency training/tuning directly from raw data. In this work, we propose a
novel framework for understanding consistency models by modeling the denoising
process of the diffusion model as a Markov Decision Process (MDP) and framing
consistency model training as the value estimation through Temporal
Difference~(TD) Learning. More importantly, this framework allows us to analyze
the limitations of current consistency training/tuning strategies. Built upon
Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT),
which incorporates variance-reduced learning using the score identity. SCT
leads to significant performance improvements on benchmarks such as CIFAR-10
and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID
1.55, a new SoTA for consistency models. [COMMENTS]Code is available at
https://github.com/G-U-N/Stable-Consistency-Tuning [LINK]http://arxiv.org/abs/2410.18958v3 [DATE]2024-12-04 13:04:42+08:00 [CATEGORIES]cs.LG
Data Acquisition for Improving Model Fairness using Reinforcement Learning [AUTHORS]Jahid Hasan, Romila Pradhan [ABSTRACT]Machine learning systems are increasingly being used in critical decision
making such as healthcare, finance, and criminal justice. Concerns around their
fairness have resulted in several bias mitigation techniques that emphasize the
need for high-quality data to ensure fairer decisions. However, the role of
earlier stages of machine learning pipelines in mitigating model bias has not
been explored well. In this paper, we focus on the task of acquiring additional
labeled data points for training the downstream machine learning model to
rapidly improve its fairness. Since not all data points in a data pool are
equally beneficial to the task of fairness, we generate an ordering in which
data points should be acquired. We present DataSift, a data acquisition
framework based on the idea of data valuation that relies on partitioning and
multi-armed bandits to determine the most valuable data points to acquire. Over
several iterations, DataSift selects a partition and randomly samples a batch
of data points from the selected partition, evaluates the benefit of acquiring
the batch on model fairness, and updates the utility of partitions depending on
the benefit. To further improve the effectiveness and efficiency of evaluating
batches, we leverage influence functions that estimate the effect of acquiring
a batch without retraining the model. We empirically evaluate DataSift on
several real-world and synthetic datasets and show that the fairness of a
machine learning model can be significantly improved even while acquiring a few
data points. [COMMENTS]19 pages, 9 figures [LINK]http://arxiv.org/abs/2412.03009v1 [DATE]2024-12-04 11:56:54+08:00 [CATEGORIES]cs.LG
SurvMamba: State Space Model with Multi-grained Multi-modal Interaction for Survival Prediction [AUTHORS]Ying Chen, Jiajing Xie, Yuxiang Lin, Yuhang Song, Wenxian Yang, Rongshan Yu [ABSTRACT]Multi-modal learning that combines pathological images with genomic data has
significantly enhanced the accuracy of survival prediction. Nevertheless,
existing methods have not fully utilized the inherent hierarchical structure
within both whole slide images (WSIs) and transcriptomic data, from which
better intra-modal representations and inter-modal integration could be
derived. Moreover, many existing studies attempt to improve multi-modal
representations through attention mechanisms, which inevitably lead to high
complexity when processing high-dimensional WSIs and transcriptomic data.
Recently, a structured state space model named Mamba emerged as a promising
approach for its superior performance in modeling long sequences with low
complexity. In this study, we propose Mamba with multi-grained multi-modal
interaction (SurvMamba) for survival prediction. SurvMamba is implemented with
a Hierarchical Interaction Mamba (HIM) module that facilitates efficient
intra-modal interactions at different granularities, thereby capturing more
detailed local features as well as rich global representations. In addition, an
Interaction Fusion Mamba (IFM) module is used for cascaded inter-modal
interactive fusion, yielding more comprehensive features for survival
prediction. Comprehensive evaluations on five TCGA datasets demonstrate that
SurvMamba outperforms other existing methods in terms of performance and
computational cost. [LINK]http://arxiv.org/abs/2404.08027v2 [DATE]2024-12-04 10:57:03+08:00 [CATEGORIES]cs.LG
Unified Inductive Logic: From Formal Learning to Statistical Inference to Supervised Learning [AUTHORS]Hanti Lin [ABSTRACT]While the traditional conception of inductive logic is Carnapian, I develop a
Peircean alternative and use it to unify formal learning theory, statistics,
and a significant part of machine learning: supervised learning. Some crucial
standards for evaluating non-deductive inferences have been assumed separately
in those areas, but can actually be justified by a unifying principle. [LINK]http://arxiv.org/abs/2412.02969v1 [DATE]2024-12-04 10:31:31+08:00 [CATEGORIES]cs.LG
How Many Ratings per Item are Necessary for Reliable Significance Testing? [AUTHORS]Christopher Homan, Flip Korn, Chris Welty [ABSTRACT]Most approaches to machine learning evaluation assume that machine and human
responses are repeatable enough to be measured against data with unitary,
authoritative, "gold standard" responses, via simple metrics such as accuracy,
precision, and recall that assume scores are independent given the test item.
However, AI models have multiple sources of stochasticity and the human raters
who create gold standards tend to disagree with each other, often in meaningful
ways, hence a single output response per input item may not provide enough
information. We introduce methods for determining whether an (existing or
planned) evaluation dataset has enough responses per item to reliably compare
the performance of one model to another. We apply our methods to several of
very few extant gold standard test sets with multiple disaggregated responses
per item and show that there are usually not enough responses per item to
reliably compare the performance of one model against another. Our methods also
allow us to estimate the number of responses per item for hypothetical datasets
with similar response distributions to the existing datasets we study. When two
models are very far apart in their predictive performance, fewer raters are
needed to confidently compare them, as expected. However, as the models draw
closer, we find that a larger number of raters than are currently typical in
annotation collection are needed to ensure that the power analysis correctly
reflects the difference in performance. [LINK]http://arxiv.org/abs/2412.02968v1 [DATE]2024-12-04 10:31:28+08:00 [CATEGORIES]cs.LG
RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data [AUTHORS]Maxwell A. Xu, Jaya Narain, Gregory Darnell, Haraldur Hallgrimsson, Hyewon Jeong, Darren Forde, Richard Fineman, Karthik J. Raghuram, James M. Rehg, Shirley Ren [ABSTRACT]We present RelCon, a novel self-supervised *Rel*ative *Con*trastive learning
approach that uses a learnable distance measure in combination with a softened
contrastive loss for training an motion foundation model from wearable sensors.
The learnable distance measure captures motif similarity and domain-specific
semantic information such as rotation invariance. The learned distance provides
a measurement of semantic similarity between a pair of accelerometer
time-series segments, which is used to measure the distance between an anchor
and various other sampled candidate segments. The self-supervised model is
trained on 1 billion segments from 87,376 participants from a large wearables
dataset. The model achieves strong performance across multiple downstream
tasks, encompassing both classification and regression. To our knowledge, we
are the first to show the generalizability of a self-supervised learning model
with motion data from wearables across distinct evaluation tasks. [LINK]http://arxiv.org/abs/2411.18822v2 [DATE]2024-12-04 09:56:07+08:00 [CATEGORIES]cs.LG
Zero-Shot Relational Learning for Multimodal Knowledge Graphs [AUTHORS]Rui Cai, Shichao Pei, Xiangliang Zhang [ABSTRACT]Relational learning is an essential task in the domain of knowledge
representation, particularly in knowledge graph completion (KGC). While
relational learning in traditional single-modal settings has been extensively
studied, exploring it within a multimodal KGC context presents distinct
challenges and opportunities. One of the major challenges is inference on newly
discovered relations without any associated training data. This zero-shot
relational learning scenario poses unique requirements for multimodal KGC,
i.e., utilizing multimodality to facilitate relational learning.However,
existing works fail to support the leverage of multimodal information and leave
the problem unexplored. In this paper, we propose a novel end-to-end framework,
consisting of three components, i.e., multimodal learner, structure
consolidator, and relation embedding generator, to integrate diverse multimodal
information and knowledge graph structures to facilitate the zero-shot
relational learning. Evaluation results on three multimodal knowledge graphs
demonstrate the superior performance of our proposed method. [COMMENTS]In the Proceedings of the 2024 IEEE International Conference on Big
Data (IEEE BigData 2024) [LINK]http://arxiv.org/abs/2404.06220v2 [DATE]2024-12-04 09:47:08+08:00 [CATEGORIES]cs.LG
Inverse Delayed Reinforcement Learning [AUTHORS]Simon Sinong Zhan, Qingyuan Wu, Zhian Ruan, Frank Yang, Philip Wang, Yixuan Wang, Ruochen Jiao, Chao Huang, Qi Zhu [ABSTRACT]Inverse Reinforcement Learning (IRL) has demonstrated effectiveness in a
variety of imitation tasks. In this paper, we introduce an IRL framework
designed to extract rewarding features from expert trajectories affected by
delayed disturbances. Instead of relying on direct observations, our approach
employs an efficient off-policy adversarial training framework to derive expert
features and recover optimal policies from augmented delayed observations.
Empirical evaluations in the MuJoCo environment under diverse delay settings
validate the effectiveness of our method. Furthermore, we provide a theoretical
analysis showing that recovering expert policies from augmented delayed
observations outperforms using direct delayed observations. [LINK]http://arxiv.org/abs/2412.02931v1 [DATE]2024-12-04 08:53:55+08:00 [CATEGORIES]cs.LG
HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond [AUTHORS]Stefan Abi-Karam, Rishov Sarkar, Allison Seigler, Sean Lowe, Zhigang Wei, Hanqiu Chen, Nanditha Rao, Lizy John, Aman Arora, Cong Hao [ABSTRACT]Machine learning (ML) techniques have been applied to high-level synthesis
(HLS) flows for quality-of-result (QoR) prediction and design space exploration
(DSE). Nevertheless, the scarcity of accessible high-quality HLS datasets and
the complexity of building such datasets present challenges. Existing datasets
have limitations in terms of benchmark coverage, design space enumeration,
vendor extensibility, or lack of reproducible and extensible software for
dataset construction. Many works also lack user-friendly ways to add more
designs, limiting wider adoption of such datasets. In response to these
challenges, we introduce HLSFactory, a comprehensive framework designed to
facilitate the curation and generation of high-quality HLS design datasets.
HLSFactory has three main stages: 1) a design space expansion stage to
elaborate single HLS designs into large design spaces using various
optimization directives across multiple vendor tools, 2) a design synthesis
stage to execute HLS and FPGA tool flows concurrently across designs, and 3) a
data aggregation stage for extracting standardized data into packaged datasets
for ML usage. This tripartite architecture ensures broad design space coverage
via design space expansion and supports multiple vendor tools. Users can
contribute to each stage with their own HLS designs and synthesis results and
extend the framework itself with custom frontends and tool flows. We also
include an initial set of built-in designs from common HLS benchmarks curated
open-source HLS designs. We showcase the versatility and multi-functionality of
our framework through seven case studies: I) ML model for QoR prediction; II)
Design space sampling; III) Fine-grained parallelism backend speedup; IV)
Targeting Intel's HLS flow; V) Adding new auxiliary designs; VI) Integrating
published HLS data; VII) HLS tool version regression benchmarking. [COMMENTS]MLCAD 2024 version of the paper. New case study with ML QoR
prediction. Artifact evaluation details included [LINK]http://arxiv.org/abs/2405.00820v3 [DATE]2024-12-04 07:30:43+08:00 [CATEGORIES]cs.LG
MACAW: A Causal Generative Model for Medical Imaging [AUTHORS]Vibujithan Vigneshwaran, Erik Ohara, Matthias Wilms, Nils Forkert [ABSTRACT]Although deep learning techniques show promising results for many
neuroimaging tasks in research settings, they have not yet found widespread use
in clinical scenarios. One of the reasons for this problem is that many machine
learning models only identify correlations between the input images and the
outputs of interest, which can lead to many practical problems, such as
encoding of uninformative biases and reduced explainability. Thus, recent
research is exploring if integrating a priori causal knowledge into deep
learning models is a potential avenue to identify these problems. This work
introduces a new causal generative architecture named Masked Causal Flow
(MACAW) for neuroimaging applications. Within this context, three main
contributions are described. First, a novel approach that integrates complex
causal structures into normalizing flows is proposed. Second, counterfactual
prediction is performed to identify the changes in effect variables associated
with a cause variable. Finally, an explicit Bayesian inference for
classification is derived and implemented, providing an inherent uncertainty
estimation. The feasibility of the proposed method was first evaluated using
synthetic data and then using MRI brain data from more than 23000 participants
of the UK biobank study. The evaluation results show that the proposed method
can (1) accurately encode causal reasoning and generate counterfactuals
highlighting the structural changes in the brain known to be associated with
aging, (2) accurately predict a subject's age from a single 2D MRI slice, and
(3) generate new samples assuming other values for subject-specific indicators
such as age, sex, and body mass index. The code for a toy dataset is available
at the following link: https://github.com/vibujithan/macaw-2D.git. [COMMENTS]27 pages [LINK]http://arxiv.org/abs/2412.02900v1 [DATE]2024-12-04 07:05:41+08:00 [CATEGORIES]cs.LG
Modeling and Discovering Direct Causes for Predictive Models [AUTHORS]Yizuo Chen, Amit Bhatia [ABSTRACT]We introduce a causal modeling framework that captures the input-output
behavior of predictive models (e.g., machine learning models) by representing
it using causal graphs. The framework enables us to define and identify
features that directly cause the predictions, which has broad implications for
data collection and model evaluation. We show two assumptions under which the
direct causes can be discovered from data, one of which further simplifies the
discovery process. In addition to providing sound and complete algorithms, we
propose an optimization technique based on an independence rule that can be
integrated with the algorithms to speed up the discovery process both
theoretically and empirically. [LINK]http://arxiv.org/abs/2412.02878v1 [DATE]2024-12-04 06:25:42+08:00 [CATEGORIES]cs.LG
Scale Invariance of Graph Neural Networks [AUTHORS]Qin Jiang, Chengjia Wang, Michael Lones, Wei Pang [ABSTRACT]We address two fundamental challenges in Graph Neural Networks (GNNs): (1)
the lack of theoretical support for invariance learning, a critical property in
image processing, and (2) the absence of a unified model capable of excelling
on both homophilic and heterophilic graph datasets. To tackle these issues, we
establish and prove scale invariance in graphs, extending this key property to
graph learning, and validate it through experiments on real-world datasets.
Leveraging directed multi-scaled graphs and an adaptive self-loop strategy, we
propose ScaleNet, a unified network architecture that achieves state-of-the-art
performance across four homophilic and two heterophilic benchmark datasets.
Furthermore, we show that through graph transformation based on scale
invariance, uniform weights can replace computationally expensive edge weights
in digraph inception networks while maintaining or improving performance. For
another popular GNN approach to digraphs, we demonstrate the equivalence
between Hermitian Laplacian methods and GraphSAGE with incidence normalization.
ScaleNet bridges the gap between homophilic and heterophilic graph learning,
offering both theoretical insights into scale invariance and practical
advancements in unified graph learning. Our implementation is publicly
available at https://github.com/Qin87/ScaleNet/tree/Aug23. [COMMENTS]add theoretical proof,. arXiv admin note: substantial text overlap
with arXiv:2411.08758 [LINK]http://arxiv.org/abs/2411.19392v2 [DATE]2024-12-04 06:08:55+08:00 [CATEGORIES]cs.LG
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [AUTHORS]Adam Fisch, Joshua Maynez, R. Alex Hofer, Bhuwan Dhingra, Amir Globerson, William W. Cohen [ABSTRACT]Prediction-powered inference (PPI) is a method that improves statistical
estimates based on limited human-labeled data. PPI achieves this by combining
small amounts of human-labeled data with larger amounts of data labeled by a
reasonably accurate -- but potentially biased -- automatic system, in a way
that results in tighter confidence intervals for certain parameters of interest
(e.g., the mean performance of a language model). In this paper, we propose a
method called Stratified Prediction-Powered Inference (StratPPI), in which we
show that the basic PPI estimates can be considerably improved by employing
simple data stratification strategies. Without making any assumptions on the
underlying automatic labeling system or data distribution, we derive an
algorithm for computing provably valid confidence intervals for population
parameters (such as averages) that is based on stratified sampling. In
particular, we show both theoretically and empirically that, with appropriate
choices of stratification and sample allocation, our approach can provide
substantially tighter confidence intervals than unstratified approaches.
Specifically, StratPPI is expected to improve in cases where the performance of
the autorater varies across different conditional distributions of the target
data. [LINK]http://arxiv.org/abs/2406.04291v2 [DATE]2024-12-04 05:59:32+08:00 [CATEGORIES]cs.LG
Is Large-Scale Pretraining the Secret to Good Domain Generalization? [AUTHORS]Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Bryan A. Plummer, Kate Saenko [ABSTRACT]Multi-Source Domain Generalization (DG) is the task of training on multiple
source domains and achieving high classification performance on unseen target
domains. Recent methods combine robust features from web-scale pretrained
backbones with new features learned from source data, and this has dramatically
improved benchmark results. However, it remains unclear if DG finetuning
methods are becoming better over time, or if improved benchmark performance is
simply an artifact of stronger pre-training. Prior studies have shown that
perceptual similarity to pre-training data correlates with zero-shot
performance, but we find the effect limited in the DG setting. Instead, we
posit that having perceptually similar data in pretraining is not enough; and
that it is how well these data were learned that determines performance. This
leads us to introduce the Alignment Hypothesis, which states that the final DG
performance will be high if and only if alignment of image and class label text
embeddings is high. Our experiments confirm the Alignment Hypothesis is true,
and we use it as an analysis tool of existing DG methods evaluated on DomainBed
datasets by splitting evaluation data into In-pretraining (IP) and
Out-of-pretraining (OOP). We show that all evaluated DG methods struggle on
DomainBed-OOP, while recent methods excel on DomainBed-IP. Put together, our
findings highlight the need for DG methods which can generalize beyond
pretraining alignment. [LINK]http://arxiv.org/abs/2412.02856v1 [DATE]2024-12-04 05:43:11+08:00 [CATEGORIES]cs.LG
Optimized CNNs for Rapid 3D Point Cloud Object Recognition [AUTHORS]Tianyi Lyu, Dian Gu, Peiyuan Chen, Yaoting Jiang, Zhenhong Zhang, Huadong Pang, Li Zhou, Yiping Dong [ABSTRACT]This study introduces a method for efficiently detecting objects within 3D
point clouds using convolutional neural networks (CNNs). Our approach adopts a
unique feature-centric voting mechanism to construct convolutional layers that
capitalize on the typical sparsity observed in input data. We explore the
trade-off between accuracy and speed across diverse network architectures and
advocate for integrating an $\mathcal\{L\}_1$ penalty on filter activations to
augment sparsity within intermediate layers. This research pioneers the
proposal of sparse convolutional layers combined with $\mathcal\{L\}_1$
regularization to effectively handle large-scale 3D data processing. Our
method's efficacy is demonstrated on the MVTec 3D-AD object detection
benchmark. The Vote3Deep models, with just three layers, outperform the
previous state-of-the-art in both laser-only approaches and combined
laser-vision methods. Additionally, they maintain competitive processing
speeds. This underscores our approach's capability to substantially enhance
detection performance while ensuring computational efficiency suitable for
real-time applications. [COMMENTS]15 pages [LINK]http://arxiv.org/abs/2412.02855v1 [DATE]2024-12-04 05:42:30+08:00 [CATEGORIES]cs.LG
An L-BFGS-B approach for linear and nonlinear system identification under $\ell_1$ and group-Lasso regularization [AUTHORS]Alberto Bemporad [ABSTRACT]In this paper, we propose a very efficient numerical method based on the
L-BFGS-B algorithm for identifying linear and nonlinear discrete-time
state-space models, possibly under $\ell_1$ and group-Lasso regularization for
reducing model complexity. For the identification of linear models, we show
that, compared to classical linear subspace methods, the approach often
provides better results, is much more general in terms of the loss and
regularization terms used (such as penalties for enforcing system stability),
and is also more stable from a numerical point of view. The proposed method not
only enriches the existing set of linear system identification tools but can
also be applied to identifying a very broad class of parametric nonlinear
state-space models, including recurrent neural networks. We illustrate the
approach on synthetic and experimental datasets and apply it to solve a
challenging industrial robot benchmark for nonlinear multi-input/multi-output
system identification. A Python implementation of the proposed identification
method is available in the package jax-sysid, available at
https://github.com/bemporad/jax-sysid. [COMMENTS]23 pages, 4 figures [LINK]http://arxiv.org/abs/2403.03827v3 [DATE]2024-12-04 04:45:59+08:00 [CATEGORIES]cs.LG
Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation [AUTHORS]Raphael Ruschel, Md Awsafur Rahman, Hardik Prajapati, Suya You, B. S. Manjuanth [ABSTRACT]Understanding video content is pivotal for advancing real-world applications
like activity recognition, autonomous systems, and human-computer interaction.
While scene graphs are adept at capturing spatial relationships between objects
in individual frames, extending these representations to capture dynamic
interactions across video sequences remains a significant challenge. To address
this, we present TCDSG, Temporally Consistent Dynamic Scene Graphs, an
innovative end-to-end framework that detects, tracks, and links subject-object
relationships across time, generating action tracklets, temporally consistent
sequences of entities and their interactions. Our approach leverages a novel
bipartite matching mechanism, enhanced by adaptive decoder queries and feedback
loops, ensuring temporal coherence and robust tracking over extended sequences.
This method not only establishes a new benchmark by achieving over 60%
improvement in temporal recall@k on the Action Genome, OpenPVSG, and MEVA
datasets but also pioneers the augmentation of MEVA with persistent object ID
annotations for comprehensive tracklet generation. By seamlessly integrating
spatial and temporal dynamics, our work sets a new standard in multi-frame
video analysis, opening new avenues for high-impact applications in
surveillance, autonomous navigation, and beyond. [LINK]http://arxiv.org/abs/2412.02808v1 [DATE]2024-12-04 04:19:20+08:00 [CATEGORIES]cs.LG
Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code [AUTHORS]Timur Galimzyanov, Sergey Titov, Yaroslav Golubev, Egor Bogomolov [ABSTRACT]This paper introduces the human-curated PandasPlotBench dataset, designed to
evaluate language models' effectiveness as assistants in visual data
exploration. Our benchmark focuses on generating code for visualizing tabular
data - such as a Pandas DataFrame - based on natural language instructions,
complementing current evaluation tools and expanding their scope. The dataset
includes 175 unique tasks. Our experiments assess several leading Large
Language Models (LLMs) across three visualization libraries: Matplotlib,
Seaborn, and Plotly. We show that the shortening of tasks has a minimal effect
on plotting capabilities, allowing for the user interface that accommodates
concise user input without sacrificing functionality or accuracy. Another of
our findings reveals that while LLMs perform well with popular libraries like
Matplotlib and Seaborn, challenges persist with Plotly, highlighting areas for
improvement. We hope that the modular design of our benchmark will broaden the
current studies on generating visualizations. Our benchmark is available
online: https://huggingface.co/datasets/JetBrains-Research/plot_bench. The code
for running the benchmark is also available:
https://github.com/JetBrains-Research/PandasPlotBench. [COMMENTS]5 pages [LINK]http://arxiv.org/abs/2412.02764v1 [DATE]2024-12-04 03:05:37+08:00 [CATEGORIES]cs.LG
Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply [AUTHORS]Chengting Yu, Fengzhao Zhang, Hanzhi Ma, Aili Wang, Erping Li [ABSTRACT]Traditional end-to-end (E2E) training of deep networks necessitates storing
intermediate activations for back-propagation, resulting in a large memory
footprint on GPUs and restricted model parallelization. As an alternative,
greedy local learning partitions the network into gradient-isolated modules and
trains supervisely based on local preliminary losses, thereby providing
asynchronous and parallel training methods that substantially reduce memory
cost. However, empirical experiments reveal that as the number of segmentations
of the gradient-isolated module increases, the performance of the local
learning scheme degrades substantially, severely limiting its expansibility. To
avoid this issue, we theoretically analyze the greedy local learning from the
standpoint of information theory and propose a ContSup scheme, which
incorporates context supply between isolated modules to compensate for
information loss. Experiments on benchmark datasets (i.e. CIFAR, SVHN, STL-10)
achieve SOTA results and indicate that our proposed method can significantly
improve the performance of greedy local learning with minimal memory and
computational overhead, allowing for the boost of the number of isolated
modules. Our codes are available at https://github.com/Tab-ct/ContSup. [COMMENTS]9 figures, 12 tables [LINK]http://arxiv.org/abs/2312.07636v2 [DATE]2024-12-04 02:35:27+08:00 [CATEGORIES]cs.LG
CPP-UT-Bench: Can LLMs Write Complex Unit Tests in C++? [AUTHORS]Vaishnavi Bhargava, Rajat Ghosh, Debojyoti Dutta [ABSTRACT]We introduce CPP-UT-Bench, a benchmark dataset to measure C++ unit test
generation capability of a large language model (LLM). CPP-UT-Bench aims to
reflect a broad and diverse set of C++ codebases found in the real world. The
dataset includes 2,653 \{code, unit test\} pairs drawn from 14 different
opensource C++ codebases spanned across nine diverse domains including machine
learning, software testing, parsing, standard input-output, data engineering,
logging, complete expression evaluation, key value storage, and server
protocols. We demonstrated the effectiveness of CPP-UT-Bench as a benchmark
dataset through extensive experiments in in-context learning,
parameter-efficient fine-tuning (PEFT), and full-parameter fine-tuning. We also
discussed the challenges of the dataset compilation and insights we learned
from in-context learning and fine-tuning experiments. Besides the CPP-UT-Bench
dataset and data compilation code, we are also offering the fine-tuned model
weights for further research. For nine out of ten experiments, our fine-tuned
LLMs outperformed the corresponding base models by an average of more than 70%. [LINK]http://arxiv.org/abs/2412.02735v1 [DATE]2024-12-04 02:35:24+08:00 [CATEGORIES]cs.LG
A Fast Convergence Theory for Offline Decision Making [AUTHORS]Chenjie Mao, Qiaosheng Zhang [ABSTRACT]This paper proposes the first generic fast convergence result in general
function approximation for offline decision making problems, which include
offline reinforcement learning (RL) and off-policy evaluation (OPE) as special
cases. To unify different settings, we introduce a framework called Decision
Making with Offline Feedback (DMOF), which captures a wide range of offline
decision making problems. Within this framework, we propose a simple yet
powerful algorithm called Empirical Decision with Divergence (EDD), whose upper
bound can be termed as a coefficient named Empirical Offline Estimation
Coefficient (EOEC). We show that EOEC is instance-dependent and actually
measures the correlation of the problem. When assuming partial coverage in the
dataset, EOEC will reduce in a rate of $1/N$ where $N$ is the size of the
dataset, endowing EDD with a fast convergence guarantee. Finally, we complement
the above results with a lower bound in the DMOF framework, which further
demonstrates the soundness of our theory. [LINK]http://arxiv.org/abs/2406.01378v2 [DATE]2024-12-04 02:32:15+08:00 [CATEGORIES]cs.LG
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback [AUTHORS]Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, Sherry Yang [ABSTRACT]Large text-to-video models hold immense potential for a wide range of
downstream applications. However, these models struggle to accurately depict
dynamic object interactions, often resulting in unrealistic movements and
frequent violations of real-world physics. One solution inspired by large
language models is to align generated outputs with desired outcomes using
external feedback. This enables the model to refine its responses autonomously,
eliminating extensive manual data collection. In this work, we investigate the
use of feedback to enhance the object dynamics in text-to-video models. We aim
to answer a critical question: what types of feedback, paired with which
specific self-improvement algorithms, can most effectively improve text-video
alignment and realistic object interactions? We begin by deriving a unified
probabilistic objective for offline RL finetuning of text-to-video models. This
perspective highlights how design elements in existing algorithms like KL
regularization and policy projection emerge as specific choices within a
unified framework. We then use derived methods to optimize a set of text-video
alignment metrics (e.g., CLIP scores, optical flow), but notice that they often
fail to align with human perceptions of generation quality. To address this
limitation, we propose leveraging vision-language models to provide more
nuanced feedback specifically tailored to object dynamics in videos. Our
experiments demonstrate that our method can effectively optimize a wide variety
of rewards, with binary AI feedback driving the most significant improvements
in video quality for dynamic interactions, as confirmed by both AI and human
evaluations. Notably, we observe substantial gains when using reward signals
derived from AI feedback, particularly in scenarios involving complex
interactions between multiple objects and realistic depictions of objects
falling. [COMMENTS]Website: https://sites.google.com/view/aif-dynamic-t2v/ [LINK]http://arxiv.org/abs/2412.02617v1 [DATE]2024-12-04 01:44:23+08:00 [CATEGORIES]cs.LG
TAB-Fields: A Maximum Entropy Framework for Mission-Aware Adversarial Planning [AUTHORS]Gokul Puthumanaillam, Jae Hyuk Song, Nurzhan Yesmagambet, Shinkyu Park, Melkior Ornik [ABSTRACT]Autonomous agents operating in adversarial scenarios face a fundamental
challenge: while they may know their adversaries' high-level objectives, such
as reaching specific destinations within time constraints, the exact policies
these adversaries will employ remain unknown. Traditional approaches address
this challenge by treating the adversary's state as a partially observable
element, leading to a formulation as a Partially Observable Markov Decision
Process (POMDP). However, the induced belief-space dynamics in a POMDP require
knowledge of the system's transition dynamics, which, in this case, depend on
the adversary's unknown policy. Our key observation is that while an
adversary's exact policy is unknown, their behavior is necessarily constrained
by their mission objectives and the physical environment, allowing us to
characterize the space of possible behaviors without assuming specific
policies. In this paper, we develop Task-Aware Behavior Fields (TAB-Fields), a
representation that captures adversary state distributions over time by
computing the most unbiased probability distribution consistent with known
constraints. We construct TAB-Fields by solving a constrained optimization
problem that minimizes additional assumptions about adversary behavior beyond
mission and environmental requirements. We integrate TAB-Fields with standard
planning algorithms by introducing TAB-conditioned POMCP, an adaptation of
Partially Observable Monte Carlo Planning. Through experiments in simulation
with underwater robots and hardware implementations with ground robots, we
demonstrate that our approach achieves superior performance compared to
baselines that either assume specific adversary policies or neglect mission
constraints altogether. Evaluation videos and code are available at
https://tab-fields.github.io. [LINK]http://arxiv.org/abs/2412.02570v1 [DATE]2024-12-04 00:55:27+08:00 [CATEGORIES]cs.LG
2024 Dec 03, Tue
BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment [AUTHORS]Shaolei Zhang, Kehao Zhang, Qingkai Fang, Shoutao Guo, Yan Zhou, Xiaodong Liu, Yang Feng [ABSTRACT]Large language models (LLMs), with their powerful generative capabilities and
vast knowledge, empower various tasks in everyday life. However, these
abilities are primarily concentrated in high-resource languages, leaving
low-resource languages with weaker generative capabilities and relatively
limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore
crucial for serving over 100 linguistic communities worldwide. An intuitive
approach to enhance the multilingual capabilities would be to construct
instruction data for various languages, but constructing instruction data for
over 100 languages is prohibitively costly. In this paper, we introduce BayLing
2, which efficiently transfers generative capabilities and knowledge from
high-resource languages to low-resource languages through language alignment.
To achieve this, we constructed a dataset of 3.2 million instructions,
comprising high-resource language instructions (Chinese and English) and
cross-lingual instructions for 100+ languages and performed instruction tuning
based on the dataset to facilitate the capability transfer between languages.
Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B,
and BayLing-2-8B, and conducted a comprehensive evaluation of BayLing. For
multilingual translation across 100+ languages, BayLing shows superior
performance compared to open-source models of similar scale. For multilingual
knowledge and understanding benchmarks, BayLing achieves significant
improvements across over 20 low-resource languages, demonstrating its
capability of effective knowledge transfer from high-resource to low-resource
languages. Furthermore, results on English benchmarks indicate that BayLing
maintains high performance in highresource languages while enhancing the
performance in low-resource languages. Demo, homepage, code and models of
BayLing are available. [COMMENTS]BayLing 2's online demo: http://nlp.ict.ac.cn/bayling/demo. BayLing
2's code and models: https://github.com/ictnlp/BayLing [LINK]http://arxiv.org/abs/2411.16300v2 [DATE]2024-12-03 22:17:41+08:00 [CATEGORIES]cs.CL
TSCheater: Generating High-Quality Tibetan Adversarial Texts via Visual Similarity [AUTHORS]Xi Cao, Quzong Gesang, Yuan Sun, Nuo Qun, Tashi Nyima [ABSTRACT]Language models based on deep neural networks are vulnerable to textual
adversarial attacks. While rich-resource languages like English are receiving
focused attention, Tibetan, a cross-border language, is gradually being studied
due to its abundant ancient literature and critical language strategy.
Currently, there are several Tibetan adversarial text generation methods, but
they do not fully consider the textual features of Tibetan script and
overestimate the quality of generated adversarial texts. To address this issue,
we propose a novel Tibetan adversarial text generation method called TSCheater,
which considers the characteristic of Tibetan encoding and the feature that
visually similar syllables have similar semantics. This method can also be
transferred to other abugidas, such as Devanagari script. We utilize a
self-constructed Tibetan syllable visual similarity database called TSVSDB to
generate substitution candidates and adopt a greedy algorithm-based scoring
mechanism to determine substitution order. After that, we conduct the method on
eight victim language models. Experimentally, TSCheater outperforms existing
methods in attack effectiveness, perturbation magnitude, semantic similarity,
visual similarity, and human acceptance. Finally, we construct the first
Tibetan adversarial robustness evaluationbenchmark called AdvTS, which is
generated by existing methods and proofread by humans. [COMMENTS]Review Version; Submitted to ICASSP 2025 [LINK]http://arxiv.org/abs/2412.02371v1 [DATE]2024-12-03 18:57:19+08:00 [CATEGORIES]cs.CL
ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation? [AUTHORS]Leixin Zhang, Steffen Eger, Yinjie Cheng, Weihe Zhai, Jonas Belouadi, Christoph Leiter, Simone Paolo Ponzetto, Fahimeh Moafian, Zhixue Zhao [ABSTRACT]Multimodal large language models (LLMs) have demonstrated impressive
capabilities in generating high-quality images from textual instructions.
However, their performance in generating scientific images--a critical
application for accelerating scientific progress--remains underexplored. In
this work, we address this gap by introducing ScImage, a benchmark designed to
evaluate the multimodal capabilities of LLMs in generating scientific images
from textual descriptions. ScImage assesses three key dimensions of
understanding: spatial, numeric, and attribute comprehension, as well as their
combinations, focusing on the relationships between scientific objects (e.g.,
squares, circles). We evaluate five models, GPT-4o, Llama, AutomaTikZ, Dall-E,
and StableDiffusion, using two modes of output generation: code-based outputs
(Python, TikZ) and direct raster image generation. Additionally, we examine
four different input languages: English, German, Farsi, and Chinese. Our
evaluation, conducted with 11 scientists across three criteria (correctness,
relevance, and scientific accuracy), reveals that while GPT-4o produces outputs
of decent quality for simpler prompts involving individual dimensions such as
spatial, numeric, or attribute understanding in isolation, all models face
challenges in this task, especially for more complex prompts. [LINK]http://arxiv.org/abs/2412.02368v1 [DATE]2024-12-03 18:52:06+08:00 [CATEGORIES]cs.CL
Evaluating Distributed Representations for Multi-Level Lexical Semantics: A Research Proposal [AUTHORS]Zhu Liu [ABSTRACT]Modern neural networks (NNs), trained on extensive raw sentence data,
construct distributed representations by compressing individual words into
dense, continuous, high-dimensional vectors. These representations are expected
to capture multi-level lexical meaning. In this thesis, our objective is to
examine the efficacy of distributed representations from NNs in encoding
lexical meaning. Initially, we identify and formalize three levels of lexical
semantics: \textit\{local\}, \textit\{global\}, and \textit\{mixed\} levels. Then,
for each level, we evaluate language models by collecting or constructing
multilingual datasets, leveraging various language models, and employing
linguistic analysis theories. This thesis builds a bridge between computational
models and lexical semantics, aiming to complement each other. [COMMENTS]Paper under review [LINK]http://arxiv.org/abs/2406.00751v2 [DATE]2024-12-03 18:37:09+08:00 [CATEGORIES]cs.CL
A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis [AUTHORS]Changzhi Zhou, Dandan Song, Yuhang Tian, Zhijing Wu, Hao Wang, Xinyu Zhang, Jun Yang, Ziyi Yang, Shuhao Zhang [ABSTRACT]Recently, Large Language Models (LLMs) have garnered increasing attention in
the field of natural language processing, revolutionizing numerous downstream
tasks with powerful reasoning and generation abilities. For example, In-Context
Learning (ICL) introduces a fine-tuning-free paradigm, allowing out-of-the-box
LLMs to execute downstream tasks by analogy learning without any fine-tuning.
Besides, in a fine-tuning-dependent paradigm where substantial training data
exists, Parameter-Efficient Fine-Tuning (PEFT), as the cost-effective methods,
enable LLMs to achieve excellent performance comparable to full fine-tuning.
However, these fascinating techniques employed by LLMs have not been fully
exploited in the ABSA field. Previous works probe LLMs in ABSA by merely using
randomly selected input-output pairs as demonstrations in ICL, resulting in an
incomplete and superficial evaluation. In this paper, we shed light on a
comprehensive evaluation of LLMs in the ABSA field, involving 13 datasets, 8
ABSA subtasks, and 6 LLMs. Specifically, we design a unified task formulation
to unify ``multiple LLMs for multiple ABSA subtasks in multiple paradigms.''
For the fine-tuning-dependent paradigm, we efficiently fine-tune LLMs using
instruction-based multi-task learning. For the fine-tuning-free paradigm, we
propose 3 demonstration selection strategies to stimulate the few-shot
abilities of LLMs. Our extensive experiments demonstrate that LLMs achieve a
new state-of-the-art performance compared to fine-tuned Small Language Models
(SLMs) in the fine-tuning-dependent paradigm. More importantly, in the
fine-tuning-free paradigm where SLMs are ineffective, LLMs with ICL still
showcase impressive potential and even compete with fine-tuned SLMs on some
ABSA subtasks. [LINK]http://arxiv.org/abs/2412.02279v1 [DATE]2024-12-03 16:54:17+08:00 [CATEGORIES]cs.CL
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling [AUTHORS]Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen [ABSTRACT]Efficiently modeling sequences with infinite context length has long been a
challenging problem. Previous approaches have either suffered from quadratic
computational complexity or limited extrapolation ability in length
generalization. In this work, we present Samba, a simple hybrid architecture
that layer-wise combines Mamba, a selective State Space Model (SSM), with
Sliding Window Attention (SWA). Samba selectively compresses a given sequence
into recurrent hidden states while still maintaining the ability to precisely
recall recent memories with the attention mechanism. We scale Samba up to 3.8B
parameters with 3.2T training tokens and demonstrate that it significantly
outperforms state-of-the-art models across a variety of benchmarks. Pretrained
on sequences of 4K length, Samba shows improved perplexity in context lengths
of up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba
efficiently extrapolates to a 256K context length with perfect memory recall on
the Passkey Retrieval task, and exhibits superior retrieval extrapolation on
the challenging Phonebook task compared to full-attention models. As a
linear-time sequence model, Samba achieves a 3.73x higher throughput compared
to Transformers with grouped-query attention for user prompts of 128K length,
and a 3.64x speedup when generating 64K tokens with unlimited streaming. Our
code for training on open source data is publicly available at
https://github.com/microsoft/Samba. [LINK]http://arxiv.org/abs/2406.07522v2 [DATE]2024-12-03 16:27:49+08:00 [CATEGORIES]cs.CLcs.LG
Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning [AUTHORS]Aditya Narayan Sankaran, Reza Farahbakhsh, Noel Crespi [ABSTRACT]Online abusive content detection, particularly in low-resource settings and
within the audio modality, remains underexplored. We investigate the potential
of pre-trained audio representations for detecting abusive language in
low-resource languages, in this case, in Indian languages using Few Shot
Learning (FSL). Leveraging powerful representations from models such as Wav2Vec
and Whisper, we explore cross-lingual abuse detection using the ADIMA dataset
with FSL. Our approach integrates these representations within the
Model-Agnostic Meta-Learning (MAML) framework to classify abusive language in
10 languages. We experiment with various shot sizes (50-200) evaluating the
impact of limited data on performance. Additionally, a feature visualization
study was conducted to better understand model behaviour. This study highlights
the generalization ability of pre-trained models in low-resource scenarios and
offers valuable insights into detecting abusive language in multilingual
contexts. [COMMENTS]Accepted as part of the proceedings of COLING 2025 [LINK]http://arxiv.org/abs/2412.01408v2 [DATE]2024-12-03 15:52:35+08:00 [CATEGORIES]cs.CL
BANER: Boundary-Aware LLMs for Few-Shot Named Entity Recognition [AUTHORS]Quanjiang Guo, Yihong Dong, Ling Tian, Zhao Kang, Yu Zhang, Sijie Wang [ABSTRACT]Despite the recent success of two-stage prototypical networks in few-shot
named entity recognition (NER), challenges such as over/under-detected false
spans in the span detection stage and unaligned entity prototypes in the type
classification stage persist. Additionally, LLMs have not proven to be
effective few-shot information extractors in general. In this paper, we propose
an approach called Boundary-Aware LLMs for Few-Shot Named Entity Recognition to
address these issues. We introduce a boundary-aware contrastive learning
strategy to enhance the LLM's ability to perceive entity boundaries for
generalized entity spans. Additionally, we utilize LoRAHub to align information
from the target domain to the source domain, thereby enhancing adaptive
cross-domain classification capabilities. Extensive experiments across various
benchmarks demonstrate that our framework outperforms prior methods, validating
its effectiveness. In particular, the proposed strategies demonstrate
effectiveness across a range of LLM architectures. The code and data are
released on https://github.com/UESTC-GQJ/BANER. [COMMENTS]Appear on COLING 2025 [LINK]http://arxiv.org/abs/2412.02228v1 [DATE]2024-12-03 15:51:14+08:00 [CATEGORIES]cs.CLcs.LG
Model Editing for LLMs4Code: How Far are We? [AUTHORS]Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, Weimin Zhang [ABSTRACT]Large Language Models for Code (LLMs4Code) have been found to exhibit
outstanding performance in the software engineering domain, especially the
remarkable performance in coding tasks. However, even the most advanced
LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to
the high cost of training LLMs4Code, it is impractical to re-train the models
for fixing these problematic code knowledge. Model editing is a new technical
field for effectively and efficiently correcting erroneous knowledge in LLMs,
where various model editing techniques and benchmarks have been proposed
recently. Despite that, a comprehensive study that thoroughly compares and
analyzes the performance of the state-of-the-art model editing techniques for
adapting the knowledge within LLMs4Code across various code-related tasks is
notably absent. To bridge this gap, we perform the first systematic study on
applying state-of-the-art model editing approaches to repair the inaccuracy of
LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists
of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and
CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help
of CLMEEval, we evaluate six advanced model editing techniques on three
LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings
include that the external memorization-based GRACE approach achieves the best
knowledge editing effectiveness and specificity (the editing does not influence
untargeted knowledge), while generalization (whether the editing can generalize
to other semantically-identical inputs) is a universal challenge for existing
techniques. Furthermore, building on in-depth case analysis, we introduce an
enhanced version of GRACE called A-GRACE, which incorporates contrastive
learning to better capture the semantics of the inputs. [COMMENTS]Accepted by ICSE2025. The code is available at:
https://github.com/xpq-tech/code-llmedit.git [LINK]http://arxiv.org/abs/2411.06638v2 [DATE]2024-12-03 15:40:40+08:00 [CATEGORIES]cs.CL
AutoGuide: Automated Generation and Selection of Context-Aware Guidelines for Large Language ModelAgents [AUTHORS]Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, Honglak Lee [ABSTRACT]Recent advances in large language models (LLMs) have empowered AI agents
capable of performing various sequential decision-making tasks. However,
effectively guiding LLMs to perform well in unfamiliar domains like web
navigation, where they lack sufficient knowledge, has proven to be difficult
with the demonstration-based in-context learning paradigm. In this paper, we
introduce a novel framework, called AutoGuide, which addresses this limitation
by automatically generating context-aware guidelines from offline experiences.
Importantly, each context-aware guideline is expressed in concise natural
language and follows a conditional structure, clearly describing the context
where it is applicable. As a result, our guidelines facilitate the provision of
relevant knowledge for the agent's current decision-making process, overcoming
the limitations of the conventional demonstration-based learning paradigm. Our
evaluation demonstrates that AutoGuide significantly outperforms competitive
baselines in complex benchmark domains, including real-world web navigation. [LINK]http://arxiv.org/abs/2403.08978v2 [DATE]2024-12-03 15:36:47+08:00 [CATEGORIES]cs.CLcs.LG
Enabling Efficient Attack Investigation via Human-in-the-Loop Security Analysis [AUTHORS]Xinyu Yang, Haoyuan Liu, Saimon Amanuel Tsegai, Peng Gao [ABSTRACT]System auditing is a vital technique for collecting system call events as
system provenance and investigating complex multi-step attacks such as Advanced
Persistent Threats. However, existing attack investigation methods struggle to
uncover long attack sequences due to the massive volume of system provenance
data and their inability to focus on attack-relevant parts. In this paper, we
present Raptor, a defense system that enables human analysts to effectively
analyze large-scale system provenance to reveal multi-step attack sequences.
Raptor introduces an expressive domain-specific language, ProvQL, that offers
essential primitives for various types of attack analyses (e.g., attack pattern
search, attack dependency tracking) with user-defined constraints, enabling
analysts to focus on attack-relevant parts and iteratively sift through the
large provenance data. Moreover, Raptor provides an optimized execution engine
for efficient language execution. Our extensive evaluations on a wide range of
attack scenarios demonstrate the practical effectiveness of Raptor in
facilitating timely attack investigation. [LINK]http://arxiv.org/abs/2211.05403v2 [DATE]2024-12-03 13:18:59+08:00 [CATEGORIES]cs.CL
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning [AUTHORS]Xueqing Wu, Yuheng Ding, Bingxuan Li, Pan Lu, Da Yin, Kai-Wei Chang, Nanyun Peng [ABSTRACT]The ability of large vision-language models (LVLMs) to critique and correct
their reasoning is an essential building block towards their self-improvement.
However, a systematic analysis of such capabilities in LVLMs is still lacking.
We propose VISCO, the first benchmark to extensively analyze the fine-grained
critique and correction capabilities of LVLMs. Compared to existing work that
uses a single scalar value to critique the entire reasoning [4], VISCO features
dense and fine-grained critique, requiring LVLMs to evaluate the correctness of
each step in the chain-of-thought and provide natural language explanations to
support their judgments. Extensive evaluation of 24 LVLMs demonstrates that
human-written critiques significantly enhance the performance after correction,
showcasing the potential of the self-improvement strategy. However, the
model-generated critiques are less helpful and sometimes detrimental to the
performance, suggesting that critique is the crucial bottleneck. We identified
three common patterns in critique failures: failure to critique visual
perception, reluctance to "say no", and exaggerated assumption of error
propagation. To address these issues, we propose an effective LookBack strategy
that revisits the image to verify each piece of information in the initial
reasoning. LookBack significantly improves critique and correction performance
by up to 13.5%. [COMMENTS]Project: https://visco-benchmark.github.io/ [LINK]http://arxiv.org/abs/2412.02172v1 [DATE]2024-12-03 13:04:49+08:00 [CATEGORIES]cs.CL
AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning [AUTHORS]Amy Xin, Jinxin Liu, Zijun Yao, Zhicheng Lee, Shulin Cao, Lei Hou, Juanzi Li [ABSTRACT]Recent advancements in large language models (LLMs) have led to significant
improvements in various natural language processing tasks, but it is still
challenging for LLMs to perform knowledge-intensive complex question answering
due to LLMs' inefficacy in reasoning planning and the hallucination problem. A
typical solution is to employ retrieval-augmented generation (RAG) coupled with
chain-of-thought (CoT) reasoning, which decomposes complex questions into
chain-like sub-questions and applies iterative RAG at each sub-question.
However, prior works exhibit sub-optimal reasoning planning and overlook
dynamic knowledge retrieval from heterogeneous sources. In this paper, we
propose AtomR, a novel heterogeneous knowledge reasoning framework that
conducts multi-source reasoning at the atomic level. Drawing inspiration from
the graph modeling of knowledge, AtomR leverages large language models (LLMs)
to decompose complex questions into combinations of three atomic knowledge
operators, significantly enhancing the reasoning process at both the planning
and execution stages. We also introduce BlendQA, a novel evaluationbenchmark
tailored to assess complex heterogeneous knowledge reasoning. Experiments show
that AtomR significantly outperforms state-of-the-art baselines across three
single-source and two multi-source reasoning benchmarks, with notable
performance gains of 9.4% on 2WikiMultihop and 9.5% on BlendQA. [LINK]http://arxiv.org/abs/2411.16495v2 [DATE]2024-12-03 13:00:18+08:00 [CATEGORIES]cs.CL
NüshuRescue: Revitalization of the endangered Nüshu Language with AI [AUTHORS]Ivory Yang, Weicheng Ma, Soroush Vosoughi [ABSTRACT]The preservation and revitalization of endangered and extinct languages is a
meaningful endeavor, conserving cultural heritage while enriching fields like
linguistics and anthropology. However, these languages are typically
low-resource, making their reconstruction labor-intensive and costly. This
challenge is exemplified by N\"ushu, a rare script historically used by Yao
women in China for self-expression within a patriarchal society. To address
this challenge, we introduce N\"ushuRescue, an AI-driven framework designed to
train large language models (LLMs) on endangered languages with minimal data.
N\"ushuRescue automates evaluation and expands target corpora to accelerate
linguistic revitalization. As a foundational component, we developed NCGold, a
500-sentence N\"ushu-Chinese parallel corpus, the first publicly available
dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to N\"ushu
and only 35 short examples from NCGold, N\"ushuRescue achieved 48.69\%
translation accuracy on 50 withheld sentences and generated NCSilver, a set of
98 newly translated modern Chinese sentences of varying lengths. A sample of
both NCGold and NCSilver is included in the Supplementary Materials.
Additionally, we developed FastText-based and Seq2Seq models to further support
research on N\"ushu. N\"ushuRescue provides a versatile and scalable tool for
the revitalization of endangered languages, minimizing the need for extensive
human input. [COMMENTS]Accepted to COLING 2025 [LINK]http://arxiv.org/abs/2412.00218v2 [DATE]2024-12-03 12:38:31+08:00 [CATEGORIES]cs.CLcs.LG
Analyzing Nobel Prize Literature with Large Language Models [AUTHORS]Zhenyuan Yang, Zhengliang Liu, Jing Zhang, Cen Lu, Jiaxin Tai, Tianyang Zhong, Yiwei Li, Siyan Zhao, Teng Yao, Qing Liu, Jinlin Yang, Qixin Liu, Zhaowei Li, Kexin Wang, Longjun Ma, Dajiang Zhu, Yudan Ren, Bao Ge, Wei Zhang, Ning Qiang, Tuo Zhang, Tianming Liu [ABSTRACT]This study examines the capabilities of advanced Large Language Models
(LLMs), particularly the o1 model, in the context of literary analysis. The
outputs of these models are compared directly to those produced by
graduate-level human participants. By focusing on two Nobel Prize-winning short
stories, 'Nine Chapters' by Han Kang, the 2024 laureate, and 'Friendship' by
Jon Fosse, the 2023 laureate, the research explores the extent to which AI can
engage with complex literary elements such as thematic analysis,
intertextuality, cultural and historical contexts, linguistic and structural
innovations, and character development. Given the Nobel Prize's prestige and
its emphasis on cultural, historical, and linguistic richness, applying LLMs to
these works provides a deeper understanding of both human and AI approaches to
interpretation. The study uses qualitative and quantitative evaluations of
coherence, creativity, and fidelity to the text, revealing the strengths and
limitations of AI in tasks typically reserved for human expertise. While LLMs
demonstrate strong analytical capabilities, particularly in structured tasks,
they often fall short in emotional nuance and coherence, areas where human
interpretation excels. This research underscores the potential for human-AI
collaboration in the humanities, opening new opportunities in literary studies
and beyond. [LINK]http://arxiv.org/abs/2410.18142v2 [DATE]2024-12-03 12:19:36+08:00 [CATEGORIES]cs.CL
Leveraging Large Language Models for Comparative Literature Summarization with Reflective Incremental Mechanisms [AUTHORS]Fernando Gabriela Garcia, Spencer Burns, Harrison Fuller [ABSTRACT]In this paper, we introduce ChatCite, a novel method leveraging large
language models (LLMs) for generating comparative literature summaries. The
ability to summarize research papers with a focus on key comparisons between
studies is an essential task in academic research. Existing summarization
models, while effective at generating concise summaries, fail to provide deep
comparative insights. ChatCite addresses this limitation by incorporating a
multi-step reasoning mechanism that extracts critical elements from papers,
incrementally builds a comparative summary, and refines the output through a
reflective memory process. We evaluate ChatCite on a custom dataset,
CompLit-LongContext, consisting of 1000 research papers with annotated
comparative summaries. Experimental results show that ChatCite outperforms
several baseline methods, including GPT-4, BART, T5, and CoT, across various
automatic evaluation metrics such as ROUGE and the newly proposed G-Score.
Human evaluation further confirms that ChatCite generates more coherent,
insightful, and fluent summaries compared to these baseline models. Our method
provides a significant advancement in automatic literature review generation,
offering researchers a powerful tool for efficiently comparing and synthesizing
scientific research. [COMMENTS]8 pages [LINK]http://arxiv.org/abs/2412.02149v1 [DATE]2024-12-03 12:09:36+08:00 [CATEGORIES]cs.CL
Personalized Multimodal Large Language Models: A Survey [AUTHORS]Junda Wu, Hanjia Lyu, Yu Xia, Zhehao Zhang, Joe Barrow, Ishita Kumar, Mehrnoosh Mirtaheri, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Namyong Park, Sungchul Kim, Huanrui Yang, Subrata Mitra, Zhengmian Hu, Nedim Lipka, Dang Nguyen, Yue Zhao, Jiebo Luo, Julian McAuley [ABSTRACT]Multimodal Large Language Models (MLLMs) have become increasingly important
due to their state-of-the-art performance and ability to integrate multiple
data modalities, such as text, images, and audio, to perform complex tasks with
high accuracy. This paper presents a comprehensive survey on personalized
multimodal large language models, focusing on their architecture, training
methods, and applications. We propose an intuitive taxonomy for categorizing
the techniques used to personalize MLLMs to individual users, and discuss the
techniques accordingly. Furthermore, we discuss how such techniques can be
combined or adapted when appropriate, highlighting their advantages and
underlying rationale. We also provide a succinct summary of personalization
tasks investigated in existing research, along with the evaluation metrics
commonly used. Additionally, we summarize the datasets that are useful for
benchmarking personalized MLLMs. Finally, we outline critical open challenges.
This survey aims to serve as a valuable resource for researchers and
practitioners seeking to understand and advance the development of personalized
multimodal large language models. [LINK]http://arxiv.org/abs/2412.02142v1 [DATE]2024-12-03 11:59:03+08:00 [CATEGORIES]cs.CL
WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image [AUTHORS]Yuci Liang, Xinheng Lyu, Meidan Ding, Wenting Chen, Jipeng Zhang, Yuexiang Ren, Xiangjian He, Song Wu, Sen Yang, Xiyue Wang, Xiaohan Xing, Linlin Shen [ABSTRACT]Recent advancements in computational pathology have produced patch-level
Multi-modal Large Language Models (MLLMs), but these models are limited by
their inability to analyze whole slide images (WSIs) comprehensively and their
tendency to bypass crucial morphological features that pathologists rely on for
diagnosis. To address these challenges, we first introduce WSI-Bench, a
large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850
WSIs across 30 cancer types, designed to evaluate MLLMs' understanding of
morphological characteristics crucial for accurate diagnosis. Building upon
this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI
understanding that employs a three-stage training approach: WSI-text alignment,
feature space alignment, and task-specific instruction tuning. To better assess
model performance in pathological contexts, we develop two specialized WSI
metrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that
WSI-LLaVA outperforms existing models across all capability dimensions, with a
significant improvement in morphological analysis, establishing a clear
correlation between morphological understanding and diagnostic accuracy. [COMMENTS]38 pages, 22 figures, 35 tables [LINK]http://arxiv.org/abs/2412.02141v1 [DATE]2024-12-03 11:57:24+08:00 [CATEGORIES]cs.CL
Large Language Model-Brained GUI Agents: A Survey [AUTHORS]Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang [ABSTRACT]GUIs have long been central to human-computer interaction, providing an
intuitive and visually-driven way to access and interact with digital systems.
The advent of LLMs, particularly multimodal models, has ushered in a new era of
GUI automation. They have demonstrated exceptional capabilities in natural
language understanding, code generation, and visual processing. This has paved
the way for a new generation of LLM-brained GUI agents capable of interpreting
complex GUI elements and autonomously executing actions based on natural
language instructions. These agents represent a paradigm shift, enabling users
to perform intricate, multi-step tasks through simple conversational commands.
Their applications span across web navigation, mobile app interactions, and
desktop automation, offering a transformative user experience that
revolutionizes how individuals interact with software. This emerging field is
rapidly advancing, with significant progress in both research and industry.
To provide a structured understanding of this trend, this paper presents a
comprehensive survey of LLM-brained GUI agents, exploring their historical
evolution, core components, and advanced techniques. We address research
questions such as existing GUI agent frameworks, the collection and utilization
of data for training specialized GUI agents, the development of large action
models tailored for GUI tasks, and the evaluation metrics and benchmarks
necessary to assess their effectiveness. Additionally, we examine emerging
applications powered by these agents. Through a detailed analysis, this survey
identifies key research gaps and outlines a roadmap for future advancements in
the field. By consolidating foundational knowledge and state-of-the-art
developments, this work aims to guide both researchers and practitioners in
overcoming challenges and unlocking the full potential of LLM-brained GUI
agents. [COMMENTS]The collection of papers reviewed in this survey will be hosted and
regularly updated on the GitHub repository:
https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a
searchable webpage is available at https://aka.ms/gui-agent for easier access
and exploration [LINK]http://arxiv.org/abs/2411.18279v3 [DATE]2024-12-03 11:16:27+08:00 [CATEGORIES]cs.CL
Trustful LLMs: Customizing and Grounding Text Generation with Knowledge Bases and Dual Decoders [AUTHORS]Xiaofeng Zhu, Jaya Krishna Mandivarapu [ABSTRACT]Although people are impressed by the content generation skills of large
language models, the use of LLMs, such as ChatGPT, is limited by the domain
grounding of the content. The correctness and groundedness of the generated
content need to be based on a verified context, such as results from
Retrieval-Augmented Generation (RAG). One important issue when adapting LLMs to
a customized domain is that the generated responses are often incomplete, or
the additions are not verified and may even be hallucinated. Prior studies on
hallucination detection have focused on evaluation metrics, which are not
easily adaptable to dynamic domains and can be vulnerable to attacks like
jail-breaking. In this work, we propose 1) a post-processing algorithm that
leverages knowledge triplets in RAG context to correct hallucinations and 2) a
dual-decoder model that fuses RAG context to guide the generation process. [LINK]http://arxiv.org/abs/2411.07870v3 [DATE]2024-12-03 09:04:10+08:00 [CATEGORIES]cs.CL
BN-AuthProf: Benchmarking Machine Learning for Bangla Author Profiling on Social Media Texts [AUTHORS]Raisa Tasnim, Mehanaz Chowdhury, Md Ataur Rahman [ABSTRACT]Author profiling, the analysis of texts to uncover attributes such as gender
and age of the author, has become essential with the widespread use of social
media platforms. This paper focuses on author profiling in the Bangla language,
aiming to extract valuable insights about anonymous authors based on their
writing style on social media. The primary objective is to introduce and
benchmark the performance of machine learning approaches on a newly created
Bangla Author Profiling dataset, BN-AuthProf. The dataset comprises 30,131
social media posts from 300 authors, labeled by their age and gender. Authors'
identities and sensitive information were anonymized to ensure privacy. Various
classical machine learning and deep learning techniques were employed to
evaluate the dataset. For gender classification, the best accuracy achieved was
80% using Support Vector Machine (SVM), while a Multinomial Naive Bayes (MNB)
classifier achieved the best F1 score of 0.756. For age classification, MNB
attained a maximum accuracy score of 91% with an F1 score of 0.905. This
research highlights the effectiveness of machine learning in gender and age
classification for Bangla author profiling, with practical implications
spanning marketing, security, forensic linguistics, education, and criminal
investigations, considering privacy and biases. [COMMENTS]Accepted to be Published in 2024 27th International Conference on
Computer and Information Technology (ICCIT) [LINK]http://arxiv.org/abs/2412.02058v1 [DATE]2024-12-03 08:32:32+08:00 [CATEGORIES]cs.CL
A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil and Sinhala [AUTHORS]Surangika Ranathunga, Asanka Ranasinghea, Janaka Shamala, Ayodya Dandeniyaa, Rashmi Galappaththia, Malithi Samaraweeraa [ABSTRACT]This paper presents a multi-way parallel English-Tamil-Sinhala corpus
annotated with Named Entities (NEs), where Sinhala and Tamil are low-resource
languages. Using pre-trained multilingual Language Models (mLMs), we establish
new benchmark Named Entity Recognition (NER) results on this dataset for
Sinhala and Tamil. We also carry out a detailed investigation on the NER
capabilities of different types of mLMs. Finally, we demonstrate the utility of
our NER system on a low-resource Neural Machine Translation (NMT) task. Our
dataset is publicly released: https://github.com/suralk/multiNER. [LINK]http://arxiv.org/abs/2412.02056v1 [DATE]2024-12-03 08:28:31+08:00 [CATEGORIES]cs.CL
Real-Time Multilingual Sign Language Processing [AUTHORS]Amit Moryossef [ABSTRACT]Sign Language Processing (SLP) is an interdisciplinary field comprised of
Natural Language Processing (NLP) and Computer Vision. It is focused on the
computational understanding, translation, and production of signed languages.
Traditional approaches have often been constrained by the use of gloss-based
systems that are both language-specific and inadequate for capturing the
multidimensional nature of sign language. These limitations have hindered the
development of technology capable of processing signed languages effectively.
This thesis aims to revolutionize the field of SLP by proposing a simple
paradigm that can bridge this existing technological gap. We propose the use of
SignWiring, a universal sign language transcription notation system, to serve
as an intermediary link between the visual-gestural modality of signed
languages and text-based linguistic representations.
We contribute foundational libraries and resources to the SLP community,
thereby setting the stage for a more in-depth exploration of the tasks of sign
language translation and production. These tasks encompass the translation of
sign language from video to spoken language text and vice versa. Through
empirical evaluations, we establish the efficacy of our transcription method as
a pivot for enabling faster, more targeted research, that can lead to more
natural and accurate translations across a range of languages.
The universal nature of our transcription-based paradigm also paves the way
for real-time, multilingual applications in SLP, thereby offering a more
inclusive and accessible approach to language technology. This is a significant
step toward universal accessibility, enabling a wider reach of AI-driven
language technologies to include the deaf and hard-of-hearing community. [COMMENTS]PhD Thesis [LINK]http://arxiv.org/abs/2412.01991v1 [DATE]2024-12-03 05:51:41+08:00 [CATEGORIES]cs.CL
Discovering influential text using convolutional neural networks [AUTHORS]Megan Ayers, Luke Sanford, Margaret Roberts, Eddie Yang [ABSTRACT]Experimental methods for estimating the impacts of text on human evaluation
have been widely used in the social sciences. However, researchers in
experimental settings are usually limited to testing a small number of
pre-specified text treatments. While efforts to mine unstructured texts for
features that causally affect outcomes have been ongoing in recent years, these
models have primarily focused on the topics or specific words of text, which
may not always be the mechanism of the effect. We connect these efforts with
NLP interpretability techniques and present a method for flexibly discovering
clusters of similar text phrases that are predictive of human reactions to
texts using convolutional neural networks. When used in an experimental
setting, this method can identify text treatments and their effects under
certain assumptions. We apply the method to two datasets. The first enables
direct validation of the model's ability to detect phrases known to cause the
outcome. The second demonstrates its ability to flexibly discover text
treatments with varying textual structures. In both cases, the model learns a
greater variety of text treatments compared to benchmark methods, and these
text features quantitatively meet or exceed the ability of benchmark methods to
predict the outcome. [COMMENTS]Published in Findings of ACL 2024 ( see
https://aclanthology.org/2024.findings-acl.714 ) [LINK]http://arxiv.org/abs/2406.10086v3 [DATE]2024-12-03 05:31:59+08:00 [CATEGORIES]cs.CLcs.LG
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [AUTHORS]Sourav Banerjee, Ayushi Agarwal, Eishkaran Singh [ABSTRACT]The pursuit of leaderboard rankings in Large Language Models (LLMs) has
created a fundamental paradox: models excel at standardized tests while failing
to demonstrate genuine language understanding and adaptability. Our systematic
analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across
the evaluation spectrum, from basic metrics to complex benchmarks like GLUE and
MMLU. These vulnerabilities manifest through benchmark exploitation, dataset
contamination, and evaluation bias, creating a false perception of progress in
language understanding capabilities. Through extensive review of contemporary
evaluation approaches, we identify significant limitations in static benchmark
designs, human evaluation protocols, and LLM-as-judge frameworks, all of which
compromise the reliability of current performance assessments. As LLM
capabilities evolve and existing benchmarks become redundant, we lay the
groundwork for new evaluation methods that resist manipulation, minimize data
contamination, and assess domain-specific tasks. This requires frameworks that
are adapted dynamically, addressing current limitations and providing a more
accurate reflection of LLM performance. [COMMENTS]11 pages [LINK]http://arxiv.org/abs/2412.03597v1 [DATE]2024-12-03 04:49:21+08:00 [CATEGORIES]cs.CLcs.LG
VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models [AUTHORS]Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, Joseph E Gonzalez [ABSTRACT]Large language models (LLMs) often exhibit subtle yet distinctive
characteristics in their outputs that users intuitively recognize, but struggle
to quantify. These "vibes" -- such as tone, formatting, or writing style --
influence user preferences, yet traditional evaluations focus primarily on the
singular axis of correctness. We introduce VibeCheck, a system for
automatically comparing a pair of LLMs by discovering identifying traits of a
model (vibes) that are well-defined, differentiating, and user-aligned.
VibeCheck iteratively discovers vibes from model outputs and then utilizes a
panel of LLM judges to quantitatively measure the utility of each vibe. We
validate that the vibes generated by VibeCheck align with those found in human
discovery and run VibeCheck on pairwise preference data from real-world user
conversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a
friendly, funny, and somewhat controversial vibe. These vibes predict model
identity with 80% accuracy and human preference with 61% accuracy. Lastly, we
run VibeCheck on a variety of models and tasks including summarization, math,
and captioning to provide insight into differences in model behavior. VibeCheck
discovers vibes like Command X prefers to add concrete intros and conclusions
when summarizing in comparison to TNGL, Llama-405b often overexplains its
thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus
on the mood and emotions of the scene when captioning compared to
Gemini-1.5-Flash. Code can be found at https://github.com/lisadunlap/VibeCheck [COMMENTS]unironic use of the word 'vibe', added more analysis and cooler
graphs. arXiv admin note: text overlap with arXiv:2301.07597 by other authors [LINK]http://arxiv.org/abs/2410.12851v4 [DATE]2024-12-03 04:27:39+08:00 [CATEGORIES]cs.CL
Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack [AUTHORS]Xiaoyue Xu, Qinyuan Ye, Xiang Ren [ABSTRACT]We introduce Lifelong ICL, a problem setting that challenges long-context
language models (LMs) to learn a sequence of language tasks through in-context
learning (ICL). We further introduce Task Haystack, an evaluation suite
dedicated to assessing and diagnosing how long-context LMs utilizes contexts in
Lifelong ICL. When given a task instruction and test inputs, long-context LMs
are expected to leverage the relevant demonstrations in the Lifelong ICL
prompt, avoid distraction and interference from other tasks, and achieve test
accuracies that are not significantly worse than those of the Single-task ICL
baseline.
Task Haystack draws inspiration from the widely-adopted
"needle-in-a-haystack" (NIAH) evaluation, but presents distinct new challenges.
It requires models (1) to utilize the contexts at a deeper level, rather than
resorting to simple copying and pasting; (2) to navigate through long streams
of evolving topics and tasks, proxying the complexities and dynamism of
contexts in real-world scenarios. Additionally, Task Haystack inherits the
controllability of NIAH, providing model developers with tools and
visualizations to identify model vulnerabilities effectively.
We benchmark 14 long-context LMs using Task Haystack, finding that frontier
models like GPT-4o still struggle with the setting, failing on 15% of cases on
average. Most open-weight models further lack behind by a large margin, with
failure rates reaching up to 61%. In our controlled analysis, we identify
factors such as distraction and recency bias as contributors to these failure
cases. Further, performance declines when task instructions are paraphrased at
test time or when ICL demonstrations are repeated excessively, raising concerns
about the robustness, instruction understanding, and true context utilization
of long-context LMs. [COMMENTS]NeurIPS 2024 (Datasets and Benchmarks Track). Code:
https://github.com/INK-USC/Lifelong-ICL Website:
https://inklab.usc.edu/lifelong-icl/ [LINK]http://arxiv.org/abs/2407.16695v2 [DATE]2024-12-03 04:23:49+08:00 [CATEGORIES]cs.CLcs.LG
RIRAG: Regulatory Information Retrieval and Answer Generation [AUTHORS]Tuba Gokhan, Kexin Wang, Iryna Gurevych, Ted Briscoe [ABSTRACT]Regulatory documents, issued by governmental regulatory bodies, establish
rules, guidelines, and standards that organizations must adhere to for legal
compliance. These documents, characterized by their length, complexity and
frequent updates, are challenging to interpret, requiring significant
allocation of time and expertise on the part of organizations to ensure ongoing
compliance. Regulatory Natural Language Processing (RegNLP) is a
multidisciplinary field aimed at simplifying access to and interpretation of
regulatory rules and obligations. We introduce a task of generating
question-passages pairs, where questions are automatically created and paired
with relevant regulatory passages, facilitating the development of regulatory
question-answering systems. We create the ObliQA dataset, containing 27,869
questions derived from the collection of Abu Dhabi Global Markets (ADGM)
financial regulation documents, design a baseline Regulatory Information
Retrieval and Answer Generation (RIRAG) system and evaluate it with RePASs, a
novel evaluation metric that tests whether generated answers accurately capture
all relevant obligations while avoiding contradictions. [LINK]http://arxiv.org/abs/2409.05677v2 [DATE]2024-12-03 02:13:28+08:00 [CATEGORIES]cs.CL
Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index [AUTHORS]Tyler McDonald, Anthony Colosimo, Yifeng Li, Ali Emami [ABSTRACT]As prompt engineering research rapidly evolves, evaluations beyond accuracy
are crucial for developing cost-effective techniques. We present the Economical
Prompting Index (EPI), a novel metric that combines accuracy scores with token
consumption, adjusted by a user-specified cost concern level to reflect
different resource constraints. Our study examines 6 advanced prompting
techniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts,
across 10 widely-used language models and 4 diverse datasets. We demonstrate
that approaches such as Self-Consistency often provide statistically
insignificant gains while becoming cost-prohibitive. For example, on
high-performing models like Claude 3.5 Sonnet, the EPI of simpler techniques
like Chain-of-Thought (0.72) surpasses more complex methods like
Self-Consistency (0.64) at slight cost concern levels. Our findings suggest a
reevaluation of complex prompting strategies in resource-constrained scenarios,
potentially reshaping future research priorities and improving
cost-effectiveness for end-users. [COMMENTS]5 pages (excluding references), accepted to Coling 2025 [LINK]http://arxiv.org/abs/2412.01690v1 [DATE]2024-12-03 00:34:18+08:00 [CATEGORIES]cs.CL
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities [AUTHORS]Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock [ABSTRACT]Forecasts of future events are essential inputs into informed
decision-making. Machine learning (ML) systems have the potential to deliver
forecasts at scale, but there is no framework for evaluating the accuracy of ML
systems on a standardized set of forecasting questions. To address this gap, we
introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML
systems on an automatically generated and regularly updated set of 1,000
forecasting questions. To avoid any possibility of data leakage, ForecastBench
is comprised solely of questions about future events that have no known answer
at the time of submission. We quantify the capabilities of current ML systems
by collecting forecasts from expert (human) forecasters, the general public,
and LLMs on a random subset of questions from the benchmark ($N=200$). While
LLMs have achieved super-human performance on many benchmarks, they perform
less well here: expert forecasters outperform the top-performing LLM (p-value
$<0.01$). We display system and human scores in a public leaderboard at
www.forecastbench.org. [LINK]http://arxiv.org/abs/2409.19839v3 [DATE]2024-12-03 00:27:16+08:00 [CATEGORIES]cs.LGcs.CL
R-Bot: An LLM-based Query Rewrite System [AUTHORS]Zhaoyan Sun, Xuanhe Zhou, Guoliang Li [ABSTRACT]Query rewrite is essential for optimizing SQL queries to improve their
execution efficiency without changing their results. Traditionally, this task
has been tackled through heuristic and learning-based methods, each with its
limitations in terms of inferior quality and low robustness. Recent
advancements in LLMs offer a new paradigm by leveraging their superior natural
language and code comprehension abilities. Despite their potential, directly
applying LLMs like GPT-4 has faced challenges due to problems such as
hallucinations, where the model might generate inaccurate or irrelevant
results. To address this, we propose R-Bot, an LLM-based query rewrite system
with a systematic approach. We first design a multi-source rewrite evidence
preparation pipeline to generate query rewrite evidences for guiding LLMs to
avoid hallucinations. We then propose a hybrid structure-semantics retrieval
method that combines structural and semantic analysis to retrieve the most
relevant rewrite evidences for effectively answering an online query. We next
propose a step-by-step LLMrewrite method that iteratively leverages the
retrieved evidences to select and arrange rewrite rules with self-reflection.
We conduct comprehensive experiments on widely used benchmarks, and demonstrate
the superior performance of our system, R-Bot, surpassing state-of-the-art
query rewrite methods. [LINK]http://arxiv.org/abs/2412.01661v1 [DATE]2024-12-03 00:13:04+08:00 [CATEGORIES]cs.CLcs.LG
Learning to Predict Structural Vibrations [AUTHORS]Jan van Delden, Julius Schultz, Christopher Blech, Sabine C. Langer, Timo Lüddecke [ABSTRACT]In mechanical structures like airplanes, cars and houses, noise is generated
and transmitted through vibrations. To take measures to reduce this noise,
vibrations need to be simulated with expensive numerical computations. Deep
learning surrogate models present a promising alternative to classical
numerical simulations as they can be evaluated magnitudes faster, while
trading-off accuracy. To quantify such trade-offs systematically and foster the
development of methods, we present a benchmark on the task of predicting the
vibration of harmonically excited plates. The benchmark features a total of
12,000 plate geometries with varying forms of beadings, material, boundary
conditions, load position and sizes with associated numerical solutions. To
address the benchmark task, we propose a new network architecture, named
Frequency-Query Operator, which predicts vibration patterns of plate geometries
given a specific excitation frequency. Applying principles from operator
learning and implicit models for shape encoding, our approach effectively
addresses the prediction of highly variable frequency response functions
occurring in dynamic systems. To quantify the prediction quality, we introduce
a set of evaluation metrics and evaluate the method on our vibrating-plates
benchmark. Our method outperforms DeepONets, Fourier Neural Operators and more
traditional neural network architectures and can be used for design
optimization. Code, dataset and visualizations:
https://github.com/ecker-lab/Learning_Vibrating_Plates [COMMENTS]Accepted at Neurips 2024 [LINK]http://arxiv.org/abs/2310.05469v4 [DATE]2024-12-03 23:21:53+08:00 [CATEGORIES]cs.LG
Vector Optimization with Gaussian Process Bandits [AUTHORS]İlter Onat Korkmaz, Yaşar Cahit Yıldırım, Çağın Ararat, Cem Tekin [ABSTRACT]Learning problems in which multiple conflicting objectives must be considered
simultaneously often arise in various fields, including engineering, drug
design, and environmental management. Traditional methods for dealing with
multiple black-box objective functions, such as scalarization and
identification of the Pareto set under the componentwise order, have
limitations in incorporating objective preferences and exploring the solution
space accordingly. While vector optimization offers improved flexibility and
adaptability via specifying partial orders based on ordering cones, current
techniques designed for sequential experiments either suffer from high sample
complexity or lack theoretical guarantees. To address these issues, we propose
Vector Optimization with Gaussian Process (VOGP), a probably approximately
correct adaptive elimination algorithm that performs black-box vector
optimization using Gaussian process bandits. VOGP allows users to convey
objective preferences through ordering cones while performing efficient
sampling by exploiting the smoothness of the objective function, resulting in a
more effective optimization process that requires fewer evaluations. We
establish theoretical guarantees for VOGP and derive information gain-based and
kernel-specific sample complexity bounds. We also conduct experiments on both
real-world and synthetic datasets to compare VOGP with the state-of-the-art
methods. [LINK]http://arxiv.org/abs/2412.02484v1 [DATE]2024-12-03 22:47:46+08:00 [CATEGORIES]cs.LG
SpaCE: The Spatial Confounding Environment [AUTHORS]Mauricio Tec, Ana Trisovic, Michelle Audirac, Sophie Woodward, Jie Kate Hu, Naeem Khoshnevis, Francesca Dominici [ABSTRACT]Spatial confounding poses a significant challenge in scientific studies
involving spatial data, where unobserved spatial variables can influence both
treatment and outcome, possibly leading to spurious associations. To address
this problem, we introduce SpaCE: The Spatial Confounding Environment, the
first toolkit to provide realistic benchmark datasets and tools for
systematically evaluating causal inference methods designed to alleviate
spatial confounding. Each dataset includes training data, true counterfactuals,
a spatial graph with coordinates, and smoothness and confounding scores
characterizing the effect of a missing spatial confounder. It also includes
realistic semi-synthetic outcomes and counterfactuals, generated using
state-of-the-art machine learning ensembles, following best practices for
causal inference benchmarks. The datasets cover real treatment and covariates
from diverse domains, including climate, health and social sciences. SpaCE
facilitates an automated end-to-end pipeline, simplifying data loading,
experimental setup, and evaluating machine learning and causal inference
models. The SpaCE project provides several dozens of datasets of diverse sizes
and spatial complexity. It is publicly available as a Python package,
encouraging community feedback and contributions. [LINK]http://arxiv.org/abs/2312.00710v3 [DATE]2024-12-03 22:45:03+08:00 [CATEGORIES]cs.LG
OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations [AUTHORS]Caixin Kang, Yubo Chen, Shouwei Ruan, Shiji Zhao, Ruochen Zhang, Jiayi Wang, Shan Fu, Xingxing Wei [ABSTRACT]With the rise of deep learning, facial recognition technology has seen
extensive research and rapid development. Although facial recognition is
considered a mature technology, we find that existing open-source models and
commercial algorithms lack robustness in certain real-world Out-of-Distribution
(OOD) scenarios, raising concerns about the reliability of these systems. In
this paper, we introduce OODFace, which explores the OOD challenges faced by
facial recognition models from two perspectives: common corruptions and
appearance variations. We systematically design 30 OOD scenarios across 9 major
categories tailored for facial recognition. By simulating these challenges on
public datasets, we establish three robustness benchmarks: LFW-C/V, CFP-FP-C/V,
and YTF-C/V. We then conduct extensive experiments on 19 different facial
recognition models and 3 commercial APIs, along with extended experiments on
face masks, Vision-Language Models (VLMs), and defense strategies to assess
their robustness. Based on the results, we draw several key insights,
highlighting the vulnerability of facial recognition systems to OOD data and
suggesting possible solutions. Additionally, we offer a unified toolkit that
includes all corruption and variation types, easily extendable to other
datasets. We hope that our benchmarks and findings can provide guidance for
future improvements in facial recognition model robustness. [LINK]http://arxiv.org/abs/2412.02479v1 [DATE]2024-12-03 22:42:31+08:00 [CATEGORIES]cs.LG
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [AUTHORS]Yan Scholten, Stephan Günnemann, Leo Schwinn [ABSTRACT]Comprehensive evaluation of Large Language Models (LLMs) is an open research
problem. Existing evaluations rely on deterministic point estimates generated
via greedy decoding. However, we find that deterministic evaluations fail to
capture the whole output distribution of a model, yielding inaccurate
estimations of model capabilities. This is particularly problematic in critical
contexts such as unlearning and alignment, where precise model evaluations are
crucial. To remedy this, we introduce the first formal probabilistic evaluation
framework in LLMs. Namely, we derive novel metrics with high-probability
guarantees concerning the output distribution of a model. Our metrics are
application-independent and allow practitioners to make more reliable estimates
about model capabilities before deployment. Through a case study focused on
unlearning, we reveal that deterministic evaluations falsely indicate
successful unlearning, whereas our probabilistic evaluations demonstrate that
most if not all of the supposedly unlearned information remains accessible in
these models. Additionally, we propose a novel unlearning loss based on entropy
optimization and adaptive temperature scaling, which significantly improves
unlearning in probabilistic settings on recent benchmarks. Our proposed shift
from point estimates to probabilistic evaluations of output distributions
represents an important step toward comprehensive evaluations of LLMs. Code
available at https://github.com/yascho/probabilistic-unlearning. [LINK]http://arxiv.org/abs/2410.03523v4 [DATE]2024-12-03 22:31:41+08:00 [CATEGORIES]cs.LG
Time-Series-Informed Closed-loop Learning for Sequential Decision Making and Control [AUTHORS]Sebastian Hirt, Lukas Theiner, Rolf Findeisen [ABSTRACT]Closed-loop performance of sequential decision making algorithms, such as
model predictive control, depends strongly on the parameters of cost functions,
models, and constraints. Bayesian optimization is a common approach to learning
these parameters based on closed-loop experiments. However, traditional
Bayesian optimization approaches treat the learning problem as a black box,
ignoring valuable information and knowledge about the structure of the
underlying problem, resulting in slow convergence and high experimental
resource use. We propose a time-series-informed optimization framework that
incorporates intermediate performance evaluations from early iterations of each
experimental episode into the learning procedure. Additionally, probabilistic
early stopping criteria are proposed to terminate unpromising experiments,
significantly reducing experimental time. Simulation results show that our
approach achieves baseline performance with approximately half the resources.
Moreover, with the same resource budget, our approach outperforms the baseline
in terms of final closed-loop performance, highlighting its efficiency in
sequential decision making scenarios. [COMMENTS]12 pages, 3 figures, submitted to L4DC 2025 [LINK]http://arxiv.org/abs/2412.02423v1 [DATE]2024-12-03 20:38:53+08:00 [CATEGORIES]cs.LG
OMENN: One Matrix to Explain Neural Networks [AUTHORS]Adam Wróbel, Mikołaj Janusz, Bartosz Zieliński, Dawid Rymarczyk [ABSTRACT]Deep Learning (DL) models are often black boxes, making their decision-making
processes difficult to interpret. This lack of transparency has driven
advancements in eXplainable Artificial Intelligence (XAI), a field dedicated to
clarifying the reasoning behind DL model predictions. Among these,
attribution-based methods such as LRP and GradCAM are widely used, though they
rely on approximations that can be imprecise.
To address these limitations, we introduce One Matrix to Explain Neural
Networks (OMENN), a novel post-hoc method that represents a neural network as a
single, interpretable matrix for each specific input. This matrix is
constructed through a series of linear transformations that represent the
processing of the input by each successive layer in the neural network. As a
result, OMENN provides locally precise, attribution-based explanations of the
input across various modern models, including ViTs and CNNs. We present a
theoretical analysis of OMENN based on dynamic linearity property and validate
its effectiveness with extensive tests on two XAI benchmarks, demonstrating
that OMENN is competitive with state-of-the-art methods. [COMMENTS]Under review, code will be released after acceptance [LINK]http://arxiv.org/abs/2412.02399v1 [DATE]2024-12-03 19:49:01+08:00 [CATEGORIES]cs.LG
Training for Speech Recognition on Coprocessors [AUTHORS]Sebastian Baunsgaard, Sebastian B. Wrede, Pınar Tozun [ABSTRACT]Automatic Speech Recognition (ASR) has increased in popularity in recent
years. The evolution of processor and storage technologies has enabled more
advanced ASR mechanisms, fueling the development of virtual assistants such as
Amazon Alexa, Apple Siri, Microsoft Cortana, and Google Home. The interest in
such assistants, in turn, has amplified the novel developments in ASR research.
However, despite this popularity, there has not been a detailed training
efficiency analysis of modern ASR systems. This mainly stems from: the
proprietary nature of many modern applications that depend on ASR, like the
ones listed above; the relatively expensive co-processor hardware that is used
to accelerate ASR by big vendors to enable such applications; and the absence
of well-established benchmarks. The goal of this paper is to address the latter
two of these challenges. The paper first describes an ASR model, based on a
deep neural network inspired by recent work in this domain, and our experiences
building it. Then we evaluate this model on three CPU-GPU co-processor
platforms that represent different budget categories. Our results demonstrate
that utilizing hardware acceleration yields good results even without high-end
equipment. While the most expensive platform (10X price of the least expensive
one) converges to the initial accuracy target 10-30% and 60-70% faster than the
other two, the differences among the platforms almost disappear at slightly
higher accuracy targets. In addition, our results further highlight both the
difficulty of evaluating ASR systems due to the complex, long, and resource
intensive nature of the model training in this domain, and the importance of
establishing benchmarks for ASR. [COMMENTS]published at ADMS 2020 [LINK]http://arxiv.org/abs/2003.12366v2 [DATE]2024-12-03 19:13:27+08:00 [CATEGORIES]cs.LG
Flow Matching for Accelerated Simulation of Atomic Transport in Materials [AUTHORS]Juno Nam, Sulin Liu, Gavin Winter, KyuJung Jun, Soojung Yang, Rafael Gómez-Bombarelli [ABSTRACT]We introduce LiFlow, a generative framework to accelerate molecular dynamics
(MD) simulations for crystalline materials that formulates the task as
conditional generation of atomic displacements. The model uses flow matching,
with a Propagator submodel to generate atomic displacements and a Corrector to
locally correct unphysical geometries, and incorporates an adaptive prior based
on the Maxwell-Boltzmann distribution to account for chemical and thermal
conditions. We benchmark LiFlow on a dataset comprising 25-ps trajectories of
lithium diffusion across 4,186 solid-state electrolyte (SSE) candidates at four
temperatures. The model obtains a consistent Spearman rank correlation of
0.7-0.8 for lithium mean squared displacement (MSD) predictions on unseen
compositions. Furthermore, LiFlow generalizes from short training trajectories
to larger supercells and longer simulations while maintaining high accuracy.
With speed-ups of up to 600,000$\times$ compared to first-principles methods,
LiFlow enables scalable simulations at significantly larger length and time
scales. [LINK]http://arxiv.org/abs/2410.01464v2 [DATE]2024-12-03 18:01:06+08:00 [CATEGORIES]cs.LG
Optimizing Plastic Waste Collection in Water Bodies Using Heterogeneous Autonomous Surface Vehicles with Deep Reinforcement Learning [AUTHORS]Alejandro Mendoza Barrionuevo, Samuel Yanes Luis, Daniel Gutiérrez Reina, Sergio L. Toral Marín [ABSTRACT]This paper presents a model-free deep reinforcement learning framework for
informative path planning with heterogeneous fleets of autonomous surface
vehicles to locate and collect plastic waste. The system employs two teams of
vehicles: scouts and cleaners. Coordination between these teams is achieved
through a deep reinforcement approach, allowing agents to learn strategies to
maximize cleaning efficiency. The primary objective is for the scout team to
provide an up-to-date contamination model, while the cleaner team collects as
much waste as possible following this model. This strategy leads to
heterogeneous teams that optimize fleet efficiency through inter-team
cooperation supported by a tailored reward function. Different trainings of the
proposed algorithm are compared with other state-of-the-art heuristics in two
distinct scenarios, one with high convexity and another with narrow corridors
and challenging access. According to the obtained results, it is demonstrated
that deep reinforcement learning based algorithms outperform other benchmark
heuristics, exhibiting superior adaptability. In addition, training with greedy
actions further enhances performance, particularly in scenarios with intricate
layouts. [COMMENTS]This article is currently under revision for the Robotics and
Automation Letters (IEEE) [LINK]http://arxiv.org/abs/2412.02316v1 [DATE]2024-12-03 17:32:02+08:00 [CATEGORIES]cs.LG
Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods [AUTHORS]Jiamian Hu, Yuanyuan Hong, Yihua Chen, He Wang, Moriaki Yasuhara [ABSTRACT]We present the Noisy Ostracods, a noisy dataset for genus and species
classification of crustacean ostracods with specialists' annotations. Over the
71466 specimens collected, 5.58% of them are estimated to be noisy (possibly
problematic) at genus level. The dataset is created to addressing a real-world
challenge: creating a clean fine-grained taxonomy dataset. The Noisy Ostracods
dataset has diverse noises from multiple sources. Firstly, the noise is
open-set, including new classes discovered during curation that were not part
of the original annotation. The dataset has pseudo-classes, where annotators
misclassified samples that should belong to an existing class into a new
pseudo-class. The Noisy Ostracods dataset is highly imbalanced with a imbalance
factor $\rho$ = 22429. This presents a unique challenge for robust machine
learning methods, as existing approaches have not been extensively evaluated on
fine-grained classification tasks with such diverse real-world noise. Initial
experiments using current robust learning techniques have not yielded
significant performance improvements on the Noisy Ostracods dataset compared to
cross-entropy training on the raw, noisy data. On the other hand, noise
detection methods have underperformed in error hit rate compared to naive
cross-validation ensembling for identifying problematic labels. These findings
suggest that the fine-grained, imbalanced nature, and complex noise
characteristics of the dataset present considerable challenges for existing
noise-robust algorithms. By openly releasing the Noisy Ostracods dataset, our
goal is to encourage further research into the development of noise-resilient
machine learning methods capable of effectively handling diverse, real-world
noise in fine-grained classification tasks. The dataset, along with its
evaluation protocols, can be accessed at
https://github.com/H-Jamieu/Noisy_ostracods. [COMMENTS]Initial submit [LINK]http://arxiv.org/abs/2412.02313v1 [DATE]2024-12-03 17:30:57+08:00 [CATEGORIES]cs.LG
LLM-ABBA: Understanding time series via symbolic approximation [AUTHORS]Erin Carson, Xinye Chen, Cheng Kang [ABSTRACT]The success of large language models (LLMs) for time series has been
demonstrated in previous work. Utilizing a symbolic time series representation,
one can efficiently bridge the gap between LLMs and time series. However, the
remaining challenge is to exploit the semantic information hidden in time
series by using symbols or existing tokens of LLMs, while aligning the
embedding space of LLMs according to the hidden information of time series. The
symbolic time series approximation (STSA) method called adaptive Brownian
bridge-based symbolic aggregation (ABBA) shows outstanding efficacy in
preserving salient time series features by modeling time series patterns in
terms of amplitude and period while using existing tokens of LLMs.
In this paper, we introduce a method, called LLM-ABBA, that integrates ABBA
into large language models for various downstream time series tasks. By
symbolizing time series, LLM-ABBA compares favorably to the recent
state-of-the-art (SOTA) in UCR and three medical time series classification
tasks. Meanwhile, a fixed-polygonal chain trick in ABBA is introduced to
\kc\{avoid obvious drifting\} during prediction tasks by significantly mitigating
the effects of cumulative error arising from misused symbols during the
transition from symbols to numerical values. In time series regression tasks,
LLM-ABBA achieves the new SOTA on Time Series Extrinsic Regression (TSER)
benchmarks. LLM-ABBA also shows competitive prediction capability compared to
recent SOTA time series prediction results. We believe this framework can also
seamlessly extend to other time series tasks. [LINK]http://arxiv.org/abs/2411.18506v2 [DATE]2024-12-03 17:25:11+08:00 [CATEGORIES]cs.LG
BInD: Bond and Interaction-generating Diffusion Model for Multi-objective Structure-based Drug Design [AUTHORS]Joongwon Lee, Wonho Zhung, Jisu Seo, Woo Youn Kim [ABSTRACT]A remarkable advance in geometric deep generative models with accumulated
structural data enables structure-based drug design (SBDD) with target protein
information only. However, most existing models struggle to address
multi-objectives simultaneously while performing well only in their specialized
tasks. Here, we present BInD, a diffusion model with knowledge-based guidance
for multi-objective SBDD. BInD is designed to co-generate molecules and their
interactions with a target protein to consider all key objectives equally well,
including target-specific interactions, molecular properties, and local
geometry. Comprehensive evaluations show that BInD achieves robust performance
for all objectives while outperforming or matching state-of-the-art methods for
each. Finally, we propose a train-free optimization method empowered by
retrieving target-specific interactions, highlighting the role of non-covalent
interactions in achieving higher selectivity and binding affinities to a target
protein. [LINK]http://arxiv.org/abs/2405.16861v2 [DATE]2024-12-03 17:17:43+08:00 [CATEGORIES]cs.LG
CADMR: Cross-Attention and Disentangled Learning for Multimodal Recommender Systems [AUTHORS]Yasser Khalafaoui, Martino Lovisetto, Basarab Matei, Nistor Grozavu [ABSTRACT]The increasing availability and diversity of multimodal data in recommender
systems offer new avenues for enhancing recommendation accuracy and user
satisfaction. However, these systems must contend with high-dimensional, sparse
user-item rating matrices, where reconstructing the matrix with only small
subsets of preferred items for each user poses a significant challenge. To
address this, we propose CADMR, a novel autoencoder-based multimodal
recommender system framework. CADMR leverages multi-head cross-attention
mechanisms and Disentangled Learning to effectively integrate and utilize
heterogeneous multimodal data in reconstructing the rating matrix. Our approach
first disentangles modality-specific features while preserving their
interdependence, thereby learning a joint latent representation. The multi-head
cross-attention mechanism is then applied to enhance user-item interaction
representations with respect to the learned multimodal item latent
representations. We evaluate CADMR on three benchmark datasets, demonstrating
significant performance improvements over state-of-the-art methods. [LINK]http://arxiv.org/abs/2412.02295v1 [DATE]2024-12-03 17:09:52+08:00 [CATEGORIES]cs.LG
Deep Matrix Factorization with Adaptive Weights for Multi-View Clustering [AUTHORS]Yasser Khalafaoui, Basarab Matei, Martino Lovisetto, Nistor Grozavu [ABSTRACT]Recently, deep matrix factorization has been established as a powerful model
for unsupervised tasks, achieving promising results, especially for multi-view
clustering. However, existing methods often lack effective feature selection
mechanisms and rely on empirical hyperparameter selection. To address these
issues, we introduce a novel Deep Matrix Factorization with Adaptive Weights
for Multi-View Clustering (DMFAW). Our method simultaneously incorporates
feature selection and generates local partitions, enhancing clustering results.
Notably, the features weights are controlled and adjusted by a parameter that
is dynamically updated using Control Theory inspired mechanism, which not only
improves the model's stability and adaptability to diverse datasets but also
accelerates convergence. A late fusion approach is then proposed to align the
weighted local partitions with the consensus partition. Finally, the
optimization problem is solved via an alternating optimization algorithm with
theoretically guaranteed convergence. Extensive experiments on benchmark
datasets highlight that DMFAW outperforms state-of-the-art methods in terms of
clustering performance. [LINK]http://arxiv.org/abs/2412.02292v1 [DATE]2024-12-03 17:08:27+08:00 [CATEGORIES]cs.LG
Conformal Symplectic Optimization for Stable Reinforcement Learning [AUTHORS]Yao Lyu, Xiangteng Zhang, Shengbo Eben Li, Jingliang Duan, Letian Tao, Qing Xu, Lei He, Keqiang Li [ABSTRACT]Training deep reinforcement learning (RL) agents necessitates overcoming the
highly unstable nonconvex stochastic optimization inherent in the
trial-and-error mechanism. To tackle this challenge, we propose a
physics-inspired optimization algorithm called relativistic adaptive gradient
descent (RAD), which enhances long-term training stability. By conceptualizing
neural network (NN) training as the evolution of a conformal Hamiltonian
system, we present a universal framework for transferring long-term stability
from conformal symplectic integrators to iterative NN updating rules, where the
choice of kinetic energy governs the dynamical properties of resulting
optimization algorithms. By utilizing relativistic kinetic energy, RAD
incorporates principles from special relativity and limits parameter updates
below a finite speed, effectively mitigating abnormal gradient influences.
Additionally, RAD models NN optimization as the evolution of a multi-particle
system where each trainable parameter acts as an independent particle with an
individual adaptive learning rate. We prove RAD's sublinear convergence under
general nonconvex settings, where smaller gradient variance and larger batch
sizes contribute to tighter convergence. Notably, RAD degrades to the
well-known adaptive moment estimation (ADAM) algorithm when its speed
coefficient is chosen as one and symplectic factor as a small positive value.
Experimental results show RAD outperforming nine baseline optimizers with five
RL algorithms across twelve environments, including standard benchmarks and
challenging scenarios. Notably, RAD achieves up to a 155.1% performance
improvement over ADAM in Atari games, showcasing its efficacy in stabilizing
and accelerating RL training. [LINK]http://arxiv.org/abs/2412.02291v1 [DATE]2024-12-03 17:07:31+08:00 [CATEGORIES]cs.LG
Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control [AUTHORS]Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, Marek Cygan [ABSTRACT]Sample efficiency in Reinforcement Learning (RL) has traditionally been
driven by algorithmic enhancements. In this work, we demonstrate that scaling
can also lead to substantial improvements. We conduct a thorough investigation
into the interplay of scaling model capacity and domain-specific RL
enhancements. These empirical findings inform the design choices underlying our
proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation
behind BRO is that strong regularization allows for effective scaling of the
critic networks, which, paired with optimistic exploration, leads to superior
performance. BRO achieves state-of-the-art results, significantly outperforming
the leading model-based and model-free algorithms across 40 complex tasks from
the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first
model-free algorithm to achieve near-optimal policies in the notoriously
challenging Dog and Humanoid tasks. [COMMENTS]NeurIPS 2024 Spotlight [LINK]http://arxiv.org/abs/2405.16158v3 [DATE]2024-12-03 16:42:49+08:00 [CATEGORIES]cs.LG
Normalizing self-supervised learning for provably reliable Change Point Detection [AUTHORS]Alexandra Bazarova, Evgenia Romanenkova, Alexey Zaytsev [ABSTRACT]Change point detection (CPD) methods aim to identify abrupt shifts in the
distribution of input data streams. Accurate estimators for this task are
crucial across various real-world scenarios. Yet, traditional unsupervised CPD
techniques face significant limitations, often relying on strong assumptions or
suffering from low expressive power due to inherent model simplicity. In
contrast, representation learning methods overcome these drawbacks by offering
flexibility and the ability to capture the full complexity of the data without
imposing restrictive assumptions. However, these approaches are still emerging
in the CPD field and lack robust theoretical foundations to ensure their
reliability. Our work addresses this gap by integrating the expressive power of
representation learning with the groundedness of traditional CPD techniques. We
adopt spectral normalization (SN) for deep representation learning in CPD tasks
and prove that the embeddings after SN are highly informative for CPD. Our
method significantly outperforms current state-of-the-art methods during the
comprehensive evaluation via three standard CPD datasets. [LINK]http://arxiv.org/abs/2410.13637v2 [DATE]2024-12-03 16:29:54+08:00 [CATEGORIES]cs.LG
Learning from Reduced Labels for Long-Tailed Data [AUTHORS]Meng Wei, Zhongnian Li, Yong Zhou, Xinzheng Xu [ABSTRACT]Long-tailed data is prevalent in real-world classification tasks and heavily
relies on supervised information, which makes the annotation process
exceptionally labor-intensive and time-consuming. Unfortunately, despite being
a common approach to mitigate labeling costs, existing weakly supervised
learning methods struggle to adequately preserve supervised information for
tail samples, resulting in a decline in accuracy for the tail classes. To
alleviate this problem, we introduce a novel weakly supervised labeling setting
called Reduced Label. The proposed labeling setting not only avoids the decline
of supervised information for the tail samples, but also decreases the labeling
costs associated with long-tailed data. Additionally, we propose an
straightforward and highly efficient unbiased framework with strong theoretical
guarantees to learn from these Reduced Labels. Extensive experiments conducted
on benchmark datasets including ImageNet validate the effectiveness of our
approach, surpassing the performance of state-of-the-art weakly supervised
methods. [COMMENTS]11 pages, 3 figures [LINK]http://arxiv.org/abs/2403.16469v2 [DATE]2024-12-03 16:18:17+08:00 [CATEGORIES]cs.LG
Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs [AUTHORS]Zixuan Hu, Yongxian Wei, Li Shen, Chun Yuan, Dacheng Tao [ABSTRACT]Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot
adaptability without requiring fine-tuning, positioning them ideal for
data-limited and real-time applications. However, this adaptability has not yet
been replicated in current Visual Foundation Models (VFMs), which require
explicit fine-tuning with sufficient tuning data. Besides, the
pretraining-finetuning paradigm has led to the surge of numerous task-specific
modular components, such as Low-Rank Adaptation (LoRA). For the first time, we
explore the potential of reusing diverse pre-tuned LoRAs without accessing
their original training data, to achieve tuning-free few-shot adaptation in
VFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned
LoRAs with a meta-learning objective, using surrogate data generated inversely
from pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is
empowered to solve new few-shot tasks in a single forward pass, akin to the
in-context learning of LLMs. Additionally, we incorporate a double-efficient
mechanism tailored to our framework, significantly accelerating the
meta-training process while maintaining or even improving performance.
Extensive experiments across various few-shot classification benchmarks across
both in- and cross-domain scenarios demonstrate the superiority of our
framework. [LINK]http://arxiv.org/abs/2412.02220v1 [DATE]2024-12-03 15:25:30+08:00 [CATEGORIES]cs.LG
Recovering implicit physics model under real-world constraints [AUTHORS]Ayan Banerjee, Sandeep K. S. Gupta [ABSTRACT]Recovering a physics-driven model, i.e. a governing set of equations of the
underlying dynamical systems, from the real-world data has been of recent
interest. Most existing methods either operate on simulation data with
unrealistically high sampling rates or require explicit measurements of all
system variables, which is not amenable in real-world deployments. Moreover,
they assume the timestamps of external perturbations to the physical system are
known a priori, without uncertainty, implicitly discounting any sensor
time-synchronization or human reporting errors. In this paper, we propose a
novel liquid time constant neural network (LTC-NN) based architecture to
recover underlying model of physical dynamics from real-world data. The
automatic differentiation property of LTC-NN nodes overcomes problems
associated with low sampling rates, the input dependent time constant in the
forward pass of the hidden layer of LTC-NN nodes creates a massive search space
of implicit physical dynamics, the physics model solver based data
reconstruction loss guides the search for the correct set of implicit dynamics,
and the use of the dropout regularization in the dense layer ensures extraction
of the sparsest model. Further, to account for the perturbation timing error,
we utilize dense layer nodes to search through input shifts that results in the
lowest reconstruction loss. Experiments on four benchmark dynamical systems,
three with simulation data and one with the real-world data show that the
LTC-NN architecture is more accurate in recovering implicit physics model
coefficients than the state-of-the-art sparse model recovery approaches. We
also introduce four additional case studies (total eight) on real-life medical
examples in simulation and with real-world clinical data to show effectiveness
of our approach in recovering underlying model in practice. [COMMENTS]This paper is published in ECAI 2024,
https://ebooks.iospress.nl/volumearticle/69651 [LINK]http://arxiv.org/abs/2412.02215v1 [DATE]2024-12-03 15:11:21+08:00 [CATEGORIES]cs.LG
Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [AUTHORS]Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Maximilian Du, Chelsea Finn [ABSTRACT]Predicting and executing a sequence of actions without intermediate
replanning, known as action chunking, is increasingly used in robot learning
from human demonstrations. Yet, its reported effects on the learned policy are
inconsistent: some studies find it crucial for achieving strong results, while
others observe decreased performance. In this paper, we first dissect how
action chunking impacts the divergence between a learner and a demonstrator. We
find that action chunking allows the learner to better capture the temporal
dependencies in demonstrations but at the cost of reduced reactivity in
stochastic environments. To address this tradeoff, we propose Bidirectional
Decoding (BID), a test-time inference algorithm that bridges action chunking
with closed-loop operations. BID samples multiple predictions at each time step
and searches for the optimal one based on two criteria: (i) backward coherence,
which favors samples that align with previous decisions; (ii) forward contrast,
which seeks samples of high likelihood for future plans. By coupling decisions
within and across action chunks, BID promotes consistency over time while
maintaining reactivity to unexpected changes. Experimental results show that
BID boosts the performance of two state-of-the-art generative policies across
seven simulation benchmarks and two real-world tasks. Code and videos are
available at https://bid-robot.github.io. [COMMENTS]Project website: https://bid-robot.github.io/ [LINK]http://arxiv.org/abs/2408.17355v3 [DATE]2024-12-03 14:53:58+08:00 [CATEGORIES]cs.LG
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey [AUTHORS]Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu [ABSTRACT]Recent research demonstrates that the nascent fine-tuning-as-a-service
business model exposes serious safety concerns -- fine-tuning over a few
harmful data uploaded by the users can compromise the safety alignment of the
model. The attack, known as harmful fine-tuning attack, has raised a broad
research interest among the community. However, as the attack is still new,
\textbf\{we observe that there are general misunderstandings within the research
community.\} To clear up concern, this paper provide a comprehensive overview to
three aspects of harmful fine-tuning: attacks setting, defense design and
evaluation methodology. Specifically, we first present the threat model of the
problem, and introduce the harmful fine-tuning attack and its variants. Then we
systematically survey the existing literature on attacks/defenses/mechanical
analysis of the problem. Finally, we introduce the evaluation methodology and
outline future research directions that might contribute to the development of
the field. Additionally, we present a list of questions of interest, which
might be useful to refer to when reviewers in the peer review process question
the realism of the experiment/attack/defense setting. A curated list of
relevant papers is maintained and made accessible at:
https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers. [LINK]http://arxiv.org/abs/2409.18169v5 [DATE]2024-12-03 14:52:11+08:00 [CATEGORIES]cs.LG
FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL [AUTHORS]Woosung Koh, Wonbeen Oh, Siyeol Kim, Suhin Shin, Hyeongjin Kim, Jaein Jang, Junghyun Lee, Se-Young Yun [ABSTRACT]Multi-agent reinforcement learning has demonstrated significant potential in
addressing complex cooperative tasks across various real-world applications.
However, existing MARL approaches often rely on the restrictive assumption that
the number of entities (e.g., agents, obstacles) remains constant between
training and inference. This overlooks scenarios where entities are dynamically
removed or added during the inference trajectory -- a common occurrence in
real-world environments like search and rescue missions and dynamic combat
situations. In this paper, we tackle the challenge of intra-trajectory dynamic
entity composition under zero-shot out-of-domain (OOD) generalization, where
such dynamic changes cannot be anticipated beforehand. Our empirical studies
reveal that existing MARL methods suffer significant performance degradation
and increased uncertainty in these scenarios. In response, we propose
FlickerFusion, a novel OOD generalization method that acts as a universally
applicable augmentation technique for MARL backbone methods. FlickerFusion
stochastically drops out parts of the observation space, emulating being
in-domain when inferenced OOD. The results show that FlickerFusion not only
achieves superior inference rewards but also uniquely reduces uncertainty
vis-\`a-vis the backbone, compared to existing methods. Benchmarks,
implementations, and model weights are organized and open-sourced at
flickerfusion305.github.io, accompanied by ample demo video renderings. [COMMENTS]NeurIPS '24 Open-World Agents Workshop [LINK]http://arxiv.org/abs/2410.15876v3 [DATE]2024-12-03 13:59:09+08:00 [CATEGORIES]cs.LG
Generalizing Weisfeiler-Lehman Kernels to Subgraphs [AUTHORS]Dongkwan Kim, Alice Oh [ABSTRACT]Subgraph representation learning has been effective in solving various
real-world problems. However, current graph neural networks (GNNs) produce
suboptimal results for subgraph-level tasks due to their inability to capture
complex interactions within and between subgraphs. To provide a more expressive
and efficient alternative, we propose WLKS, a Weisfeiler-Lehman (WL) kernel
generalized for subgraphs by applying the WL algorithm on induced $k$-hop
neighborhoods. We combine kernels across different $k$-hop levels to capture
richer structural information that is not fully encoded in existing models. Our
approach can balance expressiveness and efficiency by eliminating the need for
neighborhood sampling. In experiments on eight real-world and synthetic
benchmarks, WLKS significantly outperforms leading approaches on five datasets
while reducing training time, ranging from 0.01x to 0.25x compared to the
state-of-the-art. [COMMENTS]15 pages [LINK]http://arxiv.org/abs/2412.02181v1 [DATE]2024-12-03 13:35:44+08:00 [CATEGORIES]cs.LG
A Comprehensive Study of Shapley Value in Data Analytics [AUTHORS]Hong Lin, Shixin Wan, Zhongle Xie, Ke Chen, Meihui Zhang, Lidan Shou, Gang Chen [ABSTRACT]Over the recent years, Shapley value (SV), a solution concept from
cooperative game theory, has found numerous applications in data analytics
(DA). This paper provides the first comprehensive study of SV used throughout
the DA workflow, which involves three main steps: data fabric, data
exploration, and result reporting. We summarize existing versatile forms of SV
used in these steps by a unified definition and clarify the essential
functionalities that SV can provide for data scientists. We categorize the arts
in this field based on the technical challenges they tackled, which include
computation efficiency, approximation error, privacy preservation, and
appropriate interpretations. We discuss these challenges and analyze the
corresponding solutions. We also implement SVBench, the first open-sourced
benchmark for developing SV applications, and conduct experiments on six DA
tasks to validate our analysis and discussions. Based on the qualitative and
quantitative results, we identify the limitations of current efforts for
applying SV to DA and highlight the directions of future research and
engineering. [LINK]http://arxiv.org/abs/2412.01460v2 [DATE]2024-12-03 12:48:22+08:00 [CATEGORIES]cs.LG
FSMLP: Modelling Channel Dependencies With Simplex Theory Based Multi-Layer Perceptions In Frequency Domain [AUTHORS]Zhengnan Li, Haoxuan Li, Hao Wang, Jun Fang, Duoyin Li Yunxiao Qin [ABSTRACT]Time series forecasting (TSF) plays a crucial role in various domains,
including web data analysis, energy consumption prediction, and weather
forecasting. While Multi-Layer Perceptrons (MLPs) are lightweight and effective
for capturing temporal dependencies, they are prone to overfitting when used to
model inter-channel dependencies. In this paper, we investigate the overfitting
problem in channel-wise MLPs using Rademacher complexity theory, revealing that
extreme values in time series data exacerbate this issue. To mitigate this
issue, we introduce a novel Simplex-MLP layer, where the weights are
constrained within a standard simplex. This strategy encourages the model to
learn simpler patterns and thereby reducing overfitting to extreme values.
Based on the Simplex-MLP layer, we propose a novel \textbf\{F\}requency
\textbf\{S\}implex \textbf\{MLP\} (FSMLP) framework for time series forecasting,
comprising of two kinds of modules: \textbf\{S\}implex
\textbf\{C\}hannel-\textbf\{W\}ise MLP (SCWM) and \textbf\{F\}requency
\textbf\{T\}emporal \textbf\{M\}LP (FTM). The SCWM effectively leverages the
Simplex-MLP to capture inter-channel dependencies, while the FTM is a simple
yet efficient temporal MLP designed to extract temporal information from the
data. Our theoretical analysis shows that the upper bound of the Rademacher
Complexity for Simplex-MLP is lower than that for standard MLPs. Moreover, we
validate our proposed method on seven benchmark datasets, demonstrating
significant improvements in forecasting accuracy and efficiency, while also
showcasing superior scalability. Additionally, we demonstrate that Simplex-MLP
can improve other methods that use channel-wise MLP to achieve less overfitting
and improved performance. Code are available
\href\{https://github.com/FMLYD/FSMLP\}\{\textcolor\{red\}\{here\}\}. [LINK]http://arxiv.org/abs/2412.01654v2 [DATE]2024-12-03 12:40:13+08:00 [CATEGORIES]cs.LG
Revisiting the Initial Steps in Adaptive Gradient Descent Optimization [AUTHORS]Abulikemu Abuduweili, Changliu Liu [ABSTRACT]Adaptive gradient optimization methods, such as Adam, are prevalent in
training deep neural networks across diverse machine learning tasks due to
their ability to achieve faster convergence. However, these methods often
suffer from suboptimal generalization compared to stochastic gradient descent
(SGD) and exhibit instability, particularly when training Transformer models.
In this work, we show the standard initialization of the second-order moment
estimation ($v_0 =0$) as a significant factor contributing to these
limitations. We introduce simple yet effective solutions: initializing the
second-order moment estimation with non-zero values, using either data-driven
or random initialization strategies. Empirical evaluations demonstrate that our
approach not only stabilizes convergence but also enhances the final
performance of adaptive gradient optimizers. Furthermore, by adopting the
proposed initialization strategies, Adam achieves performance comparable to
many recently proposed variants of adaptive gradient optimization methods,
highlighting the practical impact of this straightforward modification. [COMMENTS]OPT workshop at NeurIPS 2024 [LINK]http://arxiv.org/abs/2412.02153v1 [DATE]2024-12-03 12:28:14+08:00 [CATEGORIES]cs.LG
Towards Universal Mesh Movement Networks [AUTHORS]Mingrui Zhang, Chunyang Wang, Stephan Kramer, Joseph G. Wallwork, Siyi Li, Jiancheng Liu, Xiang Chen, Matthew D. Piggott [ABSTRACT]Solving complex Partial Differential Equations (PDEs) accurately and
efficiently is an essential and challenging problem in all scientific and
engineering disciplines. Mesh movement methods provide the capability to
improve the accuracy of the numerical solution without increasing the overall
mesh degree of freedom count. Conventional sophisticated mesh movement methods
are extremely expensive and struggle to handle scenarios with complex boundary
geometries. However, existing learning-based methods require re-training from
scratch given a different PDE type or boundary geometry, which limits their
applicability, and also often suffer from robustness issues in the form of
inverted elements. In this paper, we introduce the Universal Mesh Movement
Network (UM2N), which -- once trained -- can be applied in a non-intrusive,
zero-shot manner to move meshes with different size distributions and
structures, for solvers applicable to different PDE types and boundary
geometries. UM2N consists of a Graph Transformer (GT) encoder for extracting
features and a Graph Attention Network (GAT) based decoder for moving the mesh.
We evaluate our method on advection and Navier-Stokes based examples, as well
as a real-world tsunami simulation case. Our method outperforms existing
learning-based mesh movement methods in terms of the benchmarks described
above. In comparison to the conventional sophisticated Monge-Amp\`ere
PDE-solver based method, our approach not only significantly accelerates mesh
movement, but also proves effective in scenarios where the conventional method
fails. Our project page is at https://erizmr.github.io/UM2N/. [COMMENTS]Accepted at NeurIPS 2024 as a spotlight paper [LINK]http://arxiv.org/abs/2407.00382v4 [DATE]2024-12-03 12:07:32+08:00 [CATEGORIES]cs.LG
Benchmarking symbolic regression constant optimization schemes [AUTHORS]L. G. A dos Reis, V. L. P. S. Caminha, T. J. P. Penna [ABSTRACT]Symbolic regression is a machine learning technique, and it has seen many
advancements in recent years, especially in genetic programming approaches
(GPSR). Furthermore, it has been known for many years that constant
optimization of parameters, during the evolutionary search, greatly increases
GPSR performance However, different authors approach such tasks differently and
no consensus exists regarding which methods perform best. In this work, we
evaluate eight different parameter optimization methods, applied during
evolutionary search, over ten known benchmark problems, in two different
scenarios. We also propose using an under-explored metric called Tree Edit
Distance (TED), aiming to identify symbolic accuracy. In conjunction with
classical error measures, we develop a combined analysis of model performance
in symbolic regression. We then show that different constant optimization
methods perform better in certain scenarios and that there is no overall best
choice for every problem. Finally, we discuss how common metric decisions may
be biased and appear to generate better models in comparison. [COMMENTS]9 pages, 10 figures, 2 tables [LINK]http://arxiv.org/abs/2412.02126v1 [DATE]2024-12-03 11:29:27+08:00 [CATEGORIES]cs.LG
Optimizing Latent Goal by Learning from Trajectory Preference [AUTHORS]Guangyu Zhao, Kewei Lian, Haowei Lin, Haobo Fu, Qiang Fu, Shaofei Cai, Zihao Wang, Yitao Liang [ABSTRACT]A glowing body of work has emerged focusing on instruction-following policies
for open-world agents, aiming to better align the agent's behavior with human
intentions. However, the performance of these policies is highly susceptible to
the initial prompt, which leads to extra efforts in selecting the best
instructions. We propose a framework named Preference Goal Tuning (PGT). PGT
allows an instruction following policy to interact with the environment to
collect several trajectories, which will be categorized into positive and
negative samples based on preference. Then we use preference learning to
fine-tune the initial goal latent representation with the categorized
trajectories while keeping the policy backbone frozen. The experiment result
shows that with minimal data and training, PGT achieves an average relative
improvement of 72.0% and 81.6% over 17 tasks in 2 different foundation policies
respectively, and outperforms the best human-selected instructions. Moreover,
PGT surpasses full fine-tuning in the out-of-distribution (OOD) task-execution
environments by 13.4%, indicating that our approach retains strong
generalization capabilities. Since our approach stores a single latent
representation for each task independently, it can be viewed as an efficient
method for continual learning, without the risk of catastrophic forgetting or
task interference. In short, PGT enhances the performance of agents across
nearly all tasks in the Minecraft Skillforge benchmark and demonstrates
robustness to the execution environment. [LINK]http://arxiv.org/abs/2412.02125v1 [DATE]2024-12-03 11:27:48+08:00 [CATEGORIES]cs.LG
ILASH: A Predictive Neural Architecture Search Framework for Multi-Task Applications [AUTHORS]Md Hafizur Rahman, Md Mashfiq Rizvee, Sumaiya Shomaji, Prabuddha Chakraborty [ABSTRACT]Artificial intelligence (AI) is widely used in various fields including
healthcare, autonomous vehicles, robotics, traffic monitoring, and agriculture.
Many modern AI applications in these fields are multi-tasking in nature (i.e.
perform multiple analysis on same data) and are deployed on
resource-constrained edge devices requiring the AI models to be efficient
across different metrics such as power, frame rate, and size. For these
specific use-cases, in this work, we propose a new paradigm of neural network
architecture (ILASH) that leverages a layer sharing concept for minimizing
power utilization, increasing frame rate, and reducing model size.
Additionally, we propose a novel neural network architecture search framework
(ILASH-NAS) for efficient construction of these neural network models for a
given set of tasks and device constraints. The proposed NAS framework utilizes
a data-driven intelligent approach to make the search efficient in terms of
energy, time, and CO2 emission. We perform extensive evaluations of the
proposed layer shared architecture paradigm (ILASH) and the ILASH-NAS framework
using four open-source datasets (UTKFace, MTFL, CelebA, and Taskonomy). We
compare ILASH-NAS with AutoKeras and observe significant improvement in terms
of both the generated model performance and neural search efficiency with up to
16x less energy utilization, CO2 emission, and training/search time. [COMMENTS]9 pages, 3 figures, 6 tables [LINK]http://arxiv.org/abs/2412.02116v1 [DATE]2024-12-03 11:12:16+08:00 [CATEGORIES]cs.LG
Evaluating the Impact of Data Augmentation on Predictive Model Performance [AUTHORS]Valdemar Švábenský, Conrad Borchers, Elizabeth B. Cloude, Atsushi Shimada [ABSTRACT]In supervised machine learning (SML) research, large training datasets are
essential for valid results. However, obtaining primary data in learning
analytics (LA) is challenging. Data augmentation can address this by expanding
and diversifying data, though its use in LA remains underexplored. This paper
systematically compares data augmentation techniques and their impact on
prediction performance in a typical LA task: prediction of academic outcomes.
Augmentation is demonstrated on four SML models, which we successfully
replicated from a previous LAK study based on AUC values. Among 21 augmentation
techniques, SMOTE-ENN sampling performed the best, improving the average AUC by
0.01 and approximately halving the training time compared to the baseline
models. In addition, we compared 99 combinations of chaining 21 techniques, and
found minor, although statistically significant, improvements across models
when adding noise to SMOTE-ENN (+0.014). Notably, some augmentation techniques
significantly lowered predictive performance or increased performance
fluctuation related to random chance. This paper's contribution is twofold.
Primarily, our empirical findings show that sampling techniques provide the
most statistically reliable performance improvements for LA applications of
SML, and are computationally more efficient than deep generation methods with
complex hyperparameter settings. Second, the LA community may benefit from
validating a recent study through independent replication. [COMMENTS]Published in LAK 2025 conference proceedings in the ACM Digital
Library, see https://doi.org/10.1145/3706468.3706485 [LINK]http://arxiv.org/abs/2412.02108v1 [DATE]2024-12-03 11:03:04+08:00 [CATEGORIES]cs.LG
Beyond Tree Models: A Hybrid Model of KAN and gMLP for Large-Scale Financial Tabular Data [AUTHORS]Mingming Zhang, Jiahao Hu, Pengfei Shi, Ningtao Wang, Ruizhe Gao, Guandong Sun, Feng Zhao, Yulin kang, Xing Fu, Weiqiang Wang, Junbo Zhao [ABSTRACT]Tabular data plays a critical role in real-world financial scenarios.
Traditionally, tree models have dominated in handling tabular data. However,
financial datasets in the industry often encounter some challenges, such as
data heterogeneity, the predominance of numerical features and the large scale
of the data, which can range from tens of millions to hundreds of millions of
records. These challenges can lead to significant memory and computational
issues when using tree-based models. Consequently, there is a growing need for
neural network-based solutions that can outperform these models. In this paper,
we introduce TKGMLP, an hybrid network for tabular data that combines shallow
Kolmogorov Arnold Networks with Gated Multilayer Perceptron. This model
leverages the strengths of both architectures to improve performance and
scalability. We validate TKGMLP on a real-world credit scoring dataset, where
it achieves state-of-the-art results and outperforms current benchmarks.
Furthermore, our findings demonstrate that the model continues to improve as
the dataset size increases, making it highly scalable. Additionally, we propose
a novel feature encoding method for numerical data, specifically designed to
address the predominance of numerical features in financial datasets. The
integration of this feature encoding method within TKGMLP significantly
improves prediction accuracy. This research not only advances table prediction
technology but also offers a practical and effective solution for handling
large-scale numerical tabular data in various industrial applications. [COMMENTS]8 pages, 4 figures [LINK]http://arxiv.org/abs/2412.02097v1 [DATE]2024-12-03 10:38:07+08:00 [CATEGORIES]cs.LG
Fault Detection and Identification via Monitoring Modules Based on Clusters of Interacting Measurements [AUTHORS]Enrique Luna Villagomez, Vladimir Mahalec [ABSTRACT]This work introduces a novel control-aware distributed process monitoring
methodology based on modules comprised of clusters of interacting measurements.
The methodology relies on the process flow diagram (PFD) and control system
structure without requiring cross-correlation data to create monitoring
modules. The methodology is validated on the Tennessee Eastman Process
benchmark using full Principal Component Analysis (f-PCA) in the monitoring
modules. The results are comparable to nonlinear techniques implemented in a
centralized manner such as Kernel PCA (KPCA), Autoencoders (AE), and Recurrent
Neural Networks (RNN), or distributed techniques like the Distributed Canonical
Correlation Analysis (DCCA). Temporal plots of fault detection by different
modules show clearly the magnitude and propagation of the fault through each
module, pinpointing the module where the fault originates, and separating
controllable faults from other faults. This information, combined with PCA
contribution plots, helps detection and identification as effectively as more
complex nonlinear centralized or distributed methods. [COMMENTS]Reworked and submitted to CONENGP 2/12/20024 [LINK]http://arxiv.org/abs/2409.11444v2 [DATE]2024-12-03 09:16:51+08:00 [CATEGORIES]cs.LG
emg2pose: A Large and Diverse Benchmark for Surface Electromyographic Hand Pose Estimation [AUTHORS]Sasha Salter, Richard Warren, Collin Schlager, Adrian Spurr, Shangchen Han, Rohin Bhasin, Yujun Cai, Peter Walkington, Anuoluwapo Bolarinwa, Robert Wang, Nathan Danielson, Josh Merel, Eftychios Pnevmatikakis, Jesse Marshall [ABSTRACT]Hands are the primary means through which humans interact with the world.
Reliable and always-available hand pose inference could yield new and intuitive
control schemes for human-computer interactions, particularly in virtual and
augmented reality. Computer vision is effective but requires one or multiple
cameras and can struggle with occlusions, limited field of view, and poor
lighting. Wearable wrist-based surface electromyography (sEMG) presents a
promising alternative as an always-available modality sensing muscle activities
that drive hand motion. However, sEMG signals are strongly dependent on user
anatomy and sensor placement, and existing sEMG models have required hundreds
of users and device placements to effectively generalize. To facilitate
progress on sEMG pose inference, we introduce the emg2pose benchmark, the
largest publicly available dataset of high-quality hand pose labels and wrist
sEMG recordings. emg2pose contains 2kHz, 16 channel sEMG and pose labels from a
26-camera motion capture rig for 193 users, 370 hours, and 29 stages with
diverse gestures - a scale comparable to vision-based hand pose datasets. We
provide competitive baselines and challenging tasks evaluating real-world
generalization scenarios: held-out users, sensor placements, and stages.
emg2pose provides the machine learning community a platform for exploring
complex generalization problems, holding potential to significantly enhance the
development of sEMG-based human-computer interactions. [COMMENTS]Published at NeurIPS 2024 Datasets and Benchmarks Track [LINK]http://arxiv.org/abs/2412.02725v1 [DATE]2024-12-03 07:39:37+08:00 [CATEGORIES]cs.LG
Radial Basis Operator Networks [AUTHORS]Jason Kurz, Sean Oughton, Shitao Liu [ABSTRACT]Operator networks are designed to approximate nonlinear operators, which
provide mappings between infinite-dimensional spaces such as function spaces.
These networks are playing an increasingly important role in machine learning,
with their most notable contributions in the field of scientific computing.
Their significance stems from their ability to handle the type of data often
encountered in scientific applications. For instance, in climate modeling or
fluid dynamics, input data typically consists of discretized continuous fields
(like temperature distributions or velocity fields). We introduce the radial
basis operator network (RBON), which represents a significant advancement as
the first operator network capable of learning an operator in both the time
domain and frequency domain when adjusted to accept complex-valued inputs.
Despite the small, single hidden-layer structure, the RBON boasts small $L^2$
relative test error for both in- and out-of-distribution data (OOD) of less
than $1\times 10^\{-7\}$ in some benchmark cases. Moreover, the RBON maintains
small error on OOD data from entirely different function classes from the
training data. [LINK]http://arxiv.org/abs/2410.04639v2 [DATE]2024-12-03 06:46:47+08:00 [CATEGORIES]cs.LG
Representation Learning for Sequential Volumetric Design Tasks [AUTHORS]Md Ferdous Alam, Yi Wang, Chin-Yi Cheng, Jieliang Luo [ABSTRACT]Volumetric design, also called massing design, is the first and critical step
in professional building design which is sequential in nature. As the
volumetric design process requires careful design decisions and iterative
adjustments, the underlying sequential design process encodes valuable
information for designers. Many efforts have been made to automatically
generate reasonable volumetric designs, but the quality of the generated design
solutions varies, and evaluating a design solution requires either a
prohibitively comprehensive set of metrics or expensive human expertise. While
previous approaches focused on learning only the final design instead of
sequential design tasks, we propose to encode the design knowledge from a
collection of expert or high-performing design sequences and extract useful
representations using transformer-based models. Later we propose to utilize the
learned representations for crucial downstream applications such as design
preference evaluation and procedural design generation. We develop the
preference model by estimating the density of the learned representations
whereas we train an autoregressive transformer model for sequential design
generation. We demonstrate our ideas by leveraging a novel dataset of thousands
of sequential volumetric designs. Our preference model can compare two
arbitrarily given design sequences and is almost $90\%$ accurate in evaluation
against random design sequences. Our autoregressive model is also capable of
autocompleting a volumetric design sequence from a partial design sequence. [COMMENTS]12 pages, 12 figures [LINK]http://arxiv.org/abs/2309.02583v3 [DATE]2024-12-03 06:33:40+08:00 [CATEGORIES]cs.LG
DYffCast: Regional Precipitation Nowcasting Using IMERG Satellite Data. A case study over South America [AUTHORS]Daniel Seal, Rossella Arcucci, Salva Rühling-Cachay, César Quilodrán-Casas [ABSTRACT]Climate change is increasing the frequency of extreme precipitation events,
making weather disasters such as flooding and landslides more likely. The
ability to accurately nowcast precipitation is therefore becoming more critical
for safeguarding society by providing immediate, accurate information to
decision makers. Motivated by the recent success of generative models at
precipitation nowcasting, this paper: extends the DYffusion framework to this
task and evaluates its performance at forecasting IMERG satellite precipitation
data up to a 4-hour horizon; modifies the DYffusion framework to improve its
ability to model rainfall data; and introduces a novel loss function that
combines MSE, MAE and the LPIPS perceptual score. In a quantitative evaluation
of forecasts up to a 4-hour horizon, the modified DYffusion framework trained
with the novel loss outperforms four competitor models. It has the highest CSI
scores for weak, moderate, and heavy rain thresholds and retains an LPIPS score
$<$ 0.2 for the entire roll-out, degrading the least as lead-time increases.
The proposed nowcasting model demonstrates visually stable and sharp forecasts
up to a 2-hour horizon on a heavy rain case study. Code is available at
https://github.com/Dseal95/DYffcast. [COMMENTS]Accepted in the Machine Learning for Physical Sciences workshop @
NeurIPS 2024 [LINK]http://arxiv.org/abs/2412.02723v1 [DATE]2024-12-03 06:20:31+08:00 [CATEGORIES]cs.LG
The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications [AUTHORS]Philippe Brouillard, Chandler Squires, Jonas Wahl, Konrad P. Kording, Karen Sachs, Alexandre Drouin, Dhanya Sridhar [ABSTRACT]Causal discovery aims to automatically uncover causal relationships from
data, a capability with significant potential across many scientific
disciplines. However, its real-world applications remain limited. Current
methods often rely on unrealistic assumptions and are evaluated only on simple
synthetic toy datasets, often with inadequate evaluation metrics. In this
paper, we substantiate these claims by performing a systematic review of the
recent causal discovery literature. We present applications in biology,
neuroscience, and Earth sciences - fields where causal discovery holds promise
for addressing key challenges. We highlight available simulated and real-world
datasets from these domains and discuss common assumption violations that have
spurred the development of new methods. Our goal is to encourage the community
to adopt better evaluation practices by utilizing realistic datasets and more
adequate metrics. [COMMENTS]39 pages, 8 figures [LINK]http://arxiv.org/abs/2412.01953v1 [DATE]2024-12-03 04:26:29+08:00 [CATEGORIES]cs.LG
Approximately Optimal Search on a Higher-dimensional Sliding Puzzle [AUTHORS]Nono SC Merleau, Miguel O'Malley, Érika Roldán, Sayan Mukherjee [ABSTRACT]Higher-dimensional sliding puzzles are constructed on the vertices of a
$d$-dimensional hypercube, where $2^d-l$ vertices are distinctly coloured.
Rings with the same colours are initially set randomly on the vertices of the
hypercube. The goal of the puzzle is to move each of the $2^d-l$ rings to
pre-defined target vertices on the cube. In this setting, the $k$-rule
constraint represents a generalisation of edge collision for the movement of
colours between vertices, allowing movement only when a hypercube face of
dimension $k$ containing a ring is completely free of other rings. Starting
from an initial configuration, what is the minimum number of moves needed to
make ring colours match the vertex colours? An algorithm that provides us with
such a number is called God's algorithm. When such an algorithm exists, it does
not have a polynomial time complexity, at least in the case of the 15-puzzle
corresponding to $k=1$ in the cubical puzzle. This paper presents a
comprehensive computational study of different scenarios of the
higher-dimensional puzzle. A benchmark of three computational techniques, an
exact algorithm (the A* search) and two approximately optimal search techniques
(an evolutionary algorithm (EA) and reinforcement learning (RL)) is presented
in this work. The experiments show that all three methods can successfully
solve the puzzle of dimension three for different face dimensions and across
various difficulty levels. When the dimension increases, the A* search fails,
and RL and EA methods can still provide a generally acceptable solution, i.e. a
distribution of a number of moves with a median value of less than $30$.
Overall, the EA method consistently requires less computational time, while
failing in most cases to minimise the number of moves for the puzzle dimensions
$d=4$ and $d=5$. [COMMENTS]20 pages, 8 figures [LINK]http://arxiv.org/abs/2412.01937v1 [DATE]2024-12-03 03:59:06+08:00 [CATEGORIES]cs.LG
Kernel-Free Universum Quadratic Surface Twin Support Vector Machines for Imbalanced Data [AUTHORS]Hossein Moosaei, Milan Hladík, Ahmad Mousavi, Zheming Gao, Haojie Fu [ABSTRACT]Binary classification tasks with imbalanced classes pose significant
challenges in machine learning. Traditional classifiers often struggle to
accurately capture the characteristics of the minority class, resulting in
biased models with subpar predictive performance. In this paper, we introduce a
novel approach to tackle this issue by leveraging Universum points to support
the minority class within quadratic twin support vector machine models. Unlike
traditional classifiers, our models utilize quadratic surfaces instead of
hyperplanes for binary classification, providing greater flexibility in
modeling complex decision boundaries. By incorporating Universum points, our
approach enhances classification accuracy and generalization performance on
imbalanced datasets. We generated four artificial datasets to demonstrate the
flexibility of the proposed methods. Additionally, we validated the
effectiveness of our approach through empirical evaluations on benchmark
datasets, showing superior performance compared to conventional classifiers and
existing methods for imbalanced classification. [LINK]http://arxiv.org/abs/2412.01936v1 [DATE]2024-12-03 03:57:59+08:00 [CATEGORIES]cs.LG
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers [AUTHORS]Benedikt Stroebl, Sayash Kapoor, Arvind Narayanan [ABSTRACT]Recent research has generated hope that inference scaling could allow weaker
language models to match or exceed the accuracy of stronger models, such as by
repeatedly sampling solutions to a coding problem until it passes unit tests.
The central thesis of this paper is that there is no free lunch for inference
scaling: indefinite accuracy improvement through resampling can only be
realized if the "verifier" (in this case, a set of unit tests) is perfect. When
the verifier is imperfect, as it almost always is in domains such as reasoning
or coding (for example, unit tests have imperfect coverage), there is a nonzero
probability of false positives: incorrect solutions that pass the verifier.
Resampling cannot decrease this probability, so it imposes an upper bound to
the accuracy of resampling-based inference scaling even with an infinite
compute budget. We find that there is a very strong correlation between the
model's single-sample accuracy (i.e. accuracy without unit tests) and its false
positive rate on coding benchmarks HumanEval and MBPP, whose unit tests have
limited coverage. Therefore, no amount of inference scaling of weaker models
can enable them to match the single-sample accuracy of a sufficiently strong
model (Fig. 1a). When we consider that false positives have a negative utility
compared to abstaining from producing a solution, it bends the inference
scaling curve further downward. Empirically, we find that the optimal number of
samples can be less than 10 under realistic assumptions (Fig. 1b). Finally, we
show that beyond accuracy, false positives may have other undesirable
qualities, such as poor adherence to coding style conventions. [LINK]http://arxiv.org/abs/2411.17501v2 [DATE]2024-12-03 02:54:28+08:00 [CATEGORIES]cs.LG
CREW: Facilitating Human-AI Teaming Research [AUTHORS]Lingyu Zhang, Zhengran Ji, Boyuan Chen [ABSTRACT]With the increasing deployment of artificial intelligence (AI) technologies,
the potential of humans working with AI agents has been growing at a great
speed. Human-AI teaming is an important paradigm for studying various aspects
when humans and AI agents work together. The unique aspect of Human-AI teaming
research is the need to jointly study humans and AI agents, demanding
multidisciplinary research efforts from machine learning to human-computer
interaction, robotics, cognitive science, neuroscience, psychology, social
science, and complex systems. However, existing platforms for Human-AI teaming
research are limited, often supporting oversimplified scenarios and a single
task, or specifically focusing on either human-teaming research or multi-agent
AI algorithms. We introduce CREW, a platform to facilitate Human-AI teaming
research in real-time decision-making scenarios and engage collaborations from
multiple scientific disciplines, with a strong emphasis on human involvement.
It includes pre-built tasks for cognitive studies and Human-AI teaming with
expandable potentials from our modular design. Following conventional cognitive
neuroscience research, CREW also supports multimodal human physiological signal
recording for behavior analysis. Moreover, CREW benchmarks real-time
human-guided reinforcement learning agents using state-of-the-art algorithms
and well-tuned baselines. With CREW, we were able to conduct 50 human subject
studies within a week to verify the effectiveness of our benchmark. [COMMENTS]Our project website is at: http://generalroboticslab.com/CREW [LINK]http://arxiv.org/abs/2408.00170v2 [DATE]2024-12-03 02:37:01+08:00 [CATEGORIES]cs.LG
FERERO: A Flexible Framework for Preference-Guided Multi-Objective Learning [AUTHORS]Lisha Chen, AFM Saif, Yanning Shen, Tianyi Chen [ABSTRACT]Finding specific preference-guided Pareto solutions that represent different
trade-offs among multiple objectives is critical yet challenging in
multi-objective problems. Existing methods are restrictive in preference
definitions and/or their theoretical guarantees. In this work, we introduce a
Flexible framEwork for pREfeRence-guided multi-Objective learning (FERERO) by
casting it as a constrained vector optimization problem. Specifically, two
types of preferences are incorporated into this formulation -- the relative
preference defined by the partial ordering induced by a polyhedral cone, and
the absolute preference defined by constraints that are linear functions of the
objectives. To solve this problem, convergent algorithms are developed with
both single-loop and stochastic variants. Notably, this is the first
single-loop primal algorithm for constrained vector optimization to our
knowledge. The proposed algorithms adaptively adjust to both constraint and
objective values, eliminating the need to solve different subproblems at
different stages of constraint satisfaction. Experiments on multiple benchmarks
demonstrate the proposed method is very competitive in finding
preference-guided optimal solutions. Code is available at
https://github.com/lisha-chen/FERERO/. [LINK]http://arxiv.org/abs/2412.01773v1 [DATE]2024-12-03 02:21:16+08:00 [CATEGORIES]cs.LG
OminiControl: Minimal and Universal Control for Diffusion Transformer [AUTHORS]Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, Xinchao Wang [ABSTRACT]In this paper, we introduce OminiControl, a highly versatile and
parameter-efficient framework that integrates image conditions into pre-trained
Diffusion Transformer (DiT) models. At its core, OminiControl leverages a
parameter reuse mechanism, enabling the DiT to encode image conditions using
itself as a powerful backbone and process them with its flexible multi-modal
attention processors. Unlike existing methods, which rely heavily on additional
encoder modules with complex architectures, OminiControl (1) effectively and
efficiently incorporates injected image conditions with only ~0.1% additional
parameters, and (2) addresses a wide range of image conditioning tasks in a
unified manner, including subject-driven generation and spatially-aligned
conditions such as edges, depth, and more. Remarkably, these capabilities are
achieved by training on images generated by the DiT itself, which is
particularly beneficial for subject-driven generation. Extensive evaluations
demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted
models in both subject-driven and spatially-aligned conditional generation.
Additionally, we release our training dataset, Subjects200K, a diverse
collection of over 200,000 identity-consistent images, along with an efficient
data synthesis pipeline to advance research in subject-consistent generation. [LINK]http://arxiv.org/abs/2411.15098v3 [DATE]2024-12-03 01:59:40+08:00 [CATEGORIES]cs.LG
The Data-Driven Censored Newsvendor Problem [AUTHORS]Chamsi Hssaine, Sean R. Sinclair [ABSTRACT]We study a censored variant of the data-driven newsvendor problem, where the
decision-maker must select an ordering quantity that minimizes expected overage
and underage costs based only on censored sales data, rather than historical
demand realizations. To isolate the impact of demand censoring on this problem,
we adopt a distributionally robust optimization framework, evaluating policies
according to their worst-case regret over an ambiguity set of distributions.
This set is defined by the largest historical order quantity (the observable
boundary of the dataset), and contains all distributions matching the true
demand distribution up to this boundary, while allowing them to be arbitrary
afterwards. We demonstrate a spectrum of achievability under demand censoring
by deriving a natural necessary and sufficient condition under which vanishing
regret is an achievable goal. In regimes in which it is not, we exactly
characterize the information loss due to censoring: an insurmountable lower
bound on the performance of any policy, even when the decision-maker has access
to infinitely many demand samples. We then leverage these sharp
characterizations to propose a natural robust algorithm that adapts to the
historical level of demand censoring. We derive finite-sample guarantees for
this algorithm across all possible censoring regimes, and show its
near-optimality with matching lower bounds (up to polylogarithmic factors). We
moreover demonstrate its robust performance via extensive numerical experiments
on both synthetic and real-world datasets. [COMMENTS]67 pages, 19 tables, 7 figures [LINK]http://arxiv.org/abs/2412.01763v1 [DATE]2024-12-03 01:58:54+08:00 [CATEGORIES]cs.LG
2024 Dec 02, Mon
Using Large Language Models in Automatic Hint Ranking and Generation Tasks [AUTHORS]Jamshid Mozafari, Florian Gerhold, Adam Jatowt [ABSTRACT]The use of Large Language Models (LLMs) has increased significantly recently,
with individuals frequently interacting with chatbots to receive answers to a
wide range of questions. In an era where information is readily accessible, it
is crucial to stimulate and preserve human cognitive abilities and maintain
strong reasoning skills. This paper addresses such challenges by promoting the
use of hints as an alternative or a supplement to direct answers. We first
introduce a manually constructed hint dataset, WIKIHINT, which includes 5,000
hints created for 1,000 questions. We then finetune open-source LLMs such as
LLaMA-3.1 for hint generation in answer-aware and answer-agnostic contexts. We
assess the effectiveness of the hints with human participants who try to answer
questions with and without the aid of hints. Additionally, we introduce a
lightweight evaluation method, HINTRANK, to evaluate and rank hints in both
answer-aware and answer-agnostic settings. Our findings show that (a) the
dataset helps generate more effective hints, (b) including answer information
along with questions generally improves hint quality, and (c) encoder-based
models perform better than decoder-based models in hint ranking. [LINK]http://arxiv.org/abs/2412.01626v1 [DATE]2024-12-02 23:44:19+08:00 [CATEGORIES]cs.CL
NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers [AUTHORS]Angel Yahir Loredo Lopez, Tyler McDonald, Ali Emami [ABSTRACT]Large Language Models (LLMs) have shown impressive performance on various
benchmarks, yet their ability to engage in deliberate reasoning remains
questionable. We present NYT-Connections, a collection of 358 simple word
classification puzzles derived from the New York Times Connections game. This
benchmark is designed to penalize quick, intuitive "System 1" thinking,
isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple
machine learning heuristic, and humans across three configurations:
single-attempt, multiple attempts without hints, and multiple attempts with
contextual hints. Our findings reveal a significant performance gap: even
top-performing LLMs like GPT-4 fall short of human performance by nearly 30%.
Notably, advanced prompting techniques such as Chain-of-Thought and
Self-Consistency show diminishing returns as task difficulty increases.
NYT-Connections uniquely combines linguistic isolation, resistance to intuitive
shortcuts, and regular updates to mitigate data leakage, offering a novel tool
for assessing LLM reasoning capabilities. [COMMENTS]5 pages (excluding references), accepted to Coling 2025 [LINK]http://arxiv.org/abs/2412.01621v1 [DATE]2024-12-02 23:41:47+08:00 [CATEGORIES]cs.CL
Medchain: Bridging the Gap Between LLMAgents and Clinical Practice through Interactive Sequential Benchmarking [AUTHORS]Jie Liu, Wenxuan Wang, Zizhan Ma, Guolin Huang, Yihang SU, Kao-Jung Chang, Wenting Chen, Haoliang Li, Linlin Shen, Michael Lyu [ABSTRACT]Clinical decision making (CDM) is a complex, dynamic process crucial to
healthcare delivery, yet it remains a significant challenge for artificial
intelligence systems. While Large Language Model (LLM)-based agents have been
tested on general medical knowledge using licensing exams and knowledge
question-answering tasks, their performance in the CDM in real-world scenarios
is limited due to the lack of comprehensive testing datasets that mirror actual
medical practice. To address this gap, we present MedChain, a dataset of 12,163
clinical cases that covers five key stages of clinical workflow. MedChain
distinguishes itself from existing benchmarks with three key features of
real-world clinical practice: personalization, interactivity, and
sequentiality. Further, to tackle real-world CDM challenges, we also propose
MedChain-Agent, an AI system that integrates a feedback mechanism and a
MCase-RAG module to learn from previous cases and adapt its responses.
MedChain-Agent demonstrates remarkable adaptability in gathering information
dynamically and handling sequential clinical tasks, significantly outperforming
existing approaches. The relevant dataset and code will be released upon
acceptance of this paper. [LINK]http://arxiv.org/abs/2412.01605v1 [DATE]2024-12-02 23:25:02+08:00 [CATEGORIES]cs.CL
VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning [AUTHORS]Zhihuan Jiang, Zhen Yang, Jinhao Chen, Zhengxiao Du, Weihan Wang, Bin Xu, Jie Tang [ABSTRACT]Multi-modal large language models (MLLMs) have demonstrated promising
capabilities across various tasks by integrating textual and visual information
to achieve visual understanding in complex scenarios. Despite the availability
of several benchmarks aims to evaluating MLLMs in tasks from visual question
answering to complex problem-solving, most focus predominantly on mathematics
or general visual understanding tasks. This reveals a critical gap in current
benchmarks, which often overlook the inclusion of other key scientific
disciplines such as physics and chemistry. To address this gap, we meticulously
construct a comprehensive benchmark, named VisScience, which is utilized to
assess the multi-modal scientific reasoning across the three disciplines of
mathematics, physics, and chemistry. This benchmark comprises 3,000 questions
drawn from K12 education - spanning elementary school through high school -
equally distributed across three disciplines, with 1,000 questions per
discipline. The questions within VisScience span 21 distinct subjects and are
categorized into five difficulty levels, offering a broad spectrum of topics
within each discipline. With VisScience, we present a detailed evaluation of
the performance of 25 representative MLLMs in scientific reasoning.
Experimental results demonstrate that closed-source MLLMs generally outperform
open-source models. The best performance observed include a 53.4\% accuracy in
mathematics by Claude3.5-Sonnet, 38.2\% in physics by GPT-4o, and 47.0\% in
chemistry by Gemini-1.5-Pro. These results underscore the strengths and
limitations of MLLMs, suggesting areas for future improvement and highlighting
the importance of developing models that can effectively handle the diverse
demands of multi-modal scientific reasoning. [COMMENTS]89 pages, 70 figures [LINK]http://arxiv.org/abs/2409.13730v2 [DATE]2024-12-02 23:11:23+08:00 [CATEGORIES]cs.CL
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model [AUTHORS]Zhen Yang, Jinhao Chen, Zhengxiao Du, Wenmeng Yu, Weihan Wang, Wenyi Hong, Zhihuan Jiang, Bin Xu, Jie Tang [ABSTRACT]Large language models (LLMs) have demonstrated significant capabilities in
mathematical reasoning, particularly with text-based mathematical problems.
However, current multi-modal large language models (MLLMs), especially those
specialized in mathematics, tend to focus predominantly on solving geometric
problems but ignore the diversity of visual information available in other
areas of mathematics. Moreover, the geometric information for these specialized
mathematical MLLMs is derived from several public datasets, which are typically
limited in diversity and complexity. To address these limitations, we aim to
construct a fine-tuning dataset named MathVL, and develop a series of
specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised
Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To
extensively evaluate the effectiveness of MathGLM-Vision, we conduct
experiments on several public benchmarks and our curated MathVL-test consisting
of 2,000 problems. Experimental results demonstrate that MathGLM-Vision
achieves significant improvements compared with some existing models, including
backbone models and open-source mathematical MLLMs. These findings indicate the
importance of diversity dataset in enhancing the mathematical reasoning
abilities of MLLMs. [COMMENTS]30 pages,19 figures [LINK]http://arxiv.org/abs/2409.13729v2 [DATE]2024-12-02 22:59:08+08:00 [CATEGORIES]cs.CL
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [AUTHORS]Alessandro Scirè, Andrei Stefan Bejgu, Simone Tedeschi, Karim Ghonim, Federico Martelli, Roberto Navigli [ABSTRACT]After the introduction of Large Language Models (LLMs), there have been
substantial improvements in the performance of Natural Language Generation
(NLG) tasks, including Text Summarization and Machine Translation. However,
LLMs still produce outputs containing hallucinations, that is, content not
grounded in factual information. Therefore, developing methods to assess the
factuality of LLMs has become urgent.
Indeed, resources for factuality evaluation have recently emerged. Although
challenging, these resources face one or more of the following limitations: (i)
they are tailored to a specific task or domain; (ii) they are limited in size,
thereby preventing the training of new factuality evaluators; (iii) they are
designed for simpler verification tasks, such as claim verification.
To address these issues, we introduce LLM-Oasis, to the best of our knowledge
the largest resource for training end-to-end factuality evaluators. LLM-Oasis
is constructed by extracting claims from Wikipedia, falsifying a subset of
these claims, and generating pairs of factual and unfactual texts. We then rely
on human annotators to both validate the quality of our dataset and to create a
gold standard test set for benchmarking factuality evaluation systems.
Our experiments demonstrate that LLM-Oasis presents a significant challenge
for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our
proposed end-to-end factuality evaluation task, highlighting its potential to
drive future research in the field. [COMMENTS]15 pages. To be submitted to CL journal [LINK]http://arxiv.org/abs/2411.19655v2 [DATE]2024-12-02 22:28:07+08:00 [CATEGORIES]cs.CL
Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem [AUTHORS]Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt [ABSTRACT]Natural language explanations (NLEs) are vital for elucidating the reasoning
behind large language model (LLM) decisions. Many techniques have been
developed to generate NLEs using LLMs. However, like humans, LLMs might not
always produce optimal NLEs on first attempt. Inspired by human learning
processes, we introduce Cross-Refine, which employs role modeling by deploying
two LLMs as generator and critic, respectively. The generator outputs a first
NLE and then refines this initial explanation using feedback and suggestions
provided by the critic. Cross-Refine does not require any supervised training
data or additional training. We validate Cross-Refine across three NLP tasks
using three state-of-the-art open-source LLMs through automatic and human
evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which
only utilizes self-feedback to refine the explanations. Our findings from
automatic evaluation and a user study indicate that Cross-Refine outperforms
Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful
LLMs, whereas Self-Refine only yields strong results with ChatGPT.
Additionally, we conduct an ablation study to assess the importance of feedback
and suggestions. Both of them play an important role in refining explanations.
We further evaluate Cross-Refine on a bilingual dataset in English and German. [COMMENTS]Accepted at COLING 2025; long paper [LINK]http://arxiv.org/abs/2409.07123v2 [DATE]2024-12-02 21:04:18+08:00 [CATEGORIES]cs.CLcs.LG
Multi-Facet Blending for Faceted Query-by-Example Retrieval [AUTHORS]Heejin Do, Sangwon Ryu, Jonghwi Kim, Gary Geunbae Lee [ABSTRACT]With the growing demand to fit fine-grained user intents, faceted
query-by-example (QBE), which retrieves similar documents conditioned on
specific facets, has gained recent attention. However, prior approaches mainly
depend on document-level comparisons using basic indicators like citations due
to the lack of facet-level relevance datasets; yet, this limits their use to
citation-based domains and fails to capture the intricacies of facet
constraints. In this paper, we propose a multi-facet blending (FaBle)
augmentation method, which exploits modularity by decomposing and recomposing
to explicitly synthesize facet-specific training sets. We automatically
decompose documents into facet units and generate (ir)relevant pairs by
leveraging LLMs' intrinsic distinguishing capabilities; then, dynamically
recomposing the units leads to facet-wise relevance-informed document pairs.
Our modularization eliminates the need for pre-defined facet knowledge or
labels. Further, to prove the FaBle's efficacy in a new domain beyond
citation-based scientific paper retrieval, we release a benchmark dataset for
educational exam item QBE. FaBle augmentation on 1K documents remarkably
assists training in obtaining facet conditional embeddings. [LINK]http://arxiv.org/abs/2412.01443v1 [DATE]2024-12-02 20:32:19+08:00 [CATEGORIES]cs.CL
A Top-down Graph-based Tool for Modeling Classical Semantic Maps: A Crosslinguistic Case Study of Supplementary Adverbs [AUTHORS]Zhu Liu, Cunliang Kong, Ying Liu, Maosong Sun [ABSTRACT]Semantic map models (SMMs) construct a network-like conceptual space from
cross-linguistic instances or forms, based on the connectivity hypothesis. This
approach has been widely used to represent similarity and entailment
relationships in cross-linguistic concept comparisons. However, most SMMs are
manually built by human experts using bottom-up procedures, which are often
labor-intensive and time-consuming. In this paper, we propose a novel
graph-based algorithm that automatically generates conceptual spaces and SMMs
in a top-down manner. The algorithm begins by creating a dense graph, which is
subsequently pruned into maximum spanning trees, selected according to metrics
we propose. These evaluation metrics include both intrinsic and extrinsic
measures, considering factors such as network structure and the trade-off
between precision and coverage. A case study on cross-linguistic supplementary
adverbs demonstrates the effectiveness and efficiency of our model compared to
human annotations and other automated methods. The tool is available at
\url\{https://github.com/RyanLiut/SemanticMapModel\}. [COMMENTS]Paper under review [LINK]http://arxiv.org/abs/2412.01423v1 [DATE]2024-12-02 20:06:41+08:00 [CATEGORIES]cs.CL
Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head [AUTHORS]Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee [ABSTRACT]End-to-end transformer-based detectors (DETRs) have shown exceptional
performance in both closed-set and open-vocabulary object detection (OVD) tasks
through the integration of language modalities. However, their demanding
computational requirements have hindered their practical application in
real-time object detection (OD) scenarios. In this paper, we scrutinize the
limitations of two leading models in the OVDEval benchmark, OmDet and
Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based
real-time OVD model features an innovative Efficient Fusion Head (EFH) module
designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO.
Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with
TensorRT and language cache techniques applied. Notably, in zero-shot scenarios
on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on
par with current state-of-the-art supervised models. Furthermore, it
establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an
AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of
OmDet-Turbo in industrial applications is underscored by its exceptional
performance on benchmark datasets and superior inference speed, positioning it
as a compelling choice for real-time object detection tasks. Code:
\url\{https://github.com/om-ai-lab/OmDet\} [COMMENTS]Preprint [LINK]http://arxiv.org/abs/2403.06892v2 [DATE]2024-12-02 19:24:20+08:00 [CATEGORIES]cs.CL
Understanding the World's Museums through Vision-Language Reasoning [AUTHORS]Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool [ABSTRACT]Museums serve as vital repositories of cultural heritage and historical
artifacts spanning diverse epochs, civilizations, and regions, preserving
well-documented collections. Data reveal key attributes such as age, origin,
material, and cultural significance. Understanding museum exhibits from their
images requires reasoning beyond visual features. In this work, we facilitate
such reasoning by (a) collecting and curating a large-scale dataset of 65M
images and 200M question-answer pairs in the standard museum catalog format for
exhibits from all around the world; (b) training large vision-language models
on the collected dataset; (c) benchmarking their ability on five visual
question answering tasks. The complete dataset is labeled by museum experts,
ensuring the quality as well as the practical significance of the labels. We
train two VLMs from different categories: the BLIP model, with vision-language
aligned embeddings, but lacking the expressive power of large language models,
and the LLaVA model, a powerful instruction-tuned LLM enriched with
vision-language reasoning capabilities. Through exhaustive experiments, we
provide several insights on the complex and fine-grained understanding of
museum exhibits. In particular, we show that some questions whose answers can
often be derived directly from visual features are well answered by both types
of models. On the other hand, questions that require the grounding of the
visual features in repositories of human knowledge are better answered by the
large vision-language models, thus demonstrating their superior capacity to
perform the desired reasoning. Find our dataset, benchmarks, and source code
at: https://github.com/insait-institute/Museum-65 [LINK]http://arxiv.org/abs/2412.01370v1 [DATE]2024-12-02 18:54:31+08:00 [CATEGORIES]cs.CL
Dual-Personalizing Adapter for Federated Foundation Models [AUTHORS]Yiyuan Yang, Guodong Long, Tao Shen, Jing Jiang, Michael Blumenstein [ABSTRACT]Recently, foundation models, particularly large language models (LLMs), have
demonstrated an impressive ability to adapt to various tasks by fine-tuning
diverse instruction data. Notably, federated foundation models (FedFM) emerge
as a privacy preservation method to fine-tune models collaboratively under
federated learning (FL) settings by leveraging many distributed datasets with
non-IID data. To alleviate communication and computation overhead,
parameter-efficient methods are introduced for efficiency, and some research
adapted personalization methods to FedFM for better user preferences alignment.
However, a critical gap in existing research is the neglect of test-time
distribution shifts in real-world applications, and conventional methods for
test-time distribution shifts in personalized FL are less effective for FedFM
due to their failure to adapt to complex distribution shift scenarios and the
requirement to train all parameters. To bridge this gap, we refine the setting
in FedFM, termed test-time personalization, which aims to learn personalized
federated foundation models on clients while effectively handling test-time
distribution shifts simultaneously. To address challenges in this setting, we
explore a simple yet effective solution, a Federated Dual-Personalizing Adapter
(FedDPA) architecture. By co-working with a foundation model, a global adapter
and a local adapter jointly tackle the test-time distribution shifts and
client-specific personalization. Additionally, we introduce an instance-wise
dynamic weighting mechanism that dynamically integrates the global and local
adapters for each test instance during inference, facilitating effective
test-time personalization. The effectiveness of the proposed method has been
evaluated on benchmark datasets across different NLP tasks. [LINK]http://arxiv.org/abs/2403.19211v2 [DATE]2024-12-02 18:44:08+08:00 [CATEGORIES]cs.LGcs.CL
Explicitly Representing Syntax Improves Sentence-to-layout Prediction of Unexpected Situations [AUTHORS]Wolf Nuyts, Ruben Cartuyvels, Marie-Francine Moens [ABSTRACT]Recognizing visual entities in a natural language sentence and arranging them
in a 2D spatial layout require a compositional understanding of language and
space. This task of layout prediction is valuable in text-to-image synthesis as
it allows localized and controlled in-painting of the image. In this
comparative study it is shown that we can predict layouts from language
representations that implicitly or explicitly encode sentence syntax, if the
sentences mention similar entity-relationships to the ones seen during
training. To test compositional understanding, we collect a test set of
grammatically correct sentences and layouts describing compositions of entities
and relations that unlikely have been seen during training. Performance on this
test set substantially drops, showing that current models rely on correlations
in the training data and have difficulties in understanding the structure of
the input sentences. We propose a novel structural loss function that better
enforces the syntacticstructure of the input sentence and show large
performance gains in the task of 2D spatial layout prediction conditioned on
text. The loss has the potential to be used in other generation tasks where a
tree-like structure underlies the conditioning modality. Code, trained models
and the USCOCO evaluation set are available via github. [COMMENTS]Published in TACL [LINK]http://arxiv.org/abs/2401.14212v3 [DATE]2024-12-02 18:30:50+08:00 [CATEGORIES]cs.CL
MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning [AUTHORS]Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li [ABSTRACT]Previous studies on federated learning (FL) often encounter performance
degradation due to data heterogeneity among different clients. In light of the
recent advances in multimodal large language models (MLLMs), such as GPT-4v and
LLaVA, which demonstrate their exceptional proficiency in multimodal tasks,
such as image captioning and multimodal question answering. We introduce a
novel federated learning framework, named Multimodal Large Language Model
Assisted Federated Learning (MLLM-LLaVA-FL), which employs powerful MLLMs at
the server end to address the heterogeneous and long-tailed challenges. Owing
to the advanced cross-modality representation capabilities and the extensive
open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing
the extensive, yet previously underexploited, open-source data accessible from
websites and powerful server-side computational resources. Hence, the
MLLM-LLaVA-FL not only enhances the performance but also avoids increasing the
risk of privacy leakage and the computational burden on local devices,
distinguishing it from prior methodologies. Our framework has three key stages.
Initially, we conduct global visual-text pretraining of the model. This
pretraining is facilitated by utilizing the extensive open-source data
available online, with the assistance of MLLMs. Subsequently, the pretrained
model is distributed among various clients for local training. Finally, once
the locally trained models are transmitted back to the server, a global
alignment is carried out under the supervision of MLLMs to further enhance the
performance. Experimental evaluations on established benchmarks, show that our
framework delivers promising performance in the typical scenarios with data
heterogeneity and long-tail distribution across different clients in FL. [COMMENTS]Accepted to WACV 2025 [LINK]http://arxiv.org/abs/2409.06067v2 [DATE]2024-12-02 18:18:38+08:00 [CATEGORIES]cs.CLcs.LG
A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls [AUTHORS]Sheikh Shafayat, Dongkeun Yoon, Woori Jang, Jiwoo Choi, Alice Oh, Seohyon Jung [ABSTRACT]In this work, we propose and evaluate the feasibility of a two-stage pipeline
to evaluate literary machine translation, in a fine-grained manner, from
English to Korean. The results show that our framework provides fine-grained,
interpretable metrics suited for literary translation and obtains a higher
correlation with human judgment than traditional machine translation metrics.
Nonetheless, it still fails to match inter-human agreement, especially in
metrics like Korean Honorifics. We also observe that LLMs tend to favor
translations generated by other LLMs, and we highlight the necessity of
developing more sophisticated evaluation methods to ensure accurate and
culturally sensitive machine translation of literary works. [LINK]http://arxiv.org/abs/2412.01340v1 [DATE]2024-12-02 18:07:01+08:00 [CATEGORIES]cs.CL
Query-Guided Self-Supervised Summarization of Nursing Notes [AUTHORS]Ya Gao, Hans Moen, Saila Koivusalo, Miika Koskinen, Pekka Marttinen [ABSTRACT]Nursing notes, an important part of Electronic Health Records (EHRs), track a
patient's health during a care episode. Summarizing key information in nursing
notes can help clinicians quickly understand patients' conditions. However,
existing summarization methods in the clinical setting, especially abstractive
methods, have overlooked nursing notes and require reference summaries for
training. We introduce QGSumm, a novel query-guided self-supervised domain
adaptation approach for abstractive nursing note summarization. The method uses
patient-related clinical queries for guidance, and hence does not need
reference summaries for training. Through automatic experiments and manual
evaluation by an expert clinician, we study our approach and other
state-of-the-art Large Language Models (LLMs) for nursing note summarization.
Our experiments show: 1) GPT-4 is competitive in maintaining information in the
original nursing notes, 2) QGSumm can generate high-quality summaries with a
good balance between recall of the original content and hallucination rate
lower than other top methods. Ultimately, our work offers a new perspective on
conditional text summarization, tailored to clinical applications. [LINK]http://arxiv.org/abs/2407.04125v2 [DATE]2024-12-02 17:42:24+08:00 [CATEGORIES]cs.CLcs.LG
SiTSE: Sinhala Text Simplification Dataset and Evaluation [AUTHORS]Surangika Ranathunga, Rumesh Sirithunga, Himashi Rathnayake, Lahiru De Silva, Thamindu Aluthwala, Saman Peramuna, Ravi Shekhar [ABSTRACT]Text Simplification is a task that has been minimally explored for
low-resource languages. Consequently, there are only a few manually curated
datasets. In this paper, we present a human curated sentence-level text
simplification dataset for the Sinhala language. Our evaluation dataset
contains 1,000 complex sentences and corresponding 3,000 simplified sentences
produced by three different human annotators. We model the text simplification
task as a zero-shot and zero resource sequence-to-sequence (seq-seq) task on
the multilingual language models mT5 and mBART. We exploit auxiliary data from
related seq-seq tasks and explore the possibility of using intermediate task
transfer learning (ITTL). Our analysis shows that ITTL outperforms the
previously proposed zero-resource methods for text simplification. Our findings
also highlight the challenges in evaluating text simplification systems, and
support the calls for improved metrics for measuring the quality of automated
text simplification systems that would suit low-resource languages as well. Our
code and data are publicly available:
https://github.com/brainsharks-fyp17/Sinhala-Text-Simplification-Dataset-and-Evaluation [LINK]http://arxiv.org/abs/2412.01293v1 [DATE]2024-12-02 17:08:06+08:00 [CATEGORIES]cs.CL
Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Object-Oriented Programming [AUTHORS]Tianyang Wang, Ziqian Bi, Keyu Chen, Jiawei Xu, Qian Niu, Junyu Liu, Benji Peng, Ming Li, Sen Zhang, Xuanhe Pan, Jinlang Wang, Pohsun Feng, Caitlyn Heqi Yin, Yizhu Wen, Ming Liu [ABSTRACT]Object-Oriented Programming (OOP) has become a crucial paradigm for managing
the growing complexity of modern software systems, particularly in fields like
machine learning, deep learning, large language models (LLM), and data
analytics. This work provides a comprehensive introduction to the integration
of OOP techniques within these domains, with a focus on improving code
modularity, maintainability, and scalability. We begin by outlining the
evolution of computing and the rise of OOP, followed by an in-depth discussion
of key OOP principles such as encapsulation, inheritance, polymorphism, and
abstraction. The practical application of these principles is demonstrated
using Python, a widely adopted language in AI and data science. Furthermore, we
examine how design patterns and modular programming can be employed to enhance
the structure and efficiency of machine learning systems. In subsequent
sections, we apply these OOP concepts to real-world AI tasks, including the
encapsulation of preprocessing workflows, machine learning model training, and
evaluation. Detailed examples illustrate how OOP can be used to build reusable,
scalable machine learning systems while maintaining code clarity and reducing
redundancy.This work is intended to serve as a bridge for both beginners and
experienced developers, equipping them with the necessary knowledge to apply
OOP methodologies in AI-driven projects, ultimately fostering the development
of more robust and maintainable systems. [COMMENTS]49pages [LINK]http://arxiv.org/abs/2409.19916v3 [DATE]2024-12-02 16:56:26+08:00 [CATEGORIES]cs.CL
Indexing Economic Fluctuation Narratives from Keiki Watchers Survey [AUTHORS]Eriko Shigetsugu, Hiroki Sakaji, Itsuki Noda [ABSTRACT]In this paper, we design indices of economic fluctuation narratives derived
from economic surveys. Companies, governments, and investors rely on key
metrics like GDP and industrial production indices to predict economic trends.
However, they have yet to effectively leverage the wealth of information
contained in economic text, such as causal relationships, in their economic
forecasting. Therefore, we design indices of economic fluctuation from economic
surveys by using our previously proposed narrative framework. From the
evaluation results, it is observed that the proposed indices had a stronger
correlation with cumulative lagging diffusion index than other types of
diffusion indices. [LINK]http://arxiv.org/abs/2412.01265v1 [DATE]2024-12-02 16:32:02+08:00 [CATEGORIES]cs.CL
Do Large Language Models with Reasoning and Acting Meet the Needs of Task-Oriented Dialogue? [AUTHORS]Michelle Elizabeth, Morgan Veyret, Miguel Couceiro, Ondrej Dusek, Lina M. Rojas-Barahona [ABSTRACT]Large language models (LLMs) gained immense popularity due to their
impressive capabilities in unstructured conversations. However, they
underperform compared to previous approaches in task-oriented dialogue (TOD),
wherein reasoning and accessing external information are crucial. Empowering
LLMs with advanced prompting strategies such as reasoning and acting (ReAct)
has shown promise in solving complex tasks traditionally requiring
reinforcement learning. In this work, we apply the ReAct strategy to guide LLMs
performing TOD. We evaluate ReAct-based LLMs (ReAct-LLMs) both in simulation
and with real users. While ReAct-LLMs seem to underperform state-of-the-art
approaches in simulation, human evaluation indicates higher user satisfaction
rate compared to handcrafted systems despite having a lower success rate. [LINK]http://arxiv.org/abs/2412.01262v1 [DATE]2024-12-02 16:30:22+08:00 [CATEGORIES]cs.CL
GraphOTTER: Evolving LLM-based Graph Reasoning for Complex Table Question Answering [AUTHORS]Qianlong Li, Chen Huang, Shuai Li, Yuanxin Xiang, Deng Xiong, Wenqiang Lei [ABSTRACT]Complex Table Question Answering involves providing accurate answers to
specific questions based on intricate tables that exhibit complex layouts and
flexible header locations. Despite considerable progress having been made in
the LLM era, the reasoning processes of existing methods are often implicit,
feeding the entire table into prompts, making it difficult to effectively
filter out irrelevant information in the table. To this end, we propose
GraphOTTER that explicitly establishes the reasoning process to pinpoint the
correct answers. In particular, GraphOTTER leverages a graph-based
representation, transforming the complex table into an undirected graph. It
then conducts step-by-step reasoning on the graph, with each step guided by a
set of pre-defined intermediate reasoning actions. As such, it constructs a
clear reasoning path and effectively identifies the answer to a given question.
Comprehensive experiments on two benchmark datasets and two LLM backbones
demonstrate the effectiveness of GraphOTTER. Further analysis indicates that
its success may be attributed to the ability to efficiently filter out
irrelevant information, thereby focusing the reasoning process on the most
pertinent data. Our code and experimental datasets are available at
\url\{https://github.com/JDing0521/GraphOTTER\}. [COMMENTS]COLING 2025, code is available at
https://github.com/JDing0521/GraphOTTER [LINK]http://arxiv.org/abs/2412.01230v1 [DATE]2024-12-02 15:49:23+08:00 [CATEGORIES]cs.CL
GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large Language Model [AUTHORS]Xuanchang Zhang, Zhuosheng Zhang, Hai Zhao [ABSTRACT]Despite the rapid progress of large language models (LLMs), their task
performance remains sensitive to prompt design. Recent studies have explored
leveraging the LLM itself as an optimizer to identify optimal prompts that
maximize task accuracy. However, when evaluating prompts, such approaches
heavily rely on elusive manually annotated gold labels to calculate task
accuracy for each candidate prompt, which hinders the widespread implementation
and generality. To overcome the limitation, this work proposes a gold
label-agnostic prompt evaluation (GLaPE) to alleviate dependence on gold
labels. Motivated by the observed correlation between self-consistency and the
accuracy of the answer, we adopt self-consistency as the initial evaluation
score. Subsequently, we refine the scores of prompts producing identical
answers to be mutually consistent. Experimental results show that GLaPE
provides reliable evaluations uniform with accuracy, even in the absence of
gold labels. Moreover, on six popular reasoning tasks, our GLaPE-based prompt
optimization yields effective prompts comparable to accuracy-based ones. The
code is publicly available at https://github.com/thunderous77/GLaPE. [COMMENTS]EMNLP 2024 [LINK]http://arxiv.org/abs/2402.02408v2 [DATE]2024-12-02 15:47:00+08:00 [CATEGORIES]cs.CLcs.LG
Composition of Experts: A Modular Compound AI System Leveraging Large Language Models [AUTHORS]Swayambhoo Jain, Ravi Raju, Bo Li, Zoltan Csaki, Jonathan Li, Kaizhao Liang, Guoyao Feng, Urmish Thakkar, Anand Sampat, Raghu Prabhakar, Sumati Jairath [ABSTRACT]Large Language Models (LLMs) have achieved remarkable advancements, but their
monolithic nature presents challenges in terms of scalability, cost, and
customization. This paper introduces the Composition of Experts (CoE), a
modular compound AI system leveraging multiple expert LLMs. CoE leverages a
router to dynamically select the most appropriate expert for a given input,
enabling efficient utilization of resources and improved performance. We
formulate the general problem of training a CoE and discuss inherent
complexities associated with it. We propose a two-step routing approach to
address these complexities that first uses a router to classify the input into
distinct categories followed by a category-to-expert mapping to obtain desired
experts. CoE offers a flexible and cost-effective solution to build compound AI
systems. Our empirical evaluation demonstrates the effectiveness of CoE in
achieving superior performance with reduced computational overhead. Given that
CoE comprises of many expert LLMs it has unique system requirements for
cost-effective serving. We present an efficient implementation of CoE
leveraging SambaNova SN40L RDUs unique three-tiered memory architecture. CoEs
obtained using open weight LLMs Qwen/Qwen2-7B-Instruct, google/gemma-2-9b-it,
google/gemma-2-27b-it, meta-llama/Llama-3.1-70B-Instruct and
Qwen/Qwen2-72B-Instruct achieve a score of $59.4$ with merely $31$ billion
average active parameters on Arena-Hard and a score of $9.06$ with $54$ billion
average active parameters on MT-Bench. [LINK]http://arxiv.org/abs/2412.01868v1 [DATE]2024-12-02 15:43:21+08:00 [CATEGORIES]cs.LGcs.CL
SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages [AUTHORS]Jia Guo, Longxu Dou, Guangtao Zeng, Stanley Kok, Wei Lu, Qian Liu [ABSTRACT]In this paper, we introduce SailCompass, a reproducible and robust evaluationbenchmark for assessing Large Language Models (LLMs) on Southeast Asian
Languages (SEA). SailCompass encompasses three main SEA languages, eight
primary tasks including 14 datasets covering three task types (generation,
multiple-choice questions, and classification). To improve the robustness of
the evaluation approach, we explore different prompt configurations for
multiple-choice questions and leverage calibrations to improve the faithfulness
of classification tasks. With SailCompass, we derive the following findings:
(1) SEA-specialized LLMs still outperform general LLMs, although the gap has
narrowed; (2) A balanced language distribution is important for developing
better SEA-specialized LLMs; (3) Advanced prompting techniques (e.g.,
calibration, perplexity-based ranking) are necessary to better utilize LLMs.
All datasets and evaluation scripts are public. [COMMENTS]code: https://github.com/sail-sg/sailcompass [LINK]http://arxiv.org/abs/2412.01186v1 [DATE]2024-12-02 14:42:51+08:00 [CATEGORIES]cs.CL
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [AUTHORS]Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, Zhaopeng Tu [ABSTRACT]Large Language Models (LLMs) have exhibited remarkable performance on
reasoning tasks. They utilize autoregressive token generation to construct
reasoning trajectories, enabling the development of a coherent chain of
thought. In this work, we explore the impact of individual tokens on the final
outcomes of reasoning tasks. We identify the existence of ``critical tokens''
that lead to incorrect reasoning trajectories in LLMs. Specifically, we find
that LLMs tend to produce positive outcomes when forced to decode other tokens
instead of critical tokens. Motivated by this observation, we propose a novel
approach - cDPO - designed to automatically recognize and conduct token-level
rewards for the critical tokens during the alignment process. Specifically, we
develop a contrastive estimation approach to automatically identify critical
tokens. It is achieved by comparing the generation likelihood of positive and
negative models. To achieve this, we separately fine-tune the positive and
negative models on various reasoning trajectories, consequently, they are
capable of identifying identify critical tokens within incorrect trajectories
that contribute to erroneous outcomes. Moreover, to further align the model
with the critical token information during the alignment process, we extend the
conventional DPO algorithms to token-level DPO and utilize the differential
likelihood from the aforementioned positive and negative model as important
weight for token-level DPO learning.Experimental results on GSM8K and MATH500
benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math
(7B) demonstrate the effectiveness of the propsoed approach cDPO. [COMMENTS]Work in progress [LINK]http://arxiv.org/abs/2411.19943v2 [DATE]2024-12-02 14:26:38+08:00 [CATEGORIES]cs.CLcs.LG
A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans [AUTHORS]Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga [ABSTRACT]Recently, much work has concerned itself with the enigma of what exactly PLMs
(pretrained language models) learn about different aspects of language, and how
they learn it. One stream of this type of research investigates the knowledge
that PLMs have about semantic relations. However, many aspects of semantic
relations were left unexplored. Only one relation was considered, namely
hypernymy. Furthermore, previous work did not measure humans' performance on
the same task as that solved by the PLMs. This means that at this point in
time, there is only an incomplete view of models' semantic relation knowledge.
To address this gap, we introduce a comprehensive evaluation framework covering
five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy,
and synonymy. We use six metrics (two newly introduced here) for recently
untreated aspects of semantic relation knowledge, namely soundness,
completeness, symmetry, asymmetry, prototypicality, an