publications | Rishi Hazra

2025

NeurIPS 2025 (Spotlight)

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Edan Toledo, Karen Hambardzumyan, Martin Josifoski, and 22 more authors

2025

Abs PDF Code

AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents’ performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.
NeurIPS 2025 DB Track

LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language

Periklis Mantenoglou, Rishi Hazra, Pedro Zuidberg Dos Martires, and 1 more author

2025

Abs PDF Code

Owing to their reasoning capabilities, large language models (LLMs) have been evaluated on planning tasks described in natural language. However, LLMs have largely been tested on planning domains without constraints. In order to deploy them in real-world settings where adherence to constraints, in particular safety constraints, is critical, we need to evaluate their performance on constrained planning tasks. We introduce LexiCon—a natural language-based (Lexi) constrained (Con) planning benchmark, consisting of a suite of environments, that can be used to evaluate the planning capabilities of LLMs in a principled fashion. The core idea behind LexiCon is to take existing planning environments and impose temporal constraints on the states. These constrained problems are then translated into natural language and given to an LLM to solve. A key feature of LexiCon is its extensibility. That is, the set of supported environments can be extended with new (unconstrained) environment generators, for which temporal constraints are constructed automatically. This renders LexiCon future-proof: the hardness of the generated planning problems can be increased as the planning capabilities of LLMs improve. Our experiments reveal that the performance of state-of-the-art LLMs, including reasoning models like GPT-5, o3, and R1, deteriorates as the degree of constrainedness of the planning tasks increases.
COLM 2025

Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition

Rishi Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, and 1 more author

Second Conference on Language Modeling (COLM), 2025

Abs PDF

Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities. In theory, autoregressive LLMs with Chain-of-Thought (CoT) can perform more serial computations to solve complex reasoning tasks. However, recent studies suggest that, despite this capacity, LLMs do not truly learn to reason but instead fit on statistical features. To study the reasoning capabilities in a principled fashion, we adopt a computational theory perspective and propose an experimental protocol centered on 3-SAT – the prototypical NP-complete problem lying at the core of logical reasoning and constraint satisfaction tasks. Specifically, we examine the phase transitions in random 3-SAT and characterize the reasoning abilities of state-of-the-art LLMs by varying the inherent hardness of the problem instances. By comparing DeepSeek R1 with other LLMs, our findings reveal two key insights (1) LLM accuracy drops significantly on harder instances, suggesting all current models struggle when statistical shortcuts are unavailable (2) Unlike other LLMs, R1 shows signs of having learned the underlying reasoning. Following a principled experimental protocol, our study moves beyond the benchmark-driven evidence often found in LLM reasoning research. Our findings highlight important gaps and suggest clear directions for future research.
ICLR 2025

REvolve: Reward Evolution with Large Language Models using Human Feedback

Rishi Hazra*, Alkis Sygkounas*, Andreas Persson, and 2 more authors

International Conference for Learning Representations (ICLR), 2025

Abs PDF Code Website

Designing effective reward functions is crucial to training reinforcement learning (RL) algorithms. However, this design is non-trivial, even for domain experts, due to the subjective nature of certain tasks that are hard to quantify explicitly. In recent works, large language models (LLMs) have been used for reward generation from natural language task descriptions, leveraging their extensive instruction tuning and commonsense understanding of human behavior. In this work, we hypothesize that LLMs, guided by human feedback, can be used to formulate reward functions that reflect human implicit knowledge. We study this in three challenging settings – autonomous driving, humanoid locomotion, and dexterous manipulation – wherein notions of “good” behavior are tacit and hard to quantify. To this end, we introduce REvolve, a truly evolutionary framework that uses LLMs for reward design in RL. REvolve generates and refines reward functions by utilizing human feedback to guide the evolution process, effectively translating implicit human knowledge into explicit reward functions for training (deep) RL agents. Experimentally, we demonstrate that agents trained on REvolve-designed rewards outperform other state-of-the-art baselines.
ICLR 2025

Can Large Language Models Reason? A Characterization via 3-SAT

Rishi Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, and 1 more author

Planning and Reasoning for LLMs @ International Conference for Learning Representations (ICLR), 2025

Abs PDF

Large Language Models (LLMs) are said to possess advanced reasoning abilities. However, some skepticism exists as recent works show how LLMs often bypass true reasoning using shortcuts. Current methods for assessing the reasoning abilities of LLMs typically rely on open-source benchmarks that may be overrepresented in LLM training data, potentially skewing performance. We instead provide a computational theory perspective of reasoning, using 3-SAT – the prototypical NP-complete problem that lies at the core of logical reasoning and constraint satisfaction tasks. By examining the phase transitions in 3-SAT, we empirically characterize the reasoning abilities of LLMs and show how they vary with the inherent hardness of the problems. Our experimental evidence shows that LLMs cannot perform true reasoning, as is required for solving 3-SAT problems.
HRI 2025

LLM-Driven Adaptability or Pre-programmed Efficiency? A Comparative Study for Short Interactions

Tim Schreiter*, Jens V. Ruppel*, Rishi Hazra, and 3 more authors

Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction, 2025

Abs PDF

To achieve natural and intuitive interaction with people, HRI frameworks combine a wide array of methods for human perception, intention communication, human-aware navigation and collaborative action. In practice, when encountering unpredictable behavior of people or unexpected states of the environment, these frameworks may lack the ability to dynamically recognize such states, adapt and recover to resume the interaction. Large Language Models (LLMs), owing to their advanced reasoning capabilities and context retention, present a promising solution for enhancing robot adaptability. This potential, however, may not directly translate to improved interaction metrics. This paper considers a representative interaction with an industrial robot involving approach, instruction, and object manipulation, implemented in two conditions: (1) fully scripted and (2) including LLM-enhanced responses. We use gaze tracking and questionnaires to measure the participants’ task efficiency, engagement, and robot perception. The results indicate higher subjective ratings for the LLM condition, but objective metrics show that the scripted condition performs comparably, particularly in efficiency and focus during simple tasks. We also note that the scripted condition may have an edge over LLM-enhanced responses in terms of response latency and energy consumption, especially for trivial and repetitive interactions.

2024

RO-MAN 2024

Bidirectional Intent Communication: A Role for Large Foundation Models

Tim Schreiter*, Rishi Hazra*, Jens Ruppel, and 1 more author

Large Language Models in the RoMan Age @ IEEE International Conference on Robot & Human Interactive Communication, 2024

Abs PDF

Integrating multimodal foundation models has significantly enhanced autonomous agents’ language comprehension, perception, and planning capabilities. However, while existing works adopt a task-centric approach with minimal human interaction, applying these models to developing assistive user-centric robots that can interact and cooperate with humans remains underexplored. This paper introduces “Bident”, a framework designed to integrate robots seamlessly into shared spaces with humans. Bident enhances the interactive experience by incorporating multimodal inputs like speech and user gaze dynamics. Furthermore, Bident supports verbal utterances and physical actions like gestures, making it versatile for bidirectional human-robot interactions. Potential applications include personalized education, where robots can adapt to individual learning styles and paces, and healthcare, where robots can offer personalized support, companionship, and everyday assistance in the home and workplace environments.
AAAI 2024

SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge

Rishi Hazra, Pedro Zuidberg Dos Martires, and Luc De Raedt

Association for Advancement of Artificial Intelligence (AAAI), 2024

Abs PDF Code Website

Large Language Models (LLMs) have demonstrated impressive planning abilities due to their vast world knowledge. Yet, obtaining plans that are both feasible (grounded in affordances) and cost-effective (in plan length), remains a challenge, despite recent progress. This contrasts with heuristic planning methods that employ domain knowledge (formalized in action models such as PDDL) and heuristic search to generate feasible, optimal plans. Inspired by this, we propose to combine the power of LLMs and heuristic planning by leveraging the world knowledge of LLMs and the principles of heuristic search. Our approach, SayCanPay, employs LLMs to generate actions (Say) guided by learnable domain knowledge, that evaluates actions’ feasibility (Can) and long-term reward/payoff (Pay), and heuristic search to select the best sequence of actions. Our contributions are (1) a novel framing of the LLM planning problem in the context of heuristic planning, (2) integrating grounding and cost-effective elements into the generated plans, and (3) using heuristic search over actions. Our extensive evaluations show that our model surpasses other LLM planning approaches.

2023

ICCV 2023

EgoTV: Egocentric Task Verification from Natural Language Task Descriptions

Rishi Hazra, Brian Chen, Akshara Rai, and 2 more authors

International Conference of Computer Vision (ICCV), 2023

Abs PDF Code Website

To enable progress towards egocentric agents capable of understanding everyday tasks specified in natural language, we propose a benchmark and a synthetic dataset called Egocentric Task Verification (EgoTV). EgoTV contains multi-step tasks with multiple sub-task decompositions, state changes, object interactions, and sub-task ordering constraints, in addition to abstracted task descriptions that contain only partial details about ways to accomplish a task. We also propose a novel Neuro-Symbolic Grounding (NSG) approach to enable the causal, temporal, and compositional reasoning of such tasks. We demonstrate NSG’s capability towards task tracking and verification on our EgoTV dataset and a real-world dataset derived from CrossTask (CTV). Our contributions include the release of the EgoTV and CTV datasets, and the NSG model for future research on egocentric assistive agents.
ECML PKDD 2023

Deep Explainable Relational Reinforcement Learning: A Neuro-Symbolic Approach

Rishi Hazra, and Luc De Raedt

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2023

Abs PDF Slides

Despite numerous successes in Deep Reinforcement Learning (DRL), the learned policies are not interpretable. Moreover, since DRL does not exploit symbolic relational representations, it has difficulties in coping with structural changes in its environment (such as increasing the number of objects). Relational Reinforcement Learning, on the other hand, inherits the relational representations from symbolic planning to learn reusable policies. However, it has so far been unable to scale up and exploit the power of deep neural networks. We propose Deep Explainable Relational Reinforcement Learning (DERRL), a framework that exploits the best of both – neural and symbolic worlds. By resorting to a neuro-symbolic approach, DERRL combines relational representations and constraints from symbolic planning with deep learning to extract interpretable policies. These policies are in the form of logical rules that explain how each decision (or action) is arrived at. Through several experiments, in setups like the Countdown Game, Blocks World, Gridworld, and Traffic, we show that the policies learned by DERRL can be applied to different configurations and contexts, hence generalizing to environmental modifications.

2021

NAACL 2021

Intrinsically Motivated Compositional Language Emergence

Rishi Hazra, Sonu Dixit, and Sayambhu Sen

Fourth workshop on Visually Grounded Interaction and Language (ViGIL) @ NAACL, 2021

Abs PDF Website

Recently, there has been a great deal of research in emergent communication on artificial agents interacting in simulated environments. Recent studies have revealed that, in general, emergent languages do not follow the compositionality patterns of natural language. To deal with this, existing works have proposed a limited channel capacity as an important constraint for learning highly compositional languages. In this paper, we show that this is not a sufficient condition and propose an intrinsic reward framework for improving compositionality in emergent communication. We use a reinforcement learning setting with two agents – a task-aware Speaker and a state-aware Listener that are required to communicate to perform a set of tasks. Through our experiments on three different referential game setups, including a novel environment gComm, we show intrinsic rewards improve compositionality scores by 1.5−2 times that of existing frameworks that use limited channel capacity.
NAACL 2021

gComm: An environment for investigating generalization in Grounded Language Acquisition

Rishi Hazra*, and Sonu Dixit*

Fourth workshop on Visually Grounded Interaction and Language (ViGIL) @ NAACL, 2021

Abs PDF Code Poster

gComm is a step towards developing a robust platform to foster research in grounded language acquisition in a more challenging and realistic setting. It comprises a 2-d grid environment with a set of agents (a stationary speaker and a mobile listener connected via a communication channel) exposed to a continuous array of tasks in a partially observable setting. The key to solving these tasks lies in agents developing linguistic abilities and utilizing them for efficiently exploring the environment. The speaker and listener have access to information provided in different modalities, i.e. the speaker’s input is a natural language instruction that contains the target and task specifications and the listener’s input is its grid-view. Each must rely on the other to complete the assigned task, however, the only way they can achieve the same, is to develop and use some form of communication. gComm provides several tools for studying different forms of communication and assessing their generalization.
NAACL 2021

Active^2 Learning: Actively reducing redundancies in Active Learning methods for Sequence Tagging and Machine Translation

Rishi Hazra, Parag Dutta, Shubham Gupta, and 2 more authors

In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , Jun 2021

Abs PDF Code Poster

While deep learning is a powerful tool for natural language processing (NLP) problems, successful solutions to these problems rely heavily on large amounts of annotated samples. However, manually annotating data is expensive and time-consuming. Active Learning (AL) strategies reduce the need for huge volumes of labeled data by iteratively selecting a small number of examples for manual annotation based on their estimated utility in training the given model. In this paper, we argue that since AL strategies choose examples independently, they may potentially select similar examples, all of which may not contribute significantly to the learning process. Our proposed approach, Active2 Learning (A2L), actively adapts to the deep learning model being trained to eliminate such redundant examples chosen by an AL strategy. We show that A2L is widely applicable by using it in conjunction with several different AL strategies and NLP tasks. We empirically demonstrate that the proposed approach is further able to reduce the data requirements of state-of-the-art AL strategies by ≈ 3-25% on an absolute scale on multiple NLP tasks while achieving the same performance with virtually no additional computation overhead.

2020

AAMAS 2020

Networked Multi-Agent Reinforcement Learning with Emergent Communication

Shubham Gupta*, Rishi Hazra*, and Ambedkar Dukkipati

In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS) , Auckland, New Zealand, Jun 2020

Abs PDF

We develop a Multi-Agent Reinforcement Learning (MARL) method that finds approximately optimal policies for cooperative agents that co-exist in an environment. Central to achieving this is how the agents learn to communicate with each other. Can they together develop a language while learning to perform a common task? We formulate and study a MARL problem where cooperative agents are connected via a fixed underlying network. These agents communicate along the edges of this network by exchanging discrete symbols. However, the semantics of these symbols are not predefined and have to be learned during the training process. We propose a method for training these agents using emergent communication. We demonstrate the applicability of the proposed framework by applying it to the problem of managing traffic controllers, where we achieve state-of-the-art performance (as compared to several strong baselines) and perform a detailed analysis of the emergent communication.