(ICML 2026) MindFlow: Optimizing How Research Ideas Are Produced

Bill Xu

• Published on May 2, 2026 • 63min read

Based on the paper: “MindFlow: Mind Supernet Powered Thinking Flows for Research Idea Innovation.”

TL;DR

Research idea generation is not a simple text-generation problem. A useful research idea must do two things at once: identify a meaningful problem and propose a plausible way to solve it. It also has to balance several competing objectives, including novelty, significance, timeliness, effectiveness, and feasibility.

Most LLM-based ideation systems use fixed workflows. They retrieve papers, generate an idea, critique it, revise it, and return a proposal. These pipelines can be useful, but they assume that every topic should follow the same reasoning path.

MindFlow makes a different claim:

The process used to generate an idea should itself be represented, sampled, evaluated, and optimized.

MindFlow formulates ideation as a graph-structured thinking flow composed of reusable thinking operators, such as divergent thinking, critical thinking, analogical thinking, counterfactual thinking, and constraint-driven thinking. These flows are sampled from a probabilistic “mind supernet,” then improved using tournament-based relative ranking.

The core contribution is not simply generating better research ideas. It is making the ideation process explicit and optimizable.

Across multiple research domains, MindFlow outperforms generation baselines and agentic ideation systems on both LLM-judged and computable evaluation metrics. More importantly, the results show that no single thinking operator is best for every topic. Good ideation requires adaptive composition.

The Problem: Research Ideation Is Open-Ended and Multi-Objective

Research ideation is hard because the target is not fully specified.

In many machine learning tasks, the goal is clear: answer a question, classify an image, predict a label, solve a coding issue, or maximize a benchmark score. Research ideation is different. The system must first decide what problem is worth solving, then propose a technical route that could plausibly solve it.

This creates two challenges.

First, the task is open-ended. There is no fixed answer space. The system must search through possible problem framings, hypotheses, methods, datasets, and evaluation protocols.

Second, the task is multi-objective. A research idea cannot be judged by novelty alone. A highly novel idea may be infeasible. A feasible idea may be too incremental. A timely idea may still lack technical depth. A strong idea must balance several criteria:

novelty;
significance;
timeliness;
effectiveness;
feasibility;
diversity across generated proposals.

This is why many LLM-generated research ideas feel plausible at the surface but weak under inspection. They may sound new but lack a testable method. Or they may propose a reasonable method for a problem that is not important. Or they may be technically interesting but impossible to evaluate with available data.

MindFlow starts from this diagnosis: research ideation is not just a generation problem. It is a process-design problem.

Research ideation is both open-ended and multi-objective. MindFlow differs from fixed ideation pipelines by making the thinking process explicit, controllable, and optimizable.

Why Fixed Ideation Pipelines Are Not Enough

Recent LLM-based systems have improved research ideation with structured prompts, retrieval, revision, and multi-agent collaboration. These systems often follow a hand-designed pipeline:

retrieve related work;
generate an initial idea;
critique the idea;
revise the idea;
produce a final proposal.

This structure is reasonable, but it is static. Every topic goes through roughly the same sequence of steps.

That assumption is too strong.

A theory topic may require constraint-driven reasoning and careful feasibility checks. A robotics topic may need practical stress testing under deployment constraints. A computer vision topic may benefit from analogical transfer from adjacent methods. A speculative multimodal topic may need divergent exploration before convergence.

The paper’s ablation results support this point. No single thinking operator dominates across all dimensions or domains. Counterfactual thinking can improve novelty but may reduce feasibility. Critical thinking can improve significance and feasibility but may reduce novelty. Constraint-driven thinking performs strongly in some domains but not all.

The lesson is clear: a fixed workflow cannot capture the diversity of reasoning strategies needed for research ideation.

MindFlow addresses this by treating the reasoning workflow itself as a search space.

The Main Idea

MindFlow represents research ideation as a graph-structured thinking flow.

A thinking flow is not a prompt chain. It is a directed acyclic graph of reusable cognitive operators. Each operator performs a specific function, and the edges define how intermediate artifacts move from one operator to another.

The paper defines a thinking flow as:

$$ G = (V, E) $$

where:

(V) is a subset of thinking operators;
(E) specifies directed information flow between operators.

Executing a flow on a research topic and related works produces a structured research proposal:

$$ y = (y_t, y_p, y_m, y_e) $$

where:

(y_t) is the title;
(y_p) is the problem motivation;
(y_m) is the technical method;
(y_e) is the evaluation protocol.

This output format matters. MindFlow does not evaluate ideas only by titles or abstracts. It requires the system to generate a fuller proposal: what problem is being solved, why it matters, how the method works, what datasets are used, what metrics are appropriate, and how the result should be validated.

Thinking Operators

MindFlow builds flows from modular thinking operators. Each operator is an LLM-agent routine with operator-specific prompts, optional tool calls, and multi-step executions.

The operator library includes:

Generate: produces an initial proposal.
Divergent Thinking: expands the search space by producing alternatives.
Convergent Thinking: selects and synthesizes promising elements.
Critical Thinking: stress-tests a proposal and repairs weaknesses.
Analogical Thinking: transfers structure from related problems or methods.
Counterfactual Thinking: changes key assumptions and explores what follows.
Constraint-Driven Thinking: redesigns ideas under hard constraints.
Exit: stops the flow when additional operators are unnecessary.

This operator set is deliberately heterogeneous. It does not assume that creativity is one reasoning mode. It treats ideation as composition: different topics require different combinations of expansion, critique, analogy, constraint handling, and selection.

MindFlow’s operator library includes general generation, divergent thinking, convergent thinking, critical thinking, analogical thinking, counterfactual thinking, constraint-driven thinking, and early exit.

The Mind Supernet

The key mechanism in MindFlow is the mind supernet.

A mind supernet is a probabilistic space of possible thinking flows. Instead of manually choosing one fixed workflow, MindFlow defines layer-wise probabilities over available thinking operators. Given a research topic, the controller samples a concrete flow from this supernet.

The paper defines the supernet as:

$$ \mathcal{M} = {\pi_\ell \mid \ell = 1, \dots, L} $$

where each (\pi_\ell(O)) is the probability that operator (O) is active at layer (\ell).

For a topic (x_t), the probability of selecting an operator depends on the topic and the previous layers:

$$ \pi_\ell(O) = p(O \mid \Pi_{<\ell}, x_t) $$

This means the flow is topic-aware. The system does not only ask, “Which operator is generally useful?” It asks, “Which operator is useful for this topic, at this stage of the reasoning process, given the operators already selected?”

This is the central architectural move in the paper. MindFlow turns ideation from a fixed pipeline into a distribution over possible reasoning processes.

Overall MindFlow framework. A controller samples a topic-conditioned thinking flow from the mind supernet, executes the selected operators, and updates the controller using tournament-based relative ranking.

Sampling a Thinking Flow

Given a topic and related works, the controller samples a flow layer by layer.

At each layer, it assigns probabilities to candidate operators. Operators are then selected and executed according to their scores, subject to a probability threshold. This makes the number of operators per layer adaptive.

MindFlow also includes an early-stop operator. If the controller samples the exit operator, the flow construction stops. This avoids unnecessary reasoning steps and reduces cost.

This matters because longer is not always better. For some topics, additional critique or counterfactual exploration may improve the proposal. For others, it may introduce noise or unnecessary complexity. MindFlow gives the controller a way to decide when enough thinking has already been done.

The objective is not to maximize reasoning length. The objective is to find a useful process for producing a better idea.

Optimization Through Tournament-Based Relative Ranking

A major difficulty in research ideation is evaluation.

Pointwise scalar rewards are unreliable for open-ended creative tasks. Asking an LLM judge to assign an absolute score to a research idea often produces noisy and poorly calibrated results. Two ideas may both receive high scores even if one is clearly better. This creates weak supervision for optimization.

MindFlow avoids relying on absolute scoring. It uses tournament-based relative ranking.

For each research topic, the system samples multiple thinking flows and executes them to produce multiple candidate ideas. It then compares each candidate against an anchor idea using an LLM judge. The judge performs pairwise comparisons across six dimensions:

Problem finding:

problem novelty;
significance;
timeliness.

Problem solving:

technical novelty;
effectiveness;
feasibility.

The candidates are ranked by their comparative outcomes. These ranks are converted into normalized advantage signals, regularized by execution cost, and used to update the controller.

This is a practical design choice. Relative comparison is often more stable than absolute scoring. It is easier to decide which of two ideas is stronger on a specific criterion than to assign a calibrated global score to one idea in isolation.

The Evaluation Protocol

MindFlow introduces an evaluation protocol designed around the actual structure of research ideas.

The benchmark is built on AI Idea Bench 2025 and includes research papers from major CV, NLP, and ML conferences. The dataset covers eight domains:

computer vision;
natural language processing;
multimodal learning;
audio and speech;
robotics and control;
science;
general machine learning;
theory.

For each task, the system receives a topic and related works, then generates a structured proposal. The generated proposal is compared with an expert-written reference idea from accepted papers.

The evaluation is split into two stages.

Stage 1: Problem Finding

The motivation is evaluated on:

Novelty: whether the problem is new or framed in a new way.
Significance: whether solving it would matter.
Timeliness: whether the problem is relevant now.

Stage 2: Problem Solving

The method and evaluation plan are evaluated on:

Technical novelty: whether the solution is meaningfully different.
Effectiveness: whether the method addresses the stated problem.
Feasibility: whether the method can realistically be implemented and evaluated.

This two-stage protocol is important. It prevents a system from hiding behind vague novelty. A proposal must both find a meaningful problem and describe a plausible path to solving it.

Multi-Objective Scoring

The paper introduces a multi-objective score, or MOScore, for aggregating dimension-specific scores.

For a three-dimensional score vector (s = (s_1, s_2, s_3)) and weights (w = (w_1, w_2, w_3)), the score is:

$$ MOScore(s; w) = \frac{1}{2} \left(\sum_{i=1}^{3} w_i s_i + \prod_{i=1}^{3} w_i s_i \right) $$

The sum term rewards strong average performance. The product term penalizes imbalance. This is useful for idea evaluation because a strong research idea cannot fail completely on one core dimension.

For example, an idea with high novelty but very low feasibility should not receive the same score as an idea that performs well across novelty, effectiveness, and feasibility. The product term makes that trade-off explicit.

Main Results

MindFlow is compared against vanilla generation, chain-of-thought generation, AI Scientist, AI-Researcher, and VIRSCI.

Under the LLM-judged protocol, MindFlow achieves the best overall score.

From Table 1:

Generate: 0.305 overall
GenerateCoT: 0.344 overall
AI Scientist: 0.308 overall
AI-Researcher: 0.326 overall
VIRSCI: 0.470 overall
MindFlow: 0.541 overall

MindFlow also achieves the best aggregate MOScore for both stages:

Problem finding MOScore: 0.742
Problem solving MOScore: 0.339

The paper’s analysis is important here. Some baselines are strong on one dimension but weak on another. For example, a system may produce novel motivations but fail to justify significance. Another may produce feasible methods but lose novelty. MindFlow’s advantage comes from better balance across objectives.

LLM-judged evaluation across problem finding and problem solving. MindFlow achieves the best overall score and the strongest aggregate multi-objective scores.

Computable Metrics

The paper also reports computable evaluations for novelty, diversity, effectiveness, and feasibility.

From Table 2:

Method	Novelty	Diversity	Effectiveness	Feasibility
Generate	0.370	0.184	0.642	0.539
GenerateCoT	0.422	0.275	0.626	0.191
AI Scientist	0.352	0.167	0.663	0.172
AI-Researcher	0.474	0.293	0.589	0.672
VIRSCI	0.457	0.281	0.630	0.583
MindFlow	0.541	0.322	0.665	0.735

These results matter because they reduce the chance that MindFlow is only optimized for LLM judge preference. The computable metrics show improvement across multiple independent signals.

MindFlow achieves the best novelty and diversity scores, suggesting that sampling different flows helps avoid repetitive ideation. It also achieves the best effectiveness and feasibility scores, indicating that the generated ideas are not merely unusual; they are also more likely to align with valid methods and realistic execution paths.

Computable evaluation on novelty, diversity, effectiveness, and feasibility. MindFlow leads across all four dimensions

Why Adaptive Composition Matters

The operator-wise ablation is one of the most important parts of the paper.

Table 3 shows that different operators produce different trade-offs.

Counterfactual thinking is strong on novelty:

problem-finding novelty: 0.901;
problem-solving novelty: 0.569.

But it performs poorly on feasibility and effectiveness.

Critical thinking improves significance and feasibility:

problem-finding significance: 0.796;
problem-solving feasibility: 0.275.

But it performs poorly on novelty.

Constraint-driven thinking performs well on timeliness and effectiveness:

problem-finding timeliness: 0.706;
problem-solving effectiveness: 0.346.

But it is still weaker than MindFlow on aggregate balance.

MindFlow achieves the best overall score:

MindFlow overall: 0.541;
best isolated operator overall: 0.472.

This supports the paper’s main claim. The benefit does not come from discovering one universally superior thinking mode. It comes from learning how to compose different modes based on the topic.

Operator-wise ablation. No single thinking operator dominates across all dimensions. MindFlow performs best by adaptively composing operators.

Generalization Across Domains

MindFlow is evaluated across eight topic domains. It achieves the best score in most of them, including:

CV;
NLP;
multimodal;
audio and speech;
robotics;
general ML;
theory.

The largest gains appear in domains where reasoning constraints differ sharply from generic ideation. For example, robotics benefits from constraint-aware and feasibility-sensitive flows, while theory benefits from critical and constraint-driven reasoning.

From Table 4, MindFlow reaches:

CV: 0.503;
NLP: 0.498;
Multimodal: 0.480;
Audio & Speech: 0.398;
Robotics: 0.694;
General ML: 0.576;
Theory: 0.660.

This is consistent with the thesis that ideation should be topic-conditioned. A system should not use the same reasoning structure for every research area.

[IMAGE PLACEHOLDER: Use Table 4 from the paper. Caption: “Performance across topic domains. MindFlow generalizes across diverse research areas and shows particularly strong gains in robotics, general ML, and theory.”]

Interpreting the Learned Flows

MindFlow also provides visibility into the learned reasoning process.

The paper visualizes operator sampling patterns for different topics. These patterns are not uniform. The controller learns topic-dependent preferences:

theory-oriented topics tend to favor critical and constraint-driven thinking;
applied topics are more likely to use analogical and divergent thinking;
the exit operator becomes more probable at later depths;
many flows use mixtures rather than relying on one dominant operator.

This interpretability is valuable. In many LLM ideation systems, the reasoning process is hidden inside prompts and intermediate generations. MindFlow makes the process inspectable. We can see which operators were selected, how they were arranged, and how the generated idea emerged from the flow.

Visualization of MindFlow’s operator sampling process. Different topics induce different thinking flows, showing that the controller learns topic-dependent reasoning strategies.

What This Means for Product Agents

MindFlow is a research-ideation paper, but the product implication is broader.

Many product tasks are also open-ended and multi-objective. A user may ask:

“Help me find a business idea from this market.”

or:

“Turn this product direction into a launchable plan.”

or:

“Analyze this user segment and propose a product strategy.”

These tasks are not solved by a single fixed chain. They require different modes of thinking:

divergent exploration to generate alternatives;
convergent selection to narrow the space;
analogical reasoning to transfer patterns from nearby markets;
critical thinking to detect weak assumptions;
constraint-driven reasoning to keep the plan executable;
counterfactual thinking to test whether the opportunity still holds under changed assumptions.

This is close to the structure MindFlow formalizes.

For Atoms, the lesson is direct: a useful business-building agent should not only produce polished output. It should control the process that produces the output. The system should decide when to explore, when to critique, when to constrain, when to compare, and when to stop.

In business workflows, this matters because users often need more than one answer. They need a reliable process for moving from vague intent to a concrete, testable plan.

MindFlow provides a principled way to think about that process.

A Concrete Example

Suppose a user asks:

“Find a promising AI product idea for independent Shopify merchants.”

A fixed pipeline might retrieve market data, generate a few ideas, rank them, and return a final recommendation.

A MindFlow-style system would treat the process as adaptive.

It might begin with divergent thinking to generate several opportunity areas:

customer support;
inventory forecasting;
review analysis;
abandoned-cart recovery;
creator-style product photography.

Then it might apply constraint-driven thinking:

small merchants have limited budgets;
setup time must be short;
integrations must work with existing Shopify workflows;
the product must create value without requiring large training datasets.

Next, it might use analogical thinking:

transfer patterns from CRM automation;
adapt performance-marketing workflows;
borrow onboarding structures from no-code tools.

Then it might apply critical thinking:

which ideas depend on unavailable data?
which ideas are too easy for incumbents to copy?
which ideas have unclear willingness to pay?
which ideas require too much behavior change?

Finally, it might use convergent thinking to select the strongest candidate and produce a structured product proposal.

The value is not that the system used more steps. The value is that the reasoning path matched the problem.

What MindFlow Does Not Claim

MindFlow does not solve research ideation completely.

The paper is clear about two limitations.

First, research-idea evaluation remains subjective. Even with pairwise judging, multiple dimensions, order randomization, and computable metrics, evaluating new ideas is difficult. There is no perfect automatic judge for scientific value.

Second, the operator set is manually designed. The paper uses a meaningful set of cognitive operators, but it does not yet learn new operators automatically. Future systems may need to discover, refine, and specialize operators from data.

There is also a practical cost issue. Sampling multiple flows, executing operators, and running tournament comparisons requires more compute than a single generation pass. This is justified for high-value ideation tasks, but it may not be necessary for simple brainstorming.

The method is strongest when the cost of a weak idea is high and the quality of the process matters.

From Paper to System Design

MindFlow suggests several design principles for building better ideation agents.

1. Represent the Thinking Process Explicitly

Do not hide ideation inside one prompt. Model it as a structured process with named operators and observable intermediate artifacts.

2. Use Different Thinking Modes for Different Topics

No single reasoning pattern is best for every domain. The system should choose operators based on topic, constraints, and previous intermediate outputs.

3. Optimize the Process, Not Only the Output

If a system repeatedly generates weak ideas, the issue may not be wording. The issue may be the reasoning flow that produced them.

4. Evaluate Both Problem Finding and Problem Solving

A strong idea needs a meaningful problem and a plausible solution. Evaluating only titles, abstracts, or novelty misses this structure.

5. Prefer Relative Ranking for Open-Ended Evaluation

For creative tasks, pairwise comparisons are often more stable than absolute scores. Tournament-based ranking gives a usable optimization signal.

6. Penalize Imbalanced Ideas

Novelty without feasibility is not enough. Feasibility without novelty is not enough. Multi-objective scoring should reward balance.

7. Make Ideation Inspectable

Users should be able to see why a system selected a direction, what alternatives were considered, and where the idea was strengthened or rejected.

Conclusion

MindFlow reframes research idea generation as an optimization problem over reasoning processes.

Its central contribution is the mind supernet: a probabilistic space of graph-structured thinking flows composed from modular operators. Given a topic, MindFlow samples a tailored flow, executes it to generate a structured proposal, evaluates candidate ideas through tournament-based relative ranking, and updates the controller to favor stronger flows.

The empirical results show that this approach improves both problem finding and problem solving. It also improves computable measures of novelty, diversity, effectiveness, and feasibility.

The broader lesson is simple: for open-ended work, the process matters. A good agent should not rely on one fixed workflow. It should learn when to explore, when to critique, when to constrain, when to transfer, when to converge, and when to stop.

MindFlow gives that idea a concrete form.