Research Findings: Mind Evolution

1. Core Arguments (核心论点)

Central Thesis: "Mind Evolution" is an evolutionary search strategy that significantly enhances LLM problem-solving in complex natural language planning tasks without formalizing the underlying reasoning problem.
Mechanism: It uses LLMs to generate, recombine, and optimize candidate responses, iterating based on feedback from an evaluator.
Key Outcome: Solved >98% of problem instances in TravelPlanner and Natural Plan benchmarks without a formal solver, outperforming Best-of-N and Sequential Revision.

2. Key Findings (关键发现)

TravelPlanner Benchmark:

- Mind Evolution success rate: >95%. - Sequential-Revision+: ~83%. - Best-of-N: 55.6%. - Two-stage approach (Flash then Pro): 99.9% on test set. - Comparison: GPT-4 + Formal Solver achieved 97.0%.

Natural Plan – Trip Planning:

- Mind Evolution (Pro): 94.1% on test set. - Two-stage approach: 99.6% on test set.

Natural Plan – Meeting Planning:

- Mind Evolution (Pro): 83.8% on test set. - Two-stage approach: 98.2% on test set.

StegPoet Benchmark:

- A new benchmark for embedding hidden information in creative writing. - Mind Evolution (Gemini 1.5 Pro): 87% success rate. - Best-of-N: Only 1% success rate on validation tasks.

Ablation Studies:

- The "Critic" step in RCC and text feedback are crucial. - The Island Model significantly improves performance by maintaining diversity.

3. Methodology (方法论)

Algorithm: Genetic Algorithm based search in natural language space.
Components:

- Population Evolution: - Initialization: Generate Nconvs solutions. If Nseq > 1, refine via RCC for Nseq-1 rounds. - Selection: Boltzmann Tournament Selection based on fitness scores (softmax distribution). - Update: Nconvs × Nseq offspring added per generation; duplicates removed. - Island Model: - Sub-populations evolve independently. - Migration: Top Nemigrate solutions cloned to the next island cyclically. - Island Reset: Every Nreset_interval generations, reset the worst-performing islands with the global best individual. - Mutation/Crossover (Recombination): - Implemented as a single step using RCC (Refinement through Critical Conversation). - LLM acts as "Critic" (analyzes feedback) and "Author" (proposes refinement). - Recombination integrates evaluations from multiple parents. - Fitness Evaluation: - Programmatic parsing and evaluation. - Functions: Score optimization goals, verify constraints, provide text feedback.

4. Limitations (局限性)

Domain Constraint: Currently focuses on natural language planning where solutions can be programmatically evaluated.
Future Work: Extending to broader domains by developing LLM-based evaluators.

5. References (参考文献)

Chain-of-Thought: J. Wei et al. (2022)
Self-Consistency: X. Wang et al. (2023)
Reflexion: N. Shinn et al. (2024) (Sequential Revision baseline)
Tree of Thoughts: S. Yao et al. (2023)
Best-of-N: B. Brown et al. (2024)
Program Search: B. Romera-Paredes et al. (2024)
Benchmarks: TravelPlanner (J. Xie et al., 2024), Natural Plan (H. S. Zheng et al., 2024).
Steganography: N. Provos and P. Honeyman (2003).