Research Findings: Mind Evolution
1. Core Arguments (核心论点)
- Central Thesis: "Mind Evolution" is an evolutionary search strategy that significantly enhances LLM problem-solving in complex natural language planning tasks without formalizing the underlying reasoning problem.
- Mechanism: It uses LLMs to generate, recombine, and optimize candidate responses, iterating based on feedback from an evaluator.
- Key Outcome: Solved >98% of problem instances in TravelPlanner and Natural Plan benchmarks without a formal solver, outperforming Best-of-N and Sequential Revision.
2. Key Findings (关键发现)
- TravelPlanner Benchmark:
- Mind Evolution success rate: >95%. - Sequential-Revision+: ~83%. - Best-of-N: 55.6%. - Two-stage approach (Flash then Pro): 99.9% on test set. - Comparison: GPT-4 + Formal Solver achieved 97.0%.
- Natural Plan – Trip Planning:
- Mind Evolution (Pro): 94.1% on test set. - Two-stage approach: 99.6% on test set.
- Natural Plan – Meeting Planning:
- Mind Evolution (Pro): 83.8% on test set. - Two-stage approach: 98.2% on test set.
- StegPoet Benchmark:
- A new benchmark for embedding hidden information in creative writing. - Mind Evolution (Gemini 1.5 Pro): 87% success rate. - Best-of-N: Only 1% success rate on validation tasks.
- Ablation Studies:
- The "Critic" step in RCC and text feedback are crucial. - The Island Model significantly improves performance by maintaining diversity.
3. Methodology (方法论)
- Algorithm: Genetic Algorithm based search in natural language space.
- Components:
- Population Evolution:
- Initialization: Generate Nconvs solutions. If Nseq > 1, refine via RCC for Nseq-1 rounds.
- Selection: Boltzmann Tournament Selection based on fitness scores (softmax distribution).
- Update: Nconvs × Nseq offspring added per generation; duplicates removed.
- Island Model:
- Sub-populations evolve independently.
- Migration: Top Nemigrate solutions cloned to the next island cyclically.
- Island Reset: Every Nreset_interval generations, reset the worst-performing islands with the global best individual.
- Mutation/Crossover (Recombination):
- Implemented as a single step using RCC (Refinement through Critical Conversation).
- LLM acts as "Critic" (analyzes feedback) and "Author" (proposes refinement).
- Recombination integrates evaluations from multiple parents.
- Fitness Evaluation:
- Programmatic parsing and evaluation.
- Functions: Score optimization goals, verify constraints, provide text feedback.
4. Limitations (局限性)
- Domain Constraint: Currently focuses on natural language planning where solutions can be programmatically evaluated.
- Future Work: Extending to broader domains by developing LLM-based evaluators.
5. References (参考文献)
- Chain-of-Thought: J. Wei et al. (2022)
- Self-Consistency: X. Wang et al. (2023)
- Reflexion: N. Shinn et al. (2024) (Sequential Revision baseline)
- Tree of Thoughts: S. Yao et al. (2023)
- Best-of-N: B. Brown et al. (2024)
- Program Search: B. Romera-Paredes et al. (2024)
- Benchmarks: TravelPlanner (J. Xie et al., 2024), Natural Plan (H. S. Zheng et al., 2024).
- Steganography: N. Provos and P. Honeyman (2003).