ð SWE-Factory 深床解æïŒèªåšå GitHub Issue è§£å³æ°æ®éæå»ºå·¥å
> äžå¥è¯ä»ç»ïŒSWE-Factory æ¯äžå±±å€§åŠãåäžºçæºæèååŒæºçéŠäžªè·šå€è¯èš GitHub Issue è§£å³ Benchmark èªåšæå»ºæµæ°Žçº¿ïŒéè¿å€æºèœäœç³»ç» SWE-Builder ååºäº Exit Code çèªåšåéªè¯ïŒå°æ°æ®éæå»ºææ¬éäœè³ $0.024/å®äŸã
---
ð ç®åœ
1. èæ¯äžåšæº 2. SWE-Factory æ žå¿æ¶æ 3. SWE-Builder 倿ºèœäœç³»ç» 4. Exit Code èªåšåéªè¯æºå¶ 5. å®éªè¯äŒ°äžç»æ 6. Error2Pass ç°è±¡åæ 7. äžçžå ³å·¥äœå¯¹æ¯ 8. åºçšåºæ¯äžä»·åŒ 9. æ»ç»äžå±æ
---
èæ¯äžåšæº
GitHub Issue è§£å³ä»»å¡çéèŠæ§
GitHub Issue è§£å³ïŒIssue ResolutionïŒæ¯èœ¯ä»¶å·¥çšé¢åçæ žå¿ä»»å¡ïŒæ¶åä¿®å€çå®äžçç蜯件猺é·ïŒBug FixïŒååèœå¢åŒºïŒFeature EnhancementïŒã该任å¡å·²æäžºè¯äŒ°å€§è¯èšæš¡åïŒLLMïŒèœ¯ä»¶å·¥çšèœåçå ³é®åºåã
ä»£è¡šæ§ BenchmarkïŒ
- SWE-bench (2023)ïŒ2,294 䞪 Python IssueïŒæå¹¿æ³äœ¿çšçè¯äŒ°åºå
- SWE-bench Verified (2024)ïŒ500 䞪人工éªè¯å®äŸ
- OmniGIRL (2025)ïŒ959 䞪å€è¯èš IssueïŒPython/JS/TS/JavaïŒ
- SWE-Gym (2024)ïŒ2,438 䞪 Python ä»»å¡ïŒæ¯æåŒºååŠä¹ è®ç»
äŒ ç»æ°æ®éæå»ºçäžå€§çç¹
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â äŒ ç»æ°æ®éæå»ºæµçš â
âââââââââââââââââââ¬ââââââââââââââââââ¬ââââââââââââââââââââââââââââââ€
â P1: ç¯å¢æå»º â P2: è¯åç³»ç» â P3: Fail2Pass éªè¯ â
âââââââââââââââââââŒââââââââââââââââââŒââââââââââââââââââââââââââââââ€
â æåšé
眮äŸèµ â æåšçŒåè§£æåš â äººå·¥æ£æ¥æµè¯æ¥å¿ â
â å€çå€çæ¬å
Œå®¹ â éé
äžåæµè¯æ¡æ¶ â éªè¯ patch ååç¶æ â
â æå»º Dockerfile â æ£åè¡šèŸŸåŒæå â 倿 fail â pass èœ¬æ¢ â
âââââââââââââââââââŽââââââââââââââââââŽââââââââââââââââââââââââââââââ€
â é®é¢ïŒé«åºŠäŸèµäººå·¥ïŒèæ¶èåïŒéŸä»¥æ©å± â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
å ·äœææïŒ
| çç¹ | æè¿° | äŒ ç»è§£å³æ¹æ¡ |
|---|---|---|
| P1: ç¯å¢æå»º | çŒçšè¯èšåä»åºé çœ®å€æ ·ïŒäŸèµåæµè¯åœä»€é«åºŠé¡¹ç®ç¹å® | æåšçŒå Dockerfile åé çœ®èæ¬ |
| P2: è¯åç³»ç» | äžå项ç®äœ¿çšäžåæµè¯æ¡æ¶ïŒæ¥å¿æ ŒåŒå·®åŒå€§ | 䞺æ¯äžªæ¡äŸæåšçŒåè§£æåšïŒæ£å衚蟟åŒïŒ |
| P3: Fail2Pass éªè¯ | ééªè¯åºçš gold patch åæµè¯ä»å€±èŽ¥å䞺éè¿ | äººå·¥æ£æ¥å€§é倿æµè¯æ¥å |
SWE-Factory çè§£å³æ¹æ¡
SWE-Factory éè¿äžäžªæ žå¿èªåšåç»ä»¶è§£å³äžè¿°çç¹ïŒ
1. SWE-BuilderïŒå€æºèœäœç³»ç»èªåšæå»ºè¯äŒ°ç¯å¢ïŒè§£å³ P1ïŒ 2. Exit Code è¯åæ³ïŒæ ååæµè¯ç¶ææ¶éïŒæ éèªå®ä¹è§£æåšïŒè§£å³ P2ïŒ 3. èªåšå Fail2Pass éªè¯ïŒåºäº Exit Code èªåšéªè¯ïŒè§£å³ P3ïŒ
---
SWE-Factory æ žå¿æ¶æ
æŽäœæµæ°Žçº¿
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â SWE-Factory æµæ°Žçº¿ â
ââââââââââââ¬âââââââââââââââ¬ââââââââââââââââââ¬ââââââââââââââââââââââââââ€
â é¶æ®µ 1 â é¶æ®µ 2 â é¶æ®µ 3 â é¶æ®µ 4 â
ââââââââââââŒâââââââââââââââŒââââââââââââââââââŒââââââââââââââââââââââââââ€
â Raw Issueâ è¯äŒ°ç¯å¢æå»º â æµè¯è¯å â Fail2Pass éªè¯ â
â Collectionâ (SWE-Builder)â (Exit Code æ³) â (èªåšåéªè¯) â
ââââââââââââŒâââââââââââââââŒââââââââââââââââââŒââââââââââââââââââââââââââ€
â äœ¿çš â 倿ºèœäœåäœ â æè· Exit Code â Patch åïŒExit Code â 0â
â SWE-benchâ çæ Dockerfileâ 0 = Pass â Patch åïŒExit Code = 0â
â èæ¬ â åæµè¯èæ¬ â é0 = Fail â â
ââââââââââââŽâââââââââââââââŽââââââââââââââââââŽââââââââââââââââââââââââââ
æ žå¿åæ°ç¹
| åæ° | äŒ ç»æ¹æ³ | SWE-Factory æ¹æ³ | äŒå¿ |
|---|---|---|---|
| ç¯å¢æå»º | æåšé 眮 | SWE-Builder 倿ºèœäœ | èªåšåãå¯å€çš |
| æµè¯è¯å | èªå®ä¹è§£æåš | Exit Code æ åå | 100% åç¡®çãæ ééé |
| Fail2Pass | äººå·¥æ£æ¥ | Exit Code èªåšå¯¹æ¯ | 92% 粟确çã100% å¬åç |
SWE-Builder 倿ºèœäœç³»ç»
åæºèœäœåäœæ¶æ
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â SWE-Builder æ¶æ â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ€
â â
â âââââââââââââââââââ â
â â Repository ââââââ æ¶éä»åºä¿¡æ¯ãäŸèµãæµè¯åœä»€ â
â â Explorer â (requirements.txt, pom.xml, etc.) â
â ââââââââââ¬âââââââââ â
â â â
â ⌠â
â âââââââââââââââââââ âââââââââââââââââââ â
â â Environment â â Test â â
â â Manager â â Manager â â
â â (Dockerfile) â â (æµè¯èæ¬) â â
â ââââââââââ¬âââââââââ ââââââââââ¬âââââââââ â
â â â â
â âââââââââââââ¬ââââââââââââ â
â ⌠â
â âââââââââââââââââââââââââââââââââââââââââââ â
â â Test Analyst â â
â â âââââââââââââââââââââââââââââââââââ â â
â â â éªè¯ïŒåºçš Gold Patch åæµè¯éè¿ïŒ â â â
â â â 倱莥ïŒåæé误æ¥å¿ïŒçæäŒåæå¯Œ â â â
â â âââââââââââââââââââââââââââââââââââ â â
â âââââââââââââââââââââââââââââââââââââââââââ â
â â â
â ⌠(倱莥æ¶åéŠ) â
â è¿åå¯¹åºæºèœäœè¿ä»£äŒå â
â â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
æºèœäœè¯Šç»è®Ÿè®¡
#### 1. Repository ExplorerïŒä»åºæ¢çŽ¢è ïŒ
è莣ïŒèªåšæ¶éæå»ºè¯äŒ°ç¯å¢æéçææä¿¡æ¯
æ žå¿ APIïŒ
browse_file(file_path, custom_query)ïŒä»æå®æä»¶æåä¿¡æ¯browse_directory(file_path, depth)ïŒæµè§ç®åœç»æsearch_file_by_keyword(keyword)ïŒæå ³é®è¯æçŽ¢æä»¶
- ç¯å¢äŸèµïŒ
requirements.txtãpom.xmlãpackage.jsonçïŒ - æµè¯åœä»€ïŒ
pytestãmvn testãnpm testçïŒ - ææ¡£äžç讟眮ç»èïŒ
README.mdãCONTRIBUTING.mdïŒ
èèŽ£ïŒæå»ºå¯é çè¿è¡æ¶ç¯å¢
èŸåºïŒDockerfile
å ³é®ç¹æ§ïŒ
- åºäº Repository Explorer æ¶éçä¿¡æ¯çæ Dockerfile
- ä¿ççæåå²ïŒæ¯æè¿ä»£äŒå
- 倱莥æ¶åéå°äžäžçæ¬
è莣ïŒçææ§è¡æµè¯ç shell èæ¬
æ žå¿åæ°ïŒExit Code æ ååèŸåº
#!/bin/bash
# çæçæµè¯èæ¬ç€ºäŸ (eval.sh)
# æ§è¡æµè¯åœä»€
pytest tests/test_specific_feature.py -v
# æè· Exit Code
rc=$?
# æ ååèŸåºæ è®°
echo "OMNIGRIL_EXIT_CODE=$rc"
# Exit Code å«ä¹ïŒ
# 0 = æææµè¯éè¿
# é0 = è³å°äžäžªæµè¯å€±èŽ¥æåçé误
䞺ä»ä¹äœ¿çš Exit CodeïŒ
- äž»æµæµè¯æ¡æ¶ïŒpytestãJUnitãMochaãnpmïŒéœéµåŸª Exit Code 纊å®
- 0 衚瀺æåïŒéé¶è¡šç€ºå€±èŽ¥
- æ éè§£æå€æçæ¥å¿æ ŒåŒ
è莣ïŒè¯äŒ°ç¯å¢èŽšéå¹¶åè°è¿ä»£äŒå
éªè¯é»èŸïŒ
åºçš Gold Patch â æå»ºç¯å¢ â è¿è¡æµè¯ â åæç»æ
â
ââ æåïŒç¯å¢ææïŒä¿åå°è®°å¿æ±
â
ââ 倱莥ïŒåæé误æ¥å¿ â å®äœé®é¢ â çææå¯Œ â åéŠç»å¯¹åºæºèœäœ
é误åç±»äžåéŠïŒ
| é误类å | åéŠç» | äŒåæå¯Œç€ºäŸ |
|---|---|---|
| äŸèµçŒºå€± | Environment Manager | "æ·»å missing-package==1.0.0 å° Dockerfile" |
| æµè¯åœä»€é误 | Test Manager | "å° pytest æ¹äžº python -m pytest" |
| ä¿¡æ¯äžè¶³ | Repository Explorer | "æ¥æŸ tox.ini äžçæµè¯é 眮" |
è¯äŒ°ç¯å¢è®°å¿æ±
æ žå¿è§å¯ïŒåäžä»åºççžé»çæ¬éåžžå ±äº«çžäŒŒçè¿è¡æ¶ç¯å¢åæµè¯èæ¬ã
å·¥äœåçïŒ
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â è¯äŒ°ç¯å¢è®°å¿æ± â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ€
â â
â æ° Issue è¿å
¥ â
â â â
â ⌠â
â æ¥è¯¢è®°å¿æ± ââ⺠æŸå°çžåä»åºçåå²é
眮 â
â â â
â ⌠â
â æ£çŽ¢çžé»çæ¬çç¯å¢äœäžºåè â
â â â
â ⌠â
â äœäžºåºçº¿å éæ°ç¯å¢æå»º â
â â â
â ⌠â
â éªè¯æå ââ⺠ä¿åå°è®°å¿æ± ïŒå€çšïŒ â
â â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
äŒå¿ïŒ
- å éç¯å¢çæè¿çš
- æé«è·šçæ¬ç¯å¢äžèŽæ§
- åå°éå€å³åš
Exit Code èªåšåéªè¯æºå¶
åºäº Exit Code çè¯åæ¹æ³
æ žå¿æŽå¯ïŒäž»æµæµè¯æ¡æ¶éœéµåŸª Exit Code çºŠå®æ¥åæµè¯ç»æã
| æµè¯æ¡æ¶ | Exit Code = 0 | Exit Code â 0 |
|---|---|---|
| pytest | æææµè¯éè¿ | è³å°äžäžªå€±èŽ¥/é误 |
| JUnit | æµè¯æå | æµè¯å€±èŽ¥ |
| Mocha | å šéšéè¿ | æå€±èŽ¥ |
| npm test | æå | 倱莥 |
# åšæµè¯èæ¬æ«å°Ÿæ·»å æ ååèŸåº
test_command
rc=$?
echo "OMNIGRIL_EXIT_CODE=$rc"
è¯åè¿çšïŒ
def grade_test(output_log):
# è§£ææ ååæ è®°
exit_code = parse_exit_code(output_log)
if exit_code == 0:
return "PASS"
else:
return "FAIL"
äŒå¿å¯¹æ¯ïŒ
| 绎床 | äŒ ç»è§£æå𿹿³ | Exit Code æ¹æ³ |
|---|---|---|
| åŒåææ¬ | é«ïŒé䞺æ¯äžªé¡¹ç®åè§£æåšïŒ | äœïŒæ ååç»äžïŒ |
| ç»Žæ€ææ¬ | é«ïŒæ¥å¿æ ŒåŒååéæŽæ°ïŒ | äœïŒäžäŸèµæ¥å¿æ ŒåŒïŒ |
| åç¡®ç | äŸèµè§£æåšèŽšé | 100%ïŒå®éªéªè¯ïŒ |
| éçšæ§ | äœïŒé¡¹ç®ç¹å®ïŒ | é«ïŒè·šæ¡æ¶éçšïŒ |
èªåšå Fail2Pass éªè¯
å®ä¹ïŒFail2Pass éªè¯ç¡®ä¿åºçš Gold Patch åïŒæµè¯ä»å€±èŽ¥ç¶æå䞺éè¿ç¶æã
èªåšåæµçšïŒ
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Fail2Pass éªè¯æµçš â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ€
â â
â åå§ Issue â
â â â
â ââ⺠åºçš Patch åè¿è¡æµè¯ ââ⺠Exit Code = ? â
â â (ææïŒé0ïŒå³å€±èŽ¥) â
â â â
â ââ⺠åºçš Gold Patch â
â â â
â ââ⺠åºçš Patch åè¿è¡æµè¯ ââ⺠Exit Code = ? â
â (ææïŒ0ïŒå³éè¿) â
â â
â 倿ïŒExit Code ä»é0å䞺0ïŒ â
â âââº æ¯ ââ⺠ææå®äŸïŒä¿çïŒ â
â âââº åŠ âââº æ æå®äŸïŒè¿æ»€ïŒ â
â â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
---
å®éªè¯äŒ°äžç»æ
å®éªè®Ÿçœ®
æ°æ®éïŒSweSetupBench-lite
- 12 äžªåŒæºä»åº
- 4 ç§çŒçšè¯èšïŒPythonãJavaãJavaScriptãTypeScript
- 671 䞪 Issue
| æš¡å | Input Cost | Output Cost | ååžæ¥æ |
|---|---|---|---|
| GPT-4.1-mini | $0.40/1M tokens | $1.60/1M tokens | 2025-04-14 |
| Gemini-2.5-flash | $0.15/1M tokens | $0.60/1M tokens | 2025-04-17 |
| DeepSeek-v3 | $0.30/1M tokens | $0.80/1M tokens | 2025-03-24 |
- æå€§è¿ä»£æ¬¡æ°ïŒ5
- 枩床ïŒ0.2
- Repository Explorer æå€§æ£çŽ¢èœ®æ¬¡ïŒ10
- å¹¶è¡è¿çšïŒ20
RQ1ïŒSWE-Builder çæææ§
æŽäœç»æïŒ
| æš¡å | Valid Rate | Success Rate | ææ¬/å®äŸ |
|---|---|---|---|
| GPT-4.1-mini | 40.1% (269/671) | 57.2% | $0.045 |
| Gemini-2.5-flash | 33.5% (225/671) | 49.8% | $0.024 â |
| DeepSeek-v3 | 34.6% (232/671) | 50.8% | $0.043 |
| æš¡å | Python | Java | TypeScript | JavaScript |
|---|---|---|---|---|
| GPT-4.1-mini | 39.4% | 28.5% | 54.0% | 38.7% |
| Gemini-2.5-flash | 29.8% | 19.4% | 48.3% | 40.5% |
| DeepSeek-v3 | 43.4% | 11.8% | 43.8% | 42.3% |
- GPT-4.1-mini æŽäœæææäœ³ïŒValid Rate 蟟 40.1%
- Gemini-2.5-flash ææ¬æäœïŒä» $0.024/å®äŸ
- DeepSeek-v3 åš Python å JavaScript äžè¡šç°æå¥œ
- GPT-4.1-mini åš Java å TypeScript äžé¢å
RQ2ïŒExit Code è¯ååç¡®æ§
è¯äŒ°æ¹æ³ïŒäººå·¥æ£æ¥ 2,085 仜æµè¯æ¥å
ç»æïŒ
| ç¯å¢æ¥æº | æ£æ¥æ°é | åç¡®ç |
|---|---|---|
| GPT-4.1-mini | 765 | 100% |
| DeepSeek-v3 | 670 | 100% |
| Gemini-2.5-flash | 650 | 100% |
| æ»è®¡ | 2,085 | 100% |
RQ3ïŒFail2Pass éªè¯æææ§
è¯äŒ°ææ ïŒ
- PrecisionïŒç²Ÿç¡®çïŒïŒé¢æµäžº Fail2Pass äžçæ£ Fail2Pass çæ¯äŸ
- RecallïŒå¬åçïŒïŒçæ£ Fail2Pass äžè¢«é¢æµåºçæ¯äŸ
| æš¡å | ä»»å¡å®äŸæ° | TP | FP | TN | FN | Precision | Recall |
|---|---|---|---|---|---|---|---|
| DeepSeek-v3 | 329 | 226 | 16 | 87 | 0 | 0.93 | 1.00 |
| GPT-4.1-mini | 381 | 269 | 19 | 93 | 0 | 0.93 | 1.00 |
| Gemini-2.5-flash | 320 | 223 | 25 | 72 | 0 | 0.90 | 1.00 |
| æ»è®¡ | 1,030 | 718 | 60 | 252 | 0 | 0.92 | 1.00 |
- å®çŸå¬åçïŒ100%ïŒïŒæ²¡ææŒæä»»äœçæ£ç Fail2Pass æ¡äŸ
- é«ç²Ÿç¡®çïŒ92%ïŒïŒå°éå鳿§éèŠäººå·¥äºæ¬¡ç¡®è®€
Error2Pass ç°è±¡åæ
ä»ä¹æ¯ Error2PassïŒ
å®ä¹ïŒError2Pass æ¯äžç§ç¹æ®æ åµïŒåºçš patch åæµè¯å é误ïŒåŠ ImportErrorïŒæ æ³æ§è¡ïŒåºçš patch åé误解å³ïŒæµè¯èœå€è¿è¡å¹¶éè¿ã
äŒ ç» Fail2PassïŒ Error2PassïŒ
Patch åïŒæµè¯è¿è¡ïŒäœå€±èŽ¥ Patch åïŒæµè¯æ æ³è¿è¡ïŒImportErrorïŒ
Patch åïŒæµè¯è¿è¡ïŒéè¿ Patch åïŒæµè¯èœå€è¿è¡ïŒéè¿
å žåæ¡äŸ
æ¡äŸïŒpython-attrs__attrs-830
Patch åïŒå·Šé¢æ¿ïŒïŒ
# æµè¯å°è¯å¯Œå
¥æ°åœæ°ïŒäœäžååš
from attr import to_bool # ImportError!
# æµè¯æ¡æ¶åšæ¶éé¶æ®µåŽ©æº
# 没æå®é
æ§è¡ä»»äœæµè¯
Exit CodeïŒé0ïŒç±äº ImportErrorïŒPatch åïŒå³é¢æ¿ïŒïŒ
# Gold Patch æ·»å äº to_bool åœæ°
# æµè¯å¯ä»¥æ£åžžå¯Œå
¥åè¿è¡
# 21 䞪æµè¯å
šéšéè¿
Exit CodeïŒ0䞺ä»ä¹ Error2Pass æé®é¢ïŒ
æ žå¿é®é¢ïŒæµè¯ä»£ç äžè§£å³æ¹æ¡ä»£ç 玧å¯èŠå
Gold PatchïŒ æš¡åå¯èœçæïŒ
æ·»å åœæ° to_bool() æ·»å åœæ° to_boolean() # åèœçžåïŒåœåäžå
æµè¯ä»£ç ïŒ æµè¯ä»£ç ïŒ
from attr import to_bool from attr import to_bool # 硬çŒç 富å
¥
ç»æïŒ ç»æïŒ
æµè¯éè¿ â
ImportError â
ïŒå³äœ¿åèœæ£ç¡®ïŒ
åæïŒ
- æš¡åå¯èœçæé»èŸæ£ç¡®çè§£å³æ¹æ¡
- äœç±äºåœæ°åœåçç»èäžæµè¯ææäžäžèŽ
- å¯ŒèŽæµè¯å€±èŽ¥ïŒäœäŒ°æš¡åèœå
å®éªåç°
ææå鳿§ïŒFPïŒéœæ¯ Error2PassïŒ
- 60 䞪 FP æ¡äŸç»äººå·¥å®¡æ¥ïŒå šéšäžº Error2Pass
- è¿äºæ¡äŸäžåºå å«åšé«èŽšé benchmark äž
- æå»º benchmark æ¶åºè¿æ»€ Error2Pass æ¡äŸ
- å¯éè¿æ£æ¥ patch åé误类åè¯å«ïŒImportErrorãModuleNotFoundError çïŒ
äžçžå ³å·¥äœå¯¹æ¯
ç°ææ°æ®é对æ¯
| Benchmark | è¯èš | è§æš¡ | èªåšåçšåºŠ | ç¹ç¹ |
|---|---|---|---|---|
| SWE-bench | Python | 2,294 | éšåèªåšå | æå¹¿æ³äœ¿çšçåºå |
| SWE-bench Verified | Python | 500 | 人工éªè¯ | é«èŽšéåé |
| OmniGIRL | å€è¯èš | 959 | éšåèªåšå | 倿𡿿¯æ |
| SWE-Gym | Python | 2,438 | èªåšå | æ¯æåŒºååŠä¹ è®ç» |
| R2E-Gym | Python | 8,700+ | èªåšå | çšåºçæç¯å¢ |
| SWE-Factory (æ¬æ) | å€è¯èš | åšææå»º | å®å šèªåšå | éŠäžªå šèªåšåæµæ°Žçº¿ |
èªåšç¯å¢è®Ÿçœ®æ¹æ³å¯¹æ¯
| æ¹æ³ | ç¯å¢æå»º | è¯åç³»ç» | Fail2Pass | åŒæº |
|---|---|---|---|---|
| ExecutionAgent | â èªåšå | â éæåš | â éæåš | â |
| EnvBench | â èªåšå | â éæåš | â éæåš | â |
| RepoLaunch | â èªåšå | â éæåš | â éæåš | â |
| SetupAgent | â èªåšå | â èªåšå | â éæåš | â |
| SWE-Factory | â 倿ºèœäœ | â Exit Code | â èªåšå | â |
---
åºçšåºæ¯äžä»·åŒ
1. å€§è§æš¡è®ç»æ°æ®éæå»º
åºæ¯ïŒäžºåŒºååŠä¹ è®ç»ïŒåŠ SWE-GymïŒæå»ºæ°äžçº§å«çè®ç»æ°æ®
ä»·åŒïŒ
- ææ¬ä» $10+/å®äŸ éè³ $0.024/å®äŸ
- æå»º 10,000 å®äŸæ°æ®éä» é ~$240
- æ¯æå€è¯èšïŒæ©å€§è®ç»æ°æ®å€æ ·æ§
2. Benchmark æç»æŽæ°
åºæ¯ïŒéçåŒæºé¡¹ç®åå±ïŒæç»æ·»å æ°ç Issue å° Benchmark
ä»·åŒïŒ
- èªåšåæµçšå¯æç»è¿è¡
- æ é人工干é¢å³å¯æ©å± Benchmark
- ä¿æ Benchmark äžææ°ææ¯åæ¥
3. é¢åç¹å® Benchmark æå»º
åºæ¯ïŒäžºç¹å®é¢åïŒåŠéèãå»çèœ¯ä»¶ïŒæå»ºäžçš Benchmark
ä»·åŒïŒ
- å¿«éå®å¶é¢åç¹å®è¯äŒ°é
- æ¯æå€ç§çŒçšè¯èš
- éäœé¢å Benchmark æå»ºéšæ§
4. æš¡åèœåè¯äŒ°
åºæ¯ïŒè¯äŒ°æ°æš¡ååš GitHub Issue è§£å³ä»»å¡äžç衚ç°
ä»·åŒïŒ
- æ ååè¯äŒ°æµçš
- å¯å€ç°çå®éªç¯å¢
- å ¬å¹³çèœå对æ¯
æ»ç»äžå±æ
æ žå¿èŽ¡ç®
1. SWE-FactoryïŒéŠäžªåŒæºçè·šå€è¯èš GitHub Issue è§£å³ Benchmark èªåšæå»ºæµæ°Žçº¿ 2. SWE-BuilderïŒå€æºèœäœç³»ç»å®ç°é«æç¯å¢æå»ºïŒ$0.024-$0.045/å®äŸïŒ 3. Exit Code è¯åæ³ïŒ100% åç¡®ççèªåšåæµè¯è¯å 4. èªåšåéªè¯ïŒ92% 粟确çã100% å¬åçç Fail2Pass éªè¯ 5. Error2Pass åç°ïŒè¯å«å¹¶åæåœ±å Benchmark 莚éçç¹æ®æ¡äŸ
å ³é®æ°æ®
| ææ | æ°åŒ |
|---|---|
| æå»ºæåç | 40.1% (GPT-4.1-mini) |
| æäœæå»ºææ¬ | $0.024/å®äŸ (Gemini-2.5-flash) |
| Exit Code è¯ååç¡®ç | 100% |
| Fail2Pass éªè¯ç²Ÿç¡®ç | 92% |
| Fail2Pass éªè¯å¬åç | 100% |
æªæ¥æ¹å
1. æ©å±è¯è𿝿ïŒèŠçæŽå€çŒçšè¯èšïŒGoãRustãC++ çïŒ 2. æåæåçïŒäŒåæºèœäœåäœçç¥ïŒæé«ç¯å¢æå»ºæåç 3. Error2Pass è¿æ»€ïŒåŒåèªåšè¯å«åè¿æ»€ Error2Pass æ¡äŸçæºå¶ 4. 倿𡿿¯æïŒéææªåŸãè§é¢ç倿š¡æä¿¡æ¯ïŒåè SWE-bench MultimodalïŒ 5. 宿¶ BenchmarkïŒæå»ºæç»æŽæ°çåšæ Benchmark ç³»ç»
èµæºéŸæ¥
- GitHubïŒhttps://github.com/DeepSoftwareAnalytics/swe-factory
- 论æïŒarXiv:2506.10954v1
- æ°æ®éïŒSweSetupBench-liteïŒ671 å®äŸïŒ4 ç§è¯èšïŒ
åèèµæ
1. Jimenez et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024. 2. Pan et al. "SWE-Factory: An Automatic Issue Resolution Dataset Construction Pipeline via LLM-based Multi Agents." arXiv:2506.10954v1, 2025. 3. Pratt et al. "SWE-Gym: Training Software Engineering Agents to Resolve GitHub Issues." 2024. 4. Zhang et al. "OmniGIRL: A GitHub Issue Resolution Dataset with Multi-Modal UI Trajectories." 2025.
---
*æ¥åçææ¶éŽïŒ2025幎6æ* *åºäº SWE-Factory 论æåå ¬åŒèµææŽç*
#AIç ç©¶ #SWE-Factory #GitHub #Benchmark #倿ºèœäœ #èœ¯ä»¶å·¥çš #å°å¯