Complete agent trajectories across 258k real-world software engineering tasks. Tracks tool calls, multi-turn reasoning traces, code changes, and explicit user acceptance signals.
A multimodal corpus of 5M+ connected medical records, diagnostic reasoning trails, and radiology/pathology interpretations mapped to clinical outcomes and symptoms.
Expert-annotated SFT and DPO preference datasets for corporate earnings reports, regulatory audit compliance, risk models, and professional financial rationale.
Current benchmarks fail to capture the multi-turn, iterative, and interactive reality of enterprise coding. We introduce a corpus of 258,000 production trajectories, evaluating software agent architectures under real-world development lifecycle environments.
Training medical agents on raw doctor-patient records often highlights superficial correlation over diagnostic reasoning. This paper presents a hierarchical RL framework aligned with credentialed practitioner feedback and rigorous diagnostic trails.