We adresses evaluation datasets getting scraped into pretraining data with a pipeline for generating synthetic datasets (for reasoning and retrieval evaluation). We demonstrate its scraping resilience, and use it to identify improvements in RAG and agentic methods: subtask composition, multi-branch reasoning, and needle-in-a-haystack retrieval.
Since initiatives for measuring safe, ethical AI are scattered and fragmented, this review outlines state-of-the-art methods, their proper utilization, and systemic weaknesses and scaling issues. By doing so, we seek to foster continuing discourse and innovation among both technical developers and non-technical policymakers.