ChatGPT maker, OpenAI introduces its latest addition to the artificial intelligence (AI) tool collection called MLE-bench, a model designed for AI developers.
The tool was officially introduced on the OpenAI website last week on Oct 10, 2024.
MLE-bench is an open-source AI tool for engineers to evaluate the performance of AI agents in machine learning (ML) engineering.
The American AI organisation conducted 75 machine learning tests for Kaggle, a data science competition platform and online community for learners, researchers and developers.
After testing it on Kaggle, OpenAI developed a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments.
“We establish human baselines for each competition using Kaggle's publicly available leaderboards,” OpenAI stated.
AI ‘benchmark’ tool
Hailed as a “benchmark” tool, MLE-bench was made using open-source agent scaffolds, aiming to estimate various “frontier language models” on the new tool.
According to Tech Xplore, MLE tests AI systems on their ability to undertake engineering work autonomously including innovation. AI systems likely have to learn from their own work to improve their scores on these bench tests.
OpenAI alleges that the new AI tool is the “best-performing setup” post-testing on Kaggle.
To achieve its high-performance capabilities, the ChatGPT owner combined o1, OpenAI’s latest large language model (LLM) with AIDE scaffolding to train the MLE-bench.
OpenAI o1 launched last month, is a new LLM that thinks deeply before it answers users’ prompts.
According to the AI giant, MLE-bench managed to achieve “at least the level of a Kaggle bronze medal in 16.9 percent of competitions.”
However, Kaggle doesn’t supply a held-out set for every competition, OpenAI has already prepared scripts to evaluate the ML capabilities of AI systems.
The prepared scripts divide the publicly available training set into a new training and test set.
Resource scaling for AI agents
Striving to enhance ML engineering capabilities, OpenAI also plans to explore solutions to scale resources for AI agents.
In addition to resource scaling, the company will also determine the impact of contamination on AI agents from pre-training.
“We open-source our benchmark code to facilitate future research in understanding the ML engineering capabilities of AI agents,” OpenAI stated.
OpenAI researchers emphasise that the benchmark aka MLE-bench was intentionally designed to not make any assumptions regarding the agent that generates submissions. This is because it allows the agents to be easily assessed on the benchmark.
The researchers developed a agent and examined three other open-source agents – AIDE, MLAgentBench, and OpenHands which were individually modified to enhance their capabilities.
“We also developed a "" agent, used to check that the environment is configured correctly,” OpenAI researchers stated on GitHub. “Each agent, alongside the link to our fork, is listed below. Each agent has an associated ID which we use to identify it within our repo.”
Additionally, the new AI tool for developers has a repository of several features to boost its evaluation capabilities including a rule violation detector and a plagiarism detector.
Earlier in June, OpenAI was sued by the Center for Investigative Reporting, the US's oldest nonprofit news organization, for training their AI models with the nonprofit's copyrighted content.
In the official complaint, the nonprofit newsroom stated that Copleaks found “nearly 60 percent of the responses provided by defendants’ GPT-3.5 product contained some form of plagiarized content, and over 45 percent contained text that was identical to pre-existing content.”