Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
It took just one weekend for the new, self-proclaimed king of open source AI models to have its crown tarnished.
Reflection 70B, a variant of Meta’s Llama 3.1 open source large language model (LLM) — or wait, was it a variant of the older Llama 3? — that had been trained and released by small New York startup HyperWrite (formerly OthersideAI) and boasted impressive, leading benchmarks on third-party tests, has now been aggressively questioned as other third-party evaluators have failed to reproduce some of said performance measures.
The model was triumphantly announced in a post on the social network X by HyperWrite AI co-founder and CEO Matt Shumer on Friday, September 6, 2024 as “the world’s top open-source model.”
In a series of public X posts documenting some of Reflection 70B’s training process and subsequent interview over X Direct Messages with VentureBeat, Shumer explained more about how the new LLM used “Reflection Tuning,” a previously documented technique developed by other researchers outside the company that sees LLMs check the correctness of or “reflect” on their own generated responses before outputting them to users, improving accuracy on a number of tasks in writing, math, and other domains.
However, on Saturday September 7, a day after the initial HyperWrite announcement and VentureBeat article were published, Artificial Analysis, an organization dedicated to “Independent analysis of AI models and hosting providers” posted its own analysis on X stating that “our evaluation of Reflection Llama 3.170B’s MMLU score” — referencing the commonly used Massive Multitask Language Understanding (MMLU) benchmark — “resulted in the same score as Llama 3 70B and significantly lower than Meta’s Llama 3.1 70B,” showing a major discrepancy with HyperWrite/Shumer’s originally posted results.
On X that same day, Shumer stated that Reflection 70B’s weights — or settings of the open source model — had been “fucked up during the upload process” to Hugging Face, the third-party AI code hosting repository and company, and that this issue could have resulted in worse quality performance compared to HyperWrite’s “internal API” version.
On Sunday, September 8, 2024 at around 10 pm ET, Artificial Analysis posted on X that it had been “given access to a private API which we tested and saw impressive performance but not to the level of the initial claims. As this testing was performed on a private API, we were not able to independently verify exactly what we were testing.”
The organization detailed two key questions that seriously call into question HyperWrite and Shumer’s initial performance claims, namely:
- “We are not clear on why a version would be published which is not the version we tested via Reflection’s private API.
- We are not clear why the model weights of the version we tested would not be released yet.
As soon as the weights are released on Hugging Face, we plan to re-test and compare to our evaluation of the private endpoint.”
All the while, users on various machine learning and AI Reddit communities or subreddits, have also called into question Reflection 70B’s stated performance and origins. Some have pointed out that based on a model comparison posted on Github by a third party, Reflection 70B appears to be a Llama 3 variant rather than a Llama-3.1 variant, casting further doubt on Shumer and HyperWrite’s initial claims.
This has led to at least one X user, Shin Megami Boson, to openly accuse Shumer of “fraud in the AI research community” as of 8:07 pm ET on Sunday, September 8, posting a long list of screenshots and other evidence.
Others accuse the model of actually being a “wrapper” or application built atop of propertiary/closed-source rival Anthropic’s Claude 3.
However, other X users have spoken up in defense of Shumer and Reflection 70B, and some have posted about the model’s impressive performance on their end.
Regardless, the model’s rollout, lofty claims, and now criticism show how rapidly the AI hype cycle can come crashing down.
As for now, the AI research community waits with breath baited for Shumer’s response and updated model weights on Hugging Face. VentureBeat has also reached out to Shumer for a direct response to these allegations of fraud and will update when we hear back.
Source link