Reflection 70B model maker breaks silence amid fraud accusations

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Matt Shumer, co-founder and CEO of OthersideAI, also known as its signature AI assistant writing product HyperWrite, has broken his near two days of silence after being accused of fraud when third-party researchers were unable to replicate the supposed top performance of a new large language model (LLM) he released on Thursday, September 5. Shumer apologized and claimed he “Got ahead of himself,” adding “I know that many of you are excited about the potential for this and are now skeptical.”

I got ahead of myself when I announced this project, and I am sorry. That was not my intention. I made a decision to ship this new approach based on the information that we had at the moment.
I know that many of you are excited about the potential for this and are now skeptical.…
— Matt Shumer (@mattshumer_) September 10, 2024

However, his statements do not fully explain why his model, Reflection 70B, which he claimed to be a variant of Meta’s Llama 3.1 trained using synthetic data generation platform Glaive AI, has not performed as well as he originally stated in all subsequent independent tests. Nor has Shumer clarified precisely what went wrong. Here’s a timeline:

Thursday, Sept. 5, 2024: Initial lofty claims of Reflection 70B’s superior performance on benchmarks

In case you’re just catching up, last week, Shumer released Reflection 70B, on the open source AI community Hugging Face, calling it “the world’s top open-source model” in a post on X and posting a chart of what he said were its state-of-the-art results on third-party benchmarks.

I’m excited to announce Reflection 70B, the world’s top open-source model.
Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.
405B coming next week – we expect it to be the best model in the world.
Built w/ @GlaiveAI.
Read on ⬇️: pic.twitter.com/kZPW1plJuo
— Matt Shumer (@mattshumer_) September 5, 2024

Shumer claimed the impressive performance was achieved to a technique called “Reflection Tuning,” which allows the model to assess and refine its responses for correctness before outputting them to users.

VentureBeat interviewed Shumer and accepted his benchmarks as he presented them, crediting them to him, as we do not have the time nor resources with which to run our own independent benchmarking — and most model providers we’ve covered have so far been forthright.

Fri. Sept. 6-Monday Sept. 9: Third party evaluations fail to reproduce Reflection 70B’s impressive results — Shumer accused of fraud

However, just days after its debut and over last weekend, independent third-party evaluators and members of the open source AI community posting on Reddit and Hacker News began questioning the model’s performance and were unable to replicate it on their own. Some even found responses and data indicating the model was related to — perhaps merely a thin “wrapper” — pointing back to Anthropic’s Claude 3.5 Sonnet model.

Criticism mounted after Artificial Analysis, an independent AI evaluation organization, posted on X that its tests of Reflection 70B yielded significantly lower scores than initially claimed by HyperWrite.

Our evaluation of Reflection Llama 3.1 70B’s MMLU score resulted in the same score as Llama 3 70B and significantly lower than Meta’s Llama 3.1 70B.
A LocalLLaMA post (link below) also compared the diff of Llama 3.1 & Llama 3 weights to Reflection Llama 3.1 70B and concluded the… pic.twitter.com/hqvFp2TyCC
— Artificial Analysis (@ArtificialAnlys) September 7, 2024

Also, Shumer was found to be invested in Glaive, the AI startup he said whose synthetic data he used to train the model on, which he did not disclose when releasing Reflection 70B.

Shumer attributed the discrepancies to issues during the model’s upload process to Hugging Face and promised to correct the model weights last week, but has yet to do so.

One X user, Shin Megami Boson, openly accused Shumer of “fraud in the AI research community” as of 8:07 pm ET. Shumer did not directly respond to this accusation.

A story about fraud in the AI research community:
On September 5th, Matt Shumer, CEO of OthersideAI, announces to the world that they’ve made a breakthrough, allowing them to train a mid-size model to top-tier levels of performance. This is huge. If it’s real.
It isn’t. pic.twitter.com/S0jWT8rDVb
— ? Shin Megami Boson ? (@shinboson) September 9, 2024

After posting and reposting various X messages related to Reflection 70B, Shumer went silent on Sunday, September 8, and did not respond to VentureBeat’s request for comments — nor post any public X posts — until this evening of Tuesday, September 10.

Additionally, AI researchers such as Nvidia’s Jim Fan pointed out it was easy to train even less powerful (lower parameter, or complexity) models to perform well on third-party benchmarks.

It is *incredibly* easy to game the LLM benchmarks. Training on test set is for the rookies. Here’re some tricks to practice magic at home:
1. Train on paraphrased examples of the test set. “LLM-decontaminator” paper from LMSys found that you can beat GPT-4 with a 13B model (!!)… pic.twitter.com/iMKHBJH4eG
— Jim Fan (@DrJimFan) September 9, 2024

Tuesday, Sept. 10: Shumer responds and apologizes — but doesn’t explain discrepancies

Shumer finally released a statement on X tonight at 5:30 pm ET apologizing and stating, in part, “we have a team working tirelessly to understand what happened and will determine how to proceed once we get to the bottom of it. Once we have all of the facts, we will continue to be transparent with the community about what happened and next steps.”

I got ahead of myself when I announced this project, and I am sorry. That was not my intention. I made a decision to ship this new approach based on the information that we had at the moment.
I know that many of you are excited about the potential for this and are now skeptical.…
— Matt Shumer (@mattshumer_) September 10, 2024

Shumer also linked to another X post by Sahil Chaudhary, founder of Glaive AI, the platform Shumer previously claimed was used to generate synthetic data to train Reflection 70B.

Intriguingly, Chaudhary’s post stated that some of the responses from Reflection 70B saying it was a variant of Anthropic’s Claude are also still a mystery to him. He also admits that “the benchmark scores I shared with Matt haven’t been reproducible so far.” Read his full post below:

I want to address the confusion and valid criticisms that this has caused in the community. I am currently investigating what happened that led to this and will share a transparent summary as soon as possible. There are two areas I’d like to address, which I am investigating:
-… https://t.co/NSjx6oqPRo
— Sahil Chaudhary (@csahil28) September 10, 2024

However, Shumer and Chaudhary’s responses were not enough to mollify skeptics and critics, including Yuchen Jin, co-founder and chief technology officer (CTO) of Hyperbolic Labs, an open access AI cloud provider.

Jin wrote a lengthy post on X detailing how hard he worked to host a version of Reflection 70B on his site and troubleshoot the supposed errors, noting that “I was emotionally damaged by this because we spent so much time and energy on it, so I tweeted about what my faces looked like during the weekend.”

He also responded to Shumer’s statement with a reply on X, writing, “Hi Matt, we spent a lot of time, energy, and GPUs on hosting your model and it’s sad to see you stopped replying to me in the past 30+ hours, I think you can be more transparent about what happened (especially why your private API has a much better perf).”

Megami Boson, among many others, remained unconvinced as of tonight in Shumer’s and Chaudhary’s telling of events and casting the saga as one of mysterious, still-unexplained errors borne out of enthusiasm.

As far as I can tell, either you are lying, or @mattshumer_ is lying, or of course both of you.
In the spirit of giving you the benefit of the doubt, who:
– executed the original training and where?
– hosted the private API that was benchmarked?
– hosted the OpenRouter API?
— ? Shin Megami Boson ? (@shinboson) September 10, 2024

“As far as I can tell, either you are lying, or Matt Shumer is lying, or of course both of you,” he posted on X, following up with a series of questions. Similarly, the Local Llama subreddit is not buying Shumer’s claims:

Time will tell if Shumer and Chaudhary are able to respond satisfactorily to their critics and skeptics — among whom are an increasing number of the entire generative AI community online.

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link

Reflection 70B model maker breaks silence amid fraud accusations

Thursday, Sept. 5, 2024: Initial lofty claims of Reflection 70B’s superior performance on benchmarks

Fri. Sept. 6-Monday Sept. 9: Third party evaluations fail to reproduce Reflection 70B’s impressive results — Shumer accused of fraud

Tuesday, Sept. 10: Shumer responds and apologizes — but doesn’t explain discrepancies

About The Author

Maria Carr

Thursday, Sept. 5, 2024: Initial lofty claims of Reflection 70B’s superior performance on benchmarks

Fri. Sept. 6-Monday Sept. 9: Third party evaluations fail to reproduce Reflection 70B’s impressive results — Shumer accused of fraud

Tuesday, Sept. 10: Shumer responds and apologizes — but doesn’t explain discrepancies

About The Author

Maria Carr

Start typing and press enter to search