Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Matt Shumer, co-founder and CEO of OthersideAI, also known as its signature AI assistant writing product HyperWrite, has broken his near two days of silence after being accused of fraud when third-party researchers were unable to replicate the supposed top performance of a new large language model (LLM) he released on Thursday, September 5. Shumer apologized and claimed he “Got ahead of himself,” adding “I know that many of you are excited about the potential for this and are now skeptical.”
However, his statements do not fully explain why his model, Reflection 70B, which he claimed to be a variant of Meta’s Llama 3.1 trained using synthetic data generation platform Glaive AI, has not performed as well as he originally stated in all subsequent independent tests. Nor has Shumer clarified precisely what went wrong. Here’s a timeline:
Thursday, Sept. 5, 2024: Initial lofty claims of Reflection 70B’s superior performance on benchmarks
In case you’re just catching up, last week, Shumer released Reflection 70B, on the open source AI community Hugging Face, calling it “the world’s top open-source model” in a post on X and posting a chart of what he said were its state-of-the-art results on third-party benchmarks.
Shumer claimed the impressive performance was achieved to a technique called “Reflection Tuning,” which allows the model to assess and refine its responses for correctness before outputting them to users.
VentureBeat interviewed Shumer and accepted his benchmarks as he presented them, crediting them to him, as we do not have the time nor resources with which to run our own independent benchmarking — and most model providers we’ve covered have so far been forthright.
Fri. Sept. 6-Monday Sept. 9: Third party evaluations fail to reproduce Reflection 70B’s impressive results — Shumer accused of fraud
However, just days after its debut and over last weekend, independent third-party evaluators and members of the open source AI community posting on Reddit and Hacker News began questioning the model’s performance and were unable to replicate it on their own. Some even found responses and data indicating the model was related to — perhaps merely a thin “wrapper” — pointing back to Anthropic’s Claude 3.5 Sonnet model.
Criticism mounted after Artificial Analysis, an independent AI evaluation organization, posted on X that its tests of Reflection 70B yielded significantly lower scores than initially claimed by HyperWrite.
Also, Shumer was found to be invested in Glaive, the AI startup he said whose synthetic data he used to train the model on, which he did not disclose when releasing Reflection 70B.
Shumer attributed the discrepancies to issues during the model’s upload process to Hugging Face and promised to correct the model weights last week, but has yet to do so.
One X user, Shin Megami Boson, openly accused Shumer of “fraud in the AI research community” as of 8:07 pm ET. Shumer did not directly respond to this accusation.
After posting and reposting various X messages related to Reflection 70B, Shumer went silent on Sunday, September 8, and did not respond to VentureBeat’s request for comments — nor post any public X posts — until this evening of Tuesday, September 10.
Additionally, AI researchers such as Nvidia’s Jim Fan pointed out it was easy to train even less powerful (lower parameter, or complexity) models to perform well on third-party benchmarks.
Tuesday, Sept. 10: Shumer responds and apologizes — but doesn’t explain discrepancies
Shumer finally released a statement on X tonight at 5:30 pm ET apologizing and stating, in part, “we have a team working tirelessly to understand what happened and will determine how to proceed once we get to the bottom of it. Once we have all of the facts, we will continue to be transparent with the community about what happened and next steps.”
Shumer also linked to another X post by Sahil Chaudhary, founder of Glaive AI, the platform Shumer previously claimed was used to generate synthetic data to train Reflection 70B.
Intriguingly, Chaudhary’s post stated that some of the responses from Reflection 70B saying it was a variant of Anthropic’s Claude are also still a mystery to him. He also admits that “the benchmark scores I shared with Matt haven’t been reproducible so far.” Read his full post below:
However, Shumer and Chaudhary’s responses were not enough to mollify skeptics and critics, including Yuchen Jin, co-founder and chief technology officer (CTO) of Hyperbolic Labs, an open access AI cloud provider.
Jin wrote a lengthy post on X detailing how hard he worked to host a version of Reflection 70B on his site and troubleshoot the supposed errors, noting that “I was emotionally damaged by this because we spent so much time and energy on it, so I tweeted about what my faces looked like during the weekend.”
He also responded to Shumer’s statement with a reply on X, writing, “Hi Matt, we spent a lot of time, energy, and GPUs on hosting your model and it’s sad to see you stopped replying to me in the past 30+ hours, I think you can be more transparent about what happened (especially why your private API has a much better perf).”
Megami Boson, among many others, remained unconvinced as of tonight in Shumer’s and Chaudhary’s telling of events and casting the saga as one of mysterious, still-unexplained errors borne out of enthusiasm.
“As far as I can tell, either you are lying, or Matt Shumer is lying, or of course both of you,” he posted on X, following up with a series of questions. Similarly, the Local Llama subreddit is not buying Shumer’s claims:
Time will tell if Shumer and Chaudhary are able to respond satisfactorily to their critics and skeptics — among whom are an increasing number of the entire generative AI community online.
Source link