Getty Images drops ‘cleanest’ visual dataset for training foundation models


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Getty Images is going all in to establish itself as a trusted data partner. The creative company, known for enabling the sharing, discovery and purchase of visual content from global photographers and videographers, today announced it is releasing images from its library as a sample open dataset on Hugging Face. 

While there are plenty of visual datasets on the Hugging Face hub, Getty says its offering stands out from the crowd for being reliable and commercially safe. This means enterprise developers can integrate it into their AI training pipeline without worrying about quality or legal issues cropping up in the future. 

“Imagine building or enhancing your AI/ML capabilities with data that’s not only diverse and high quality but also comes with the peace of mind that it’s responsibly sourced. That’s what we’re bringing to the table,” Andrea Gagliano, the head of data science and AI/ML at the company, told VentureBeat.

Eventually, the company hopes the move will create an ecosystem where AI companies would prefer to go for officially licensed content from its platform to train their AI models.

What does the Getty Images dataset have on offer?

When training AI/ML models, developers often struggle with the challenge of poorly sourced, low-quality data. To fix this, they resort to multiple layers of work and clean/enrich the whole repository. This means not only removing duplicates and damaged files but also filtering out dangerous or unnecessary elements such as celebrity images, trademarks, NSFW content, low-resolution images as well as those with incomplete or missing metadata (that helps models understand context better).

This task, given the size of the dataset, can take a lot of time and resources, leading to missed opportunities for the engineering team. Not to mention, even after all the hard work, some harmful or copyrighted materials may still slip through the cracks and end up in the downstream model outputs – stirring up legal battles.

With its open dataset on Hugging Face, Getty Images is trying to solve all these issues, giving developers a ready-to-use repository of high-quality images covering as many as 15 categories.

“This sample Dataset includes 3,750 images from 15 categories, including abstracts and backgrounds, built environments, business, concepts, education, healthcare, icons, industry, nature, illustrations and travel,” Gagliano tells VentureBeat. 

Content from Getty Images sample dataset
Content from Getty Images sample dataset

According to the data science head, the repository comes from Getty’s wholly-owned creative library, which means the images are commercially safe and developers can use them without having to worry about unexpected legal troubles at a later stage. There’s also no hassle of cleaning or enrichment as the whole thing has been specifically curated for machine learning (ML) training with high-resolution images, supported by rich structured metadata, and no unwanted elements like NSFW content. 

She described it as the “cleanest, highest quality dataset” one could find for training ML models.

Usage conditions to apply

While the sample dataset is open for use, it is pertinent to note that certain conditions will apply to ensure the licensed content is used responsibly for training/testing commercial applications and conducting academic research.

“Some of the restrictions include redistribution of the dataset, development of models/software to re-create/reproducing or generating digital reproductions of items of the content contained in the dataset, creation of products/services in direct competition with Getty Images, create or use biometric identifiers derived from the dataset,  and use in any manner that violates applicable laws or regulations,” Gagliano noted.

Eventually, Getty hopes the move will engage the developer community, helping them understand the depth and breadth of content the company can offer, and raise awareness that it can be a “trusted partner” for providing licensed, high-quality data for responsible AI training.

“Our goal is to show that it is possible to accommodate licensing for all the content required to train functional AI models – developing business models that enable the creation of high-quality AI models while respecting creator IP,” Gagliano added. She noted if a developer needs more data, they can get in touch with the company with their respective use cases to source a bigger licensed repository.

This arrangement will also see the original providers/creators of the content receiving compensation on an annual recurring basis. Notably, Getty Images also used the same approach for its AI image generation tool developed in partnership with Nvidia.



Source link

About The Author

Scroll to Top