Yahoo Lab’s machine learning dataset will soon become the largest recommender systems in the world. Yahoo also wants to help bring more equality between the academic and industrial research communities, according to ZDNet.
“Many academic researchers and data scientists don’t have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies,” said Suju Rajan, director of research at Yahoo Labs, in a statement. “We are releasing this dataset for independent researchers because we value open and collaborative relationships with our academic colleagues, and are always looking to advance the state-of-the-art in machine learning and recommender systems.”
Categorized information, including age range, general geographic data and gender, is included for a subset of anonymized users. The title, key-phrases of news articles, and summaries are also included in the data dump. User interaction data is timestamped and even shows what device was used to browse the sites.
All told, Yahoo Lab’s machine learning dataset contains 13.5 TB of uncompressed information connected to how users relate to and interact with these Yahoo properties. The dataset covers 110 billion events and includes the interactions of about 20 million users from February 2015 to May 2015.
The company is handing over a collection based on a sample of anonymized user interactions on Yahoo properties, including the Yahoo News Feed dataset, the Yahoo home page, Yahoo Finance, Yahoo Sports, Yahoo Real Estate and Yahoo Movies.
“Academic researchers everywhere will finally have access to realistic scale data to study how to automatically discover which news articles are of interest to which users, and will be able to compare their methods using this as a shared test case,” said Tom Mitchell, machine learning department chair, Carnegie Mellon University, in a statement. “Here at CMU we’ll certainly be using it for our research.”
“Yahoo’s effort should help to advance machine learning, particularly at the university level. Its effects on business organizations is hard to parse though. Over time, many of the innovations that universities develop do find their way into the commercial market,” Charles King, principal analyst at Pund-IT, said. “Given the size and richness of the dataset Yahoo is releasing, it could very well support and inspire research that will eventually benefit businesses.”
In the vast majority of instances, companies collecting datasets of this sort retain them for their own private uses, King noted. As a result, data scientists at universities and associated research labs are forced to make due with much smaller data samples.
“In essence, by making this huge dataset charting anonymized user interactions with Yahoo properties available to academic researchers, the company is helping to advance machine learning efforts among users who seldom, if ever, have access to such a profusion of data,” King said.
King says Yahoo Lab’s machine learning dataset qualifies as a self-promotional event on Yahoo’s part that positions the company as a player in the rapidly growing area of machine learning. The company’s ongoing business troubles sometime mask its history of developing innovative, often market-leading technologies, and this effort could and should help counteract that misperception.