Vana to Allow Users to Rent Out Their Reddit Data for AI Training
In the upsurge of generative AI, data becomes highly valuable. Hence, the question emerges, shouldn’t you be allowed to sell your own data?
AI producers, including large tech corporations and startups, procure e-books, images, videos, audio and more from data brokers. This is done for creating more proficient and legally invincible AI-based products. Shutterstock partners with Meta, Google, Amazon, and Apple for providing a myriad of images for training models. Meanwhile, OpenAI forms agreements with several news agencies to use their archives for model training.
Most often, the original creators and owners of this data are not financially compensated. A startup named Vana aspires to change this situation.
Vana was co-founded by Anna Kazlauskas and Art Abal, who first met in an MIT Media Lab course dedicated to developing technology for emerging markets. Before Vana, Kazlauskas delved into computer science and economics at MIT, eventually venturing into a fintech automation startup, Iambiq, through Y Combinator. As for Abal, he was previously a corporate lawyer and an associate at The Cadmus Group, a consulting firm based in Boston. He later took charge of impact sourcing at the data annotation firm, Appen.
Vana, a new platform developed by Kazlauskas and Abal, is designed to enable users to gather their data such as chat records, voice recordings, and images into datasets. These datasets will then be utilized to train generative AI models. The goal is not only to create models but also to enhance user experiences. These personal experiences could vary, from getting a motivational voicemail that supports your health objectives to having an art-creating app that comprehends your artistic tastes, all achieved by adjusting public models based on your personal data.
Kazlauskas said to TechCrunch that this infrastructure created by Vana effectively builds a data treasury owned by the user. The platform is non-custodial, giving users the freedom to gather their personal data while also allowing ownership over AI models. Users can further use their data across varied AI applications.
This is how Vana advertises its platform and API to developers:
Vana’s API acts as a bridge that links personal data of users across various platforms, thus permitting you to customize your application accordingly. With this, your app gains immediate admittance to a user’s personalized AI model or their core data. This not only simplifies the user onboarding process but also eliminates the worry of compute cost. Users should have the liberty to transfer their personal data from privy areas such as Instagram, Facebook, or Google, over to your application. This will allow you to provide a unique, personalized experience right from the first interaction of a user with your consumer AI application.
Opening an account with Vana is considerably straightforward. After validating your email, it’s possible to link data to a digital persona (such as self-portraits, a personal description and voice clips) and delve into apps developed using Vana’s platform and data repositories. The portfolio of apps includes everything from ChatGPT-like chatbots and interactive storybooks to a Hinge profile creator.
Bearing in mind the current climate where data privacy awareness is growing and ransomware attacks are prevalent, one might question the wisdom of willingly providing personal information to an unidentified startup, let alone one that has received venture capital. The company Vana has secured $20 million so far from investors such as Paradigm, Polychain Capital among others. Can any company that operates for profit truly be entrusted to handle monetizable data without abusing or mishandling it?
Kazlauskas emphasised in her response that Vana is designed for users to regain control over their own data. She mentioned that users can opt to host their data themselves rather than storing it on Vana’s servers and also have control over how their data is shared with apps and developers. She pointed out that as Vana generates income through monthly subscriptions from users (starting at $3.99) and by imposing a “data transaction” fee on developers (for example, when transferring data sets for AI model training), the company has no incentive to exploit its users or their large amount of personal data.
According to Kazlauskas, their goal is to create models that are owned and managed by users who contribute their own data, and to enable users to bring their data and models to any application.
While Vana does not sell user data to companies for AI model training, it does allow users to do so if they choose, beginning with their Reddit posts.
This month, Vana announced the launch of a program called Reddit Data DAO (Digital Autonomous Organization). This program collects Reddit data from multiple users, including their karma and post history, and allows them to collectively decide how this accumulated data is used. Once users join using their Reddit account, submit a data request to Reddit, and upload this data to the DAO, they acquire the right to vote alongside other DAO members on decisions such as licensing the combined data to AI companies for mutual profit.
We have crunched the numbers and r/datadao is now the largest data DAO in history: Phase 1 welcomed 141,000 Reddit users with 21,000 full data uploads.
— r/datadao (@rdatadao) April 11, 2024
It’s an answer of sorts to Reddit’s recent moves to commercialize data on its platform.
Reddit previously didn’t gate access to posts and communities for generative AI training purposes. But it reversed course late last year, ahead of its IPO. Since the policy change, Reddit has raked in over $203 million in licensing fees from companies including Google.
“The broad concept [with the DAO is] to liberate user data from the significant platforms that aim to monopolize and profit from it,” Kazlauskas stated. “This is a novel idea and forms a part of our initiative to assist individuals to amalgamate their data into user-owned data sets for AI model training.”
It is not surprising that Reddit — which is not collaborating with Vana in any official manner — is not supportive of the DAO.
Reddit has imposed a ban on Vana’s subreddit which was solely devoted to conversations about the DAO. Additionally, a spokesperson from Reddit indicted Vana of “misusing” its data export system, designed in compliance with data privacy laws such as the GDPR and the California Consumer Privacy Act.
“Our data arrangements provide us with the capability to establish checks on such entities, even with public information,” the spokesperson informed TechCrunch. “Reddit does not divulge non-public, personal data to commercial enterprises, and when Redditors request for their data export from us, they acquire back non-public personal data from us conforming to applicable regulations. The direct collaboration between Reddit and screened organizations, demarcated by clear principles and accountability, is of significance, and these partnerships and agreements restrict the misuse and exploitation of individuals’ data.”
Is there an actual justification for Reddit to worry?
According to Kazlauskas, the DAO could eventually influence how much Reddit can ask its customers to pay for its data. However, this could be quite a distance away, given that the DAO currently consists of slightly above 141,000 members, which is significantly smaller than Reddit’s user base of 73 million. It is also likely that some of these members may just be bot or duplicate accounts.
The question then is how to appropriately share any payments that could be received from data purchasers by the DAO.
As of now, users are awarded “tokens” – which are cryptocurrencies – by the DAO according to their Reddit karma. Yet, karma may not necessarily be the best way to gauge the quality of contributions to the data set, particularly in smaller Reddit communities where the opportunities to earn it are fewer.
Kazlauskas suggests that DAO members might opt to share their cross-platform and demographic data, potentially boosting the value of the DAO and encouraging more people to join. Yet, this would also demand a strong trust in Vana to properly handle this sensitive information.
In my opinion, I doubt that Vana’s DAO will achieve considerable influence. The obstacles are just too many. Nonetheless, I believe this won’t be the final endeavor to regain control over the data increasingly used to train generative AI models.
Companies like Spawning are devising methods that would enable creators to set regulations about the use of their data for training, while providers like Getty Images, Shutterstock, and Adobe are still testing various recompense models. But the solution remains elusive. Is it at all possible to find it? Given the highly competitive environment in the generative AI business, it does seem like a daunting task. But perhaps, someone will find a way — or be compelled to by regulators.
Discover the pinnacle of WordPress auto blogging technology with AutomationTools.AI. Harnessing the power of cutting-edge AI algorithms, AutomationTools.AI emerges as the foremost solution for effortlessly curating content from RSS feeds directly to your WordPress platform. Say goodbye to manual content curation and hello to seamless automation, as this innovative tool streamlines the process, saving you time and effort. Stay ahead of the curve in content management and elevate your WordPress website with AutomationTools.AI—the ultimate choice for efficient, dynamic, and hassle-free auto blogging. Learn More