AI Startups Taking Data Into Their Own Hands
- The next wave of artificial intelligence isn't being built on freely scraped data or cheap overseas annotation.
- Taylor (who requested anonymity) spent a week this summer with a GoPro strapped to her forehead,meticulously documenting her artistic process.
- "We woke up, did our regular routine, and then strapped the cameras on our head and synced the times together," Taylor explained."Then we would make our breakfast and...
The Rise of ‘Human-in-the-Loop’ AI: Why Companies Are Paying People to Perform Everyday Tasks
Table of Contents
The next wave of artificial intelligence isn’t being built on freely scraped data or cheap overseas annotation. rather,companies are increasingly focused on meticulously curated,proprietary datasets – and they’re paying a premium to collect them.From artists wearing GoPro cameras to executive assistants training email models, the “human-in-the-loop” approach is becoming a key competitive advantage in the AI landscape.
The GoPro-Wearing Artists of Turing Labs
Taylor (who requested anonymity) spent a week this summer with a GoPro strapped to her forehead,meticulously documenting her artistic process. She and her roommate painted, sculpted, and completed household chores, all while generating data for Turing Labs, an AI company. The goal wasn’t to teach the AI to create art, but to develop its understanding of sequential problem-solving and visual reasoning.
“We woke up, did our regular routine, and then strapped the cameras on our head and synced the times together,” Taylor explained.”Then we would make our breakfast and clean the dishes.Then we’d go our separate ways and work on art.”
The work was demanding, requiring seven hours a day to produce five hours of usable footage, ofen resulting in headaches and a “red square on your forehead” after removing the camera. But the pay was good, and it allowed Taylor to focus on her art.
Turing Labs is deliberately targeting “blue-collar” professions - chefs, construction workers, electricians – to build a diverse dataset. “We are doing it for so manny different kinds of blue-collar work, so that we have a diversity of data in the pre-training phase,” said Sudarshan Sivaraman, Turing’s Chief AGI Officer. “After we capture all this data, the models will be able to understand how a certain task is performed.”
From Scraped Data to Proprietary Advantage
This approach represents a meaningful shift in the AI industry. Historically, training datasets were often scraped from the web or assembled using low-cost annotators. Now, companies are recognizing the value of carefully curated, proprietary data as a key differentiator. The raw power of AI is largely established; the competitive battleground is now the quality and uniqueness of the data used to train it.
Fyxer: The Power of Expert Annotators
Fyxer, an email company utilizing AI to sort and draft replies, provides another example of this trend. Founder Richard Hollingsworth discovered that the quality of data, not the quantity, was the primary driver of performance. This led to an unconventional staffing strategy.
“We realized that the quality of the data, not the quantity, is the thing that really defines the performance,” Hollingsworth stated.
In the early days, Fyxer’s engineers and managers were outnumbered by executive assistants tasked with training the model on fundamental email handling skills. “We used a lot of experienced executive assistants,because we needed to train on the fundamentals of whether an email should be responded to,” Hollingsworth explained. “it’s a very people-oriented problem.finding great people is very hard.”
Fyxer now prioritizes smaller,more tightly curated datasets for post-training,recognizing that focused expertise yields better results.
Synthetic Data and the Importance of a strong Foundation
Many companies, including Turing Labs, are leveraging synthetic data to expand their training sets. Turing estimates that 75-80% of its data is synthetic, extrapolated from the original GoPro footage. Though, this underscores the critical importance of a high-quality original dataset.
“If the pre-training data itself is not of good quality, then whatever you do with synthetic data is also not going to be of good quality,” Sivaraman emphasized.
Data as a Competitive Moat
Beyond data quality, the act of collecting data in-house provides a significant competitive advantage. Hollingsworth believes that building custom models and investing in human-led data training is the best way to stay ahead.
“We beleive that the best way to do it is indeed through data,” he said, “through building custom models, through high quality, human led data training.”
While open-source models are readily available,the ability to find and train expert annotators is a unique and valuable asset.
The Cost of Quality Data: A Look at Freelance Rates
While specific rates vary based on complexity and expertise, here’s a general overview of freelance data annotation costs (as of late 2024):
| Task | Estimated Hourly Rate |
|---|---|
| Simple Image Labeling | $15 – $25 |
| Complex Video annotation (e.g., Turing Labs’ work) | $40 – $75+ |
| Expert Email Annotation (e.g., fyxer’s EAs) | $50 – $100+ |
| Synthetic Data Validation | $30 – $60 |
Source: DataAnnotationTech.com, Upwork.com (averaged rates)
