Baidu Imposes Restrictions on Google and Bing for Scraping Content in AI Development
As a tech journalist, Zul specializes in cloud computing, cybersecurity, and disruptive technologies within the enterprise sector. He is skilled in hosting webinars and delivering video presentations, with a solid foundation in network technology.
Chinese internet search company Baidu has updated its Wikipedia-like Baike service to block Google and Microsoft Bing from scraping its data.
This modification was noted in the recent revision of the Baidu Baike robots.txt file, which now prohibits Googlebot and Bingbot crawlers from accessing its content.
The alteration was recorded on August 8, according to the Wayback Machine. In earlier configurations, search engines like Google and Bing were granted permission to index Baidu Baike’s extensive database of nearly 30 million entries, albeit with certain restrictions on specific subdomains of the site.
This action by Baidu comes amid increasing demand for large datasets used in training artificial intelligence models and applications. It follows similar moves by other companies to protect their online content. In July, Reddit blocked various search engines, except Google, from indexing its posts and discussions. Google, like Reddit, has a financial agreement with Reddit for data access to train its AI services.
According to sources, in the past year, Microsoft considered restricting access to internet-search data for rival search engine operators; this was most relevant for those who used the data for chatbots and generative AI services.
Meanwhile, the Chinese Wikipedia, with its 1.43 million entries, remains available to search engine crawlers. A survey conducted by the South China Morning Post found that entries from Baidu Baike still appear on both Bing and Google searches. Perhaps the search engines continue to use older cached content.
Such a move is emerging against the background where developers of generative AI around the world are increasingly working with content publishers in a bid to access the highest-quality content for their projects. For instance, relatively recently, OpenAI signed an agreement with Time magazine to access the entire archive, dating back to the very first day of the magazine’s publication over a century ago. A similar partnership was inked with the Financial Times in April.
Baidu’s decision to restrict access to its Baidu Baike content for major search engines highlights the growing importance of data in the AI era. As companies invest heavily in AI development, the value of large, curated datasets has significantly increased. This has led to a shift in how online platforms manage access to their content, with many choosing to limit or monetise access to their data.
As the AI industry continues to evolve, it’s likely that more companies will reassess their data-sharing policies, potentially leading to further changes in how information is indexed and accessed across the internet.
(Photo by Kelli McClintock)
See also: Google advances mobile AI in Pixel 9 smartphones
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.
Discover the pinnacle of WordPress auto blogging technology with AutomationTools.AI. Harnessing the power of cutting-edge AI algorithms, AutomationTools.AI emerges as the foremost solution for effortlessly curating content from RSS feeds directly to your WordPress platform. Say goodbye to manual content curation and hello to seamless automation, as this innovative tool streamlines the process, saving you time and effort. Stay ahead of the curve in content management and elevate your WordPress website with AutomationTools.AI—the ultimate choice for efficient, dynamic, and hassle-free auto blogging. Learn More