In the era of big data, in addition to direct user collection, another major source of data is the use of web crawlers to collect public information. To what extent is the use of crawlers? According to industry insiders, more than 50% of the Internet traffic, or even higher, is actually contributed by crawlers. For some popular webpages, crawler visits may even account for more than 90% of the total visits to the page.
From a technical point of view, crawlers are the process of simulating human surfing or browsing the web or APP behavior through programs, and then grabbing the information needed by the crawler author. With the continuous development of the data industry and the increasing value of data, the competition for data has become increasingly fierce. “Crawlers” and “anti-crawlers” have become endless “offensive and defensive confrontations.” Some crawlers violated the wishes of the website, conducted unauthorized access to the website, and obtained a large amount of public or non-public data on the website, which caused many legal disputes.
On October 23, Hangzhou Yangtze River Delta Big Data Research Institute, Shanghai Yangpu District People’s Procuratorate, Shanghai Enterprise Legal Counsel Association, Zhejiang Enterprise Counsel Association, and Caijing Business Governance Institute jointly launched the “Yangtze River Delta Data Compliance Forum” Cum data crawler legal regulation seminar”, invited a number of heavyweight legal scholars, judges, prosecutors, Internet practitioners from “data crawler technology and industrial impact”, “data crawler civil law responsibility”, “data crawler criminal Discuss from different perspectives such as “Compliance”.
“Crawlers have a wide range of application scenarios, including both compliance and non-compliance scenarios. For example, crawling evaluation data of e-commerce websites for market research; those who do digital content can use crawlers to crawl the corresponding content on the Internet; crawl the network of judgment documents After optimizing the data, the “paid version database” is launched; Enterprise Chacha and Tianyancha are also using crawler technology to realize commercial use of government open data.” Liu Yu, head of digitalization in L’Oréal China, introduced.
Liu Yu explained the basic principles of crawlers. Usually crawlers locate all URL links on the website, obtain the data in the page, and then disassemble and utilize the data. Whether on the web or mobile, basic crawlers are based on this principle. The use of crawling technology is risky for both the ‘crawler’ and the ‘crawler’ party, ranging from website crashes to jail time.
Specifically, for those small websites or websites with weak technical strength, if the crawler continues to visit them 24/7, the server may not be able to withstand the surge of traffic, causing the website to crash. What’s more troublesome is that for programmers who write crawlers, if they crawl to crawl data that shouldn’t be crawled, and then use the data, it may be illegal.
Liu Yu said that in different scenarios, the attitude towards crawlers is completely different. For example, search engine crawlers are popular because search engines can increase the exposure rate of crawled websites; but most websites also do not want crawlers to crawl data based on server risks or various commercial reasons. There are two types of refusals, the ‘anti-climbing’ mechanism and the ‘anti-anti-climbing’ mechanism. Websites can develop corresponding strategies or technical means to prevent crawlers from crawling data.
A common response strategy for websites is to place the Robots protocol, which was written by Dutch engineer Martijn Koster in 1994 and later became a common communication mechanism between the data crawler and the crawled party. In the “China Internet Industry Self-Discipline Convention” issued by the Internet Society of China in 2012, compliance with the Robots agreement was identified as “internationally accepted industry management and business rules.”
▲ The Robots protocol is a willingness communication mechanism between the data crawler and the crawled party
However, Liu Yu said that the Robots agreement is more like a gentleman agreement, which can only serve as a notice and not a preventive effect. Crawler technology, anti-crawler technology, and anti-crawler technology have been iterating. As long as the website and App can be accessed by users, there is a possibility of being crawled.
Bad crawling methods will cause the waste of social resources and technical resources, these resources are hard-won. Zeng Xiang, the general counsel of Xiaohongshu, said that some crawlers will crawl data by “simulating real-person access” or “hacking through protocols.” “These are disgraceful means. The websites that have been crawled have to take offensive and defensive measures, causing a lot of waste of corporate resources.”
Zeng Xiang said that for content platforms, a crawler attack can easily infringe on the intellectual property rights enjoyed by itself and its users. Usually crawling is purposeful. If the core business secrets are crawled, they can be directly used elsewhere to form a competitive advantage. In addition, in his view, crawlers also involve the destruction of the public order of the Internet. “Whether the crawled data can be effectively used, whether it is placed under supervision, and where the data flows are all very big question marks.”
Judgment of reptile’s civil liability
“Technology is neutral, but technology applications are never neutral.” Zhang Zhe, director of Sina Group Litigation, said that when discussing the principles of crawling technology, it is more important to look at what crawling technology is used for and whether its behavior itself is justified. .
Recently, the Beijing Higher People’s Court (hereinafter referred to as the “Beijing Higher Court”) made a second-instance judgment on the “Toutiao Toutiao v. Weibo Unfair Competition Case”. In this case, Weibo was sued for setting up a blacklist in the Robots protocol and restricting the Bytedance Company from crawling relevant web content. The court held that Weibo was a legitimate act within the scope of the exercise of the enterprise’s right to operate independently, and did not constitute unfair competition. At the same time, the first-instance judgment was revoked. Zhang Zhe said that the judiciary’s evaluation of the Robots agreement is “one body and two sides”.
When the Beijing High Court made its judgment on the “360 v. Baidu Unfair Competition Case” in 2020, it believed that Baidu, without reasonable and justified reasons, should not distinguish between the main body and restrict access to search engines to crawl the web page content ( Too awkward, keep it simple). In the “Today’s Toutiao v. Weibo Unfair Competition Case”, the court established the principle that enterprises have the right to restrict other visitors within the scope of their own business, only when they violate the public interest and infringe the rights of consumers. May be found to be improper.
According to Gao Fuping, a professor at the School of Law of East China University of Political Science and Law and director of the Data Law Research Center, crawlers and the data industry are connected. The data intelligence and big data analysis that the so-called data companies are talking about are basically capturing data before proceeding. Mining and analysis. Nowadays, crawlers are generally considered to be a neutral technology, but more often users are for the purpose of ‘getting for nothing’.
Gao Fuping believes that it is difficult to judge the legitimacy of crawlers without discussing that the legitimate producers of data have the right to control. The discussion on the legal boundaries of crawlers at home and abroad mainly focuses on two aspects: the means and the purpose of data crawling.
From the point of view of means, the crawler ignores the access control of the website, or pretends to be a legitimate visitor, will be considered illegal; from the point of view of purpose, whether the data crawling party conducts part of the products or services provided by the crawled party” “Substantial substitution”, if it is a “substantial substitution”, the purpose is illegal.
If the website legally accumulates data resources, the production side of the website can control the use of it. More importantly, it is recognized that the data controller can open the data for commercial purposes, and allow the data to be used more by means of licensing, exchange, and transactions. People use. Gao Fuping added, “Based on the premise that the legitimate data producers have control rights, it is possible to crack down on crawlers who ignore the Robots protocol. “
Xu Hongtao, Judge of the Intellectual Property Division of Shanghai Pudong Court, believes that there are two issues with Robots protocol and data flow that need to be considered: first, how to grasp the degree of “interconnection” and data sharing; second, the current Internet industry operators adopt Whether the Robots protocol strategy may cause data islands. The essence of interconnection is to ensure the orderly flow of data, rather than forcing Internet industry operators to fully open the data resources on their own platforms to competitors. In the context of “interconnection”, “order” and “circulation” are equally important and indispensable. It is necessary to exclude acts that hinder fair competition and endanger user data security under the guise of “interconnection”.
In the case of a new media company crawling data on the WeChat public platform, the Hangzhou Internet Court has made clear its point of view. The Internet platform has set up the Robots protocol. It is hoped that in the course of competition, it can still abide by the competition regulations, or at least maintain a mutual respect and mutual compliance agreement, which is the basis of order.
In the above case, the court held that allowing third-party crawlers to crawl official account information would discourage platform creation and distort the big data element market competition mechanism; from the perspective of consumer interests, unauthorized crawling of information and displaying it, Failed to respect the wishes of the information publishing subject; from the perspective of public interest, the defendant failed to dig deep, innovate, or apply deeper after crawling the information, and failed to enhance the overall public interest of the society. In addition, the source of the crawling data is not normal, and it is hard to say Justified.
Xu Hongtao believes that data is the core competitive resource of the content industry, and the data collected and analyzed by content platforms often have extremely high economic value. If content platform operators are required to open their core competitive resources to competitors indefinitely, it will not only violate the spirit of “interconnection”, but will also be detrimental to the continuous change of high-quality content and the continuous development of the Internet industry.
Xu Hongtao said that the judgment of the legitimacy of non-search engine crawlers can be summarized into four elements: first, whether to respect the Robots protocol preset by the crawled website; second, whether to destroy the technical measures of the crawled website; third; See whether it threatens the security of user data; the fourth is from the measurement of creativity and public interest.
Xu Hongtao specifically pointed out that user data, including identity data, behavioral data, etc., is not only a competitive resource for operators, but also has personal privacy attributes, and the collection of such data involves social and public interests. If they endanger the security of user data when scraping data, their behavior is not justified.
Reptile involves criminal compliance
Criminal compliance, originally originated in the United States, refers to a set of supervision mechanisms, restraint mechanisms, and incentive mechanisms established by the state to promote corporate compliance management using criminal law as a tool.
In 2020, under the promotion of the Supreme People’s Procuratorate, the basic-level procuratorial organs in Shenzhen, Zhejiang, Jiangsu, and Shanghai will actively explore corporate criminal compliance. In order to encourage more companies to carry out compliance reforms, a new criminal procedure system of “criminal compliance non-prosecution” has been rolled out across the country. An attempt is made to select companies involved in crimes that are likely to establish compliance and plead guilty and commit punishment to establish compliance. Plan, and then take non-prosecution measures against companies.
Wu Juping, deputy director of the Third Procuratorial Department of the Second Branch of the Shanghai People’s Procuratorate, said that criminal compliance is mainly to give the enterprises involved in the case a chance to rectify and save themselves and start anew, as well as to ensure high-quality social and economic development. At present, the criminal compliance that many companies are concerned about is more about how to avoid criminal risks in their business behaviors. Wu Juping believes that when companies use crawler technology for data analysis, they should focus on how to implement criminal compliance.
Wu Juping said, “In addition to illegal technologies such as Trojan horse virus programs, we judge whether a behavior related to crawling technology constitutes a crime. Then judge whether the behavior is an intrusion into a computer information system or illegally obtaining computer information system data, and then see whether the crawled data involves corporate data or citizen personal information, and relevant charges apply respectively.”
Among them, it is also necessary to consider whether the legal attribute of the crawled data is property or just data. Wu Juping said that this is a big controversy in judicial practice. “For example, we have a case where we used illegal detention to force the other party to deliver virtual currency. It was criminally recognized as illegal detention, and the property attribute of virtual currency was denied. In civil, the property was returned and the property attribute was recognized.” She believes, Data is an important factor of production in the development of the digital economy. In essence, it should have property attributes, but current laws and judicial practices have not fully kept up.
Zhang Yong, a professor at East China University of Political Science and Law, classified the criminal activities that reptiles may involve: from the perspective of the rights and interests that may be infringed, including computer system security, personal information, copyright, state secrets, trade secrets, market competition order, etc.; from crawling In terms of methods, it may endanger the security of computer information systems, illegally obtain citizens’ personal information, illegally obtain business secrets, and undermine copyright technical protection measures; from the crawling results, there are unfair competition, copyright infringement, and personality rights violations. Class and other issues. “
“Finance” E Law retrieved 54 criminal judgments related to reptiles on the Judgment Document Network, involving multiple crimes. Among them, 26 were found to be crimes of infringing on citizens’ personal information; 10 were crimes of illegally obtaining computer information systems; 5 were crimes of spreading obscene materials for profit; 3 were crimes of destroying computer information systems; crimes of providing intrusion and illegal control of computer programs and tools 3; 3 crimes of infringement of intellectual property rights; 1 crime of illegal intrusion into computer information system, crime of opening a casino, crime of theft, and crime of fraud.