How Does “Headline Today” Provide Personalized News?
The rise of “Headlines Today” (今日头条) brings news aggregator into the public attention. Based on data mining technology, it filters the mass information and recommend “the valuable and personalized for readers” and provides “a new service connecting human and information as one of the fastest-growing products and services in China’s mobile internet domain”. This article tries to analyze the communication mechanism of Headlines Today and relevant principles.
Future News: Database-based News Industry Format
The reformation of communication technology breaks the limit on channels and technologies. The now affordable communication products provide users access to any information, which is not detrimental to public and private interests, at any time and any place, so the information from the government, listed companies, social organizations, as well as the analyses from experts’ and scholars’ will flow to every individual without the limitation of time and space. Surely the mass information is too overwhelming to digest. That indicates human is now ushering in a new “attention era”. In other words, he who wins the public attention wins the future.
Attention is and will still be a scarce resource in our society. As individual is not qualified to professionally verify information source and analyze information, the market needs a reliable group to filter and analyze information, and provide insights. Although experts, scholars, engineers, data analysts are competent for that job, the social labor division makes them less competitive than the media practitioner, the information producer. Only journalists are the first provider of information, and the insights into it. As an information observer group, journalists in the future will more focus on news mining. They are different from their contemporary counterparts because journalism will be more professional and think bigger. Journalists will systematically collect, cleanse, and analyze information just to provide a comprehensive but useful information cluster to the reader. Interviewing, news writing, editing, and commenting will change because of new data gathering, cleansing, and analyzing, news visualization, and narration. All these renovative forces are called “data news”. In the future, the database built and maintain by journalism will be the richest strategic information resource in human history.
The thought experience of future journalism provides us with a perspective observing the medium transformation. The news media can survive only by adapting itself to the future. Communication technology must transform from information transmission to information mining and dynamic presenting if it wants to have its place in the future journalism market.
Web Crawler: Basic Technology of Scraping News
Headlines Today is a typically data news platform. The analysis of its attention profile indicates that it focuses on examining and recommending information, a distinctive feature of the medium. It is that feature that endows it with its publicity: providing valuable news for the public. Its news source mainly comes from the web crawlers on search engines, and some from its partners. The crawlers collect information across the internet and aggregate the news kind; and then the news information is sorted according to the release time by the crawler. Data news platform is distinguished from the general data news by a key function: it provides a medium platform which carries a series of, not one piece of information.
The working mechanism of web crawl relies on the hyperlink network on the internet. Many pages have hyperlinks, which connect them together into a vast network, i.e. the hyperlink network. The web crawler starts from certain pages as an internet bog, copy their content, seek for and visit their hyperlinks, and repeat the entire process. It has two strategies: path-ascending crawling. The crawler won’t move to the next until it has download as many resources as possible from a particular web site; focused crawling. When the crawler finds a hyperlink, it will craw the page immediately and deeply. Let me give you an example of how the crawler in news aggregators like Headlines Today works: first, the background technicians set a dictionary (keyword) of news source, such as “Netease News”, “Sina News”, “Ifeng News”, and “Zhejiang News”; then the web crawler will pinpoint the hyperlinks on those websites for news. If the news is in a relevant blog, not the news platform itself, then the crawler will miss them.
Surely the concept of news aggregator cannot be that simple. Other than gathering information from different media, it has a more crucial feature: sorting different information into an aggregator, which usually takes the form of a ranking list. The communication mechanism of such list accords with the “priority link mechanism” in network science, i.e. users’ pay more attention to the information on top of the list.
News Aggregator: A New Sexy in the International Journalism Arena
News aggregator is also widely used overseas. Information on the data news platform can be a planarized or visualized presentation of the search engine. A newsmap website in Japan (http://newsmap.jp) is an example of the latter. Based on google news, it uses different colors to categorize different new types, for example red for “world” and yellow for “National”. Users can also filter information by choosing from the columns at the page bottom, or from the country and region options on top. The algorithm of the back-end system can automatically adjust the area of each piece of news based the number of news information, its importance, and hits.
A good example is the social networking news site (Digg, Reddit, etc.), where users could register, log in, follow other users, submit and score news. The highest score will be put on the webpage of popular news. Hence, socializing and information aggregating play important roles in this process. Information aggregation, as a news selecting and presenting mechanism, brings more clicks to the popular news. During that process, every user is a gate-keeper who brings in good news and kick away bad information. That is called group examining.
But, the significance of group examining is to bring good news to the webpage of popular news, i.e. to the public. After that, information will spread in the traditional way. In fact, such news aggregation, where users are involved in news filtering, is a commonplace, the “hot topics” of Sina weibo is an example. It is a feature of almost all ranking list. Based on my analysis of news spreading on Digg, such kind of news aggregation affect more on information dissemination. For instance, more than 70% of the widespread news reaches the user through the popular news page.
Recommending System: The Technical Logic of Personalized News Production
Personalized recommendation is fairy important for Headlines Today, whose log in interface is very user-friendly. It strategically allows its user to log in through weibo, QQ, or other social platform accounts, by which Headlines Today delve into, in an authorized manner, the basic information on users’ social platform so that it can easily generate the personalized information. The more it knows about its users, the more appropriate information it will recommend. The basis of personalized recommendation is to build up a recommending system. The system is an automatic tool linking the user and page contents. It is widely used in where users have no specific need. The information for building up the system can be divided into three categories: friends, interests, and register info. Other information like time and place can also be added to the system. Currently, the system is used in many areas, ranging from news, book, music, movie, and friend recommendation.
The algorithm in the recommendation system is basically to construct a similarity matrix. This similarity matrix can be content with the content of similarity, the similarity between the similarity between such as books and music recommended. Obviously, if two pieces of news are browsed by more than one person simultaneously, if means they are similar in content. Recommendation system can optimize the similarity matrix according to the user's behavior, and make accurate recommendations. Of course, if the user's interest profile (like or dislike) is available, more precise recommendations would be made. Suppose that at the 70th Anniversary of The Victory of the Chinese People's War of Resistance Against Japanese Aggression and The World Anti-Fascist War, there are four readers visit Headlines Today at the same time. Let’s say A is a female reader, she browses the news “How to Make an Light Autumn Syrup”, “Five Things You Should Know about Parenting”, and news about the parade. For B, a middle-aged worker, he clicks on links about the parade and the list of China’s new weapon; C is a senior reader, he reads news about the parade and health; the last one D, a new college graduate, visits experience-sharing articles of video games and Hollywood tour, and news about the parade. Noticing that all of them read the parade news, the algorithm calculates the clicks and the viewing time and finds out that the parade is the hot news for today. This process also applies to other hot news. When new users click, the algorithm will analyze the contents and then recommend to him the chosen hot news in accordance with his interest. The entire process is done automatically by the computer.
The above example shows that personalized news is offered to readers on the database of a wide range of hot news. But this leads to a problem: if the user’s interest deviates from the hot news, then the system will get no relevant hot news at all. Therefore, it may fish for other related information to satisfy the users need as much as possible. Even if it succeeds, the news it finds may be outdated. The key to offer an ideal personalized recommendation service is to categorize the aggregated news as much as possible. Only when we divide different topics and subdivide the sub-topics can we offer real personalized service. The dividing work cannot be done merely by machines. It depends on human’s understanding about the nature of things. When more and more users’ behavior is collected, the division will be more precise and the automatic personalized service will be more user-friendly.
The transition from media examining to group examining is a progress, but potential risks exist in transiting from group examining to computer or algorithm examining. In the process, the traditional news producing logic suffers the most. The traditional news value highlights public interest, so events with long-term influence are reported and insights into it always follow. If such work is handed over to machines and algorithm, we will face unprecedented challenges: first, algorithm sort and recommend information according to the “interests” of users showed in a certain period, it cannot represent the real interest of the users. Besides, such recommend news is often poorly written; second, the frequent contact with vulgar information lowers the individual media literacy; third, the algorithm starts from individual information, instead of social or national information. The unbalanced attention distribution may entail criticisms.
To consider the transformation of journalism from a future perspective enables us to know the importance of returning to the essence of news. The journalism in the future is not to provide a limited number of case interviews, but to systematically obtain, accumulate and analyze data, and then reveal the hidden information. In the era of attention economy, media’s duty is to provide users with professional information and comments. The fast-growing data news is advancing toward that direction, although it focuses more on visualized presentation for the time being. News aggregator automatically filter information, which is an embodiment of news making in the future. Based on personalized recommendation, news aggregator makes people’s life more convenient by introducing the AI technology into news integration. But we shall never ignore the potential risks of being overdependent on machine and algorithm because they may distort the right media value. All in all, the future for journalism is an era of man machine integration.
Researcher of ZMT, assistant researcher of School of Journalism and Communication of Nanjing University
(This article is a phase achievement of the Youth Project of National Social Science fund. No. 15CXW017)
Note: This is an article from The Technical Logic: Web crawler + Matrix Filter, Communication Review, 2015, No. 10