Skip to Main Content
Purpose

This study aims to assess the technological capability of Chinese internet platforms (BAT: Baidu, Alibaba, Tencent) compared to US ones (GAFA: Google, Amazon, Facebook, Apple). More specifically, this study explores Baidu’s technological catching-up process with Google by analyzing their patent textual information.

Design/methodology/approach

The authors retrieved 26,383 Google patents and 6,695 Baidu patents from PATSTAT 2019 Spring version. The collected patent documents were vectorized using the Word2Vec model first, and then K-means clustering was applied to visualize the technological space of two firms. Finally, novel indicators were proposed to capture the technological catching-up process between Baidu and Google.

Findings

The results show that Baidu follows a trend of US rather than Chinese technology which suggests Baidu is aggressively seeking to catch up with US players in the process of its technological development. At the same time, the impact index of Baidu patents increases over time, reflecting its upgrading of technological competitiveness.

Originality/value

This study proposed a new method to analyze technology mapping and evolution based on patent text information. As both US and China are crucial players in the internet industry, it is vital for policymakers in third countries to understand the technological capacity and competitiveness of both countries to develop strategic partnerships effectively.

The advancement of artificial intelligence (AI) (machine learning) could turn massive data from the internet and IoT sensors into a gold mine (Agrawal et al., 2018). AI technology is versatile and applicable across various industries (Trajtenberg, 2018; Motohashi, 2020). Not only does it improve the accuracy of predictions, but it also enhances the economy of scope in big data analysis. The nature of general-purpose technology of AI, or non-rivalry of big data for various applications, allows internet business firms to grow as internet platforms, expanding their services to a variety of industries (Goldfarb and Trefler, 2018). Accordingly, Google, Amazon, Facebook and Apple (GAFA) have become top-listed firms in stock market valuation ranking.

At the same time, the concentration of data into a small number of firms, such as GAFA, has raised concern among national authorities outside the US. Google has been fined a combined $9.5bn since 2017 by EU antitrust regulators, and EU regulatory bodies have kept a close watch on the activities of other US internet firms. The EU also imposes General Data Protection Regulation to ensure privacy protection of European standards when private data are transferred beyond EU borders. Such policy actions could lead to “virtual nationalism,” where cyberspace is compartmentalized by nation/region (Economist, 2020).

In this regard, China is going its own way by virtually banning internet business on US internet platforms and international data transfer (Chorzempa et al., 2018). As a result, indigenous internet giants Baidu, Alibaba and Tencent (BAT) have emerged in a domestically segmented cyberspace insulated from international competition. Based on huge amounts of data from 800 million smartphone users, as well as large domestic markets in China, Alibaba and Tencent are listed in the global top 20 in terms of market capitalization. Recently, BAT have invested heavily in AI technology based on a large talent pool inside China. The Chinese Government plans to become a global AI leader by 2025, and BAT is supposed to play a crucial role (Biancotti and Ciocca, 2018).

This study focuses on Baidu and Google and assesses the technological capability of Chinese internet platforms compared to US ones. These two firms are quite comparable in terms of their business domain and advertising based on internet search queries, and both firms have recently made substantial investments in autonomous driving technology. We use text information (abstract) of patent applications submitted to the US Patents and Trademarks Office (USPTO) and CNIPR (China patent authority). The text information of patent data is assumed to reflect the content of the invention precisely. The similarity score of two patents based on the patent abstract provides more accurate information than their IPC code (Arts et al., 2017). In addition, the vector space model with a high dimension of continuous variables gives finer-grained information about patent contents, as compared to one-dimensional IPC codes with discrete variables (Younge and Kuhn, 2016; Motohashi et al., 2019).

Understanding the technological capability of Chinese firms is important from the perspective of both business and policy. A firm in a developed economy, such as Japan, cannot conduct internet/IoT business in China by itself but needs to collaborate with local firms such as BAT. Under such conditions, it is critical to access the technological capability of Chinese counterparts as the bargaining position in partnership negotiation depends on relative management resources, particularly technological capacity, to which Chinese firms are eager to gain access. In addition, as tensions between the US and China due to trade disputes become intense, information on technological competitiveness in both countries is essential intelligence for policymakers in third countries. This is particularly the case for Japan as both countries are very important partners, and an inappropriate strategy to deal with them may cause substantial damage to the domestic economy.

The remainder of this paper is organized as follows. Section 2 reviews catch-up related literature and our research framework. Section 3 outlines the data source and methodology of our vector space model based on internet technology patents from USPTO and CNIPR. Google and Baidu patents are compared via two types of empirical analysis in Sections 4 and 5. One is an overview of the technologies of these two firms using clustering analysis. The other is based on a more micro view of individual patents, together with the distribution of patented technologies of its neighbors in the technology space. Finally, we conclude with a summary of the findings and policy implications in Section 6.

The concept of catch-up possesses a significant and enduring historical legacy, marked by notable examination in Abramovitz’s (1986) influential work. It achieved prominence in the post-Second World War period, characterized by the USA’s early adoption of advanced methods of production and industrial practices that other countries had not yet embraced. According to this scenario, catch-up is commonly defined by economic scholars as the process of reducing the disparities in productivity and income between a leading nation and a trailing one (Fagerberg and Godinho, 2005). Kashani et al. (2022) examine the evolution of catch-up studies and suggest that catch-up can be measured by a range of indicators, including productivity, income and technological capability. The primary focus of this study lies in the technological aspect of catch-up, defined as the significant improvements in technological capabilities by firms from technologically disadvantaged nations as they close the gap with advanced incumbents, moving closer to the global technological frontier (Miao et al., 2018).

Theoretically, Bell and Pavitt (1993) introduce a framework to conceptualize technology as a capability in the catch-up process. This framework emphasizes that technological capabilities, representing a firm's capacity to absorb and learn from imported technology, are critical determinants of successful technology transfers in developing countries. It has underpinned numerous empirical studies examining the growth of latecomer firms and the impediments to their leadership emergence. Studies on technology catch-up fall into the following two main categories based on research methods: qualitative case studies and quantitative empirical research. Qualitative studies have explored the success of catch-up among Asian firms in diverse industries, including consumer electronics, automotive and shipbuilding (Cho et al., 1998; Kim, 1998; Fan, 2006; Mathews, 2006). In quantitative research, patent data, often regarded as a common proxy for technological knowledge, have gained prominence in monitoring the technological catch-up process.

Considering catch-up as a learning process, prior research has used patent citations to track technology acquisition. Wang et al. (2014) leverage the citations of licensees’ patents to discern if latecomer firms had gleaned knowledge from prior licensing agreements. Besides, Lee (2013) conducts a comprehensive comparison of technological capabilities between Korean firms and their US counterparts, using a range of citation-based indicators such as quality, originality and diversity. Although citation information has been widely used for measure patent quality and technology spillover, such information is not available in many developing countries. In this light, we introduce a novel framework that leverages patent text data to monitor the catch-up process between latecomer firms and incumbents in advanced economies. Initially, we train our own Word2Vec model based on a large-scale patent corpus. This trained Word2Vec model is then used to convert patent texts, specifically abstracts, into vector format. Subsequently, clustering analysis is performed to provide an overview of the technological landscape and detailed technical domains. Following that, two semantic-based indicators are introduced to compare the technological capabilities of Google and Baidu. Traditionally, constructing pairwise cosine similarity scores for a large-scale data set, such as one exceeding one million entries, is computationally demanding. Therefore, we use a neighborhood graph and tree (NGT) to search for similar patent pairs. Figure 1 presents the proposed research framework.

To conduct a fair comparison of a US firm (Google) and a Chinese firm (Baidu), we use the patent data from USPTO and CNIPR. Specifically, we retrieve all patent application information by Google (26,383 USPTO patents) and Baidu (6,695 CNIPR patents) from the PATSTAT 2019 Spring version. We then check the IPC subgroups of these patents to identify internet-related technology patents. We identify a total of 2,350 IPC subgroups, but many of them contain a very small number of Google or Baidu patents.

We treat the subgroups with at least 100 Google or Baidu patents as a core technology of internet search engine-related business and retrieve all patents belonging to these 50 subclasses for subsequent analysis. There are 680,241 US patents and 427,628 CN patents from 1959 to 2018. The subgroups span over seven IPC classes, “F24,” “G01,” “G02,” “G06,” “G09,” “‘G10,” “‘H04,” but more than 95% of patents belong to the G06 (computing, calculating, counting) and H04 (electric communication technique) classes. Figure 2 shows the number of patents by application year. It should be noted that most patent applications via CNPIR have been made within the last five years, while USPTO patent applications were made relatively earlier. A drop in patent applications in recent years comes from data truncation associated with the time lag between application and publication years, particularly for USPTO patents.

A myriad of patents makes it difficult to mine out useful information and relationships among them. Recent text mining techniques have been proposed to turn a document into a vector form so that existing machine learning algorithms can be used. We followed the classic Skip-gram model proposed by Mikolov et al. (2013) to build word vector representations for our patent corpus. We then calculated the document embedding for a patent by averaging all nouns occurring in that patent. To do so, we first conducted a preprocess on the patent corpus. Wang et al. (2019) noted that the word representations should be able to demonstrate multifacetedness. That is, the trained Word2Vec model should yield meaningful representations for words in different forms (e.g. in different tenses). Furthermore, many pre-trained word embedding models (e.g. Google pre-trained Word2Vec models) kept words in their original forms.

Along with this convention, without conducting lemmatization, we only removed punctuation and placed all words in lowercase, turning all digits into a token “<num>”. The corpus was built on 1,107,869 patent applications We retained words with frequencies higher than four. A Skip-gram model was then adopted to build a 300-dimensional vector for each word in the corpus. Our Skip-gram model generated vector representations for 170,340 words, of which 73,780 (43%) were nouns.

From the results of this word embedding (300-dimension vector expressions for each word), the document vector dj (corresponding to the patent content expression) is computed by the following:

where vi is the vector representation of word wi; nj is the number of nouns occurring in the document dj; and N is a set of all nouns in the dictionary.

The document embedding results are created in two steps: (1) word embedding and (2) aggregation at the document level. In terms of the first step, we conduct a face validation of word embedding results. Specifically, we conduct k-means clustering of embedded words to check that similar words are clustered into the same cluster. The results of the clustering analysis are presented in  Appendix 1. For example, the first cluster consists of “image-related” words, including “image,” “position,” “display,” and “picture.” The second one shows the list of text-related words (“document,” “language,” etc.). Accordingly, it is possible to conclude that our word embedding results are reasonable.

In the second step (aggregation at document level), we take a simple average of word embedding vectors in each document. To assess the document embedding results, we use Doc-DB patent family information. Within each patent family, all patents are based on the same invention, so the contents of these patents should be close to each other. We calculate pairwise cosine similarities of the patents corresponding to the same patent family. It should be noted that one patent family could have both USPTO and CNIPR patents. Therefore, we could evaluate document embedding results separately using US-US, CN-CN and US-CN pairs.

Figures 3 and 4 show the distribution of cosine similarity of document embedding results between patent family pairs. For a given patent family, we calculated all pairwise cosine values of US patents and then described the results separately using US-US, CN-CN and US-CN pairs. The mode points of each type of pair correspond to 1 (showing exactly the same vector), and most pairs have cosine similarity close to 1. We could conclude that our document embedding method produces reasonable results. In addition, the US-US patent family pair is relatively closer in terms of the contents, as compared to the CN-CN pairs, and the US-CN pairs are in the middle. Therefore, there may not be any systematic bias associated with the data source (USPTO or CNIPR patents), which is important to make a fair comparison between Google and Baidu in the following sections.

Table 1 shows the results of descriptive statistics of cosine similarities of patent pairs by type of family and by type of document-level aggregation. We have again confirmed that the median point of each type of pair is close to 1 (at least 0.97), suggesting the validity of document embedding results. Table 1 also reports the results using TF-IDF weighted averages of word embedding results (figures with asterisks). The cosine similarity of these figures is even lower than that of the simple mean. Therefore, we proceed with the subsequent analysis by using the document embedding results with a simple average of word embedding vectors.

The contents of the patent corpus are explored by dividing the whole corpus into several clusters. We used k-means to conduct clustering based on the vectorized patent contents information. In terms of the granularity of clustering, we take the number of IPC subclasses, that is, 11. We could set this number arbitrarily, but it becomes difficult to gain a broad picture from too many clustering results. In addition, the number of clusters could not be too small as the whole corpus would be divided much more finely. We applied k-means clustering for 1,107,869 patents, and the word crowd of each cluster is presented in Figures 3 and 4. The number of words in this figure corresponds to the aggregated TF-IDF value of each word in each cluster (sum of patent level TFIDF to each cluster level) and can be formally expressed as follows:

where Dj’s are patents in cluster C, and tji is the TF-IDF value of word wi in patent Dj. Figure 5 also shows the label of each cluster, created by using this word crowd information, together with 10 patents located near the center point of each cluster (A list of titles of these patents are presented in  Appendix 2).

Figure 6 visualizes the contents of 1.1 million patents, together with the location of each of the 11 clusters. For this purpose, the 300-dimensional document vectors are reduced into 2D space. We use the Uniform Manifold Approximation and Projection (UMASP), which has a superior run-time efficiency (McInnes et al., 2018). UMAP can convert high-dimensional data into a low-dimensional space while preserving both local and global structures. There are three broad types of patent content:

  1. web application, such as data analytics, language modeling and web content application;

  2. display interface, such as image recognition and human interface; and

  3. ICT infrastructure, such as storage system, file management and mobile communication.

Figure 7 shows the share of patent applications by cluster and country (USPTO or CNIPR). The share of ICT infrastructure patents (such as storage, file management systems and wireless communication) is found to be larger for the USA, while there are relatively more application-related patents (such as mobile user interaction and data analytics) for China. Such differences come from the difference in the timing of technological development in both countries. US patent applications started in the 1990s and grew rapidly in the early 2000s, while for China, most patent applications were submitted after 2010. Players in China, including Baidu, therefore focus more on application developments based on ICT infrastructure technologies developed by US players.

Figure 8 shows the location of Google and Baidu patents in the technology space based on the information compiled using UMAP in Figure 6. Google patents are more widely distributed in the space, while Baidu patents are concentrated in some particular fields, such as data analytics, mobile user interaction and Web search/language modeling. Google’s first patent application was submitted in 1997, while Baidu started applying for patents mainly after 2009. As is shown in cross-country trends in the USA and China, Baidu focuses on application development in the process of technologically catching up with Google.

To control for cross country differences in patent contents, we calculate the revealed comparative advantage (RCA) index for Google and Baidu by cluster as follows:

where Pij is patent country by firm “i” and cluster “j”. Figure 9 shows RCA for Google and Baidu (i = Google or Baidu) by cluster (j). It should be noted that the value of RCA is greater than 1 when a firm focuses on a particular field, and vice versa. First, the pattern of RCA by cluster is very similar across these two firms. As both are operating internet search engines, a high value can be found for web search and language modeling (Google: 2.48, Baidu: 3.36). In addition, the RCA of file management system is greater than 1 for both firms. Second, differences can be found between these firms in web content application (Google > Baidu) and mobile user interaction (Google<Baidu). This point can be explained by the difference in the ICT environment between the two countries, that is, mobile internet is diffused more widely in China. As a consequence, it is more important for Baidu to invest more in mobile specific applications, such as internet services taking user location information into account.

The foregoing clustering analysis provides an overview of the technology space in terms of patenting, but it does not provide detailed information on the within-cluster distribution of individual patents. In this section, we generate statistics regarding the neighborhood patents to each of over one million patents in our sample in terms of content. Specifically, we estimate the top 200 nearest patents in terms of cosine similarity to each patent.

An apparent difficulty is that deriving all pairwise cosine similarities among one million involves a massive amount of computations. We, therefore, used a NGT proposed by Sugawara et al. (2016) for indexing, which is an approximate similarity search method. NGT has been developed for efficient retrieval of relevant internet content by search engines, but it can be applied to any type of text information. Motohashi et al. (2019) use NGT results for patent titles and abstracts published by the Japan Patent Office to understand the characteristics of academic patents (as compared to firm patents).

NGT uses a tree structure for indexing network graphs efficiently. A parameter is epsilon as a range of search of nearest neighbors. There is a trade-off between the search range and search time. We fit our samples and use epsilon = 0.35 with an accuracy rate of 0.997 (See  Appendix 3 for details).

Figure 10 presents the average cosine similarity of the 200th nearest patents (i.e. the patents ranked 200th in terms of the cosine similarity) with each of 1.1 million patents by application year and patent authority. An upward time trend (technology space becomes denser over time) can be found in CNIPR patents, while it is not the case for USPTO patents. As a result, the cosine similarity of the 200th nearest patents for CNIPR patents (around 0.90) becomes greater than that of USPTO patents (around 0.88) on average.

Figure 11 shows the share of USPTO patents in the top 200 nearest patents by patent authority (CNIPR or USPTO). The share for USPTO patents is stable at around 70%, meaning 30% of the top 200 nearest patents are CNIPR patents. In contrast, the share for CNIPR patents rose until 2006, then fell. The upward trend corresponds to the period in which the number of USPTO patents increases, while a downward trend occurs when the number of CNIPR patent applications overtakes USPTO patents. More importantly, a pattern of technology divergence is revealed between the two countries, that is, increasing numbers of same-country patent pairs in terms of content similarity rather than cross-country pairs.

The information on 200 near patents in terms of patent contents provides a picture of the technology space around the patent to be examined. As shown in Figure 12, finding near patents corresponds to drawing a border within which 200 near patents are located. The border is a hypersphere (300 dimensions) with a radius of the distance (e.g. 1-cosine similarity) between the patent to be examined and the 200th nearest patent. The technology space is densely populated with surrounding patents if the radius (1-cosine similarity) is small, and vice versa. It should be noted that there are two types of surrounding patents. One is the patent applied for before the patent is to be examined, and the other is one thereafter. A patent application provides information on preceding patents, and we refer to such patents as BASE. We refer to the latter as FOLLOW, as these patent applications were submitted following the patent to be examined.

BASE could be considered as a backward citation and FOLLOW as a forward citation. Hence, the number of BASE patents can be used as an indicator of the novelty of a patent (smaller BASE means more novelty), and the number of FOLLOW patents indicates the impact of a patent (larger FOLLOW means more impact).

We use this information to assess the technological capability of Google and Baidu. As is the case for citation information, this indicator can be biased by data truncation, that is, the newer the patent to be examined, the more BASE patents and the fewer FOLLOW patents could be found. Therefore, we normalized the number of BASE and FOLLOW (200-BASE) using the number of patent applications before and after, respectively. In addition, there is a time trend of such indicators, particularly for CNIPR patents. As the number of patent applications increases (Figure 2) in densely populated fields (Figure 10) for CNIPR patents, IMPACT tends to be larger, while BASE is smaller. Therefore, we need to control for the patent authority difference (USPTO or CNIPR). Finally, we derive the following indicator for cumulativeness (less novel) and impact for each patent:

where BASEi and FOLLOWi are the number of BASE patents of patent “i” with application date “T” and patent authority “c” (US or China), and Pt is a patent count of patent applications at the application date “T.” Here we conduct double normalization by the timing (BASE is normalized by the number of patent applications before the patent to be examined, all candidate of BASE and the same for FOLLOW) and by the country of patent authority.

As cumulativeness and impact are patent-level indicators, we could aggregate this at the firm level. Figure 13 presents the trend of cumulativeness indicators of Google and Baidu. Here, we produce three types of these indicators: (1) using all patents, (2) using USPTO patents only and (3) using CNIPR patents only in the 200 nearest patents. The distinction of patent authority allows us to investigate the technology trajectory of these firms within and across countries. The cumulativeness of Google used to be below 1, suggesting relatively novel patents under the USPTO patent standards, but it has recently reached one due to an increasing trend of US neighbor patents. This could be explained by the convergence of internet technologies among major players such as (G)AFA. The increasing trend of cumulativeness is clearer in the case of Baidu. Baidu patents used to be relatively novel (less than 1) under Chinese standards, but this has also recently reached 1. Increasing numbers of USPTO patents are used as a base, and Baidu has aggressively caught up with US players in the process of its technological development.

Figure 14 shows the impact indicators of Google and Baidu. Google’s performance is stable over time around 1, reflecting an average impact under US standards. However, the impact of USPTO patents is found to be more than average (around 1.2), while the impact of CNIPR patents is less than average (0.7 to 0.8). In contrast, Baidu shows quite dynamic patterns for this indicator. While the overall impact indicator has recently fallen, USPTO neighbor patents reveal an increase regarding this indicator. Together with the finding in Figure 13, Baidu is found to pay more attention to technological development in China and started patenting in mainstream technologies in the USA so that both cumulativeness and impact measured by US patents increase over time. It should be noted that the USPTO-based impact indicator has recently become greater than 1, suggesting Baidu has achieved technological catching up with US players to some extent.

Technology upgrading of China’s internet platforms has received growing attention given their huge data assets of a billion mobile users together with ample engineering talents for AI and data science. China has set a goal of becoming a global leader in AI by 2025, and it is assumed that BAT (China’s GAFA equivalent) will play a vital role. Using Google as the benchmark, this study assessed the technological capability of BAIDU. We use patent text information (abstract of invention) to examine how these two firms have developed over time.

We extract internet-related technology patents from USPTO and CNIPR patent publication information to determine the technology trajectory of both countries’ patent applicants. Internet-related patent applications to CNIPR have increased significantly in the past five years, and the contents of patent applications in both countries are found to be diverging. This may be due to the fact that China’s internet market is segmented from the rest of the world and evolving in its own way. The rapid progress of mobile internet in China also explains the difference in technology portfolios across the two countries.

Given such general trends of technological development, Baidu and Google show similar patterns of focused areas of R&D in general, such as web search technology and data analytics for language modeling, based on common business models based on internet search engines. However, our results reveal some differences, such as more mobile applications in Baidu and more web content applications in Google. In terms of the dynamics of technological development, Baidu follows a trend of US rather than Chinese technology, and it is assumed that Baidu is aggressively seeking to catch up in the process of technological development. At the same time, the impact index of Baidu patents increases over time, suggesting its upgrading of technological competitiveness.

This study proposes a new methodology to analyze technology mapping and evolution based on patent text information. The citation information has been used extensively for patent characteristics (mainly patent quality) and technology spillover (Nagaoka et al., 2010). However, patent citation information is unavailable in many countries, including China. In contrast, the proposed methodology offers wider geographic applicability, particularly when using patent information in developing countries, due to the availability of patent abstract information in most nations. Furthermore, recent studies have highlighted the utilization of companies’ Web pages to monitor their market-side opportunities (Park and Geum, 2022; Motohashi and Zhu, 2023). As web data are also in a textual format, our proposed methodology can be easily applied to these datasets for a better understanding of market-side catch-up and competition.

However, there are also some limitations in our methodology. First, we use fixed word embedding information over time. The content of the same term, such as “machine learning,” for example, should change over time as its technology progresses. Therefore, our document embedding results could represent a range of various technologies, while it is weak to measure the progress (or depth) of some particular technology component. Using a word embedding methodology that takes the context of each word within paragraphs into account, such as BERT, maybe a potential solution. In addition, the size of neighbor patents (200 in our case) is arbitrary. We could decrease or increase this size, but the number depends on the scope of our analysis or the degree to what extent we want to identify the density of technology (patent) distribution. We may use the kernel smoothing technique in multi-dimension space for future research.

This study is conducted as part of the Project “Digitalization and Innovation Ecosystem: A Holistic Approach” undertaken at the Research Institute of Economy, Trade, and Industry (RIETI). In addition, financial support from JSPS-KAKEN Fostering Joint International Research Program B (Grant No.19K0035) is acknowledged. The authors would like to thank the participants of the discussion seminar at RIETI for their helpful comments.

In total, 680,241 US patents + 427,628 China patents. The abstract of CNIPR patents is translated into English, so that all documents are in English.

It should be noted that any difference in the type of document (USPTO or CNIPR patents) does not cause such pattern, as is discussed in the Section 2, based on the validation of document embedding with patent family information.

Abramovitz
,
M.
(
1986
), “
Catching up, forging ahead, and falling behind
”,
The Journal of Economic History
, Vol.
46
No.
2
, pp.
385
-
406
.
Agrawal
,
A.
,
Gans
,
J.
and
Goldfarb
,
A.
(
2018
),
Prediction Machines: The Simple Economics of Artificial Intelligence
,
Harvard Business School Press
.
Arts
,
S.
,
Cassiman
,
B.
and
Gomez
,
J.C.
(
2017
), “
Text matching to measure patent similarity
”,
Strategic Management Journal
, Vol.
39
No.
1
, pp.
62
-
84
.
Bell
,
M.
and
Pavitt
,
K.
(
1993
), “
Technological accumulation and industrial growth: contrasts between developed and developing countries
”,
Industrial and Corporate Change
, Vol.
2
No.
2
, pp.
157
-
210
.
Biancotti
,
C.
and
Ciocca
,
P.
(
2018
), “
Regulating data superpower in the age of AI
”,
Realtime Economic Issues Watch, October 23, 2018
,
Peterson Institute for International Economics
.
Cho
,
D.S.
,
Kim
,
D.J.
and
Rhee
,
D.K.
(
1998
), “
Latecomer strategies: evidence from the semiconductor industry in Japan and Korea
”,
Organization Science
, Vol.
9
No.
4
, pp.
489
-
505
.
Chorzempa
,
M.
,
Triolo
,
P.
and
Saks
,
S.
(
2018
), “
China’s social credit system: a mark of progress or a threat to privacy?
”,
Peterson Institute for International Economics
,
Policy Brief 18-14
.
Economist
(
2020
), “
Special report: the data economy
”,
The Economist, Feb 22, 2020
,
London
.
Fagerberg
,
J.
and
Godinho
,
M.M.
(
2005
), “
Innovation and catching-up
”,
The Oxford Handbook of Innovation
,
Oxford University Press
,
New York, NY
, pp.
514
-
543
.
Fan
,
P.
(
2006
), “
Catching up through developing innovation capability: evidence from China’s telecomequipment industry
”,
Technovation
, Vol.
26
No.
3
, pp.
359
-
368
.
Goldfarb
,
A.
and
Trefler
,
D.
(
2018
), “
AI and international trade
”,
NBER Working Paper #24254
,
Cambridge MA
.
Kashani
,
E.S.
,
Radosevic
,
S.
,
Kiamehr
,
M.
and
Gholizadeh
,
H.
(
2022
), “
The intellectual evolution of the technological catch-up literature: bibliometric analysis
”,
Research Policy
, Vol.
51
No.
7
, p.
104538
.
Kim
,
L.
(
1998
), “
Crisis construction and organizational learning: capability building in catching-up at Hyundai motor
”,
Organization Science
, Vol.
9
No.
4
, pp.
506
-
521
.
Lee
,
K.
(
2013
),
Schumpeterian Analysis of Economic Catch-up: Knowledge, Path-Creation, and the Middle-Income Trap
,
Cambridge University Press
,
London
.
McInnes
,
L.
,
Healy
,
J.
and
Melville
,
J.
(
2018
), “
UMAP: uniform manifold approximation and projection for dimension reduction
”,
6, Dec 2018, arXiv preprint arXiv:1802.03426
.
Mathews
,
J.A.
(
2006
), “
Dragon multinationals: new players in 21st century globalization
”,
Asia Pacific Journal of Management
, Vol.
23
No.
1
, pp.
5
-
27
.
Miao
,
Y.
,
Song
,
J.
,
Lee
,
K.
and
Jin
,
C.
(
2018
), “
Technological catch-up by east Asian firms: trends, issues, and future research agenda
”,
Asia Pacific Journal of Management
, Vol.
35
No.
3
, pp.
639
-
669
.
Mikolov
,
T.
,
Chen
,
K.
,
Corrado
,
G.
and
Dean
,
J.
(
2013
), “
Efficient estimation of word representations in vector space
”,
In ICLR
.
Motohashi
,
K.
(
2020
), “
Science and technology co-evolution in AI: empirical understanding through a linked dataset of scientific articles and patents
”,
RIETI Discussion Paper Series 20-E-010
,
RIETI
,
Tokyo Japan
.
Motohashi
,
K.
and
Zhu
,
C.
(
2023
), “
Identifying technology opportunity using dual-attention model and technology-market concordance matrix
”,
Technological Forecasting and Social Change
, Vol.
197
, p.
122916
.
Motohashi
,
K.
,
Koshiba
,
H.
and
Ikeuchi
,
K.
(
2019
), “
A method of extracting content information from patent documents and comparison of their characteristics by applicant type by using the vector space model of distributed expressions
”,
NISTEP Discussion Paper No. 175
,
MEXT
,
Japan, Tokyo
, (
in Japanese
).
Nagaoka
,
S.
,
Motohashi
,
K.
and
Goto
,
A.
(
2010
), “
Patent statistics as an innovation indicator
”, in
Hall
,
B.
and
Rosenberg
,
N.
(Eds),
Handbook of the Economics of Innovation
,
Elsevier Science
,
North Holland
, Vol.
2
.
Park
,
M.
and
Geum
,
Y.
(
2022
), “
Two-stage technology opportunity discovery for firm-level decision making: GCN-based link-prediction approach
”,
Technological Forecasting and Social Change
, Vol.
183
, p.
121934
.
Sugawara
,
K.
,
Kobayashi
,
H.
and
Iwasaki
,
M.
(
2016
), “
On approximately searching for similar word embeddings
”,
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
.
Trajtenberg
,
M.
(
2018
), “
Artificial intelligence as the next GPT: a Political-Economy perspective
”,
NBER Working Paper #24245
,
Cambridge MA
.
Wang
,
Y.
,
Roijakkers
,
N.
and
Vanhaverbeke
,
W.
(
2014
), “
How fast do Chinese firms learn and catch up? Evidence from patent citations
”,
Scientometrics
, Vol.
98
No.
1
, pp.
743
-
761
.
Wang
,
B.
,
Wang
,
A.
,
Chen
,
F.
,
Wang
,
Y.
and
Kuo
,
C.
(
2019
), “
Evaluating word embedding models: methods and experimental results
”,
APSIPA Transactions on Signal and Information Processing
, Vol.
8
No.
1
, p.
e19
.
Younge
,
K.A.
and
Kuhn
,
J.M.
(
2016
),
Patent-to-Patent Similarity: A Vector Space Model
,
SSRN
.
Kim
,
D.
,
Lee
,
H.
and
Kwak
,
J.
(
2017
), “
Standards as a driving force that influences emerging technological trajectories in the converging world of the internet and things: an investigation of the M2M/IoT patent network
”,
Research Policy
, Vol.
46
No.
7
, pp.
1234
-
1254
.

k-means++ was used to assign all words derived by the Skip-gram model into 24 clusters. We chose the number of clusters arbitrarily. The words in each cluster were presented in the form of word cloud. The Skip-gram model assumes that similar words are more likely to appear in the same context (window). Therefore, in fact, the words in each cluster are supposed to be associative and related, not exactly to be similar.

Instead of labeling document clusters only by the word clouds, we also adopted the patent titles as complementary information. We picked up ten patents of each cluster, which were nearest to its centroid.

NGT has a primary parameter ϵ that defines the explored range for the graph, allowing us to achieve higher precision. As the “No Free Lunch” theorem, the more extensive the explored range, the higher the precision, the longer the search time. To investigate the relationship between the explored range ϵ and accuracy, we randomly collect n patents from the corpus. Denote Ntrue(i) as the true nearest 200 neighbors of patent i, and Nngt(i, ϵ) the approximated nearest 200 neighbors of patent i given by NGT. Then, the accuracy of given ϵ value is calculated by the following:

In our case, we collected a random sample of 500 patents and set the range of ϵ from 0.05 to 1 with a step 0.05. The following figures shows the change of accuracy by tuning the value of ϵ. For the following results, we set the ϵ as 0.35, which had a 0.997 accuracy rate and plausible running time in the experiment.

Published in Asia Pacific Journal of Innovation and Entrepreneurship. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode

Data & Figures

Figure 1.

Research framework

Figure 1.

Research framework

Close modal
Figure 2.

Internet-related patents by application year

Figure 2.

Internet-related patents by application year

Close modal
Figure 3.

Distribution of cosine similarity between pairs within patent families

Figure 3.

Distribution of cosine similarity between pairs within patent families

Close modal
Figure 4.

Histograms of cosine similarity between pairs within patent families

Figure 4.

Histograms of cosine similarity between pairs within patent families

Close modal
Figure 5.

Word crowd of clustering results

Figure 5.

Word crowd of clustering results

Close modal
Figure 6.

UMAP visualization of patent contents and clustering results

Figure 6.

UMAP visualization of patent contents and clustering results

Close modal
Figure 7.

Composition of patent contents by country

Figure 7.

Composition of patent contents by country

Close modal
Figure 8.

Comparison of Google and Baidu patents

Figure 8.

Comparison of Google and Baidu patents

Close modal
Figure 9.

RCA of Google/Baidu patents in each country

Figure 9.

RCA of Google/Baidu patents in each country

Close modal
Figure 10.

Cosine similarity of 200th nearest patents

Figure 10.

Cosine similarity of 200th nearest patents

Close modal
Figure 11.

Share of USPTO patents in 200 neighbors by country

Figure 11.

Share of USPTO patents in 200 neighbors by country

Close modal
Figure 12.

Graphical interpretation of NGT results

Figure 12.

Graphical interpretation of NGT results

Close modal
Figure 13.

Cumulativeness of Google and Baidu patents

Figure 13.

Cumulativeness of Google and Baidu patents

Close modal
Figure 14.

Impact of Google and Baidu patents

Figure 14.

Impact of Google and Baidu patents

Close modal
Figure A2.

Tuning explored range in NGT analysis

Figure A2.

Tuning explored range in NGT analysis

Close modal
Table 1.

Cosine similarity between within patent family pairs

CountryMeanSDMin25%50%75%Max
US0.970.050.250.991.001.001.00
CN0.950.070.530.920.980.991.00
USCN0.970.050.280.960.991.001.00
US*0.950.100.140.991.001.001.00
CN*0.890.140.240.840.970.991.00
USCN*0.940.100.110.940.981.001.00

Note:

(*) denoted the results of TF-IDF weighted document embedding

Source: Created by authors
Table A1.

Document cluster labels

Labelsnearest10_titleIPC
0Method and device for obtaining combined imageG06K9/62
0Digital image visualized management and retrieval for communication networkG06F17/30
0Terminal device, intelligent mobile phone, and face identification-based authentication method and systemG06K9/00
0Remote sensing image significance target detection method and system based on HadoopG06F17/30
0Method for detecting over-exposure area in monitoring video image combining multiple featuresG06K9/62
0Method and system for detection of representative area of automatic quasi object type imageG06F17/30
0Station identification method and deviceG06K9/00
0Method for generating and applying image search code techniqueG06F17/30
0Image matching method and image matching deviceG06K9/62
0Method and system for replacing background images of smart camera in real timeG06F3/0484
1Distributed storage method and apparatus, and data processing method and apparatusG06F17/30
1Massive real-time data synchronization system based on private cloud storageH04L29/08
1Distribution and utilization global total data transmission and storage method and device and electronic equipmentG06F17/30
1Data rapid distribution method and deviceH04L29/06
1Method for acquiring and converting data of metering system of intelligent transformer substationG06F17/30
1Method of pre-caching or pre-fetching data utilizing thread lists and multimedia editing systems using such pre-cachingG06F3/06
1Database normalization storage system and method suitable for use in multi-model satellite testingG06F17/30
1Data audits based on timestamp criteria in replicated data bases within digital mobile telecommunication systemG06F17/30
1Write operation control method, system and device and computer storage mediumG06F3/06
1Smart storage platform apparatus and method for efficient storage and real-time analysis of big dataG06F3/06
2Context-based photograph sharing platform for property inspectionsG06F17/30
2Systems and methods for constructing and using models of memorability in computing and communications applicationsG06F3/048
2Systems and methods for constructing and using models of memorability in computing and communications applicationsG06F3/048
2Systems and methods for constructing and using models of memorability in computing and communications applicationsG06F3/048
2Incentives for content consumptionG06Q30/00
2Method and apparatus for locating errors in documents via database queries, similarity-based information retrieval and modeling the errors for error resolutionG06F17/30
2Method and system for electronic display of photographsG06F17/30
2Three-dimensional web crawlerG06F17/30
2Intelligent integrating system for crowdsourcing and collaborative intelligence in human- and device- adaptive query-response networksG06F17/30
2Methods and systems for annotation of digital informationG06F17/24
3Intelligent liquid warehousing deviceG06K9/00
3Internet-of-things-based water level monitoring system for water conservancy and hydropower engineeringH04L29/08
3Touch control input device used for electronic information equipmentG06F3/041
3Output device and wearable displayG09G5/00
3Diversified reinforced tablet computer systemG06F1/16
3Force touch module, preparation method thereof, touch screen panel and display deviceG06F3/041
3Luminous band display type sliding touch bar and display method of touch luminous bandG06F3/041
3Economical skin-pattern-acquisition and analysis apparatus for access control; systems controlled therebyG06K9/00
3Shield machine posture solving device based on VBA writingG06F9/44
3Touch-control module, touch screen and intelligent device and stereo touch-control methodG06F3/041
4Method for understanding questions in question type automatic question-answer systems on basis of ruleG06F17/27
4Data searching method and system based on semantic analysisG06F17/27
4Information searching method based on metadataG06F17/30
4Relevancy priority ordering method used for environmental protection regulation retrievalG06F17/30
4Information management, retrieval and display system and associated methodG06F17/30
4Information management, retrieval and display systems and associated methodsG06F7/00
4Information management, retrieval and display system and associated methodG06F17/30
4Method of indexing words in handwritten document images using image hash tablesG06F17/30
4Method for searching pattern matching indexG06F17/30
4System, method and program product for answering questions using a search engineG06F17/30
5Search engine method based on keyword resolution schedulingG06F17/30
5Method and system for automatically converting dynamic form page to HTML5 pageG06F17/22
5Automatic access of electronic information through machine-readable codes on printed documentsG06F12/00
5Electronic commerce system for updating informationG06F12/00
5Web service multithreading file uploading systemH04L29/08
5System and method for creating and posting media lists for purposes of subsequent playbackG06F3/0482
5System and method for creating and posting media lists for purposes of subsequent playbackG06F15/16
5System and method for creating and posting media lists for purposes of subsequent playbackG06F15/16
5Pay per record system and methodH04L29/06
5Dynamic generation of target files from template files and tracking of the processing of target filesG06F7/00
6Wired security access control device of financial industry network and access method of wired security access control deviceH04L29/06
6Vehicle identification system and methodG06F17/30
6Control systemG06F3/16
6Plug type audio device and signal processing methodG06F3/16
6Touch display device and touch display methodG06F3/041
6Method and device for playing audio data in sound card signal input channel in real timeG06F3/16
6Portal access control systemG06F7/04
6Method and device for displaying states of ports of switchH04L12/24
6Computer control systemG06F3/00
6Login method and device for user identified by radio frequencyG06F21/00
7Device, method and equipment for information data interaction for processing information dataG06F17/30
7Smart instant interaction technology for use in radius range of positionG06F17/30
7Information processing method, terminal and electronic deviceG06F17/30
7System information security monitoring method and device, computer device and storage mediumG06Q10/10
7Novel electronic device information collection and selective information orientation distribution methodH04L29/06
7Interested object information acquisition method and system with mobile terminals coordinating with cloud terminalH04L29/08
7Information display method and deviceH04L12/58
7Method and device for feeding back information, and terminalH04L12/58
7Method, device and system for storing social networking service (SNS) contentG06F17/30
7Method and system for automatically ordering dishes and settling accountG06Q30/02
8Facial action unit strength estimation-based expression analysis methodG06K9/00
8Spatial data matching method based on machine learningG06F17/30
8Method for quickly sorting electroencephalograph signal based on threshold analysisG06F3/01
8Intelligent analysis method for components of camera scene imageG06K9/62
8Method and system for generating radio frequency identification data into tripping origin destination) matrix on the basis of SparkG06F17/30
8Target identification method based on geometry reconstruction and multi-scale analysisG06K9/00
8Time sequence similarity measurement method based on self-adaptive piecewise statistical approximationG06F17/30
8Judgment standard establishment method for identifying red and black time sequence through resistance methodG06K9/62
8Data flow abnormality detection and multiple verification method based on enhancement-type angle abnormality factorG06F17/30
8Wi-Fi-based indoor personnel passive detection methodG06K9/00
9Systems and methods of network operation and information processingG06F15/16
9Systems and methods of network operation and information processingG06F17/30
9Systems and methods of network operation and information processing, including engaging users of a public-access networkG06F15/16
9Systems and methods of network operation and information processing, including use of unique/anonymous identifiers throughout all stages of information processing and deliveryG06F15/16
9Video broadcast creation method and system, access device and management deviceH04L29/06
9System and method for realizing signaling firewall based on signaling point-free access technologyH04L29/06
9Network device access authentication method in network video monitoringH04L29/06
9System and method for simulating an application for subsequent deployment to a device in communication with a transaction serverG06F7/00
9Method and system for managing personal informationG06Q30/00
9Method for monitoring resource utilization of serverH04L12/24
10Off-line engine system based on software as a service (SaaS) modeG06F17/30
10System and method for providing a messaging application program interfaceG06F3/00
10Integrated chaining process for continuous software integration and validationG06F9/44
10Method for implementing configuration clause processing of policy-based network in cloud component software systemH04L29/06
10Method for providing a virtual execution environment on a target computer using a virtual software machineG06F9/44
10Frame driving method of application construction platformG06F9/44
10Internal control management system capable of applying response type shared application architectureG06F9/44
10Computer flexible management construction system and interface storage and explanation methodG06F9/44
10Method and system for connecting words, phrases, or symbols within the content of transmitted data to URI or IP addressG06F17/30
10Realization method and system for device control by using HTTP interfaceH04L29/08
Source: Created by authors

Supplements

References

Abramovitz
,
M.
(
1986
), “
Catching up, forging ahead, and falling behind
”,
The Journal of Economic History
, Vol.
46
No.
2
, pp.
385
-
406
.
Agrawal
,
A.
,
Gans
,
J.
and
Goldfarb
,
A.
(
2018
),
Prediction Machines: The Simple Economics of Artificial Intelligence
,
Harvard Business School Press
.
Arts
,
S.
,
Cassiman
,
B.
and
Gomez
,
J.C.
(
2017
), “
Text matching to measure patent similarity
”,
Strategic Management Journal
, Vol.
39
No.
1
, pp.
62
-
84
.
Bell
,
M.
and
Pavitt
,
K.
(
1993
), “
Technological accumulation and industrial growth: contrasts between developed and developing countries
”,
Industrial and Corporate Change
, Vol.
2
No.
2
, pp.
157
-
210
.
Biancotti
,
C.
and
Ciocca
,
P.
(
2018
), “
Regulating data superpower in the age of AI
”,
Realtime Economic Issues Watch, October 23, 2018
,
Peterson Institute for International Economics
.
Cho
,
D.S.
,
Kim
,
D.J.
and
Rhee
,
D.K.
(
1998
), “
Latecomer strategies: evidence from the semiconductor industry in Japan and Korea
”,
Organization Science
, Vol.
9
No.
4
, pp.
489
-
505
.
Chorzempa
,
M.
,
Triolo
,
P.
and
Saks
,
S.
(
2018
), “
China’s social credit system: a mark of progress or a threat to privacy?
”,
Peterson Institute for International Economics
,
Policy Brief 18-14
.
Economist
(
2020
), “
Special report: the data economy
”,
The Economist, Feb 22, 2020
,
London
.
Fagerberg
,
J.
and
Godinho
,
M.M.
(
2005
), “
Innovation and catching-up
”,
The Oxford Handbook of Innovation
,
Oxford University Press
,
New York, NY
, pp.
514
-
543
.
Fan
,
P.
(
2006
), “
Catching up through developing innovation capability: evidence from China’s telecomequipment industry
”,
Technovation
, Vol.
26
No.
3
, pp.
359
-
368
.
Goldfarb
,
A.
and
Trefler
,
D.
(
2018
), “
AI and international trade
”,
NBER Working Paper #24254
,
Cambridge MA
.
Kashani
,
E.S.
,
Radosevic
,
S.
,
Kiamehr
,
M.
and
Gholizadeh
,
H.
(
2022
), “
The intellectual evolution of the technological catch-up literature: bibliometric analysis
”,
Research Policy
, Vol.
51
No.
7
, p.
104538
.
Kim
,
L.
(
1998
), “
Crisis construction and organizational learning: capability building in catching-up at Hyundai motor
”,
Organization Science
, Vol.
9
No.
4
, pp.
506
-
521
.
Lee
,
K.
(
2013
),
Schumpeterian Analysis of Economic Catch-up: Knowledge, Path-Creation, and the Middle-Income Trap
,
Cambridge University Press
,
London
.
McInnes
,
L.
,
Healy
,
J.
and
Melville
,
J.
(
2018
), “
UMAP: uniform manifold approximation and projection for dimension reduction
”,
6, Dec 2018, arXiv preprint arXiv:1802.03426
.
Mathews
,
J.A.
(
2006
), “
Dragon multinationals: new players in 21st century globalization
”,
Asia Pacific Journal of Management
, Vol.
23
No.
1
, pp.
5
-
27
.
Miao
,
Y.
,
Song
,
J.
,
Lee
,
K.
and
Jin
,
C.
(
2018
), “
Technological catch-up by east Asian firms: trends, issues, and future research agenda
”,
Asia Pacific Journal of Management
, Vol.
35
No.
3
, pp.
639
-
669
.
Mikolov
,
T.
,
Chen
,
K.
,
Corrado
,
G.
and
Dean
,
J.
(
2013
), “
Efficient estimation of word representations in vector space
”,
In ICLR
.
Motohashi
,
K.
(
2020
), “
Science and technology co-evolution in AI: empirical understanding through a linked dataset of scientific articles and patents
”,
RIETI Discussion Paper Series 20-E-010
,
RIETI
,
Tokyo Japan
.
Motohashi
,
K.
and
Zhu
,
C.
(
2023
), “
Identifying technology opportunity using dual-attention model and technology-market concordance matrix
”,
Technological Forecasting and Social Change
, Vol.
197
, p.
122916
.
Motohashi
,
K.
,
Koshiba
,
H.
and
Ikeuchi
,
K.
(
2019
), “
A method of extracting content information from patent documents and comparison of their characteristics by applicant type by using the vector space model of distributed expressions
”,
NISTEP Discussion Paper No. 175
,
MEXT
,
Japan, Tokyo
, (
in Japanese
).
Nagaoka
,
S.
,
Motohashi
,
K.
and
Goto
,
A.
(
2010
), “
Patent statistics as an innovation indicator
”, in
Hall
,
B.
and
Rosenberg
,
N.
(Eds),
Handbook of the Economics of Innovation
,
Elsevier Science
,
North Holland
, Vol.
2
.
Park
,
M.
and
Geum
,
Y.
(
2022
), “
Two-stage technology opportunity discovery for firm-level decision making: GCN-based link-prediction approach
”,
Technological Forecasting and Social Change
, Vol.
183
, p.
121934
.
Sugawara
,
K.
,
Kobayashi
,
H.
and
Iwasaki
,
M.
(
2016
), “
On approximately searching for similar word embeddings
”,
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
.
Trajtenberg
,
M.
(
2018
), “
Artificial intelligence as the next GPT: a Political-Economy perspective
”,
NBER Working Paper #24245
,
Cambridge MA
.
Wang
,
Y.
,
Roijakkers
,
N.
and
Vanhaverbeke
,
W.
(
2014
), “
How fast do Chinese firms learn and catch up? Evidence from patent citations
”,
Scientometrics
, Vol.
98
No.
1
, pp.
743
-
761
.
Wang
,
B.
,
Wang
,
A.
,
Chen
,
F.
,
Wang
,
Y.
and
Kuo
,
C.
(
2019
), “
Evaluating word embedding models: methods and experimental results
”,
APSIPA Transactions on Signal and Information Processing
, Vol.
8
No.
1
, p.
e19
.
Younge
,
K.A.
and
Kuhn
,
J.M.
(
2016
),
Patent-to-Patent Similarity: A Vector Space Model
,
SSRN
.
Kim
,
D.
,
Lee
,
H.
and
Kwak
,
J.
(
2017
), “
Standards as a driving force that influences emerging technological trajectories in the converging world of the internet and things: an investigation of the M2M/IoT patent network
”,
Research Policy
, Vol.
46
No.
7
, pp.
1234
-
1254
.

Languages

or Create an Account

Close Modal
Close Modal