Gå til hovedindhold

HPC uncovers how Patent Precision affects innovation

Using DeiC Interactive HPC, Assistant Professor Marek Giebel from CBS analysed millions of patent documents to uncover how imbalances in patent drafting can affect future innovation.
Af
10/12/2024 13:12
Billede
A person sitting in front of a computer
Jakob Boserup

Assistant Professor Marek Giebel from the Department of Economics at Copenhagen Business School has used DeiC Interactive HPC to conduct large-scale, multifaceted analyses of patent documents, uncovering how patent breadth and descriptive precision can significantly impact future innovation:

“Using DeiC Interactive HPC has been a gamechanger for this project. The processing power allowed us to perform multiple types of in-depth analysis, from large-scale text processing to complex statistical modeling, and this enabled us to identify patterns and insights in the data material, that we otherwise wouldn’t have found.”

His findings suggest that when patents are characterised by an imbalance between the scope of protection and described invention or are unclearly written they risk creating barriers to further technological development—a critical insight for researchers and practitioners alike:

“Our findings are important for researchers and practitioners because they show how crucial it is to balance the patent description with the scope of protection. Our results suggest that even small imbalances in the initial application can have a big impact and can potentially affect innovation”. 

The analysis was based on dataset of over 4.5 million patents and related applications from the US Patent and Trademark Office between 2001 and 2022. This deep dive involved processing massive text files—sometimes up to 47 GB per year—requiring computational resources way beyond what is possible on standard systems. And here is where DeiC Interactive HPC became essential.

Facts

The main part of the project was carried out in the second half of 2023 and the first half of 2024. The first tests and data preparations started in 2022/2023.

In total, Marek used 89K core hours for computation, which covered all aspects of the project, including testing, failed runs, and successful analyses.

A smooth process and great support

Marek Giebel learned about DeiC Interactive HPC through internal information channels at CBS, such as email announcements and the university’s research data management web pages. Despite having prior experience using remote systems for data processing this was his first time using HPC. The support staff at CBS’s library and research data management team proved instrumental in guiding him through the HPC application process and providing support for resource use:

"The process of applying for resources felt relatively smooth," he recalls. The application required a project title, description, and an outline of computational requirements. Determining the exact resources, he would need was a learning process, but with help from CBS support staff and resources like the UCloud User Guide, he was able to identify the right setup. CBS support also assisted with renewing software licenses and securing additional resources when needed, making it easier for him to continue his work uninterrupted.

Decoding Patent Language with NLP and Statistics

Marek’s workflow spanned multiple stages, from text processing and NLP in Python to statistical modeling in Stata and R allowing him to identify and analyze patterns in the patent data.

His key tasks included:

  • Data Cleaning and Pre-processing: Initially, Marek extracted raw text for example from XML files and USPTO PatentsView text data deliveries and organizes it into distinct folders based on text parts, such as patent descriptions and claims. He used Python to adjust the format of the text files, ensuring that the content to be analyzed is stored in a single column, which facilitates more efficient processing.
  • Text Analysis: Marek conducted simple counts of words and sentences, assessed sentence length, and compared similarities within and across documents. He applied further Natural Language Processing (NLP) techniques, including word tokenization and classification of word types, using Python to gain deeper insights into the language used in patents. Using Python, Marek applied readability measures to assess the complexity and grade level of the texts. These measures help evaluate the accessibility and comprehensibility of the patent documents.
  • Statistical Analysis: To handle the large datasets and extract patterns, Marek used Stata and R for statistical analysis. The programs allowed him to perform complex analyses, such as finding correlations, examining causal relationships and identifying trends across the data, providing valuable insights into how different aspects of patents are related.

See detailed workflow here

Given the scale and complexity of the data, the project requires significant computational resources, leveraging DeiC Interactive HPC’s capabilities, including up to 384 GB of memory and 64 CPU cores (u1-standard-64). This allowed him to work with text files as large as 20 GB and conduct detailed comparisons of extensive text segments. Additionally, Marek leverages parallel processing by running multiple instances of both Stata and Python or by using the respective libraries for parallel processing.

A learning curve: technical Challenges and Solutions

The work was not without its challenges. The performance of the hard disk proved to be a bottleneck when opening and saving large files and excessive memory usage sometimes caused the machine to shut down mid-process, leading to the loss of progress. To mitigate these issues, Marek needed to determine the appropriate memory and time requirements for each job through trial and error, adjusting his approach to optimize the processing of large data sets.

Reflecting on his experience, he notes the following:

“Conducting this project was a great opportunity to learn new and improve existing skills. If I repeat the work, I will attempt to make the process more efficient from the beginning. This includes the programs and libraries within the programs I use. For example, instead of beginning to perform the data cleaning with one program I am already used to (e.g., Stata and R) and switching to another (Python) for specific tasks and the analysis, I would directly start to conduct the whole process in Python”.

Future steps: the possibility of upscaling

As he continues his research, Marek remains open to scaling up his computational resources to accommodate future analyses. His next steps may involve including earlier patent cohorts before 2001 and examining additional  aspects of patent documents, such as figures, which will further enhance the breadth of his findings.

“To upscale in the future, it would be great to get further information about the possibilities and ways to obtain these improved computation resources. For example, knowing how to apply and where would be great. In that respect, it would also be interesting to know how competitive access to these resources is. Moreover, having information about these opportunities and practical assistance in setting up the systems and performing the necessary tasks/jobs would be great. This includes information about how to increase the efficiency of text and data processing from a technical perspective.”

RDM Support Team at CBS

The RDM Support team at CBS Library is the central resource for CBS researchers and students seeking expertise in high-performance computing (HPC) and research data management (RDM). Our HPC support encompasses advising on and allocating resources, addressing technical challenges, assisting with code development, and providing teaching, documentation, and tutorials. In terms of RDM, we guide researchers on complying with funder and publisher requirements, writing data management plans, navigating legal considerations, and selecting IT infrastructures that best suit their data management needs.

Want to get started using HPC?

Learn more about our HPC resources here, or find out how to apply for HPC resources here.