About the project
The project was begun in 2022. Approx. 36K core hours were used for the initial phase. In December 2024, 26K core hours were awarded through the national DeiC call for the next phase of the project. This will take place in Jan.-Dec. 2025
“The project was challenging due to the volume of unstructured text data spanning both Danish and English. Without structured tags or categories, these job postings required substantial data processing, far beyond what could be handled by a standard computer, to become useful.” Rentian Zhu, PhD fellow, Department of Economics, CBS.
To address these complexities, Rentian utilized DeiC Interactive HPC to process and analyse the data effectively. Through advanced techniques and the support of large language models like BERT and GPT-3.5/4, he extracted, classified, and quantified skill demands across industries and job roles.
RDM Support Team at CBS
The RDM Support team at CBS Library is the central resource for CBS researchers and students seeking expertise in high-performance computing (HPC) and research data management (RDM). The HPC support encompasses advising on and allocating resources, addressing technical challenges, assisting with code development, and providing teaching, documentation, and tutorials. In terms of RDM, the team guides researchers on complying with funder and publisher requirements, writing data management plans, navigating legal considerations, and selecting IT infrastructures that best suit their data management needs.
He first became aware of DeiC Interactive through a former colleague:
“I became aware of DeiC Interactive through a former colleague, an NLP enthusiast who highly recommended it”. Rentian says, “The process of resource allocation was incredibly smooth, thanks to the outstanding support from Kristoffer and Lars (CBS FO support red.). They were not only extremely helpful but also kind and patient, ensuring that all my questions were addressed and that I could efficiently set up and utilize the resources for my research”.
Building a Data Pipeline for Skill Extraction
To process and analyze the vast amount of data Rentian developed a robust data pipeline. Key steps included:
Data Preprocessing Normalizing text data, a step that includes removing inconsistencies, identifying the language (Danish or English), and tokenizing the text into manageable pieces. Tokenization provides a structured foundation for LLMs which enables them to interpret the text accurately.
Skill Extraction Using the language model BERT to analyze and extract key features from the text and enhance skill recognition accuracy by using GPT-3.5 and GPT-4 with specialized prompt strategies tailored to labor market terminology. These strategies help the models understand nuanced skill requirements better.
Skill Categorization Creating a hierarchical classification of skills by integrating outputs from BERT and GPT models using LangChain (a library that helps coordinate the flow of information between different AI models). This classification of skills makes it possible to analyze patterns across industries and job roles more effectively.
Linking Skills to Firm Metrics Analyzing the association between aggregated skill categories at the firm level and performance metrics such as revenue growth and profitability, Rentian applied advanced econometric techniques. By employing panel data models, including Fixed Effects, he and his supervisors uncovered how categorized workforce skills are associated with organizational outcomes, shedding light on labor-driven factors that influence profitability.
Optimizing Computational Resources for Complex Data Analysis
To meet the heavy computational demands of his project, Rentian took a strategic approach in efficiently distributing tasks across GPUs and CPUs, ensuring he made the most of DeiC Interactive HPC’s power.
Fine-Tuning BERT on Job Postings One major task involved fine-tuning BERT to classify skills from over 3 million job postings. This required significant GPU power, which Rentian optimized through parallel processing across several GPUs. By batching data and using techniques like gradient accumulation, he managed to process large amounts of data without exceeding hardware limits. Meanwhile, CPUs handled data preparation and orchestration, ensuring smooth workflow coordination.
Code Example 1: Showcasing how GPUs were utilized for BERT fine-tuning
Prompt Engineering for GPT Models In addition to fine-tuning BERT, Rentian employed GPT-3.5 and GPT-4 for hierarchical skill classification. Using advanced prompt engineering techniques, he guided the models to provide accurate and structured outputs. These included:
- Chain of Thought Reasoning: Guiding the model to break down its decision-making process step by step.
- Few-Shot Learning: Including curated examples within prompts to demonstrate the desired output format and improve clarity.
Code Example: Sample prompt used in GPT classification
By strategically framing prompts, Rentian achieved precise and reliable results without altering the models’ underlying parameters.
Advanced Data Management
Efficient data management was critical to streamlining Rentian’s workflow. He used the Parquet format to compress and structure data, reducing 47 GB of raw data to 25 GB after preprocessing. Distributed computing allowed tasks to be split across multiple processors, accelerating execution and ensuring the entire pipeline ran efficiently.
Moving forward: securing resources from the national call
Building on this foundational work, Rentian now, as a Ph.D. student, continues to broaden the scope of his research. In December 2024 he was awarded 26K core hours through DeiCs national call, enabling him to combine his extensive data material with registry and time-use data, and to fine-tune the language models for more complex workflows. The expanded setup will not only allow him to deepen his investigation into flexible work arrangements and their broader economic and social effects but also improve efficiency and reduce processing time compared to standard computing environments.
"With the additional national HPC resources, I can now venture into more detailed hypotheses, incorporate a richer variety of data, and apply more advanced modeling techniques. This will help me probe the mechanisms that shape work patterns and skill demands and hopefully lead to a more rigorous scientific understanding of Denmark’s evolving labor market."
Want to get started using HPC?
Learn more about our HPC resources here, or find out how to apply for HPC resources here.