Experiments, and how-to guide for the lecture "Large language models for Scientometrics"
(Credit: Davide Bonazzi) from Discover Magazine
Large Language Models:
The capabilities of Large Language Models (LLM's) to process data from different modalities and excel at different tasks ranging from information extraction, question and answering, math, coding, and recently reasoning simply shows the potential of this technology. Intuitively the complexities of training these models on different datasets/data mixes, opting different architectural choices, choosing different alignment strategies [1] seemingly could suggest picking a specific model for each task, but LLM's are geared towards being considered as general task solvers.
Customized MLRC Data built from [2].
For this study we are going to test out three use-cases, Labelling, Information Extraction, and LLM as a Judge. We are going to use the dataset from the paper Laying Foundations to Quantify the "Effort of Reproducibility" [2]. The dataset and the tasks outline a good experimentation framework to effectively utilize Large language models for computational social science tasks [3].
For this study we are going to use the Reproducibility dataset from the paper Laying Foundations to Quantify the "Effort of Reproducibility" [2] to preference tune answers using the Direct Preference Optimization(DPO) algorithm. DPO unlike other reinforcement algorithms directly applies maximum likelihood on the preference dataset to perform implicit reward modeling. Ideally, similar to most RL algorithms we would be applying the same reward maximization via KL divergence constraint. Theoretically, DPO is RL free, and doing a simple classification on a given a dataset
where the
- using
$r^+$ (human preferred response), and$r^-$ (rejected responses). - for the models
$\pi_{LLMSciSci}$ and$\pi_{LLM-instruct}$ . -
$r_{\theta}$ captures the log-probability of the chosen vs rejected responses on$D_{ReproEffortDataset}$ . -
$\pi_{LLM-instruct}$ is the instruct-tuned open weight reference model. -
$\pi_{LLMSciSci}$ is the final RL model intended to be preference-tuned on$D_{ReproEffortDataset}$ .
For this study we are going to use the Reproducibility dataset from the paper Laying Foundations to Quantify the "Effort of Reproducibility" [2] to optimize policy gradients using Group Relative Policy Optimization(GRPO) algorithm. GRPO is an online learning algorithm where the model uses generated completions to learn how to maximize advantages and get better at generating completions at every given step. Learn more about the GRPO from the original paper [5].
Format rewards
Label rewards
Stepwise rewards
Hamming loss correctness reward
Conditional Reasoning trace length award
[1] A Survey of Large Language Models
[2] Laying Foundations to Quantify the “Effort of Reproducibility”
[3] Can Large Language Models Transform Computational Social Science?
[4] Direct Preference Optimization: Your Language Model is Secretly a Reward Model
[5] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Akhil Pandey, Want to contribute see your name here :), Open an Issue ?
The computing resources for this work is supported completely by the Google Cloud Research Credits Grant 331845891.