LLMSciSci

Experiments, and how-to guide for the lecture "Large language models for Scientometrics"

(Credit: Davide Bonazzi) from Discover Magazine

About

Large Language Models:

The capabilities of Large Language Models (LLM's) to process data from different modalities and excel at different tasks ranging from information extraction, question and answering, math, coding, and recently reasoning simply shows the potential of this technology. Intuitively the complexities of training these models on different datasets/data mixes, opting different architectural choices, choosing different alignment strategies [1] seemingly could suggest picking a specific model for each task, but LLM's are geared towards being considered as general task solvers.

Dataset

Customized MLRC Data built from [2].

Study-1:

For this study we are going to test out three use-cases, Labelling, Information Extraction, and LLM as a Judge. We are going to use the dataset from the paper Laying Foundations to Quantify the "Effort of Reproducibility" [2]. The dataset and the tasks outline a good experimentation framework to effectively utilize Large language models for computational social science tasks [3].

In-context-learning notebook:

Study-2:

For this study we are going to use the Reproducibility dataset from the paper Laying Foundations to Quantify the "Effort of Reproducibility" [2] to preference tune answers using the Direct Preference Optimization(DPO) algorithm. DPO unlike other reinforcement algorithms directly applies maximum likelihood on the preference dataset to perform implicit reward modeling. Ideally, similar to most RL algorithms we would be applying the same reward maximization via KL divergence constraint. Theoretically, DPO is RL free, and doing a simple classification on a given a dataset $D$ that includes chosen and rejected responses. Learn more about DPO from the original paper [4].

$$ L_{DPO}(\pi_{LLMSciSci}: \pi_{LLM-instruct}) ;=; - ,\mathbb{E}{\bigl(x,,r^+,,r^-\bigr) \sim D_{ReproEffortDataset}} \Bigl[ \log ,\sigma!\Bigl( r_\theta(x,r^+) ;-; r_\theta(x,r^-) \Bigr) \Bigr] $$

$$ r_\theta(x, r) ;=; \beta ,\log \frac{\pi_{LLMSciSci}(r ,\vert, x)}{\pi_{LLM-instruct}(r ,\vert, x)} $$

where the $r_{\theta}$ is computed

using $r^+$(human preferred response), and $r^-$(rejected responses).
for the models $\pi_{LLMSciSci}$ and $\pi_{LLM-instruct}$.
$r_{\theta}$ captures the log-probability of the chosen vs rejected responses on $D_{ReproEffortDataset}$.
$\pi_{LLM-instruct}$ is the instruct-tuned open weight reference model.
$\pi_{LLMSciSci}$ is the final RL model intended to be preference-tuned on $D_{ReproEffortDataset}$.

DPO Notebook:

Study-3:

For this study we are going to use the Reproducibility dataset from the paper Laying Foundations to Quantify the "Effort of Reproducibility" [2] to optimize policy gradients using Group Relative Policy Optimization(GRPO) algorithm. GRPO is an online learning algorithm where the model uses generated completions to learn how to maximize advantages and get better at generating completions at every given step. Learn more about the GRPO from the original paper [5].

Format rewards

$$ R_{\text{format}}(c) = \begin{cases} 1.0 & \text{if } c \text{ matches pattern } \texttt{... ...} \ 0.0 & \text{otherwise} \end{cases} $$

Label rewards

$$ R_{\text{label}}(c) = \begin{cases} 0.5 & \text{if } c \text{ matches format AND } \text{extracttext}(c, \text{"label"}) \text{ is valid onehot} \\ 0.0 & \text{otherwise} \end{cases} $$

Stepwise rewards

$$ R_{\text{stepwise}}(c) = r_1 + r_2 + r_3 + r_4 $$

$$ r_1 = \begin{cases} 0.125 & \text{if there exists non-empty text within } \texttt{...} \\ 0.0 & \text{otherwise} \end{cases} $$

$$ r_2 = \begin{cases} 0.125 & \text{if text consists only of 0's and 1's (ignoring brackets, commas, whitespace)} \\ 0.0 & \text{otherwise} \end{cases} $$

$$ r_3 = \begin{cases} 0.125 & \text{if text starts with '[' and ends with ']'} \\ 0.0 & \text{otherwise} \end{cases} $$

$$ r_4 = \begin{cases} 0.625 & \text{if text passes the } \textit{isvalidonehot()} \text{check} \\ 0.0 & \text{otherwise} \end{cases} $$

Hamming loss correctness reward

$$ R_{\text{hamming}}(p, c, \text{doi}, \text{ltype}) = \begin{cases} 1 - HL(y_{\text{true}}, y_{\text{pred}}) & \text{if } \text{extracttext}(c, \text{"label"}) \text{ is valid onehot} \\ 0.0 & \text{otherwise} \end{cases} $$

Conditional Reasoning trace length award

$$ R_{\text{condcotsteplabel}}(c) = \begin{cases} R_{\text{stepwise}}(c) + \alpha \cdot R_{\text{cotlength}}(c) & \text{if } R_{\text{stepwise}}(c) \geq \tau \\ R_{\text{stepwise}}(c) & \text{otherwise} \end{cases} $$

GRPO Notebook:

References(s):

[1] A Survey of Large Language Models
[2] Laying Foundations to Quantify the “Effort of Reproducibility”
[3] Can Large Language Models Transform Computational Social Science?
[4] Direct Preference Optimization: Your Language Model is Secretly a Reward Model
[5] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Authors and Contributors:

Akhil Pandey, Want to contribute see your name here :), Open an Issue ?

Acknowledgement

The computing resources for this work is supported completely by the Google Cloud Research Credits Grant 331845891.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
media		media
notebooks		notebooks
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLMSciSci

About

Dataset

Study-1:

In-context-learning notebook:

Study-2:

DPO Notebook:

Study-3:

GRPO Notebook:

References(s):

Authors and Contributors:

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

akhilpandey95/LLMSciSci

Folders and files

Latest commit

History

Repository files navigation

LLMSciSci

About

Dataset

Study-1:

In-context-learning notebook:

Study-2:

DPO Notebook:

Study-3:

GRPO Notebook:

References(s):

Authors and Contributors:

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages