Machine Learning survey

Purpose

The purpose of this repo is to perform a yearly survey of major machine learning conferences. Extract all the metadata, abstracts, and other information from of all the papers and look for topic frequencies that show up.

Disclaimer

Current product version is 0.3.0. __main__.py in the TUI folder is operational. Currently the only two working search models are Fuzzy, Cosine, Word2vec, Marco and Specter.
The two search parameters that work best are title and abstract as those have the least amount of missing values. (Scraping data isn't always perfect)

Requirements

Python >= 3.11

Main Libraries used

numpy
pandas
rich
textual
requests
matplotlib
spacy
scikit-learn
beautifulsoup4
pyzotero

In VSCODE press CTRL + SHIFT + ~ to open a terminal Navigate to the directory where you want to clone the repo.

Cloning and setting up environment.

Launch VSCode if that is IDE of choice.

Project setup with Poetry

How to check Poetry installation

In your terminal, navigate to your root folder.

If poetry is not installed, do so in order to continue.

On Windows

(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -

On Linux/Mac

curl -sSL https://install.python-poetry.org | python3 -

To check if poetry is installed on your system. Type the following into your terminal

poetry -V

if you see a version returned, you have Poetry installed. The second command is to update poetry if its installed. (Always a good idea). If not, follow this link and follow installation commands for your systems requirements. If on windows, we recommend the powershell option for easiest installation. Using pip to install poetry will lead to problems down the road and we do not recommend that option. It needs to be installed separately from your standard python installation to manage your many python installations. Note: Python 2.7 is not supported. You are more than welcome to go the pip route but I can't guarantee your dependencies won't clash.

Environment storage

Some prefer Poetry's default storage method of storing environments in one location on your system. The default storage are nested under the {cache_dir}/virtualenvs.

If you want to store you virtual environment locally. Set this global configuration flag below once poetry is installed. This will now search for whatever environments you have in the root folder before trying any global versions of the environment in the cache.

poetry config virtualenvs.in-project true

For general instruction as to poetry's functionality and commands, please see read through poetry's cli documentation

To create a new venv

python -m venv .venv

or This command will automatically activate the env

poetry env use python3.12

Activate the venv Windows

.venv\scripts\activate

Mac/Linux

source .venv/bin/activate

Installation with GPU

To use your GPU, or not to use your GPU. That is the question. If you're lucky enough to have workhorse GPU on your rig, you might be inclined to use it when selecting the "Marco" and "Specter" models. To do so requires... a few extra annoying steps. Hopefully you bought into the NVIDIA hype and have one of their GPU's as most of pytorch's implmentations are based on the NVIDIA CUDA drivers.

First order of business is to see what NVIDIA drivers you can currently operate at.

nvidia-smi

After running the above look on the top right for CUDA Version: xx.x This will be the maximum CUDA version you can use with your current installation. If you want to install pytorch, you'll need to install a CUDA toolkit that is BELOW that max version. If you go over it... well that's on you.

Now you'll need to head over to pytorchs getting started page

Go through the selections and see which align with your system. My only options were 11.8 or 12.6. Since my NVIDIA max driver version is 12.5. 11.8 it is! Because poetry is a bit extra, we'll have to add the source for whatever cuda version will fit below your GPU's current NVIDIA drivers.

poetry source add --priority=explicit pytorch-cuda "https://download.pytorch.org/whl/cu118"

After the source is added, you should see something like this in your project.toml file.

[[tool.poetry.source]]
name = "pytorch-cuda"
url = "https://download.pytorch.org/whl/cu118"
priority = "explicit"

Now you can install the specific versions of what you'll need to run SBert models on your GPU. In my case, these were the available versions from the 11.8 CUDA Toolkit.

poetry add torch==2.7.0+cu118 torchaudio==2.7.0+cu118 torchvision==0.22.0+cu118 --source pytorch-cuda
poetry add sentence-transformers

Installation without GPU

You'll want to go into the project.toml file and before you run the command below. Delete lines 23-25 and 34-44. Then run the following below. To update the lock file (first) then install libraries. Do the following

poetry lock
poetry install --no-root

This will read from the project.toml file that is included in this repo and install all necessary packagage versions. Should other versions be needed, the project TOML file will be utilized and packages updated according to your system requirements. To view the current libraries installed

poetry show

To view only top level library requirements

poetry show -T

File Setup

While in root directory run commands below

$ mkdir data/logs data/logs/scrape data/logs/tui
$ mkdir data/searches data/models/Marco data/models/specter

Model setup

If you'd like to use word2vec to do your asymetric semantic search, you'll need to do a few things before starting. In your terminal, with your environment activated type the following in your terminal. This should install the model in your activated environment. You can check by looking for something like en_core_web_md-3.8.0.... in your .venv/Lib/site-packages folder.

python -m spacy download en_core_web_md

TUI

This repo also comes with a TUI (Terminal User Interface) that allows you to explore the JSON objects for each conference / year. This repo was forked from here and updated with a ScrollableContainer on the right panel instead of the previous output. Thank you to oleksis for creating the initial structure!! 🎉

To run the TUI with poetry

poetry run python tui/__main__.py data/scraped/2024_ICML.json 
#replace year/conf

With python

python tui/__main__.py data/scraped/2024_ICML.json 
#replace year/conf

With no file args, like a madman. This will launch a file picking application that scans the data/conferences folder and shows you a list of available files. Enter a number of the conference you want, and you're good to go.

poetry run python tui/__main__.py

python tui/__main__.py

Runtime Notes

Search with word2vec takes longer to run. Patience Iago
Fuzzy search on abstract will take even longer

Suggested operation ranges

Fuzzy => 1 to 10
- Best results around 5
Cosine => -1 to 1
- Best results around 0.40
Word2vec => -1 to 1
- Best results around 0.85
Marco => -1 to 1
- Best results around 0.85
Specter => -1 to 1
- Best results around 0.85

With the TUI running, it should look something like this.

paper_search.mp4

Project Todo list

Search Tab - Modeling paths

[x] - Implement SBert Model [x] - Install CUDA toolkit to use GPU [x] - Update instructions on how to do that

Additional Tab's to add

arXiv
medarXiv
bioarXiv
Local Zotero search

Arxiv Tab

Need basic search here.

Functionality
- make a query search
- Add data to searched datasets as json

medarxiv / bioarxiv Tab

Workflow

Unsure at the current moment
Likely a checkbox that can switch maybe between all 3 arxiv sources?
I think they have different inputs

Zotero Tab

Possible Features

LLM Summarization Paper Summarization

I bet gemini would be a good free use case

Clustering Tab
Gemma / Bert embedding
- Tsne
- Look at first two components to find the subject topics that are most in variation.
Tom Arnold idea. Build a graph network from the citations of each paper

Analyze who is getting cited most often and driving a particular area of research. Really like this!!

Name		Name	Last commit message	Last commit date
Latest commit History 285 Commits
backup		backup
data		data
scripts		scripts
tui		tui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Learning survey

Purpose

Disclaimer

Requirements

Main Libraries used

Cloning and setting up environment.

Project setup with Poetry

How to check Poetry installation

Environment storage

Installation with GPU

Installation without GPU

File Setup

Model setup

TUI

Runtime Notes

Project Todo list

Search Tab - Modeling paths

Additional Tab's to add

Arxiv Tab

medarxiv / bioarxiv Tab

Zotero Tab

Possible Features

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Landcruiser87/paper_search

Folders and files

Latest commit

History

Repository files navigation

Machine Learning survey

Purpose

Disclaimer

Requirements

Main Libraries used

Cloning and setting up environment.

Project setup with Poetry

How to check Poetry installation

Environment storage

Installation with GPU

Installation without GPU

File Setup

Model setup

TUI

Runtime Notes

Project Todo list

Search Tab - Modeling paths

Additional Tab's to add

Arxiv Tab

medarxiv / bioarxiv Tab

Zotero Tab

Possible Features

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages