The purpose of this repo is to perform a yearly survey of major machine learning conferences. Extract all the metadata, abstracts, and other information from of all the papers and look for topic frequencies that show up.
Current product version is 0.3.0. __main__.py
in the TUI folder is operational. Currently the only two working search models are Fuzzy, Cosine, Word2vec, Marco and Specter
.
The two search parameters that work best are title and abstract
as those have the least amount of missing values. (Scraping data isn't always perfect)
- Python >= 3.11
- numpy
- pandas
- rich
- textual
- requests
- matplotlib
- spacy
- scikit-learn
- beautifulsoup4
- pyzotero
In VSCODE
press CTRL + SHIFT + ~
to open a terminal
Navigate to the directory where you want to clone the repo.
Launch VSCode if that is IDE of choice.
In your terminal, navigate to your root folder.
If poetry is not installed, do so in order to continue.
On Windows
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -
On Linux/Mac
curl -sSL https://install.python-poetry.org | python3 -
To check if poetry is installed on your system. Type the following into your terminal
poetry -V
if you see a version
returned, you have Poetry installed. The second command is to update poetry if its installed. (Always a good idea). If not, follow this link and follow installation commands for your systems requirements. If on windows, we recommend the powershell
option for easiest installation. Using pip to install poetry will lead to problems down the road and we do not recommend that option. It needs to be installed separately from your standard python installation to manage your many python installations. Note: Python 2.7 is not supported
. You are more than welcome to go the pip route but I can't guarantee your dependencies won't clash.
Some prefer Poetry's default storage method of storing environments in one location on your system. The default storage are nested under the {cache_dir}/virtualenvs
.
If you want to store you virtual environment locally. Set this global configuration flag below once poetry is installed. This will now search for whatever environments you have in the root folder before trying any global versions of the environment in the cache.
poetry config virtualenvs.in-project true
For general instruction as to poetry's functionality and commands, please see read through poetry's cli documentation
To create a new venv
python -m venv .venv
or This command will automatically activate the env
poetry env use python3.12
Activate the venv Windows
.venv\scripts\activate
Mac/Linux
source .venv/bin/activate
To use your GPU, or not to use your GPU. That is the question. If you're lucky enough to have workhorse GPU on your rig, you might be inclined to use it when selecting the "Marco" and "Specter" models. To do so requires... a few extra annoying steps. Hopefully you bought into the NVIDIA hype and have one of their GPU's as most of pytorch's implmentations are based on the NVIDIA CUDA drivers.
First order of business is to see what NVIDIA drivers you can currently operate at.
nvidia-smi
After running the above look on the top right for CUDA Version: xx.x
This will be the maximum CUDA version you can use with your current installation. If you want to install pytorch, you'll need to install a CUDA toolkit that is BELOW
that max version. If you go over it... well that's on you.
Now you'll need to head over to pytorchs getting started page
Go through the selections and see which align with your system. My only options were 11.8 or 12.6. Since my NVIDIA max driver version is 12.5. 11.8 it is! Because poetry is a bit extra, we'll have to add the source for whatever cuda version will fit below your GPU's current NVIDIA drivers.
poetry source add --priority=explicit pytorch-cuda "https://download.pytorch.org/whl/cu118"
After the source is added, you should see something like this in your project.toml file.
[[tool.poetry.source]]
name = "pytorch-cuda"
url = "https://download.pytorch.org/whl/cu118"
priority = "explicit"
Now you can install the specific versions of what you'll need to run SBert models on your GPU. In my case, these were the available versions from the 11.8 CUDA Toolkit.
poetry add torch==2.7.0+cu118 torchaudio==2.7.0+cu118 torchvision==0.22.0+cu118 --source pytorch-cuda
poetry add sentence-transformers
You'll want to go into the project.toml file and before you run the command below. Delete lines 23-25
and 34-44
. Then run the following below. To update the lock file (first) then install libraries. Do the following
poetry lock
poetry install --no-root
This will read from the project.toml file that is included in this repo and install all necessary packagage versions. Should other versions be needed, the project TOML file will be utilized and packages updated according to your system requirements. To view the current libraries installed
poetry show
To view only top level library requirements
poetry show -T
While in root directory run commands below
$ mkdir data/logs data/logs/scrape data/logs/tui
$ mkdir data/searches data/models/Marco data/models/specter
If you'd like to use word2vec
to do your asymetric semantic search, you'll need to do a few things before starting. In your terminal, with your environment activated
type the following in your terminal. This should install the model in your activated environment. You can check by looking for something like en_core_web_md-3.8.0.... in your .venv/Lib/site-packages folder.
python -m spacy download en_core_web_md
This repo also comes with a TUI (Terminal User Interface) that allows you to explore the JSON objects for each conference / year. This repo was forked from here and updated with a ScrollableContainer on the right panel instead of the previous output. Thank you to oleksis for creating the initial structure!! 🎉
To run the TUI with poetry
poetry run python tui/__main__.py data/scraped/2024_ICML.json
#replace year/conf
With python
python tui/__main__.py data/scraped/2024_ICML.json
#replace year/conf
With no file args, like a madman. This will launch a file picking application
that scans the data/conferences
folder and shows you a list of available files.
Enter a number of the conference you want, and you're good to go.
poetry run python tui/__main__.py
python tui/__main__.py
- Search with word2vec takes longer to run. Patience Iago
- Fuzzy search on abstract will take even longer
Suggested operation ranges
- Fuzzy => 1 to 10
- Best results around 5
- Cosine => -1 to 1
- Best results around 0.40
- Word2vec => -1 to 1
- Best results around 0.85
- Marco => -1 to 1
- Best results around 0.85
- Specter => -1 to 1
- Best results around 0.85
With the TUI running, it should look something like this.
paper_search.mp4
[x] - Implement SBert Model [x] - Install CUDA toolkit to use GPU [x] - Update instructions on how to do that
- arXiv
- medarXiv
- bioarXiv
- Local Zotero search
- Need basic search here.
- Functionality
- make a query search
- Add data to searched datasets as json
- Workflow
- Unsure at the current moment
- Likely a checkbox that can switch maybe between all 3 arxiv sources?
- I think they have different inputs
- LLM Summarization Paper Summarization
- I bet gemini would be a good free use case
- Clustering Tab
- Gemma / Bert embedding
- Tsne
- Look at first two components to find the subject topics that are most in variation.
- Tom Arnold idea. Build a graph network from the citations of each paper
- Analyze who is getting cited most often and driving a particular area of research. Really like this!!