Retrieve Information from Text Documents with TF-IDF model and dimention reduction with (Latent Semantic Indexing)LSI.
- Python : Popular language for implementing Neural Network
- Jupyter Notebook : Best tool for running python cell by cell
- Google Colab : Best Space for running Jupyter Notebook with hosted server
- Requests : Simple HTTP library for accessing APIs and websites
- Polars : Fast DataFrame library for efficient data processing
- Numpy : Best Library for working with arrays in python
- MatPlotLib : Library for showing the charts in python
- Scikit-learn : Essential ML toolkit for training and evaluating models
You can easily run this code on google colab by just clicking this badge
This dataset named LISA
and i modified them into three files (easy for working) :
- Documents.txt (Documents are stored here)
- Queries.txt (Queries are stored here)
- Result.txt (Real related results are stored here)
you can use this modified Dataset with clicking this badges :
or Download the raw dataset :
Here is part of the Documents raw text :
Here is part of the Queries raw text :
Here is part of the Real Result raw text :
Here is part of the Documents frame :
Here is part of the Queries frame :
Here is part of the Real Result frame :
- Clear garbage charachters and digits
- Lower all alphabet charachters
- Tokenization
- Word Counting
- Show Zipf Law
- Calculate stop and steⅿⅿing words
- Remove stop and steⅿⅿing words
This project is licensed under the MIT License.