This respositry consists of the submission for the First Mini Project for the course CS771, Fall 2024, completed under the instruction of Prof. Piyush Rai, Department of CSE, IIT Kanpur
Name | Roll Number |
---|---|
Anushka Singh | 220188 |
Arush Upadhyaya | 220213 |
Aujasvit Datta | 220254 |
Pahal Dhruvin Patel | 220742 |
Pranav Agrawal | 220791 |
17.py
: main file to generate and save predictionsutils.py
: utility functions used in17.py
pred_emoticon.txt
: predictions for the emoticons datasetpred_deepfeat.txt
: predictions for the deep features datasetpred_text_seq.txt
: predictions for the text sequences datasetemoticons/
: jupyter notebooks containing experiments and EDA for emoticons datasetfeatures/
: jupyter notebooks containing experiments and EDA for features datasettext_seq/
: jupyter notebooks containing experiments and EDA for text sequences datasetcombined/
: jupter notebooks containing experiments and EDA for all datasets combinedcommon/
: helper functions used in experiments
- Install the dependencies
pip install -r requirements.txt
-
Download the dataset, make sure the
datasets/
directory is present in the root -
Run
17.py
to generate the prediction files →
python 17.py
-
Preprocessing :
- Removed dummy emojis, that are occuring in all the input emoji strings
- Columnarised the emoji strings into one column per character
- One hot encoded the categorical columns
-
Model : Logistic Regression
-
Best Parametres
Parameter Value C 10 penalty L1 Solver Liblinear -
Achieved Accuracy on Validation Set : 97.13%
-
Preprocessing : None
-
Model : Logistic Regression
-
Best Parametres
Parameter Value C 10.0 fit_intercept True penalty l2 solver lbfgs -
Achieved Accuracy on Validation Set : 98.77%
-
Preprocessing
- Removed substrings occuring in all the input strings
- Converted the input strings into n-gram respresentation, with
$n_range = (3, 5)$
-
Model : Logistic Regression
-
Best Parametres
Parameter Value colsample_bytree 1.0 eval_metric logloss gamma 0.2 learning_rate 0.1 max_depth 7 min_child_weight 3 n_estimators 500 subsample 1.0 -
Achieved Accuracy on Validation Set : 93.05%
-
Model : Logistic Regression
-
Best Parametres :
Parameter Value C 10.0 fit_intercept True penalty l2 solver lbfgs -
Achieved Accuracy on Validation Set : 98.77%
We used the seed 42 for all the probabilistic models that we attempted to run.