Lots of variance in 'best'

Hey

First thank you for all your work :)

tldr: This is less of an issue, just more of a question: Is it normal that when using many different metrics, that they all seem to have different opinions on which images are best / only a few point to the same images being best (like dists&lpips-vgg on 90k, topiq_fr&lpips on 30k instead, topiq_fr-pipal&stlpips on 150k instead etc)? 

----

I train sisr models as one of my hobbies, and I though I could maybe use metrics to find the best release-checkpoint of a model training (a dat2 model) I was doing.
So I scored the 7 val images I was using of these checkpoints (10k, 20k, 30k, ... 210k iterations) with a few metrics to find the best checkpoint.
I got different results : psnr said 70k is the best checkpoint, ssim said 10k, dists said 90k, lpips & topiq_fr said 30k, topiq_fr-pipal said 150k.

So I did a more extensive test and ran 68 metrics (could not run qalign because its ressource hungry, and piqe had some input shape errors) on these 7 val images and also hr on hr as a baseline/comparison checkpoint (because if hr-hr is not best in FR then something must have gone wrong).

I gathered the results in this google sheets: [https://docs.google.com/spreadsheets/d/1NL-by7WvZyDMHj5XN8UeDALVSSwH70IKvwV65ATWqrA/edit?usp=sharing](https://docs.google.com/spreadsheets/d/1NL-by7WvZyDMHj5XN8UeDALVSSwH70IKvwV65ATWqrA/edit?usp=sharing)
I put in all the scored per checkpoint, visually highlighted the best in red, second best in blue and third best in green per metric, and also put underneath the checkpoints sorted per score per metric. Metrics are sorted according to the Model Cards for IQA-PyTorch documentation page.

Screenshot of the Spreadsheet:
![image](https://github.com/user-attachments/assets/07bea520-fe06-4383-a9b4-d25bf9e2a5fa)

While some checkpoints consistently show up (10k, 60k, 150k, ..) I was surprised how äh divergent the 'best' scoring checkpoint is between all these metrics. (I was simply expecting all the different metrics to a bit more consistently point towards the same checkpoint than this, but i am probably wrong looking at this sheet). My question was simply if this is normal/ if this experience is normal?

I was simply trying to find on which few metrics I could rely on so in the future they might help me find the best release candidate of a model training. Or to then compare the outputs of my already released models between each other on datasets with a few select metrics.

Anyway thank you for all your work :) 
IQA-Pytorch is fantastic and made it simple for me to score outputs with multiple different metrics :)

(ah and if needed or of interest, all the image files used for this test / sheet can be found in [this .tar file on google drive](https://drive.google.com/file/d/1ZTp9fBMeawftNqzg4RN9_zIvHtul5jVc/view?usp=drive_link))



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lots of variance in 'best' #182

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Lots of variance in 'best' #182

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions