Skip to content

Include option to additionally retrieve external IDs for data #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
wkyoshida opened this issue Jan 16, 2024 · 9 comments
Open
2 tasks done

Include option to additionally retrieve external IDs for data #59

wkyoshida opened this issue Jan 16, 2024 · 9 comments
Labels
data Relates to data or Wikidata question Further information is requested

Comments

@wkyoshida
Copy link
Member

Terms

Languages

ALL

Description

This issue is to discuss an option (i.e. a flag perhaps) to also retrieve external IDs for data when running the data process (this is optional, as I'm thinking this should probably be something to opt-in, i.e. not the default behavior). On the Scribe-Server side, this information could be later useful for tracking when specific data points are new or have been updated in the external sources Scribe references, e.g. Wikidata. For those interested, it could also potentially be useful to see the IDs.

  • For nouns, verbs, and prepositions, this is likely the Wikidata lexemes.

  • For translations, autosuggestions, and emoji keywords - sources for these data points are from elsewhere - e.g. Wikipedia, Unicode CLDR, translation models. I believe these wouldn't really have IDs tied to them..
    Considerations for Scribe-Server:

    • I wonder if it could make sense to attempt to tie them to a matching Wikidata lexeme, but I'm still unsure as this likely could get messy.
    • Is there anything else we could use that makes sense?
  • Also, would doing this even make sense?

Open for discussion! 😊👀

@wkyoshida wkyoshida added question Further information is requested data Relates to data or Wikidata labels Jan 16, 2024
@andrewtavis
Copy link
Member

Hey @wkyoshida 👋 FYI I made a new issue in iOS that speaks to this even being something that we could include in the app data files 😊 See scribe-org/Scribe-iOS#400. What that's saying is when we have a verb conjugation not showing up, this could actually be a link to the Wikidata page for the given lexeme such that the person could then enter in the conjugation and have it show up in the next data download :)

@wkyoshida
Copy link
Member Author

It was decided in the dev sync to go ahead and already at least implement the first idea proposed in this issue:

  • For nouns, verbs, and prepositions, this is likely the Wikidata lexemes.

Created a different issue, #101, to track the work for this and actually decided to leave this issue open to continue the discussion on potential ideas for the second point:

  • For translations, autosuggestions, and emoji keywords...

Grabbing the lexemes though will already be a useful addition 😁

@andrewtavis
Copy link
Member

Noting down some points here with long-term architecture in mind:

  • Translations will eventually come from Wikidata and will thus have LIDs
  • Autosuggestions will eventually come from included LLMs in the end applications
  • Emojis being CLDR based makes it hard to actually put IDs on them

The real interest here is lastModified, which for translations will be present. Maybe the solution for here is to get some kind of field in the emoji data that's for when the emoji data was last updated as a whole and then we can know when to include them in data transfers - i.e. local lastModified in emojis table is < that that's on Scribe-Server's version of the table. Then send the whole thing over, or we could have different lastModified for each emoji where if a change is made to add the emoji or change its keywords then the current timestamp is set?

CC @axif0: What do you think on the above? :)

@andrewtavis
Copy link
Member

Big thing, let's not focus on this for translations and autosuggestions as hopefully a year and a half from now it won't even be needed :)

@axif0
Copy link
Member

axif0 commented Mar 16, 2025

we could have different lastModified for each emoji where if a change is made to add the emoji or change its keywords then the current timestamp is set?

I think the second approach—having a lastModified timestamp for each emoji, is the better option. as we’ll have a precise history of changes for each emoji then.

{
  "cheerful": [
    {
      "emoji": "😀",
      "is_base": false,
      "rank": 61,
      "lastModified": "2025-03-16T12:00:00Z"
    }
  ],
  "cheery": [
    {
      "emoji": "😀",
      "is_base": false,
      "rank": 61,
      "lastModified": "2025-03-16T12:00:00Z"
    }
  ]
}

  • Do we need to convert the emoji_keywords.json into emoji_keywords.sqlite ?
  • When uploading in scribe-server, it should check the keys like cheerful or cheery ( Question: Are the keys unique?), if those keys are found, then we skips the data importing. if no keys match then we uploaded key into table, with last scribe-server data updated time.

Is this make scene ?

@andrewtavis
Copy link
Member

Do we need to convert the emoji_keywords.json into emoji_keywords.sqlite ?

No it's just an emojis table within the language SQLite DB. Because of this, I think that lastModified for each keyword would be good so that the final columns can be keyword, last_modified, emoji_1, emoji_2, emoji_3 (btw these are renamed as I'm realizing that the current versions don't make much sense). Maybe we can also do emoji_4 just in case we ever want to do four emojis for tablets?

Your points in the second one make sense. We'll check the keyword to see if it doesn't exist or if the lastModified time is earlier than the current one, and if so we send along the data.

Let me know on the above! Maybe it makes sense for us to close this and make a new issue for the work we're describing?

@axif0
Copy link
Member

axif0 commented Mar 24, 2025

(btw these are renamed as I'm realizing that the current versions don't make much sense). Maybe we can also do emoji_4 just in case we ever want to do four emojis for tablets?

German, emoji_keywords -

"fröhlich": [
{
Last Modified: last-server-upload_date.
"emoji": "😂",
"is_base": false,
"rank": 1
},
{
Last Modified: last-server-upload_date.
"emoji": "😁",
"is_base": false,
"rank": 12
},
{
Last Modified: last-server-upload_date.
"emoji": "🥳",
"is_base": false,
"rank": 30
}
],

In emoji sqlite file do we want like -

Keyword Last Modified Emoji_1 Rank_1 Is_Base_1 Emoji_2 Rank_2 Is_Base_2 Emoji_3 Rank_3 Is_Base_3 Emoji_4 Rank_4 Is_Base_4
fröhlich YYYY-MM-DD 😂 1 False 😁 12 False 🥳 30 False (NULL) (NULL) (NULL)

What do you think?

@axif0
Copy link
Member

axif0 commented Mar 24, 2025

  1. After fixing emoji_keywords, I'll finalize the whole export circle including interactive mode, also I see the converting sqlite is working only --all cmd.
scribe-data c -a -ot sqlite
  1. Also, using query for sub-language, exported json files saves as Hindustani_urdu. Shouldn't it save as `Hindustani/urdu/ ?

Dump and convert are following Hindustani/urdu/ convention.

@andrewtavis
Copy link
Member

I'd say we should lower case all the column names, @axif0, but aside from that we're good :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Relates to data or Wikidata question Further information is requested
Projects
Status: Todo
Development

No branches or pull requests

3 participants