Include option to additionally retrieve external IDs for data #59

wkyoshida · 2024-01-16T00:32:24Z

Terms

I have searched open and closed data issues
I agree to follow Scribe-Data's Code of Conduct

Languages

ALL

Description

This issue is to discuss an option (i.e. a flag perhaps) to also retrieve external IDs for data when running the data process (this is optional, as I'm thinking this should probably be something to opt-in, i.e. not the default behavior). On the Scribe-Server side, this information could be later useful for tracking when specific data points are new or have been updated in the external sources Scribe references, e.g. Wikidata. For those interested, it could also potentially be useful to see the IDs.

For nouns, verbs, and prepositions, this is likely the Wikidata lexemes.
For translations, autosuggestions, and emoji keywords - sources for these data points are from elsewhere - e.g. Wikipedia, Unicode CLDR, translation models. I believe these wouldn't really have IDs tied to them..
Considerations for Scribe-Server:
- I wonder if it could make sense to attempt to tie them to a matching Wikidata lexeme, but I'm still unsure as this likely could get messy.
- Is there anything else we could use that makes sense?
Also, would doing this even make sense?

Open for discussion! 😊👀

andrewtavis · 2024-02-24T13:41:42Z

Hey @wkyoshida 👋 FYI I made a new issue in iOS that speaks to this even being something that we could include in the app data files 😊 See scribe-org/Scribe-iOS#400. What that's saying is when we have a verb conjugation not showing up, this could actually be a link to the Wikidata page for the given lexeme such that the person could then enter in the conjugation and have it show up in the next data download :)

wkyoshida · 2024-03-11T02:48:04Z

It was decided in the dev sync to go ahead and already at least implement the first idea proposed in this issue:

For nouns, verbs, and prepositions, this is likely the Wikidata lexemes.

Created a different issue, #101, to track the work for this and actually decided to leave this issue open to continue the discussion on potential ideas for the second point:

For translations, autosuggestions, and emoji keywords...

Grabbing the lexemes though will already be a useful addition 😁

andrewtavis · 2025-03-16T15:33:46Z

Noting down some points here with long-term architecture in mind:

Translations will eventually come from Wikidata and will thus have LIDs
Autosuggestions will eventually come from included LLMs in the end applications
Emojis being CLDR based makes it hard to actually put IDs on them

The real interest here is lastModified, which for translations will be present. Maybe the solution for here is to get some kind of field in the emoji data that's for when the emoji data was last updated as a whole and then we can know when to include them in data transfers - i.e. local lastModified in emojis table is < that that's on Scribe-Server's version of the table. Then send the whole thing over, or we could have different lastModified for each emoji where if a change is made to add the emoji or change its keywords then the current timestamp is set?

CC @axif0: What do you think on the above? :)

andrewtavis · 2025-03-16T15:34:33Z

Big thing, let's not focus on this for translations and autosuggestions as hopefully a year and a half from now it won't even be needed :)

axif0 · 2025-03-16T16:26:15Z

we could have different lastModified for each emoji where if a change is made to add the emoji or change its keywords then the current timestamp is set?

I think the second approach—having a lastModified timestamp for each emoji, is the better option. as we’ll have a precise history of changes for each emoji then.

{
  "cheerful": [
    {
      "emoji": "😀",
      "is_base": false,
      "rank": 61,
      "lastModified": "2025-03-16T12:00:00Z"
    }
  ],
  "cheery": [
    {
      "emoji": "😀",
      "is_base": false,
      "rank": 61,
      "lastModified": "2025-03-16T12:00:00Z"
    }
  ]
}

Do we need to convert the emoji_keywords.json into emoji_keywords.sqlite ?
When uploading in scribe-server, it should check the keys like cheerful or cheery ( Question: Are the keys unique?), if those keys are found, then we skips the data importing. if no keys match then we uploaded key into table, with last scribe-server data updated time.

Is this make scene ?

andrewtavis · 2025-03-17T09:38:19Z

Do we need to convert the emoji_keywords.json into emoji_keywords.sqlite ?

No it's just an emojis table within the language SQLite DB. Because of this, I think that lastModified for each keyword would be good so that the final columns can be keyword, last_modified, emoji_1, emoji_2, emoji_3 (btw these are renamed as I'm realizing that the current versions don't make much sense). Maybe we can also do emoji_4 just in case we ever want to do four emojis for tablets?

Your points in the second one make sense. We'll check the keyword to see if it doesn't exist or if the lastModified time is earlier than the current one, and if so we send along the data.

Let me know on the above! Maybe it makes sense for us to close this and make a new issue for the work we're describing?

axif0 · 2025-03-24T09:36:30Z

(btw these are renamed as I'm realizing that the current versions don't make much sense). Maybe we can also do emoji_4 just in case we ever want to do four emojis for tablets?

German, emoji_keywords -

"fröhlich": [
{
Last Modified: last-server-upload_date.
"emoji": "😂",
"is_base": false,
"rank": 1
},
{
Last Modified: last-server-upload_date.
"emoji": "😁",
"is_base": false,
"rank": 12
},
{
Last Modified: last-server-upload_date.
"emoji": "🥳",
"is_base": false,
"rank": 30
}
],

In emoji sqlite file do we want like -

Keyword	Last Modified	Emoji_1	Rank_1	Is_Base_1	Emoji_2	Rank_2	Is_Base_2	Emoji_3	Rank_3	Is_Base_3	Emoji_4	Rank_4	Is_Base_4
fröhlich	YYYY-MM-DD	😂	1	False	😁	12	False	🥳	30	False	(NULL)	(NULL)	(NULL)

What do you think?

axif0 · 2025-03-24T09:44:04Z

After fixing emoji_keywords, I'll finalize the whole export circle including interactive mode, also I see the converting sqlite is working only --all cmd.

scribe-data c -a -ot sqlite

Also, using query for sub-language, exported json files saves as Hindustani_urdu. Shouldn't it save as `Hindustani/urdu/ ?

Dump and convert are following Hindustani/urdu/ convention.

andrewtavis · 2025-03-24T10:42:34Z

I'd say we should lower case all the column names, @axif0, but aside from that we're good :)

wkyoshida added question Further information is requested data Relates to data or Wikidata labels Jan 16, 2024

wkyoshida added this to Scribe Board Jan 16, 2024

github-project-automation bot moved this to Todo in Scribe Board Jan 16, 2024

andrewtavis mentioned this issue Feb 27, 2024

[Deleted] Explore formatting data with SQLite rather than Python directly #47

Closed

2 tasks

wkyoshida mentioned this issue Mar 11, 2024

Option to grab the Wikidata lexemes for queried words #101

Closed

2 tasks

andrewtavis mentioned this issue Jun 7, 2024

Simplify formatting process to lexeme based outputs rather than string based #142

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Include option to additionally retrieve external IDs for data #59

Include option to additionally retrieve external IDs for data #59

wkyoshida commented Jan 16, 2024

andrewtavis commented Feb 24, 2024

Uh oh!

wkyoshida commented Mar 11, 2024

Uh oh!

andrewtavis commented Mar 16, 2025

Uh oh!

andrewtavis commented Mar 16, 2025

Uh oh!

axif0 commented Mar 16, 2025 •

edited

Loading

Uh oh!

andrewtavis commented Mar 17, 2025

Uh oh!

axif0 commented Mar 24, 2025 •

edited

Loading

Uh oh!

axif0 commented Mar 24, 2025

Uh oh!

andrewtavis commented Mar 24, 2025

Uh oh!

Include option to additionally retrieve external IDs for data #59

Include option to additionally retrieve external IDs for data #59

Comments

wkyoshida commented Jan 16, 2024

Terms

Languages

Description

andrewtavis commented Feb 24, 2024

Uh oh!

wkyoshida commented Mar 11, 2024

Uh oh!

andrewtavis commented Mar 16, 2025

Uh oh!

andrewtavis commented Mar 16, 2025

Uh oh!

axif0 commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewtavis commented Mar 17, 2025

Uh oh!

axif0 commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

axif0 commented Mar 24, 2025

Uh oh!

andrewtavis commented Mar 24, 2025

Uh oh!

axif0 commented Mar 16, 2025 •

edited

Loading

axif0 commented Mar 24, 2025 •

edited

Loading