Skip to content

[proposed enhancement] AWS us-west-2 checking method #231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
battistowx opened this issue Apr 13, 2023 · 26 comments · May be fixed by #424
Open

[proposed enhancement] AWS us-west-2 checking method #231

battistowx opened this issue Apr 13, 2023 · 26 comments · May be fixed by #424
Labels

Comments

@battistowx
Copy link
Collaborator

In some of our DAAC notebooks, we include the following Boto3 snippet to check if the notebook is being executed inside us-west-2, and throws a ValueError (with emojis) if you are not, preventing the notebook from being fully executed:

if (boto3.client('s3').meta.region_name == 'us-west-2'):
    display(Markdown('### us-west-2 Region Check: ✅'))
else:
    display(Markdown('### us-west-2 Region Check: ❌'))
    raise ValueError('Your notebook is not running inside the AWS us-west-2 region, and will not be able to directly access NASA Earthdata S3 buckets')

It may be useful to include a method that the user can call to check if they are in us-west-2 for direct S3 access and will throw an error like this, possibly using an fsspec transitive dependency, or the existing authorization checks.

@battistowx battistowx changed the title AWS us-west-2 checking method [enhancement] AWS us-west-2 checking method Apr 13, 2023
@battistowx battistowx changed the title [enhancement] AWS us-west-2 checking method [proposed enhancement] AWS us-west-2 checking method Apr 13, 2023
@betolink
Copy link
Member

This sounds great @battistowx we should pair this week or the next one if you have time, it should be a simple thing to implement with your code!

@battistowx
Copy link
Collaborator Author

Found methods to use requests to grab the current instance's region from the locally-linked Instance Identity Document, and it works great! https://gist.github.com/doublenns/7e3e4b72df4aaeccbeabf87ba767f44e

Wondering if this should be its own function that a user could call anywhere, or if it should be called only when an S3 granule is opened, which would print a detailed exception statement.

@JessicaS11
Copy link
Collaborator

I'd be interested in helping implement this in some fashion - we're updating related functionality in icepyx and I'd love to just put it upstream and then use it.

@betolink
Copy link
Member

I think it would be great if we can expose earthaccess.__store__.in_region to earthaccess.in_region(default="us-west-2") and perhaps using an nicely formatted repr output besides the boolean value?

@JessicaS11 JessicaS11 linked a pull request Jan 8, 2024 that will close this issue
@JessicaS11
Copy link
Collaborator

Could someone clarify for #424:

was the spirit to check specifically for us-west-2, or to enable the user to see what region they are running in?

Thanks!

@chuckwondo
Copy link
Collaborator

Perhaps a silly question, but why do we need to make such an explicit check? I ran across another thread somewhere (I can't find it at the moment) where it seems that this is a hard problem (impossible in some cases?) to get such a check to work reliably/confidently in various environments.

What specifically is such a check to help a user do or not do or avoid?

@jhkennedy
Copy link
Collaborator

jhkennedy commented Sep 3, 2024

A lot of this has been discussed in #444.

@JessicaS11 my general understanding is that we want to provide an up-front check to confirm the "direct access tokens" requested by an Earthdata Login user will work because you can request and receive them from anywhere (locally/on-prem, another cloud, in a different region than the, in the same region as the data etc.) but you'll get an Access Denied error unless you are in the same AWS region as the data. Many users we work with don't really understand the cloud in general, let alone the specifics of AWS regions, and "looking before you leap" (LBYL) is helpful for users. This kind of check could be done right up front in a notebook so users don't waste time on something that's ultimately not going to work. Is that correct, or did I miss anything?

Sorry to be a downer here, but I'm now generally of the opinion that we should not provide a LBYL method/functionality like this at all. There are many challenges in doing this well, as described in #444, starting with #444 (comment). As the discussion has gone on, more problems and edge cases have come up.

I think for this, we'll be stuck with an easier-to-ask forgiveness-than-permission (EAFP) pattern, and time would be best spent on robust error handling/warnings so users understand what's happening. And really, for the high-level methods in earthaccess (e.g., smart_open), users don't actually have to care about where the data is other than it might be slow to work with the data in their location.

That said, there's one possible way to provide a LBYL method: actually open a file and see if you end up with a signed S3 URL in the region you're interested in (#444 (comment)), which should probably be a granule-level method (you need something to open); something like granule.check_region('us-west-2') or granule.confirm_direct_s3_access(). There are a lot of other things that can go wrong in this process (e.g., EULAs, outages, etc.), so we especially want to handle errors well with this method.

A high-level LBYL method could be implemented if we knew of good "canary" files or hosted canary files in every region (or at least the region(s) of interest), but that's a lot of infrastructure to maintain and potential cloud costs we don't have a clear way to pay for.

@chuckwondo
Copy link
Collaborator

@jhkennedy, thanks for jumping in and supplying the link to the thread I was looking for.

I'm glad to hear that you are in favor of not trying to do this (if I understand your comment correctly -- if not, please correct me) precisely because of the difficulties involved. This is why I asked the question, because I also feel that we should not be trying to implement such a LBYL function/method for the user.

Instead, as you mentioned, we should do our best to produce a helpful error message with guidance for the user, to minimize (or eliminate) a great deal of head-scratching.

@itcarroll
Copy link
Collaborator

@jhkennedy @chuckwondo

Could one of you clarify how earthaccess.open would work without the LBYL/in-region check? Would it always first attempt direct access and fall back on https? Would the user have to enable logging to see how they are opening a granule, or know how to check the type of opened file? I understand the challenges of implementing the check, but want to understand the proposed UX without such a check. I'm skeptical users will be happily agnostic; I think they want confirmation they're doing the cloud thingy correctly.

A thought: if we fix the cloud_hosted filter so that it's useful (#565), we could use that as a user input on whether or not to raise an error when direct access fails.

@asteiker asteiker added needs: feedback requested We requested feedback from the reporter; if we don't hear back in X days the issue may be closed and removed type: question labels Dec 10, 2024
@itcarroll
Copy link
Collaborator

Noting (mostly to myself) that this issue (including #431, #444, #883) remains.

How do the experts feel about checking for an EARTHACCESS_IN_REGION environment variable, akin to the DataGranule.data_links argument in_region? Hub image maintainers could set this.

The environment variable could be in addition to some TBD way of accurately determining if we are in the right clouds.

@battistowx
Copy link
Collaborator Author

battistowx commented May 1, 2025

That's definitely an option! It would be consistent with the environment variable approach that is already utilized for authentication. Would it just be something like EARTHACCESS_IN_REGION=1 for us-west-2? We would need to make it clear to maintainers that "in region" is specifically for that AWS region.

@jhkennedy
Copy link
Collaborator

jhkennedy commented May 7, 2025

I suppose we could add an environment variable that hub/infra providers could set to tell Earthaccess what region it is deployed in, and it should probably be something like EARTHACESS_AWS_REGION=us-west-2.

That said, I will reiterate that I don't think LBYL (look before you leap) is a good pattern here at all, and should be avoided as there's no way to actually appropriately do this check, so I'm particularly hesitant to build something in. It'd be much better to follow an EAFP (asier-to-ask forgiveness-than-permission) pattern and oftrying direct access and if you get Access Denied fall back to HTTPS.

@chuckwondo
Copy link
Collaborator

I suppose we could add an environment variable that hub/infra providers could set to tell Earthaccess what region it is deployed in, and it should probably be something like EARTHACESS_AWS_REGION=us-west-2.

That said, I will reiterate that I don't think LBYL (look before you leap) is a good pattern here at all, and should be avoided as there's no way to actually appropriately do this check, so I'm particularly hesitant to build something in. It'd be much better to follow an EAFP (asier-to-ask forgiveness-than-permission) pattern and oftrying direct access and if you get Access Denied fall back to HTTPS.

I completely agree. This has been my recommendation all along.

Our current logic is brittle because there is simply no way to definitively determine the region in which the code is running, so inclusion of brittle/hacky code to inconsistently/unreliably attempt to make such a determination is simply futile, and continues to remain a source of issues/questions/bugs/discussions that take away from more significant topics.

I'll reiterate Joe's point above (which we both have likely stated more than once elsewhere as well), we should simply try using S3 and fallback to HTTPS upon Access Denied errors.

The only addition we might want to make to that is to be able to supply a flag somehow to indicate whether or not fallback should occur. That is, if a user somehow can turn off the fallback mechanism, then if S3 fails, then the failure is simply propagated immediately, and no attempt to use HTTPS is made.

Regarding having hubs set an env var, I don't think that's a viable approach for at least 2 reasons: (1) we cannot possibly ensure that absolutely every hub in existence will set the variable, and thus there's no sense in attempting to rely on it, as we would still have to handle cases where it isn't set, and (2) if earthaccess were to get a special var set in a hub, what would prevent other library authors from making the same request? That's just not feasible, given the vast number of libraries that exist.

@betolink
Copy link
Member

betolink commented May 7, 2025

I agree on making this a robust approach, the simplest thing would be the environment variable, another approach would require some sort of test before starting a download/open call, trying a very small granule using boto3 S3 and see if we get a 200 (HTTP HEAD request) and we set the in_region to true @chuckwondo @jhkennedy

@jhkennedy
Copy link
Collaborator

I think the simplest thing would be to default to HTTPS -- in region, you'll end up with S3-signed links from tea after a few redirects, so it will be slightly less performant (will be most impacted by accessing lots of little files). And then S3 direct would be opt in if you know what you're doing.

Doing a head request on the first search result and using that to set the "in region" thing might be a good way to check if you'll be able to do S3 access, but there's a lot of ways you can get back an AccessDenied error beyond being out of region, so I'd still be hesitant there and the check would be more "did s3 access work this time?" than "am I in region".

@itcarroll
Copy link
Collaborator

itcarroll commented May 7, 2025

The only addition we might want to make to that is to be able to supply a flag somehow to indicate whether or not fallback should occur. That is, if a user somehow can turn off the fallback mechanism, then if S3 fails, then the failure is simply propagated immediately, and no attempt to use HTTPS is made.

That's essential. Since we are all reiterating, I will reiterate that users deserve a clear indication they are doing the cloud thingy correctly. Very bad if they think they are using S3 but actually using HTTPS.

simply try using S3

I disagree for earthaccess.download. If a user knows they are not in region, let them skip this try.

I see three approaches being described, in short:

  1. Automagically determine in_region (possibly using a sentinel object in us-west-2)
  2. Assume in_region == True until a request fails
  3. Make the user declare in_region (possibly using an environment variable a hub owner could set)

I still favor some way to achieve 3, but could support 2 in the case where the user does not provide an opinion. So here is my proposal.

I keep saying in_region because we all know what it means but I do think we could workshop the name.

And now @jhkennedy added more to think about!

@itcarroll itcarroll removed the needs: feedback requested We requested feedback from the reporter; if we don't hear back in X days the issue may be closed label May 7, 2025
@chuckwondo
Copy link
Collaborator

chuckwondo commented May 7, 2025

I will reiterate that users deserve a clear indication they are doing the cloud thingy correctly. Very bad if they think they are using S3 but actually using HTTPS.

I'm wondering if we are perhaps conflating "cloud hosted" with "s3 vs. https".

In other words, do users really need to know they are using s3 URLs rather than https URLs?

I could very well be wrong, so please correct me if I am, but my suspicion is that they don't care what the protocol is. What they care about is whether or not they're obtaining data from the cloud instead of from some on-prem location, and that's determined by the cloud_hosted=True option to the search.

I think the simplest thing would be to default to HTTPS -- in region, you'll end up with S3-signed links from tea after a few redirects, so it will be slightly less performant (will be most impacted by accessing lots of little files). And then S3 direct would be opt in if you know what you're doing.

This sounds good to me. This would seem to provide users more concerned with performance the control they need, but for those who aren't, don't need to be concerned with s3 vs. https.

@itcarroll
Copy link
Collaborator

I could very well be wrong, so please correct me if I am, but my suspicion is that they don't care what the protocol is.

I think you are wrong. It is tough to say without a perfect grasp on all the lingo. Users care whether their environment and code is performant, which translates to caring indirectly about whether data egress from us-west-2 is happening. It's not just where they are obtaining data from, it's where their machine is too.

@chuckwondo
Copy link
Collaborator

I could very well be wrong, so please correct me if I am, but my suspicion is that they don't care what the protocol is.

I think you are wrong. It is tough to say without a perfect grasp on all the lingo. Users care whether their environment and code is performant, which translates to caring indirectly about whether data egress from us-west-2 is happening. It's not just where they are obtaining data from, it's where their machine is too.

Fair enough. If we can get you, me, @jhkennedy, @mfisher87, and @betolink on the next hack-day call, I think it would be great for us to attempt to iron out the details, perhaps along the lines of what @jhkennedy is suggesting. It sounds like we need to find a good balance between simplicity, control, and sensible defaults.

@jhkennedy
Copy link
Collaborator

If we can get you, me, @jhkennedy, @mfisher87, and @betolink on the next hack-day call, I think it would be great for us to attempt to iron out the details

Sounds like a good plan to me and I should be at the next one!

@chuckwondo
Copy link
Collaborator

If we can get you, me, @jhkennedy, @mfisher87, and @betolink on the next hack-day call, I think it would be great for us to attempt to iron out the details

Sounds like a good plan to me and I should be at the next one!

Oh! It's tomorrow! Fantastic timing.

@jhkennedy
Copy link
Collaborator

Tomorrow?! I thought the next Earthaccess hack day is Tuesday, May 13. Or at least that's what's on my calendar... did it change?

@chuckwondo
Copy link
Collaborator

Tomorrow?! I thought the next Earthaccess hack day is Tuesday, May 13. Or at least that's what's on my calendar... did it change?

Oops! You're right. For some reason my calendar was looking at next Monday (jeez am I out of whack today!)

@mfisher87
Copy link
Collaborator

I can make it for the first 30 minutes at least. I have a staff meeting which conflicts 😖

@betolink
Copy link
Member

betolink commented May 7, 2025

I'm most likely going to miss it 😿 will be traveling that day.

@battistowx
Copy link
Collaborator Author

If we can get you, me, @jhkennedy, @mfisher87, and @betolink on the next hack-day call, I think it would be great for us to attempt to iron out the details

Sounds like a good plan to me and I should be at the next one!

Oh! It's tomorrow! Fantastic timing.

I will be there as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

9 participants