Skip to content

Issue using non-root user with google-batch executor #4880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JohnWalshTempus opened this issue Apr 3, 2024 · 5 comments
Open

Issue using non-root user with google-batch executor #4880

JohnWalshTempus opened this issue Apr 3, 2024 · 5 comments

Comments

@JohnWalshTempus
Copy link

Bug report

Expected behavior and actual behavior

The GCP Batch executor (google-batch) should allow non-root users for improved security concerns. Today, only the root user can access files under /mnt/disks/**

Steps to reproduce the problem

I have pushed two public docker images, one with root as the default user, another with worker as the default user.

These can be found at on dockerhub at:

  • jvwalsh/nextflow-non-root-user:latest
  • jvwalsh/nextflow-root-user:latest
FROM debian:buster-slim

RUN apt-get update 
RUN apt-get upgrade -y 
RUN apt-get install -y curl make wget ca-certificates && rm -rf /var/lib/apt/lists/*

RUN apt-get update \
    && apt-get -yqq install \
    libhdf5-dev jq procps -y


RUN adduser --disabled-login worker # commented out for root-default user image

RUN mkdir /app

# Set ownership and permissions for directories
RUN chown -R worker:worker /bin/ /app /mnt/ /tmp/ && \. # commented out for root-default user image
    chmod -R +x /mnt/ /bin/ /app/ /tmp/   # commented out for root-default user image

USER worker # commented out for root-default user image

WORKDIR /app/

ENTRYPOINT []
CMD []

The workflow I am running is as follows

main.nf:

#!/usr/bin/env nextflow

process HELLO {
  input: 
    val x

  script:
    """
    echo '$x world!'
    """
}

workflow {
  input_channel = Channel.of('Hello')
  input_channel | HELLO
}

nextflow.config:

workDir = 'gs://<my-bucekt>/workshop'

process {
    executor = 'google-batch'
    //  container = 'jvwalsh/nextflow-non-root-user:latest'
    container = 'jvwalsh/nextflow-non-root-user:latest'
}

google {
    <my google conf>
}

Program output

The execution is successful in the root image, while the non-root image gives the following error in the GCP Batch logs:

/bin/bash: /mnt/disks/<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.run: Permission denied
cp: failed to access '/mnt/disks/<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.log': Permission denied
nextflow.log
Apr-03 13:08:58.219 [main] DEBUG nextflow.cli.Launcher - $> nextflow run .
Apr-03 13:08:58.274 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 23.10.1
Apr-03 13:08:58.288 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/Users/John.Walsh/.nextflow/plugins; core-plugins: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]
Apr-03 13:08:58.299 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Apr-03 13:08:58.300 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Apr-03 13:08:58.302 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode
Apr-03 13:08:58.309 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
Apr-03 13:08:58.636 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: /Users/John.Walsh/Learn/nf-minimal-issue/nextflow.config
Apr-03 13:08:58.637 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /Users/John.Walsh/Learn/nf-minimal-issue/nextflow.config
Apr-03 13:08:58.641 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Apr-03 13:08:58.681 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=2 by global default
Apr-03 13:08:58.706 [main] INFO  nextflow.cli.CmdRun - Launching `./main.nf` [furious_plateau] DSL2 - revision: 0f6cc6a24a
Apr-03 13:08:58.707 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[[email protected]]
Apr-03 13:08:58.707 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins resolved requirement=[[email protected]]
Apr-03 13:08:58.707 [main] DEBUG nextflow.plugin.PluginUpdater - Installing plugin nf-google version: 1.8.3
Apr-03 13:08:58.738 [main] INFO  org.pf4j.AbstractPluginManager - Plugin '[email protected]' resolved
Apr-03 13:08:58.739 [main] INFO  org.pf4j.AbstractPluginManager - Start plugin '[email protected]'
Apr-03 13:08:58.812 [main] DEBUG nextflow.plugin.BasePlugin - Plugin started [email protected]
Apr-03 13:08:58.823 [main] DEBUG n.secret.LocalSecretsProvider - Secrets store: /Users/John.Walsh/.nextflow/secrets/store.json
Apr-03 13:08:58.825 [main] DEBUG nextflow.secret.SecretsLoader - Discovered secrets providers: [nextflow.secret.LocalSecretsProvider@12b5736c] - activable => nextflow.secret.LocalSecretsProvider@12b5736c
Apr-03 13:08:58.859 [main] DEBUG nextflow.Session - Session UUID: 11a85f03-9fcc-4ee8-940b-a606ff267b8d
Apr-03 13:08:58.860 [main] DEBUG nextflow.Session - Run name: furious_plateau
Apr-03 13:08:58.860 [main] DEBUG nextflow.Session - Executor pool size: 10
Apr-03 13:08:59.048 [main] DEBUG nextflow.file.FilePorter - File porter settings maxRetries=3; maxTransfers=50; pollTimeout=null
Apr-03 13:08:59.051 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'FileTransfer' minSize=10; maxSize=30; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
Apr-03 13:08:59.268 [main] DEBUG nextflow.cli.CmdRun - 
  Version: 23.10.1 build 5891
  Created: 12-01-2024 22:01 UTC (16:01 CDT)
  System: Mac OS X 13.4.1
  Runtime: Groovy 3.0.19 on OpenJDK 64-Bit Server VM 11.0.21+0
  Encoding: UTF-8 (UTF-8)
  Process: [email protected] [192.168.1.66]
  CPUs: 10 - Mem: 16 GB (64.9 MB) - Swap: 9 GB (327.2 MB)
Apr-03 13:08:59.286 [main] DEBUG nextflow.Session - Work-dir: gs://<my-bucket>/workshop [Mac OS X]
Apr-03 13:08:59.286 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /Users/John.Walsh/Learn/nf-minimal-issue/bin
Apr-03 13:08:59.305 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[GoogleLifeSciencesExecutor, GoogleBatchExecutor]
Apr-03 13:08:59.311 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Apr-03 13:08:59.320 [main] DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.DefaultCacheFactory
Apr-03 13:08:59.326 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 11; maxThreads: 1000
Apr-03 13:08:59.392 [main] DEBUG nextflow.Session - Session start
Apr-03 13:08:59.489 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Apr-03 13:08:59.537 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: google-batch
Apr-03 13:08:59.538 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'google-batch'
Apr-03 13:08:59.539 [main] DEBUG nextflow.executor.Executor - [warm up] executor > google-batch
Apr-03 13:08:59.542 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'google-batch' > capacity: 1000; pollInterval: 10s; dumpInterval: 5m 
Apr-03 13:08:59.544 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: google-batch)
Apr-03 13:08:59.547 [main] DEBUG nextflow.cloud.google.GoogleOpts - Google auth via application DEFAULT
Apr-03 13:08:59.549 [main] DEBUG n.c.google.batch.GoogleBatchExecutor - [GOOGLE BATCH] Executor config=BatchConfig[googleOpts=GoogleOpts(projectId:<my-project-id>, credsFile:null, location:us-west1, enableRequesterPaysBuckets:false, httpConnectTimeout:1m, httpReadTimeout:1m, credentials:UserCredentials{requestMetadata=null, temporaryAccess=null, clientId=<myclientid>, refreshToken=<refresh token>, tokenServerUri=https://oauth2.googleapis.com/token, transportFactoryClassName=com.google.auth.oauth2.OAuth2Utils$DefaultHttpTransportFactory, quotaProjectId=<my-project-id>})
Apr-03 13:08:59.558 [main] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] Creating service client with config credentials
Apr-03 13:09:00.161 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: HELLO
Apr-03 13:09:00.161 [main] DEBUG nextflow.Session - Igniting dataflow network (2)
Apr-03 13:09:00.161 [main] DEBUG nextflow.processor.TaskProcessor - Starting process > HELLO
Apr-03 13:09:00.166 [main] DEBUG nextflow.script.ScriptRunner - Parsed script files:
  Script_48b0a9acaea0588b: /Users/John.Walsh/Learn/nf-minimal-issue/main.nf
Apr-03 13:09:00.167 [main] DEBUG nextflow.script.ScriptRunner - > Awaiting termination 
Apr-03 13:09:00.167 [main] DEBUG nextflow.Session - Session await
Apr-03 13:09:03.662 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `HELLO (1)` submitted > job=nf-c83e1a8f-1712167740918; uid=nf-c83e1a8f-171216-d929d349-3ff3-4efa0; work-dir=gs://<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3
Apr-03 13:09:03.662 [Task submitter] INFO  nextflow.Session - [c8/3e1a8f] Submitted process > HELLO (1)
Apr-03 13:10:39.970 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `HELLO (1)` - terminated job=nf-c83e1a8f-1712167740918; state=FAILED
Apr-03 13:10:40.273 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Cannot read exit status for task: `HELLO (1)` - gs://<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3/.exitcode
Apr-03 13:10:41.011 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 1; name: HELLO (1); status: COMPLETED; exit: -; error: -; workDir: gs://<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3]
Apr-03 13:10:41.017 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=HELLO (1); work-dir=gs://<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3
  error [nextflow.exception.ProcessFailedException]: Process `HELLO (1)` terminated for an unknown reason -- Likely it has been terminated by the external system
Apr-03 13:10:41.108 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs://<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.out
Apr-03 13:10:41.178 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs://<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.err
Apr-03 13:10:41.179 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'HELLO (1)'

Caused by:
  Process `HELLO (1)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  echo 'Hello world!'

Command exit status:
  -

Command output:
  (empty)

Work dir:
  gs://<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
Apr-03 13:10:41.184 [main] DEBUG nextflow.Session - Session await > all processes finished
Apr-03 13:10:41.273 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Process `HELLO (1)` terminated for an unknown reason -- Likely it has been terminated by the external system
Apr-03 13:10:41.373 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs://<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.err
Apr-03 13:10:41.450 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs://<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.out
Apr-03 13:10:41.451 [main] DEBUG nextflow.Session - Session await > all barriers passed
Apr-03 13:10:41.451 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: google-batch) - terminating tasks monitor poll loop
Apr-03 13:10:41.535 [main] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs://<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.err
Apr-03 13:10:41.614 [main] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs://<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.out
Apr-03 13:10:41.620 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=0; failedCount=1; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=0ms; failedDuration=1.1s; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=1; peakCpus=1; peakMemory=0; ]
Apr-03 13:10:41.661 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done
Apr-03 13:10:41.662 [main] INFO  org.pf4j.AbstractPluginManager - Stop plugin '[email protected]'
Apr-03 13:10:41.662 [main] DEBUG nextflow.plugin.BasePlugin - Plugin stopped nf-google
Apr-03 13:10:41.680 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

Additionally, running gcloud beta batch jobs describe projects/<project-id>/locations/us-west1/jobs/<my-nf-job> --format json gives a consistent output like:
image

Environment

  • Nextflow version: [23.10.1.5891]
  • Java version: [21.0.2]
  • Operating system: [macOS]
  • Bash version: [zsh 5.9 (x86_64-apple-darwin22.0)]

Additional context

Other attempts to address the issue

  • Adding process { containerOptions = "--user worker" }
  • Adding process { containerOptions = "--u 1000:1000" } (worker's user/group id)
  • Running export NXF_OWNER=worker on orchestrating machine first

I am wondering if there's a need for an explicit option like the docker executor's fixOwnership

@bentsherman
Copy link
Member

Mino type, not sure if it's what you actually tried:

process { containerOptions = "-u 1000:1000" }

@JohnWalshTempus
Copy link
Author

Mino type, not sure if it's what you actually tried:

process { containerOptions = "-u 1000:1000" }

Thanks, corrected that but the issue is still there with /mnt/disks/** access

containerOptions are propagating to my batch runnable definition when I describe the job with gcloud:
image

@bentsherman
Copy link
Member

Likely need to look at the GCS mount options to see if there is anything related to permissions:

@Override
List<Volume> getVolumes() {
final result = new ArrayList(10)
for( String it : buckets ) {
result.add(
Volume.newBuilder()
.setGcs(
GCS.newBuilder()
.setRemotePath(it)
)
.setMountPath( "${MOUNT_ROOT}/${it}".toString() )
.addAllMountOptions( ['-o rw', '-implicit-dirs'] )
.build()
)
}
return result
}

If you can submit a job through gcloud and play with these options, and find something that works, it should be trivial to update in Nextflow

@JohnWalshTempus
Copy link
Author

So far I've had success with the following but it fails when allow_other is removed. allow_other was removed for some reason back in #4332 - I've found these docs on the security implications https://github.com/torvalds/linux/blob/a33f32244d8550da8b4a26e277ce07d5c6d158b5/Documentation/filesystems/fuse.txt#L218-L310

.addAllMountOptions( ['-o rw,allow_other', '--file-mode=777', '--dir-mode=777', '-implicit-dirs'] ) // working option 1
.addAllMountOptions( ['-o rw,allow_other', '--uid=1000', '--gid=1000', '-implicit-dirs'] ) // working option 2

@JohnWalshTempus
Copy link
Author

JohnWalshTempus commented Apr 14, 2025

Revisiting this one year later with a proposed fix, thanks in advance for any reviews or guidance!

#5970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants