Skip to content

Create a Flintrock repository to host Hadoop and Spark releases #238

Open
@nchammas

Description

@nchammas

Since its creation, Flintrock has sourced Spark releases from s3://spark-related-packages, an S3 bucket hosted by the AMPLab and kept up-to-date by the Apache Spark project. As of Spark 2.2.1, the Spark committers have confirmed that this bucket will no longer receive updates (alternate reference).

This is a big change for Flintrock's out-of-the-box experience. Users today can configure Flintrock to download Spark from a custom location via the --spark-download-source option, but by default Flintrock downloads Spark from s3://spark-related-packages. This gives users a fast, reliable, and convenient source of Spark releases to use with Flintrock without users needing to do any work. Now that the bucket is being retired, we're stuck with the Apache mirror network as a default download source. Flintrock already uses Apache mirrors as a default source for Hadoop, and as Flintrock users know, they are slow and often unreliable (#66).

To preserve a strong out-of-the-box experience for Flintrock, I have begrudgingly decided to maintain a repository of Spark and Hadoop releases on S3 for use with Flintrock. I am loath to maintain new infrastructure, but in the absence of a fast CDN hosting public Spark and Hadoop releases, I think this is the only way.

To summarize the changes I plan to make:

  1. How Flintrock works today:
    • By default, Flintrock downloads Spark from s3://spark-related-packages.
    • By default, Flintrock downloads Hadoop from the Apache mirror network.
    • Users can customize where Flintrock downloads Spark and Hadoop from using --spark-download-source and --hdfs-download-source.
  2. How Flintrock will work after the change proposed here is complete:
    • By default, Flintrock will download both Spark and Hadoop from an S3 bucket maintained by me / the Flintrock project.
      • The bucket will be a Requester Pays bucket, meaning that users will pay the cost of data transfer from S3 to their Flintrock clusters on EC2.
      • The Flintrock project will only maintain a rolling window of select, recent releases of Spark and Hadoop in this repository.
    • As before, users can continue to customize where Flintrock downloads Spark and Hadoop from.

When this change is complete, Flintrock will no longer depend on external sources for Spark and Hadoop, and clusters that use Hadoop will launch faster by default since they will now download Hadoop from S3 as opposed to the Apache mirror network.

Thank you to the AMPLab and to the Apache Spark project for graciously hosting Spark releases on S3 for as long as they did (and footing the bill!), and to Matei for the suggestion to use a Requester Pays bucket with Flintrock.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions