Create a Flintrock repository to host Hadoop and Spark releases

Since its creation, Flintrock has sourced Spark releases from [`s3://spark-related-packages`](https://s3.amazonaws.com/spark-related-packages), an S3 bucket hosted by the AMPLab and kept up-to-date by the Apache Spark project. As of Spark 2.2.1, the Spark committers have confirmed that [this bucket will no longer receive updates](http://apache-spark-developers-list.1001551.n3.nabble.com/Please-keep-s3-spark-related-packages-alive-td23503.html) ([alternate reference](https://mail-archives.apache.org/mod_mbox/spark-dev/201802.mbox/%3CCAOhmDzeOaXF3VcZQqOFnE2g_PKbW_T-Q+6WaWMN_C_OMZ3S0bA@mail.gmail.com%3E)).

This is a big change for Flintrock's out-of-the-box experience. Users today can configure Flintrock to download Spark from a custom location via the `--spark-download-source` option, but by default Flintrock downloads Spark from `s3://spark-related-packages`. This gives users a fast, reliable, and convenient source of Spark releases to use with Flintrock without users needing to do any work. Now that the bucket is being retired, we're stuck with the Apache mirror network as a default download source. Flintrock already uses Apache mirrors as a default source for Hadoop, and as Flintrock users know, they are slow and often unreliable (#66).

To preserve a strong out-of-the-box experience for Flintrock, I have begrudgingly decided to maintain a repository of Spark and Hadoop releases on S3 for use with Flintrock. I am loath to maintain new infrastructure, but in the absence of a fast CDN hosting public Spark and Hadoop releases, I think this is the only way.

To summarize the changes I plan to make:
1. How Flintrock works today:
    * By default, Flintrock downloads Spark from `s3://spark-related-packages`.
    * By default, Flintrock downloads Hadoop from the Apache mirror network.
    * Users can customize where Flintrock downloads Spark and Hadoop from using `--spark-download-source` and `--hdfs-download-source`.
2. How Flintrock will work after the change proposed here is complete:
    * By default, Flintrock will download both Spark and Hadoop from an S3 bucket maintained by me / the Flintrock project.
        * The bucket will be a [Requester Pays bucket](https://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html), meaning that users will pay the cost of data transfer from S3 to their Flintrock clusters on EC2.
        * The Flintrock project will only maintain a rolling window of select, recent releases of Spark and Hadoop in this repository.
    * As before, users can continue to customize where Flintrock downloads Spark and Hadoop from.

When this change is complete, Flintrock will no longer depend on external sources for Spark and Hadoop, and clusters that use Hadoop will launch faster by default since they will now download Hadoop from S3 as opposed to the Apache mirror network.

Thank you to the AMPLab and to the Apache Spark project for graciously hosting Spark releases on S3 for as long as they did (and footing the bill!), and to Matei for the suggestion to use a Requester Pays bucket with Flintrock.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create a Flintrock repository to host Hadoop and Spark releases #238

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Create a Flintrock repository to host Hadoop and Spark releases #238

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions