|
| 1 | +!!! summary |
| 2 | + The `DataSet` API added in v1.5.3, makes it easy to work with potentially large data sets, |
| 3 | + perform complex pre-processing tasks and feed these data sets into TensorFlow models. |
| 4 | + |
| 5 | + |
| 6 | +## Data Set |
| 7 | + |
| 8 | +### Basics |
| 9 | + |
| 10 | +A `DataSet[X]` instance is simply a wrapper over an `Iterable[X]` object, although the user still has |
| 11 | +access to the underlying collection. |
| 12 | + |
| 13 | +!!! tip |
| 14 | + The [`dtfdata`](https://transcendent-ai-labs.github.io/api_docs/DynaML/recent/dynaml-core/#io.github.mandar2812.dynaml.tensorflow.package) |
| 15 | + object gives the user easy access to the `DataSet` API. |
| 16 | + |
| 17 | + ```scala |
| 18 | + import _root_.io.github.mandar2812.dynaml.probability._ |
| 19 | + import _root_.io.github.mandar2812.dynaml.pipes._ |
| 20 | + import io.github.mandar2812.dynaml.tensorflow._ |
| 21 | + |
| 22 | + |
| 23 | + val random_numbers = GaussianRV(0.0, 1.0) :* GaussianRV(1.0, 2.0) |
| 24 | + |
| 25 | + //Create a data set. |
| 26 | + val dataset1 = dtfdata.dataset(random_numbers.iid(10000).draw) |
| 27 | + |
| 28 | + //Access underlying data |
| 29 | + dataset1.data |
| 30 | + ``` |
| 31 | + |
| 32 | +### Transformations |
| 33 | + |
| 34 | +DynaML data sets support several operations of the _map-reduce_ philosophy. |
| 35 | + |
| 36 | +#### Map |
| 37 | + |
| 38 | +Transform each element of type `X` into some other element of type `Y` (`Y` can possibly be the same as `X`). |
| 39 | + |
| 40 | +```scala |
| 41 | +import _root_.io.github.mandar2812.dynaml.probability._ |
| 42 | +import _root_.io.github.mandar2812.dynaml.pipes._ |
| 43 | +import io.github.mandar2812.dynaml.tensorflow._ |
| 44 | + |
| 45 | + |
| 46 | +val random_numbers = GaussianRV(0.0, 1.0) |
| 47 | +//A data set of random gaussian numbers. |
| 48 | +val random_gaussian_dataset = dtfdata.dataset( |
| 49 | + random_numbers.iid(10000).draw |
| 50 | +) |
| 51 | + |
| 52 | +//Transform data set by applying a scala function |
| 53 | +val random_chisq_dataset = random_gaussian_dataset.map((x: Double) => x*x) |
| 54 | + |
| 55 | +val exp_tr = DataPipe[Double, Double](math.exp _) |
| 56 | +//Can pass a DataPipe instead of a function |
| 57 | +val random_log_gaussian_dataset = random_gaussian_dataset.map(exp_tr) |
| 58 | +``` |
| 59 | + |
| 60 | +#### Flat Map |
| 61 | + |
| 62 | +Process each element by applying a function which transforms each element into an `Iterable`, |
| 63 | +this operation is followed by flattening of the top level `Iterable`. |
| 64 | + |
| 65 | +Schematically, this process is |
| 66 | + |
| 67 | +`Iterable[X] -> Iterable[Iterable[Y]] -> Iterable[Y]` |
| 68 | + |
| 69 | +```scala |
| 70 | +import _root_.io.github.mandar2812.dynaml.probability._ |
| 71 | +import _root_.io.github.mandar2812.dynaml.pipes._ |
| 72 | +import scala.util.Random |
| 73 | +import io.github.mandar2812.dynaml.tensorflow._ |
| 74 | + |
| 75 | +val random_gaussian_dataset = dtfdata.dataset( |
| 76 | + GaussianRV(0.0, 1.0).iid(10000).draw |
| 77 | +) |
| 78 | + |
| 79 | +//Transform data set by applying a scala function |
| 80 | +val gaussian_mixture = random_gaussian_dataset.flatMap( |
| 81 | + (x: Double) => GaussianRV(0.0, x*x).iid(10).draw |
| 82 | +) |
| 83 | +``` |
| 84 | + |
| 85 | +#### Filter |
| 86 | + |
| 87 | +Collect only the elements which satisfy some predicate, i.e. a function which returns `true` for the |
| 88 | +elements to be selected (filtered) and `false` for the ones which should be discarded. |
| 89 | + |
| 90 | +```scala |
| 91 | +import _root_.io.github.mandar2812.dynaml.probability._ |
| 92 | +import _root_.io.github.mandar2812.dynaml.pipes._ |
| 93 | +import scala.util.Random |
| 94 | +import io.github.mandar2812.dynaml.tensorflow._ |
| 95 | + |
| 96 | +val gaussian_dataset = dtfdata.dataset( |
| 97 | + GaussianRV(0.0, 1.0).iid(10000).draw |
| 98 | +) |
| 99 | + |
| 100 | +val onlyPositive = DataPipe[Double, Boolean](_ > 0.0) |
| 101 | + |
| 102 | +val truncated_gaussian = gaussian_dataset.filter(onlyPositive) |
| 103 | + |
| 104 | +val zeroOrGreater = (x: Double) => x >= 0.0 |
| 105 | +//filterNot works in the opposite manner to filter |
| 106 | +val neg_truncated_gaussian = gaussian_dataset.filterNot(zeroOrGreater) |
| 107 | + |
| 108 | +``` |
| 109 | + |
| 110 | +#### Scan & Friends |
| 111 | + |
| 112 | +Sometimes, we need to perform operations on a data set which are sequential in nature. In this situation, |
| 113 | +the `scanLeft()` and `scanRight()` are useful. |
| 114 | + |
| 115 | +Lets simulate a random walk, we start with $x_0$, a number and add independent gaussian increments to it. |
| 116 | + |
| 117 | +$$ |
| 118 | +\begin{align*} |
| 119 | +x_t &= x_{t-1} + \epsilon \\ |
| 120 | +\epsilon &\sim \mathcal{N}(0, 1) |
| 121 | +\end{align*} |
| 122 | +$$ |
| 123 | + |
| 124 | +```scala |
| 125 | +import _root_.io.github.mandar2812.dynaml.probability._ |
| 126 | +import _root_.io.github.mandar2812.dynaml.pipes._ |
| 127 | +import scala.util.Random |
| 128 | +import io.github.mandar2812.dynaml.tensorflow._ |
| 129 | + |
| 130 | +val gaussian_increments = dtfdata.dataset( |
| 131 | + GaussianRV(0.0, 1.0).iid(10000).draw |
| 132 | +) |
| 133 | + |
| 134 | +val increment = DataPipe2[Double, Double, Double]((x, i) => x + i) |
| 135 | + |
| 136 | +//Start the random walk from zero, and keep adding increments. |
| 137 | +val random_walk = gaussian_increments.scanLeft(0.0)(increment) |
| 138 | +``` |
| 139 | + |
| 140 | +The `scanRight()` works just like the `scanLeft()` method, except it begins from the last element |
| 141 | +of the collection. |
| 142 | + |
| 143 | +#### Reduce & Reduce Left |
| 144 | + |
| 145 | +The `reduce()` and `reduceLeft()` methods help in computing summary values from the entire data |
| 146 | +collection. |
| 147 | + |
| 148 | +```scala |
| 149 | +import _root_.io.github.mandar2812.dynaml.probability._ |
| 150 | +import _root_.io.github.mandar2812.dynaml.pipes._ |
| 151 | +import scala.util.Random |
| 152 | +import io.github.mandar2812.dynaml.tensorflow._ |
| 153 | + |
| 154 | +val gaussian_increments = dtfdata.dataset( |
| 155 | + GaussianRV(0.0, 1.0).iid(10000).draw |
| 156 | +) |
| 157 | + |
| 158 | +val increment = DataPipe2[Double, Double, Double]((x, i) => x + i) |
| 159 | + |
| 160 | +val random_walk = gaussian_increments.scanLeft(0.0)(increment) |
| 161 | + |
| 162 | +val average = random_walk.reduce( |
| 163 | + DataPipe2[Double, Double, Double]((x, y) => x + y) |
| 164 | +)/10000.0 |
| 165 | +``` |
| 166 | + |
| 167 | +#### Other Transformations |
| 168 | + |
| 169 | +Some times transformations on data sets cannot be applied on each element individually, but the |
| 170 | +entire data collection is required for such a transformation. |
| 171 | + |
| 172 | +```scala |
| 173 | +import _root_.io.github.mandar2812.dynaml.probability._ |
| 174 | +import _root_.io.github.mandar2812.dynaml.pipes._ |
| 175 | +import scala.util.Random |
| 176 | +import io.github.mandar2812.dynaml.tensorflow._ |
| 177 | + |
| 178 | +val gaussian_data = dtfdata.dataset( |
| 179 | + GaussianRV(0.0, 1.0).iid(10000).draw |
| 180 | +) |
| 181 | + |
| 182 | +val resample = DataPipe[Iterable[Double], Iterable[Double]]( |
| 183 | + coll => (0 until 10000).map(_ => coll(Random.nextInt(10000))) |
| 184 | +) |
| 185 | + |
| 186 | +val resampled_data = gaussian_data.transform(resample) |
| 187 | + |
| 188 | +``` |
| 189 | + |
| 190 | +!!! note |
| 191 | + **Conversion to TF-Scala `Dataset` class** |
| 192 | + |
| 193 | + The TensorFlow scala API also has a `Dataset` class, from a DynaML `DataSet` |
| 194 | + instance, it is possible to obtain a TensorFlow `Dataset`. |
| 195 | + |
| 196 | + ```scala |
| 197 | + import _root_.io.github.mandar2812.dynaml.probability._ |
| 198 | + import _root_.io.github.mandar2812.dynaml.pipes._ |
| 199 | + import io.github.mandar2812.dynaml.tensorflow._ |
| 200 | + import org.platanios.tensorflow.api._ |
| 201 | + import org.platanios.tensorflow.api.types._ |
| 202 | + |
| 203 | + |
| 204 | + val random_numbers = GaussianRV(0.0, 1.0) |
| 205 | + |
| 206 | + //Create a data set. |
| 207 | + val dataset1 = dtfdata.dataset(random_numbers.iid(10000).draw) |
| 208 | + |
| 209 | + //Convert to TensorFlow data set |
| 210 | + dataset1.build[Tensor, Output, DataType.Aux[Double], DataType, Shape]( |
| 211 | + Left(DataPipe[Double, Tensor](x => dtf.tensor_f64(1)(x))), |
| 212 | + FLOAT64, Shape(1) |
| 213 | + ) |
| 214 | + ``` |
| 215 | + |
| 216 | + |
| 217 | + |
| 218 | +## Tuple Data & Supervised Data |
| 219 | + |
| 220 | +The classes `ZipDataSet[X, Y]` and `SupervisedDataSet[X, Y]` both represent data collections which consist of |
| 221 | +`(X, Y)` tuples. They can be created in a number of ways. |
| 222 | + |
| 223 | +### Zip Data |
| 224 | + |
| 225 | +The `zip()` method can be used to create data sets consisting of tuples. |
| 226 | + |
| 227 | +```scala |
| 228 | +import _root_.io.github.mandar2812.dynaml.probability._ |
| 229 | +import _root_.io.github.mandar2812.dynaml.pipes._ |
| 230 | +import scala.util.Random |
| 231 | +import _root_.breeze.stats.distributions._ |
| 232 | +import io.github.mandar2812.dynaml.tensorflow._ |
| 233 | + |
| 234 | +val gaussian_data = dtfdata.dataset( |
| 235 | + GaussianRV(0.0, 1.0).iid(10000).draw |
| 236 | +) |
| 237 | + |
| 238 | +val log_normal_data = gaussian_data.map((x: Double) => math.exp(x)) |
| 239 | + |
| 240 | +val poisson_data = dtfdata.dataset( |
| 241 | + RandomVariable(Poisson(2.5)).iid(10000).draw |
| 242 | +) |
| 243 | + |
| 244 | +val tuple_data1 = poisson_data.zip(gaussian_data) |
| 245 | + |
| 246 | +val tuple_data2 = poisson_data.zip(log_normal_data) |
| 247 | + |
| 248 | +//Join on the keys, in this case the |
| 249 | +//Poisson distributed integers |
| 250 | + |
| 251 | +tuple_data1.join(tuple_data2) |
| 252 | +``` |
| 253 | + |
| 254 | +### Supervised Data |
| 255 | + |
| 256 | +For supervised learning operations, we can use the `SupervisedDataSet` class, which can be instantiated |
| 257 | +in the following ways. |
| 258 | + |
| 259 | +```scala |
| 260 | + |
| 261 | +import _root_.io.github.mandar2812.dynaml.probability._ |
| 262 | +import _root_.io.github.mandar2812.dynaml.pipes._ |
| 263 | +import scala.util.Random |
| 264 | +import _root_.breeze.stats.distributions._ |
| 265 | +import io.github.mandar2812.dynaml.tensorflow._ |
| 266 | + |
| 267 | +val gaussian_data = dtfdata.dataset( |
| 268 | + GaussianRV(0.0, 1.0).iid(10000).draw |
| 269 | +) |
| 270 | + |
| 271 | +val sup_data1 = gaussian_data.to_supervised( |
| 272 | + DataPipe[Double, (Double, Double)](x => (x, GaussianRV(0.0, x*x).draw)) |
| 273 | +) |
| 274 | + |
| 275 | +val targets = gaussian_data.map((x: Double) => math.exp(x)) |
| 276 | + |
| 277 | +val sup_data2 = dtfdata.supervised_dataset(gaussian_data, targets) |
| 278 | + |
| 279 | +``` |
| 280 | + |
0 commit comments