Skip to content

Commit 9019d3b

Browse files
committed
[Release]: v1.5.3
Signed-off-by: mandar2812 <[email protected]>
1 parent ef3b9b5 commit 9019d3b

File tree

9 files changed

+447
-132
lines changed

9 files changed

+447
-132
lines changed

build.sbt

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ packageSummary := "Scala Library/REPL for Machine Learning Research"
99

1010
packageDescription := "DynaML is a Scala environment for conducting research and education in Machine Learning. DynaML comes packaged with a powerful library of classes for various predictive models and a Scala REPL where one can not only build custom models but also play around with data work-flows. It can also be used as an educational/research tool for data analysis."
1111

12-
val mainVersion = "v1.5.3-beta.3"
12+
val mainVersion = "v1.5.3"
1313

1414
val dataDirectory = settingKey[File]("The directory holding the data files for running example scripts")
1515

docs/core/core_dtfdata.md

+280
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,280 @@
1+
!!! summary
2+
The `DataSet` API added in v1.5.3, makes it easy to work with potentially large data sets,
3+
perform complex pre-processing tasks and feed these data sets into TensorFlow models.
4+
5+
6+
## Data Set
7+
8+
### Basics
9+
10+
A `DataSet[X]` instance is simply a wrapper over an `Iterable[X]` object, although the user still has
11+
access to the underlying collection.
12+
13+
!!! tip
14+
The [`dtfdata`](https://transcendent-ai-labs.github.io/api_docs/DynaML/recent/dynaml-core/#io.github.mandar2812.dynaml.tensorflow.package)
15+
object gives the user easy access to the `DataSet` API.
16+
17+
```scala
18+
import _root_.io.github.mandar2812.dynaml.probability._
19+
import _root_.io.github.mandar2812.dynaml.pipes._
20+
import io.github.mandar2812.dynaml.tensorflow._
21+
22+
23+
val random_numbers = GaussianRV(0.0, 1.0) :* GaussianRV(1.0, 2.0)
24+
25+
//Create a data set.
26+
val dataset1 = dtfdata.dataset(random_numbers.iid(10000).draw)
27+
28+
//Access underlying data
29+
dataset1.data
30+
```
31+
32+
### Transformations
33+
34+
DynaML data sets support several operations of the _map-reduce_ philosophy.
35+
36+
#### Map
37+
38+
Transform each element of type `X` into some other element of type `Y` (`Y` can possibly be the same as `X`).
39+
40+
```scala
41+
import _root_.io.github.mandar2812.dynaml.probability._
42+
import _root_.io.github.mandar2812.dynaml.pipes._
43+
import io.github.mandar2812.dynaml.tensorflow._
44+
45+
46+
val random_numbers = GaussianRV(0.0, 1.0)
47+
//A data set of random gaussian numbers.
48+
val random_gaussian_dataset = dtfdata.dataset(
49+
random_numbers.iid(10000).draw
50+
)
51+
52+
//Transform data set by applying a scala function
53+
val random_chisq_dataset = random_gaussian_dataset.map((x: Double) => x*x)
54+
55+
val exp_tr = DataPipe[Double, Double](math.exp _)
56+
//Can pass a DataPipe instead of a function
57+
val random_log_gaussian_dataset = random_gaussian_dataset.map(exp_tr)
58+
```
59+
60+
#### Flat Map
61+
62+
Process each element by applying a function which transforms each element into an `Iterable`,
63+
this operation is followed by flattening of the top level `Iterable`.
64+
65+
Schematically, this process is
66+
67+
`Iterable[X] -> Iterable[Iterable[Y]] -> Iterable[Y]`
68+
69+
```scala
70+
import _root_.io.github.mandar2812.dynaml.probability._
71+
import _root_.io.github.mandar2812.dynaml.pipes._
72+
import scala.util.Random
73+
import io.github.mandar2812.dynaml.tensorflow._
74+
75+
val random_gaussian_dataset = dtfdata.dataset(
76+
GaussianRV(0.0, 1.0).iid(10000).draw
77+
)
78+
79+
//Transform data set by applying a scala function
80+
val gaussian_mixture = random_gaussian_dataset.flatMap(
81+
(x: Double) => GaussianRV(0.0, x*x).iid(10).draw
82+
)
83+
```
84+
85+
#### Filter
86+
87+
Collect only the elements which satisfy some predicate, i.e. a function which returns `true` for the
88+
elements to be selected (filtered) and `false` for the ones which should be discarded.
89+
90+
```scala
91+
import _root_.io.github.mandar2812.dynaml.probability._
92+
import _root_.io.github.mandar2812.dynaml.pipes._
93+
import scala.util.Random
94+
import io.github.mandar2812.dynaml.tensorflow._
95+
96+
val gaussian_dataset = dtfdata.dataset(
97+
GaussianRV(0.0, 1.0).iid(10000).draw
98+
)
99+
100+
val onlyPositive = DataPipe[Double, Boolean](_ > 0.0)
101+
102+
val truncated_gaussian = gaussian_dataset.filter(onlyPositive)
103+
104+
val zeroOrGreater = (x: Double) => x >= 0.0
105+
//filterNot works in the opposite manner to filter
106+
val neg_truncated_gaussian = gaussian_dataset.filterNot(zeroOrGreater)
107+
108+
```
109+
110+
#### Scan & Friends
111+
112+
Sometimes, we need to perform operations on a data set which are sequential in nature. In this situation,
113+
the `scanLeft()` and `scanRight()` are useful.
114+
115+
Lets simulate a random walk, we start with $x_0$, a number and add independent gaussian increments to it.
116+
117+
$$
118+
\begin{align*}
119+
x_t &= x_{t-1} + \epsilon \\
120+
\epsilon &\sim \mathcal{N}(0, 1)
121+
\end{align*}
122+
$$
123+
124+
```scala
125+
import _root_.io.github.mandar2812.dynaml.probability._
126+
import _root_.io.github.mandar2812.dynaml.pipes._
127+
import scala.util.Random
128+
import io.github.mandar2812.dynaml.tensorflow._
129+
130+
val gaussian_increments = dtfdata.dataset(
131+
GaussianRV(0.0, 1.0).iid(10000).draw
132+
)
133+
134+
val increment = DataPipe2[Double, Double, Double]((x, i) => x + i)
135+
136+
//Start the random walk from zero, and keep adding increments.
137+
val random_walk = gaussian_increments.scanLeft(0.0)(increment)
138+
```
139+
140+
The `scanRight()` works just like the `scanLeft()` method, except it begins from the last element
141+
of the collection.
142+
143+
#### Reduce & Reduce Left
144+
145+
The `reduce()` and `reduceLeft()` methods help in computing summary values from the entire data
146+
collection.
147+
148+
```scala
149+
import _root_.io.github.mandar2812.dynaml.probability._
150+
import _root_.io.github.mandar2812.dynaml.pipes._
151+
import scala.util.Random
152+
import io.github.mandar2812.dynaml.tensorflow._
153+
154+
val gaussian_increments = dtfdata.dataset(
155+
GaussianRV(0.0, 1.0).iid(10000).draw
156+
)
157+
158+
val increment = DataPipe2[Double, Double, Double]((x, i) => x + i)
159+
160+
val random_walk = gaussian_increments.scanLeft(0.0)(increment)
161+
162+
val average = random_walk.reduce(
163+
DataPipe2[Double, Double, Double]((x, y) => x + y)
164+
)/10000.0
165+
```
166+
167+
#### Other Transformations
168+
169+
Some times transformations on data sets cannot be applied on each element individually, but the
170+
entire data collection is required for such a transformation.
171+
172+
```scala
173+
import _root_.io.github.mandar2812.dynaml.probability._
174+
import _root_.io.github.mandar2812.dynaml.pipes._
175+
import scala.util.Random
176+
import io.github.mandar2812.dynaml.tensorflow._
177+
178+
val gaussian_data = dtfdata.dataset(
179+
GaussianRV(0.0, 1.0).iid(10000).draw
180+
)
181+
182+
val resample = DataPipe[Iterable[Double], Iterable[Double]](
183+
coll => (0 until 10000).map(_ => coll(Random.nextInt(10000)))
184+
)
185+
186+
val resampled_data = gaussian_data.transform(resample)
187+
188+
```
189+
190+
!!! note
191+
**Conversion to TF-Scala `Dataset` class**
192+
193+
The TensorFlow scala API also has a `Dataset` class, from a DynaML `DataSet`
194+
instance, it is possible to obtain a TensorFlow `Dataset`.
195+
196+
```scala
197+
import _root_.io.github.mandar2812.dynaml.probability._
198+
import _root_.io.github.mandar2812.dynaml.pipes._
199+
import io.github.mandar2812.dynaml.tensorflow._
200+
import org.platanios.tensorflow.api._
201+
import org.platanios.tensorflow.api.types._
202+
203+
204+
val random_numbers = GaussianRV(0.0, 1.0)
205+
206+
//Create a data set.
207+
val dataset1 = dtfdata.dataset(random_numbers.iid(10000).draw)
208+
209+
//Convert to TensorFlow data set
210+
dataset1.build[Tensor, Output, DataType.Aux[Double], DataType, Shape](
211+
Left(DataPipe[Double, Tensor](x => dtf.tensor_f64(1)(x))),
212+
FLOAT64, Shape(1)
213+
)
214+
```
215+
216+
217+
218+
## Tuple Data & Supervised Data
219+
220+
The classes `ZipDataSet[X, Y]` and `SupervisedDataSet[X, Y]` both represent data collections which consist of
221+
`(X, Y)` tuples. They can be created in a number of ways.
222+
223+
### Zip Data
224+
225+
The `zip()` method can be used to create data sets consisting of tuples.
226+
227+
```scala
228+
import _root_.io.github.mandar2812.dynaml.probability._
229+
import _root_.io.github.mandar2812.dynaml.pipes._
230+
import scala.util.Random
231+
import _root_.breeze.stats.distributions._
232+
import io.github.mandar2812.dynaml.tensorflow._
233+
234+
val gaussian_data = dtfdata.dataset(
235+
GaussianRV(0.0, 1.0).iid(10000).draw
236+
)
237+
238+
val log_normal_data = gaussian_data.map((x: Double) => math.exp(x))
239+
240+
val poisson_data = dtfdata.dataset(
241+
RandomVariable(Poisson(2.5)).iid(10000).draw
242+
)
243+
244+
val tuple_data1 = poisson_data.zip(gaussian_data)
245+
246+
val tuple_data2 = poisson_data.zip(log_normal_data)
247+
248+
//Join on the keys, in this case the
249+
//Poisson distributed integers
250+
251+
tuple_data1.join(tuple_data2)
252+
```
253+
254+
### Supervised Data
255+
256+
For supervised learning operations, we can use the `SupervisedDataSet` class, which can be instantiated
257+
in the following ways.
258+
259+
```scala
260+
261+
import _root_.io.github.mandar2812.dynaml.probability._
262+
import _root_.io.github.mandar2812.dynaml.pipes._
263+
import scala.util.Random
264+
import _root_.breeze.stats.distributions._
265+
import io.github.mandar2812.dynaml.tensorflow._
266+
267+
val gaussian_data = dtfdata.dataset(
268+
GaussianRV(0.0, 1.0).iid(10000).draw
269+
)
270+
271+
val sup_data1 = gaussian_data.to_supervised(
272+
DataPipe[Double, (Double, Double)](x => (x, GaussianRV(0.0, x*x).draw))
273+
)
274+
275+
val targets = gaussian_data.map((x: Double) => math.exp(x))
276+
277+
val sup_data2 = dtfdata.supervised_dataset(gaussian_data, targets)
278+
279+
```
280+

0 commit comments

Comments
 (0)