61
61
* <li>
62
62
* For each Query message from the PubSub, check to see if it is has a KILL or COMPLETE signal.
63
63
* If yes, remove any existing {@link Querier} objects for that query (identified by the ID)
64
- * If no, create an instance of {@link Querier} for the query. If any exceptions or errors initializing, throw away
65
- * the querier.
64
+ * If no, create an instance of {@link Querier} for the query in {@link Mode#PARTITION} mode if and only if you are
65
+ * going to be persisting the querier for the duration of the query. If you are throwing away the querier, such as
66
+ * after processing your partitioned data in mini-batches and recreating it every new mini-batch, then you need not
67
+ * change the mode. If any exceptions or errors initializing, throw away the querier since the errors are handled in
68
+ * the Join stage below.
66
69
* </li>
67
70
* <li>
68
71
* For every {@link BulletRecord}, call {@link #consume(BulletRecord)} on all the {@link Querier} objects
72
75
* If {@link #isDone()}, call {@link #getData()} and also remove the {@link Querier}.
73
76
* </li>
74
77
* <li>
75
- * If {@link #isClosedForPartition ()}, use {@link #getData()} to emit the intermediate data to the Join stage for
78
+ * If {@link #isClosed ()}, use {@link #getData()} to emit the intermediate data to the Join stage for
76
79
* the query ID. Then, call {@link #reset()}.
77
80
* </li>
78
81
* <li>
79
- * <em>Optional</em>: if you are processing record by record (instead of micro-batches) and honoring {@link #isClosedForPartition ()},
82
+ * <em>Optional</em>: if you are processing record by record (instead of micro-batches) and honoring {@link #isClosed ()},
80
83
* you should check if {@link #isExceedingRateLimit()} is true after calling {@link #getData()}. If yes, you should
81
84
* cancel the query and emit a RateLimitError to the Join stage to kill the query. You can use {@link #getRateLimitError()}
82
85
* to get the {@link RateLimitError} to pass to the Join stage.
83
86
* </li>
84
87
* <li>
85
88
* <em>Optional</em>: If your data volume is very, very small (Heuristic: less than 1 per your 0.1 *
86
89
* bullet.query.window.min.emit.every.ms). across your partitions), you should run the {@link #isDone()} and
87
- * {@link #isClosedForPartition ()} and do the emits either on a timer or at fixed intervals so that your queries
90
+ * {@link #isClosed ()} and do the emits either on a timer or at fixed intervals so that your queries
88
91
* are checked for results and maintain their windowing guarantees.
89
92
* </li>
90
93
* </ol>
91
94
*
92
- * You can also use {@link #hasData()} to check if there is any data to emit if you need. If you do not want to call
93
- * {@link #getData()}, you can serialize Querier using non-native serialization frameworks and use {@link #merge(Monoidal)}
94
- * in the Join stage to merge them into an empty Querier for the query. This will be equivalent to calling
95
- * {@link #combine(byte[])} on {@link #getData()}. Just remember to not call {@link #initialize()}
95
+ * You can also use {@link #hasNewData()} to check if there is any new data to emit if you need to know a successful
96
+ * consumption or combining happened.
97
+ *
98
+ * If you do not want to call {@link #getData()}, you can serialize Querier using non-native serialization frameworks
99
+ * and use {@link #merge(Monoidal)} in the Join stage to merge them into an empty Querier for the query. This will be
100
+ * equivalent to calling {@link #combine(byte[])} on {@link #getData()}. Just remember to not call {@link #initialize()}
96
101
* on the reified querier objects on the Join side since that will wipe the existing results stored in them.
97
102
*
98
103
* <h4>Pseudo Code</h4>
115
120
* emit(q.getData())
116
121
* remove q
117
122
* else
118
- * if (q.isClosedForPartition ())
123
+ * if (q.isClosed ())
119
124
* emit(q.getData())
120
125
* q.reset()
121
126
* q.consume(record)
131
136
* if (q.isDone())
132
137
* emit(q.getData())
133
138
* remove q
134
- * if (q.isClosedForPartition ())
139
+ * if (q.isClosed ())
135
140
* emit(q.getData())
136
141
* q.reset()
137
142
* if (q.isExceedingRateLimit())
144
149
* <ol>
145
150
* <li>
146
151
* For each Query message from the PubSub, if it is a KILL signal similar to the Filter stage, kill the query and
147
- * return. Otherwise create an instance of {@link Querier} for the query. If any exceptions or errors initializing it,
148
- * make BulletError objects from them and return them as a {@link Clip} back through the PubSub.
152
+ * return. Otherwise create an instance of {@link Querier} for the query in {@link Mode#ALL} mode. If any exceptions
153
+ * or errors initializing it, make BulletError objects from them and return them as a {@link Clip} back through the PubSub.
149
154
* </li>
150
155
* <li>
151
156
* For each KILL message from the Filter stage, call {@link #finish()}, and add to the {@link Meta} a
258
263
*/
259
264
@ Slf4j
260
265
public class Querier implements Monoidal {
266
+ /**
267
+ * This is used to determine if this operates in partitioned mode or not. If the Querier is operating in
268
+ * {@link Mode#PARTITION}, it is assumed there are multiple queriers running in parallel and consuming parts of the
269
+ * data for the query. Use this if you are distributing the {@link #consume(BulletRecord)} calls across multiple
270
+ * machines. This fixes the semantics of the {@link #reset()} and the {@link #isClosed()} methods to keep the
271
+ * correct windowing semantics.
272
+ *
273
+ * If you are not distributing the data or recreating the querier instance in your parallelized step, you can
274
+ * leave this at the default of {@link Mode#ALL}.
275
+ */
276
+ public enum Mode {
277
+ PARTITION , ALL
278
+ }
279
+
261
280
public static final String TRY_AGAIN_LATER = "Please try again later" ;
262
281
263
282
// For testing convenience
@@ -271,11 +290,14 @@ public class Querier implements Monoidal {
271
290
private BulletConfig config ;
272
291
private Map <String , String > metaKeys ;
273
292
private String timestampKey ;
274
- private boolean hasData = false ;
293
+ private boolean hasNewData = false ;
275
294
276
295
// This is counting the number of times we get the data out of the query.
277
296
private RateLimiter rateLimit ;
278
297
298
+ // Mode for the querier
299
+ private Mode mode ;
300
+
279
301
/**
280
302
* Constructor that takes a String representation of the query and a configuration to use. This also starts the
281
303
* query.
@@ -286,9 +308,22 @@ public class Querier implements Monoidal {
286
308
* @throws JsonParseException if there was an issue parsing the query.
287
309
*/
288
310
public Querier (String id , String queryString , BulletConfig config ) throws JsonParseException {
289
- this (new RunningQuery (id , queryString , config ), config );
311
+ this (Mode . ALL , new RunningQuery (id , queryString , config ), config );
290
312
}
291
313
314
+ /**
315
+ * Constructor that takes a String representation of the query and a configuration to use. This also starts the
316
+ * query.
317
+ *
318
+ * @param mode The mode for this querier.
319
+ * @param id The query ID.
320
+ * @param queryString The query as a string.
321
+ * @param config The validated {@link BulletConfig} configuration to use.
322
+ * @throws JsonParseException if there was an issue parsing the query.
323
+ */
324
+ public Querier (Mode mode , String id , String queryString , BulletConfig config ) throws JsonParseException {
325
+ this (mode , new RunningQuery (id , queryString , config ), config );
326
+ }
292
327
/**
293
328
* Constructor that takes a {@link RunningQuery} instance and a configuration to use. This also starts executing
294
329
* the query.
@@ -297,6 +332,19 @@ public Querier(String id, String queryString, BulletConfig config) throws JsonPa
297
332
* @param config The validated {@link BulletConfig} configuration to use.
298
333
*/
299
334
public Querier (RunningQuery query , BulletConfig config ) {
335
+ this (Mode .ALL , query , config );
336
+ }
337
+
338
+ /**
339
+ * Constructor that takes a {@link Querier.Mode}, {@link RunningQuery} instance and a configuration to use.
340
+ * This also starts executing the query.
341
+ *
342
+ * @param mode The mode for this querier.
343
+ * @param query The running query.
344
+ * @param config The validated {@link BulletConfig} configuration to use.
345
+ */
346
+ public Querier (Mode mode , RunningQuery query , BulletConfig config ) {
347
+ this .mode = mode ;
300
348
this .runningQuery = query ;
301
349
this .config = config ;
302
350
}
@@ -362,7 +410,7 @@ public void consume(BulletRecord record) {
362
410
BulletRecord projected = project (record );
363
411
try {
364
412
window .consume (projected );
365
- hasData = true ;
413
+ hasNewData = true ;
366
414
} catch (RuntimeException e ) {
367
415
log .error ("Unable to consume {} for query {}" , record , this );
368
416
log .error ("Skipping due to" , e );
@@ -379,7 +427,7 @@ public void consume(BulletRecord record) {
379
427
public void combine (byte [] data ) {
380
428
try {
381
429
window .combine (data );
382
- hasData = true ;
430
+ hasNewData = true ;
383
431
} catch (RuntimeException e ) {
384
432
log .error ("Unable to aggregate {} for query {}" , data , this );
385
433
log .error ("Skipping due to" , e );
@@ -458,41 +506,32 @@ public Clip getResult() {
458
506
}
459
507
460
508
/**
461
- * Returns true if the query window is closed and you should emit the result at this time. If you have partitioned
462
- * the data, use {@link #isClosedForPartition()} .
509
+ * Depending on the {@link Mode#ALL} mode this is operating in, returns true if and only if the query window is
510
+ * closed and you should emit the result at this time .
463
511
*
464
- * @return boolean denoting if query has expired .
512
+ * @return boolean denoting if query has closed .
465
513
*/
466
514
@ Override
467
515
public boolean isClosed () {
468
- return window .isClosed ();
516
+ return mode == Mode . PARTITION ? window . isClosedForPartition () : window .isClosed ();
469
517
}
470
518
471
519
/**
472
520
* Resets this object. You should call this if you have called {@link #getResult()} or {@link #getData()} after
473
- * verifying whether this is {@link #isClosed()} or {@link #isClosedForPartition()} .
521
+ * verifying whether this is {@link #isClosed()}.
474
522
*/
475
523
@ Override
476
524
public void reset () {
477
- window .reset ();
478
- hasData = false ;
525
+ if (mode == Mode .PARTITION ) {
526
+ window .resetForPartition ();
527
+ } else {
528
+ window .reset ();
529
+ }
530
+ hasNewData = false ;
479
531
}
480
532
481
533
// ********************************* Public helpers *********************************
482
534
483
- /**
484
- * Returns true if the query has been consuming parts of the data (parallelized) and should emit the result
485
- * for that partition of data when operating that way. Use this if you have distributed the
486
- * {@link #consume(BulletRecord)} calls across multiple machines and you want to know if, for this particular kind
487
- * of query, whether it is necessary to emit results now. While not necessary to use, it would keep the
488
- * windowing semantics for the query correct to adhere to emitting when this is true.
489
- *
490
- * @return A boolean denoting whether there is data to emit for this query if it was reading part of the data.
491
- */
492
- public boolean isClosedForPartition () {
493
- return window .isClosedForPartition ();
494
- }
495
-
496
535
/**
497
536
* Returns true if the query has expired and will never accept any more data.
498
537
*
@@ -504,13 +543,14 @@ public boolean isDone() {
504
543
}
505
544
506
545
/**
507
- * Returns whether there is any data to emit at all. Use this method if you are driving how data is consumed by this
508
- * instance (for instance, microbatches) and need to emit data outside the windowing standards.
546
+ * Returns whether there is any new data to emit at all since the last {@link #reset()}. Use this method if you are
547
+ * driving how data is consumed by this instance (for instance, microbatches) and need to emit data outside the
548
+ * windowing standards.
509
549
*
510
- * @return A boolean denoting whether we have any data that can be emitted.
550
+ * @return A boolean denoting whether we have any new data that can be emitted.
511
551
*/
512
- public boolean hasData () {
513
- return hasData ;
552
+ public boolean hasNewData () {
553
+ return hasNewData ;
514
554
}
515
555
516
556
/**
0 commit comments