File Stream Data Presist Issue #2284

AkthemRehab · 2023-12-06T15:51:33Z

AkthemRehab
Dec 6, 2023

Hello Streampipes,
I create a new file stream adapter as follows:

The file I upload is attached:
test.csv

I then add the timestamp and choose to persist events. When I create a new data view in the data explorer as follows:

I find the number of raws much less than expected. Doing some investigation I find that number of raws matching the # Events in "Settings - Data Lake".

However, checking the Data Lake Metrics, I find 20,480 consumed as expected. Refer below:

It only worked properly once, but other than that it keeps returning differing number of rows with inconsistent value each time. Could you please help me get this right? Or at least understand such behavior?

Thank you,
Akthem

Answered by tenthe

Dec 6, 2023

Hi @AkthemRehab,
thanks for your question.

The problem you are facing is due to the fact that the File Stream Adapter is primarily designed for testing purposes. It operates by taking a CSV file and replaying its contents. When there's no "original" timestamp present in the data, the adapter generates a new timestamp for each row upon reading. Consequently, this results in multiple events sharing the same timestamp. When these events are stored in the time series database, entries with identical timestamps overwrite each other. As a result, your data lake sink will receive the correct number of events, but not all of them will be stored in the database. The timestamp is used as an index i…

View full answer

tenthe · 2023-12-06T17:03:58Z

tenthe
Dec 6, 2023
Collaborator

Hi @AkthemRehab,
thanks for your question.

The problem you are facing is due to the fact that the File Stream Adapter is primarily designed for testing purposes. It operates by taking a CSV file and replaying its contents. When there's no "original" timestamp present in the data, the adapter generates a new timestamp for each row upon reading. Consequently, this results in multiple events sharing the same timestamp. When these events are stored in the time series database, entries with identical timestamps overwrite each other. As a result, your data lake sink will receive the correct number of events, but not all of them will be stored in the database. The timestamp is used as an index in time series databases. If two events arrive with the same time stamp, they are overwritten.

To address this, consider providing an 'original' timestamp to your CSV data. By doing so, StreamPipes will utilize this timestamp, and you should observe the desired outcome.

Does this explanation clarify the issue for you?
If you have any additional questions, feel free to ask.

Cheers,
Philipp

6 replies

AkthemRehab Dec 7, 2023
Author

Hey @tenthe ,
I marked your previous reply as answer acknowledging my follow up is perhaps another set of questions. Please let me know if a new thread would be preferable in this regard?

Best
Akthem

tenthe Dec 7, 2023
Collaborator

Hi, thanks for marking the question as answered.
Yes, I think it makes sense to open new threads for the follow-up questions.
I see three topics:

Dealing with files in the File Stream Adapter that don't have a timestamp on every line. (e.g. adding an option to create timestamps at a fixed interval)
Handling files (or arrays of events) as data source in the streaming environments
List of needed algorithms for working with event streams

Please correct me if I have misunderstood something. You are also welcome to add any other topics you are interested in.
My suggestion would be to discuss each topic in a separate thread to avoid mixing things up.
What do you think about this?

AkthemRehab Dec 8, 2023
Author

Hi Philipp, You got it all right. I am wondering though what you mean by the third topic "List of needed algorithms for working with event streams"? Does it have to do with the http post containing file as pay load?

I will create two threads for the first two topics assuming they have not been discussed/addressed before. And once I get a better understanding of what you mean by the third, I will proceed with creating its thread too.

tenthe Dec 11, 2023
Collaborator

Thank you for opening the discussion. I will answer you directly there.
On the third topic: I thought you are missing any algorithms to transform the data as processing elements within a pipeline.
If that is the case just start a discussion. If not we can ignore this topic for now.

AkthemRehab Dec 12, 2023
Author

Hi Philipp, I think I am more worried about the ability to HTTP POST with a file as the payload (multipart). Similarly ability to pub/sub file payloads in the MQTT nodes. Not sure if that requires algorithms that are currently missing?

tenthe · 2023-12-12T11:45:48Z

tenthe
Dec 12, 2023
Collaborator

No, that's something else. Let's discuss it in the other discussion.
Thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

File Stream Data Presist Issue #2284

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

File Stream Data Presist Issue #2284

Uh oh!

AkthemRehab Dec 6, 2023

Replies: 2 comments · 6 replies

Uh oh!

tenthe Dec 6, 2023 Collaborator

Uh oh!

AkthemRehab Dec 7, 2023 Author

Uh oh!

tenthe Dec 7, 2023 Collaborator

Uh oh!

AkthemRehab Dec 8, 2023 Author

Uh oh!

tenthe Dec 11, 2023 Collaborator

Uh oh!

AkthemRehab Dec 12, 2023 Author

Uh oh!

tenthe Dec 12, 2023 Collaborator

AkthemRehab
Dec 6, 2023

Replies: 2 comments 6 replies

tenthe
Dec 6, 2023
Collaborator

AkthemRehab Dec 7, 2023
Author

tenthe Dec 7, 2023
Collaborator

AkthemRehab Dec 8, 2023
Author

tenthe Dec 11, 2023
Collaborator

AkthemRehab Dec 12, 2023
Author

tenthe
Dec 12, 2023
Collaborator