-
Notifications
You must be signed in to change notification settings - Fork 77
How to configure cron so file is not processed multiple times (duplicate file) #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I encountered the same issue when I upgraded confluent from 6.0.1 to 7.0.1, everything ok in the previous version. I suspect it is related to this change of kafka connect commit offset logic: apache/kafka#11323 Below are the TRACE logs for your information, I added some notes start with # for your investigation easier.
|
@mmolimar I humbly beseech you to read the above and provide some feedback/insight. |
I'm struggling with the same issue on 7.3.0. I tried rolling back to 6.0.1 and encountered it there too. It happens in every policy, the file gets processed multiple times. ---
version: '3'
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.3.0
container_name: zookeeper
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
broker:
image: confluentinc/cp-kafka:7.3.0
container_name: broker
ports:
- "9092:9092"
depends_on:
- zookeeper
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_INTERNAL:PLAINTEXT
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092,PLAINTEXT_INTERNAL://broker:29092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
kafka-connect:
image: confluentinc/cp-kafka-connect:7.2.3
container_name: kafka-connect
depends_on:
- zookeeper
- broker
ports:
- 8083:8083
environment:
CONNECT_BOOTSTRAP_SERVERS: "broker:29092"
CONNECT_REST_PORT: 8083
CONNECT_GROUP_ID: compose-connect-group
CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs
CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets
CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status
CONNECT_REST_ADVERTISED_HOST_NAME: "kafka-connect"
CONNECT_KEY_CONVERTER: org.apache.kafka.connect.storage.StringConverter
CONNECT_VALUE_CONVERTER: org.apache.kafka.connect.storage.StringConverter
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: "1"
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: "1"
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: "1"
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 1000
CONNECT_PLUGIN_PATH: '/usr/share/java,/usr/share/confluent-hub-components/'
command:
- bash
- -c
- |
confluent-hub install --no-prompt mmolimar/kafka-connect-fs:1.3.0
/etc/confluent/docker/run &
sleep infinity
volumes:
- /tmp:/data Setting the sleep timer to a higher value seems to reduce the number of times each file gets reprocessed but it doesn't stop it. Even if I set it to several minutes the file still gets processed 2 or 3 times before it's moved. curl \
-i -X PUT -H "Accept:application/json" \
-H "Content-Type:application/json" http://localhost:8083/connectors/xml-filesystem-00/config \
-d '{
"connector.class":"com.github.mmolimar.kafka.connect.fs.FsSourceConnector",
"tasks.max": 1,
"fs.uris":"file:///data/unprocessed",
"topic":"xml-ingest",
"policy.class":"com.github.mmolimar.kafka.connect.fs.policy.SleepyPolicy",
"policy.sleepy.sleep":"10000",
"policy.recursive":"true",
"policy.regexp":".*\\.xml$",
"policy.batch_size":"-1",
"policy.cleanup":"move",
"policy.cleanup.move":"file:///data/processed",
"file_reader.class":"com.github.mmolimar.kafka.connect.fs.file.reader.TextFileReader",
"file_reader.batch_size":"-1"
}' kafka-connect | [2023-02-01 16:18:54,427] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test.xml, length = 24, blocks = [[offset = 0, length = 24, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:18:54,437] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test2.xml, length = 9, blocks = [[offset = 0, length = 9, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:18:54,440] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test3.xml, length = 19, blocks = [[offset = 0, length = 19, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:18:54,444] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test4.xml, length = 280, blocks = [[offset = 0, length = 280, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:18:54,447] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test5.xml, length = 19, blocks = [[offset = 0, length = 19, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:18:54,450] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test6.xml, length = 12, blocks = [[offset = 0, length = 12, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:18:54,459] INFO [Producer clientId=connector-producer-source-xml-filesystem-00-0] Resetting the last seen epoch of partition xml-ingest-0 to 0 since the associated topicId changed from null to B_WprujwS8aE5VHedkX7BQ (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,521] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test.xml, length = 24, blocks = [[offset = 0, length = 24, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:19:04,523] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test2.xml, length = 9, blocks = [[offset = 0, length = 9, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:19:04,525] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test3.xml, length = 19, blocks = [[offset = 0, length = 19, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:19:04,528] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test4.xml, length = 280, blocks = [[offset = 0, length = 280, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:19:04,529] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test5.xml, length = 19, blocks = [[offset = 0, length = 19, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:19:04,533] INFO FsSourceTask Processing records for file [path = file:/data/unprocessed/test6.xml, length = 12, blocks = [[offset = 0, length = 12, corrupt = false]]]... (com.github.mmolimar.kafka.connect.fs.FsSourceTask)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-0 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-5 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-10 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-20 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-15 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-9 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-11 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-16 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-4 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-17 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-3 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-24 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-23 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-13 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-18 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-22 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-2 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-8 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-12 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-19 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-14 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-1 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-6 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-7 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:04,866] INFO [Producer clientId=producer-1] Resetting the last seen epoch of partition docker-connect-offsets-21 to 0 since the associated topicId changed from null to DWF138FARpqXLsWwspFMXA (org.apache.kafka.clients.Metadata)
kafka-connect | [2023-02-01 16:19:14,588] INFO SleepyPolicy Moving file [file:/data/unprocessed/test.xml] to [file:/data/processed/test.xml] due to it was already processed. (com.github.mmolimar.kafka.connect.fs.policy.SleepyPolicy)
kafka-connect | [2023-02-01 16:19:14,640] INFO SleepyPolicy Moving file [file:/data/unprocessed/test2.xml] to [file:/data/processed/test2.xml] due to it was already processed. (com.github.mmolimar.kafka.connect.fs.policy.SleepyPolicy)
kafka-connect | [2023-02-01 16:19:14,666] INFO SleepyPolicy Moving file [file:/data/unprocessed/test3.xml] to [file:/data/processed/test3.xml] due to it was already processed. (com.github.mmolimar.kafka.connect.fs.policy.SleepyPolicy)
kafka-connect | [2023-02-01 16:19:14,696] INFO SleepyPolicy Moving file [file:/data/unprocessed/test4.xml] to [file:/data/processed/test4.xml] due to it was already processed. (com.github.mmolimar.kafka.connect.fs.policy.SleepyPolicy)
kafka-connect | [2023-02-01 16:19:14,729] INFO SleepyPolicy Moving file [file:/data/unprocessed/test5.xml] to [file:/data/processed/test5.xml] due to it was already processed. (com.github.mmolimar.kafka.connect.fs.policy.SleepyPolicy)
kafka-connect | [2023-02-01 16:19:14,754] INFO SleepyPolicy Moving file [file:/data/unprocessed/test6.xml] to [file:/data/processed/test6.xml] due to it was already processed. (com.github.mmolimar.kafka.connect.fs.policy.SleepyPolicy)
|
This might same with #89 , but I've tried several cron expressions and nothing works. No change on
offset.flush.internal.ms
so it uses default 60000 ms (as stated here)This might be bug on deletion process, since I've tried various cron expression :
Using kafka from docker :
docker.io/bitnami/kafka:3
with kafka connect dockerconfluentinc/cp-kafka-connect:7.0.1
To replicate :
Expected :
Actual :
So the kafka consumer produces like this:
Every minute
1 0/1 * ? * * *
(once every minute) : file processed twice before deletedKafka consumer:
Kafka connect log
Every 30 second
0/30 * * ? * * *
(every 30 second) : file processed 3 times before deletedKafka consumer
Kafka connect log
Every 5 minutes
1 0/5 * ? * * *
(every 5 minutes): file processed twice before deleted.kafka consumer
kafka connect log
No error / warn log on kafka broker
The text was updated successfully, but these errors were encountered: