By default, PowerExchange CDC Publisher generates a local checkpoint file after it sends the first change operation. As data streaming progresses, CDC Publisher saves information about the last change operation processed to the checkpoint file.
If you set the connector property Connector.checkpointsInTarget to true in the cdcPublisherKafka.cfg file, CDC Publisher stores checkpoints in the Kafka headers and periodically in the checkpoint file, but they might be out of synch. In this case, the checkpoint file becomes a backup in case topics that contain checkpoint information are purged or become missing. For more information, see
Considerations for Storing Checkpoints in Kafka.
Checkpoint information is written to each Kafka message that CDC Publisher writes. The backup checkpoint file is written at the intervals set in one or both of the following properties in the cdcPublisherKafka.cfg configuration file:
Connector.checkpointMessageFrequency
Connector.checkpointTimeFrequency
If neither property is specified with a value greater than 0, no periodic checkpoints are written to the backup checkpoint file.
If you set the Connector.checkpointsInTarget property to true, when the CDC Publisher restarts, it reads the topics to find the last checkpoint that was stored and then uses it to restart.
Before using Kafka to store checkpoints, review the following considerations.
The Kafka version must be 0.11.0.2 or later.
CDC Publisher assumes that all of the messages in a topic are from the same CDC Publisher instance. If multiple CDC Publisher instances write to the same topics in the same Kafka target, corruption of the target data can occur.
To shut down the CDC Publisher, you must run the PWXCDCAdmin utility with the SHUTDOWN parameter. The SHUTDOWN parameter forces CDC Publisher to synchronize the last written Kafka checkpoint with the backup checkpoint file.
If you do not use the PwxCDCAdmin SHUTDOWN command to shut down CDC Publisher, duplicate data might be written to CDC Publisher on startup of CDC Publisher if the Kafka topics are empty, deleted, or corrupted. If no Kafka headers are found on startup of CDC Publisher, CDC Publisher uses a restart point based on the checkpoint information in the backup checkpoint file. Depending on the latency of the checkpoint, more duplicate data might be written on CDC Publisher restart.
When CDC Publisher starts, it collects information about the existing Kafka topics and topic partitions. The latest Kafka checkpoint information is used as a starting point for PowerExchange CDC capture or as the starting point for CDC Publisher extraction of changes from the Logger log files. If previously written messages are not found because of topic removal or content truncation, checkpoint information might be compromised and CDC Publisher might not restart from the correct location. This situation can result in the incorrect startup of CDC Publisher as follows:
If CDC Publisher does not find the topic that contains the latest checkpoint for restart, it uses the last checkpoint from another topic. In this case, the restart point is compromised, which might cause CDC Publisher to send duplicate data to the target.
If CDC Publisher does not find topics that contain checkpoint data, which indicates that all topics were deleted or truncated, a restart point cannot be determined. Because no restart point exists that indicates which data has not been processed, data might be lost.
If the last checkpoint value written to Kafka is missing because the topic to which data was last written has been deleted or truncated, you can optionally use the backup checkpoint file contents to restart CDC Publisher, as described in
Restarting a PowerExchange CDC Publisher Change Data Stream.
If you need to change the CDC Publisher ID that is specified by the connector.checkpointPublisherId property, first shut down CDC Publisher by using the PwxCDCAdmin SHUTDOWN command and ensure that a backup checkpoint file exists. When you change the CDC Publisher ID, the CDC Publisher restart process uses the backup checkpoint file to determine the restart point because it will not find the new CDC Publisher ID in the Kafka topics.
After you change the ID, restart the CDC Publisher. The restart process then uses the backup checkpoint file to determine the restart point. Thereafter, in new messages that CDC Publisher writes to the target, it uses the new CDC Publisher ID.