remove duplicate events

Overview

The Remove Duplicate Events enrichment is a powerful data quality tool that automatically identifies and removes duplicate events from your process cases. When the same event appears multiple times within a case with identical attribute values (activity name, timestamp, and all other event attributes), this enrichment eliminates the redundant copies, keeping only the first occurrence.

This enrichment is particularly valuable when working with data from multiple source systems, data integration processes, or legacy systems where duplicate events may be inadvertently created. By removing these duplicates, you ensure that your process analysis reflects the actual process execution rather than data quality issues, leading to accurate cycle times, activity frequencies, and process flow visualizations.

Unlike other activity-related enrichments that modify or categorize events, this enrichment physically removes duplicate event records from your event log, permanently cleaning your dataset. The enrichment compares all event attributes from the original data source (not calculated or derived attributes) to determine if two events are truly identical.

Common Uses

Clean datasets imported from multiple source systems that may contain duplicate event records
Remove redundant events created by data integration processes or ETL pipelines
Eliminate duplicate activity recordings caused by system errors or data synchronization issues
Improve data quality before performing process mining analysis to ensure accurate metrics
Prepare datasets for conformance checking by removing noise from duplicate events
Clean historical data that has accumulated duplicates over time due to legacy system issues
Ensure accurate activity frequency counts and cycle time measurements by eliminating duplicate event noise

Settings

This enrichment requires no configuration settings. It is a one-click operation that automatically scans all events within each case and removes any duplicates it finds.

The enrichment uses an intelligent comparison algorithm that:

Compares all original source data attributes for each event (activity name, timestamp, case ID, and any other event-level attributes)
Ignores calculated or derived attributes added by previous enrichments
Keeps the first occurrence of each unique event
Removes subsequent duplicate events that match all attribute values

To use this enrichment:

Navigate to 'Log Enrichment' from any analysis by clicking 'Log Enrichment' in the top right
Click 'Add New' to create a new enrichment
Select 'Remove Duplicate Events' from the Activities section
Click 'Create' - no additional configuration is needed
Click 'Calculate Enrichment' to process your dataset

Examples

Example 1: Multi-System Order Processing

Scenario: An e-commerce company imports order data from three different systems: the web storefront, the warehouse management system, and the accounting system. Due to data integration issues, some order events appear multiple times when the same order was recorded by multiple systems with identical timestamps and values.

Settings:

No configuration required - the enrichment automatically detects and removes all duplicate events

Output: Before enrichment, a sample case might contain these events:

2024-03-15 09:00:00 - Order Received - Order#12345 - Customer: ABC Corp - Amount: $1,500
2024-03-15 09:00:00 - Order Received - Order#12345 - Customer: ABC Corp - Amount: $1,500 (duplicate)
2024-03-15 10:30:00 - Payment Processed - Order#12345 - Amount: $1,500
2024-03-15 10:30:00 - Payment Processed - Order#12345 - Amount: $1,500 (duplicate)
2024-03-15 14:00:00 - Order Shipped - Order#12345

After enrichment, the duplicate events are removed:

2024-03-15 09:00:00 - Order Received - Order#12345 - Customer: ABC Corp - Amount: $1,500
2024-03-15 10:30:00 - Payment Processed - Order#12345 - Amount: $1,500
2024-03-15 14:00:00 - Order Shipped - Order#12345

Insights: The company can now accurately measure process performance. The cycle time from order to shipment is correctly calculated as 5 hours instead of being skewed by duplicate event records. Activity frequency counts now reflect actual process execution rather than data quality issues.

Example 2: Healthcare Patient Journey

Scenario: A hospital consolidates patient data from their EHR system, radiology system, and pharmacy system. During migration from a legacy system, some patient events were duplicated, causing patient journey timelines to show the same procedure multiple times and inflating activity counts.

Settings:

No configuration required

Output: A patient case before enrichment:

2024-06-20 08:00:00 - Patient Admission - Patient ID: P9876 - Ward: Cardiology
2024-06-20 09:15:00 - Blood Test Ordered - Test Type: CBC
2024-06-20 09:15:00 - Blood Test Ordered - Test Type: CBC (duplicate from lab system)
2024-06-20 11:30:00 - ECG Performed - Result: Normal
2024-06-20 11:30:00 - ECG Performed - Result: Normal (duplicate from radiology system)
2024-06-20 15:00:00 - Medication Prescribed - Drug: Aspirin
2024-06-20 15:00:00 - Medication Prescribed - Drug: Aspirin (duplicate from pharmacy system)
2024-06-21 10:00:00 - Patient Discharge

After enrichment, duplicates are removed:

2024-06-20 08:00:00 - Patient Admission - Patient ID: P9876 - Ward: Cardiology
2024-06-20 09:15:00 - Blood Test Ordered - Test Type: CBC
2024-06-20 11:30:00 - ECG Performed - Result: Normal
2024-06-20 15:00:00 - Medication Prescribed - Drug: Aspirin
2024-06-21 10:00:00 - Patient Discharge

Insights: The hospital can now accurately track patient pathways and calculate true wait times between procedures. Resource utilization metrics reflect actual activity volumes rather than inflated numbers from duplicate records.

Example 3: Manufacturing Production Line

Scenario: A manufacturing plant uses SCADA systems that occasionally log the same machine operation twice due to network synchronization issues. These duplicate events distort production analytics and make it appear that operations take longer than they actually do.

Settings:

No configuration required

Output: Production case before enrichment:

2024-05-10 06:00:00 - Material Loaded - Batch: B1234 - Machine: Press-01
2024-05-10 06:05:00 - Press Operation Start - Batch: B1234
2024-05-10 06:05:00 - Press Operation Start - Batch: B1234 (network duplicate)
2024-05-10 06:45:00 - Press Operation Complete - Batch: B1234
2024-05-10 06:45:00 - Press Operation Complete - Batch: B1234 (network duplicate)
2024-05-10 07:00:00 - Quality Inspection - Result: Pass
2024-05-10 07:15:00 - Material Unloaded - Batch: B1234

After enrichment:

2024-05-10 06:00:00 - Material Loaded - Batch: B1234 - Machine: Press-01
2024-05-10 06:05:00 - Press Operation Start - Batch: B1234
2024-05-10 06:45:00 - Press Operation Complete - Batch: B1234
2024-05-10 07:00:00 - Quality Inspection - Result: Pass
2024-05-10 07:15:00 - Material Unloaded - Batch: B1234

Insights: Production cycle time calculations are now accurate. The plant can reliably measure machine utilization and identify true bottlenecks without noise from duplicate event records.

Example 4: Financial Transaction Processing

Scenario: A bank's transaction processing system occasionally creates duplicate log entries when transactions are processed through both the real-time system and the batch reconciliation system. These duplicates need to be removed before analyzing transaction patterns and compliance.

Settings:

No configuration required

Output: Transaction case before enrichment:

2024-07-15 14:30:00 - Transaction Initiated - Amount: $5,000 - Account: 12345
2024-07-15 14:30:05 - Fraud Check Performed - Risk Score: Low
2024-07-15 14:30:05 - Fraud Check Performed - Risk Score: Low (duplicate from reconciliation)
2024-07-15 14:30:10 - Authorization Approved - Auth Code: A789
2024-07-15 14:30:10 - Authorization Approved - Auth Code: A789 (duplicate from reconciliation)
2024-07-15 14:30:15 - Transaction Completed - Status: Success

After enrichment:

2024-07-15 14:30:00 - Transaction Initiated - Amount: $5,000 - Account: 12345
2024-07-15 14:30:05 - Fraud Check Performed - Risk Score: Low
2024-07-15 14:30:10 - Authorization Approved - Auth Code: A789
2024-07-15 14:30:15 - Transaction Completed - Status: Success

Insights: The bank can now accurately measure transaction processing times and identify true delays in their system. Compliance reporting shows actual activity counts rather than inflated numbers from duplicate records.

Example 5: IT Service Management

Scenario: An IT service desk imports ticket data from multiple monitoring systems. When incidents are escalated between systems, the same status change events sometimes appear multiple times, making incident resolution times appear longer than they actually are.

Settings:

No configuration required

Output: Incident case before enrichment:

2024-08-22 10:00:00 - Incident Created - Ticket: INC0012345 - Priority: High
2024-08-22 10:15:00 - Assigned to L1 Support - Agent: John Smith
2024-08-22 10:30:00 - Escalated to L2 - Reason: Complex Issue
2024-08-22 10:30:00 - Escalated to L2 - Reason: Complex Issue (duplicate from escalation system)
2024-08-22 11:45:00 - Issue Resolved - Resolution: Network Config Fix
2024-08-22 11:45:00 - Issue Resolved - Resolution: Network Config Fix (duplicate from escalation system)
2024-08-22 12:00:00 - Incident Closed - Satisfaction: 5/5

After enrichment:

2024-08-22 10:00:00 - Incident Created - Ticket: INC0012345 - Priority: High
2024-08-22 10:15:00 - Assigned to L1 Support - Agent: John Smith
2024-08-22 10:30:00 - Escalated to L2 - Reason: Complex Issue
2024-08-22 11:45:00 - Issue Resolved - Resolution: Network Config Fix
2024-08-22 12:00:00 - Incident Closed - Satisfaction: 5/5

Insights: The IT department can now accurately measure mean time to resolution (MTTR) and identify true performance bottlenecks in their incident management process without duplicate events skewing the timeline analysis.

Output

The Remove Duplicate Events enrichment modifies your event log by physically removing duplicate event records. Unlike enrichments that add new attributes to your dataset, this enrichment reduces the total number of events in your log.

What Gets Removed:

Any event that has identical values for all original source data attributes (activity name, timestamp, case ID, and all other event attributes) compared to a previous event in the same case
Only the duplicate occurrences are removed; the first occurrence of each unique event is always retained

What Stays:

The first occurrence of each unique event
Events that differ in any attribute value (even if timestamps or activity names match)
All calculated attributes and enrichment results from previous enrichments

Impact on Your Dataset:

Event Count: The total number of events in your log decreases based on how many duplicates are found
Case Count: The number of cases remains unchanged
Activity Statistics: Activity frequency counts become more accurate, reflecting actual process execution
Cycle Times: Duration calculations between activities become more precise without duplicate events creating zero-duration intervals
Process Flow: Process maps and variant analysis show cleaner, more accurate process flows

Important Notes:

This enrichment permanently removes duplicate events from your working dataset. If you need to preserve the original data with duplicates, create a backup or use a dataset snapshot before applying this enrichment.
The enrichment only compares original source data columns, not calculated or derived attributes added by previous enrichments
Events are considered duplicates only if ALL original attribute values match exactly
The enrichment processes events in chronological order, always keeping the first occurrence

Using the Cleaned Data: After running this enrichment, you can:

Perform accurate process discovery without noise from duplicate events
Calculate reliable performance metrics and KPIs
Conduct conformance checking on clean data
Create accurate process visualizations and dashboards
Combine with other enrichments knowing your baseline data is clean

Overview

Common Uses

Settings

Examples

Example 1: Multi-System Order Processing

Example 2: Healthcare Patient Journey

Example 3: Manufacturing Production Line

Example 4: Financial Transaction Processing

Example 5: IT Service Management

Output

See Also