Remove Duplicate Events

Overview

The Remove Duplicate Events enrichment is a powerful data quality tool that automatically identifies and removes duplicate events from your process cases. When the same event appears multiple times within a case with identical attribute values (activity name, timestamp, and all other event attributes), this enrichment eliminates the redundant copies, keeping only the first occurrence.

This enrichment is particularly valuable when working with data from multiple source systems, data integration processes, or legacy systems where duplicate events may be inadvertently created. By removing these duplicates, you ensure that your process analysis reflects the actual process execution rather than data quality issues, leading to accurate cycle times, activity frequencies, and process flow visualizations.

Unlike other activity-related enrichments that modify or categorize events, this enrichment physically removes duplicate event records from your event log, permanently cleaning your dataset. The enrichment compares all event attributes from the original data source (not calculated or derived attributes) to determine if two events are truly identical.

Common Uses

  • Clean datasets imported from multiple source systems that may contain duplicate event records
  • Remove redundant events created by data integration processes or ETL pipelines
  • Eliminate duplicate activity recordings caused by system errors or data synchronization issues
  • Improve data quality before performing process mining analysis to ensure accurate metrics
  • Prepare datasets for conformance checking by removing noise from duplicate events
  • Clean historical data that has accumulated duplicates over time due to legacy system issues
  • Ensure accurate activity frequency counts and cycle time measurements by eliminating duplicate event noise

Settings

This enrichment requires no configuration settings. It is a one-click operation that automatically scans all events within each case and removes any duplicates it finds.

The enrichment uses an intelligent comparison algorithm that:

  • Compares all original source data attributes for each event (activity name, timestamp, case ID, and any other event-level attributes)
  • Ignores calculated or derived attributes added by previous enrichments
  • Keeps the first occurrence of each unique event
  • Removes subsequent duplicate events that match all attribute values

To use this enrichment:

  1. Navigate to 'Log Enrichment' from any analysis by clicking 'Log Enrichment' in the top right
  2. Click 'Add New' to create a new enrichment
  3. Select 'Remove Duplicate Events' from the Activities section
  4. Click 'Create' - no additional configuration is needed
  5. Click 'Calculate Enrichment' to process your dataset

Examples

Example 1: Multi-System Order Processing

Scenario: An e-commerce company imports order data from three different systems: the web storefront, the warehouse management system, and the accounting system. Due to data integration issues, some order events appear multiple times when the same order was recorded by multiple systems with identical timestamps and values.

Settings:

  • No configuration required - the enrichment automatically detects and removes all duplicate events

Output: Before enrichment, a sample case might contain these events:

  • 2024-03-15 09:00:00 - Order Received - Order#12345 - Customer: ABC Corp - Amount: $1,500
  • 2024-03-15 09:00:00 - Order Received - Order#12345 - Customer: ABC Corp - Amount: $1,500 (duplicate)
  • 2024-03-15 10:30:00 - Payment Processed - Order#12345 - Amount: $1,500
  • 2024-03-15 10:30:00 - Payment Processed - Order#12345 - Amount: $1,500 (duplicate)
  • 2024-03-15 14:00:00 - Order Shipped - Order#12345

After enrichment, the duplicate events are removed:

  • 2024-03-15 09:00:00 - Order Received - Order#12345 - Customer: ABC Corp - Amount: $1,500
  • 2024-03-15 10:30:00 - Payment Processed - Order#12345 - Amount: $1,500
  • 2024-03-15 14:00:00 - Order Shipped - Order#12345

Insights: The company can now accurately measure process performance. The cycle time from order to shipment is correctly calculated as 5 hours instead of being skewed by duplicate event records. Activity frequency counts now reflect actual process execution rather than data quality issues.

Example 2: Healthcare Patient Journey

Scenario: A hospital consolidates patient data from their EHR system, radiology system, and pharmacy system. During migration from a legacy system, some patient events were duplicated, causing patient journey timelines to show the same procedure multiple times and inflating activity counts.

Settings:

  • No configuration required

Output: A patient case before enrichment:

  • 2024-06-20 08:00:00 - Patient Admission - Patient ID: P9876 - Ward: Cardiology
  • 2024-06-20 09:15:00 - Blood Test Ordered - Test Type: CBC
  • 2024-06-20 09:15:00 - Blood Test Ordered - Test Type: CBC (duplicate from lab system)
  • 2024-06-20 11:30:00 - ECG Performed - Result: Normal
  • 2024-06-20 11:30:00 - ECG Performed - Result: Normal (duplicate from radiology system)
  • 2024-06-20 15:00:00 - Medication Prescribed - Drug: Aspirin
  • 2024-06-20 15:00:00 - Medication Prescribed - Drug: Aspirin (duplicate from pharmacy system)
  • 2024-06-21 10:00:00 - Patient Discharge

After enrichment, duplicates are removed:

  • 2024-06-20 08:00:00 - Patient Admission - Patient ID: P9876 - Ward: Cardiology
  • 2024-06-20 09:15:00 - Blood Test Ordered - Test Type: CBC
  • 2024-06-20 11:30:00 - ECG Performed - Result: Normal
  • 2024-06-20 15:00:00 - Medication Prescribed - Drug: Aspirin
  • 2024-06-21 10:00:00 - Patient Discharge

Insights: The hospital can now accurately track patient pathways and calculate true wait times between procedures. Resource utilization metrics reflect actual activity volumes rather than inflated numbers from duplicate records.

Example 3: Manufacturing Production Line

Scenario: A manufacturing plant uses SCADA systems that occasionally log the same machine operation twice due to network synchronization issues. These duplicate events distort production analytics and make it appear that operations take longer than they actually do.

Settings:

  • No configuration required

Output: Production case before enrichment:

  • 2024-05-10 06:00:00 - Material Loaded - Batch: B1234 - Machine: Press-01
  • 2024-05-10 06:05:00 - Press Operation Start - Batch: B1234
  • 2024-05-10 06:05:00 - Press Operation Start - Batch: B1234 (network duplicate)
  • 2024-05-10 06:45:00 - Press Operation Complete - Batch: B1234
  • 2024-05-10 06:45:00 - Press Operation Complete - Batch: B1234 (network duplicate)
  • 2024-05-10 07:00:00 - Quality Inspection - Result: Pass
  • 2024-05-10 07:15:00 - Material Unloaded - Batch: B1234

After enrichment:

  • 2024-05-10 06:00:00 - Material Loaded - Batch: B1234 - Machine: Press-01
  • 2024-05-10 06:05:00 - Press Operation Start - Batch: B1234
  • 2024-05-10 06:45:00 - Press Operation Complete - Batch: B1234
  • 2024-05-10 07:00:00 - Quality Inspection - Result: Pass
  • 2024-05-10 07:15:00 - Material Unloaded - Batch: B1234

Insights: Production cycle time calculations are now accurate. The plant can reliably measure machine utilization and identify true bottlenecks without noise from duplicate event records.

Example 4: Financial Transaction Processing

Scenario: A bank's transaction processing system occasionally creates duplicate log entries when transactions are processed through both the real-time system and the batch reconciliation system. These duplicates need to be removed before analyzing transaction patterns and compliance.

Settings:

  • No configuration required

Output: Transaction case before enrichment:

  • 2024-07-15 14:30:00 - Transaction Initiated - Amount: $5,000 - Account: 12345
  • 2024-07-15 14:30:05 - Fraud Check Performed - Risk Score: Low
  • 2024-07-15 14:30:05 - Fraud Check Performed - Risk Score: Low (duplicate from reconciliation)
  • 2024-07-15 14:30:10 - Authorization Approved - Auth Code: A789
  • 2024-07-15 14:30:10 - Authorization Approved - Auth Code: A789 (duplicate from reconciliation)
  • 2024-07-15 14:30:15 - Transaction Completed - Status: Success

After enrichment:

  • 2024-07-15 14:30:00 - Transaction Initiated - Amount: $5,000 - Account: 12345
  • 2024-07-15 14:30:05 - Fraud Check Performed - Risk Score: Low
  • 2024-07-15 14:30:10 - Authorization Approved - Auth Code: A789
  • 2024-07-15 14:30:15 - Transaction Completed - Status: Success

Insights: The bank can now accurately measure transaction processing times and identify true delays in their system. Compliance reporting shows actual activity counts rather than inflated numbers from duplicate records.

Example 5: IT Service Management

Scenario: An IT service desk imports ticket data from multiple monitoring systems. When incidents are escalated between systems, the same status change events sometimes appear multiple times, making incident resolution times appear longer than they actually are.

Settings:

  • No configuration required

Output: Incident case before enrichment:

  • 2024-08-22 10:00:00 - Incident Created - Ticket: INC0012345 - Priority: High
  • 2024-08-22 10:15:00 - Assigned to L1 Support - Agent: John Smith
  • 2024-08-22 10:30:00 - Escalated to L2 - Reason: Complex Issue
  • 2024-08-22 10:30:00 - Escalated to L2 - Reason: Complex Issue (duplicate from escalation system)
  • 2024-08-22 11:45:00 - Issue Resolved - Resolution: Network Config Fix
  • 2024-08-22 11:45:00 - Issue Resolved - Resolution: Network Config Fix (duplicate from escalation system)
  • 2024-08-22 12:00:00 - Incident Closed - Satisfaction: 5/5

After enrichment:

  • 2024-08-22 10:00:00 - Incident Created - Ticket: INC0012345 - Priority: High
  • 2024-08-22 10:15:00 - Assigned to L1 Support - Agent: John Smith
  • 2024-08-22 10:30:00 - Escalated to L2 - Reason: Complex Issue
  • 2024-08-22 11:45:00 - Issue Resolved - Resolution: Network Config Fix
  • 2024-08-22 12:00:00 - Incident Closed - Satisfaction: 5/5

Insights: The IT department can now accurately measure mean time to resolution (MTTR) and identify true performance bottlenecks in their incident management process without duplicate events skewing the timeline analysis.

Output

The Remove Duplicate Events enrichment modifies your event log by physically removing duplicate event records. Unlike enrichments that add new attributes to your dataset, this enrichment reduces the total number of events in your log.

What Gets Removed:

  • Any event that has identical values for all original source data attributes (activity name, timestamp, case ID, and all other event attributes) compared to a previous event in the same case
  • Only the duplicate occurrences are removed; the first occurrence of each unique event is always retained

What Stays:

  • The first occurrence of each unique event
  • Events that differ in any attribute value (even if timestamps or activity names match)
  • All calculated attributes and enrichment results from previous enrichments

Impact on Your Dataset:

  • Event Count: The total number of events in your log decreases based on how many duplicates are found
  • Case Count: The number of cases remains unchanged
  • Activity Statistics: Activity frequency counts become more accurate, reflecting actual process execution
  • Cycle Times: Duration calculations between activities become more precise without duplicate events creating zero-duration intervals
  • Process Flow: Process maps and variant analysis show cleaner, more accurate process flows

Important Notes:

  • This enrichment permanently removes duplicate events from your working dataset. If you need to preserve the original data with duplicates, create a backup or use a dataset snapshot before applying this enrichment.
  • The enrichment only compares original source data columns, not calculated or derived attributes added by previous enrichments
  • Events are considered duplicates only if ALL original attribute values match exactly
  • The enrichment processes events in chronological order, always keeping the first occurrence

Using the Cleaned Data: After running this enrichment, you can:

  • Perform accurate process discovery without noise from duplicate events
  • Calculate reliable performance metrics and KPIs
  • Conduct conformance checking on clean data
  • Create accurate process visualizations and dashboards
  • Combine with other enrichments knowing your baseline data is clean

See Also

Related data quality enrichments:

  • Remove Repeated Activities - Removes consecutive occurrences of the same activity (different from this enrichment, which removes exact duplicate events)
  • Sort Log on Start Time - Ensures events are in correct chronological order before analysis
  • Hide Attribute - Remove unnecessary attributes from your analysis view
  • Filter Process Log - Remove specific cases or events based on criteria
  • Anonymize - Remove or obscure sensitive information in event attributes

For more information on data quality best practices:

  • Data Quality Best Practices - Guidelines for preparing clean process data
  • Log Enrichment Overview - Understanding the enrichment workflow in mindzieStudio

This documentation is part of the mindzie Studio process mining platform.

An error has occurred. This application may no longer respond until reloaded. Reload ??