Trim Text

Overview

The Trim Text enrichment is a data cleanup operator that automatically removes all leading and trailing whitespace characters from text attributes throughout your dataset. This essential data hygiene tool ensures consistency in text fields by eliminating accidental spaces, tabs, and other invisible characters that can cause issues with data matching, filtering, and analysis. When processing data from various sources like ERP systems, spreadsheets, or manual entry systems, text fields often contain unintentional whitespace that can prevent accurate process mining analysis.

Unlike manual data cleaning approaches, this enrichment processes every text attribute in both case-level and event-level data in a single operation. The enrichment intelligently handles empty strings by converting them to null values, ensuring your dataset maintains proper data integrity. This automatic cleanup is particularly valuable when preparing data for conformance checking, where exact text matches are critical for identifying process patterns and deviations.

Common Uses

  • Clean imported data from ERP systems where fields contain trailing spaces due to fixed-width database columns
  • Standardize user-entered text fields from forms or manual data entry systems where operators accidentally add spaces
  • Prepare data for accurate matching and filtering operations by ensuring consistent text formatting
  • Remove invisible whitespace characters that can cause duplicate-looking values in dropdown filters
  • Clean activity names and resource names for accurate process discovery and conformance analysis
  • Normalize product codes, customer IDs, and reference numbers that may have inconsistent spacing
  • Prepare text attributes for concatenation or joining operations where extra spaces would create formatting issues

Settings

This enrichment operates automatically on all text attributes without requiring any configuration. It processes every string column in your dataset, applying trimming logic consistently across case attributes and event attributes.

Examples

Example 1: Cleaning ERP System Export Data

Scenario: A manufacturing company exports order data from their SAP system where product codes and customer names contain trailing spaces due to fixed-width database fields, causing issues with product categorization and customer analysis.

Before Enrichment: | Case ID | Product_Code | Customer_Name | Order_Status | |---------|--------------|---------------|--------------| | ORD-001 | "PRD-1234 " | "Acme Corp " | "APPROVED " | | ORD-002 | " PRD-5678" | " Beta Inc " | "PENDING" | | ORD-003 | "PRD-1234" | "Acme Corp" | "APPROVED" |

After Enrichment: | Case ID | Product_Code | Customer_Name | Order_Status | |---------|--------------|---------------|--------------| | ORD-001 | "PRD-1234" | "Acme Corp" | "APPROVED" | | ORD-002 | "PRD-5678" | "Beta Inc" | "PENDING" | | ORD-003 | "PRD-1234" | "Acme Corp" | "APPROVED" |

Output: All text attributes are trimmed, removing leading and trailing spaces. Now products PRD-1234 from orders ORD-001 and ORD-003 are correctly identified as the same product, and customer names are consistently formatted.

Insights: After trimming, the company discovered that what appeared to be 150 unique product codes was actually only 95 distinct products. This accurate data enabled proper inventory analysis and revealed that Acme Corp accounted for 40% more orders than initially calculated due to proper name matching.

Example 2: Standardizing Manual Entry Data in Healthcare

Scenario: A hospital's patient admission system has activity names and department fields with inconsistent spacing from manual data entry, preventing accurate process flow analysis and department utilization metrics.

Event Data Before: | Case ID | Activity | Department | Resource | |---------|----------|------------|----------| | PAT-101 | " Patient Registration" | "Emergency " | "Nurse Johnson " | | PAT-101 | "Triage " | " Emergency" | "Dr. Smith" | | PAT-102 | "Patient Registration" | "Emergency" | " Nurse Johnson" |

Event Data After: | Case ID | Activity | Department | Resource | |---------|----------|------------|----------| | PAT-101 | "Patient Registration" | "Emergency" | "Nurse Johnson" | | PAT-101 | "Triage" | "Emergency" | "Dr. Smith" | | PAT-102 | "Patient Registration" | "Emergency" | "Nurse Johnson" |

Output: Activity names, departments, and resource names are standardized by removing all extra spaces. The process flow now correctly shows a single "Patient Registration" activity instead of appearing as two different activities.

Insights: The cleanup revealed the true patient flow through the emergency department, showing that 100% of patients follow the same initial registration process. Resource utilization reports now accurately show Nurse Johnson handles 75% of registrations instead of appearing as two different resources.

Example 3: Cleaning Financial Transaction Data

Scenario: A bank's loan processing system exports transaction types and approval codes with various whitespace issues from different branch systems, making it impossible to accurately track approval patterns and process compliance.

Case Attributes Before: | Loan_ID | Loan_Type | Branch_Code | Approval_Level | |---------|-----------|-------------|----------------| | LN-5001 | "Personal Loan " | " NYC-01 " | "Manager " | | LN-5002 | " Personal Loan" | "NYC-01" | "Manager" | | LN-5003 | " Business Loan " | " LA-02" | " Director " |

Case Attributes After: | Loan_ID | Loan_Type | Branch_Code | Approval_Level | |---------|-----------|-------------|----------------| | LN-5001 | "Personal Loan" | "NYC-01" | "Manager" | | LN-5002 | "Personal Loan" | "NYC-01" | "Manager" | | LN-5003 | "Business Loan" | "LA-02" | "Director" |

Output: All loan types, branch codes, and approval levels are consistently formatted. Personal Loans from LN-5001 and LN-5002 are now correctly grouped together, and branch codes are standardized for accurate regional analysis.

Insights: After cleaning, the bank discovered that Personal Loans represented 65% of their portfolio instead of the reported 43%, as various spacing variations had been counted as different loan types. This enabled proper risk assessment and resource allocation for their dominant product line.

Example 4: Normalizing Procurement Process Data

Scenario: A procurement system combines data from multiple vendor platforms where vendor names, material categories, and purchase order statuses contain inconsistent whitespace, preventing accurate spend analysis and vendor performance tracking.

Before Enrichment: | PO_Number | Vendor_Name | Material_Category | Status | |-----------|-------------|-------------------|---------| | PO-8001 | "TechSupply Inc " | " Electronics " | "Delivered " | | PO-8002 | " TechSupply Inc" | "Electronics" | " Delivered" | | PO-8003 | "TechSupply Inc" | " Electronics" | "Pending" |

After Enrichment: | PO_Number | Vendor_Name | Material_Category | Status | |-----------|-------------|-------------------|---------| | PO-8001 | "TechSupply Inc" | "Electronics" | "Delivered" | | PO-8002 | "TechSupply Inc" | "Electronics" | "Delivered" | | PO-8003 | "TechSupply Inc" | "Electronics" | "Pending" |

Output: Vendor names and material categories are standardized across all purchase orders. All three orders are now correctly associated with the same vendor and category.

Insights: The cleanup revealed that TechSupply Inc was actually the company's largest vendor with $2.3M in annual spend, not the three separate smaller vendors previously reported. This consolidation enabled better vendor negotiations and identified opportunities for volume discounts.

Example 5: Cleaning Activity Names for Process Discovery

Scenario: A logistics company's shipment tracking system has activity names with various spacing issues from different scanning devices and manual entries, making process discovery show fragmented and incorrect process flows.

Event Log Before: | Case_ID | Activity | Location | Timestamp | |---------|----------|----------|-----------| | SHIP-901 | "Package Received " | "Warehouse A " | 2024-01-10 08:00 | | SHIP-901 | " Sorting" | "Warehouse A" | 2024-01-10 09:00 | | SHIP-902 | "Package Received" | " Warehouse A" | 2024-01-10 08:30 | | SHIP-902 | "Sorting " | "Warehouse A " | 2024-01-10 09:30 |

Event Log After: | Case_ID | Activity | Location | Timestamp | |---------|----------|----------|-----------| | SHIP-901 | "Package Received" | "Warehouse A" | 2024-01-10 08:00 | | SHIP-901 | "Sorting" | "Warehouse A" | 2024-01-10 09:00 | | SHIP-902 | "Package Received" | "Warehouse A" | 2024-01-10 08:30 | | SHIP-902 | "Sorting" | "Warehouse A" | 2024-01-10 09:30 |

Output: All activity names and locations are trimmed to remove whitespace variations. The process now shows a clean, linear flow of Package Received followed by Sorting for all shipments.

Insights: Process discovery now correctly shows a standard two-step process for all packages instead of eight different activity variations. This revealed that 100% of packages follow the same initial handling process, enabling the company to standardize training and optimize resource allocation at Warehouse A.

Output

The Trim Text enrichment modifies existing text attributes in place rather than creating new attributes. All string-type columns in your dataset are automatically processed, including both case-level attributes and event-level attributes. The enrichment applies the following transformations:

Text Processing Rules:

  • Removes all leading whitespace (spaces, tabs, and other invisible characters at the start of text)
  • Removes all trailing whitespace (spaces, tabs, and other invisible characters at the end of text)
  • Preserves internal spaces within the text (only beginning and end are trimmed)
  • Converts empty strings (strings that become empty after trimming) to null values
  • Leaves already-trimmed text unchanged for optimal performance
  • Skips non-text attributes (numbers, dates, booleans remain untouched)
  • Processes hidden columns are not modified to preserve system data

The enrichment works seamlessly with other mindzieStudio features. Trimmed text attributes can be immediately used in filters for accurate matching, in calculators for precise concatenation operations, and in other enrichments that depend on consistent text formatting. Since the enrichment modifies data in place, all existing visualizations, dashboards, and analyses automatically benefit from the cleaned data without requiring any reconfiguration.

For downstream processing, the cleaned text ensures that conformance checking operators correctly identify matching activities, lookup enrichments find accurate matches across datasets, and group-by operations properly aggregate related cases. The null conversion for empty strings prevents issues with database operations and ensures that empty values are handled consistently throughout the platform.


This documentation is part of the mindzie Studio process mining platform.

An error has occurred. This application may no longer respond until reloaded. Reload ??