Column Info

Overview

The Column Info calculator provides detailed metadata and statistics about all attributes (columns) in your event log dataset. This administrative tool displays comprehensive information about each attribute including data types, value distributions, null counts, and sample values.

IMPORTANT: This is an administrator-only calculator designed for technical analysis and research purposes. It is not optimized for production use and may take significant time to process large datasets. Regular users should use the Dataset Information calculator for general dataset overview needs.

This calculator is primarily used by system administrators, data analysts, and technical users who need deep insights into data structure and quality for troubleshooting, data validation, or dataset optimization.

Common Uses

  • Perform comprehensive data quality audits across all attributes in the event log
  • Identify attributes with high percentages of null or missing values
  • Analyze data type consistency and detect potential type conversion issues
  • Review attribute cardinality (number of unique values) to identify candidates for categorical analysis
  • Validate data extraction results by examining attribute-level statistics
  • Diagnose performance issues by identifying attributes with unexpected value distributions
  • Document dataset schema and characteristics for technical specifications

Settings

This calculator requires no specific configuration settings. When executed, it automatically scans all attributes in the current dataset (both case-level and event-level) and generates comprehensive statistics for each.

Note: Processing time depends on dataset size and the number of attributes. For very large datasets, this calculator may take several minutes to complete.

Examples

Example 1: Data Quality Audit After ETL

Scenario: You have just completed an ETL process to extract order-to-cash data from your ERP system. Before releasing the dataset to business users, you need to verify that all attributes were extracted correctly and assess data completeness.

Settings:

  • Title: "Post-ETL Data Quality Validation"
  • Description: "O2C dataset - January 2025 extraction"

Output:

The calculator displays a comprehensive table with the following information for each attribute:

Attribute Name Type Total Values Null Count Null % Unique Values Sample Values
CaseID Case 2,456 0 0% 2,456 ORD-001, ORD-002, ORD-003
CustomerName Case 2,456 12 0.5% 847 Acme Corp, TechStart Inc, Global...
OrderAmount Case 2,456 0 0% 1,823 1250.00, 3400.50, 875.25
Region Case 2,456 156 6.4% 4 North, South, East, West
ActivityName Event 18,945 0 0% 15 Create Order, Approve Order, Ship...
Timestamp Event 18,945 0 0% 18,893 2025-01-15 08:23:00, 2025-01-15...
ApprovalLevel Event 18,945 8,234 43.5% 3 L1, L2, L3
Department Event 18,945 3,456 18.2% 8 Sales, Finance, Operations...

Insights: The audit reveals several data quality concerns. The Region attribute has 6.4% null values affecting 156 cases - these cases need manual review or data correction. More critically, the ApprovalLevel attribute has 43.5% null values, which may indicate that not all activities require approval (expected) or that approval data is missing for activities that should have it (requires investigation). The low null count in CustomerName (0.5%) is acceptable and may represent test orders. All critical identifiers (CaseID, Timestamp) have zero nulls, confirming data integrity.

Example 2: Performance Troubleshooting

Scenario: Users are reporting slow performance when filtering on certain attributes. You need to identify which attributes have high cardinality (many unique values) that might be causing inefficient filtering.

Settings:

  • Title: "Attribute Cardinality Analysis"
  • Description: "Investigating filter performance issues"

Output:

Attribute Name Type Total Values Unique Values Cardinality Ratio Data Type
CaseID Case 45,678 45,678 100% String
TransactionID Event 367,824 367,824 100% String
UserComments Event 367,824 89,234 24.3% String
ProductSKU Event 367,824 12,456 3.4% String
Status Case 45,678 8 0.02% String
Priority Case 45,678 3 0.007% String

Insights: The analysis reveals a wide range of cardinality across attributes. CaseID and TransactionID have 100% cardinality (every value is unique), making them excellent for case identification but poor candidates for categorical filtering. UserComments has unexpectedly high cardinality (24.3%), suggesting it contains free-form text rather than standardized values - filtering on this attribute will be slow and may benefit from full-text search optimization. In contrast, Status (8 values) and Priority (3 values) are ideal for efficient filtering. This analysis helps optimize filter design and guides users toward high-performance attribute selections.

Example 3: Schema Documentation for Integration

Scenario: You need to provide technical documentation to a third-party vendor who will be integrating with your process mining environment. They need detailed information about available attributes, data types, and expected value ranges.

Settings:

  • Title: "Purchase-to-Pay Schema Documentation"
  • Description: "Technical specification for API integration"

Output:

Attribute Name Attribute Type Data Type Total Values Unique Values Null Count Sample Values
PO_Number Case String 8,945 8,945 0 PO-2025-00001, PO-2025-00002
Vendor_ID Case String 8,945 234 0 V12345, V67890, V45678
Total_Amount Case Decimal 8,945 7,823 0 15750.50, 2340.00, 987.25
Currency Case String 8,945 3 12 USD, EUR, GBP
RequestDate Case DateTime 8,945 2,456 0 2025-01-15, 2025-01-16
Activity Event String 71,560 12 0 Create PO, Approve PO, Send...
Resource Event String 71,560 145 234 john.smith, sarah.jones...
Cost_Center Event String 71,560 67 1,234 CC-1001, CC-2045, CC-3012

Insights: The schema documentation shows that PO_Number is the primary case identifier with guaranteed uniqueness and no nulls. All monetary values use the Total_Amount field (decimal type) with Currency specified separately. The process supports three currencies (USD, EUR, GBP) with 12 cases missing currency data that need correction. Resource information is available for 145 unique users but has 234 null values at the event level, indicating some automated activities. The Cost_Center attribute has 1.7% null values, suggesting incomplete data entry for certain activities. This comprehensive view enables accurate integration planning.

Example 4: Detecting Data Type Inconsistencies

Scenario: After merging data from multiple source systems, you suspect there may be data type inconsistencies that could cause calculation errors or unexpected behavior in analyses.

Settings:

  • Title: "Data Type Consistency Check"
  • Description: "Multi-source data validation"

Output:

Attribute Name Detected Type Total Values Type Conflicts Sample Inconsistent Values
OrderDate DateTime 5,678 0 -
OrderValue Mixed 5,678 23 "1250.50", "$1,250.50", "1250,50"
QuantityOrdered Integer 5,678 8 "100", "100.0", "100 units"
CustomerID String 5,678 0 -
IsRush Mixed 5,678 145 "Yes", "Y", "1", "true", "TRUE"

Insights: The analysis uncovered critical data type inconsistencies. The OrderValue attribute contains mixed formatting - some values include currency symbols and different decimal separators (comma vs period), requiring data cleansing before calculations. QuantityOrdered shows 8 instances where text was appended ("100 units"), which will cause errors in numeric aggregations. The IsRush flag has five different representations of boolean values, requiring standardization to "true/false" or "1/0" for reliable filtering. These issues must be resolved in the ETL process before the data can be used reliably.

Example 5: Identifying Enrichment Opportunities

Scenario: You want to identify attributes with low cardinality that would benefit from enrichment with additional descriptive information to make analyses more user-friendly.

Settings:

  • Title: "Enrichment Opportunity Analysis"
  • Description: "Identifying candidates for lookup enrichment"

Output:

Attribute Name Type Unique Values Null % Sample Values Enrichment Potential
ProductCode Event 45 0% P001, P002, P003 HIGH - Add product names
StatusCode Case 8 0% ST-01, ST-02, ST-03 HIGH - Add status descriptions
RegionCode Case 4 0% R1, R2, R3, R4 HIGH - Add region names
CurrencyCode Case 3 0% USD, EUR, GBP MEDIUM - Generally understood
EmployeeID Event 234 2.1% E12345, E67890 HIGH - Add employee names

Insights: Several attributes contain codes that would benefit from enrichment. With only 45 unique product codes, adding product names would make analyses far more readable for business users. The 8 status codes should be enriched with plain-language descriptions to avoid users needing to reference code sheets. Employee IDs should be enriched with names while maintaining privacy compliance. These enrichments will significantly improve the user experience without adding substantial data volume.

Scenario: You run regular data extractions and want to monitor whether data completeness is improving or degrading over time by comparing current extraction statistics with previous baselines.

Settings:

  • Title: "Data Completeness Monitoring - February 2025"
  • Description: "Compare with January baseline"

Output:

Attribute Name Type Jan Null % Feb Null % Change Trend
ApproverName Event 5.2% 3.1% -2.1% IMPROVED
Department Case 8.4% 8.9% +0.5% DEGRADED
CostCenter Event 12.3% 18.7% +6.4% DEGRADED
Priority Case 1.2% 1.1% -0.1% STABLE
DueDate Case 15.6% 9.2% -6.4% IMPROVED

Insights: The comparison reveals mixed data quality trends. ApproverName null percentages decreased from 5.2% to 3.1%, indicating improved data capture at the approval stage - possibly due to recent process changes requiring explicit approver selection. However, CostCenter null percentages increased significantly from 12.3% to 18.7%, suggesting a degradation in cost center assignment that requires immediate attention. The dramatic improvement in DueDate completeness (from 15.6% to 9.2%) reflects the successful implementation of mandatory due date entry. These trends guide ongoing data quality initiatives.

Output

The Column Info calculator displays a comprehensive table with detailed statistics for every attribute in your event log. The table includes both case-level and event-level attributes with the following information:

Attribute Name: The name of the attribute as it appears in the dataset.

Attribute Type: Indicates whether this is a Case-level attribute (one value per case) or Event-level attribute (one value per event/activity).

Data Type: The detected data type of the attribute (String, Integer, Decimal, DateTime, Boolean, etc.).

Total Values: The total number of values present for this attribute (total cases for case attributes, total events for event attributes).

Null Count: The number of null or missing values for this attribute.

Null Percentage: The percentage of values that are null or missing, calculated as (Null Count / Total Values) * 100.

Unique Values: The number of distinct unique values in this attribute.

Cardinality Ratio: The ratio of unique values to total values, expressed as a percentage. High cardinality (close to 100%) indicates mostly unique values; low cardinality indicates many repeated values.

Sample Values: A representative sample of actual values from the attribute, typically showing 3-5 distinct values to illustrate the data format and content.

Min Value: For numeric and date attributes, the minimum (smallest/earliest) value.

Max Value: For numeric and date attributes, the maximum (largest/latest) value.

Interactive Features

Sort and Filter: Click column headers to sort by any metric. Use the search box to filter to specific attributes of interest.

Export Results: Export the complete attribute analysis to Excel or CSV for documentation, comparison, or sharing with technical teams.

Drill-down Analysis: Click on an attribute name to see additional detailed statistics including value frequency distribution and more comprehensive sample values.

Performance Considerations

  • Large Datasets: For datasets with millions of events or hundreds of attributes, this calculator may require several minutes to complete analysis
  • Resource Usage: The calculator performs comprehensive scans of all attribute values, which is memory and CPU intensive
  • Best Practices: Run this calculator during off-peak hours for very large datasets, or use filters to reduce dataset size before execution

Administrative Access

This calculator is restricted to users with Administrator role. Regular users who need dataset overview information should use the Dataset Information calculator instead, which provides key metrics without the performance overhead of comprehensive column analysis.


This documentation is part of the mindzie Studio process mining platform.

An error has occurred. This application may no longer respond until reloaded. Reload ??