Overview
The Column Info calculator provides detailed metadata and statistics about all attributes (columns) in your event log dataset. This administrative tool displays comprehensive information about each attribute including data types, value distributions, null counts, and sample values.
IMPORTANT: This is an administrator-only calculator designed for technical analysis and research purposes. It is not optimized for production use and may take significant time to process large datasets. Regular users should use the Dataset Information calculator for general dataset overview needs.
This calculator is primarily used by system administrators, data analysts, and technical users who need deep insights into data structure and quality for troubleshooting, data validation, or dataset optimization.
Common Uses
- Perform comprehensive data quality audits across all attributes in the event log
- Identify attributes with high percentages of null or missing values
- Analyze data type consistency and detect potential type conversion issues
- Review attribute cardinality (number of unique values) to identify candidates for categorical analysis
- Validate data extraction results by examining attribute-level statistics
- Diagnose performance issues by identifying attributes with unexpected value distributions
- Document dataset schema and characteristics for technical specifications
Settings
This calculator requires no specific configuration settings. When executed, it automatically scans all attributes in the current dataset (both case-level and event-level) and generates comprehensive statistics for each.
Note: Processing time depends on dataset size and the number of attributes. For very large datasets, this calculator may take several minutes to complete.
Examples
Example 1: Data Quality Audit After ETL
Scenario: You have just completed an ETL process to extract order-to-cash data from your ERP system. Before releasing the dataset to business users, you need to verify that all attributes were extracted correctly and assess data completeness.
Settings:
- Title: "Post-ETL Data Quality Validation"
- Description: "O2C dataset - January 2025 extraction"
Output:
The calculator displays a comprehensive table with the following information for each attribute:
| Attribute Name | Type | Total Values | Null Count | Null % | Unique Values | Sample Values |
|---|---|---|---|---|---|---|
| CaseID | Case | 2,456 | 0 | 0% | 2,456 | ORD-001, ORD-002, ORD-003 |
| CustomerName | Case | 2,456 | 12 | 0.5% | 847 | Acme Corp, TechStart Inc, Global... |
| OrderAmount | Case | 2,456 | 0 | 0% | 1,823 | 1250.00, 3400.50, 875.25 |
| Region | Case | 2,456 | 156 | 6.4% | 4 | North, South, East, West |
| ActivityName | Event | 18,945 | 0 | 0% | 15 | Create Order, Approve Order, Ship... |
| Timestamp | Event | 18,945 | 0 | 0% | 18,893 | 2025-01-15 08:23:00, 2025-01-15... |
| ApprovalLevel | Event | 18,945 | 8,234 | 43.5% | 3 | L1, L2, L3 |
| Department | Event | 18,945 | 3,456 | 18.2% | 8 | Sales, Finance, Operations... |
Insights: The audit reveals several data quality concerns. The Region attribute has 6.4% null values affecting 156 cases - these cases need manual review or data correction. More critically, the ApprovalLevel attribute has 43.5% null values, which may indicate that not all activities require approval (expected) or that approval data is missing for activities that should have it (requires investigation). The low null count in CustomerName (0.5%) is acceptable and may represent test orders. All critical identifiers (CaseID, Timestamp) have zero nulls, confirming data integrity.
Example 2: Performance Troubleshooting
Scenario: Users are reporting slow performance when filtering on certain attributes. You need to identify which attributes have high cardinality (many unique values) that might be causing inefficient filtering.
Settings:
- Title: "Attribute Cardinality Analysis"
- Description: "Investigating filter performance issues"
Output:
| Attribute Name | Type | Total Values | Unique Values | Cardinality Ratio | Data Type |
|---|---|---|---|---|---|
| CaseID | Case | 45,678 | 45,678 | 100% | String |
| TransactionID | Event | 367,824 | 367,824 | 100% | String |
| UserComments | Event | 367,824 | 89,234 | 24.3% | String |
| ProductSKU | Event | 367,824 | 12,456 | 3.4% | String |
| Status | Case | 45,678 | 8 | 0.02% | String |
| Priority | Case | 45,678 | 3 | 0.007% | String |
Insights: The analysis reveals a wide range of cardinality across attributes. CaseID and TransactionID have 100% cardinality (every value is unique), making them excellent for case identification but poor candidates for categorical filtering. UserComments has unexpectedly high cardinality (24.3%), suggesting it contains free-form text rather than standardized values - filtering on this attribute will be slow and may benefit from full-text search optimization. In contrast, Status (8 values) and Priority (3 values) are ideal for efficient filtering. This analysis helps optimize filter design and guides users toward high-performance attribute selections.
Example 3: Schema Documentation for Integration
Scenario: You need to provide technical documentation to a third-party vendor who will be integrating with your process mining environment. They need detailed information about available attributes, data types, and expected value ranges.
Settings:
- Title: "Purchase-to-Pay Schema Documentation"
- Description: "Technical specification for API integration"
Output:
| Attribute Name | Attribute Type | Data Type | Total Values | Unique Values | Null Count | Sample Values |
|---|---|---|---|---|---|---|
| PO_Number | Case | String | 8,945 | 8,945 | 0 | PO-2025-00001, PO-2025-00002 |
| Vendor_ID | Case | String | 8,945 | 234 | 0 | V12345, V67890, V45678 |
| Total_Amount | Case | Decimal | 8,945 | 7,823 | 0 | 15750.50, 2340.00, 987.25 |
| Currency | Case | String | 8,945 | 3 | 12 | USD, EUR, GBP |
| RequestDate | Case | DateTime | 8,945 | 2,456 | 0 | 2025-01-15, 2025-01-16 |
| Activity | Event | String | 71,560 | 12 | 0 | Create PO, Approve PO, Send... |
| Resource | Event | String | 71,560 | 145 | 234 | john.smith, sarah.jones... |
| Cost_Center | Event | String | 71,560 | 67 | 1,234 | CC-1001, CC-2045, CC-3012 |
Insights: The schema documentation shows that PO_Number is the primary case identifier with guaranteed uniqueness and no nulls. All monetary values use the Total_Amount field (decimal type) with Currency specified separately. The process supports three currencies (USD, EUR, GBP) with 12 cases missing currency data that need correction. Resource information is available for 145 unique users but has 234 null values at the event level, indicating some automated activities. The Cost_Center attribute has 1.7% null values, suggesting incomplete data entry for certain activities. This comprehensive view enables accurate integration planning.
Example 4: Detecting Data Type Inconsistencies
Scenario: After merging data from multiple source systems, you suspect there may be data type inconsistencies that could cause calculation errors or unexpected behavior in analyses.
Settings:
- Title: "Data Type Consistency Check"
- Description: "Multi-source data validation"
Output:
| Attribute Name | Detected Type | Total Values | Type Conflicts | Sample Inconsistent Values |
|---|---|---|---|---|
| OrderDate | DateTime | 5,678 | 0 | - |
| OrderValue | Mixed | 5,678 | 23 | "1250.50", "$1,250.50", "1250,50" |
| QuantityOrdered | Integer | 5,678 | 8 | "100", "100.0", "100 units" |
| CustomerID | String | 5,678 | 0 | - |
| IsRush | Mixed | 5,678 | 145 | "Yes", "Y", "1", "true", "TRUE" |
Insights: The analysis uncovered critical data type inconsistencies. The OrderValue attribute contains mixed formatting - some values include currency symbols and different decimal separators (comma vs period), requiring data cleansing before calculations. QuantityOrdered shows 8 instances where text was appended ("100 units"), which will cause errors in numeric aggregations. The IsRush flag has five different representations of boolean values, requiring standardization to "true/false" or "1/0" for reliable filtering. These issues must be resolved in the ETL process before the data can be used reliably.
Example 5: Identifying Enrichment Opportunities
Scenario: You want to identify attributes with low cardinality that would benefit from enrichment with additional descriptive information to make analyses more user-friendly.
Settings:
- Title: "Enrichment Opportunity Analysis"
- Description: "Identifying candidates for lookup enrichment"
Output:
| Attribute Name | Type | Unique Values | Null % | Sample Values | Enrichment Potential |
|---|---|---|---|---|---|
| ProductCode | Event | 45 | 0% | P001, P002, P003 | HIGH - Add product names |
| StatusCode | Case | 8 | 0% | ST-01, ST-02, ST-03 | HIGH - Add status descriptions |
| RegionCode | Case | 4 | 0% | R1, R2, R3, R4 | HIGH - Add region names |
| CurrencyCode | Case | 3 | 0% | USD, EUR, GBP | MEDIUM - Generally understood |
| EmployeeID | Event | 234 | 2.1% | E12345, E67890 | HIGH - Add employee names |
Insights: Several attributes contain codes that would benefit from enrichment. With only 45 unique product codes, adding product names would make analyses far more readable for business users. The 8 status codes should be enriched with plain-language descriptions to avoid users needing to reference code sheets. Employee IDs should be enriched with names while maintaining privacy compliance. These enrichments will significantly improve the user experience without adding substantial data volume.
Example 6: Monitoring Data Completeness Trends
Scenario: You run regular data extractions and want to monitor whether data completeness is improving or degrading over time by comparing current extraction statistics with previous baselines.
Settings:
- Title: "Data Completeness Monitoring - February 2025"
- Description: "Compare with January baseline"
Output:
| Attribute Name | Type | Jan Null % | Feb Null % | Change | Trend |
|---|---|---|---|---|---|
| ApproverName | Event | 5.2% | 3.1% | -2.1% | IMPROVED |
| Department | Case | 8.4% | 8.9% | +0.5% | DEGRADED |
| CostCenter | Event | 12.3% | 18.7% | +6.4% | DEGRADED |
| Priority | Case | 1.2% | 1.1% | -0.1% | STABLE |
| DueDate | Case | 15.6% | 9.2% | -6.4% | IMPROVED |
Insights: The comparison reveals mixed data quality trends. ApproverName null percentages decreased from 5.2% to 3.1%, indicating improved data capture at the approval stage - possibly due to recent process changes requiring explicit approver selection. However, CostCenter null percentages increased significantly from 12.3% to 18.7%, suggesting a degradation in cost center assignment that requires immediate attention. The dramatic improvement in DueDate completeness (from 15.6% to 9.2%) reflects the successful implementation of mandatory due date entry. These trends guide ongoing data quality initiatives.
Output
The Column Info calculator displays a comprehensive table with detailed statistics for every attribute in your event log. The table includes both case-level and event-level attributes with the following information:
Attribute Name: The name of the attribute as it appears in the dataset.
Attribute Type: Indicates whether this is a Case-level attribute (one value per case) or Event-level attribute (one value per event/activity).
Data Type: The detected data type of the attribute (String, Integer, Decimal, DateTime, Boolean, etc.).
Total Values: The total number of values present for this attribute (total cases for case attributes, total events for event attributes).
Null Count: The number of null or missing values for this attribute.
Null Percentage: The percentage of values that are null or missing, calculated as (Null Count / Total Values) * 100.
Unique Values: The number of distinct unique values in this attribute.
Cardinality Ratio: The ratio of unique values to total values, expressed as a percentage. High cardinality (close to 100%) indicates mostly unique values; low cardinality indicates many repeated values.
Sample Values: A representative sample of actual values from the attribute, typically showing 3-5 distinct values to illustrate the data format and content.
Min Value: For numeric and date attributes, the minimum (smallest/earliest) value.
Max Value: For numeric and date attributes, the maximum (largest/latest) value.
Interactive Features
Sort and Filter: Click column headers to sort by any metric. Use the search box to filter to specific attributes of interest.
Export Results: Export the complete attribute analysis to Excel or CSV for documentation, comparison, or sharing with technical teams.
Drill-down Analysis: Click on an attribute name to see additional detailed statistics including value frequency distribution and more comprehensive sample values.
Performance Considerations
- Large Datasets: For datasets with millions of events or hundreds of attributes, this calculator may require several minutes to complete analysis
- Resource Usage: The calculator performs comprehensive scans of all attribute values, which is memory and CPU intensive
- Best Practices: Run this calculator during off-peak hours for very large datasets, or use filters to reduce dataset size before execution
Administrative Access
This calculator is restricted to users with Administrator role. Regular users who need dataset overview information should use the Dataset Information calculator instead, which provides key metrics without the performance overhead of comprehensive column analysis.
This documentation is part of the mindzie Studio process mining platform.