column info | Calculators | mindzieStudio Documentation

Overview

The Column Info calculator provides detailed metadata and statistics about all attributes (columns) in your event log dataset. This administrative tool displays comprehensive information about each attribute including data types, value distributions, null counts, and sample values.

IMPORTANT: This is an administrator-only calculator designed for technical analysis and research purposes. It is not optimized for production use and may take significant time to process large datasets. Regular users should use the Dataset Information calculator for general dataset overview needs.

This calculator is primarily used by system administrators, data analysts, and technical users who need deep insights into data structure and quality for troubleshooting, data validation, or dataset optimization.

Common Uses

Perform comprehensive data quality audits across all attributes in the event log
Identify attributes with high percentages of null or missing values
Analyze data type consistency and detect potential type conversion issues
Review attribute cardinality (number of unique values) to identify candidates for categorical analysis
Validate data extraction results by examining attribute-level statistics
Diagnose performance issues by identifying attributes with unexpected value distributions
Document dataset schema and characteristics for technical specifications

Settings

This calculator requires no specific configuration settings. When executed, it automatically scans all attributes in the current dataset (both case-level and event-level) and generates comprehensive statistics for each.

Note: Processing time depends on dataset size and the number of attributes. For very large datasets, this calculator may take several minutes to complete.

Examples

Example 1: Data Quality Audit After ETL

Scenario: You have just completed an ETL process to extract order-to-cash data from your ERP system. Before releasing the dataset to business users, you need to verify that all attributes were extracted correctly and assess data completeness.

Settings:

Title: "Post-ETL Data Quality Validation"
Description: "O2C dataset - January 2025 extraction"

Output:

The calculator displays a comprehensive table with the following information for each attribute:

Attribute Name	Type	Total Values	Null Count	Null %	Unique Values	Sample Values
CaseID	Case	2,456	0	0%	2,456	ORD-001, ORD-002, ORD-003
CustomerName	Case	2,456	12	0.5%	847	Acme Corp, TechStart Inc, Global...
OrderAmount	Case	2,456	0	0%	1,823	1250.00, 3400.50, 875.25
Region	Case	2,456	156	6.4%	4	North, South, East, West
ActivityName	Event	18,945	0	0%	15	Create Order, Approve Order, Ship...
Timestamp	Event	18,945	0	0%	18,893	2025-01-15 08:23:00, 2025-01-15...
ApprovalLevel	Event	18,945	8,234	43.5%	3	L1, L2, L3
Department	Event	18,945	3,456	18.2%	8	Sales, Finance, Operations...

Insights: The audit reveals several data quality concerns. The Region attribute has 6.4% null values affecting 156 cases - these cases need manual review or data correction. More critically, the ApprovalLevel attribute has 43.5% null values, which may indicate that not all activities require approval (expected) or that approval data is missing for activities that should have it (requires investigation). The low null count in CustomerName (0.5%) is acceptable and may represent test orders. All critical identifiers (CaseID, Timestamp) have zero nulls, confirming data integrity.

Example 2: Performance Troubleshooting

Scenario: Users are reporting slow performance when filtering on certain attributes. You need to identify which attributes have high cardinality (many unique values) that might be causing inefficient filtering.

Settings:

Title: "Attribute Cardinality Analysis"
Description: "Investigating filter performance issues"

Output:

Attribute Name	Type	Total Values	Unique Values	Cardinality Ratio	Data Type
CaseID	Case	45,678	45,678	100%	String
TransactionID	Event	367,824	367,824	100%	String
UserComments	Event	367,824	89,234	24.3%	String
ProductSKU	Event	367,824	12,456	3.4%	String
Status	Case	45,678	8	0.02%	String
Priority	Case	45,678	3	0.007%	String

Insights: The analysis reveals a wide range of cardinality across attributes. CaseID and TransactionID have 100% cardinality (every value is unique), making them excellent for case identification but poor candidates for categorical filtering. UserComments has unexpectedly high cardinality (24.3%), suggesting it contains free-form text rather than standardized values - filtering on this attribute will be slow and may benefit from full-text search optimization. In contrast, Status (8 values) and Priority (3 values) are ideal for efficient filtering. This analysis helps optimize filter design and guides users toward high-performance attribute selections.

Example 3: Schema Documentation for Integration

Scenario: You need to provide technical documentation to a third-party vendor who will be integrating with your process mining environment. They need detailed information about available attributes, data types, and expected value ranges.

Settings:

Title: "Purchase-to-Pay Schema Documentation"
Description: "Technical specification for API integration"

Output:

Attribute Name	Attribute Type	Data Type	Total Values	Unique Values	Null Count	Sample Values
PO_Number	Case	String	8,945	8,945	0	PO-2025-00001, PO-2025-00002
Vendor_ID	Case	String	8,945	234	0	V12345, V67890, V45678
Total_Amount	Case	Decimal	8,945	7,823	0	15750.50, 2340.00, 987.25
Currency	Case	String	8,945	3	12	USD, EUR, GBP
RequestDate	Case	DateTime	8,945	2,456	0	2025-01-15, 2025-01-16
Activity	Event	String	71,560	12	0	Create PO, Approve PO, Send...
Resource	Event	String	71,560	145	234	john.smith, sarah.jones...
Cost_Center	Event	String	71,560	67	1,234	CC-1001, CC-2045, CC-3012

Insights: The schema documentation shows that PO_Number is the primary case identifier with guaranteed uniqueness and no nulls. All monetary values use the Total_Amount field (decimal type) with Currency specified separately. The process supports three currencies (USD, EUR, GBP) with 12 cases missing currency data that need correction. Resource information is available for 145 unique users but has 234 null values at the event level, indicating some automated activities. The Cost_Center attribute has 1.7% null values, suggesting incomplete data entry for certain activities. This comprehensive view enables accurate integration planning.

Example 4: Detecting Data Type Inconsistencies

Scenario: After merging data from multiple source systems, you suspect there may be data type inconsistencies that could cause calculation errors or unexpected behavior in analyses.

Settings:

Title: "Data Type Consistency Check"
Description: "Multi-source data validation"

Output:

Attribute Name	Detected Type	Total Values	Type Conflicts	Sample Inconsistent Values
OrderDate	DateTime	5,678	0	-
OrderValue	Mixed	5,678	23	"1250.50", "$1,250.50", "1250,50"
QuantityOrdered	Integer	5,678	8	"100", "100.0", "100 units"
CustomerID	String	5,678	0	-
IsRush	Mixed	5,678	145	"Yes", "Y", "1", "true", "TRUE"

Insights: The analysis uncovered critical data type inconsistencies. The OrderValue attribute contains mixed formatting - some values include currency symbols and different decimal separators (comma vs period), requiring data cleansing before calculations. QuantityOrdered shows 8 instances where text was appended ("100 units"), which will cause errors in numeric aggregations. The IsRush flag has five different representations of boolean values, requiring standardization to "true/false" or "1/0" for reliable filtering. These issues must be resolved in the ETL process before the data can be used reliably.

Example 5: Identifying Enrichment Opportunities

Scenario: You want to identify attributes with low cardinality that would benefit from enrichment with additional descriptive information to make analyses more user-friendly.

Settings:

Title: "Enrichment Opportunity Analysis"
Description: "Identifying candidates for lookup enrichment"

Output:

Attribute Name	Type	Unique Values	Null %	Sample Values	Enrichment Potential
ProductCode	Event	45	0%	P001, P002, P003	HIGH - Add product names
StatusCode	Case	8	0%	ST-01, ST-02, ST-03	HIGH - Add status descriptions
RegionCode	Case	4	0%	R1, R2, R3, R4	HIGH - Add region names
CurrencyCode	Case	3	0%	USD, EUR, GBP	MEDIUM - Generally understood
EmployeeID	Event	234	2.1%	E12345, E67890	HIGH - Add employee names

Insights: Several attributes contain codes that would benefit from enrichment. With only 45 unique product codes, adding product names would make analyses far more readable for business users. The 8 status codes should be enriched with plain-language descriptions to avoid users needing to reference code sheets. Employee IDs should be enriched with names while maintaining privacy compliance. These enrichments will significantly improve the user experience without adding substantial data volume.

Example 6: Monitoring Data Completeness Trends

Scenario: You run regular data extractions and want to monitor whether data completeness is improving or degrading over time by comparing current extraction statistics with previous baselines.

Settings:

Title: "Data Completeness Monitoring - February 2025"
Description: "Compare with January baseline"

Output:

Attribute Name	Type	Jan Null %	Feb Null %	Change	Trend
ApproverName	Event	5.2%	3.1%	-2.1%	IMPROVED
Department	Case	8.4%	8.9%	+0.5%	DEGRADED
CostCenter	Event	12.3%	18.7%	+6.4%	DEGRADED
Priority	Case	1.2%	1.1%	-0.1%	STABLE
DueDate	Case	15.6%	9.2%	-6.4%	IMPROVED

Insights: The comparison reveals mixed data quality trends. ApproverName null percentages decreased from 5.2% to 3.1%, indicating improved data capture at the approval stage - possibly due to recent process changes requiring explicit approver selection. However, CostCenter null percentages increased significantly from 12.3% to 18.7%, suggesting a degradation in cost center assignment that requires immediate attention. The dramatic improvement in DueDate completeness (from 15.6% to 9.2%) reflects the successful implementation of mandatory due date entry. These trends guide ongoing data quality initiatives.

Output

The Column Info calculator displays a comprehensive table with detailed statistics for every attribute in your event log. The table includes both case-level and event-level attributes with the following information:

Attribute Name: The name of the attribute as it appears in the dataset.

Attribute Type: Indicates whether this is a Case-level attribute (one value per case) or Event-level attribute (one value per event/activity).

Data Type: The detected data type of the attribute (String, Integer, Decimal, DateTime, Boolean, etc.).

Total Values: The total number of values present for this attribute (total cases for case attributes, total events for event attributes).

Null Count: The number of null or missing values for this attribute.

Null Percentage: The percentage of values that are null or missing, calculated as (Null Count / Total Values) * 100.

Unique Values: The number of distinct unique values in this attribute.

Cardinality Ratio: The ratio of unique values to total values, expressed as a percentage. High cardinality (close to 100%) indicates mostly unique values; low cardinality indicates many repeated values.

Sample Values: A representative sample of actual values from the attribute, typically showing 3-5 distinct values to illustrate the data format and content.

Min Value: For numeric and date attributes, the minimum (smallest/earliest) value.

Max Value: For numeric and date attributes, the maximum (largest/latest) value.

Interactive Features

Sort and Filter: Click column headers to sort by any metric. Use the search box to filter to specific attributes of interest.

Export Results: Export the complete attribute analysis to Excel or CSV for documentation, comparison, or sharing with technical teams.

Drill-down Analysis: Click on an attribute name to see additional detailed statistics including value frequency distribution and more comprehensive sample values.

Performance Considerations

Large Datasets: For datasets with millions of events or hundreds of attributes, this calculator may require several minutes to complete analysis
Resource Usage: The calculator performs comprehensive scans of all attribute values, which is memory and CPU intensive
Best Practices: Run this calculator during off-peak hours for very large datasets, or use filters to reduce dataset size before execution

Administrative Access

This calculator is restricted to users with Administrator role. Regular users who need dataset overview information should use the Dataset Information calculator instead, which provides key metrics without the performance overhead of comprehensive column analysis.

This documentation is part of the mindzie Studio process mining platform.