Follows Graphs

Note: This is an administrator-only calculator designed for testing and data quality analysis. Most users should use the Process Map calculator for visual process analysis.

Overview

The Follows Graphs calculator generates detailed data about how activities relate to each other in your process. It calculates two types of relationships: directly follows relationships where one activity immediately follows another, and eventually follows relationships where one activity occurs before another at any point in the case regardless of intervening activities.

Unlike the Process Map calculator which provides interactive visualizations, Follows Graphs performs complete graph calculations and outputs structured data tables suitable for detailed analysis, testing, performance benchmarking, and data quality validation. This calculator is primarily used by administrators and process mining analysts who need access to raw graph data for technical analysis or export to external tools.

Common Uses

Test and validate graph calculation algorithms for correctness and performance
Benchmark calculation performance across different dataset sizes and complexities
Identify data quality issues where events have identical timestamps
Export detailed graph data for external analysis in tools like R, Python, or Gephi
Analyze duration distributions for specific activity pairs in detail
Validate process mining algorithms during development and regression testing

Settings

This calculator has no configurable settings. It processes all cases and events to generate complete graph data every time it runs.

Examples

Example 1: Identifying Data Quality Issues with Identical Timestamps

Scenario: You suspect your event log has timestamp precision issues where multiple activities have identical timestamps, making it impossible to determine their correct order. You want to identify which activity pairs are affected and how frequently this occurs.

Settings:

No settings required.

Output:

The calculator generates five data tables. Tables 2 and 3 show indeterminate pairs where events have identical timestamps:

DirectlyFollows-Indeterminate table:

Create Invoice and Send Invoice: 127 occurrences
Receive Payment and Record Payment: 89 occurrences
Approve Request and Notify Approver: 45 occurrences

EventuallyFollows-Indeterminate table shows the same pairs plus any additional eventually-follows relationships with zero duration.

The Stats table shows:

Calculation Time: 2,347 milliseconds
Fill Tables Time: 156 milliseconds
Total Calculations: 1,247,893

Insights: The high number of indeterminate pairs indicates significant timestamp precision problems in your event log. The most common issue occurs with Create Invoice and Send Invoice happening at exactly the same time in 127 cases. This suggests these events are either being recorded with date-only precision or are being timestamped simultaneously by your source system. You should investigate whether these activities truly occur simultaneously or if your data extraction process is losing time-of-day information. This data quality issue could impact process analysis accuracy and should be resolved by improving timestamp precision in your source data.

Example 2: Performance Benchmarking Across Dataset Sizes

Scenario: You are optimizing your process mining infrastructure and need to understand how graph calculation performance scales with dataset size. You want to measure calculation time for different data volumes to plan resource allocation.

Settings:

No settings required.

Output:

Running the calculator on progressively larger datasets and examining the Stats table:

10,000 cases dataset:

Calculation Time: 847 milliseconds
Total Calculations: 186,234

50,000 cases dataset:

Calculation Time: 4,521 milliseconds
Total Calculations: 931,170

100,000 cases dataset:

Calculation Time: 9,234 milliseconds
Total Calculations: 1,862,340

The DirectlyFollows table has 156 unique activity pairs while the EventuallyFollows table has 2,847 pairs, showing the comprehensive nature of eventually follows relationships.

Insights: The calculation time scales roughly linearly with the number of cases for this dataset where cases have a consistent average number of events. However, the total number of calculations shows that eventually follows graph computation is significantly more expensive than directly follows computation, as expected from the algorithm's quadratic complexity for cases with many events. For datasets exceeding 100,000 cases, you should consider filtering to the most relevant cases before running this calculator, or allocating additional computational resources. The Fill Tables Time remains consistently low across all dataset sizes, indicating that table conversion is not a bottleneck.

Example 3: Exporting Process Data for External Research Analysis

Scenario: You are collaborating with a university research team studying process optimization algorithms. They need raw process graph data in a standardized format to test their new analysis approach. You want to export your process relationships with complete duration statistics.

Settings:

No settings required.

Output:

The calculator generates the DirectlyFollows table with 243 unique activity pairs:

Sample rows from DirectlyFollows table:

Submit Claim -> Validate Documents: Count=1,847, Mean=2.3 days, Median=1.8 days, StDev=3.2 days
Validate Documents -> Approve Claim: Count=1,245, Mean=4.7 days, Median=3.1 days, StDev=6.8 days
Validate Documents -> Request Additional Info: Count=602, Mean=1.2 days, Median=0.9 days, StDev=2.1 days

The EventuallyFollows table contains 4,892 pairs showing all possible activity relationships including non-consecutive ones.

Insights: You can export the DirectlyFollows table to CSV format and provide it to the research team. The table includes all the essential information for process mining research: activity names, relationship frequencies, and comprehensive duration statistics including mean, median, standard deviation, minimum, and maximum values. The EventuallyFollows table provides an even more complete picture of activity relationships for researchers studying long-distance dependencies in processes. The structured output format makes it easy to import into analysis tools like R or Python for statistical modeling.

Example 4: Validating Process Mining Algorithm Changes

Scenario: Your development team has modified the graph calculation algorithm to improve performance. You need to verify that the changes produce identical results to the previous version to ensure no regression has occurred.

Settings:

No settings required.

Output:

Running both the old and new algorithm versions on a known test dataset with 5 cases and 11 events:

DirectlyFollows table (both versions):

8 unique activity pairs
Identical counts for each pair
Identical duration statistics

EventuallyFollows table (both versions):

28 unique activity pairs
All counts match exactly
All duration statistics match within floating-point precision

Stats table comparison:

Old algorithm: 89 milliseconds
New algorithm: 42 milliseconds
Both: 138 total calculations

Insights: The validation confirms that the algorithm optimization successfully reduced calculation time by 53 percent without changing any output values. All activity pairs, counts, and duration statistics match exactly between versions, proving no regression occurred. The consistent calculation count confirms both algorithms process the same event pairs. This type of validation is essential when making performance improvements to ensure accuracy is maintained. You can now confidently deploy the optimized algorithm to production.

Example 5: Analyzing Duration Variability for Specific Activity Pairs

Scenario: Your operations team reports inconsistent processing times between document validation and approval activities. You want detailed duration statistics for this specific activity pair to understand the variability and identify if there are multiple distinct patterns.

Settings:

No settings required.

Output:

Examining the DirectlyFollows table for the "Validate Documents -> Approve" pair:

Activity1: Validate Documents Activity2: Approve Count: 3,247 occurrences Mean Duration: 5.8 days Median Duration: 2.3 days Standard Deviation: 12.4 days Min Duration: 0.2 days Max Duration: 87.3 days

The large difference between mean and median suggests a right-skewed distribution with some extreme outliers. The high standard deviation indicates significant variability.

Insights: The dramatic difference between median duration (2.3 days) and mean duration (5.8 days) indicates that while most cases process relatively quickly, a subset of cases takes much longer and pulls the average up. The maximum duration of 87.3 days shows extreme outliers that warrant investigation. The minimum of 0.2 days suggests some cases are fast-tracked. This variability pattern suggests you should segment the cases to identify what distinguishes fast, normal, and slow processing. You can drill down into the raw event pair data to identify specific cases with extreme durations and investigate their characteristics.

Output

The Follows Graphs calculator generates five structured data tables containing comprehensive process graph information:

Table 0: DirectlyFollows

Shows all directly follows relationships where one activity immediately follows another with no intervening activities.

Columns: Key (activity pair identifier), Activity1 (first activity), Activity2 (second activity), Count (frequency), MeanDuration, MedianDuration, StdevDuration, MinDuration, MaxDuration

This table typically contains fewer relationships than EventuallyFollows as it only includes consecutive activity pairs.

Table 1: EventuallyFollows

Shows all eventually follows relationships where one activity occurs before another at any point in the case.

Columns: Same structure as DirectlyFollows table

This table is significantly larger as it includes all possible activity pairs regardless of intervening activities. For a case with 10 events, this captures 45 possible pairs compared to just 9 directly follows pairs.

Table 2: DirectlyFollows-Indeterminate

Identifies directly follows pairs where events have identical timestamps, making ordering indeterminate.

Columns: Key (undirected pair identifier), Activity1, Activity2, Count

A well-structured event log with precise timestamps should have zero or very few indeterminate pairs. High counts indicate data quality issues.

Table 3: EventuallyFollows-Indeterminate

Identifies eventually follows pairs with identical timestamps.

Columns: Same structure as DirectlyFollows-Indeterminate table

Typically contains the same pairs as DirectlyFollows-Indeterminate since timestamp issues affect both relationship types.

Table 4: Stats

Contains performance metrics for the calculation.

Columns: CalculationTime (milliseconds to compute graphs), FillTablesTime (milliseconds to convert to tables), Calculations (total event pair comparisons)

Use this table to track performance and identify when datasets become too large for efficient processing.

Data Export Options:

All tables can be exported to CSV or Excel format for further analysis in external tools. The structured format makes it easy to import into statistical software, graph visualization tools, or custom analysis scripts.

This documentation is part of the mindzieStudio process mining platform.