forked from kitian616/jekyll-TeXt-theme
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathcase-study.html
More file actions
1289 lines (1190 loc) · 96.6 KB
/
case-study.html
File metadata and controls
1289 lines (1190 loc) · 96.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html data-wf-page="5f71dd169010d6326b65485d">
<head>
<meta charset="utf-8" />
<title>Sundial • Case Study</title>
<meta content="width=device-width, initial-scale=1" name="viewport" />
<link href="assets/css/style.css" rel="stylesheet" type="text/css" />
<script
src="https://ajax.googleapis.com/ajax/libs/webfont/1.6.26/webfont.js"
type="text/javascript"
></script>
<link
rel="stylesheet"
href="https://fonts.googleapis.com/css?family=Inter:regular,500,600,700"
media="all"
/>
<script type="text/javascript">
WebFont.load({ google: { families: ["Inter:regular,500,600,700"] } });
</script>
<script type="text/javascript">
!(function (o, c) {
var n = c.documentElement,
t = " w-mod-";
(n.className += t + "js"),
("ontouchstart" in o ||
(o.DocumentTouch && c instanceof DocumentTouch)) &&
(n.className += t + "touch");
})(window, document);
</script>
<link
href="assets/images/sundial-logo.png"
rel="shortcut icon"
type="image/x-icon"
/>
<link href="assets/images/sundial-logo.png" rel="apple-touch-icon" />
<script
src="https://kit.fontawesome.com/d019875f94.js"
crossorigin="anonymous"
></script>
<meta
name="image"
property="og:image"
content="assets/images/thumbnail.png"
/>
</head>
<body>
<div class="navigation-wrap">
<div
data-collapse="medium"
data-animation="default"
data-duration="400"
role="banner"
class="navigation w-nav"
>
<div class="navigation-container">
<div class="navigation-left">
<a
href="/"
aria-current="page"
class="brand w-nav-brand w—current"
aria-label="home"
>
<img
src="assets/images/sundial-logo.png"
alt=""
class="template-logo"
/>
</a>
<nav role="navigation" class="nav-menu w-nav-menu">
<a href="/case-study" class="link-block w-inline-block">
<div>Case Study</div>
</a>
<a href="/team" class="link-block w-inline-block">
<div>The Team</div>
</a>
</nav>
</div>
<div class="navigation-right">
<div class="login-buttons">
<a href="https://github.com/Project-Sundial" target="_blank">
<span style="color: #fed74f">
<i class="fab fa-github fa-lg"></i>
</span>
</a>
</div>
</div>
</div>
<div class="w-nav-overlay" data-wf-ignore="" id="w-nav-overlay-0"></div>
</div>
</div>
<div id="sidebar" class="toc">
</div>
<div class="section header">
<article class="container case-study-container">
<div class="hero-text-container">
<h1 class="h1 centered">Case Study</h1>
</div>
<div id="case-study">
<br />
<br />
<h2>1 Introduction</h2>
<br>
<h3>1.1 Overview of Sundial</h2>
<p>Sundial is a self-hosted, open-source cron job monitoring and management system that users can operate across one or multiple nodes. </p>
<br>
<p>Designed primarily for individuals and small to medium teams, it provides a readily deployable option to aid in setting up, modifying, and monitoring cron jobs. </p>
<br>
<p>This case study introduces the cron utility and its problem space, followed by the design and implementation of Sundial. It concludes with a selection of tradeoffs and technical challenges.</p>
<br>
<h2>2 Cron</h2>
<br>
<h3>2.1 The Cron Utility</h3>
<br>
<p>Cron is a time-based job scheduler found in Unix and Unix-like operating systems, allowing users to automate the execution of tasks at specified intervals. Created in 1975, it remains a fundamental tool for scheduling routine processes.</p>
<br>
<h4>Use Cases</h4>
<br>
<p>Some common use cases of the cron utility are:</p>
<img src="assets/images/case-study/1.1.svg" alt="common cron use cases" class="case-study-image resizable xx-large centered"/>
<br>
<ol>
<li><strong>Database Backups</strong><br>Database backups are crucial for data management and risk mitigation, enabling organizations to recover from data loss swiftly and maintain business continuity. They must be performed regularly, making them a perfect candidate for automation using cron.</li>
<br>
<li><strong>Log Rotation</strong><br>System logs can grow over time and consume valuable disk space. Cron can rotate and compress log files at scheduled intervals, preventing them from becoming too large.</li>
<br>
<li><strong>System Maintenance</strong><br>System maintenance tasks such as cleaning up temporary files, performance testing, and updating software can be automated using cron. Regularly performing these tasks helps ensure the system's smooth operation.</li>
<br>
<li><strong>Producing Reports</strong><br>Cron jobs can be used to generate and send regular reports, such as sales figures, to relevant departments - this can help keep an organization efficient and productive.</li>
<br>
<li><strong>Time-Based Scaling</strong><br>Organizations with predictable traffic hours can use cron to scale their applications based on predefined schedules. Cron jobs can deploy additional server instances during peak hours and then remove them again later. This aids the efficient use of resources and conserves costs without requiring manual adjustments.</li>
</ol>
<br>
<h4>Cron Job</h4>
<br>
<p>A cron job is a command or shell script executed periodically according to a fixed schedule, such as a specific time, date, or interval. Cron jobs comprise a schedule and a script; below is an example.</p>
<img src="assets/images/case-study/1.2.svg" alt="example cronjob: 0 02 * * * /path/to/rotate-log" class="case-study-image resizable medium centered"/>
<p>The schedule is articulated in a cron-specific syntax, detailed further in section 2.2 Limitations of Cron. The script denotes the specific executable to be run by cron at the scheduled intervals.</p>
<br>
<h4>Crontab</h4>
<br>
<p>The term ‘crontab’ refers to <em>both</em> a <strong>configuration file</strong> used for managing cron jobs and a <strong>command-line tool</strong> to interact with said configuration file. Subsequent references in this document pertain to the configuration file.</p>
<br>
<p>See how the command <code>crontab -e</code> displays the contents of the crontab configuration file in a text editor.</p>
<br>
<video autoplay loop muted playsinline class="resizable large" aria-label="crontab -e screencapture">
<source src="assets/videos/case-study/0.3.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<br>
<p>Each user on a machine has an individual crontab, and there is also a system-wide crontab.</p>
<br>
<h4>Crond</h4>
<p>Crond functions as a background process, also called a daemon. It regularly scans crontab files for scheduled jobs. When crond identifies a job scheduled to run, it initiates a child process dedicated to executing the job, as shown below.</p>
<br>
<video autoplay loop muted playsinline class="resizable x-large" aria-label="crond launching job scripts as child process on a schedule">
<source src="assets/videos/case-study/0.4.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<br>
<p>Crond configures the environment to match the user's specifications, and the child process inherits the environment from crond, ensuring access to essential file paths and permissions while operating independently from both crond and the user shell.</p>
<br>
<h3>2.2 Limitations of Cron</h3>
<br>
<p>Despite being widely used, cron has limitations. A brief search for 'cron job issues' returns many results, indicating that users often face challenges when working with cron. Common cron user concerns include <a href="#toc-8-references">[1]</a> <a href="#toc-8-references">[2]</a>:</p>
<br>
<ul>
<li>Did my cron job start?</li>
<li>Was my job completed?</li>
<li>Is my schedule correct?</li>
<li>Where can I find logs for cron jobs?</li>
</ul>
<br>
<p>The remainder of this section explains the specific issues associated with cron that we designed Sundial to address.</p>
<br>
<h4>Cron Jobs Fail Silently</h4>
<br>
<p>The cron utility lacks alerting capabilities for errors during job execution. When a job encounters failure, there are two potential scenarios: it either fails to start or fails to complete successfully.</p>
<br>
<p>A failure to start indicates an issue with the cron utility simply not executing the script. Cron cannot initiate jobs if they contain specific user errors like incorrect paths or typos in the script names. Additionally, job initiation may fail during situations where system resources are depleted.</p>
<br>
<p>A failure to complete often suggests an issue inherent to the job itself. It could be the result of a bug or a failure of a dependency.</p>
<br>
<p>A particularly problematic scenario of job failure arises when the scheduled interval for a job is shorter than the duration of the job itself. This results in concurrent execution, known as <strong>overlapping jobs,</strong> and can potentially deplete system resources.</p>
<br>
<video autoplay loop muted playsinline class="resizable x-large" aria-label="crond launching multiple versions of the same job script concurrently">
<source src="assets/videos/case-study/0.5.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<br>
<p>The impact of a job failing to run can vary from being merely inconvenient to severely detrimental. While the inconvenience of a marketing email not being sent might be manageable, the absence of essential database backups and the lack of crucial security updates can leave the system vulnerable.</p>
<br>
<h4>Cron Logs Are Not Centralized</h4>
<br>
<p>While the cron utility does not have built-in alerting for failed job execution, it records every attempt at cron job execution in the syslog. Therefore, users can manually inspect the syslog to confirm whether cron initiated the job.</p>
<br>
<p>Logs related to cron job output, including errors, aren’t saved anywhere by default. A user must manually direct the data to a location to capture these outputs. Options for this include configuring the cron utility to send output to their email or creating a log file and directing output from the cron job to that file.</p>
<br>
<p>Users must review these logs manually to determine if the job encountered any issues. If a user has numerous cron jobs or is running just a few frequently, it requires time, effort, and careful attention from the user to sift through the large number of logs produced to find and verify the desired information.</p>
<br>
<h4>Writing Cron Jobs is Error-Prone</h4>
<br>
<p>The cron scheduling syntax can be unintuitive, especially for new users.</p>
<br>
<div class="flex-container">
<figure>
<img src="assets/images/case-study/1.3.svg" alt="once a month cron schedule: 0 0 1 * *" class="case-study-image resizable xxx-large"/>
<figcaption>Figure 1</figcaption>
</figure>
<figure>
<img src="assets/images/case-study/1.4.svg" alt="once a year cron schedule: 0 0 1 1 *" class="case-study-image resizable xxx-large" />
<figcaption>Figure 2</figcaption>
</figure>
</div>
<br>
<p>To illustrate how easy it is to make a mistake, Figure 1 is an example of a schedule that indicates once a month, and Figure 2 indicates once a year.</p>
<br>
<img src="assets/images/case-study/1.6.svg" alt="cron schedule syntax explanation" class="case-study-image resizable large centered"/>
<br>
<p>It’s easy to accidentally write the wrong schedule, leading to jobs running at unexpected times. Additionally, a user must manually edit the crontab. When numerous other cron jobs exist, it’s easy to edit the wrong one mistakenly.</p>
<br>
<h2>3 Solutions</h2>
<br>
<h3>3.1 Existing Solutions</h3>
<br>
<p>There are many solutions to the issues we’ve mentioned. Services might focus on job monitoring or job management or offer both. Some do not use cron but instead implement the same functionality using different technologies, so they handle scheduling and execution through their own platform. We can split these options into the following categories:</p>
<br>
<h4>Paid Services</h4>
<br>
<p>Two paid solutions that aim to improve the cron experience are <strong>Cronitor</strong> and <strong>Cronhub</strong>:</p>
<ul>
<li><strong>Cronitor</strong> prioritizes monitoring and provides a Command Line Interface (CLI) that automatically identifies all existing cron jobs on a system. It also offers monitoring of workflows, health checks, status sites, and other related features.</li>
<li><strong>Cronhub</strong> offers both monitoring and job scheduling on a single platform. The scheduling implemented by Cronhub does not rely on cron; any jobs scheduled with Cronhub will only appear on the Cronhub interface and not on a user’s own machine.</li>
</ul>
<p>Both paid services have disadvantages:</p>
<ol>
<li><strong>Data ownership</strong>: Dependence on a third-party service for monitoring or scheduling introduces the risk of relinquishing a user's data ownership. This may be undesirable for several reasons, including privacy concerns, as job error logs could contain sensitive system information, or legal reasons, such as the requirement to store data in the user's own country.</li>
<li><strong>Monthly fees</strong>: Both solutions impose a monthly fee on the user, determined by the number of monitors they have.</li>
</ol>
<br>
<h4>Open-Source</h4>
<br>
<p>Two open-source solutions are <strong>Uptime Kuma</strong> and <strong>Cronicle</strong>.</p>
<ul>
<li><strong>Uptime Kuma</strong> is a monitoring tool for various services, e.g., HTTP(s), TCP, and Docker Containers. While Uptime Kuma offers tracking of cron jobs through a Push monitor/webhook, its primary focus isn't on cron jobs. Despite having numerous features related to uptime monitoring, it lacks specific capabilities for collecting cron-related data such as start times, durations, or error logs.</li>
<li><strong>Cronicle</strong> is a cron-like service that handles the scheduling and execution of jobs internally without any help from the cron utility. Its main downside is that a user needs to transition any existing cron jobs to the service, which could be time-consuming and error-prone. Users may also be more comfortable relying on the cron utility instead of a system that claims to function like it.</li>
</ul>
<br>
<h4>DIY</h4>
<br>
<p>A developer can also choose to build their own solution by copying the pattern used by the existing solutions. The pattern is to use HTTP requests for monitoring: a request will be sent before and after a job has been executed. The requests indicate if a job started and ended, and can be used to gather other details, like error messages.</p>
<br>
<p>Users don’t get a monitoring interface or other features like the named services offer, and setup and maintenance would require additional work. Still, it is customizable and contained wholly on their system.</p>
<br>
<h3>3.2 Sundial</h3>
<br>
<p>In exploring the options above, we saw a need for a product that would work for a user that:</p>
<ul>
<li>Has existing cron jobs and does not want to use a cron-like service</li>
<li>Prioritizes owning their data (no third party)</li>
<li>Seeks out cost-effective solutions</li>
<li>Wants the option to monitor and manage their jobs in one place</li>
</ul>
<p>To meet this use case, we based our decisions on the following goals:</p>
<ol>
<li><strong>Control</strong> <br>
We created Sundial as an open-source and self-hosted product, allowing users to maintain full control over their code and data.
</li>
<li><strong>In-Depth Cron Monitoring</strong> <br>
Our monitoring system focuses on the cron utility. Sundial's oversight of cron jobs centers on automatically discovering tasks with a script the user can execute from the CLI and gathering jobs’ start times, durations, and error logs at runtime.
</li>
<li><strong>Cron-Based Management</strong> <br>
Sundial offers a centralized platform for monitoring and managing all your cron jobs. Any modifications made to jobs within the user interface (UI) will update the crontab automatically.
</li>
</ol>
<br>
<p>In summary, Sundial is an open-source, self-hosted solution that focuses specifically on the cron utility and provides:</p>
<ul>
<li>reliable monitoring</li>
<li>centralized error logging</li>
<li>convenient job management from a UI</li>
</ul>
<img src="assets/images/case-study/1.8.svg" alt="alternatives comparison table" class="case-study-image resizable xxxx-large centered"/>
<br>
<h2>4 The Sundial System</h2>
<br>
<h3>4.1 Architecture</h3>
<br>
<p>Sundial provides cron job monitoring and management across one or multiple nodes.</p>
<br>
<h4>General Overview of Components</h4>
<br>
<p>The Sundial system consists of two main components:</p>
<ol>
<li>The Monitoring Service</li>
<li>The Linking Client</li>
</ol>
<img src="assets/images/case-study/2.1.svg" alt="monitoring service and linking client" class="case-study-image resizable medium centered"/>
<br>
<p>The following sections will give a general outline of the Monitoring Service and the Linking Client. We’ll explain the details of individual components and provide a high-level overview of their roles.</p>
<br>
<h5>Monitoring Service</h5>
<br>
<p>The <strong>Monitoring Service</strong> is primarily responsible for actively monitoring the execution of cron jobs. Additionally, it provides users with an interface, offering a way to interact with and manage their cron jobs.</p>
<br>
<p>The Service consists of four components:</p>
<ul>
<li>a UI, accessible via the browser</li>
<li>an application server that exposes an API</li>
<li>Task Queues</li>
<li>a PostgreSQL database</li>
</ul>
<img src="assets/images/case-study/2.2.svg" alt="monitoring services components" class="case-study-image resizable xxx-large centered"/>
<p>We’ve containerized the Monitoring Service for straightforward deployment. The UI, database, and application server (including the Task Queues) are each encapsulated into a Docker image. A Docker Compose script runs them collectively as a single package.</p>
<br>
<h5>Linking Client</h5>
<br>
<p>The <strong>Linking Client</strong> serves as a link between the Monitoring Service and the crontab.</p>
<img src="assets/images/case-study/2.3.svg" alt="linking client links monitoring service and cron" class="case-study-image resizable xx-large centered"/>
<p>It consists of:</p>
<ul>
<li>a lightweight HTTP server known as the <strong>Listening Service</strong></li>
<li>a binary executable containing a collection of scripts</li>
</ul>
<img src="assets/images/case-study/2.3.5.svg" alt="linking client components" class="case-study-image resizable large centered"/>
<p>The Linking Client is packaged as a standalone binary executable. Users can install the Linking Client on any Linux server without additional dependencies.</p>
<br>
<p>Once installed, the Linking Client scripts can be executed by other processes, notably crond, or by the user via commands in the CLI.</p>
<br>
<p>After installation, the user must execute one such command: <code>sundial register</code>. This executes the registration script, which establishes the connection between the Monitoring Service and the Linking Client and configures the Listening Service to run as a background process.</p>
<br>
<h4>Adding Nodes</h4>
<br>
<p>Sundial accommodates both single and multi-node setups.</p>
<br>
<p>In a <strong>single-node configuration</strong>, the Monitoring Service and the Linking Client coexist on the same node.</p>
<img src="assets/images/case-study/2.4.svg" alt="single node architecture" class="case-study-image resizable xxx-large centered"/>
<p>For <strong>multi-node scenarios</strong>, additional nodes - termed <strong>remote nodes</strong> - are integrated through the installation of the Linking Client. The Monitoring Service, on the other hand, only runs on one designated node, referred to as the <strong>hub node</strong>.</p>
<br>
<p>When adding new remote nodes, the <code>sundial register</code> command is passed the IP addresses of both the hub node and remote node as arguments. These addresses are stored on both nodes and used to facilitate future communication.</p>
<br>
<p>If desired, the hub node can exclusively host the Monitoring Service, monitoring the cron jobs of remote nodes across a distributed network.</p>
<img src="assets/images/case-study/2.5.svg" alt="multi-node architecture" class="case-study-image resizable xxxx-large centered"/>
<br>
<h3>4.2 Job Monitoring</h3>
<br>
<p>Monitoring aims to detect issues promptly, such as errors during job execution or jobs failing to run. Sundial conveys this information through its UI, using color to highlight potential faults during job execution.</p>
<img src="assets/images/case-study/2.6.svg" alt="all monitored jobs listed page in UI" class="case-study-image resizable xxxx-large centered screenshot"/>
<br>
<p>The Monitoring Service creates a <strong>monitor</strong> entity for every monitored cron job.</p>
<br>
<h4>Awareness of Jobs and their Execution</h4>
<br>
<p>This section focuses on how the Monitoring Service documents job execution. To do this, the Monitoring Service requires two things:</p>
<ol>
<li>Prior knowledge of a user’s jobs and when they are due to execute.</li>
<li>Real-time notification of when jobs start, when they end, and when they encounter errors.</li>
</ol>
<p>The Linking Client provides both requirements to the Monitoring Service. Next, we explain the two scripts that enable the Linking Client to do so.</p>
<img src="assets/images/case-study/2.7.svg" alt="linking clients run and discover scripts" class="case-study-image resizable medium centered"/>
<h5>Prior Knowledge - <code>discover</code></h5>
<br>
<p>The Linking Client uses its <code>discover</code> script to provide the Monitoring Service with knowledge of jobs in a user’s crontab. The user executes the command <code>sundial discover</code> from the CLI to run the <code>discover</code> script.</p>
<br>
<p>The <code>discover</code> script sends information about each job in the crontab file, such as the schedule and command, to the Monitoring Service. The Monitoring Service stores this information in its database.</p>
<video autoplay loop muted playsinline class="resizable xxxx-large" aria-label="monitor created in db for each cronjob in crontab">
<source src="assets/videos/case-study/2.8.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<p>Additionally, the <code>discover</code> script sets up the real-time notifications of job execution that the Monitoring Service requires through a process called <strong>wrapping</strong><a href="#toc-8-references">[7]</a>.</p>
<img src="assets/images/case-study/2.9.5.svg" alt="pre-wrapped crontab" class="case-study-image resizable large centered"/>
<p>As shown above, cron jobs are considered <strong>wrapped</strong> when the text <code>sundial run</code>, followed by a string of letters and numbers, has been inserted in between the schedule and the command.</p>
<br>
<p>The string of alphanumeric characters, known as the <strong>endpoint key</strong>, is used by the Monitoring Service to link execution notifications from a job with the job’s corresponding database entity.</p>
<br>
<p>The following section will go over the <code>run</code> script in detail.</p>
<br>
<h5>Real-Time Notification - <code>run</code></h5>
<br>
<p>Once the Linking Client has wrapped a job, it can send notifications about its execution via the Linking Client’s <code>run</code> script.</p>
<br>
<p>The <code>run</code> script sends information to the Monitoring Service via requests to the Monitoring Services API; these requests are called <strong>pings</strong>.</p>
<br>
<p>There are <strong>three</strong> types of pings:</p>
<br>
<ul>
<li>When a job starts executing, the run script sends a <strong>start ping</strong> to the Monitoring Service</li>
<li>When a job finishes execution, the run script sends an <strong>end ping</strong> to the Monitoring Service</li>
<li>When a job encounters an error, the run script sends an <strong>error ping</strong> to the Monitoring Service</li>
</ul>
<br>
<p>These pings give the Monitoring Service real-time notification of when jobs start, end, or encounter errors.</p>
<br>
<hr>
<br>
<p>The remainder of this section is a detailed explanation of how the <code>run</code> script sends pings.</p>
<br>
<p>First, a refresher on how the cron utility executes cron jobs: the cron daemon executes anything following the schedule string. In the example cron job below, crond executes the <code>rotate-log</code> script directly.</p>
<img src="assets/images/case-study/1.2.svg" alt="pre-wrapped cron job: 0 02 * * * /path/to/rotate-log" class="case-study-image resizable large centered"/>
<p>When a job is wrapped, the schedule string is followed by <code>sundial run</code>, an endpoint key, and the original job script.</p>
<img src="assets/images/case-study/2.12.svg" alt="wrapped cron job: 0 02 * * * sundial run /path/to/rotate-log" class="case-study-image resizable xxx-large centered"/>
<p>With this setup, the cron daemon executes <code class="language-plaintext highlighter-rouge">sundial run</code> with two arguments. Recall that <code class="language-plaintext highlighter-rouge">sundial run</code> is simply a script installed as part of the Linking Client.</p>
<p>When launched, the <code class="language-plaintext highlighter-rouge">run</code> process sends a start ping to the Monitoring Service to notify the Service that the job has begun.</p>
<video autoplay loop muted playsinline class="resizable xx-large" aria-label="sundial run script sending start ping to monitoring service">
<source src="assets/videos/case-study/2.14.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<p>Next, the <code class="language-plaintext highlighter-rouge">run</code> process spawns a child process that executes the actual job script.</p>
<p>Since child processes inherit the environmental variables of their parent processes, the <code>run</code> process inherits the user context set up by <em>crond</em> and passes that on to the job process. This means the job is executed in the same environment as if it were run directly by <em>crond</em>.</p>
<br>
<p>Once the job finishes executing, the job process returns with an exit code. The <code class="language-plaintext highlighter-rouge">run</code> process has access to this exit code because the job process is run as a child of the <code class="language-plaintext highlighter-rouge">run</code> process. The <code class="language-plaintext highlighter-rouge">run</code> process sends an end ping to the Monitoring Service.</p>
<video autoplay loop muted playsinline class="resizable xxxx-large" aria-label="sundial run script sending end ping to monitoring service">
<source src="assets/videos/case-study/2.16.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<p>Additionally, if the exit code provided by the job process signifies an error occurred, the <code class="language-plaintext highlighter-rouge">run</code> process sends the Monitoring Service an error ping. This ping is different from an end ping in that it contains the error log, if one is available, returned by the job process.</p>
<video autoplay loop muted playsinline class="resizable xxxx-large" aria-label="sundial run script sending error ping to monitoring service">
<source src="assets/videos/case-study/2.17.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<p>Finally, the <code>run</code> process exits.</p>
<br>
<p>In summary, the Linking Client's scripts allow the Monitoring Service to document the execution of cron jobs and store error logs.</p>
<br>
<h4>Awareness of Missed Execution</h4>
<br>
<p>Recall that to monitor effectively; the Monitoring Service must document not only the execution of cron jobs but also any instances where a scheduled job fails to execute as intended.</p>
<br>
<p>A few reasons for irregular or failed job execution include:</p>
<ul>
<li>Host node resources could be depleted, preventing the cron daemon from running jobs</li>
<li>A job might be running for longer than usual - perhaps because it's processing a larger than normal data set or it’s stuck in an infinite loop</li>
<li>Manual changes to the crontab may not have been propagated to the monitoring service</li>
<li>Network issues</li>
</ul>
<br>
<p>When a job fails to execute, the Linking Client doesn't send any pings to the Monitoring Service. Despite this, the Service must maintain a record of this event. Solving this challenge is not straightforward because it requires the Monitoring Service to recognize the absence of a ping, also known as a <strong>missed ping</strong>.</p>
<br>
<p>The Monitoring Service uses <strong>Task Queues</strong> to deal with missed pings.</p>
<img src="assets/images/case-study/2.19.svg" alt="monitoring services components" class="case-study-image resizable large centered"/>
<p>Task Queues are implemented with <strong>pg-boss</strong> <a href="#toc-8-references">[3]</a>, an npm package built on PostgreSQL. Specifically, we leverage the <strong>deferred tasks</strong> feature, where tasks are added with a specified delay and are processed by a worker only after that delay has passed.</p>
<br>
<p>A <strong>worker</strong> is a function assigned to a queue. The Monitoring Service executes this function when 'processing' a task, passing in any additional data included in the task.</p>
<video autoplay loop muted playsinline class="resizable xxxx-large" aria-label="worker pulling task off queue when time reaches 0">
<source src="assets/videos/case-study/2.18.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<p>Above is an example of a deferred task in action: there are three tasks on a queue, and as time elapses, the delay eventually reaches zero, and the worker processes the task.</p>
<br>
<p>The Task Queues component consists of a Start Queue and an End Queue. The Monitoring Service uses tasks on these queues to keep track of the expected arrival times of each cron job’s start and end pings.</p>
<br>
<p>Every task on either Queue is added with a delay and waits for a specific ping from the Linking Client. If the ping arrives, the Monitoring Service removes the task. If the delay elapses without the arrival of the ping, a worker processes the task.</p>
<br>
<p>In the examples, we will elaborate on a few items the worker is responsible for. Critically, the worker documents that the Monitoring Service did not receive a specific job’s expected start or end ping.</p>
<br>
<h4>Start and End Queues</h4>
<br>
<p>Start and End Queues use tasks to recognize the absence of start and end pings, respectively.</p>
<br>
<p>A task exists in the Start Queue for every job in the Monitoring Service at all times. This is because, by definition, there is always a next expected start time for any given cron job.</p>
<br>
<p>A task is added to the End Queue only after a job’s start ping arrives. If the Monitoring Service does not receive a job's start ping, the Service’s logic dictates that an end ping should not be expected, and the End Queue is not used.</p>
<br>
<p>The most critical component of the Queues is the <strong>delay</strong> for each task. If the Monitoring Service doesn’t receive a ping within the specified delay, the associated task is processed, and the Service documents the ping as missed.</p>
<br>
<p>Calculating the delay for a Start Queue task:</p>
<blockquote>
<p>time until next scheduled execution of job + grace period</p>
</blockquote>
<br>
<p>The grace period is set to 5s to account for expected delays such as network latency or high load on the Monitoring Service.</p>
<br>
<p>Calculating the delay for an End Queue task :</p>
<blockquote>
<p>tolerable runtime + grace period</p>
</blockquote>
<br>
<p>Tolerable runtime signifies the maximum acceptable duration of the job. The default is set to 15s, but users can configure this in the UI to suit their preferences.</p>
<br>
<p>Since every task has a delay, these calculations occur every time a new task is added or when one is updated.</p>
<br>
<p>In summary, the Task Queues allow the Monitoring Service to document the missed execution of cron jobs.</p>
<br>
<h4>Utilizing the Data</h4>
<br>
<p>Through the methods discussed above, the Monitoring Service can record data about the execution of each job.</p>
<br>
<p>Execution data is organized as database entities called <strong>runs</strong>. For each expected execution of a cron job, Sundial creates a new run. Runs contain information regarding the existence or absence of start and end pings captured in the run’s <strong>state</strong>. The UI displays this data to the user.</p>
<br>
<img src="assets/images/case-study/2.20.svg" alt="runs log" class="xxx-large screenshot"/>
<br>
<br>
<p>The system accounts for seven various <strong>run states</strong>. Each state provides insight into the execution status, occurrences of errors and irregularities, and, taken together with other listed runs, the overall health of the cron job.</p>
<br>
<h4>Examples</h4>
<br>
<p>The detailed examples that follow outline the sequence of events triggered within the Monitoring Service for when the Service:</p>
<ul>
<li><em>receives</em> an expected start ping</li>
<li><em>receives</em> an expected end ping</li>
<li><em>does not receive</em> an expected start ping</li>
<li><em>does not receive</em> an expected end ping</li>
</ul>
<br>
<h5>Pings Received</h5>
<br>
<p>Pings contain:</p>
<ul>
<li><strong>an endpoint key,</strong> used for matching a job to a monitor</li>
<li><strong>a run token,</strong> used for associating an executing job’s start and end ping</li>
</ul>
<br>
<p><strong>Start Ping Arrives:</strong></p>
<ol>
<li>The Linking Client sends a start ping before the job starts.</li>
<li>The Monitoring Service creates and stores a run. The run contains the supplied run token and is given a state: <code>started</code>.</li>
<li>The Monitoring Service displays information derived from the run state to any user viewing the UI using Server-Sent Events (SSE).</li>
<img src="assets/images/case-study/2.21.svg" alt="started run displayed in UI" class="case-study-image resizable xxxx-large centered screenshot" />
<li>To ensure that the Task Queues are constantly keeping track of when a job’s start or end pings are expected to arrive <em>next</em>, the Monitoring Service makes changes in <strong>both</strong> the Start and End Queues:</li>
<ul>
<li><em>Update</em> task in Start Queue:
<ol>
<li>The Services uses the ID (in this case, 7) to find the task in the start queue that is associated with the correct monitor</li>
<li>The Service updates the delay to reflect the next expected arrival of a start ping (12 hours)</li>
<video autoplay loop muted playsinline class="resizable xxxx-large" aria-label="update start queue task when start ping comes in">
<source src="assets/videos/case-study/2.22.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
</ol>
</li>
<li><em>Create</em> task in End Queue:
<ol>
<li>The Service adds a task with the <strong>run token</strong> provided by the ping. The worker that processes this task uses the run token to ensure it alters the correct run entity in the database.</li>
<li>The Service sets the delay using the <strong>tolerable runtime</strong> (1 minute)</li>
<img src="assets/images/case-study/2.23.5.svg" alt="task added to end queue after start ping comes in" class="case-study-image resizable xx-large centered"/>
</ol>
</li>
</ul>
</ol>
<br>
<p><strong>End Ping Arrives:</strong></p>
<ol>
<li>The Linking Client sends an end ping once the job ends.</li>
<li>The Monitoring Service retrieves the run created by the start ping and updates its state to <code>completed</code>.</li>
<img src="assets/images/case-study/2.24.svg" alt="run completed displayed in UI" class="case-study-image resizable xxxx-large centered screenshot"/>
<li>The Service removes the associated End Queue task because the end ping came within the tolerable runtime.</li>
<ul>
<li><em>Remove</em> task from End Queue:
<video autoplay loop muted playsinline class="resizable xxx-large" aria-label="end queue task removed when ping arrives">
<source src="assets/videos/case-study/2.25.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
</li>
</ul>
</ol>
<br>
<h5>Pings Missing</h5>
<br>
<strong>No Start Ping Arrives:</strong>
<ol>
<li>The Linking Client does not send a start ping when the Monitoring Service expects one.</li>
<li>The delay on the task in the Start Queue elapses, and a worker processes the task.
<video autoplay loop muted playsinline class="resizable medium" aria-label="start queue task removed by worker">
<source src="assets/videos/case-study/2.26.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
</li>
<li>The worker creates a new run with the state <code>missed</code>.
<img src="assets/images/case-study/2.27.svg" alt="start ping missed displayed in UI" class="case-study-image resizable xxxx-large centered screenshot"/>
</li>
<li>The worker creates a new task in the Start Queue to recognize the next expected start ping.</li>
</ol>
<br>
<strong>No End Ping Arrives:</strong>
<ol>
<li>The Linking Client does not send an end ping when the Monitoring Service expects one.</li>
<li>The delay on the task in the End Queue elapses, and a worker processes the task.
</li>
<li>The worker retrieves the run created by the start ping and updates the state to <code>unresolved</code>.
<img src="assets/images/case-study/2.29.svg" alt="missed end ping displayed in UI" class="case-study-image resizable xxxx-large centered screenshot"/>
</li>
</ol>
<br>
<h3>4.3 Job Management</h3>
<br>
<p>Sundial also provides cron job management. Management refers to the ability for users to add, edit, and delete jobs across one or multiple nodes from the UI.</p>
<br>
<p>Managing jobs through the UI reduces the risk of errors associated with manual crontab changes. It also adds user convenience by providing a centralized platform for interacting with one or multiple crontabs.</p>
<br>
<p>Each cron job is added or edited from its form, like the one shown here.</p>
<img src="assets/images/case-study/3.1.svg" alt="new job form" class="case-study-image resizable xxx-large centered screenshot"/>
<p>The form allows the user to see the schedule and command of a job clearly at a glance and makes it harder to modify the wrong cron job. Additionally, the schedule field includes an automatic Schedule Translator. It translates the schedule string to text in real time as a user enters data into the form. The Translator confirms the accuracy of the schedule, clarifying the cryptic cron schedule syntax.</p>
<img src="assets/images/case-study/3.2.svg" alt="syntax helper in UI" class="case-study-image resizable x-large centered screenshot"/>
<p>The Monitoring Service automatically synchronizes changes made to jobs from the UI with the crontab. In a multi-node setup, the user must specify the node when adding new jobs.</p>
<br>
<h4>Altering Crontab from UI</h4>
<br>
<p>New jobs or job updates written to the UI are referred to as <strong>management data</strong>. This section explains how management data travels to its intended crontab. The components involved in management are:</p>
<img src="assets/images/case-study/3.3.svg" alt="components involved in management" class="case-study-image resizable xxx-large centered"/>
<ol>
<li>Dashboard</li>
<li>Database</li>
<li>Listening Service</li>
<li>Linking Client <code>update</code> script</li>
<li>crontab</li>
</ol>
<br>
<p>In this section, we refer to communications between the Dashboard and the Monitoring Service's database, as well as the the Linking Client and the Monitoring Service's database. The Dashboard and the Linking Client perform these interactions using HTTP requests to the API exposed by the Monitoring Service's application server. However, for the sake of simplicity, we do not make further reference to this and treat the discussion as if the Dashboard and Linking Client can interact with the database directly.</p>
<br>
<p><strong>Recall:</strong></p>
<ul>
<li>The Monitoring Service runs in a Docker container on one node, the hub node.</li>
<li>The user installs the Linking Client on each node with a crontab they want to integrate with Sundial.</li>
</ul>
<br>
<p>When the user inputs new management data to the Dashboard, the data is saved to the first persistent data store it encounters: the database.</p>
<video autoplay loop muted playsinline class="resizable large" aria-label="in monitoring service, data travels from UI through app servers API to database">
<source src="assets/videos/case-study/3.4.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<p>Next, the management data must travel from the Monitoring Service’s database to the appropriate crontab. Since the crontab and the Monitoring Service might reside on different nodes, the management data may have to travel over the network. Even in the single-node architecture, editing the crontab of the host machine directly from the Monitoring Service poses difficulties because the Monitoring Service runs in a Docker container.</p>
<br>
<p>To address this issue, the Linking Client includes an <code>update</code> script that fetches management data from the Monitoring Service’s database and writes it to the crontab. In the below diagram, arrows represent HTTP requests and responses.</p>
<video autoplay loop muted playsinline class="resizable xxx-large" aria-label="update script sending request to moniotring service for data writes that to crontab">
<source src="assets/videos/case-study/3.5.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<p>The <strong>Listening Service</strong>, a simple HTTP server integrated into the Linking Client, executes the Linking Client’s <code>update</code> script.</p>
<br>
<p>
The Listening Service has one role: to await requests from the Monitoring Service. A request signals new management data is available in the Monitoring Service's database. When the Listening Service receives this request, it initiates the execution of the <code class="language-plaintext highlighter-rouge">update</code> script. Note that the <code class="language-plaintext highlighter-rouge">update</code> script is idempotent, guaranteeing consistent and predictable outcomes with each execution.
</p>
<p>Below is a diagram of the complete management data flow.</p>
<video autoplay loop muted playsinline class="resizable xxxx-large" aria-label="monitoring service sends request to listening service, twhich launches the update script, which requests new data from monitroring service and writes that to crontab.">
<source src="assets/videos/case-study/3.7.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<p>While this process may seem circuitous, we deliberately routed management data through the Linking Client’s <code>update</code> script. We will explain our design considerations in the Engineering Decisions section.</p>
<br>
<h2>5 Engineering Decisions</h2>
<br>
<h3>5.1 Monitoring</h3>
<br>
<h4>Trade-offs: Getting Information about Job Execution</h4>
<br>
<p>To get information about the execution of the job, we used the same pattern we saw from other task monitoring services: sending requests (pings) before and after job execution. Receiving a ping would indicate that a job successfully completed an action (it started or ended), while not receiving one would indicate a failure (failure to start or end).</p>
<br>
<p>We considered two options to send these pings: insert cURL GET requests to all jobs in the crontab or write a wrapping script that included the ping logic and the user’s command and have cron run it. We initially chose to use cURL requests for simplicity.</p>
<img src="assets/images/case-study/4.1.svg" alt="curl wrapped cron job" class="case-study-image resizable xxxx-large centered"/>
<p>The first cURL executes before the start of a job. The second cURL only executes if the script test-job.sh completes successfully.</p>
<br>
<p>Sending requests in this manner was quick to implement, giving us time to reflect on whether we could improve or add to the information we were getting from job execution.</p>
<br>
<p>We realized that writing a ‘wrapping’ script instead gave us more control because we could add logic around a job’s execution: we could gather error logs, send more specific information like a ping to indicate failure, and execute any additional actions (e.g. re-running a job, although we did not implement this).</p>
<br>
<p>The final implementation of our Monitoring Service uses this wrapping option.</p>
<br>
<p>Recall the <code>discover</code> and <code>run</code> scripts from the Linking Client: the user runs <code>sundial discover</code> to wrap their cron jobs, which involves adding <code>sundial run</code>. The <code>run</code> script invokes a child process to run the user’s specified script, which makes it possible for the script to send a ping when a job errors out and send error logs from that aborted process.</p>
<img src="assets/images/case-study/2.12.svg" alt="sundial run wrapped cron job" class="case-study-image resizable x-large centered"/>
<h4>Challenge: Unordered Pings</h4>
<br>
<p>When the Monitoring Service receives both start and end pings, the Service initiates operations on a shared database record, a run - either through an INSERT or UPDATE operation. However, the Monitoring Service can receive the pings out of order, i.e., receiving the end ping before the start ping, which adds additional complexity in handling these requests.</p>
<br>
<p>To address this issue, we engineered the Monitoring Service to always check for the presence of existing runs, regardless of whether a start or end ping is received. The Monitoring Service uses the existence and state of such runs to act appropriately if the pings arrive out of order.</p>
<br>
<h4>Challenge: Race Condition</h4>
<br>
<p>As mentioned, when the Monitoring Service receives a ping, it must update or create a run. A chain of subsequent actions follows this. The state of a run dictates these subsequent actions, which include Task Queue operations, user notifications, and database operations. The Monitoring Service’s application server housed all of this in our initial implementation.</p>
<br>
<p>However, we discovered that this approach led to race conditions, occasionally creating duplicate run entities. These race conditions arose in scenarios where the job duration was extremely brief, spanning only a fraction of a second, and the start and end pings arrived in quick succession.</p>
<br>
<p>To understand this, let's first look over how our initial implementation handled the arrival of a start ping.</p>
<!-- This is a single-line comment -->
<ol>
<li>Upon the arrival of a start ping at the application server, the server initiates a SELECT query to retrieve a run using the run token present in the start ping.</li>
<li>The application server verifies if any rows were returned from the database.</li>
<li>With the database returning no rows, the application server executes an INSERT query to create a new run.</li>
<li>Finally, the application server appends a new task to the end queue, awaiting the arrival of an end ping.</li>
</ol>
<br>
<p>Let's look at how this translates when the start and end ping arrive almost simultaneously.</p>
<!-- This is a single-line comment -->
<ol>
<li>Upon the arrival of the start ping at the application server, it initiates a SELECT query to the database.</li>
<li>An end ping arrives almost immediately. Due to the asynchronous nature of the database calls, the application server proceeds to handle this request without waiting for the completion of the start ping’s SELECT query. It triggers another SELECT query to check for an existing run in the database.</li>
<li>As the start ping's SELECT query yields no rows, the application server executes an INSERT operation into the database.</li>
<li>Simultaneously, the end ping's SELECT query, initiated before the beginning of the start ping’s INSERT, also returns no rows. Consequently, the application server executes an additional INSERT into the database.</li>
<li>After the start ping’s INSERT is complete, the application server adds a task to the end queue.</li>
<li>Upon surpassing the tolerable runtime, the task triggers, and the Monitoring Service inaccurately notifies the user that their job was not completed.</li>
</ol>
<br>
<p>This leads to the database containing duplicate runs with incorrect states in their respective entries. Furthermore, the Monitoring Service erroneously informs the user of job failure.</p>
<br>
<p>To resolve this issue, we moved the responsibility of updating or inserting a run to the database layer. Our approach used PostgreSQL’s UPSERT operation, which either updates existing rows or inserts new rows otherwise and guarantees that when the application server receives a ping, it can always retrieve an associated run, mitigating the chance of race conditions.</p>
<br>
<p>Additionally, the database returns the run indicating whether an INSERT or UPDATE operation occurred, which allows the Monitoring Service to determine if subsequent actions concerning the Task Queue and user notifications need to be executed.</p>
<br>
<h4>Trade-offs: Architecture for Missed Pings</h4>
<br>
<p>Designing the logic for handling missed pings revolved around tracking the next expected start time and the next expected end time. We ended up deciding between two different implementation options: one that used database attributes and another that used task queues.</p>
<br>
<p>The first approach was to make the next expected times attributes of the monitor entity in the database. The approach went as follows:</p>
<ol>
<li>A monitor is registered, and the Monitoring Service creates a corresponding entity in the db. The next expected start time and next expected end time attributes are calculated.</li>
<li>Each minute, the Monitoring Service checks both attributes on each monitor against the current time.</li>
<li>If the specified time of either attribute has lapsed, the Monitoring Service executes actions specific to the attribute type.</li>
<li>The Monitoring Service recalculates the next expected time of the attribute whose time had lapsed.</li>
</ol>
<br>
<p>The benefit of using this first approach is that it was lightweight – it did not require any additions to the architecture. The downside was that the logic was complex and was tightly coupled to monitors.</p>
<br>
<p>Instead, we chose to implement <strong>task queues</strong>. At the expense of adding complexity to the Monitoring Service’s architecture by introducing a new component, it provided separation of logic between missed pings and monitors, and the missed ping tasks would be decoupled from their execution.</p>
<br>
<h4>Trade-offs: Cache-Based vs. Database-Based Queues</h4>
<br>
<p>Once we had settled on using task queues, we had to decide on whether to use a database-based or a cache-based queue.</p>
<br>
<p>It’s important to note that we only needed one specific functionality of a task queue: <strong>deferred tasks</strong>.</p>
<br>
<p>Remember that when the Monitoring Service receives a ping, it needs to process subsequent actions, which include adding or updating tasks. One stress point is when there are many concurrent jobs because the Service receives the pings all at once.</p>
<br>
<p>A <strong>cache-based queue</strong> would allow the Monitoring Service to handle more concurrent jobs in a reasonable response time. This is due to its fast write times, which directly impacts the processing of adding and updating tasks. The downside is that we would be adding complexity to the architecture and not using any other cache functionality.</p>
<br>
<p>The <strong>database-based queue</strong> would not be able to handle as many concurrent jobs because querying the database is slower than a cache. Still, it would require no changes to the architecture since we could reuse our existing database.</p>
<br>
<p>We did not expect many concurrent jobs for our use case of individuals and small to medium-sized teams, so we opted for a database-based queue. (We discuss more details on the number of concurrent jobs Sundial supports in the Load Testing section).</p>
<br>
<p>We chose <strong>pg-boss</strong>, a database-based job queue, to implement our start and end queues for the following reasons:</p>
<ul>
<li>Ideal for our stack: it was built for Node.js applications on top of PostgreSQL (we use both Node.js and PostgreSQL)</li>
<li>It gave us an easy way to implement a task queue without complicating our architecture</li>
</ul>
<br>
<h4>Sending out Notifications</h4>
<br>
<p>Notifications, in general, are challenging because when they should be sent out can vary depending on the user. We identified two alternative notification methods:</p>
<ol>
<li>Sending notifications each time a job fails.</li>
<li>Triggering notifications on a state change, i.e., when a job starts failing or, after repeated failures, begins succeeding.</li>
</ol>
<br>
<p>The first method is more straightforward and reduces the risk of users missing crucial notifications for continuously failing jobs. However, jobs running at short intervals (e.g., every minute) could flood users with excessive notifications. Additionally, this approach doesn't inform users when a previously failing job recovers, which can be valuable information.</p>
<br>
<p>We opted for the second method to prevent overwhelming users with repeated notifications and to include alerts when a job recovers. However, a drawback is that users will only receive a notification for the initial failure if a job fails multiple times.
To address this, we utilized the Task Queues to schedule a task at 9 AM every weekday. This task will alert the user to all their failing jobs, reducing the chance of missing a failure alert.</p>
<br>
<p>For a more comprehensive solution, future work could involve allowing users to customize their notification preferences via the UI. For example, the next version could include a setting that lets the Monitoring Service ignore initial job failures up to a specific count before triggering notifications or turn off notifications entirely for specific jobs.</p>
<br>
<h4>Runs Rotation</h4>
<br>
<p>With monitoring, there is an ever-increasing amount of data related to runs. Thus, we decided to limit the number of runs (on a per-monitor basis). We implemented a rows rotation mechanism (similar to log rotations <a href="#toc-8-references">[4]</a>) to our runs table to ensure that the table did not keep growing infinitely. We implemented a stored procedure that runs once weekly and reduces the amount of runs by deleting all runs except the 100 most current for each monitor.</p>
<br>
<p>A stored procedure <a href="#toc-8-references">[5]</a> is a feature available in many RDBMS and is a grouping of SQL statements. Stored procedures have names used to call them and execute the group of statements, similar in function to a batch script. An additional job queue (called the <strong>maintenance queue</strong>) was created with pg-boss, and the stored procedure is scheduled using a deferred job to be processed at the time mentioned.</p>
<br>
<h3>5.2 Management</h3>
<br>
<h4>Trade-offs: Source of Truth</h4>
<br>
<p>This section looks into how we chose our cron job information <strong>source of truth</strong>. A <strong>source of truth</strong> is an authoritative data store that other data stores duplicate if discrepancies arise.</p>
<br>
<p>Instead of managing from the Monitoring Service UI, a user might manually edit a job’s schedule from the crontab, leaving the crontab and Monitoring Service database with different cron job information. This creates errors in the monitoring data shown to the user since the job now executes on a different schedule than the Monitoring Service expects.</p>
<br>
<p>To reconcile these types of inconsistencies, we had to choose our <strong>source of truth</strong>:</p>
<ol>
<li>The crontab file</li>
<li>The Monitoring Service database</li>
</ol>
<p>Let's explore how both options would work if a user were to modify a job from the UI.</p>
<br>
<p>For option 1, the Monitoring Service must apply the UI modifications while recognizing the crontab as the source of truth. To do so, the Monitoring Service would begin by fetching the crontab’s jobs from the Linking Client. Only then would it consider the UI modifications, integrating them with the crontab jobs and sending the updated management data to both the database and the Linking Client to update the crontab.</p>
<br>
<p>For option 2, the Monitoring Service would simply use the UI modifications to update the source of truth (the database) and then send the management data to the Linking Client to overwrite the crontab.</p>
<br>
<p>We chose option 2 for its reduced complexity and network load; the Monitoring Service never fetches the crontab’s jobs. The downside to this choice is that the Linking Client overwrites new manual changes to the crontab with each sync with the Monitoring Service. Thus, users should exclusively manage their jobs through the UI to ensure their changes persist.</p>
<br>
<h4>Trade-offs: Listening Service vs Polling</h4>
<br>
<p>We had two options to send management data to the Linking Client: use the Linking Client Scripts to poll the Monitoring Service for new management data or have the Monitoring Service push modification notifications to the Linking Client.</p>
<br>
<p>The first option, polling, used a cron job to repeatedly send requests to the Monitoring Service's API for new job data.</p>
<br>
<p>We opted for the second option, the Monitoring Service pushing data only when available, and built out the Listening Service.</p>
<br>
<p>The downside of this decision was that we needed to add additional infrastructure on each remote server. On the other hand, frequent polling suffers from potential delays and increased server load. The Listening Client approach offers faster updates from UI to crontab and minimal network load.</p>
<br>
<h4>Trade-offs: Direct Transmission of Data vs Separation of Concerns</h4>
<br>
<p>This section examines our options for managing data flow to the crontab.</p>
<br>
<!-- <p>The simplest option was to have the Monitoring Service directly transmit updated cron job information to the Listening Service. As shown below, the Monitoring Service makes an HTTP request to the Listening Service containing the data, triggering the Listening Service to invoke a Linking Client script that writes the data from the request body to the crontab.</p>
<br> -->
<p>The simplest option was to have the Monitoring Service directly transmit updated cron job information to the Listening Service. The Monitoring Service makes an HTTP request to the Listening Service containing the data, triggering the Listening Service to invoke a Linking Client script that writes the data from the request body to the crontab.</p>
<br>
<p>This approach worked fine in the locally hosted version of our application. Still, it introduced a notable vulnerability over multiple nodes: if not appropriately secured, each Listening Service would become a potential entry point for malicious actors to inject unauthorized cron jobs.</p>
<br>
<p>As seen in the Management section, this vulnerability led us to an alternative set of steps:</p>
<ol>
<li>The Monitoring Service would send a request to the Listening Service notifying it that there were new modified jobs (but would not include the management data itself)</li>
<li>The Listening Service would trigger the <code>sundial update</code> script to request those job modifications from the Monitoring Service</li>
</ol>
<p>While more complex, this decoupling isolates the notification of new job modifications from the actual job modifications. Management data is transmitted only as a response to a request to the Monitoring Service's API, which is secured using API Keys (discussed in the Security Considerations section).</p>
<br>
<p>As long as the Monitoring Service and remote server remain secure, this approach hinders malicious actors from altering our crontabs via requests to the Listening Service.</p>
<br>
<h4>Trade-offs: Data Payload</h4>
<br>
<p>When framing this section’s decision, we had one primary concern: what if there's an error when transferring management data to the crontab?</p>
<br>
<p>Consider this scenario: the Monitoring Service removes a job from the database and notifies the Linking Client accordingly. However, attempts by the Linking Client to delete the job result in an error, leaving the job to persist within the crontab. While the job won't exist in the database, it will keep running on the node, invisible and unmanageable through Sundial’s UI. That’s a big issue!</p>
<br>
<p>To account for this concern, we had to ensure that each sync would not only update the crontab with new management data but also remove any previous inconsistencies.</p>
<br>
<p>With this in mind, we had to decide what the Monitoring Service's API should respond with to the Linking Client's sync requests:</p>
<ol>
<li>only new management data</li>
<li>all cron job data associated with the node</li>
</ol>
<p>First, we considered option 1. To resolve previous inconsistencies on each sync, the Linking Client would send a copy of the crontab file to the Monitoring Service when requesting updates. The Monitoring Service would then parse this copy to determine what actions the Linking Client needed to take, accounting for any inconsistencies in the crontab and new updates from the UI.</p>
<br>
<p>For option 2, the Linking Client would overwrite the crontab entirely with all the cron job information provided by the Monitoring Service during each sync, addressing any prior inconsistencies.</p>
<br>
<p>We chose option 2 for its simplicity. The Monitoring Service didn't need to parse the existing crontab, and the Linking Client didn't have to include a copy of the crontab in its request for updates.</p>
<br>
<h3>5.3 Security Considerations</h3>
<br>
<p>This section will outline our approach to addressing security concerns when using Sundial across multiple nodes.</p>
<br>
<p>In the multi-node architecture, both the Listening Service of the Linking Client and the application server of the Monitoring Service must expose a port to the network to facilitate communication.</p>
<br>
<p>Of course, it’s best practice to secure any API accessible through the network, but it’s especially critical to secure the API provided by the Monitoring Service’s application server.</p>
<br>
<p>Access to the Monitoring Service’s API would allow a malicious actor to modify the crontabs of both the hub server and any connected remote servers. Additionally, unauthorized changes to monitoring data might disrupt job monitoring, or a flood of requests could overwhelm the system.</p>
<br>
<!-- <p><img src="assets/images/case-study/4.11.gif" alt="bad things"></p> -->
<p>Our primary recommendation is to employ Sundial's multi-node setup within a Virtual Private Cloud (VPC). VPCs are private, isolated network environments within the cloud that ensure secure communication between nodes using private IP addresses. VPCs are region-specific, meaning a user's nodes cannot span regions.</p>
<br>
<!-- <p><img src="assets/images/case-study/4.12.png" alt="VPC"></p> -->
<p>To enhance adaptability, we've implemented initial security measures for scenarios where VPCs are not feasible. It's crucial to emphasize that these efforts represent initial steps, the system should be hardened further to operate over the wider internet.</p>
<br>
<p>Authentication methods for the Monitoring Service’s API differ between the UI and Linking Client. UI requests require a JSON Web Token (JWT), while the Linking Client uses an API key.</p>
<br>
<p>JWTs and API keys must be used in tandem with the HTTPS protocol to provide encryption. To enable HTTPS, the user must obtain an SSL certificate for the hub node.</p>
<br>
<p>Upon user login, the Monitoring Service generates a new JSON Web Token (JWT) and stores it in the user's browser. The JWT expires after a set time, after which the user needs to log in again to obtain a new one. This mechanism enhances security by regularly refreshing and validating user authentication.</p>
<br>
<!-- <p><img src="assets/images/case-study/4.13.gif" alt="JWT"></p> -->
<p>API keys are distributed individually to remote nodes. Users must obtain a new key from the password-protected UI and register it to the node. The registration process involves running the <code>sundial register</code> command from the CLI of the target node and passing the key as an argument. The Monitoring Service’s database stores a hashed copy of the API key for authentication when processing requests.</p>
<!-- <p><img src="assets/images/case-study/4.14.gif" alt="API keys"></p> -->
<h2>6 Load Testing</h2>
<br>
<p>We conducted load testing on our application to ascertain the <strong>maximum number of concurrent cron jobs</strong> the Monitoring Service can handle effectively.</p>
<br>
<p>When interpreting the outcomes, we consider the <strong>worst-case scenario</strong>, assuming the jobs operate at the minimum interval provided by the cron utility - every minute.</p>
<br>
<p>Since our application is self-hosted, we must carefully select the appropriate hardware for conducting our tests. We opted for the minimum available Digital Ocean droplet, characterized by the following specifications:</p>
<ul>
<li>CPU Type: Regular Intel</li>
<li>vCPUs: 1</li>
<li>Memory: 1GB</li>
<li>Cost: <span>$</span>6/month</li>
</ul>
<h3>6.1 Utilizing Grafana k6</h3>
<br>
<p>For conducting the tests, we employed the Grafana k6 open-source load-testing tool, executed on a local machine belonging to one of our developers. k6 tests involve virtual users and a user-provided testing script.</p>
<br>
<p>This script is executed for a specific duration or a set number of times by a designated number of virtual users, which can operate sequentially or concurrently.</p>
<br>
<p>In our case, we structured the testing script to replicate the procedural sequence of the <code>sundial run</code> script.</p>
<br>
<p>It begins with a start ping, observes a specified interval, and concludes with an end ping. The script undergoes <em>n</em> iterations using <em>n</em> virtual users that operate concurrently, closely simulating the behavior of <em>n</em> monitored jobs executing on the same schedule.</p>
<video autoplay loop muted playsinline class="resizable medium-large" aria-label="Diagram of virtual users sending pings to the Monitoring Service simultaneously">
<source src="assets/videos/case-study/4.15.mp4" type="video/mp4" />
Your browser does not support the HTML5 Video element.
</video>
<p>There are two variables in our tests:</p>
<ul>
<li>The <strong>waiting period observed</strong> between start and end pings. This simulates the duration of the modeled jobs.</li>
<li>The <strong>number of iterations</strong>, simulating the total concurrent jobs.</li>
</ul>
<h3>6.2 Load Testing Objectives</h3>
<br>
<p>This section defines clear and measurable performance goals in load testing. Establishing these objectives is crucial to assess and confirm the application's performance under the strain of monitoring a substantial volume of concurrent jobs.</p>
<br>
<p>Our two primary objectives are to:</p>
<ul>
<li>Maintain a failure rate of 0%</li>
<li>Ensure that no jobs overlap</li>
</ul>
<p>A 0% failure rate indicates that every request forwarded to the Monitoring Service should be processed without encountering any failures. We verify that all requests yield a 200 status code to ensure this.</p>
<br>
<p>Jobs overlap when they aren't completed before their subsequent scheduled execution, leading to multiple instances of the job running simultaneously. This is an often-cited cause of cron-related failures <a href="#toc-8-references">[6]</a>. We convert this objective into a quantifiable target by restricting response time.</p>
<br>
<p>For jobs set to run every minute, we determine this restriction using the formula:</p>
<blockquote>
<p>d + r < 60</p>
</blockquote>
<p>Where <strong>d</strong> is the duration of the job, modeled by the waiting period, and <strong>r</strong> is the response time of the end ping. For example, if a job lasts 55 seconds, rearranging this equation yields a maximum acceptable response time of 5 seconds.</p>
<br>
<p>Rephrasing our core objectives more explicitly:</p>
<br>
<ul>
<li><strong>Status Code</strong>: Ensure all requests attain a 200 status code</li>
<li><strong>Overlap Limit</strong>: The combined job duration and response time of the end ping should not exceed 60 seconds</li>
</ul>
<br>
<h3>6.3 Results</h3>
<br>
<p>We'll explore the outcomes derived from our load testing, focusing on simulations of two different job durations: one lasting 200 milliseconds and another extending to 55 seconds.</p>
<br>
<h4>Test 1: 200 Millisecond Job Duration</h4>
<br>
<p>The 200-millisecond job duration was the shortest among our tests. Its quick start and end pings increased the number of requests per second for the Monitoring Service, causing the longest response times among all the modeled scenarios.</p>
<figure>
<img src="assets/images/case-study/4.16.svg" alt="chart of 55-second job duration" class="case-study-image resizable xxx-large centered screenshot"/>
<figcaption>Graph illustrating response durations for 200-millisecond job duration</figcaption>
</figure>
<p>In line with our objectives, here's what we discovered:</p>
<ul>
<li>Failures started at the 200 concurrent job mark when 5% of requests yielded an incorrect status code.</li>