Detection of acute stress caused by cognitive tasks based on physiological signals

Received Apr 6, 2021 Revised Jun 28, 2021 Accepted Aug 9, 2021 We report on the development of an automated detector of acute stress based on physiological signals. Our detector discriminates between high and low levels of acute stress accumulated by students when performing cognitive tasks on a computer. The proposed detector builds on well-known physiological signal processing principles combined with the state-of-art support vector machine (SVM) classifier. The novelty aspects here come from the design and implementation of the signal pre-processing and the feature extraction stages, which were purposely designed and fine-tuned for the specific needs of acute stress detection and from applying existing algorithms to a new problem. The proposed acute stress detector was evaluated in personspecific and person-independent experimental setups using the publicly available CLAS dataset. Each setup involved three cognitive tasks with a dissimilar crux of the matter and different complexity. The experimental results indicated a very high detection accuracy when discriminating between acute stress conditions due to significant cognitive load and conditions elicited by two typical emotion elicitation tasks. Such a functionality would also contribute towards obtaining a multi-faceted analysis on the dependence of work efficiency from personal treats, cognitive load and acute stress level.


INTRODUCTION
Stress is well known to modify the autonomic nervous system balance and to affect variety of physiological processes happening in the human body. Stress increases the risk of injuries and death due to loss of concentration or risky behaviours. Work-related stress is a significant factor for triggering low performance, depressions, and illnesses. Chronic stress exposure could impair the immune system, increase the risks of cardiovascular diseases, neurological disorders etc. In the educational context and the context of collaborative human-machine activities, work-related stress due to high cognitive load results in low performance and low efficiency on tasks that are otherwise easy to cope with.
The emotional and mental response to high cognitive load varies widely among people. It was reported that the reduced examination results are often due to the increased stress level [1]. Stress exposure was reported to cause low physical activity, increased sweet consumption, sleep problems, and musculoskeletal pains [2]. Higher stress levels are associated directly with physiological changes and indirectly with poor health behaviours [3]. Unfortunately, the direct stress level assessment is challenging, mainly due to the individuality of responses and the individual stress coping capacity. The physiological stress response is characterised by activating the hypothalamic-pituitary-adrenal (HPA) axis, which triggers various physiological processes to cope with a stressful situation and recover homeostasis [4]. In addition to cortisol release, stress also alters serum leptin (an anti-obesity hormone) level. D. J. Haleem et al. [4] it was shown that serum leptin levels are positively linked with academic performance and proposed serum leptin levels as a stress perception biomarker. Other studies demonstrated that physiological signals could be used as indicators of stress and health status. Significant progress in detecting of stress levels during driving has been made by J. A. Healey and R. W. Picard [5], utilizing electrocardiography (ECG), electromyography (EMG), and galvanic skin response (EDA) signals. J. Kim and E. Andre [6] developed an automated system, recognizing emotions in several classes. These and many other studies paved the way for the wide use of technology in health and well-being monitoring applications. Compared to subjective assessment based on questionnaires and intrusive medical methods, physiological approaches have better performance in terms of efficiency, non-intrusiveness, and diagnostic ability.
The recent technological advances provided the means for the emergency of a great number of e-Health oriented services and applications [7]- [16]. Due to these advances, nowadays, application developers can quickly design new products and services that incorporate a range of novel functionalities, such as continuous health monitoring on-demand measurement of physiological states and conditions [17], data collection [18] and others. Building on robust machine learning methods and signal processing algorithms, which over the past decade were mastered as reliable instruments in laboratory conditions [19], these new functionalities have the potential to deliver significant social impact. The last is not only because these innovations aim to improve medical diagnostics and treatment practices but because they possess the potential to redefine the overall workflow in national healthcare systems. Although the recent advances in wearable devices allowed monitoring physiological stress indicators like heart rate, sweating rate, blood pressure, however at this time, there are no reliable direct methods for monitoring affective and behavioural stress response. Generally, the current stress-detection systems rely on physiological responses to stress or emotional stimuli used to train machine-learning models to predict a subject's affective state or emotions.
In recent years, many researchers have made efforts to create databases and models for recognizing negative emotions, cognitive load and stress based on physiological signals. Some of the well-known are DEAP [20], MAHNOB-HCI [21], ASCERTAIN [22]. These are multimodal datasets that encompass recordings of physiological signals from healthy subjects, induced by purposely selected audio-visual stimuli or cognitive tasks. Specifically, the DEAP dataset contains recordings of the emotional reaction of 32 participants. It provides over one hundred features extracted from electroencephalography (EEG), EDA, skin surface temperature, EMG, electrooculography (EOG), respiration, and photoplethysmography (PPG) signals. The MAHNOB_HCI comprises face and body video records, eye gaze and audio signals, EEG, EDA, ECG, respiration, and skin temperature recordings from 27 participants. The ASCERTAIN includes recordings from 58 participants, oriented towards evaluating the emotion-personality relationship and affecting recognition. Two hundred features were extracted from ECG, EDA, EEG, and EMO signals in order to emotional states modelling and the detection of five personality traits. The WESAD dataset is focused on wearable stress and affect detection. All physiological modalities are acquired during experimental setup inducing three affective states-neutral (neutral reading task), stress (trier social stress test) and amusement (funny video clips).
In the present study, we aim to recognise acute stress caused by cognitive tasks. Based on previous research, we developed an automated detector for acute stress caused by a range of cognitive tasks. The novelty aspects, described in section 2, consist in the design and implementation of the signal pre-processing and the feature extraction stages, which were purposely crafted and fine-tuned for the specific needs of acute stress detection. For the purpose of detector validation, we experimented with three types of cognitive tasks characterised by different levels of abstraction, difficulty, and domain-specific knowledge in sections 3 and 4.

RESEARCH METHOD
The overall concept of the proposed automated detector of acute stress based on evidence extracted from PPG and EDA signals is shown in Figure 1. As shown, the workflow follows the conventional twostage machine-learning strategy, including the compact information representation stage and classification stage. The information extraction process starts with physiological signal pre-processing, followed by peak detection and feature extraction steps. Next, the EDA and PPG-based features are subjected to postprocessing which involves dynamic range normalisation and subset selection. The feature selection step reduces the feature vector size and eliminates the less relevant and redundant features. The feature vectors obtained in such a manner are then fed to a classifier trained to discriminate between acute stress and other emotional conditions.

Peak detection
In brief, the significant peaks in the PPG and EDA signals are identified via purposely-developed signal-specific peak detectors. Specifically, the PPG peak detection algorithm was inspired by the concept of the Mountaineers peak detection algorithm [23]; however, the actual processing steps and their implementation differ significantly from the original method. Our PPG-peak detection algorithm uses a signal follower based on the backward differences of amplitude to find rising edges and then apply several steps for refining the list of candidate peaks in Figure 2. The central idea behind this algorithm can be summarised as a signal follower, which seeks to detect rising edges, the endpoints of which becomes candidate peak locations. The list of candidate peaks is then refined to eliminate those not located within a pre-specified expected range. Any suspected not accurately positioned peaks are identified via procedure for detection of peak candidates, followed by two-step procedure for fine-tuning. This process brings numerous advantages in terms of noise robustness, accuracy of peak detection and computational efficiency. The proposed algorithm does not require detrending, or artefact removal, which makes it easy to be implemented on variety of mobile platforms. The algorithm is free of complex adjustments and fine-tuning. The EDA signal peak detector [24] aims to identify the SCR peaks using front slope detection. At the first processing step, we aim to separate the SCR and SCL components of the EDA signal (cf. Figure 3). Next, we search for the rising edges using a signal follower. If the distance between two rising edges is smaller than the threshold Tr (in samples), each peak candidate's amplitude is compared to a predefined amplitude threshold. The final processing step checks up for zero crossings before and after each peak candidate.

Feature extraction
Let us consider the availability of a PPG signal with a duration of 50 sec. The raw PPG signal is down-sampled and then filtered with a median filter. Then the detection of R peaks is made automatically through the algorithm mentioned in section 2.1. The time intervals between two successive R peaks are referred to as RR intervals or inter-beat intervals (IBI). On the base of the obtained RR intervals, statistical features like mean and standard deviation in Table 1 are extracted. The variability of RR intervals is estimated, by the use of frequency domain analysis. After computing the power spectrum of three frequency bands: 1) very-low-frequency band [0, 0.04] Hz; 2) low-frequency band [0.04, 0.2] Hz; 3) high-frequency band [0.2, 0.4] Hz, the band-specific signal power is estimated. The ratio between the signal power in the low and high bands is used to estimate the heart rate variability (HRV).
In Table 2, we show the EDA-based statistical features, as some of them (indexes 7-15) were extracted directly from the raw signal and others were derived based on the number of peaks found by the peak detector outlined in section 2.1. The power spectra features (index 16) were estimated using the fast fourier transform applied on a segment level.  Afterwards, the raw EDA signal is low-pass filtered at 0.2 Hz to separate the tonic level of electrical conductivity (skin conductance level, (SCL)), reflecting variations of the arousal. Subtraction of the SCL component from the raw EDA signal leads to separation of the skin conductance response (SCR). The SCR is the phasic component resulting from the sympathetic nervous system's activity and refers to faster signal changes. Then the SCR peaks are detected by using the algorithm described in section 2.1.

Post-processing and feature selection
Dynamic range normalisation is applied to the so far obtained feature vectors. Based on the assumption of Gaussian distribution for all features, the dynamic range normalisation is implemented by subtracting the mean value and dividing by the standard deviation. Next, the EDA features are scaled to the dynamic range [0, 1] by dividing by the maximum value. To discard features that are not relevant to the acute stress detection task, we carried out feature selection before the classification stage. For that purpose, we used the adaptive feature selection method outlined in [25] to evaluate individual features' discriminative capability. A smaller feature vector brings benefits in terms of: 1) smaller dataset size is needed for robust model creation and 2) reducing the computational demand for model creation and classification.

Classifier
We trained binary detectors to discriminate between high and low levels of acute stress in personspecific and person-independent setups. All detectors used SVM with a polynomial kernel, which implements the L1 soft-margin classifier trained with the sequential minimal optimization (SMO) method. We followed the leave-one-out method and fine-tuned the classifier's adjustable parameters with a grid search in all experiments. The search ranges were set as follows: box constrain C ϵ [10 -6 , 10 0 ] with step 10 0.2 , tolerance ɛ ϵ {10 -8 , 10 -7 }, and kernel polynomial order p ϵ {1, 2, 3}.

EXPERIMENTAL PROTOCOL
We performed an experimental evaluation of the acute stress detector in person-specific and personindependent setups, using the resources described in section 3.1. The performance evaluation was performed in terms of detection Accuracy (section 3.2). In the person-specific setup, we report the Average Accuracy for

Dataset and protocol
In the current study, we used the CLAS dataset [26] due to its particular design-it allows to evaluate the individual's ability to concentrate and successfully solve cognitive assignments under stress. The CLAS dataset contains a large number of recordings of the physiological response of university students in their twenties, acquired while they were performing three different cognitive tasks and two emotions-evoking tasks. Specifically, emotional responses were elicited through audio-visual and picture stimuli balanced in the four quadrants of the arousal-valence system. The two emotion elicitation tasks used 16 video clips and 16 pictures with known tags, inducing different emotions with balanced distribution in the valence-arousal plane. The interactive cognitive tasks include (i) a Math test consisting of a sequence of 24 relatively simple mathematical problems, (ii) a Stoop test consisting of 30 instances, and (iii) an IQ test with 20 logical problems. The complexity, the duration of the stimuli, and the limited time for response in the cognitive tasks, followed by the quick show of the correct answer for each problem, were adjusted to build significant levels of acute stress. In contrast, the emotion elicitation tasks did not cause a high cognitive load and the associated acute stress because a response was not required-the participants were expected only to watch the stimuli.
In the experimental validation, we used the whole blocks of PPG and EDA signals of 56 students (16 females and 40 males) who have complete sets of recordings. The acute stress models were built using the stacked blocks recorded during the Math test, Stroop test and IQ test. The reference model representing the absence of acute stress was created from the stacked blocks of the non-interactive tasks-these associated with emotion elicitation via music video clips and pictures set. We computed the PPG and EDA-based features outlined in Section 2.2 for signal segments with a duration of 120 sec that overlap with 60 seconds. This frame size was selected in order to provide an adequate frequency resolution for the spectral-domain features (cf. in Tables 1 and 2).
In the person-specific setup, we computed the FLD-derived person-specific subsets of normalised EDA and PPG features using the methods outlined in section 2.3. For each person, we performed experiments with leave-one-out recording in order to better utilise the available dataset. Specifically, in the person-specific setup, we carried out four experiments with different settings of the feature normalisation and feature selection stage (cf. section 2.3), such as raw feature vector (Fraw), normalised feature vector (Fnorm), raw feature vector with adaptive FLD attribute selection (FrawFLD), normalised feature vector with adaptive FLD attribute selection (FnormFLD). The raw feature vector consists of the genuine 21 PPG and 16 EDA features as computed (cf. section 2.2). The normalised feature vectors were obtained after applying the znorm on the raw features. The adaptive FLD attribute selection followed and applied to the raw and normalised feature vectors. Here, we aimed to find the optimal performance settings, as previous related studies did not agree on the benefit of normalisation.
In the person-independent setup, the dataset consisted of the merged feature vectors of all 56 persons. We carried out the experimental evaluation using the leave-one-person-out shuffling of the available data. Specifically, we carried out experiments for the entire EDA-and PPG-based features with raw (Fraw) and normalised (Fnorm) feature vectors using the entire feature vector, consisting of 37 features (cf. section 2.2). We did not take advantage of the adaptive FLD attribute selection method because, during the personspecific study, we observed that the selected features varied significantly among people in both their number and composition. Thus, the EDA-and PPG-based features selected for different people were dissimilar, and there was no common subset suitable for most people.
In both the person-specific and person-independent setups, we considered experiments using the data of the 40 Males, the 16 Females, and All 56 persons. This was aimed to investigate the potential genderspecific differences if there were any.

Metrics
The accuracy of the person-specific detectors for acute stress was computed in percentages as the weighted sum of the class-specific accuracy obtained for the two classes: where TP is the number of true negative decisions, i.e., the number of correctly detected instances of lowlevel acute stress, and TP correctly detected instances with a high level of acute stress. N and P are the total number of low-level and high-level acute stress instances, respectively. For the person-independent detector, . [%] (2)

RESULTS AND DISCUSSION
In the following subsections, we report the experimental results for the person-specific (section 4.1) and the person-independent (section 4.2) setups separately, as these correspond to different application scenarios.

Person-specific results
In Table 3, we show the Average Accuracy of acute stress detection in percentages for the four different sets of features, such as the raw feature vector (Fraw), normalised feature vector (Fnorm), raw feature vector with adaptive FLD attribute selection (FrawFLD), and the normalised feature vector with adaptive FLD attribute selection (FnormFLD). As shown in Table 3, the highest Average Accuracy (99.72%) was observed for the normalised feature vectors with adaptive FLD attribute selection (FnormFLD), and this holds for all the three subsets {All, Females, Males}. This superb performance is due to the combined effect of the feature vector normalisation and the FLD-based selection of the person-specific subset of features. In the person-specific experimental setup, the z-norm permitted a beneficial selection of attributes with the FLD method. This normalisation expressed as eliminating the mean value of the features and scaling their dynamic range, facilitated the model creation and the actual detection processes. In contrast, the FLD-based attribute selection applied directly to the raw data led to the selection of features, which have significant DC offset (such as the HR and the EDAbased features), and this caused suboptimal acute stress detection accuracy for all subsets, 94.75%, 95.50% and 94.72% for All, Females, and Males, respectively. The slightly higher average accuracy for Females (95.50%) is not significantly different and is due to the smaller number of women (16), which causes a lower resolution of the accuracy assessment. Finally, the experimental results for the raw feature vector (Fraw) and the normalised feature vector (Fnorm) without applying the adaptive FLD attribute selection are worse for all subsets. This is because these feature vectors contain parameters that are not relevant to the stress detection task or are not discriminative for the data of specific persons.

Person-independent results
In Table 4, we present the acute stress detection Accuracy in percentages for the two personindependent experiments-with raw (Fraw) and normalised (Fnorm) feature vectors. As shown in Table 4, a higher stress detection Accuracy was observed for the normalised feature vectors, Fnorm, when compared to the raw data Fraw, and this holds for all the three subsets {All, Females, Males}. Again, the z-norm was found beneficial for acute stress detection. This benefit comes from eliminating the mean value of the different parameters and the unification of their dynamic range, which facilitates the modelling and classification stage. The detection Accuracy, computed only for the Males, is nearly identical to the one for All, which is understandable, keeping in mind that the All dataset is not gender-balanced-it contains 2.5 times more data of Males than of Females. It is even more interesting to note the effect of normalisation on the Females dataset, where the stress detection Accuracy for the raw data, Fraw, was much lower when compared to those of All and Males and the normalised feature vectors, Fnorm, provided a higher detection Accuracy when compared to those of All and Males. This seemingly perplexing result is due partially to the relatively smaller size of that dataset -there were only 16 Females. However, juxtaposing both results, we concluded that the observed detection Accuracy, 86.44% and 100% for the Fraw and Fnorm, respectively, is mainly due to the higher variability of the mean value computed for the Females features, which is effectively eliminated by the z-norm. The slightly higher detection Accuracy (difference of 0.54%), observed for Males compared to Females, is primarily due to the smaller size of the second dataset, which causes both lower quality of models and a worsened resolution of the Accuracy estimation.
Finally, quite surprisingly, the person-specific acute stress detection results (Average Accuracy=99.72%) and the person-independent acute stress detection Accuracy (99.68%), computed for the dataset All, were not statistically different. The numerical similarity between the two stress detection results might be due to the relatively limited data in the person-specific setup, where the acute stress models may remain undertrained. This opens an interesting research direction for further studies.

CONCLUSION
We outlined the overall concept and the experimental validation of the proposed acute stress detection method based on EDA and PPG signals. The experimental results support that we can discriminate between low and high levels of acute stress caused by cognitive activities. The experimental evaluation in both person-specific and person-independent setups has validated the practical applicability of the proposed acute stress detection method in an experimental setup that approximates a personalised learning environment. Such functionality would facilitate the development of adaptive e-learning environments, which use continuous real-time monitoring of acute stress levels. Estimating the acute stress level would permit adaptability of the learning process intensity so that the system can manage the situations with high cognitive load levels leading to reduced perceptive capability. Furthermore, the availability of such adaptability would permit keeping the trainee in the zone of high concentration and high motivation for a more extended period, which would enhance the learning performance.