Single-channel speech enhancement by PSO-GSA with harmonic regeneration noise reduction

ABSTRACT


INTRODUCTION
Speech signal gets contaminated by background noise affecting its clarity and intelligibility. In the speech dependent programs namely automatic speech recognition, speaker recognition, speech operated applications quality of speech is very important. These systems give better accuracy if the input speech is with minimum noise. In real environment, background noise, either stationary or non-stationary is everywhere. It gets added with speech signal giving noisy speech. As the additive noise appears in different forms and shapes, it is impossible to track and remove it completely [1]. But it is possible to reduce it. Speech enhancement systems can reduce additive noise present in noisy speech. This improves speech quality which in turn improves the accuracy and efficiency of speech-dependent systems. Besides, improvement in speech intelligibility is crucial in automatic speech recognising systems. Conventional speech advancement method never enhance speech intelligibility in a highly noisy environment [2]. The main challenge in speech improvement is to minimise noise in speech without distorting the speech content and still giving improvement in speech intelligibility.
Dual channel speech enhancement systems give better performance but with extra cost. Neural network-based speech enhancement methods require clean speech to train the system. Single-channel speech enhancement system is quite tough to implement as there is no noise reference. This research aspires to implement a single channel speech improvement system with a hybrid evolutionary algorithm that can boost voice intelligibility without speech distortion. This implementation does not have the reference of background noise or clean speech. The contribution of this research is: i) implementing combination of PSO and GSA to have a hybrid algorithm so that the advantages of both the algorithms can be used; ii) deciding the fitness function for the modified algorithm; iii) using this hybrid algorithm, decide the percentage of overlap between the noisy speech frames before applying TSNR and harmonic regeneration noise reduction (HRNR); iv) testing the proposed methodology for various kinds of additive noises with various SNR levels; and v) estimating the performance measures for each noise type and contrast the outcomes with common speech enhancement methods. This paper is arranged as follows: papers related to previous research are described in section 2. section 3 gives materials and method; section 4 depicts the implementation outcomes and section 5 gives conclusion.

LITERATURE REVIEW
The existing speech enhancement methods are classified as spectral subtractive method, statistical-model-based method and subspace method. Spectral subtraction for speech denoising suggested by [2] suffers from musical noise. To reduce musical noise, modulation frequency domain is recommended in literature. Spectral subtraction is carried out in modulation frequency domain [3]. Model oriented speech advancement is executed [4]. Implementing a Kalman filter in modulation domain to predict the spectral amplitudes of speech and noise, model-based speech enhancement is accomplished in modulation domain [5]. According to Buragohain et al. [6], sinusoidal modelling is used for speech analysis. A combination of convolutional and recurrent neural network is used for single channel speech enhancement [7]. Wiener filter is the optimal complex discrete fourier transform (DFT) coefficients estimator [8] which does not estimate the optimal spectral magnitudes. Several optimal methods are proposed in literature to obtain spectral amplitudes from noisy observations as it has the importance respecting speech intelligibility and quality [8]. Minimum mean square error (MMSE) spectral amplitude estimator is one of such kind. It requires a priori signal to noise ratio (SNR) estimation. The estimation of a priori SNR is frequently performed using the decision-directed (DD) technique. At low SNR, the DD method overestimates the true value of a priori SNR, which suppresses the speech signal [9]. An unique statistical model that takes into account the time-correlation between succeeding speech spectrum components is described [9]. The two-step noise reduction (TSNR) technique, which addresses the aforementioned issues while retaining the benefits of the DD approach, is suggested [10]. Here, a second step is employed to eliminate the DD approach's bias, which allows for more precise calculation of the a priori SNR. All short-time noise reduction techniques, including TSNR for tiny SNR, produce harmonic distortion in enhanced speech [10].
The short-time noise reduction methods fail to boost speech harmonics at low SNR levels was presented by Plapous et al. [11]. They have implemented a new method in which the harmonic characteristic of speech is considered. Processing of the output of a noise reduction method is again done to regenerate the missing harmonics of speech. This method is combined with deep neural network (DNN) for single channel speech improvement [12]. For noise reduction in speech, evolutionary algorithms can be used as these algorithms are not dependent on system structure, they utilize only fitness function. They provide excellent solutions for rough, discontinuous, and multimodal surfaces [13]. Particle swarm optimization (PSO) chooses the search area for the subsequent iteration based on the knowledge currently available about the search space. PSO's exploitative character can lead to premature convergence at local optima, but it can also help solve the problem [14]. So, different modifications are suggested in standard PSO giving modified PSO (MPSO) [15], accelerated PSO (APSO) [16], craziness based PSO (CRPSO) [17] to improve the efficiency. According to Kunche et al. [18], a blend of PSO and gravitational search algorithm (hybrid PSOGSA) is proposed to improve output SNR compared to input. A combination of spectral filtering MMSE and PSO, (MMSEPSO) is suggested [19].

MATERIALS AND METHOD 3.1. Particle swarm optimization
The PSO is introduced [20]. Here the particles are nothing but the possible solutions to the problem. They are initialized randomly at in the start [20]. Position of a particle (i) in the t th iteration is represented by vector ui and its velocity is given by vi.
where 1 and 2 are random numbers, W is the inertia. It is the inclination of the particle to move in the direction as it moved in the last iteration. In each iteration, the objective function value is estimated for every particle. Based on this value, the best particle of that iteration (global best particle) is decided which has the best value. If the problem is of maximization, then the best particle will be with maximum value of objective function. Global best position ( ) refers to the location of the best particleof the swarm. To determine the local best value of a particle that gives the local best location ( ), the particle's current fitness value is compared to its fitness value from the previous iteration. In (1) provides the particle's velocity.
Depending on the velocity, the particle positions are modified as per (2) for the next iteration. The steps get repeated, and the algorithm runs through the iterations until the stopping criterion is satisfied. The algorithm gives an optimal solution to the problem.

Gravitational search algorithm
Rashedi et al. [21] suggested a new high performance heuristic algorithm named as GSA formulated on the gravitational law. Here, objects are agents. The measurement of the performance of the agents is based on the masses of the agents. The heavier mass means the better solution. The force of gravity causes objects to be drawn together. All the objects attract each other and get moved towards the objects with heavier masses due to this force. The heavy masses move slowly. The location of the mass exemplifies the achievement of a solution to the issue [21]. For N agent system, = ( 1 , 2 , . … , . . ) depicts the position of the mth agent where m=1; 2; ...;N. The force on mth mass due to kth mass at t is given by (3): where and are the gravitational masses of agent k and agent m respectively, and represent the position of mth agent and position of k th agent respectively in dth dimension, G(t) is a gravitational constant at time t, ( ) is the Euclidian distance among the two agents m, k.
In (5) gives the total force acting on m in a dth dimension.
where is a random number in the interval [0,1]. In (6) gives the acceleration of the agent m, at t.
where ( ) is the inertial mass of mth agent. The velocity and position of m are given by (7) and (8) respectively.
( + 1) = ( ) + ( + 1) where is a standardised random variable in the interval [0,1]. G is initialized at the start and goes on reducing with time for controlling the searching accuracy.
here G0 is the initial value of G and s are the damping constant, iter is the current iteration number and the total count of iterations set by the technique is known as iterations. The masses are calculated by using (10) and (11): where ( ) represents the fitness value of the agent m at time t, ( ) and ( ) are the minimum fitness value at t and the maximum fitness value at t respectively when the problem is of maximization, ( ) is an intermediate variable in particle mass calculation.

Proposed algorithm
In conventional PSO, potential solution to the problem is found with the use of exploitative nature but sometimes the solution may get trapped into the local optima [20]. In GSA, the particles have no memory ability since just the current position information is used in the updating process during the assessment [22]. By combining it with PSO we include memory to the particles. So, we implemented the combination of GSA and PSO, the hybrid algorithm which can overcome the shortcomings of the two algorithms [22]. We used this algorithm as a pre-processing step for the combination of TSNR and HRNR standard method. It proved to be beneficial in improving the intelligibility and the clarity of the enhanced speech.
Initially the noisy speech is framed with 25 ms duration and windowed using hanning window. Then 400-point FFT is calculated for each frame. To decide the overlap percentage within the speech frames, we used PSOGSA. The parameters of PSOGSA such as the number of agents (N), total number of iterations (iterations), dimension of the problem space (d), constants C1, C2, and G0 are initialised. Instead of randomly generating the agents, we have considered them as the samples of the speech signal. So, the amount of agents is same as the number of samples in speech signal. The velocity of the agents is initialised randomly. Current position of the agents is the function of the sample values of the speech signal. The upper bound and the lower bound for the agent position are decided at the start based on the maximum and minimum value in the speech signal. In every iteration, based on the current position, the fitness function (objective function) of each individual is estimated. The algorithm finds the maximal value of the fitness function which is the global best of that iteration. The corresponding position of the global best gives the global best position. The maximum value of fitness function is considered as the finest fitness and the minimum value is considered as the worst fitness. G is updated as per (9). For every agent, mass is estimated based on the current fitness of the agent, best and worst value of the fitness as per (10) and (11) and the force and acceleration are estimated as per (5) and (6) respectively. The new velocity is found for each agent according to (12). The velocity equation is modified to have the combination of PSO and GSA. According to the new velocity, position of the particles is updated by (13) while entering the new iteration. The algorithm goes through all the iterations to find the optimized global best value.

Parameter selection
As the number of agents (N) is large in our case, we selected lesser number of iterations which is equal to 20. G0 is kept equal to 1 and s is equal to 23. The acceleration coefficients C1 and C2 are 0.5 and 1.5 respectively 1 + 2 ≤ 4 [20]. Table 1 gives the parameters of the proposed framework.

Objective function
Here, the particles or agents are nothing but the samples of the input noisy speech sentence. Using it, the linear prediction coefficients are found for the filter and the signal is passed through the designed filter. The filter output is considered as the fitness function value for individual particle. In (14) gives the fitness function.
The algorithm finds the maximum value of the fitness function in each iteration. The most optimized value of the fitness function is found by the algorithm in twenty iterations. Using this value, the frame overlapping percentage is calculated and is used while applying TSNR algorithm. Instead of keeping the frame overlap percentage fixed, we used the parameter optimized by the PSO-GSA which in turn gives the amount of overlap. It reduces reverberation effect in the enhanced speech.

Stopping criterion
We kept the total number of iterations equal to 20. We get optimization in first 6 to 7 iterations. The algorithm stops executing when it passes through all the iterations giving optimized value for the global best. The most optimized value of the fitness function given by the framework is used to decide the overlap amount within the noisy frames. Speech harmonics are maintained at low SNR levels while noise is removed using a TSNR algorithm with HRNR.

RESULTS
The proposed algorithm's performance is evaluated using the developed noisy speech corpus (NOIZEUS) database [23]. The objective speech quality metrics are perceptual evaluation of speech quality (PESQ) [24] and segmental SNR [25]. Sentences with additive noises of three different sorts and three SNR levels are obtained from the NOIZEUS database. We used babble, car, and exhibition noises. The results are compared the results of MMSE-PSO [19], log-MMSE, TSNR, and TSNR with HRNR algorithms [11]. The PESQ value for various approaches is shown in Table 2. It demonstrates that the suggested algorithm improves PESQ more than the TSNR and HRNR algorithms. It has been found that the PESQ improvement for 0dB is greater than for 5 and 10 dB. The improvement in segmental SNR is analyzed in Table 3. The proposed framework gives the same amount of improvement in segmental SNR with increase in PESQ as compared to TSNR and TSNR with HRNR. Figure 2 displays a PESQ comparison for the exhibition noise which shows that the suggested algorithm performs better at 0 dB and at 5 dB than 10 dB and compared with other algorithms. Figure 3 gives the spectrogram of clean speech, noisy speech and enhanced speech by a female speaker with 5 dB car noise. The spectrogram shows that the noise is reduced in the enhanced speech telling the quality improvement. With the noise reduced, the output has clear formants. In six iterations, the method converges and produces optimum fitness function.

CONCLUSION
We used a single-channel speech enhancement system in our work. In the suggested technique, we utilised PSOGSA to determine how much overlap there was between the noisy voice frames before using a TSNR algorithm with harmonic regeneration. This method results in an increase in PESQ, indicating an improvement in the comprehension of input speech. For all three noise classes, this improvement is more pronounced for 0 dB noise. Compared to car noise, we saw a greater level of improvement in PESQ for babbling and exhibition noise. There is increase in segmental SNR value indicating the quality improvement of the enhanced speech. Reduction in noise is also clear from the spectrograms. Algorithm converges within 10 iterations. Hybrid algorithms such as the combination of evolutionary algorithms may be employed to optimize the parameters of existing speech enhancement algorithms for improving the intelligibility.