Speaking without vocal folds using a machine-learning-assisted wearable sensing-actuation system – Nature.com

Design of the wearable sensing-actuation system

A thin, flexible, and adhesive wearable sensing-actuation system was attached to the throat surface, as shown in Fig.1a, for speaking without vocal folds. This system comprises two symmetrical components: a sensing component (located at the bottom part of the device) converting the biomechanical muscle activities into high-fidelity electrical signals and an actuation component using the electrical signals to produce sound (located at the upper part of the device), as shown in Fig.1b. Both components consist of a polydimethylsiloxane (PDMS) layer (~200m thick) and a magnetic induction (MI) layer made of serpentine copper coil (with 20 turns and a diameter of ~67m). The serpentine configuration of the coil ensures the flexibility of the device while maintaining its performance, as discussed in Supplementary Note1. The symmetrical design of the device enhances its user-friendliness. The middle layer of the device is the shared magnetomechanical coupling (MC) layer, made of magnetoelastic materials consisting of mixed PDMS and micromagnets. The MC layer, with a thickness of approximately 1mm, is fabricated with a kirigami structure to enhance the devices sensitivity and stretchability (see Fig.S1). The entire system is small, thin (~1.35cm3, with a width and length of ~30mm and a thickness of ~1.5mm), and lightweight (~7.2268g) (see Fig.S2 and Supplementary TableS1).

a Illustration of the wearable sensing and phonation system attached to the throat. b Explosion diagram exhibiting each layer of the device design. c Two modes of muscle movement, expansion induces the elongation in the x- and y-axis, while contraction induces the elongation in the z-axis. Kirigami-structured device response to muscle movement patterns in the x, y (d), and z direction (e): expansion results in x- and y-axis expansion and less deformation in the z-axis, contraction results in less deformation in x and y direction and expansion of the z-axis. f Detailed illustration of the magnetic field change caused by magnetic particles. For one part, the angle change between each single unit of the kirigami structure is represented by . For the other part, the magnetic particle itself undergoes torque caused by the deformation applied onto the polymer (g), thus, generating a change of magnetic flux and, subsequently, current in the coil. The photo of the device in muscle expansion state is shown in h (x-, y-axis), i (z-axis), and in muscle contraction state is shown in j (x-, y-axis), k (z-axis). Scale bars, 1cm.

Multidirectional movement of laryngeal muscles sets the significance of capturing laryngeal muscle movement signals in a three-dimensional manner. Moreover, the learning process of phonation may be heterogeneous across populations: different people may adopt a variety of muscle patterns to achieve identical vocal movements45,53. Such complexity of muscle movement requires the device to be able to capture the deformation of muscles not horizontally or vertically alone, but rather in a three-dimensional way. Figure 1c illustrates the movement of the muscle fiber during two stages, i.e., expansion and contraction. During the expansion phase, the muscle relaxes and elongates in the x- and y-axis. On the other hand, during the contraction phase, the muscle shortens in the x- and y-axis while thickening in the z-axis through the increase in muscle fiber bundle diameters. Figure 1d, e demonstrates the devices response in the x-, y-axis, and z-axis, respectively. During the expansion phase, the kirigami-structured device expands in surface area with slight deformation in the z-axis. Conversely, during the contraction phase, the device opposes deformation in the x- and y-axis and undergoes deformation in the z-axis. Thus, the device captures the muscle movement across all three dimensions by measuring the corresponding deformation, which generates the change of magnetic flux density followed by the induction of an electrical signal in the MI layer. Supplementary Note2 further demonstrates the response of the device to the omnidirectional laryngeal movements and how the kirigami structure ensures the sensing performance.

The key defining characteristic of this system (MC layer) is based on the magnetoelastic effect, which refers to a change in the magnetic flux density of a ferromagnetic material in response to an externally applied mechanical stress, which was discovered in the mid-19th century54. It has been observed in rigid metals and metal alloys such as Fe1xCox54, TbxDy1x Fe2 (Terfenol-D)55, and GaxFe1x (Galfenol)56. Historically, these materials received limited attention within the bioelectronics domain for several reasons: the magnetization variation of magnetic alloys within biomechanical stress ranges is limited; the necessity for an external magnetic field introduces structural intricacies; and a significant mechanical modulus mismatch exists between magnetic alloys and human tissue, differing by six orders of magnitude. However, a breakthrough occurred in 2021 when the pronounced magnetoelastic effect was observed in a soft matter system57. This system exhibited a peak magnetomechanical coupling factor of 7.17108TPa1, representing an enhancement up to fourfold compared to traditional rigid metal alloys, underscoring its potential in soft bioelectronics. Functionally, the MC layer converts the mechanical movement of extrinsic laryngeal muscle into magnetic field variation, and the copper coils transfer the magnetic change into electrical signals based on electromagnetic induction, operating in a self-powered manner. While additional power management circuits are essential for processing and filtering the signals, the initial sensing phase is autonomous and does not rely on an external power supply. After recognition through the machine learning model, the voice signal is output through the actuation system (Fig.1a).

The signal conversion through the giant magnetoelastic effect in soft elastomers can be explained at both the micro and atomic scales. At the microscale, compressive stress applied to the soft polymer composite causes a corresponding shape deformation, leading to magnetic particle-particle interactions (MPPI), including changes in the distance and orientation of the inter-particle connections. The horizontal rotation of each subunit in the kirigami structure (Fig.1d) and vertical bending deformation (Fig.1e) create a micro change of magnetic density. In detail, as shown in Fig.1f, in a subunit of the kirigami structure, deformation-induced angle shift generates a concentration of stress and MPPI in between each single unit of the kirigami structure. At the atomic scale, mechanical stress also induces magnetic dipole-dipole interactions (MDDI), which results in the rotation and movement of magnetic domains within the particles. As shown in Fig.1g, a torque was made on each magnetic nanoparticle, and the shift of angle generates the change in magnetic flux density. The photo of the device design is presented in Fig.1h, i as the x-, y-axis, and z-axis response in the expansion phase; and in Fig.1j, k as in the contraction phase. Fig. 1h,j describes the expansion and contraction in the x-y plane, and Fig. 1i, j describes the corresponding z-axis contraction and expansion.Such structural design also displays a series of appealing features, including high current generation, low inner impedance, and intrinsic waterproofness, which will be presented in the following sections.

Our present work compares previous approaches based on PVDF and graphene for flexible voice monitoring and emitting, as shown in Fig.2a and Supplementary TableS335,36,37,58,59,60,61,62. The device developed in this work has a similar acoustic performance, with a frequency range covering the entire human hearing range. However, it has a much lower driving voltage (1.95V) and a Youngs modulus of 7.83105Pa. As shown in Fig.S3, it exhibits the stress-strain curve and testing photo of the material with and without the kirigami structure, which lowered Youngs modulus from 2.59107Pa to 7.83105Pa. This result ensures a higher comfort level while wearing as the modulus of the device is very close to that of the human skin. Notably, the device we developed has two unique features of stretchability and water resistance, which ensure the detection of horizontal movements, wearing comfort and resistance to respiration. Additionally, the device does not have the issue of temperature rising during use, preventing unexpected low-temperature scalding of users. Subsequently, several standard tests establish the sensing features of the device and its efficacy in outputting voice signals. To enhance the stretchability of the device, a kirigami structure was fabricated onto the MC layer of the device. The unit design of the structure is shown in Fig.S4, and the stretchability with regards to the parameters of the kirigami unit design is exhibited in Fig.S5. Such an approach not only enhances the stretchability of the device to a maximum of 164% with Youngs modulus at the level of 100kPa but also realizes isotropy. Furthermore, the structure enlarges the horizontal deformation of the device under unit pressure, generating a higher current output and enhanced detectable signals of extrinsic muscle contraction and relaxation, as shown in Figs.S6S7. The change in sensitivity brought by the structure on the vertical axis was also tested, and an elevation can be observed, as shown in Figs.S8S9. Moreover, isotropy prevents the device from being disturbed by random and uneven body movements in use. Thus, there are no requirements on wearing orientation which elevates user-friendliness as revealed in Figs.S10S11.

a Performance comparison of different flexible throat sensors in terms of Youngs modulus, stretchability, underwater sound pressure level, temperature rise, driving voltage, and working frequency range. b Pressuresensitivity response of the device at varied degrees of stretching under different amplification levels. (arb. units) referring to arbitrary units. c Response time and signal-to-noise ratio of the device. d Variation of sound pressure level with distance from the device at different amplification levels. e Sound pressure level of the device with resonance point highlighted in human hearing frequency range compared to SPL normal human speaking threshold. f The right shift of the first resonance point towards high frequency with regards to increasing strains. The test of performance is repeated 3 times at each condition. Data are presented as mean valuesSD. g Relationship between Kirigami structure parameters and actuating (first resonance point and sound pressure level)/sensing properties (response time and signal-to-noise ratio). The test of performance is repeated three times at each condition. Data are presented as mean valuesSD. Waveform (h) and spectrum (i) comparison of commercial loudspeaker (Red) and the device (Yellow) sound output at 900Hz and maximum strain (164%).

The stretchable structure of the device was leveraged to examine its sensitivity with respect to deformation degrees, as depicted in Fig.2b. The sensitivity curve demonstrated consistency under varying strains, with a minor change observed under maximum strain (164%). This change could be attributed to the reduction in the MC layers thickness due to deformation, which in turn decreases the magnetic flux density under the same pressure level, resulting in lower current generation. The devices response curve under different frequencies and forces of the shaker was tested, as shown in Fig.S12. We have also validated that the electric output of the device is not due to the triboelectricity in Supplementary Note363. The devices inherent flexibility and stretchability facilitate tight adherence to the throat, yielding a high signal-to-noise ratio (SNR) and swift response time (Fig.2c). In addition to the kirigami structure design parameters, other factors influencing the devices sensitivity, response time, and SNR were also evaluated. Fig.S13 illustrates that an increase in coil turns results in longer response times and lower SNR due to the increased total thickness of the copper coils. This thickness impedes the membranes deformation during vibrations, leading to longer response times and lower signal quality. We have further investigated the increase of thickness with the coil turn ratios in Supplementary TableS2. As the number of coil turns escalates, theres a direct correlation with the likelihood of copper wires stacking. Consequently, a significant number of samples exhibit thicknesses approximating 2 or 3 layers of copper (134 m and 201 m, respectively). This stacking effect amplifies the average coil thickness as the number of turns increases. However, this augmentation isnt strictly linear. For instance, the propensity for overlapping is less pronounced for turn ratios of 20 and 40. In contrast, for turn ratios exceeding 60, a clear trend emerges where the likelihood of overlapping increases with the number of turns. The relationship between the sensing performance and nanomagnetic powder concentrations of the MC Layer is presented in Fig.S14. A semi-linear relationship was observed, with higher magnetic nanoparticle concentration generating a stronger magnetic field and, consequently, higher current output. The influence of varying PDMS ratios in the sensing membrane on the performance of the sensor is delineated in Fig.S15. An increase in the PDMS ratios was found to extend the response times and decrease the SNR while having a negligible effect on the sensitivity curve. The augmentation in PDMS ratios leads to a softer membrane, which is prone to deformation at a slower rate. Consequently, devices with higher PDMS ratios exhibit heightened sensitivity to noise-generating deformations, albeit at a reduced response time. The influence of thickness on sensing performance was tested in Fig.S16, with thicker membranes resulting in quicker response times and a fluctuating SNR. Lastly, the impact of the MC layers thickness was tested in Fig.S17. A thicker MC layer had no influence on response time but reduced SNR. Weve consolidated the results of each optimization factor in Fig.S18, providing a clear overview of the primary variables influencing each performance metric. After considering the sensing performance, weight, and flexibility of the device, the current parameters were determined. The devices durability with these parameters was evaluated in Fig.S19, where the device underwent continuous working for 24,000 cycles with a shaker under a frequency of 5Hz, with no observable degradation in the current generation.

The acoustic performance of the actuation system of the device is examined firstly with a focus on its sound pressure level (SPL) at different distances. The results, presented in Fig.2d, show that larger output magnification led to a higher SPL at all tested positions. Even at a distance of 1 meter, the typical distance during normal conversations, the device provided an SPL of over 40dB, which is above the lower limit of normal speaking SPL (4060dB)64. We also tested the devices SPL at different angles and compared its performance with those of previous works on acoustic devices (Fig.S20, Supplementary TableS3). The devices performance across various frequencies was tested and presented in Fig.2e, which indicates that it could provide sound with SPL louder than normal speaking loudness across the entire human hearing range64. The resonance point in the figure indicates the frequency at which the device has relatively the largest loudness output under the same signal strength as other adjacent frequencies. Further investigation into the SPL regarding frequency under different strains revealed that the first few resonance points tended to have the largest acoustic output across the frequency range (Fig.S21). Since the device under one strain has multiple resonance points that change non-linearly with deformation, investigating the change of every resonance point is complicated. Therefore, we only investigated the first resonance point (FRP) in Fig.2f because of its complexity and our interest in the highest output. According to Fig.2e and Fig.S22, the voice output at each strain was above the normal talking threshold across the whole human hearing range. Figure2f revealed a right shift of FRP of the device as the deformation gets larger, enabling the device to adjust its best output performance under different usage scenarios. Our device can adjust its best output performance by simply changing the deformation degree, thus creating a unique output setting for each individual and realizing user adaptability. More details about the right shift of FRP are shown in Fig.S23.

We also tested the influence of introducing the kirigami design into the device, as presented in Fig.2g. The results show that the parameter of the kirigami design had a negligible impact on the sensing and acoustic performance, further supporting the decision to use this design due to its impact on flexibility (Fig.S5). Additional factors influencing the acoustic performance of the actuation system were evaluated, and the final parameters were determined based on both performance and the devices mass/flexibility. Fig.S24 explores the impact of coil turn ratios on the SPL produced by the device. It was observed that an increase in coil turns led to a decrease in SPL, likely due to the weight of the additional coil impeding membrane vibration and subsequently reducing SPL. The relationship between SPL and the PDMS ratio of the actuator membrane was examined in Fig.S25. As the ratio increased, the membrane softened, leading to a decrease in the generated SPL. The dampening effect of a softer membrane hindered vibration and sound generation, resulting in a semi-linear decrease. Fig.S26 presents the relationship between SPL and magnetic powder concentrations. The devices SPL increased with the addition of higher amounts of magnetic powder in the MC layer, plateauing after a ratio of 4:1. The effect of varying MC layer thickness on SPL is shown in Fig.S27. A sharp increase in the devices SPL was observed as the MC layers thickness increased from 0.5mm to 1mm. However, the increase slowed and eventually plateaued as the MC layer became thicker. Finally, the SPL under different actuator membrane thicknesses was tested in Fig.S28. The devices SPL increased as the PDMS membrane (vibrating membrane) thickness increased from 100 to 200m but decreased when the membrane became thicker. The weight of thicker membranes may dampen the vibration and reduce the loudness produced by the device. Regarding the acoustic output quality of the device, Fig.2h displays the waveform of the commercial loudspeaker and our device at the maximum (164%) strain at the frequency of 1100Hz. The device reproduced the voice signal accurately, even under maximum deformation, with only slight distortion. The distortion was further explained in the spectrogram of Fig.2i, which shows that a noise of around 1400Hz was generated in the output of our device but not strong enough to significantly distort the signal. Output of other strains was tested in Fig.S29, a similar distortion of less extent can be observed with less strain. In the final phase of our study, we evaluated the water resistance of our device. The waveform of the device outputting an identical voice signal segment under water and in air is depicted in Fig.S30. The waveforms are notably similar, with no significant signal distortion observed. A slight loss of the high-frequency component, without major signal attenuation, is evident in the frequency domain (Fig.S31). The device demonstrated consistent performance even after being submerged in water for an accelerated aging test with aduration of 7 days (Fig.S32). The sound pressure level (SPL) in relation to distance underwater is presented in Fig.S33. A correlation was observed between the depth of the device underwater and the sound output, with deeper submersion resulting in lower output. However, the device could produce an output exceeding 60dB when placed 2cm underwater at a distance of 20cm. The SPL of the device in relation to frequency underwater is illustrated in Fig.S34. Despite the attenuation of high-frequency components underwater, the device consistently delivered an SPL above the normal speaking range (60dB) across the entire human hearing range. These results suggest that our device, as a wearable, can effectively withstand conditions of perspiration, damp environments, and rain exposure.

After obtaining the preliminary standard test results, we focused on collecting laryngeal muscle movement signals using our wearable sensing component. The experiment is schematically illustrated in Fig.3a. The analog signal generated by the vibration of the extrinsic laryngeal muscles (Sternothyroid muscle, as shown in Fig.3a) was collected by the sensor and then passed through an amplifier and a low-pass filter exhibited in Fig.3b. The digital signal of the laryngeal muscle movements was output and collected for further analysis. The sensitivity and repeatability of the device were tested in Fig.3c with two successive different throat movements. The device was able to generate distinguishable and unique signals for each different throat movement, indicating its feasibility to detect and analyze different laryngeal movement properties. Furthermore, the device responded consistently to one throat movement, as demonstrated by the participants continuous two throat movements. In addition, larger throat muscle movements, such as coughing or yawning, generated larger peaks, while longer movements, such as swallowing, generated longer signals. We also conducted experiments to test the devices functionality under different conditions. In Fig.3d, we asked the participant to voicelessly pronounce the same word (UCLA) under different conditions, including standing still, walking, running, and jumping. The device was able to discern the unique and repeatable feature syllable wave shape of each word, with only slight differences made by the participants with different pronouncing paces each time. Thus, the wearable device was able to function without being influenced by the users body movements, even during strenuous exercise. Finally, to test the signal quality and accuracy acquired by purely the laryngeal muscle movement, we performed examinations to compare normal speaking and voiceless speaking, as shown in Fig.3e. The five successive signals of participant saying Go Bruins with and without vocal fold vibration were compared in Fig.3f and g, respectively. Both tests generated consistent signals, and the syllables of each word were represented with distinguishable waveforms. Comparing the test results of normal speaking and speaking voicelessly, we observed only a slight loss of maximum amplitude in the signal of speaking voicelessly. This could be explained by the fact that the vibration of vocal folds requires more and stronger muscle movements, thus generating stronger signals. Furthermore, a clear loss of high-frequency components in voiceless signals compared to the signals with vocal fold vibration was observed in Fig.3h, i after the Fourier transform of both signals across frequencies. This finding was consistent with our hypothesis that the high-frequency part of the vibration generated by intrinsic muscles and vocal folds is absent in voiceless signals, leaving a smoother yet distinguishable waveform. Hence, the device was proven to capture recognizable and unique signals with laryngeal muscle movements for further analysis.

a Schematic illustration of the extrinsic muscle and vibration. Created with BioRender.com. b Circuit diagram of the system for collecting extrinsic muscle movement signal. c Sensor output for different throat movementsCoughing, Humming, Nodding, Swallowing, and Yawning. d Device signal output for participant pronouncing UCLA under different body movements. e Sensor output for participant pronouncing Go Bruins! with vocal fold vibration (upper, gray) and voiceless (lower, red). Enlarged waveform of participant pronouncing Go Bruins! with vocal fold vibration (f) and voiceless (g). Amplitude-frequency spectrum of the signal with vocal fold vibration (h) and voiceless (i).

With generated data of laryngeal muscle movement, a machine-learning algorithm was employed to classify the semantical meaning of the signal and select a corresponding voice signal for outputting through the actuation component of the system. A schematic flow chart of the machine-learning algorithm is presented in Fig.4a. The algorithm consists of two steps: training and classifying a set of n sentences for which assisted speaking is required. Firstly, the filtered training data was fed to the algorithm for model training. The electrical signal of each of the n sentences was compacted into an Nth-order matrix for feature extraction with principal component analysis (PCA) (Fig.4b). N is determined by the sampling window, which is the length of the longest sentences signal. PCA is applied to remove redundancy and prepare the signal for classification. Multi-class support vector classification (SVC) was chosen as the classification algorithm with the decision function shape of one vs. rest. For each sentence to be classified, the rest of the n-1 sentences were considered as a whole to generate a binary classification boundary to discriminate the target sentence. A brief illustration of the support vector machine (SVM) process is depicted in Fig.4c. The margin of the linear boundary between two target data groups undergoes a series of optimizing processes and was set to the largest with support vectors. Details of PCA and multi-class SVC are discussed in Methods. After the classifier was trained with pre-fed training data, it was used for classifying newly collected laryngeal muscle movement signals. The real-time data were fed to the classifier, and the class (which sentence) of the signal was output for voice signal selection. Subsequently, the corresponding pre-recorded voice signal was played by the actuation component, realizing assisted speaking.

a Flow chart of the machine-learning-assisted wearable sensing-actuation system. b Illustration depicting the process of data segmentation and principal components analysis (PCA) applied to the muscle movement signal captured by the sensor. Yellow indicates one sentence, and red indicates another one. c Optimizing process of data classification after PCA with support vector machine (SVM) algorithm. d Contour plot of the classification results with SVM, class 1, indicating 100% possibility of the target sentence, dotted lines are the possibility boundaries between the target sentence and the others. e Bar chart exhibiting 7 participants accuracy of both validation set and testing set. f Confusion matrix of the 8th participants validation set with an overall accuracy of 98%. g Confusion matrix of the 8th participants testing set with an overall accuracy of 96.5%. h Demonstration of the machine-learning-assisted wearable sensing-actuation system in assisted speaking. The left panel shows the muscle movement signal captured by the sensor as the participant pronounces the sentence voicelessly, while the right panel shows the corresponding output waveform produced by the systems actuation component. i The SPL and temperature trends over time while the device is worn by participants; no notable temperature increase or SPL decrease was seen for up to 40min. j The devices SPL outputs participant-specific sound signals, both with and without sweat presence. Each participant was asked to repeat testing of N=3 times for both scenarios. Data are presented as mean valuesSD. The p-value between dry and sweaty state is calculated to be 0.818, indicating no significant difference in the devices performance under the two cases. k The devices SPL across various conversation angles while done by the participant. Created with BioRender.com.

A brief demonstration was made with five sentences that we had selected for training the algorithm (S1: Hi Rachel, how you are doing today?, S2: Hope your experiments are going well!, S3: Merry Christmas!, S4: I love you!, S5: I dont trust you.). Each participant repeated each sentence 100 times for data collection. The resulting contour plot in Fig.4d shows an example of the classification result, with the red dots indicating the target sentence and the yellow dots indicating the others. A probability contour was drawn to classify whether a newly input sentence point belonged to the target sentence or not. With the trained classifier, the laryngeal movement signal was recognized for the corresponding sentence that the participant wished to express. To test the robustness and user-adaptability of the algorithm, the device was tested with eight participants, each repeating the sentence 120 times in total, with 100 repeats selected for the training set and 20 separated as the testing set. Of the 100 repeats, 20 were selected as the validation set. Figure4e shows the validation and testing results of seven out of the eight participants, while Fig.4f, g presents a detailed illustration of the confusion matrix of the 8th participant for the validation and testing sets, respectively. Even slightly lower than the validation set, each participants testing set achieved more than 93% accuracy. FigureS35 shows the detailed confusion matrix of both the validation and testing set and the accuracy of every other participant. The overall prediction accuracy of the model was 94.68%, and it worked well with different participants. Each participants voice signal was played by the actuation component, realizing the demonstration in Fig.4h. The left panel shows the muscle movement signal transferred into the correct voice signal, with the waveform shown in the right panel. Further, we extended our analysis to validate the practical usability of the device for vocal output after the selection of the accurate voice signal by the algorithm. As demonstrated in Fig.4i, an evaluation of the SPL and temperature of the device during use by the participant revealed no significant drop in SPL or rise in temperature, even after an extended working period of 40min. This suggests the devices durability in voice output and safe usage. In Fig.4j, we display the SPL of the device as it produces voice signals for seven participants, both with and without sweat. We noted consistent performance by the device across different participants, with no evident signal attenuation despite the presence of perspiration. Finally, Fig.4k illustrates the devices SPL during voice output at various normal conversation angles while worn by the participant. The device demonstrated reliable sound performance across all angles, thereby enabling assisted speaking in multiple real-life scenarios. In conclusion, the device can convert laryngeal muscle movement into voice signals, providing patients with voice disorders with a feasible method to communicate during the recovery process.

Link:
Speaking without vocal folds using a machine-learning-assisted wearable sensing-actuation system - Nature.com

Related Posts

Comments are closed.