For scientists, faking or making up data has obvious connotations and, thanks to some high-profile cases of scientific misconduct, theyre generally not positive ones. Chemists may, for example, be aware of a 2022 case in which a respected journal retracted two papers by a Japanese chemistry group that were found to contain manipulated or fabricated data. Or the case of Beng Sezen, the Columbia University chemist who, during the 2000s, falsified, fabricated and plagiarised data to get her work on chemical bonding published including fixing her NMR spectra with correcting fluid.
Synthetic data, unlike dishonestly made-up data, is created in a systematic way for legitimate reasons, however, usually by a machine and for a variety of reasons. Synthetic data is familiar to machine learning experts, and increasingly to computational chemists, but relatively unknown to the wider chemistry community, as Keith Butler, a materials researcher who works with machine learning methods at University College London in the UK, acknowledges.
I dont think very many people at all, in chemistry, would refer to synthetic data, he says, adding that its probably a confusing term for chemists not just because of the apparent associations with scientific misconduct, but because synthetic in chemistry has another important meaning, relating to the way chemical compounds are made. But why are chemists making up their data, how is it any different to the simulations theyve been making for decades and what are the implications?
In areas like health and finance, synthetic data is often used to replace real data from real people due to privacy concerns, or to deal with issues of imbalance, such as when people from certain ethnic groups arent well-represented. While researchers might like to use peoples personal medical records, for example, to inform their understanding of a new disease, that data is difficult to access in the quantities or levels of detail that would be most useful. Here, the creation of synthetic data, which mirrors the statistical features of the real-world data, offers a solution. Youre not taking a whole dataset and just masking it, explains Benjamin Jacobsen, a sociologist at the University of York in the UK, whose work focuses partly on the use of synthetic data. The promise of synthetic data is that you can train a model to understand the overall distribution of the particular dataset. In this way, synthetic data draws from real-world sources, but cant be traced back to real individuals. In the chemical sciences, though, synthetic data relates more to the behaviour of molecules than people and so as Butler notes, its used for a very different reason.
Synthetic data is generated by algorithms and for algorithms
In some ways, legitimately made-up data is nothing very new in chemistry. Areas such as materials and drug discovery, for example, have long used whats referred to as simulated or calculated data to expand the chemical space for exploration the data might describe predicted properties of new materials or potential drug compounds. Whats different now is that made-up data, whether its considered synthetic, simulated or calculated, is being used in combination with machine learning models. These AI models are algorithms capable of making sense of vast quantities of data, which could be from real experiments or made up (or a combination of both). They learn patterns in the data and use them to make classifications and predictions to provide valuable insights to humans whether or not its clear how the machines have delivered them.
Synthetic data is not just an input for AI models its an output from them too. Jacobsen, in fact, previously defined synthetic data as data generated by algorithms and for algorithms, although chemists may not be concerned with having such a strict definition. Some of the techniques commonly used to create it are related to those used in making deepfakes. In the same way that deepfakers might ask their machines to generate realistic-looking faces and speech, chemists might prompt theirs to generate realistic-looking chemical structures.
Another option for generating synthetic data is large language modelling the basis of generative AI tools like ChatGPT. This is an approach Butlers team recently used to build an app that can produce a hypothetical crystal structure for a compound based on a chemical formula, which is typed in as a prompt (like a question in a chat). More advanced versions of such tools, which would know more of the rules of chemistry, could prove invaluable in the search for new materials. If you could prompt [it] by saying produce me a plausible chemical structure that absorbs light well and is made only from Earth-abundant elements, thats actually interesting, says Butler. The problem being that you want to make sure that its reasonable and viable so that you dont start telling people things that are impossible to make.
Applications for synthetic data in the chemical sciences abound. However, for those who are less well-acquainted with machine learning, it can be hard to get a grip on what synthetic data is and how it can be so useful to chemists when its not actually real. Here, an example may help.
Most students of chemistry learn, fairly early on, to sketch out a chemical structure by hand, but transferring that structure from a piece of paper to a computer could be a tedious task. It would be useful if there was a quick way of doing it say, by taking a picture of the sketch on your phone. At Stanford University in California, US, Todd Martinez and his team set about tackling this problem, their idea being to teach a machine learning model how to recognise hand-drawn chemical structures so that they could be quickly converted into virtual versions. However, to do this using real data, they would have needed to train their data-hungry model with a vast dataset of hand-drawn structures. As Martinez notes, even someone who can draw really fast is only going to be able to churn out a handful a minute. We did try, he recalls. We got 30 people together and spent hours just doing this, but we were only able to get 1000 structures or something. You need this data in the hundreds of thousands.
So, instead, they developed a process for roughing-up half a million clean, artificially generated chemical structures made with well-known software called RDKit and sticking them onto backgrounds to simulate photographs of hand-drawn structures. Having ingested this synthetic data, combined with a much smaller sample of real hand-drawn structures, their machine learning approach was able to correctly identify hand-drawn hydrocarbon structures 70% of the time compared to never when trained just with their limited hand-drawn data (and only 56% of the time with the clean RDKit structures). More recently, they developed an app that turns hand-drawn structures directly into 3D visualisations of molecules on a mobile device.
Had it been fed half a million genuine, hand-drawn structures, Martinezs model might have performed even better, but the real data just wasnt available. So in this case, using synthetic data was a solution to the problem of data sparsity a problem described as one of the main barriers to the adoption of AI in the chemical sciences. As Martinez puts it, There are lots of problems in chemistry actually, most problems, I would say where there is insufficient data to really apply machine learning methods the way that practitioners would like to. The cost savings to be made by using synthetic data, he adds, could be much larger than in his molecular recognition example, because elsewhere this made-up data wouldnt just be replacing data that it takes someone a few seconds to draw, but data from expensive, real-life experiments.
Theres certainly no shortage of such experimental data at the Rutherford Appleton Laboratory in Oxfordshire. Here, petabytes and petabytes of data are produced, Butler says, by real-life experiments in which materials are bombarded with subatomic particles in order to probe their structure. But Butlers collaborators at the facility face a different problem: a massive surplus of unlabelled data data that is effectively meaningless to many machine learning models. To understand why, think back to the previous example, where each hand-drawn chemical structure would need a label attached to teach the machine learning model how to recognise it. Without labels, its harder for the AI to learn anything. The problem is that its really expensive to label that kind of experimental data, Butler says. Synthetic data, though, can be generated ready-labelled and then used as training data to help machine learning models learn to interpret the masses of real data produced by the laboratorys neutron-scattering experiments a line of thinking Butler and co-workers explored in a 2021 study.
In the study, they trained their models with thousands of synthetic spectra images in the same format as those they got from real neutron scattering experiments, but that were created via theoretical calculations. What they realised, though, was that when it came to interpreting real spectra, the models trained on simulated data just werent very good, because they werent used to all the imperfections that exist in real-world experimental spectra. Like the fake, hand-drawn chemical structures made by Martinezs team, they needed roughing up. As a solution, Butlers team came up with a way to add experimental artefacts including noise to the clean synthetic data. He describes it as akin to a filter you might apply to selfie, except instead of giving a photo of your face the style of a Van Gogh painting, it gives a perfect, simulated spectrum the style of a messier one from a real experiment. This restyling of synthetic data could be useful more broadly than in neutron scattering experiments, according to Butler.
Another area in which synthetic data could have an important impact is in combination with AI approaches in drug discovery. Though, as in other fields, the terminology could be a little off-putting. People are a bit shy to accept synthetic data [perhaps because] it sounds like its lower quality than real data, says Ulrich Zachariae, a drugs researcher at the University of Dundee in the UK. Zachariaes recent work has focused on searching for new compounds to target antibiotic-resistant Gram-negative bacteria like Pseudomonas aeruginosa, which causes life-threatening infections in hospital. One of the issues slowing down the search is that these bugs outer shells are virtually impenetrable, and while machine learning models could help make useful predictions about which compounds might work, theres a lack of data on bacterial permeability to feed the models.
That dataset provided us with enough input data that we could understand what was going on
To start tackling the permeability problem, Zachariaes team first constructed a model with what data they had data that came from existing antibiotics and used it predict whether other compounds would be good permeators, or not. This worked well, but didnt explain anything about why one compound was better than another. The researchers then wanted to probe the effects of small differences in chemical structure on permeability , but this required lots more data. So, to generate more, they used their own machine learning model to predict the properties of hundreds of thousands of (real) compounds for which there was no experimental data on permeability creating a huge new synthetic dataset. That dataset provided us with enough input data for [the additional analysis] that we could understand what was going on and why these compounds were good permeators or not, Zachariae explains. They were then able to suggest chemical features, including amine, thiophene and halide groups, that medicinal chemists should look out for in their hunt for new Gram-negative antibiotics.
For Martinez, understanding why machine learning models make the predictions they do is another key motivation for using synthetic data. The internal workings of AI models can be difficult to unravel theyre widely referred to as black boxes but Martinez says he thinks of synthetic data as a tool to sharpen a particular model that is producing that data, or to understand its essence in a simpler form. I think you can see examples of people groping towards this, but I dont know that its clearly stated, he muses. Martinezs interests lie mainly in quantum chemistry, where more traditional computational models are used to solve theoretical problems. Addressing the same problems with machine learning may be a way to get to solutions faster while also with the help of synthetic data getting to the heart of what the AI models have learned. In this way, chemists may be able to improve their more traditional models.
But what are the risks of using data that isnt real? Its hard to answer this question at this point. In other fields, the risks often relate to people whose data was absorbed to generate synthetic data, or who are affected by the decisions it is used to make. As Jacobsen notes, the risks are going to vary depending on the application area. Chemists have to delineate exactly How do we frame what risk is in this context? he says.
In the drug discovery space, Zachariae wonders if there is any more risk associated with artificially generated data than with simulated data used in the past. From a purely scientific perspective, I dont see how its any different from previous cycles of predictions, where we just used physical models, he says, adding that any hit molecule identified using AI and synthetic data would still have to go through rigorous safety testing.
The risk is that we lose credibility if our predictions dont match up with reality
Martinez, though, sees a potential problem in areas of theoretical chemistry where there is no real or experimental data on which to train machine learning models because that data can only be arrived at by computational means. Here, synthetic data may effectively be the norm, although the magic words often arent mentioned, he says, because chemists arent familiar with them. In quantum chemistry, for example, a molecules geometry can be used to compute its energy and now machine learning models are trained, based on existing theory, to take in geometries and spit out energies, just in faster and cheaper ways. However, since traditional methods are more accurate for smaller molecules than bigger ones and theres no way of experimentally checking the results, machine learning algorithms trained to spit out energies for big molecules could be doing a poor job which could be a concern if millions of data points are being generated. The interesting point about synthetic data in this context is that these issues do not seem to always be at the forefront of the communitys thinking, Martinez says. This seems to be because the synthetic nature of the data implies tight control over the training dataset and this can give a false confidence in data correctness and coverage.
In materials discovery, similar concerns surround the use of machine learning to predict stable chemical structures as Google DeepMind researchers have done in chemical spaces where the existing theory was not necessarily that accurate. The risk, says Butler, is that we lose credibility (and funding) if the properties of predicted materials dont match up with reality. So, while making up data may mean something different these days, its worth remembering that there could still be a lot at stake if its not done well.
Hayley Bennett is a science writer based in Bristol, UK
View original post here:
Why are computational chemists making up their data? - Chemistry World
- What Is Machine Learning? | How It Works, Techniques ... [Last Updated On: September 5th, 2019] [Originally Added On: September 5th, 2019]
- Start Here with Machine Learning [Last Updated On: September 22nd, 2019] [Originally Added On: September 22nd, 2019]
- What is Machine Learning? | Emerj [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- Microsoft Azure Machine Learning Studio [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- Machine Learning Basics | What Is Machine Learning? | Introduction To Machine Learning | Simplilearn [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- What is Machine Learning? A definition - Expert System [Last Updated On: October 2nd, 2019] [Originally Added On: October 2nd, 2019]
- Machine Learning | Stanford Online [Last Updated On: October 2nd, 2019] [Originally Added On: October 2nd, 2019]
- How to Learn Machine Learning, The Self-Starter Way [Last Updated On: October 17th, 2019] [Originally Added On: October 17th, 2019]
- definition - What is machine learning? - Stack Overflow [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Artificial Intelligence vs. Machine Learning vs. Deep ... [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning in R for beginners (article) - DataCamp [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning | Udacity [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning Artificial Intelligence | McAfee [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- AI-based ML algorithms could increase detection of undiagnosed AF - Cardiac Rhythm News [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- The Cerebras CS-1 computes deep learning AI problems by being bigger, bigger, and bigger than any other chip - TechCrunch [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- Can the planet really afford the exorbitant power demands of machine learning? - The Guardian [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- New InfiniteIO Platform Reduces Latency and Accelerates Performance for Machine Learning, AI and Analytics - Business Wire [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- How to Use Machine Learning to Drive Real Value - eWeek [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- Machine Learning As A Service Market to Soar from End-use Industries and Push Revenues in the 2025 - Downey Magazine [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Rad AI Raises $4M to Automate Repetitive Tasks for Radiologists Through Machine Learning - - HIT Consultant [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Machine Learning Improves Performance of the Advanced Light Source - Machine Design [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Synthetic Data: The Diamonds of Machine Learning - TDWI [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- The transformation of healthcare with AI and machine learning - ITProPortal [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Workday talks machine learning and the future of human capital management - ZDNet [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Machine Learning with R, Third Edition - Free Sample Chapters - Neowin [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Verification In The Era Of Autonomous Driving, Artificial Intelligence And Machine Learning - SemiEngineering [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Podcast: How artificial intelligence, machine learning can help us realize the value of all that genetic data we're collecting - Genetic Literacy... [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The Real Reason Your School Avoids Machine Learning - The Tech Edvocate [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Siri, Tell Fido To Stop Barking: What's Machine Learning, And What's The Future Of It? - 90.5 WESA [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Microsoft reveals how it caught mutating Monero mining malware with machine learning - The Next Web [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The role of machine learning in IT service management - ITProPortal [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Global Director of Tech Exploration Discusses Artificial Intelligence and Machine Learning at Anheuser-Busch InBev - Seton Hall University News &... [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The 10 Hottest AI And Machine Learning Startups Of 2019 - CRN: The Biggest Tech News For Partners And The IT Channel [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Startup jobs of the week: Marketing Communications Specialist, Oracle Architect, Machine Learning Scientist - BetaKit [Last Updated On: November 30th, 2019] [Originally Added On: November 30th, 2019]
- Here's why machine learning is critical to success for banks of the future - Tech Wire Asia [Last Updated On: December 2nd, 2019] [Originally Added On: December 2nd, 2019]
- 3 questions to ask before investing in machine learning for pop health - Healthcare IT News [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Machine Learning Answers: If Caterpillar Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Measuring Employee Engagement with A.I. and Machine Learning - Dice Insights [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Amazon Wants to Teach You Machine Learning Through Music? - Dice Insights [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Machine Learning Answers: If Nvidia Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- AI and machine learning platforms will start to challenge conventional thinking - CRN.in [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If Twitter Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If Seagate Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If BlackBerry Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Amazon Releases A New Tool To Improve Machine Learning Processes - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Another free web course to gain machine-learning skills (thanks, Finland), NIST probes 'racist' face-recog and more - The Register [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Kubernetes and containers are the perfect fit for machine learning - JAXenter [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- TinyML as a Service and machine learning at the edge - Ericsson [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- AI and machine learning products - Cloud AI | Google Cloud [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning | Blog | Microsoft Azure [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning in 2019 Was About Balancing Privacy and Progress - ITPro Today [Last Updated On: December 25th, 2019] [Originally Added On: December 25th, 2019]
- CMSWire's Top 10 AI and Machine Learning Articles of 2019 - CMSWire [Last Updated On: December 25th, 2019] [Originally Added On: December 25th, 2019]
- Here's why digital marketing is as lucrative a career as data science and machine learning - Business Insider India [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Dell's Latitude 9510 shakes up corporate laptops with 5G, machine learning, and thin bezels - PCWorld [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Finally, a good use for AI: Machine-learning tool guesstimates how well your code will run on a CPU core - The Register [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Cloud as the enabler of AI's competitive advantage - Finextra [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Forget Machine Learning, Constraint Solvers are What the Enterprise Needs - - RTInsights [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Informed decisions through machine learning will keep it afloat & going - Sea News [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- The Problem with Hiring Algorithms - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- New Program Supports Machine Learning in the Chemical Sciences and Engineering - Newswise [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- AI-System Flags the Under-Vaccinated in Israel - PrecisionVaccinations [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- New Contest: Train All The Things - Hackaday [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- AFTAs 2019: Best New Technology Introduced Over the Last 12 MonthsAI, Machine Learning and AnalyticsActiveViam - www.waterstechnology.com [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Educate Yourself on Machine Learning at this Las Vegas Event - Small Business Trends [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Seton Hall Announces New Courses in Text Mining and Machine Learning - Seton Hall University News & Events [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Looking at the most significant benefits of machine learning for software testing - The Burn-In [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Leveraging AI and Machine Learning to Advance Interoperability in Healthcare - - HIT Consultant [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Adventures With Artificial Intelligence and Machine Learning - Toolbox [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Five Reasons to Go to Machine Learning Week 2020 - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Uncover the Possibilities of AI and Machine Learning With This Bundle - Interesting Engineering [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Learning that Targets Millennial and Generation Z - HR Exchange Network [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- Red Hat Survey Shows Hybrid Cloud, AI and Machine Learning are the Focus of Enterprises - Computer Business Review [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- Vectorspace AI Datasets are Now Available to Power Machine Learning (ML) and Artificial Intelligence (AI) Systems in Collaboration with Elastic -... [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- What is Machine Learning? | Types of Machine Learning ... [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- How Machine Learning Will Lead to Better Maps - Popular Mechanics [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- Jenkins Creator Launches Startup To Speed Software Testing with Machine Learning -- ADTmag - ADT Magazine [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- An Open Source Alternative to AWS SageMaker - Datanami [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- Machine Learning Could Aid Diagnosis of Barrett's Esophagus, Avoid Invasive Testing - Medical Bag [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- OReilly and Formulatedby Unveil the Smart Cities & Mobility Ecosystems Conference - Yahoo Finance [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]