Page 2,682«..1020..2,6812,6822,6832,684..2,6902,700..»

Liquidity is key to unlocking the value in data, researchers say – MIT Sloan News

open share links close share links

Like financial assets, data assets can have different levels of liquidity. Certificates of deposit tie up money for a certain length of time, preventing other use of the funds. Siloed business applications tie up data, which makes it difficult, even impossible, to use that data in other ways across an organization.

A recent research briefing, Build Data Liquidity to Accelerate Data Monetization, defines data liquidity as ease of data asset reuse and recombination. The briefing was written by Barbara Wixom, principal research scientist at the MIT Center for Information Systems Research (CISR) and Gabriele Piccoli of Louisiana State University and the University of Pavia.

Unlike physical capital assets listed on a corporate balance sheet, such as buildings and equipment, data does not deteriorate over time. In fact, it can become more valuable as it is used in different ways.

While data is inherently reusable and recombinable, an organization must activate these characteristics by creating strategic data assets and building out its data monetization capabilities, the authors explain.

Typically, companies use data in linear value-creation cycles, where data is trapped in silos and local business processes. Over time, the data becomes incomplete, inaccurate, and poorly classified or defined.

To increase data liquidity, organizations need to decontextualize the data, divorcing it from a specific condition or context. The authors suggest using best practices in data management including metadata management, data integration and taxonomy/ontology to ensure each data asset is accurate, complete, current, standardized, searchable, and understandable throughout the enterprise.

Such data management practices build key enterprise capabilities like data platform, data science, acceptable data use, and customer understanding, which increases datas monetization potential.

As a companys strategic data assets become more highly liquid and their number grows, data is made increasingly available for conversion to value, and the companys data monetization accelerates, write the authors.

In explaining how an organization can create highly liquid data assets for use across an enterprise, the authors cite the example of Fidelity Investments, a Boston-based financial services company.

The firm is combining more than 100 data warehouses and analytics stores into one common analytics platform, built upon five foundational structures:

Fidelitys goal is to organize the data around a common set of priorities such as customer, employee, and investible security. The result will be strategic data assets that are integrated and easily consumable. We want to create long-term data assets for creating value, not only immediately, but also for use cases that are yet to be identified, says Mihir Shah, Fidelitys enterprise head of data, in the briefing.

As long as Fidelitys internal data consumers follow core rules, they can combine data from different sources and build for specific requirements. Not only has Fidelity already created valuable data assets through this platform, it has begun to identify value-add opportunities using data that were never before possible activities that would add value for customers, revenue, and efficiency, according to the authors.

Once data is highly liquid, future ready companies can use it to produce value in three ways, according to MIT CISR research:

Fidelity is one of more than 70 strategic data asset initiatives that MIT CISR researchers uncovered in the course of interviews with its member organizations. The projects illustrate how the beauty lies not in a single use of data, but in the recurring reuse and recombination of the carefully curated underlying strategic data assets, the authors write.

As companies transform into future-ready entities, they need to view their strategic digital initiatives not simply as a way to exploit digital possibilities, but also as opportunities for reshaping their data into highly liquid strategic data assets, they conclude.

Read the original here:

Liquidity is key to unlocking the value in data, researchers say - MIT Sloan News

Read More..

MSK Study Identifies Biomarker That May Help Predict Benefits of Immunotherapy – On Cancer – Memorial Sloan Kettering

In recent years, immune-based treatments for cancer have buoyed the hopes of doctors and patients alike. Drugs called immune checkpoint inhibitors have provided lifesaving benefits to a growing list of people with several types of cancer, including melanoma, lung cancer, bladder cancer, and many more.

Despite the excitement surrounding these medications, a frustrating sticking point has been the inability of doctors to predict who will benefit from them and who will not.

On August 25, 2021, a group of researchers from Memorial Sloan Kettering Cancer Center reported in the journal Science Translational Medicine that a specific pattern, or signature, of markers on immune cells in the blood is a likely biomarker of response to checkpoint immunotherapy. Within this immune signature, a molecule LAG-3 provided key information identifying patients with poorer outcomes.

This link was discovered in a group of patients with metastatic melanoma and validated in a second group of patients with metastatic bladder cancer, suggesting that this potential biomarker may be broadly applicable to patients with a variety of cancers.

According to Margaret Callahan, an investigator with the Parker Institute for Cancer Immunotherapyat MSK and the physician-researcher who led the study, the large patient cohorts, robust clinical follow-up, and rigorous statistical approach of the study gives her enthusiasm that this immune signature is telling us something important about who responds to immunotherapy and why.

The findings pave the way for prospective clinical trials designed to test whether incorporating this biomarker into patient care can improve outcomes for those who are less likely to benefit from existing therapies.

Despite the excitement surrounding immune checkpoint inhibitors, a frustrating sticking point has been the inability of doctors to predict who will benefit from them and who will not.

In making their discoveries, the researchers had data on their side. As one of the first cancer centers in the world to begin treating large numbers of patients with immunotherapy, MSK has a cache of stored blood from hundreds of patients treated over the years, efforts pioneered by MSK researchers Jedd Wolchok and Phil Wong, co-authors on the study. The investigators of this study made their discoveries using pre-treatment blood samples collected from patients enrolled on seven different clinical trials open at MSK between 2011 and 2017.

To mine the blood for clues, researchers used a technique called flow cytometry. Flow cytometry is a tool that rapidly analyzes attributes of single cells as they flow past a laser. The investigators goal was to identify markers found on patients immune cells that correlated with their response to immunotherapy primarily PD-1 targeting drugs like nivolumab (Opdivo) and pembrolizumab (Keytruda). But this wasnt a job for ordinary human eyeballs.

When you think about the fact that there are hundreds of thousands of blood cells in a single patient blood sample, and that were mapping out the composition of nearly 100 different immune cell subsets, its a real challenge to extract clinically relevant information effectively, says Ronglai Shen, a statistician in the Department of Epidemiology and Biostatistics at MSK who developed some of the statistical tools used in the study. Thats where we as data scientists were able to help Dr. Callahan and the other physician-researchers on the study. It was a perfect marriage of skills.

The statistical tools that Dr. Shen and fellow data scientist Katherine Panageas developed allowed the team to sort patients into three characteristic immune signatures, or immunotypes, based on unique patterns of blood markers.

The immunotype that jumped out was a group of patients who had high levels of a protein called LAG-3 expressed on various T cell subsets. Patients with this LAG+ immunotype, the team found, had a much shorter survival time compared with patients with a LAG- immunotype: For melanoma patients, there was a difference in median survival of more than four years (22.2 months compared with 75.8 months) and the difference was statistically significant.

This immune signature is telling us something important about who responds to immunotherapy and why.

LAG-3 (short for lymphocyte-activation gene 3) belongs to a family of molecules called immune checkpoints. Like the more well-known checkpoints CTLA-4 and PD-1, LAG-3 has an inhibitory effect on immune responses, meaning it tamps them down. Several drugs targeting LAG-3 are currently in clinical development, although defining who may benefit from them the most has been challenging.

When Dr. Callahan and her colleagues started this research, they did not plan to focus on LAG-3 specifically. We let the data lead us and LAG-3 is what shook out, she says.

One strength of the study is its use of both a discovery set and a validation set. What this means is that the investigators performed their initial analysis on one set of blood samples from a large group of patients in this case, 188 patients with melanoma. Then, they asked whether the immune signature they identified in the discovery set could predict outcomes in an entirely different batch of patients 94 people with bladder cancer.

It could, and quite well.

When we looked at our validation cohort of bladder cancer patients who received checkpoint blockade, those who had the LAG+ immunotype had a 0% response rate, Dr. Callahan says. Zero. Not one of them responded. Thats compared with a 49% response rate among people who had the LAG- immunotype.

Because of the large data set, the scientists were also able to ask how their LAG+ immunotype compares with other known biomarkers of response specifically, PD-L1 status and tumor mutation burden. What they found was the immunotype provided new and independent information about patient outcomes, rather than just echoing these other biomarkers.

Biomarkers are important in cancer for several reasons. They may help clinicians and patients select the best treatment and may allow them to avoid unnecessary treatment or treatment that is unlikely to work.

Immunotherapy drugs are not without potential toxicity, Dr. Panageas says. So, if we can spare someone the potential risks of a treatment because we know theyre not likely to respond, thats a big advance.

The second reason is cost. Immunotherapy drugs are expensive, so having a means to better match patients with available drugs is vital.

And, because the researchers identified this biomarker using patient blood samples, it raises the pleasing prospect that patients could be assessed for this marker using a simple blood draw. Other biomarkers currently in use rely on tumor tissue typically obtained by a biopsy.

If I told you that you could have a simple blood draw and in a couple of days have information to make a decision about what therapy you get, Id say it doesnt get much better than that, Dr. Callahan says. Of course, there is still much work to be done before these research findings can be applied to patients in the clinic, but we are really enthusiastic about the potential to apply these findings.

A limitation of the study is that it is retrospective, meaning that the data that were analyzed came from blood samples that were collected years ago and stored in freezers. To confirm that the findings have the potential to benefit patients, investigators will need to test their hypothesis in a prospective study, meaning one where patients are enrolled on a clinical trial specifically designed to test the idea that using this immunotype in treatment decisions can improve patient outcomes.

What Im most excited about is prospectively evaluating the idea that not only can we identify patients who wont do as well with the traditional therapies but that we can also give these patients other treatments that might help them, based on our knowledge of what LAG-3 is doing biologically, Dr. Callahan says.

Key Takeaways

Read the rest here:

MSK Study Identifies Biomarker That May Help Predict Benefits of Immunotherapy - On Cancer - Memorial Sloan Kettering

Read More..

Understanding The Macroscope Initiative And GeoML – Forbes

How is it possible to harness high volumes of data on a planetary scale to discover spatial and temporal patterns that escape human perception? The convergence of technologies such as LIDAR and machine learning is allowing for the creation of macroscopes, which have many applications in monitoring and risk analysis for enterprises and governments.

Microscopes have been around for centuries, and they are tools that allow individuals to visualize and research phenomena that are too small to be perceived by the human eye. Macroscopes can be thought of as carrying out the opposite function; they are systems that are designed to uncover spatial and temporal patterns that are too large or slow to be perceived by humans. In order to function, they require both the ability to gather planetary-scale information over specified periods of time, as well as the compute technologies that can deal with such data and provide interactive visualization. Macroscopes are similar to geographic information systems, but include other multimedia and ML-based tools.

Dr. Mike Flaxman, Spatial Data Science Practice Lead at OmniSci

In an upcoming Data for AI virtual event with OmniSci, Dr. Mike Flaxman, Spatial Data Science Practice Lead, and Abhishek Damera, Data Scientist, will be giving a presentation on building planetary geoML and the macroscope initiative. OmniSci is an accelerated analytics platform that combines data science, machine learning, and GPU to query and visualize big data. They provide solutions for visual exploration of data that can aid in monitoring and forecasting different kinds of conditions for large geospatial areas.

The Convergence of Data and Technologiesn a world where the amount and importance of data continue to grow exponentially, it is increasingly important for organizations to be able to harness that data. The fact that data now flows digitally changes how we collect and integrate from many different sources across varying formats. Because of this, getting data from its raw condition to the state of being analysis-ready and then actually performing the analysis can be challenging and often requires very complex pipelines. Traditional software approaches generally do not scale very well, resulting in teams that are increasingly looking to machine learning algorithms and pipelines to perform tasks such as feature classification, extraction, and condition monitoring. This is why companies like OmniSci are applying ML as part of a larger macroscope pipeline to provide analytics methods for applications such as powerline risk analysis and naval intelligence.

One way that OmniSci is using their technology is in monitoring powerline vegetation by district at a national level in Portugal. Partnering with Tesselo, they are using a combination of imagery and LIDAR technologies to build a more detailed and temporally flexible portrait of land cover that can be updated weekly. Using stratified sampling for ML and GPU analytics for real-time integration, they are able to extract and render billions of data points from samples sites for vegetation surrounding transmission lines.

For large scale projects such as the above, there are most often two common requirements: extremely high volumes of data are required to provide accurate representations of specific geographical locations, and machine learning is needed for data classification and continuous monitoring. OmniSci aims to address the question of how, on a technical level, these two requirements can be integrated in a manner that is dynamic, fast, and efficient. The OmniSci software is a platform that consists of three layers, each of which can be independently used or combined together. The first layer, OmniSci DB, is a SQL database that provides fast queries and embedded ML. The middle component, the Render Engine, provides server-side rendering that acts similarly to a map server and can be combined with the database layer to render results as images. The final layer, OmniSci Immerse, is an interactive front-end component that allows the user to play around with charts and data and request queries from the backend. Together, the OmniSci ecosystem can take in data from many different sources and formats and talk to other SQL databases through well-established protocols. Data scientists can use traditional data science tools jointly, making it easy to analyze the information. OmniScis solution centers on the notion of moving the code to the data rather than the data to the code.

Case Study in Firepower-Transmission Line Risk AnalysisA specific case study for OmniSci Immerse demonstrates the ability to perform firepower-transmission line risk analysis. Growing vegetation can pose high risks to power lines for companies such as PG&E, and it can be inefficient and challenging to accurately assess changing risks in an accurate manner. However, combining imagery and LIDAR data, OmniSci is providing a better way to map out the physical structures of different geographic areas, such as in Northern California, to analyze risk without needing on-site visits. OmniScis platform combines three factors of physical structure, vegetation health over time, and varying wind speeds over space to determine firepower strike tree risk. They are addressing both the issues of scale and detail to allow utility companies to determine appropriate actions through continuous monitoring.

In addition to the firepower-transmission line risks analysis example, there are many other use cases for macroscope technologies and methods. OmniSci is providing a way to perform interactive analyses on multi-billion row datasets, and they can provide efficient methods for critical tasks such as anomaly detection. To learn more about the technology behind OmniSci solutions as well as the potential use cases, make sure to join the upcoming Data for AI community for the virtual event.

Originally posted here:

Understanding The Macroscope Initiative And GeoML - Forbes

Read More..

Cancer Informatics for Cancer Centers: Scientific Drivers for Informatics, Data Science, and Care in Pediatric, Adolescent, and Young Adult Cancer -…

This article was originally published here

JCO Clin Cancer Inform. 2021 Aug;5:881-896. doi: 10.1200/CCI.21.00040.

ABSTRACT

Cancer Informatics for Cancer Centers (CI4CC) is a grassroots, nonprofit 501c3 organization intended to provide a focused national forum for engagement of senior cancer informatics leaders, primarily aimed at academic cancer centers anywhere in the world but with a special emphasis on the 70 National Cancer Institute-funded cancer centers. This consortium has regularly held topic-focused biannual face-to-face symposiums. These meetings are a place to review cancer informatics and data science priorities and initiatives, providing a forum for discussion of the strategic and pragmatic issues that we faced at our respective institutions and cancer centers. Here, we provide meeting highlights from the latest CI4CC Symposium, which was delayed from its original April 2020 schedule because of the COVID-19 pandemic and held virtually over three days (September 24, October 1, and October 8) in the fall of 2020. In addition to the content presented, we found that holding this event virtually once a week for 6 hours was a great way to keep the kind of deep engagement that a face-to-face meeting engenders. This is the second such publication of CI4CC Symposium highlights, the first covering the meeting that took place in Napa, California, from October 14-16, 2019. We conclude with some thoughts about using data science to learn from every child with cancer, focusing on emerging activities of the National Cancer Institutes Childhood Cancer Data Initiative.

PMID:34428097 | DOI:10.1200/CCI.21.00040

Read the original:

Cancer Informatics for Cancer Centers: Scientific Drivers for Informatics, Data Science, and Care in Pediatric, Adolescent, and Young Adult Cancer -...

Read More..

Empowering the Intelligent Data-Driven Enterprise in the Cloud – CDOTrends

Businesses realize that the cloud offers a lot more than digital infrastructure. Around the world, organizations are turning to the cloud to democratize data access, harness advanced AI and analytics capabilities, and make better data-driven business decisions.

But despite heavy investments to build data repositories, setting up advanced database management systems (DBMS), and building large data warehouses on-premises, many enterprises are still challenged with poor business outcomes, observed Anthony Deighton, chief product officer at Tamr.

Deighton was speaking at the Empowering the intelligent data-driven enterprise in the cloud event by Tamr and Google Cloud in conjunction with CDOTrends. Attended by top innovation executives, data leaders, and data scientists from Asia Pacific, the virtual panel discussion looked at how forward-looking businesses might kick off the next phase of data transformation.

Why a DataOps strategy makes sense

Despite this massive [and ongoing] revolution in data, customers still can't get a view of their customers, their suppliers, and the materials they use in their business. Their analytics are out-of-date, or their AI initiatives are using bad data and therefore making bad recommendations. The result is that people don't trust the data in their systems, said Deighton.

As much as we've seen a revolution in the data infrastructure space, we're not seeing a better outcome for businesses. To succeed, we need to think about changing the way we work with data, he explained.

And this is where a DataOps strategy comes into play. A direct play on the popular DevOps strategy for software development, DataOps relies on an automated, process-oriented methodology to improve data quality for data analytics. Deighton thinks the DevOps revolution in software development can be replicated with data through a continuous collaborative approach with best-of-breed systems and the cloud.

Think of Tamr working in the backend to clean and deliver this centralized master data in the cloud. Offering clean, curated sources to questions such as: Who are my customers? What products have we sold? What vendors do we do business with? What are my sales transactions? And of course, for every one of your [departments], there's a different set of these clean, curated business topics that are relevant to you.

Data in an intelligent cloud

But wont an on-premises data infrastructure work just as well? So what benefits does the cloud offer? Deighton outlined two distinct advantages to explain why he considers the cloud the linchpin of the next phase of data transformation.

You can store infinite amounts of data in the cloud, and you can do that very cost-effectively. It's far less costly to store data in the cloud than it is to try to store it on-premises, in [your own] data lakes, he said.

Another really powerful capability of Google Cloud is its highly scalable elastic compute infrastructure. We can leverage its highly elastic compute and the fact that the data is already there. And then we can run our human-guided machine learning algorithms cost-effectively and get on top of that data quickly.

Andrew Psaltis, the APAC Technology Practice Lead at Google Cloud, drew attention to the synergy between Tamr and Google Cloud.

You can get data into [Google] BigQuery in different ways, but what you really want is clean, high-quality data. That quality allows you to have confidence in your advanced analytics, machine learning, and to the entire breadth of our analytics and AI platform. We have an entire platform to enable you to collaborate with your data science team; we have the tooling to do so without code, packaged AI solutions, tools for those who prefer to write their code, and everywhere in between.

Bridging the data silos

A handful of polls were conducted as part of the panel event, which saw participants quizzed about their ongoing data-driven initiatives. When asked about how they are staffing their data science initiatives, the majority (46%) responded they have multiple teams across various departments handling their data science initiatives.

The rest are split between either having a central team collecting, processing, and analyzing data or a combination of a central team working with multiple project teams across departments.

Deighton observed that multiple work teams typically result in multiple data silos: Each team has their silo of data. Maybe the team is tied to a specific business unit, a specific product team, or maybe a specific customer sales team.

The way to break the data barriers is to bring data together in the cloud to give users a view of the data across teams, he says. And it may sound funny, but sometimes, the way to break the interpersonal barriers is by breaking the data barriers.

Your customers don't care how you are organized internally. They want to do business with you, with your company. If you think about it, not from the perspective of the team, but the customer, then you need to put more effort into resolving your data challenges to best serve your customers.

Making the move

When asked about their big data change initiatives for the next three years, the response is almost unanimous: Participants want to democratize analytics, build a data culture, and make decisions faster (86%). Unsurprisingly, the top roadblock is that IT takes too long to deliver the systems data scientists need (62%) and the cost of data solutions (31%).

The cloud makes sense, given how it enables better work efficiency, lowers operational expenses, and is inherently secure, said Psaltis. Workers are moving to the cloud, Psaltis noted as he shared an anecdote about an unnamed organization that loaded the cloud with up to a petabyte of data in relatively short order.

This was apparently done without the involvement or knowledge of the IT department. Perhaps it might be better if the move to the cloud is done under more controlled circumstances with the approval and participation of IT, says Psaltis.

Finally, it is imperative that data is cleaned and kept clean as it is migrated to the cloud. Simply moving it into the cloud isn't enough. Without cleaning the data first, you will end up with poor quality, disparate data in the cloud. Where each applications data sits within a silo, with more silos than before, and difficulty making quality business decisions, summed up Deighton.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at[emailprotected].

Image credit: iStockphoto/Artystarty

See more here:

Empowering the Intelligent Data-Driven Enterprise in the Cloud - CDOTrends

Read More..

The Winners Of Weekend Hackathon -Tea Story at MachineHack – Analytics India Magazine

The Weekend Hackathon Edition #2 The Last Hacker Standing Tea Story challenge concluded successfully on 19 August 2021. The challenge involved creating a time series analysis model that forecasts for 29 weeks . It had almost 240+participants and 110+ solutions posted on the leaderboard.

Based on the leaderboard score,we have the top 4 winners of the Tea Story Time Series Challenge, who will get free passes to the virtual Deep Learning DevCon 2021, to be held on 23-24 Sept 2021. Here, we look at the winners journeys, solution approaches and experiences at MachineHack.

First Rank Vybhav Nath C A

Vybhav Nath- a final year student at IIT Madras. He entered this field during his second year of college and started participating in MachineHack hackathons from last year. He plans to take up a career in Data Science.

Approach

He says the problem was unique in the sense that many columns in the test set had a lot of null value. So this was a challenging task to solve. He kept his preprocessing steps restricted to imputation and replacing N.S tasks. This was the first competition where he didnt use any ML model. Since many columns had null values, he interpolated the columns to get a fully populated test set. Then the final prediction was just the mean of these Price columns. He thinks this was total doosra by the cool MachineHack Team.

Experience

He says, I always participate in MH hackathons whenever possible. There are a wide variety of problems which test multiple areas. I also get to participate with many Professionals which I found to be a good pointer about where I stand among them.

Check out his solution here.

Second prize Shubham Bharadwaj

Shubham has been working as a Data Scientist for about 7 years now. He has been working on large datasets for the past 7 years. Started off with SQL then BigData Analytics, then Data Engineering and finally working as a Data Scientist. But he is new to hackathons and this is his fourth hackathon in which he has participated. He loves to solve complex problems.

Approach

The data which was provided was very raw in nature, there were around 70 percent missing values in the test dataset. From his point of view ,finding the best imputation method was the backbone of this challenge.

Preprocessing steps followed:

1. Converting the columns to correct data types,

2. Imputing the missing values- He tried various methods like filling the null values with mean of each column, mean of that row, MICE. But the best was KNN imputer with n_neighbors as 3.

For removing the outliers,he used the IQR(InterQuartile Range), which helped in reducing the mean square error.

Models tried were logistic regression, then XGBRegressor, ARIMA, T-POT, and finally H2OAutoML which yielded the best result.

Experience

Shubham says I am new to the MachineHack family, and one thing is for sure that I am here to stay. Its a great place, I have already learned so much. The datasets are of wide variety and the problem statements are unique, puzzling and complex. Its a must for every aspiring and professional data scientist to upskill themselves.

Check out his solution here.

Third prize Panshul Pant

Panshul is a Computer Science and Engineering graduate. He has picked up data science mostly from online platforms like Coursera, Hackerearth, MachineHack and by watching videos on YouTube. Going through articles on websites like Analytics India Magazine have also helped him in this journey. This problem was based on a time series which made it unique, though he solved it using machine learning algorithms rather than other traditional ways.

Approach

There were certain string values like N.S, No sale etc in all numerical columns which I changed to Null values and imputed all the null values. I tried various ways to impute NaNs like with zero, mean, f-fill and b-fill methods .Out of these forward and backward filling methods performed significantly better. Exploring the data he noticed that the prices increased over the months and years, having a trend. The target columns values were also very closely related to the average of prices of all the independent columns.He kept all data including the outliers without much change as tree based models are quite robust to outliers.

As the prices were related to time he extracted time based features as well out of which day of week proved to be useful. An average based feature which had the average of all the numerical columns was extremely useful for good predictions. He tried using some aggregate based features as well but they were not of much help. For predictions he used tree based models like lightgbm and xgboost. The combination of both of them using weighted average gave best results.

Experience

Panshul says It was definitely a valuable experience. The challenges set up by the organisers are always exciting and unique. Participating in these challenges has helped me hone my skills in this domain.

Check out his solution here.

Fourth prize Shweta Thakur

Shwetas fascination with data science started when she realised how numbers can guide decision making. She did a PGP-DSBA course from Great Learning . Even though her professional work does not involve Data Science activity, she loves to challenge herself by working on Data Science projects and participating in Hackathons.

Approach

Shweta says that the fact that it is a time series problem makes it unique. She observed the trend and seasonality in the dataset and the higher correlation between various variables. Didnt treat the outliers but tried to treat the missing values with interpolate (linear, spline)method, ffill, bfill, replacing with other values from dataset.Even though some of the features were not as significant in identifying the target but removing them didnt improve the RMSE. She tried only SARIMAX.

Experience

Shweta says It was a great experience to compete with people from different back-ground and expertise.

Check out his solution here.

Once again, join us in congratulating the winners of this exciting hackathon who indeed were the Last Hackers Standing of Tea Story- Weekend Hackathon Edition-2 . We will be back next week with the winning solutions of the ongoing challenge Soccer Fever Hackathon.

Original post:

The Winners Of Weekend Hackathon -Tea Story at MachineHack - Analytics India Magazine

Read More..

Taktile makes it easier to leverage machine learning in the financial industry – TechCrunch

Meet Taktile, a new startup that is working on a machine learning platform for financial services companies. This isnt the first company that wants to leverage machine learning for financial products. But Taktile wants to differentiate itself from competitors by making it way easier to get started and switch to AI-powered models.

A few years ago, when you could read machine learning and artificial intelligence in every single pitch deck, some startups chose to focus on the financial industry in particular. It makes sense as banks and insurance companies gather a ton of data and know a lot of information about their customers. They could use that data to train new models and roll out machine learning applications.

New fintech companies put together their own in-house data science team and started working on machine learning for their own products. Companies like Younited Credit and October use predictive risk tools to make better lending decisions. They have developed their own models and they can see that their models work well when they run them on past data.

But what about legacy players in the financial industry? A few startups have worked on products that can be integrated in existing banking infrastructure. You can use artificial intelligence to identify fraudulent transactions, predict creditworthiness, detect fraud in insurance claims, etc.

Some of them have been thriving, such as Shift Technology with a focus on insurance in particular. But a lot of startups build proof-of-concepts and stop there. Theres no meaningful, long-term business contract down the road.

Taktile wants to overcome that obstacle by building a machine learning product that is easy to adopt. It has raised a $4.7 million seed round led by Index Ventures with Y Combinator, firstminute Capital, Plug and Play Ventures and several business angels also participating.

The product works with both off-the-shelf models and customer-built models. Customers can customize those models depending on their needs. Models are deployed and maintained by Taktiles engine. It can run in a customers cloud environment or as a SaaS application.

After that, you can leverage Taktiles insights using API calls. It works pretty much like integrating any third-party service in your product. The company tried to provide as much transparency as possible with explanations for each automated decision and detailed logs. As for data sources, Taktile supports data warehouses, data lakes as well as ERP and CRM systems.

Its still early days for the startup, and its going to be interesting to see whether Taktiles vision pans out. But the company has already managed to convince some experienced backers. So lets keep an eye on them.

Visit link:
Taktile makes it easier to leverage machine learning in the financial industry - TechCrunch

Read More..

Avalo uses machine learning to accelerate the adaptation of crops to climate change – TechCrunch

Climate change is affecting farming all over the world, and solutions are seldom simple. But if you could plant crops that resisted the heat, cold or drought instead of moving a thousand miles away, wouldnt you? Avalo helps plants like these become a reality using AI-powered genome analysis that can reduce the time and money it takes to breed hardier plants for this hot century.

Founded by two friends who thought theyd take a shot at a startup before committing to a life of academia, Avalo has a very direct value proposition, but it takes a bit of science to understand it.

Big seed and agriculture companies put a lot of work into creating better versions of major crops. By making corn or rice ever so slightly more resistant to heat, insects, drought or flooding, they can make huge improvements to yields and profits for farmers, or alternatively make a plant viable to grow somewhere it couldnt before.

There are big decreases to yields in equatorial areas and its not that corn kernels are getting smaller, said co-founder and CEO Brendan Collins. Farmers move upland because salt water intrusion is disrupting fields, but they run into early spring frosts that kill their seedlings. Or they need rust resistant wheat to survive fungal outbreaks in humid, wet summers. We need to create new varieties if we want to adapt to this new environmental reality.

To make those improvements in a systematic way, researchers emphasize existing traits in the plant; this isnt about splicing in a new gene but bringing out qualities that are already there. This used to be done by the simple method of growing several plants, comparing them, and planting the seeds of the one that best exemplifies the trait like Mendel in Genetics 101.

Nowadays, however, we have sequenced the genome of these plants and can be a little more direct. By finding out which genes are active in the plants with a desired trait, better expression of those genes can be targeted for future generations. The problem is that doing this still takes a long time as in a decade.

The difficult part of the modern process stems (so to speak) from the issue that traits, like survival in the face of a drought, arent just single genes. They may be any number of genes interacting in a complex way. Just as theres no single gene for becoming an Olympic gymnast, there isnt one for becoming drought-resistant rice. So when the companies do what are called genome-wide association studies, they end up with hundreds of candidates for genes that contribute to the trait, and then must laboriously test various combinations of these in living plants, which even at industrial rates and scales takes years to do.

Numbered, genetically differentiated rice plants being raised for testing purposes. Image Credits: Avalo

The ability to just find genes and then do something with them is actually pretty limited as these traits become more complicated, said Mariano Alvarez, co-founder and CSO of Avalo. Trying to increase the efficiency of an enzyme is easy, you just go in with CRISPR and edit it but increasing yield in corn, there are thousands, maybe millions of genes contributing to that. If youre a big strategic [e.g., Monsanto] trying to make drought-tolerant rice, youre looking at 15 years, 200 million dollars its a long play.

This is where Avalo steps in. The company has built a model for simulating the effects of changes to a plants genome, which they claim can reduce that 15-year lead time to two or three and the cost by a similar ratio.

The idea was to create a much more realistic model for the genome thats more evolutionarily aware, said Collins. That is, a system that models the genome and genes on it that includes more context from biology and evolution. With a better model, you get far fewer false positives on genes associated with a trait, because it rules out far more as noise, unrelated genes, minor contributors and so on.

He gave the example of a cold-tolerant rice strain that one company was working on. A genomewide association study found 566 genes of interest, and to investigate each costs somewhere in the neighborhood of $40,000 due to the time, staff and materials required. That means investigating this one trait might run up a $20 million tab over several years, which naturally limits both the parties who can even attempt such an operation, and the crops that they will invest the time and money in. If you expect a return on investment, you cant spend that kind of cash improving a niche crop for an outlier market.

Were here to democratize that process, said Collins. In that same body of data relating to cold-tolerant rice, We found 32 genes of interest, and based on our simulations and retrospective studies, we know that all of those are truly causal. And we were able to grow 10 knockouts to validate them, three in a three-month period.

In each graph, dots represent confidence levels in genes that must be tested. The Avalo model clears up the data and selects only the most promising ones. Image Credits: Avalo

To unpack the jargon a little there, from the start Avalos system ruled out more than 90% of the genes that would have had to be individually investigated. They had high confidence that these 32 genes were not just related, but causal having a real effect on the trait. And this was borne out with brief knockout studies, where a particular gene is blocked and the effect of that studied. Avalo calls its method gene discovery via informationless perturbations, or GDIP.

Part of it is the inherent facility of machine learning algorithms when it comes to pulling signal out of noise, but Collins noted that they needed to come at the problem with a fresh approach, letting the model learn the structures and relationships on its own. And it was also important to them that the model be explainable that is, that its results dont just appear out of a black box but have some kind of justification.

This latter issue is a tough one, but they achieved it by systematically swapping out genes of interest in repeated simulations with what amount to dummy versions, which dont disrupt the trait but do help the model learn what each gene is contributing.

Avalo co-founders Mariano Alvarez (left) and Brendan Collins by a greenhouse. Image Credits: Avalo

Using our tech, we can come up with a minimal predictive breeding set for traits of interest. You can design the perfect genotype in silico [i.e., in simulation] and then do intensive breeding and watch for that genotype, said Collins. And the cost is low enough that it can be done by smaller outfits or with less popular crops, or for traits that are outside possibilities since climate change is so unpredictable, who can say whether heat- or cold-tolerant wheat would be better 20 years from now?

By reducing the capital cost of undertaking this exercise, we sort of unlock this space where its economically viable to work on a climate-tolerant trait, said Alvarez.

Avalo is partnering with several universities to accelerate the creation of other resilient and sustainable plants that might never have seen the light of day otherwise. These research groups have tons of data but not a lot of resources, making them excellent candidates to demonstrate the companys capabilities.

The university partnerships will also establish that the system works for fairly undomesticated plants that need some work before they can be used at scale. For instance it might be better to supersize a wild grain thats naturally resistant to drought instead of trying to add drought resistance to a naturally large grain species, but no one was willing to spend $20 million to find out.

On the commercial side, they plan to offer the data handling service first, one of many startups offering big cost and time savings to slower, more established companies in spaces like agriculture and pharmaceuticals. With luck Avalo will be able to help bring a few of these plants into agriculture and become a seed provider as well.

The company just emerged from the IndieBio accelerator a few weeks ago and has already secured $3 million in seed funding to continue their work at greater scale. The round was co-led by Better Ventures and Giant Ventures, with At One Ventures, Climate Capital, David Rowan and of course IndieBio parent SOSV participating.

Brendan convinced me that starting a startup would be way more fun and interesting than applying for faculty jobs, said Alvarez. And he was totally right.

Originally posted here:
Avalo uses machine learning to accelerate the adaptation of crops to climate change - TechCrunch

Read More..

Improve Machine Learning Performance by Dropping the Zeros – ELE Times

KAUST researchers have found a way to significantly increase the speed of training. Large machine learning models can be trained significantly faster by observing how frequently zero results are produced in distributed machine learning that uses large training datasets.

AI models develop their intelligence by being trained on datasets that have been labelled to tell the model how to differentiate between different inputs and then respond accordingly. The more labelled data that goes in, the better the model becomes at performing whatever task it has been assigned to do. For complex deep learning applications, such as self-driving vehicles, this requires enormous input datasets and very longtrainingtimes, even when using powerful and expensive highly parallel supercomputing platforms.

During training, small learning tasks are assigned to tens or hundreds of computing nodes, which then share their results over acommunications networkbefore running the next task. One of the biggest sources of computing overhead in such parallel computing tasks is actually this communication among computing nodes at each model step.

Communication is a major performance bottleneck in distributed deep learning, explains the KAUST team. Along with the fast-paced increase in model size, we also see an increase in the proportion of zero values that are produced during the learning process, which we call sparsity. Our idea was to exploit this sparsity to maximize effective bandwidth usage by sending only non-zero data blocks.

Building on an earlier KAUST development called SwitchML, which optimized internode communications by running efficient aggregation code on the network switches that process data transfer, Fei, Marco Canini and their colleagues went a step further by identifying zero results and developing a way to drop transmission without interrupting the synchronization of the parallel computing process.

Exactly how to exploit sparsity to accelerate distributed training is a challenging problem, says the team. All nodes need to process data blocks at the same location in a time slot, so we have to coordinate the nodes to ensure that only data blocks in the same location are aggregated. To overcome this, we created an aggregator process to coordinate the workers, instructing them on which block to send next.

The team demonstrated their OmniReduce scheme on a testbed consisting of an array of graphics processing units (GPU) and achieved an eight-fold speed-up for typicaldeep learningtasks.

Read more from the original source:
Improve Machine Learning Performance by Dropping the Zeros - ELE Times

Read More..

New imaging, machine-learning methods speed effort to reduce crops’ need for water – University of Illinois News

CHAMPAIGN, Ill. Scientists have developed and deployed a series of new imaging and machine-learning tools to discover attributes that contribute to water-use efficiency in crop plants during photosynthesis and to reveal the genetic basis of variation in those traits.

The findings are described in a series of four research papers led by University of Illinois Urbana-Champaign graduate students Jiayang (Kevin) Xie and Parthiban Prakash, and postdoctoral researchers John Ferguson, Samuel Fernandes and Charles Pignon.

The goal is to breed or engineer crops that are better at conserving water without sacrificing yield, said Andrew Leakey, a professor of plant biology and of crop sciences at the University of Illinois Urbana-Champaign, who directed the research.

Drought stress limits agricultural production more than anything else, Leakey said. And scientists are working to find ways to minimize water loss from plant leaves without decreasing the amount of carbon dioxide the leaves take in.

Plants breathe in carbon dioxide through tiny pores in their leaves called stomata. That carbon dioxide drives photosynthesis and contributes to plant growth. But the stomata also allow moisture to escape in the form of water vapor.

A new approach to analyzing the epidermis layer of plant leaves revealed that the size and shape of the stomata (lighter green pores) in corn leaves strongly influence the crops water-use efficiency.

Micrograph by Jiayang (Kevin) Xie

Edit embedded media in the Files Tab and re-insert as needed.

The amount of water vapor and carbon dioxide exchanged between the leaf and atmosphere depends on the number of stomata, their size and how quickly they open or close in response to environmental signals, Leakey said. If rainfall is low or the air is too hot and dry, there can be insufficient water to meet demand, leading to reduced photosynthesis, productivity and survival.

To better understand this process in plants like corn, sorghum and grasses of the genus Setaria, the team analyzed how the stomata on their leaves influenced plants water-use efficiency.

We investigated the number, size and speed of closing movements of stomata in these closely related species, Leakey said. This is very challenging because the traditional methods for measuring these traits are very slow and laborious.

For example, determining stomatal density previously involved manually counting the pores under a microscope. The slowness of this method means scientists are unable to analyze large datasets, Leakey said.

There are a lot of features of the leaf epidermis that normally dont get measured because it takes too much time, he said. Or, if they get measured, its in really small experiments. And you cant discover the genetic basis for a trait with a really small experiment.

To speed the work, Xie took a machine-learning tool originally developed to help self-driving cars navigate complex environments and converted it into an application that could quickly identify, count and measure thousands of cells and cell features in each leaf sample.

Jiayang (Kevin) Xie converted a machine-learning tool originally designed to help self-driving cars navigate complex environments into an application that can quickly analyze features on the surface of plant leaves.

Photo by L. Brian Stauffer

Edit embedded media in the Files Tab and re-insert as needed.

To do this manually, it would take you several weeks of labor just to count the stomata on a seasons worth of leaf samples, Leakey said. And it would take you months more to manually measure the sizes of the stomata or the sizes of any of the other cells.

The team used sophisticated statistical approaches to identify regions of the genome and lists of genes that likely control variation in the patterning of stomata on the leaf surface. They also used thermal cameras in field and laboratory experiments to quickly assess the temperature of leaves as a proxy for how much water loss was cooling the leaves.

This revealed key links between changes in microscopic anatomy and the physiological or functional performance of the plants, Leakey said.

By comparing leaf characteristics with the plants water-use efficiency in field experiments, the researchers found that the size and shape of the stomata in corn appeared to be more important than had previously been recognized, Leakey said.

Along with the identification of genes that likely contribute to those features, the discovery will inform future efforts to breed or genetically engineer crop plants that use water more efficiently, the researchers said.

The new approach provides an unprecedented view of the structure and function of the outermost layer of plant leaves, Xie said.

There are so many things we dont know about the characteristics of the epidermis, and this machine-learning algorithm is giving us a much more comprehensive picture, he said. We can extract a lot more potential data on traits from the images weve taken. This is something people could not have done before.

Leakey is an affiliate of the Carl R. Woese Institute for Genomic Biology at the U. of I. He and his colleagues report their findings in a study published in The Journal of Experimental Botany and in three papers in the journal Plant Physiology (see below).

The National Science Foundation Plant Genome Research Program, the Advanced Research Projects Agency-Energy, the Department of Energy Biosystems Design Program, the Foundation for Food and Agriculture Research Graduate Student Fellows Program, The Agriculture and Food Research Initiative from the U.S. Department of Agriculture National Institute of Food and Agriculture, and the U. of I. Center for Digital Agriculture supported this research.

More:
New imaging, machine-learning methods speed effort to reduce crops' need for water - University of Illinois News

Read More..