Category Archives: Data Science

iPhone Creator Suggests Opinions Drive Innovation, not Data – Towards Data Science

Source: DALL-E

We need to be data driven, says everyone. And yes. I agree 90% of the time, but it shouldnt be taken as a blanket statement. Like everything else in life, recognizing where it does and doesnt apply is important.

In a world obsessed with data, its the bold, opinionated decisions that break through to revolutionary innovation.

The Economist wrote about the rumoured, critical blunders of McKinsey in the 1980s during the early days of the smartphone era. AT&T asked McKinsey to project the size of the smartphone market.

McKinsey, presumably after rigorous projections, expert calls, and data crunching, shared that the estimated total market would be about 900,000 smartphones. They based it on data, specifically data in that time. It was bulky, large, and only a necessary evil for mobile people. Data lags.

AT&T pulled out initially, in part, due to those recommendations, before diving back in the market to compete. Some werent as lucky. Every strategy consultant in South Korea will know about the rumours of McKinsey sharing a similar advice to one of the largest conglomerates that used go go head-to-head with Samsung: LG. They pulled out of the market, and lost even taking a shot at becoming a global leader in this estimated 500 billion dollar market.

Today, the World Economic Forum shared in a recent analysis that there are more smartphones than people on earth, with roughly 8.6 BILLION subscribed phones.

The designer and builder of the iPhone and Nest Tony Faddell, shares in his book Build that decisions are driven by some proportion of opinions and data. And for the very first version of a product that are revolutionary, as opposed to evolutionary, are by definition opinion driven. And theyre useful for different types of innovation:

See original here:

iPhone Creator Suggests Opinions Drive Innovation, not Data - Towards Data Science

Advanced Selection from Tensors in Pytorch | by Oliver S | Feb, 2024 – Towards Data Science

In some situations, youll need to do some advanced indexing / selection with Pytorch, e.g. answer the question: how can I select elements from Tensor A following the indices specified in Tensor B?

In this post well present the three most common methods for such tasks, namely torch.index_select, torch.gather and torch.take. Well explain all of them in detail and contrast them with one another.

Admittedly, one motivation for this post was me forgetting how and when to use which function, ending up googling, browsing Stack Overflow and the, in my opinion, relatively brief and not too helpful official documentation. Thus, as mentioned, we here do a deep dive into these functions: we motivate when to use which, give examples in 2- and 3D, and show the resulting selection graphically.

I hope this post will bring clarity about said functions and remove the need for further exploration thanks for reading!

And now, without further ado, lets dive into the functions one by one. For all, we first start with a 2D example and visualize the resulting selection, and then move to somewhat more complex example in 3D. Further, we re-implement the executed operation in simple Python s.t. you can look at pseudocode as another source of information what these functions do. In the end, we summarize the functions and their differences in a table.

torch.index_select selects elements along one dimension, while keeping the other ones unchanged. That is: keep all elements from all other dimensions, but pick elements in the target dimensions following the index tensor. Lets demonstrate this with a 2D example, in which we select along dimension 1:

The resulting tensor has shape [len_dim_0, num_picks]: for every element along dimension 0, we have picked the same element from dimension 1. Lets visualize this:

Read more here:

Advanced Selection from Tensors in Pytorch | by Oliver S | Feb, 2024 - Towards Data Science

Mosaic Data Science’s Neural Search Solution Named the Top Insight Engine of 2024 by CIO Review – Newswire

Press Release Feb 28, 2024 10:00 EST

Mosaic Data Science has been recognized as the Top Insight Engines Solutions Provider of 2024 by CIO Review magazine for its Neural Search Engine framework.

LEESBURG, Va., February 28, 2024 (Newswire.com) - In a significant acknowledgment of its pioneering efforts in the realm of insight engines, Mosaic Data Science has been recognized as the Top Insight Engines Solutions Provider of 2024 by CIO Review magazine for its Neural Search Engine framework. The accolade is a testament to Mosaics ability to address and solve complex customer challenges using Large Language Models (LLMs) and Reader/Retrieval Architectures (RAG), positioning the company at the forefront of innovation in the Generative AI landscape.

The Neural Search Engine has revolutionized how businesses comb through vast amounts of data, automating text, image, video, and audio information retrieval from all corporate documents and significantly enhancing efficiency and productivity. With its advanced modeling and architecture frameworks, Neural Search provides firms with a robust set of templates for the secure tuning of AI models, tailoring them to an organizations specific data and requirements.

Mosaics Neural Search Engine is designed for versatility. Whether organizations have already deployed a production-grade AI search system and seek assistance with nuanced queries or contextualized results, or are exploring the right LLM for their needs, Mosaic offers a custom-built, cutting-edge solution. The engines ability to understand the nuances of human language and deliver actionable insights empowers businesses to make informed, data-driven decisions, effectively transforming how companies access and leverage information.

The Insight Engines award from CIO Review highlights Mosaics commitment to a vendor-agnostic approach, ensuring seamless integration with existing data sources, infrastructure, AI, and governance tools. By adopting Mosaics Neural Search Engine, businesses can embrace the future of search technology without discarding their current investments, taking what works and integrating it.

The recognition includes a feature in the print edition of CIO Reviews Insight Engines special. This accolade is not just a win for Mosaic but a win for the future of efficient, intelligent search solutions that cater to the evolving needs of businesses.

Source: Mosaic Data Science

Read the original post:

Mosaic Data Science's Neural Search Solution Named the Top Insight Engine of 2024 by CIO Review - Newswire

Researchers receive National Science Foundation grant for long-term data research – Virginia Tech

Predicting the future of ecosystems requires a plethora of accurate data available in real-time.

To enable real-time data collection, three Virginia Tech researchers have received a prestigious five-year National Science Foundation grant to help better predict the future of ecosystems.

The $450,000 Long-Term Research in Environmental Biology (LTREB) grant will support the enhancement and continuation of field monitoring and data sharing at two freshwater drinking water supply reservoirs in Roanoke.

This grant will allow the researchers to create a cutting-edge, ecological monitoring program with real-time data access and publishing, which normally takes weeks to years after data collection.

We are developing one of the first open-source automated forecasting systems in the world by using the reservoirs as a test bed for exploring new data collection, data access, and forecasting methods, said Cayelan Carey, professor and the Roger Moore and Mejdeh Khatam-Moore Faculty Fellow in the Department of Biological Sciences. We will use our new designation as an official long-term research environmental biology monitoring site as a platform for scaling and disseminating our data so that other researchers can similarly start to forecast water quality in lakes and reservoirs around the globe.

The LTREB program is one of the National Science Foundations premier environmental science programs to support long-term monitoring at select exemplar terrestrial, coastal, and freshwater ecosystems across the U.S.

Carey leads the Virginia Reservoirs LTREB team with Professor Madeline Schreiber in the Department of Geosciences and Associate Professor Quinn Thomas, who has a joint appointment in the Departments of Forest Resources and Environmental Conservation and Biological Sciences.

With a group of co-mentored students, technicians, and postdoctoral researchers, Carey, Schreiber, and Thomas have been monitoring biological, chemical, and physical metrics of water quality in the two reservoirs for the past decade in partnership with the Western Virginia Water Authority, which owns and manages the reservoirs for drinking water.

See the rest here:

Researchers receive National Science Foundation grant for long-term data research - Virginia Tech

From Algorithms to Answers: Demystifying AI’s Impact on Scientific Discovery – Argonne National Laboratory

Following the explosion of tools like Chat GPT, the use of artificial intelligence seems to be everywhere. But what exactly does it mean for the world of scientific discovery? This presentation aims to unravel the complexity surrounding AIs role in scientific processes, shedding light on how algorithms and machine learning have become critical tools for researchers.

Presenters will share real-world examples showcasing AIs integration with traditional scientific methods and highlight the critical leadership role Argonne is playing in framing ethical use. By explaining the technical aspects, ethical considerations, and practical applications, this presentation will demystify the relationship between AI and science, fostering a deeper understanding of the innovative landscape that lies ahead for scientific discovery.

Ian Foster,Introductory Remarks Director of Data Science and Learning Division

Sean Jones, Moderator Deputy Laboratory Director for Science & Technology

Arvind Ramanathan, Panelist Computational Biologist

Mathew Cherukara, Panelist Computational Scientist

Casey Stone, Panelist Computational Scientist

Go here to read the rest:

From Algorithms to Answers: Demystifying AI's Impact on Scientific Discovery - Argonne National Laboratory

Diffusion Transformer Explained. Exploring the architecture that brought | by Mario Namtao Shianti Larcher | Feb, 2024 – Towards Data Science

Exploring the architecture that brought transformers into image generation Image generated with DALLE.

After shaking up NLP and moving into computer vision with the Vision Transformer (ViT) and its successors, transformers are now entering the field of image generation. They are gradually becoming an alternative to the U-Net, the convolutional architecture upon which all the early diffusion models were built. This article looks into the Diffusion Transformer (DiT), introduced by William Peebles and Saining Xie in their paper Scalable Diffusion Models with Transformers.

DiT has influenced the development of other transformer-based diffusion models like PIXART-, Sora (OpenAIs astonishing text-to-video model), and, as I write this article, Stable Diffusion 3. Lets start exploring this emerging class of architectures that are contributing to the evolution of diffusion models.

Given that this is an advanced topic, Ill have to assume a certain familiarity with recurring concepts in AI and, in particular, in image generation. If youre already familiar with this field, this section will help refresh these concepts, providing you with further references for a deeper understanding.

If you want an extensive overview of this world before reading this article, I recommend reading my previous article below, where I cover many diffusion models and related techniques, some of which well revisit here.

At an intuitive level, diffusion models function by first taking images, introducing noise (usually Gaussian), and then training a neural network to reverse this noise-adding

Read the original post:

Diffusion Transformer Explained. Exploring the architecture that brought | by Mario Namtao Shianti Larcher | Feb, 2024 - Towards Data Science

GenAI and LLM: Key Concepts You Need to Know – DataScienceCentral.com – Data Science Central

It is difficult to follow all the new developments in AI. How can you discriminate between fundamental technology here to stay, and the hype? How to make sure that you are not missing important developments? The goal of this article is to provide a short summary, presented as a glossary. I focus on recent, well-established methods and architecture.

I do not cover the different types of deep neural networks, loss functions, or gradient descent methods: in the end, these are the core components of many modern techniques, but they have a long history and are well documented. Instead, I focus on new trends and emerging concepts such as RAG, LangChain, embeddings, diffusion, and so on. Some may be quite old (embeddings), but have gained considerable popularity in recent times, due to widespread use in new ground-breaking applications such as GPT.

The landscape evolves in two opposite directions. On one side, well established GenAI companies implement neural networks with trillions of parameters, growing more and more in size, using considerable amounts of GPU, and very expensive. People working on these products believe that the easiest fix to current problems is to use the same tools, but with bigger training sets. Afterall, it also generates more revenue. And indeed, it can solve some sampling issues and deliver better results. There is some emphasis on faster implementations, but speed and especially size, are not top priorities. In short, more brute force is key to optimization.

On the other side, new startups including myself focus on specialization. The goal is to extract as much useful data as you can from much smaller, carefully selected training sets, to deliver highly relevant results to specific audiences. Afterall, there is no best evaluation metric: depending on whether you are a layman or an expert, your criteria to assess quality are very different, even opposite. In many cases, the end users are looking for solutions to deal with their small internal repositories and relatively small number of users. More and more companies are concerned with costs and ROI on GenAI initiatives. Thus, in my opinion, this approach has more long-term potential.

Still, even with specialization, you can process the entire human knowledge the whole Internet with a fraction of what OpenAI needs (much less than one terabyte), much faster, with better results, even without neural networks: in many instances, much faster algorithms can do the job, and it can do it better, for instance by reconstructing and leveraging taxonomies. One potential architecture consists of multiple specialized LLMs or sub-LLMs, one per top category. Each one has its own set of tables and embeddings. The cost is dramatically lower, and the results more relevant to the user who can specify categories along with his prompt. If in addition you allow the user to choose the parameters of his liking, you end up with self-tuned LLMs and/or customized output. I discuss some of these new trends in more details, in the next section. It is not limited to LLMs only.

The list below is in alphabetical order. In many cases, the description highlights how I use the concepts in question in my own open-source technology.

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist atMLTechniques.comandGenAItechLab.com, former VC-funded executive, author and patent owner one related to LLM. Vincents past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET.Follow Vincent on LinkedIn.

Read the rest here:

GenAI and LLM: Key Concepts You Need to Know - DataScienceCentral.com - Data Science Central

Building a Data Warehouse. Best practice and advanced techniques | by Mike Shakhomirov | Feb, 2024 – Towards Data Science

Best practice and advanced techniques for beginners 12 min read

In this story, I would like to talk about data warehouse design and how we organise the process. Data modelling is an essential part of data engineering. It defines the database structure, schemas we use and data materialisation strategies for analytics. Designed in the right way it helps to ensure our data warehouse runs efficiently meeting all business requirements and cost optimisation targets. We will touch on some well-known best practices in data warehouse design using the dbt tool as an example. We will take a better look into some examples of how to organise the build process, test our datasets and use advanced techniques with macros for better workflow integration and deployment.

Lets say we have a data warehouse and lots of SQL to deal with the data we have in it.

In my case it is Snowflake. Great tool and one of the most popular solutions in the market right now, definitely among the top three tools for this purpose.

So how do we structure our data warehouse project? Consider this starter project folder structure below. This is what we have after we run dbt init command.

At the moment we can see only one model called example with table_a and table_b objects. It can be any data warehouse objects that relate to each other in a certain way, i.e. view, table, dynamic table, etc.

When we start building our data warehouse the number of these objects will grow inevitably and it is the best practice to keep it organised.

The simple way of doing this would be to organise the model folder structure being split into base (basic row transformations) and analytics models. In the analytics subfolder, we would typically have data deeply enriched and

Excerpt from:

Building a Data Warehouse. Best practice and advanced techniques | by Mike Shakhomirov | Feb, 2024 - Towards Data Science

For whoever thought data scientists knew churn prediction – Times of India

I have learned three things over time: 1) Just because an excuse is correct doesnt make it any less of an excuse 2) More value is added to a discussion by asking the right questions than by giving the right answers 3) You cannot say half of the job is done in stand-up comedy by just standing up. I am a little carried away by the second one today. I am going to discuss the topic of churn analytics by only asking questions. You tell me whether the questions guide you in the right direction.

How do we define a churn so that we identify exactly how many and when churns have happened in the past? Is it the event when the status of the customer was updated as inactive in the IT system? Would defining churn with a reduction in product or service usage by more than a threshold number be more helpful? Is it when the customer didnt make his bill payments for several consecutive months? What signals or patterns are more likely to occur in case of customers about to churn? How do patterns differ among customers who are not at risk of churn? Should we consider churn a binary event (churn, no churn) or a multi-level event with various churn gradations?

How much in advance should we predict churn? How much lead time does the business team need to act on at-risk customers? What offers or campaigns suit a customer predicted to churn in 3 months?

What urgent incentives do you offer a customer predicted to churn next week? Is it ok to let specific customers churn? Should we consider a customers lifetime value (LTV) prediction to decide how much criticality to assign to save the customer? How do we calculate the effectiveness of a churn prediction? If the customer is predicted to churn but doesnt, is it a good churn prevention action or a bad prediction? Isnt it always a wrong prediction if the customer is not predicted to churn but churns? Dont you think because the customer was considered to be not at risk, no action was taken?

How does a customers journey look till the point of churn? Does that give us a good peep into the factors causing churn? Was the outcome of a customer touch point with the organisation negative? How many touch points went negative for a customer? Which channels (automated IVR, call centre, mobile app, website, stores, in-person interaction, SMS, email, WhatsApp) mattered more in influencing churn? On average, have we taken more time to solve tickets for customers who eventually churned compared to those who didnt? Were quantitative responses to customer survey questions harsher from customers who churned? Was the sentiment derived from a qualitative response negative from a customer who churned? Does churn increase with decreasing NPS and vice versa?

Which product and service areas have a higher churn rate? How does the tenure of customers correlate with the churn rate? Are new customers more likely to churn compared to older ones? Customers who have a higher frequency of touch points churn more or less? Did the customer have a good experience during onboarding? What does the trend of churn rate look like? Is there any seasonality? Is there any other pattern in the movement? How many times has the customer defaulted on bill payments?

Should we design the churn problem as a binary classification problem or a multi-class classification problem? Before building a churn prediction model, should we cluster the customers first into different groups through an unsupervised learning technique? Which influencers of churn can be controlled (for example, customer experience)? Which influencers of churn are outside the organisations control (for example, economic slowdown)? When I say something impacts churn negatively, isnt it ambiguous? Instead, shouldnt I say something decreases the churn rate or something reduces the churn numbers? Shouldnt I be clear on whether new customer additions offset churn numbers? Or is churn calculation independent of how many new customers the organisation has added? Are churn numbers and churn rates calculated on a monthly frequency? Half-yearly? Annually? What is the target reduction in churn rate through this prediction solution? Do we know how much monetary savings it translates to if we reduce churn by the target rate?

Views expressed above are the author's own.

END OF ARTICLE

See the rest here:

For whoever thought data scientists knew churn prediction - Times of India

What is the difference between data science and data analytics? – Fortune

Data is in demand. And it is no surprise that jobs that help to collect, discern, and utilize data are growingand fast.

Occupations that deal with data are projected to have strong job growth by 2031, according to the U.S. Bureau of Labor Statistics. For context, on average all occupations are expected to grow by 5%. Data scientists, as just one example of data-related occupations, are growing by over seven times that amountat 36%.

Moreover, data-related occupations had a median annual wage above the median for all occupations in 2022, and data scientists in particular make double the median wage in the U.S.

The terms data science and data scientists have only been popularized within the past decade, according to Wade Fagen-Ulmschneider, a teaching associate professor of computer science at the University of Illinois. Data analytics, on the other hand, has been around for longer and is a field that many students of statistics, economics, and even some social sciences end up pursing.

With the two areas often discussed in conjunction with one another, you may be wondering, whats the real difference between data science and data analytics? Fortune is here to help.

Broadly speaking, data science is the study of using and applying data to solve real-world problems. It encompasses multiple areas, including AI machine learning, and algorithms and intersects closely with subjects like computer science, statistics, and business. It can also encompass data analytics itself.

Data science is typically about estimation of unknown phenomena and prediction of future events, Joel Shapiro, clinical associate professor of managerial economics and decision sciences at Northwestern Universitys Kellogg School of Management tells Fortune. Data science can include capturing and managing data, building algorithms, and articulating the implications of results.

Fagen-Ulmschneider previously told Fortune that he believes data science skills will soon be as ubiquitous as knowing Microsoft Office skills.

Instead of looking at the future, data analytics focuses more on the pastas well as the now.

Data analytics uses historical data to identify trends and articulate the implications of those trends, Shapiro says. Experts in the field also tend to be adept at data visualization techniques.

The field is important, Shapiro adds, because it helps uncover the stories that otherwise may not be seen or found.

There are so many things that can be measured, and it is impossible for any person to track all of them, let alone really understand how they relate to one another.Those trends and relationships enable us to understand and synthesize past events, which then can be used to make future-looking decisions, Shapiro says.

Theres no question that data science and data analytics are inherently similar. And from a business perspective, both can be critical components to decision-making.

In terms of skills, those working in data science and data analytics will likely be working as part of a team of experts, so having effective communication and collaboration skills are important.

On a more technical side, Fagen-Ulmschneider notes that data science and data analytics will benefit from learning skills in statistics, mathematics, and computer science. For those particularly interested in data science, he suggests students lean heavily on computer science, and for students wanting to become a consultant should focus on statistics or even actuarial/finance.

Shapiro goes further and notes that data science requires a deeper knowledge of things like statistics, machine learning, coding, experimentation, and predictive modeling. Data science, he adds, is better at the individualized level like customized customer experiences, optimized pricing, and differentiated messaging for digital users.

On the other hand, data analytics, Shapiro says, typically requires knowledge of basic data management, some statistics and data visualization techniques and technologies.

So, if things like AI, machine learning, and predictive models excite you, focusing on data science may be for you, whereas if using data to identify and visualize trends, you may want to take a closer look at the analytical side.

Overall, though, data science or data analytics lean on each other, and many of the skills and expertise needed to succeed in either area are similar. Neither data science or data analytics are mutually exclusive, and both play a major role in solving the biggest problems in todays world.

See more here:

What is the difference between data science and data analytics? - Fortune