Category Archives: Data Science

Exploring Medusa and Multi-Token Prediction | by Matthew Gunton | Jul, 2024 – Towards Data Science

Speculative Decoding was introduced as a way to speed up inferencing for an LLM. You see, LLMs are autoregressive, meaning we take the output token that we just predicted and use it to help predict the next token we want. Typically we are predicting one-token at a time (or one-token per forward pass of the neural network). However, because the attention pattern for the next token is very similar to the attention pattern from the previous one, we are repeating most of the same calculations and not gaining much new information.

Speculative decoding means that rather than doing one forward pass for one token, instead after one forward pass we try to find as many tokens as we can. In general there are three steps for this:

(1) Generate the candidates

(2) Process the candidates

(3) Accept certain candidates

Medusa is a type of speculative decoding, and so its steps map directly onto these. Medusa appends decoding heads to the final layer of the model as its implementation of (1). Tree attention is how it processes the candidates for (2). Finally, Medusa uses either rejection sampling or a typical acceptance scheme to accomplish (3). Lets go through each of these in detail.

A decoding head takes the internal representation of the hidden state produced by a forward pass of the model and then creates the probabilities that correspond to different tokens in the vocabulary. In essence, it is converting the things the model has learned into probabilities that will determine what the next token is.

Medusa adjusts the architecture of a typical Transformer by appending multiple decoding heads to the last hidden layer of the model. By doing so, it can predict more than just one token given a forward pass. Each additional head that we add predicts one token further. So if you have 3 Medusa heads, you are predicting the first token from the forward pass, and then 3 more tokens after that with the Medusa heads. In the paper, the authors recommend using 5, as they saw this gave the best balance between speed-up and quality.

To accomplish this, the authors of the paper proposed the below decoder head for Medusa:

This equation gives us the probability of token t from the k-th head. We start off by using the weights weve found through training the Medusa head, W1, and multiplying them by our internal state for token t. We use the SiLU activation function to pass through only selective information(SiLU = x * sigmoid(x)). We add to this the internal state a second time as part of a skip connection, which allows the model to be more performant by not losing information during the linear activation of the SiLU. We then multiply the sum by the second set of weights weve trained for the head, W2, and run that product through a softmax to get our probability.

The first Medusa heads give the model probabilities they should consider based off the forward pass, but the subsequent Medusa heads need to figure out what token they should pick based off what the prior Medusa heads chose.

Naturally, the more options the earlier Medusa heads put forward (hyperparameter sk), the more options future heads need to consider. For example, when we consider just the top two candidates from head 1 (s1=2) and the top three from head 2 (s2=3), we wind up with 6 different situations we need to compute.

Due to this expansion, we would like to generate and verify these candidates as concurrently as possible.

The above matrix shows how we can run all of these calculations within the same batch via tree attention. Unlike typical causal self-attention, only the tokens from the same continuation are considered relevant for the attention pattern. As the matrix illustrates with this limited space, we can fit our candidates all into one batch and run attention on them concurrently.

The challenge here is that each prediction needs to consider only the candidate tokens that would be directly behind it. In other words, if we choose It from head 1, and we are evaluating which token should come next, we do not want to have the attention pattern for I being used for the tokens.

The authors avoid this kind of interference by using a mask to avoid passing data about irrelevant tokens into the attention calculation. By using this mask, they can be memory efficient while they calculate the attention pattern & then use that information in the decoding head to generate the subsequent token candidates.

While the above matrix shows us considering every prediction the same, if we have a probability for each prediction, we can treat these differently based on how likely they are to be the best choice. The below tree visualizes just that.

Read more:

Exploring Medusa and Multi-Token Prediction | by Matthew Gunton | Jul, 2024 - Towards Data Science

An Off-Beat Approach to Train-Test-Validation Split Your Dataset | by Amarpreet Singh | Jul, 2024 – Towards Data Science

Generated with Microsoft Designer

We all require to sample our population to perform statistical analysis and gain insights. When we do so, the aim is to ensure that our samples distribution closely matches that of the population.

For this, we have various methods: simple random sampling (where every member of the population has an equal chance of being selected), stratified sampling (which involves dividing the population into subgroups and sampling from each subgroup), cluster sampling (where the population is divided into clusters and entire clusters are randomly selected), systematic sampling (which involves selecting every nth member of the population), etc etc. Each method has its advantages and is chosen based on the specific needs and characteristics of the study.

In this article, we wont be focusing on sampling methods themselves per se, but rather on using these concepts to split the dataset used for machine learning approaches into Train-Test-Validation sets. These approaches work for all kinds of Tabular data. We will be working in Python here.

Below are some approaches that you already might know:

This approach uses random-sampling method. Example code:

This approach ensures that the splits maintain the same proportion of classes as the original dataset (with random sampling again of course), which is useful for imbalanced datasets. This approach will work when your target variable is not a continuous variable.

In K-Fold cross-validation, the dataset is split into k subsets (folds). The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times.

As the name suggests, this is a combination of Stratified sampling and K-fold cross-validation.

Full example usage:

Now, you can use these methods to split your dataset but they have the following limitations:

Now, suppose you have a small total number of observations in your dataset and its difficult to ensure similar distributions amongst your splits. In that case, you can combine clustering and random sampling (or stratified sampling).

Below is how I did it for my problem at hand:

In this method, first, we cluster our dataset and then use sampling methods on each cluster to obtain our data splits.

For example, using HDBSCAN:

You can also use other clustering methods according to your problem for eg. K-Means clustering:

Now you can also add levels of granularities (any categorical variable) to your dataset to get more refined clusters as follows:

Once you have obtained cluster labels from any clustering method, you can use random sampling or stratified sampling to select samples from each cluster.

We will select indices randomly and then use these indices to select our train-test-val sets as follows:

As per my use-case, it was useful to sort my target variable y and then select every 1st, 2nd, and 3rd indices for train, test, and validation set respectively (all mutually exclusive), a.k.a systematic random sampling as below:

The above-discussed approaches of combining clustering with different sampling methods are very useful when you have a small number of observations in your dataset as they ensure to maintain similar distributions amongst the Train, Test and Validation sets.

Thanks for reading, and I hope you find this article helpful!

Visit link:

An Off-Beat Approach to Train-Test-Validation Split Your Dataset | by Amarpreet Singh | Jul, 2024 - Towards Data Science

Perception-Inspired Graph Convolution for Music Understanding Tasks | by Emmanouil Karystinaios | Jul, 2024 – Towards Data Science

This article discusses MusGConv, a perception-inspired graph convolution block for symbolic musical applications 10 min read

In the field of Music Information Research (MIR), the challenge of understanding and processing musical scores has continuously been introduced to new methods and approaches. Most recently many graph-based techniques have been proposed as a way to target music understanding tasks such as voice separation, cadence detection, composer classification, and Roman numeral analysis.

This blog post covers one of my recent papers in which I introduced a new graph convolutional block, called MusGConv, designed specifically for processing music score data. MusGConv takes advantage of music perceptual principles to improve the efficiency and the performance of graph convolution in Graph Neural Networks applied to music understanding tasks.

Traditional approaches in MIR often rely on audio or symbolic representations of music. While audio captures the intensity of sound waves over time, symbolic representations like MIDI files or musical scores encode discrete musical events. Symbolic representations are particularly valuable as they provide higher-level information essential for tasks such as music analysis and generation.

However, existing techniques based on symbolic music representations often borrow from computer vision (CV) or natural language processing (NLP) methodologies. For instance, representing music as a pianoroll in a matrix format and treating it similarly to an image, or, representing music as a series of tokens and treating it with sequential models or transformers. These approaches, though effective, could fall short in fully capturing the complex, multi-dimensional nature of music, which includes hierarchical note relation and intricate pitch-temporal relationships. Some recent approaches have been proposed to model the musical score as a graph and apply Graph Neural Networks to solve various tasks.

The fundamental idea of GNN-based approaches to musical scores is to model a musical score as a graph where notes are the vertices and edges are built from the temporal relations between the notes. To create a graph from a musical score we can consider four types of edges (see Figure below for a visualization of the graph on the score):

A GNN can treat the graph created from the notes and these four types of relations.

MusGConv is designed to leverage music score graphs and enhance them by incorporating principles of music perception into the graph convolution process. It focuses on two fundamental dimensions of music: pitch and rhythm, considering both their relative and absolute representations.

Absolute representations refer to features that can be attributed to each note individually such as the notes pitch or spelling, its duration or any other feature. On the other hand, relative features are computed between pairs of notes, such as the music interval between two notes, their onset difference, i.e. the time on which they occur, etc.

The importance and coexistence of the relative and absolute representations can be understood from a transpositional perspective in music. Imagine the same music content transposed. Then, the intervalic relations between notes stay the same but the pitch of each note is altered.

To fully understand the inner workings of the MusGConv convolution block it is important to first explain the principles of Message Passing.

In the context of GNNs, message passing is a process where vertices within a graph exchange information with their neighbors to update their own representations. This exchange allows each node to gather contextual information from the graph, which is then used to for predictive tasks.

The message passing process is defined by the following steps:

MusGConv alters the standard message passing process mainly by incorporating both absolute features as node features and relative musical features as edge features. This design is tailored to fit the nature of musical data.

The MusGConv convolution is defined by the following steps:

By designing the message passing mechanism in this way, MusGConv attempts to preserve the relative perceptual properties of music (such as intervals and rhythms), leading to more meaningful representations of musical data.

Should edge features are absent or deliberately not provided then MusGConv computes the edge features between two nodes as the absolute difference between their node features. The version of MusGConv with the edges features is named MusGConv(+EF) in the experiments.

To demonstrate the potential of MusGConv I discuss below the tasks and the experiments conducted in the paper. All models independent of the task are designed with the pipeline shown in the figure below. When MusGConv is employed the GNN blocks are replaced by MusGConv blocks.

I decided to apply MusGConv to four tasks: voice separation, composer classification, Roman numeral analysis, and cadence detection. Each one of these tasks presents a different taxonomy from a graph learning perspective. Voice separation is a link prediction task, composer classification is a global classification task, cadence detection is a node classification task, and Roman numeral analysis can be viewed as a subgraph classification task. Therefore we are exploring the suitability of MusGConv not only from a musical analysis perspective but through out the spectrum of graph deep learning task taxonomy.

Read the original:

Perception-Inspired Graph Convolution for Music Understanding Tasks | by Emmanouil Karystinaios | Jul, 2024 - Towards Data Science

A Weekend AI Project: Object Detection with YOLO on PC and Raspberry Pi | by Dmitrii Eliuseev | Jul, 2024 – Towards Data Science

Running the Latest YOLO v10 Model on Different Hardware YOLO Objects Detection, Image by author

Computer vision can be an important part of ML apps of different scales, from $20,000 Tesla Bots or self-driving cars to smart doorbells and vacuum cleaners. It is also a challenging task because, compared to a cloud infrastructure, on real edge devices, the hardware specs are often much more constrained.

YOLO (You Only Look Once) is a popular object detection library; its first version was made in 2015. YOLO is particularly interesting for embedded devices because it can run almost anywhere; there are not only Python but also C++ (ONNX and OpenVINO) and Rust versions available. A year ago, I tested YOLO v8 on a Raspberry Pi 4. Nowadays, many things have changed a new Raspberry Pi 5 became available, and a newer YOLO v10 was released. So I expect a new model on new hardware to work faster and more precisely.

The code presented in this article is cross-platform, so readers who dont have a Raspberry Pi can run it on a Windows, Linux, or OS X computer as well.

Without further ado, lets see how it works!

For someone who may have never heard about the Raspberry Pi, lets make a short

Read more here:

A Weekend AI Project: Object Detection with YOLO on PC and Raspberry Pi | by Dmitrii Eliuseev | Jul, 2024 - Towards Data Science

Catching online scammers: our model combines data and behavioural science to map the psychological games cybercriminals play – The Conversation Canada

When fictions most famous detective, Sherlock Holmes, needed to solve a crime, he turned to his sharp observational skills and deep understanding of human nature. He used this combination more than once when facing off against his arch-nemesis, Dr James Moriarty, a villain adept at exploiting human weaknesses for his gain.

This classic battle mirrors todays ongoing fight against cybercrime. Like Moriarty, cybercriminals use cunning strategies to exploit their victims psychological vulnerabilities. They send deceptive emails or messages that appear to be from trusted sources such as banks, employers, or friends. These messages often contain urgent requests or alarming information to provoke an immediate response.

For example, a phishing email might claim there has been suspicious activity on a victims bank account and prompt them to click on a link to verify their account details. Once the victim clicks the link and enters their information, the attackers capture their credentials for malicious use. Or individuals are manipulated into divulging confidential information to compromise their own or a companys security.

Holmes had to outsmart Moriarty by understanding and anticipating his moves. Modern cybersecurity teams and users must stay vigilant and proactive to outmanoeuvre cybercriminals who continuously refine their deceptive tactics.

Read more: Deepfakes in South Africa: protecting your image online is the key to fighting them

What if those trying to prevent cybercrime could harness Holmess skills? Could those skills complement existing, more data-driven ways of identifying potential threats? I am a professor of information systems whose research focuses on, among other things, integrating data science and behavioural science through a sociotechnical lens to investigate the deceptive tactics used by cybercriminals.

Recently, I worked with Shiven Naidoo, a Masters student in data science, to understand how behavioural science and data science could join forces to combat cybercrime.

Our study found that, just as Holmess analytical genius and his sidekick Dr John Watsons practical approach were complementary, behavioural scientists and data scientists can collaborate to make cybercrime detection and prevention models more effective.

Data science uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.

When its powerful algorithms are applied to complex and large datasets they can identify patterns that indicate potential cyber threats. Predictive analysis helps cybersecurity teams anticipate and prevent large-scale attacks. This is done through, for instance, detecting anomalies in sentence structure to spot scams.

However, relying solely on data science often overlooks the human factors that drive cybercriminal behaviour.

The behavioural sciences study human behaviour. They consider the principles that influence decision-making and compliance. We drew extensively from US psychologist Robert Cialdinis social influence model in our study.

This model has been applied in cybersecurity studies to explain how cybercriminals exploit psychological tendencies.

Read more: Five things South Africa must do to combat cybercrime

For example, cybercriminals exploit humans tendency to be obedient to authority by impersonating trusted figures to spread disinformation. They also exploit urgency and scarcity principles to prompt hasty actions. Social proof the tendency to follow the actions of those similar to us is another tool, used to manipulate users into complying with fraudulent requests. For instance, cybercriminals might create fake reviews or testimonials, prompting users to fall for a scam.

We adapted the social influence model to detect cybercriminal tactics in scam datasets by combining behavioural and data science. Scam datasets consist of unstructured data, which includes complex text data such as phishing emails and fake social media posts. Our data consisted of known scams such as phishing and other malicious activities. It came from FraudWatch Internationals Cyber Intelligence Datafeed, which collects information on cybercrime incidents.

Its tough to draw insights from unstructured data. Models cant easily discern between meaningful data points and those that are irrelevant or misleading (we call it noisy data). Data scientists rely on feature engineering to cut through the noise. This process identifies and labels meaningful data points using knowledge from other fields.

We used domain knowledge from behavioural science to engineer and label meaningful features in unstructured scam data. Scams were labelled based on how they used Cialdinis social influence principles, transforming raw text data into meaningful features. For example, a phishing email might use the principle of urgency by saying your account will be locked in 24 hours if you do not respond!. The raw text is transformed into a meaningful feature labelled urgency, which can be analysed for patterns. Then we used machine learning to analyse and visualise the labelled dataset.

The results showed that certain social influence principles such as liking and authority were frequently used together in scams. We also found that phishing scams often employed a mix of several principles. This made them more sophisticated and harder to detect.

The results gave us valuable insights into how often different types of social influence principles (such as urgency, trust, familiarity) are exploited by cybercriminals, as well as where more than one type is used at a time. Analysing unstructured text data like phishing emails and fake social media posts allowed us to identify patterns that indicated manipulative tactics.

Overall, our work yielded high quality insights from complex scam datasets.

Its important to mention that our dataset was not exhaustive. However, we believe our results are invaluable for mining insights from complex cybercrime data. This kind of analysis can be used by cybersecurity professionals, data scientists, cybersecurity firms and organisations involved in cybersecurity research. It can help improve automated detection systems and inform targeted training.

Visit link:

Catching online scammers: our model combines data and behavioural science to map the psychological games cybercriminals play - The Conversation Canada

Online Data Science Training Programs Market size is set to grow by USD 6.54 billion from 2024-2028, Increasing job prospects boost the market,…

NEW YORK, July 9, 2024 /PRNewswire/ -- The global online data science training programs market size is estimated to grow by USD 6.54 billion from 2024 to 2028, according to Technavio, with a CAGR of 34.73% during the forecast period. This growth is driven by increasing job prospects and a trend towards microlearning and gamification. However, the advent of open-source learning materials poses a challenge. Key market players include 2U Inc., Coursera Inc., DataCamp Inc., Harvard University, Intellipaat Software Solutions Pvt. Ltd., Simplilearn, Udacity Inc., and upGrad Education Pvt. Ltd.

Get a detailed analysis on regions, market segments, customer landscape, and companies-View the snapshot of this report

Online Data Science Training Programs Market Scope

Report Coverage

Details

Base year

2023

Historic period

2018 - 2022

Forecast period

2024-2028

Growth momentum & CAGR

Accelerate at a CAGR of 34.73%

Market growth 2024-2028

USD 6542.6 million

Market structure

Fragmented

YoY growth 2022-2023 (%)

25.9

Regional analysis

North America, APAC, Europe, South America, and Middle East and Africa

Performing market contribution

APAC at 47%

Key countries

US, China, Canada, India, and Germany

Key companies profiled

2U Inc., Alison, AnalytixLabs, Coursera Inc., DataCamp Inc., Dataquest Labs Inc., Great Lakes E-Learning Services Pvt. Ltd., Harvard University, Henry Harvin Education Inc., Intellipaat Software Solutions Pvt. Ltd., InventaTeq, Kaplan Inc., Manipal Academy of Higher Education, NIIT Ltd., NYC Data Science Academy, Simplilearn, Udacity Inc., Udemy Inc., and upGrad Education Pvt. Ltd.

Market Driver

The global online data science training programs market is witnessing a rise in the popularity of microlearning. Microlearning delivers content in short, easily digestible formats, including videos and infographics, which can be accessed on-demand. Corporations are increasingly adopting microlearning due to its compatibility with mobile devices and its ability to provide just-in-time learning. Vendors like DataCamp offer mobile versions with gamified, interactive microlearning lessons, enabling efficient knowledge acquisition and addressing learning gaps. This trend towards microlearning and gamification is projected to fuel market expansion during the forecast period.

Online Data Science training programs have seen a significant surge in popularity due to the technological advancement in online education. These programs cover essential topics like Statistics, Math, Data management, Data visualization, Statistical programming, Machine learning, and more. Students and working professionals can now access high-quality Data Science education from anywhere in the world, thanks to live streaming and recorded sessions. Online Statistics courses are particularly beneficial for data science beginners, helping them grasp fundamental statistical concepts. Education technology companies offer flexible and convenient solutions, including digital-curriculum, tutoring platforms, and collaboration opportunities. Certifications and Masters programs add industry recognition for career advancement. Remote work trends and the increasing demand for skilled data scientists make online training programs an attractive option for organizations and decision-makers. Mid-Range Data Scientists can also benefit from these platforms, addressing technical issues and providing access to the latest industry knowledge. Online learning platforms offer a cost-effective alternative to college education, with the added benefits of access to textbooks, communication apps, and artificial intelligence tools. Overall, online Data Science training programs provide flexibility, convenience, and global accessibility, making them an essential part of the Data Science education landscape.

Discover 360 analysis of this market. For complete information, schedule your consultation -Book Here!

MarketChallenges

For more insights on driver and challenges-Request asample report!

Segment Overview

This online data science training programs market report extensively covers market segmentation by

1.1Professional degree courses- The professional degree course segment is a significant player in the global online data science training market. It offers intensive and detailed training programs, designed to arm professionals with essential data science skills and knowledge. The curriculum covers essential data science areas like statistical analysis, machine learning, data visualization, and data engineering. Industry experts and academic professionals deliver these courses, ensuring top-notch education. Students engage in live lectures, webinars, and group discussions, fostering a collaborative learning environment. Practical assignments and projects enable learners to apply their knowledge in real-world scenarios, enhancing their proficiency in data science techniques and tools. Upon completion, students receive industry-relevant certifications, increasing their competitiveness in the job market. These certifications validate their acquired knowledge and skills, providing a significant edge in pursuing data science careers. The professional degree course segment's active and comprehensive approach to education is expected to fuel the growth of the global online data science training programs market.

For more information on market segmentation with geographical analysis including forecast (2024-2028) and historic data (2017-2021) - Download a Sample Report

Research Analysis

Online Data Science Training Programs have gained significant popularity in recent years due to the increasing demand for skilled data scientists and the convenience of remote learning. These programs cover various aspects of data science, including Statistics, Math, Data Management, Data Visualization, Statistical Programming, Machine Learning, and more. Students and professionals can learn fundamental statistical concepts, big data technologies, and advanced machine learning algorithms from the comfort of their homes. Online training programs offer Industry recognition through certifications and even Masters degrees, making it an attractive option for career advancement. With remote work trends on the rise, online learning platforms provide Global accessibility, making data science education accessible to anyone with an internet connection. Organizations and decision-makers benefit from having a skilled workforce, and online training programs offer an efficient and cost-effective solution. Online statistics courses are available for data science beginners, and advanced programs cater to professionals looking to expand their skillset. Overall, online data science training programs offer a flexible, convenient, and accessible way to learn data science and advance your career.

Market Research Overview

Online Data Science Training Programs: Unleashing the Power of Statistics, Math, and Machine Learning Online Data Science training programs have gained significant traction in recent times, offering Students and Working Professionals an opportunity to master this high-demand field from the comfort of their homes or workplaces. These programs cover essential topics such as Statistics, Math, Data Management, Data Visualization, Statistical Programming, Machine Learning, and more. Online Education technology companies have been at the forefront of this technological advancement, providing live streaming and recorded sessions, digital-curriculum, tutoring platforms, and collaboration opportunities with industry partners. These platforms offer Mid-Range to advanced courses, catering to Data Scientists, decision-makers, and data science beginners. Online learning platforms provide Flexibility, Convenience, and Global Accessibility, making it an attractive alternative to traditional College education. Industry recognition and career advancement opportunities are significant benefits, as skilled data scientists are in high demand by organizations. Customized training programs, hands-on projects, and real-world applications ensure learners gain practical experience and upskilling/reskilling opportunities. Online certification exams, designed with integrity and security, provide learners with Industry recognition. However, concerns regarding Quality, Industry relevance, and accreditation are valid. Informed decisions should be made based on the platform's reputation, industry partnerships, and the ability to offer practical experience and hands-on training in the online format. Digital technology and Mobile technology have revolutionized data science education, with automation and communication apps streamlining the learning process. Despite these advancements, technical issues may arise, necessitating effective assessments, evaluations, and customer support. In conclusion, Online Data Science Training Programs offer a cost-effective, flexible, and convenient alternative to traditional education methods, providing learners with the skills and knowledge required to excel in this dynamic field.

Table of Contents:

1 Executive Summary 2 Market Landscape 3 Market Sizing 4 Historic Market Size 5 Five Forces Analysis 6 Market Segmentation

7Customer Landscape 8 Geographic Landscape 9 Drivers, Challenges, and Trends 10 Company Landscape 11 Company Analysis 12 Appendix

About Technavio

Technavio is a leading global technology research and advisory company. Their research and analysis focuses on emerging market trends and provides actionable insights to help businesses identify market opportunities and develop effective strategies to optimize their market positions.

With over 500 specialized analysts, Technavio's report library consists of more than 17,000 reports and counting, covering 800 technologies, spanning across 50 countries. Their client base consists of enterprises of all sizes, including more than 100 Fortune 500 companies. This growing client base relies on Technavio's comprehensive coverage, extensive research, and actionable market insights to identify opportunities in existing and potential markets and assess their competitive positions within changing market scenarios.

Contacts

Technavio Research Jesse Maida Media & Marketing Executive US: +1 844 364 1100 UK: +44 203 893 3200 Email:[emailprotected] Website:www.technavio.com/

SOURCE Technavio

See the original post:

Online Data Science Training Programs Market size is set to grow by USD 6.54 billion from 2024-2028, Increasing job prospects boost the market,...

Master This Data Science Skill and You Will Land a Job In Big Tech Part I | by Khouloud El Alami | Jul, 2024 – Towards Data Science

11 min read

Are you a data scientist dreaming of landing a job in Big Tech but youre not sure what skills you need to get there?

Well, Ive got a secret weapon that could be just what you need to land your dream job in top tech companies.

A few months ago, I wrote this article about all the essential skills you need to get hired by the best tech firms, and today, were going to focus on one of those crucial skills: Experimentation.

Experimentation is a statistical approach that helps us isolate and evaluate the impact of product changes launching features, UX updates, and all!

But why is experimentation so important for standing out among other data scientists?

Its simple. The biggest tech companies are all about creating great products, and experimentation is a vital tool in achieving that.

If you can become an expert in experimentation, youll have a significant advantage over other candidates because most job seekers overlook this skill and dont know how to develop it.

Excerpt from:

Master This Data Science Skill and You Will Land a Job In Big Tech Part I | by Khouloud El Alami | Jul, 2024 - Towards Data Science

LLM Apps, Crucial Data Skills, Multi-AI Agent Systems, and Other July Must-Reads – Towards Data Science

Feeling inspired to write your first TDS post? Were always open to contributions from new authors.

If its already summer where you live, we hope youre making the most of the warm weather and (hopefully? maybe?) more relaxed daily rhythms. Learning never stops, of courseat least not for data scientistsso if your idea of a good time includes diving into new challenges and exploring cutting-edge tools and workflows, youre in for a treat.

Our July highlights, made up of the articles that created the biggest splash among our readers last month, cover a wide range of practical topicsand many of them are geared towards helping you raise your own bar and expand your skill set. Lets dive in!

Every month, were thrilled to see a fresh group of authors join TDS, each sharing their own unique voice, knowledge, and experience with our community. If youre looking for new writers to explore and follow, just browse the work of our latest additions, including Mengliu Zhao, Robbie Geoghegan, Alex Dremov, Torsten Walbaum, Jeremi Nuer, Jason Jia, Akchay Srivastava, Roman S, James Teo, Luis Fernando PREZ ARMAS, Ph.D., Lea Wu, W. Caden Hamrick, Jack Moore, Eddie Forson, Carsten Frommhold, Danila Morozovskii, Biman Chakraborty, Jean Meunier-Pion, Ken Kehoe, Robert Lohne, Pranav Jadhav, Cornellius Yudha Wijaya, Vito Rihaldijiran, Justin Laughlin, Yiit Ak, Teemu Sormunen, Lars Wiik, Rhea Goel, Ryan D'Cunha, Gonzalo Espinosa Duelo, Akila Somasundaram, Mel Richey, PhD, Loren Hinkson, Jonathan R. Williford, PhD, Daniel Low, Nicole Ren, Daniel Pollak, Stefan Todoran, Daniel Khoa Le, Avishek Biswas, Eyal Trabelsi, Ben Olney, Michael B Walker, Eleanor Hanna, and Magda Ntetsika.

Follow this link:

LLM Apps, Crucial Data Skills, Multi-AI Agent Systems, and Other July Must-Reads - Towards Data Science

A data science roadmap for open science organizations engaged in early-stage drug discovery – Nature.com

Consistent data processing: a critical prelude to building AI models

The critical nature of precise storage, management, and dissemination of data in the realm of drug discovery is universally recognized. This is because the extraction of meaningful insights depends on the data being readily accessible, standardized, and maintained with the highest possible consistency. However, the implementation of good data practices can vary greatly and depends on the goals, culture, resources, and expertise of research organizations. A critical, yet sometimes underestimated, aspect is the initial engineering task of data preprocessing, which entails transforming raw assay data into a format suitable for downstream analysis. For instance, quantifying sequencing reads from DNA-encoded library screens into counts is required for the subsequent hit identification data science analysis step. Ensuring the correctness of this initial data processing step is imperative, but it may be given too little priority, potentially leading to inaccuracies in subsequent analyses. Standardization of raw data processing is an important step to enable subsequent machine learning studies of DEL data. Currently, this step is done by companies or organizations that generate and screen DEL libraries, and the respective protocols are reported if a study is published (see the Methods section in McCloskey et al. 18). Making data processing pipelines open source will help establish best practices to allow for scrutiny and revisions if necessary. While this foundational step is vital for harnessing data science, it is worth noting that it will not be the focus of this discussion.

In drug discovery, data science presents numerous opportunities to increase the efficiency and speed of the discovery process. Initially, data science facilitates the analysis of huge experimental data, e.g., allowing researchers to identify potential bioactive compounds in large screening data. Machine learning models can be trained on data from DEL or ASMS and, in turn, be used for hit expansion in extensive virtual screens. For example, a model trained to predict the read counts of a specific DEL screen can be used to identify molecules from other large compound libraries, which are likely to bind to the target protein under consideration18.

As the drug discovery process advances to compound optimization, data science can be used to analyse and predict the pharmacokinetic and dynamic properties of potential drug candidates. This includes model-based evaluation of absorption, distribution, metabolism, excretion, and toxicity (ADMET) profiles. ADMET parameters are crucial in prioritizing and optimizing candidate molecules. Acknowledging their importance, the pharmaceutical industry has invested substantially in developing innovative assays and expanding testing capacities. Such initiatives have enabled the characterization of thousands of compounds through high-quality in-vitro ADMET assays, serving as a prime example of data curation in many pharmaceutical companies37. The knowledge derived from accumulated datasets has the potential to impact research beyond the projects where the data was originally produced. Computational teams utilize these data to understand the principles governing ADMET endpoints as well as to develop in-silico models for the prediction of ADMET properties. These models can help prioritize compound sets lacking undesired liabilities and thus guide researchers in their pursuit to identify the most promising novel drug candidates.

Major approaches in early drug discovery data science encompass classification, regression, or ranking models. They are, for example, employed in drug discovery to classify molecules as mutagenic, predict continuous outcomes such as the binding affinity to a target, and rank compounds in terms of their solubility. Incorporating prior domain knowledge can further enhance the predictive power of these models. Often, assays or endpoints that are correlated can be modelled together, even if they represent independent tasks. By doing so, the models can borrow statistical strength from each individual task, thereby improving overall performance compared to modelling them independently. For example, multitask learning models can predict multiple properties concurrently, as demonstrated by a multitask graph convolutional approach used for predicting physicochemical ADMET endpoints38.

When confronted with training data that have ambiguous labels, utilizing multiple-instance learning can be beneficial. Specifically, in the context of bioactivity models, this becomes relevant when multiple 3D conformations are considered, as the bioactive conformation is often unknown39. A prevalent challenge in applying data science for predictive modelling of chemical substances is choosing a suitable molecular representation. Different representations, such as Continuous Data-Driven Descriptor (CDDD)40 from SMILES strings, molecular fingerprints41 or 3D representations42, capture different facets of the molecular structure and properties43. It is vital to select an appropriate molecular representation as this determines how effectively the nuances of the chemical structures are captured. The choice of the molecular representation influences the prediction performance of various downstream tasks, making it a critical factor in AI-driven drug discovery, as discussed in detail in David et al.s43 review and practical guide on molecular representations in AI-driven drug discovery. Recent studies have found that simple k-nearest neighbours on molecular fingerprints can match or outperform much more complicated deep learning approaches on some compound potency prediction benchmarks44,45. On the other hand, McCloskey et al. 18 have discovered hits by training graph neural networks on data from DEL screens, which are not close to the training set using established molecular similarity calculations. Whether a simple molecular representation, infused with chemical knowledge, or a complex, data-driven deep learning representation is more suitable for the task at hand depends strongly on the training data and needs to be carefully evaluated on a case-by-case basis to obtain a fast and accurate model.

Sound strategies for splitting data into training and test sets are crucial to ensure robust model performance. These strategies include random splitting, which involves dividing the data into training and test sets at random, ensuring a diverse mix of data points in both sets. Temporal splitting arranges data chronologically, training the model on older data and testing it on more recent data, which is useful for predicting future trends. Compound cluster-wise splitting devides training and test sets into distinct chemical spaces. Employing these strategies is essential as inconsistencies between the distributions of training and test data can lead to unreliable model outputs, negatively impacting decision-making processes in drug discovery46.

The successful application of machine learning requires keeping their domain of applicability in mind at all stages. This includes using the techniques described in the previous section for data curation and model development. However, it is equally important to be able to estimate the reliability of a prediction made by an AI model. While generalization to unseen data is theoretically well understood for classic machine learning techniques, it is still an active area of research for deep learning. Neural networks can learn complex data representations through successive nonlinear transformations of the input. As a downside of this flexibility, these models are more sensitive to so-called adversarial examples, i.e., instances outside the domain of applicability that are seemingly close to the training data from the human perspective44. For this reason, deep learning models often fall short of providing reliable confidence estimates for their predictions. Several empirical techniques can be used to obtain uncertainty estimates: Neural network classifiers present a probability distribution indicative of prediction confidence, which is inadequately calibrated but can be adjusted on separate calibration data45. For regression tasks, techniques such as mixture density networks47 or Bayesian dropout48 can be employed to predict distributions instead of single-point estimates. For both classification and regression, the increased variance of a model ensemble indicates that the domain of applicability has been left49.

With the methods described in the previous paragraphs, we possess the necessary methodological stack to establish a data-driven feedback loop from experimental data, a crucial component for implementing active learning at scale. By leveraging predictive models that provide uncertainty estimates, we can create a dynamic and iterative data science process for the design-make-test-analyse (DMTA) cycle. For instance, these predictive models can be utilized to improve the potency of a compound by identifying and prioritizing molecules that are predicted to have high affinity yet are uncertain. Similarly, the models can be used to increase the solubility of a compound by selecting molecules that are likely to be more soluble, thus improving delivery and absorption. This process continuously refines predictions and prioritizes the most informative data points for subsequent experimental testing and retraining the predictive model, thereby enhancing the efficiency and effectiveness of drug discovery efforts. An important additional component is the strategy to pick molecules for subsequent experiments. By intelligently selecting the most informative samples, possibly those that the model is most uncertain about, the picking strategy ensures that each iteration contributes maximally to refining the model and improving predictions. For example, in the context of improving compound potency, the model might prioritize molecules that are predicted to have high potency but with a high degree of uncertainty. These strategies optimize the DMTA process by ensuring that each experimental cycle contributes to the refinement of the predictive model and the overall efficiency of the drug discovery process.

When applying the computational workflow depicted in Fig.3 on large compound libraries, scientists encounter a rather uncommon scenario for machine learning: usually, the training of deep neural networks incurs the highest computational cost since many iterations over large datasets are required, while comparatively few predictions will later be required from the trained model within a similar time frame. However, when inference is to be performed on a vast chemical space, we face the inverse situation. Assessing billions of molecules for their physicochemical parameters and bioactivity is an extremely costly procedure, potentially requiring thousands of graphics processing unit (GPU) hours. Therefore, not only predictive accuracy but also the computational cost of machine learning methods is an important aspect that should be considered when evaluating the practicality of a model.

Computational workflow for predicting molecular properties, starting with molecular structure encoding, followed by model selection and assessment, and concluding with the application of models to virtually screen libraries and rank these molecules for potential experimental validation. The process can be cyclical, allowing iterative refinement of models based on empirical data. ADMET: absorption, distribution, metabolism, and excretiontoxicity. ECFP: Extended Connectivity Fingerprints. CDDD: Continuous Data-Driven Descriptor, a type of molecular representation derived from SMILES strings. Entropy: Shannon entropy descriptors50,51.

Originally posted here:

A data science roadmap for open science organizations engaged in early-stage drug discovery - Nature.com

Data science instruction comes of age – The Hechinger Report

This is an edition of our Future of Learning newsletter. Sign up today to get it delivered straight to your inbox.

Ive been reporting on data science education for two years now, and its become clear to me that whats missing is a national framework for teaching data skills and literacy, similar to the Common Core standards for math or the Next Generation Science Standards.

Data literacy is increasingly critical for many jobs in science, technology and beyond, and so far schools in 28 states offer some sort of data science course. But those classes vary widely in content and approach, in part because theres little agreement around what exactly data science education should look like.

Last week, there was finally some movement on this front a group of K-12 educators, students, higher ed officials and industry leaders presented initial findings on what they believe students should know about data by the time they graduate from high school.

Data Science 4 Everyone, an initiative based at the University of Chicago, assembled 11 focus groups that met over five months to debate what foundational knowledge on data and artificial intelligence students should acquire not only in dedicated data science classes but also in math, English, science and other subjects.

Among the groups proposals for what every graduating high schooler should be able to do:

On August 15, Data Science 4 Everyone plans to release a draft of its initial recommendations, and will be asking educators, parents and others across the country to vote on those ideas and give other feedback.

Here are a few key stories to bring you up to speed:

Data science under fire: What math do high schoolers really need?

Earlier this year, I reported on how a California school district created a data science course in 2020, to offer an alternative math course to students who might struggle in traditional junior and senior math courses such as Algebra II, Pre-Calculus and Calculus, or didnt plan to pursue science or math fields or attend a four-year college. California has been at the center of the debate on how much math, and what math, students need to know before high school graduation.

Eliminating advanced math tracks often prompts outrage. Some districts buck the trend

Hechinger contributor Steven Yoder wrote about how districts that try to detrack or stop sorting students by perceived ability often face parental pushback. But he identified a handful of districts that have forged ahead successfully with detracking.

PROOF POINTS: Stanfords Jo Boaler talks about her new book MATH-ish and takes on her critics

My colleague Jill Barshay spoke with Boaler, the controversial Stanford math education professor who has advocated for data science education, detracking and other strategies to change how math is taught. Jill writes that the academic fight over Boalers findings reflects wider weaknesses in education research.

Whats next: This summer and fall Im reporting on other math topics, including a program to get more Black and Hispanic students into and through Calculus, and efforts by some states to revise algebra instruction. Id love to hear your thoughts on these topics and other math ideas you think we should be writing about.

More on the Future of Learning

How did students pitch themselves to colleges after last years affirmative action ruling?, The Hechinger Report

PROOF POINTS: This is your brain. This is your brain on screens, The Hechinger Report

Budget would require districts to post plans to educate kids in emergencies, EdSource

Turmoil surrounds LAs new AI student chatbot as tech firm furloughs staff just 3 months after launch, The 74

Oklahoma education head discusses why hes mandating public schools teach the Bible, PBS

This story about data science standards was produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education.

The Hechinger Report provides in-depth, fact-based, unbiased reporting on education that is free to all readers. But that doesn't mean it's free to produce. Our work keeps educators and the public informed about pressing issues at schools and on campuses throughout the country. We tell the whole story, even when the details are inconvenient. Help us keep doing that.

Join us today.

Go here to read the rest:

Data science instruction comes of age - The Hechinger Report